Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions Root Cause Analysis and Fix Proposal
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
Root Cause Analysis and Fix Proposal
I've completed the investigation into the HCP test BackupRepository Kopia initialization failure. Here are my findings:
Root Cause
The error occurs in Velero's PrepareRepo function at pkg/repository/provider/unified_repo.go:170-215:
func (urp *unifiedRepoProvider) PrepareRepo(ctx context.Context, param RepoParam) error {
// ... setup code ...

repoOption, err := udmrepo.NewRepoOptions(
udmrepo.WithPassword(urp, param),
udmrepo.WithConfigFile(urp.workPath, string(param.BackupRepo.UID)),
// ... other options ...
)

// Check if repo exists in backend storage
if created, err := urp.repoService.IsCreated(ctx, *repoOption); err != nil {
return errors.Wrap(err, "error to check backup repo")
} else if created {
log.Debug("Repo has already been initialized remotely")
return nil // ❌ BUG: Returns without creating local config file!
}

// Only reached if repo doesn't exist in backend
err = urp.repoService.Create(ctx, *repoOption)
// ...
}
The Problem:
Kopia Architecture: Kopia repositories have two components:
Backend storage (S3/Azure/GCS) containing repository metadata
Local config file (/home/velero/udmrepo/repo-<uid>.conf) containing connection info
Current Flow:
repo.Initialize(): Creates backend storage metadata AND creates initial local config file
repo.Connect(): Connects to existing backend AND creates/updates local config file
repo.Open(): Opens repo using EXISTING local config file
The Bug:
When PrepareRepo finds the repository already exists in backend storage (via IsCreated()), it returns immediately without calling Connect()
This means the local config file is never created/recreated
Since the config file is stored in ephemeral pod storage (/home/velero), it's lost when:
- A different node-agent pod tries to use the repository
- The pod restarts
- The creating pod is deleted
Later, when DataUpload tries to open the repository via repoService.Open(), it fails because the config file doesn't exist
When This Occurs
This bug manifests in these scenarios:
Multi-pod environments: One node-agent pod creates the BackupRepository, another pod tries to use it
Pod restarts: After the creating pod restarts, it loses the local config file (ephemeral storage)
HCP tests: Multiple backups in sequence where different resources trigger repository initialization
The Fix
The fix is to ensure PrepareRepo always calls Connect() when the repository exists remotely:
func (urp *unifiedRepoProvider) PrepareRepo(ctx context.Context, param RepoParam) error {
// ... setup code ...

repoOption, err := udmrepo.NewRepoOptions(/* ... */)
if err != nil {
return errors.Wrap(err, "error to get repo options")
}

if created, err := urp.repoService.IsCreated(ctx, *repoOption); err != nil {
return errors.Wrap(err, "error to check backup repo")
} else if created {
log.Debug("Repo has already been initialized remotely, connecting to it")
// ✅ FIX: Call Connect to create local config file
return urp.repoService.Connect(ctx, *repoOption)
}

// Repository doesn't exist, create it
if param.BackupLocation.Spec.AccessMode == velerov1api.BackupStorageLocationAccessModeReadOnly {
return errors.Errorf("cannot create new backup repo for read-only backup storage location %s/%s",
param.BackupLocation.Namespace, param.BackupLocation.Name)
}

err = urp.repoService.Create(ctx, *repoOption)
if err != nil {
return errors.Wrap(err, "error to create backup repo")
}

log.Debug("Prepare repo complete")
return nil
}
File to Fix
Location: /Users/tkaovila/git/velero/pkg/repository/provider/unified_repo.go
Lines: 211-216 (specifically line 215)
Change: Replace `return nil` with `return urp.repoService.Connect(ctx, *repoOption)`
Additional Context
From lib_repo.go:98-106, Create() calls writeInitParameters() which opens the repo. This means:
The initial repo.Initialize() does create a local config file
However, this config file is stored in ephemeral pod storage (/home/velero/udmrepo/)
The issue is that subsequent PrepareRepo() calls don't recreate the config file when it's missing

This explains the failure pattern:
The first DataUpload might succeed (if it runs on the same pod that created the repo)
Subsequent DataUploads fail when:
- Running on different node-agent pods (don't have the config file)
- Running after pod restart (config file lost from ephemeral storage)
All HCP resources fail with the same error because they're processed by pods without the config file
1 change: 1 addition & 0 deletions changelogs/unreleased/0000-kaovilai
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix BackupRepository PrepareRepo to ensure local config file exists when repo exists remotely, using Open() fallback to Connect() pattern. This fixes failures in multi-pod environments and after pod restarts where config files are missing or invalid.
1 change: 1 addition & 0 deletions changelogs/unreleased/9379-Lyndon-Li
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Refactor repo provider interface for static configuration
115 changes: 115 additions & 0 deletions design/wildcard-namespace-support-design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@

# Wildcard Namespace Support

## Abstract

Velero currently treats namespace patterns with glob characters as literal strings. This design adds wildcard expansion to support flexible namespace selection using patterns like `app-*` or `test-{dev,staging}`.

## Background

Requested in [#1874](https://github.com/vmware-tanzu/velero/issues/1874) for more flexible namespace selection.

## Goals

- Support glob pattern expansion in namespace includes/excludes
- Maintain backward compatibility with existing `*` behavior

## Non-Goals

- Complex regex patterns beyond basic globs

## High-Level Design

Wildcard expansion occurs early in both backup and restore flows, converting patterns to literal namespace lists before normal processing.

### Backup Flow

Expansion happens in `getResourceItems()` before namespace collection:
1. Check if wildcards exist using `ShouldExpandWildcards()`
2. Expand patterns against active cluster namespaces
3. Replace includes/excludes with expanded literal namespaces
4. Continue with normal backup processing

### Restore Flow

Expansion occurs in `execute()` after parsing backup contents:
1. Extract available namespaces from backup tar
2. Expand patterns against backup namespaces (not cluster namespaces)
3. Update restore context with expanded namespaces
4. Continue with normal restore processing

This ensures restore wildcards match actual backup contents, not current cluster state.

## Detailed Design

### Status Fields

Add wildcard expansion tracking to backup and restore CRDs:

```go
type WildcardNamespaceStatus struct {
// IncludeWildcardMatches records namespaces that matched include patterns
// +optional
IncludeWildcardMatches []string `json:"includeWildcardMatches,omitempty"`

// ExcludeWildcardMatches records namespaces that matched exclude patterns
// +optional
ExcludeWildcardMatches []string `json:"excludeWildcardMatches,omitempty"`

// WildcardResult records final namespaces after wildcard processing
// +optional
WildcardResult []string `json:"wildcardResult,omitempty"`
}

// Added to both BackupStatus and RestoreStatus
type BackupStatus struct {
// WildcardNamespaces contains wildcard expansion results
// +optional
WildcardNamespaces *WildcardNamespaceStatus `json:"wildcardNamespaces,omitempty"`
}
```

### Wildcard Expansion Package

New `pkg/util/wildcard/expand.go` package provides:

- `ShouldExpandWildcards()` - Skip expansion for simple "*" case
- `ExpandWildcards()` - Main expansion function using `github.com/gobwas/glob`
- Pattern validation rejecting unsupported regex symbols

**Supported patterns**: `*`, `?`, `[abc]`, `{a,b,c}`
**Unsupported**: `|()`, `**`

### Implementation Details

#### Backup Integration (`pkg/backup/item_collector.go`)

Expansion in `getResourceItems()`:
- Call `wildcard.ExpandWildcards()` with cluster namespaces
- Update `NamespaceIncludesExcludes` with expanded results
- Populate status fields with expansion results

#### Restore Integration (`pkg/restore/restore.go`)

Expansion in `execute()`:
```go
if wildcard.ShouldExpandWildcards(includes, excludes) {
availableNamespaces := extractNamespacesFromBackup(backupResources)
expandedIncludes, expandedExcludes, err := wildcard.ExpandWildcards(
availableNamespaces, includes, excludes)
// Update context and status
}
```

## Alternatives Considered

1. **Client-side expansion**: Rejected because it wouldn't work for scheduled backups
2. **Expansion in `collectNamespaces`**: Rejected because these functions expect literal namespaces

## Compatibility

Maintains full backward compatibility - existing "*" behavior unchanged.

## Implementation

Target: Velero 1.18
68 changes: 65 additions & 3 deletions pkg/repository/manager/manager.go
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,19 @@ type Manager interface {
BatchForget(context.Context, *velerov1api.BackupRepository, []string) []error

// DefaultMaintenanceFrequency returns the default maintenance frequency from the specific repo
DefaultMaintenanceFrequency(repo *velerov1api.BackupRepository) (time.Duration, error)
DefaultMaintenanceFrequency(*velerov1api.BackupRepository) (time.Duration, error)

// ClientSideCacheLimit returns the max cache size required on client side
ClientSideCacheLimit(*velerov1api.BackupRepository) (int64, error)
}

// ConfigProvider defines the methods to get configurations of a backup repository
type ConfigManager interface {
// DefaultMaintenanceFrequency returns the default maintenance frequency from the specific repo
DefaultMaintenanceFrequency(string) (time.Duration, error)

// ClientSideCacheLimit returns the max cache size required on client side
ClientSideCacheLimit(string, map[string]string) (int64, error)
}

type manager struct {
Expand All @@ -74,6 +86,11 @@ type manager struct {
log logrus.FieldLogger
}

type configManager struct {
providers map[string]provider.ConfigProvider
log logrus.FieldLogger
}

// NewManager create a new repository manager.
func NewManager(
namespace string,
Expand Down Expand Up @@ -101,6 +118,20 @@ func NewManager(
return mgr
}

// NewConfigManager create a new repository config manager.
func NewConfigManager(
log logrus.FieldLogger,
) ConfigManager {
mgr := &configManager{
providers: map[string]provider.ConfigProvider{},
log: log,
}

mgr.providers[velerov1api.BackupRepositoryTypeKopia] = provider.NewUnifiedRepoConfigProvider(velerov1api.BackupRepositoryTypeKopia, mgr.log)

return mgr
}

func (m *manager) InitRepo(repo *velerov1api.BackupRepository) error {
m.repoLocker.LockExclusive(repo.Name)
defer m.repoLocker.UnlockExclusive(repo.Name)
Expand Down Expand Up @@ -227,12 +258,16 @@ func (m *manager) DefaultMaintenanceFrequency(repo *velerov1api.BackupRepository
return 0, errors.WithStack(err)
}

param, err := m.assembleRepoParam(repo)
return prd.DefaultMaintenanceFrequency(), nil
}

func (m *manager) ClientSideCacheLimit(repo *velerov1api.BackupRepository) (int64, error) {
prd, err := m.getRepositoryProvider(repo)
if err != nil {
return 0, errors.WithStack(err)
}

return prd.DefaultMaintenanceFrequency(context.Background(), param), nil
return prd.ClientSideCacheLimit(repo.Spec.RepositoryConfig), nil
}

func (m *manager) getRepositoryProvider(repo *velerov1api.BackupRepository) (provider.Provider, error) {
Expand All @@ -256,3 +291,30 @@ func (m *manager) assembleRepoParam(repo *velerov1api.BackupRepository) (provide
BackupRepo: repo,
}, nil
}

func (cm *configManager) DefaultMaintenanceFrequency(repoType string) (time.Duration, error) {
prd, err := cm.getRepositoryProvider(repoType)
if err != nil {
return 0, errors.WithStack(err)
}

return prd.DefaultMaintenanceFrequency(), nil
}

func (cm *configManager) ClientSideCacheLimit(repoType string, repoOption map[string]string) (int64, error) {
prd, err := cm.getRepositoryProvider(repoType)
if err != nil {
return 0, errors.WithStack(err)
}

return prd.ClientSideCacheLimit(repoOption), nil
}

func (cm *configManager) getRepositoryProvider(repoType string) (provider.ConfigProvider, error) {
switch repoType {
case velerov1api.BackupRepositoryTypeKopia:
return cm.providers[velerov1api.BackupRepositoryTypeKopia], nil
default:
return nil, fmt.Errorf("failed to get provider for repository %s", repoType)
}
}
17 changes: 17 additions & 0 deletions pkg/repository/manager/manager_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,20 @@ func TestGetRepositoryProvider(t *testing.T) {
_, err = mgr.getRepositoryProvider(repo)
require.Error(t, err)
}

func TestGetRepositoryConfigProvider(t *testing.T) {
mgr := NewConfigManager(nil).(*configManager)

// empty repository type
_, err := mgr.getRepositoryProvider("")
require.Error(t, err)

// valid repository type
provider, err := mgr.getRepositoryProvider(velerov1.BackupRepositoryTypeKopia)
require.NoError(t, err)
assert.NotNil(t, provider)

// invalid repository type
_, err = mgr.getRepositoryProvider(velerov1.BackupRepositoryTypeRestic)
require.Error(t, err)
}
Loading