Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: subvolumegroup clean up #14026

Merged
merged 1 commit into from
Apr 10, 2024
Merged

core: subvolumegroup clean up #14026

merged 1 commit into from
Apr 10, 2024

Conversation

sp98
Copy link
Contributor

@sp98 sp98 commented Apr 4, 2024

Cleanup the resources created by subvolumegroup
when its deleted. Following resources will be cleaned up:

Cleanup:

  • - OMAP value
  • - OMAP keys
  • - cancel pending Clones
  • - Snapshots
  • - Subvolumes
  • Unit Tests

pending (maybe for a follow up PR)

  • Handle OMAP not found scenario.
  • Integration tests

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Copy link
Member

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sp98 i though we agreed on not duplicating the code in Rook and rook-ceph krew plugin it looks like this PR is not importing any of the existing things from the kubectl-rook-ceph whats the plan going forward?

@sp98 sp98 changed the title [WPP]core: subvolumegroup clean up [WIP]core: subvolumegroup clean up Apr 4, 2024
@sp98
Copy link
Contributor Author

sp98 commented Apr 4, 2024

@sp98 i though we agreed on not duplicating the code in Rook and rook-ceph krew plugin it looks like this PR is not importing any of the existing things from the kubectl-rook-ceph whats the plan going forward?

Agree that duplicating the code will be bad. I'll discuss this in huddle to check what will be approach.

@sp98 sp98 added the skip-ci label Apr 4, 2024
@sp98 sp98 force-pushed the resource-cleanup branch 2 times, most recently from 383f127 to e7939c2 Compare April 4, 2024 08:34
@@ -0,0 +1,25 @@
---
title: Ceph Cleanup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a new doc, how about a new section in ceph-teardown.md?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved it to ceph-teardown.md doc.


Once the cleanup job is completed successfully, Rook will remove the finalizers from the deleted custom resource.

This cleanup is supported only for the following custom resources:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're going to have multiple types of resources supported, it would be nice to have a table, one row for each resource, and its corresponding resources that will be cleaned up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using table.


logger.Infof("starting clean up ceph SubVolumeGroup resource %q", subVolumeGroupName)

subVolumeList, err := cephclient.ListSubvolumesInGroup(context, clusterInfo, fsName, subVolumeGroupName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the CLI processing should be in the cmd package. The rest of this code should likely move under pkg/daemon/ceph

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to pkg/daemon/ceph

@@ -143,8 +143,8 @@ var (
DefaultRegistrarImage = "registry.k8s.io/sig-storage/csi-node-driver-registrar:v2.10.0"
DefaultProvisionerImage = "registry.k8s.io/sig-storage/csi-provisioner:v4.0.0"
DefaultAttacherImage = "registry.k8s.io/sig-storage/csi-attacher:v4.5.0"
DefaultSnapshotterImage = "registry.k8s.io/sig-storage/csi-snapshotter:v7.0.1"
DefaultResizerImage = "registry.k8s.io/sig-storage/csi-resizer:v1.10.0"
DefaultSnapshotterImage = "registry.k8s.io/sig-storage/csi-snapshotter:v6.3.2"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it intentional to downgrade these image versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this change.

}

var cleanUpDisksCmd = &cobra.Command{
Use: "disks",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we cleaning subvolumes not disks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already had a ceph clean clean to clean up the data on the disks.

I've extend to clean command with more sub commands.

ceph clean disks will be use the data cleanup job. This is an existing behavior.

ceph clean CephFSSubVolumeGroup will clean up the subvolumegrous.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ceph clean disks is called by which container?

@sp98 sp98 force-pushed the resource-cleanup branch 5 times, most recently from f09904b to cb0c42f Compare April 5, 2024 08:09
@sp98 sp98 requested review from travisn and Madhu-1 April 5, 2024 14:38
@sp98 sp98 marked this pull request as ready for review April 5, 2024 14:38
@sp98 sp98 force-pushed the resource-cleanup branch 2 times, most recently from 6472c34 to 9c23c8a Compare April 5, 2024 14:58
@sp98 sp98 removed the skip-ci label Apr 5, 2024
@@ -2,6 +2,23 @@
title: Cleanup
---

## Cleaning up a Custom Resource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Cleaning up a Custom Resource
## Force Delete Resources

@@ -2,6 +2,23 @@
title: Cleanup
---

## Cleaning up a Custom Resource

To cleanup a specific custom resource, add `ceph.rook.io/force-deletion="true"` annotation before deleting it. Rook will start a cleanup job that will delete the all the ceph resources created by that custom resource. For example, run the following commands to clean the `CephFSSubVolumeGroup` resource named `myfs-csi`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To cleanup a specific custom resource, add `ceph.rook.io/force-deletion="true"` annotation before deleting it. Rook will start a cleanup job that will delete the all the ceph resources created by that custom resource. For example, run the following commands to clean the `CephFSSubVolumeGroup` resource named `myfs-csi`:
To keep your data safe in the cluster, Rook disallows deleting critical cluster resources by default. To override this behavior and force delete a specific custom resource, add the annotation `ceph.rook.io/force-deletion="true"` to the resource and then delete it. Rook will start a cleanup job that will delete the all the related ceph resources created by that custom resource.
For example, run the following commands to clean the `CephFSSubVolumeGroup` resource named `my-subvolumegroup`:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CephFSSubVolumeGroup to CephFilesystemSubVolumeGroup

To cleanup a specific custom resource, add `ceph.rook.io/force-deletion="true"` annotation before deleting it. Rook will start a cleanup job that will delete the all the ceph resources created by that custom resource. For example, run the following commands to clean the `CephFSSubVolumeGroup` resource named `myfs-csi`:

``` console
kubectl annotate cephfilesystemsubvolumegroups.ceph.rook.io myfs-csi -n rook-ceph ceph.rook.io/force-deletion="true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kubectl annotate cephfilesystemsubvolumegroups.ceph.rook.io myfs-csi -n rook-ceph ceph.rook.io/force-deletion="true"
kubectl annotate cephfilesystemsubvolumegroups.ceph.rook.io my-subvolumegroup -n rook-ceph ceph.rook.io/force-deletion="true"


``` console
kubectl annotate cephfilesystemsubvolumegroups.ceph.rook.io myfs-csi -n rook-ceph ceph.rook.io/force-deletion="true"
kubectl delete cephfilesystemsubvolumegroups.ceph.rook.io myfs-csi -n rook-ceph
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency in the docs, let's put the namespace first. We aren't very consistent about it, but IMO it is a nice convention.

Suggested change
kubectl delete cephfilesystemsubvolumegroups.ceph.rook.io myfs-csi -n rook-ceph
kubectl -n rook-ceph delete cephfilesystemsubvolumegroups.ceph.rook.io my-subvolumegroup

if !success {
return opcontroller.ImmediateRetryResult, errors.Wrapf(err, "Waiting for ceph cleanup job to complete successfully for cephFilesystemSubVolumeGroup %q", cephFilesystemSubVolumeGroup.Name)
}
logger.Infof("successfully cleaned up all the ceph resources created by the cephFilesystemSubVolumeGroup %q", cephFilesystemSubVolumeGroup.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get here it just means the job was successfully started. We don't know if it successfully completed, so we should always requeue the event, right?

cleanupConfig := map[string]string{
opcontroller.CephFSSubVolumeGroupNameEnv: cephFilesystemSubVolumeGroup.Spec.Name,
opcontroller.CephFSVolumeNameEnv: cephFilesystemSubVolumeGroup.Spec.FilesystemName,
opcontroller.CephFSNamesaceEnv: "csi", // TODO: is it always "csi" for cephFS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create a default "csi" subvolumegroup for each cephfilesystem, but if the user creates their own subvolumegroup CR, the name will be different. And isn't this the namespace, not the svg name?

Suggested change
opcontroller.CephFSNamesaceEnv: "csi", // TODO: is it always "csi" for cephFS
opcontroller.CephFSNamespaceEnv: namespace

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rados namespace name where the omap are stored, its always csi for csi created PVC. i suggested above to rename this variable name to avoid confusion

@@ -187,6 +187,25 @@ func (r *ReconcileCephFilesystemSubVolumeGroup) reconcile(request reconcile.Requ
if cephCluster.Spec.External.Enable {
logger.Warningf("external subvolume group %q deletion is not supported, delete it manually", namespacedName)
} else {
if opcontroller.IsCleanupRequired(cephFilesystemSubVolumeGroup.GetAnnotations()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving this whole block to a helper method? This method is so long already, it helps for code readability.

},
}

existingJob := &batch.Job{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we get or create the job, let's check if there are any subvolumes in the subvolumegroup. if not, then we don't need to create the job at all in the first place and it can be immediately deleted. Or is that check somewhere else in the finalizer already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently are always creating a job for cleaning up the subvolumegroup, whether it has subvolumes or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like good to have. But since we only create this job when annotation is present on the resource, it should be ok to run the job as user has deliberately asked to clean the resource (subvolumes, snapshot etc)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no subvolumes in the svg, seems like the finalizer should be immediately removed. This would cover both cases of 1) no subvolumes existing (and no need for creating the job), or 2) if the subvolumes were already cleaned up by the job and the operator is retrying to delete the svg CR.

}

existingJob := &batch.Job{}
err := client.Get(ctx, types.NamespacedName{Name: jobName, Namespace: c.resource.GetNamespace()}, existingJob)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calling Get(), how about just calling Create() and ignore if there is an error that it already exists? Then no need for two api calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get will also help to get the currently running job and validate its status.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only status we need to know is if there are any subvolumes remaining. The status of the job seems like an extra detail we don't need? Let's discuss.

@@ -2,6 +2,23 @@
title: Cleanup
---

## Cleaning up a Custom Resource

To cleanup a specific custom resource, add `ceph.rook.io/force-deletion="true"` annotation before deleting it. Rook will start a cleanup job that will delete the all the ceph resources created by that custom resource. For example, run the following commands to clean the `CephFSSubVolumeGroup` resource named `myfs-csi`:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CephFSSubVolumeGroup to CephFilesystemSubVolumeGroup


| Custom Resource | Ceph Resources to be cleaned up |
| -------- | ------- |
| CephFSSubVolumeGroups | OMAP details, snapshots, clones, subvolumes |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CephFilesystemSubVolumeGroup
OMAP details -> CSI stored RADOS OMAP details for pvc/volumesnapshots
snapshots -> subvolume snapshots
clones -> subvolume clones


fsName := os.Getenv(opcontroller.CephFSVolumeNameEnv)
if fsName == "" {
return fmt.Errorf("cephfs volume name is not available in the pod environment variables")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cephfs volume name to ceph filesystem name as the user known about the filesystem and subvolumegroup relationship and not much aware what is fs volume is even though it also refers to filesystem

namespace := os.Getenv(k8sutil.PodNamespaceEnvVar)
clusterInfo := client.AdminClusterInfo(ctx, namespace, "")

fsName := os.Getenv(opcontroller.CephFSVolumeNameEnv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CephFSVolumeNameEnv to CephFSNameEnv

Comment on lines 129 to 130
csiNamespace := os.Getenv(opcontroller.CephFSNamesaceEnv)
if csiNamespace == "" {
return fmt.Errorf("cephfs SubVolumeGroup namespace name is not available in the pod environment variables")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
csiNamespace := os.Getenv(opcontroller.CephFSNamesaceEnv)
if csiNamespace == "" {
return fmt.Errorf("cephfs SubVolumeGroup namespace name is not available in the pod environment variables")
}
csiNamespace := os.Getenv(opcontroller.CSICephFSRadosNamesaceEnv)
if csiNamespace == "" {
return fmt.Errorf("CSI rados namespace name is not available in the pod environment variables")
}

func DeleteSubVolumeSnapshots(context *clusterd.Context, clusterInfo *client.ClusterInfo, snapshots cephclient.SubVolumeSnapshotList, fsName, subvol, svg string) error {
for _, snapshot := range snapshots {
logger.Info("deleting snapshot %q for subvolume %q in group %q", snapshot.Name, subvol, svg)
err := cephclient.DeleteSubvolumeSnapshot(context, clusterInfo, fsName, subvol, svg, snapshot.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot delete snapshots which are having pending clones and also we need to cancel clone and delete the subvolume created for the clone before deleting the snapshot

// SubVolumeSnapshotList is the list of snapshots in a CephFS subvolume
type SubVolumeSnapshotList []SubVolumeSnapshot

// ListSubVolumeSnaphotsInGroup lists all the subvolume snapshots present in the given filesystem's subvolume group by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// ListSubVolumeSnaphotsInGroup lists all the subvolume snapshots present in the given filesystem's subvolume group by
// ListSubVolumeSnaphots lists all the subvolume snapshots present in the subvolume in the given filesystem's subvolume group. Times out after 5 seconds.

}

// SubVolumeSnapshotList is the list of snapshots in a CephFS subvolume
type SubVolumeSnapshotList []SubVolumeSnapshot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SubVolumeSnapshotList to SubVolumeSnapshots we can remove List as its ending with plurals

Volumes: volumes,
RestartPolicy: v1.RestartPolicyOnFailure,
PriorityClassName: cephv1.GetCleanupPriorityClassName(c.cluster.Spec.PriorityClassNames),
ServiceAccountName: k8sutil.DefaultServiceAccount,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add securityContext to this as well?

cleanupConfig := map[string]string{
opcontroller.CephFSSubVolumeGroupNameEnv: cephFilesystemSubVolumeGroup.Spec.Name,
opcontroller.CephFSVolumeNameEnv: cephFilesystemSubVolumeGroup.Spec.FilesystemName,
opcontroller.CephFSNamesaceEnv: "csi", // TODO: is it always "csi" for cephFS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is rados namespace name where the omap are stored, its always csi for csi created PVC. i suggested above to rename this variable name to avoid confusion

@sp98 sp98 force-pushed the resource-cleanup branch 3 times, most recently from fd41ddb to f48bc5a Compare April 8, 2024 13:26
@@ -2,6 +2,11 @@
title: Cleanup
---

Rook provides following clean up options:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Rook provides following clean up options:
Rook provides the following clean up options:


## Force Delete Resources

To keep your data safe in the cluster, Rook disallows deleting critical cluster resources by default. To override this behavior and force delete a specific custom resource, add the annotation `ceph.rook.io/force-deletion="true"` to the resource and then delete it. Rook will start a cleanup job that will delete the all the related ceph resources created by that custom resource.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To keep your data safe in the cluster, Rook disallows deleting critical cluster resources by default. To override this behavior and force delete a specific custom resource, add the annotation `ceph.rook.io/force-deletion="true"` to the resource and then delete it. Rook will start a cleanup job that will delete the all the related ceph resources created by that custom resource.
To keep your data safe in the cluster, Rook disallows deleting critical cluster resources by default. To override this behavior and force delete a specific custom resource, add the annotation `ceph.rook.io/force-deletion="true"` to the resource and then delete it. Rook will start a cleanup job that will delete all the related ceph resources created by that custom resource.

}

var cleanUpHostsCmd = &cobra.Command{
Use: "hosts",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one host is cleaned up by this command for a particular cleanup job, right?

Suggested change
Use: "hosts",
Use: "host",


var cleanUpHostsCmd = &cobra.Command{
Use: "hosts",
Short: "Starts the cleanup process on the hosts after ceph cluster is deleted",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Short: "Starts the cleanup process on the hosts after ceph cluster is deleted",
Short: "Starts the cleanup process on a host after the ceph cluster is deleted",

@@ -74,6 +74,7 @@ func (s *DiskSanitizer) StartSanitizeDisks() {
// Start the sanitizing sequence
s.SanitizeRawDisk(osdRawList)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert the blank line?


err = cephclient.DeleteSubVolume(context, clusterInfo, fsName, subVolume.Name, svg)
if err != nil {
return errors.Wrapf(err, "failed to delete subvolume Group %q.", subVolume.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of returning immediately on failure, store the error in a var and attempt deletion of all subvolumes. Then at the end return the error if any. This way, we can always attempt to cleanup as much as possible.

@@ -138,7 +138,7 @@ func (c *ClusterController) cleanUpJobContainer(cluster *cephv1.CephCluster, mon
SecurityContext: securityContext,
VolumeMounts: volumeMounts,
Env: envVars,
Args: []string{"ceph", "clean"},
Args: []string{"ceph", "clean", "hosts"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Args: []string{"ceph", "clean", "hosts"},
Args: []string{"ceph", "clean", "host"},

},
}

existingJob := &batch.Job{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no subvolumes in the svg, seems like the finalizer should be immediately removed. This would cover both cases of 1) no subvolumes existing (and no need for creating the job), or 2) if the subvolumes were already cleaned up by the job and the operator is retrying to delete the svg CR.

}

existingJob := &batch.Job{}
err := client.Get(ctx, types.NamespacedName{Name: jobName, Namespace: c.resource.GetNamespace()}, existingJob)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only status we need to know is if there are any subvolumes remaining. The status of the job seems like an extra detail we don't need? Let's discuss.

@sp98 sp98 force-pushed the resource-cleanup branch 3 times, most recently from 7ec4e67 to 509f07c Compare April 9, 2024 14:42
@sp98 sp98 requested a review from Madhu-1 April 9, 2024 14:42
@sp98 sp98 requested review from travisn and parth-gr April 9, 2024 14:42
@sp98 sp98 changed the title [WIP]core: subvolumegroup clean up core: subvolumegroup clean up Apr 9, 2024
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is testing looking? In a separate PR we could add integration tests, as long as manual testing with various scenarios is covered.

volumeName = "cleanup-volume"
dataDirHostPath = "ROOK_DATA_DIR_HOST_PATH"
CleanupAppName = "resource-cleanup"
RESOURCE_CLEANUP_ANNOTATION = "ceph.rook.io/force-deletion"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for the ceph prefix, so we can use this annotation for any type of resource in the future.

Suggested change
RESOURCE_CLEANUP_ANNOTATION = "ceph.rook.io/force-deletion"
RESOURCE_CLEANUP_ANNOTATION = "rook.io/force-deletion"


// Start a new job to perform clean up of the ceph resources. It returns true if the cleanup job has succeeded
func (c *ResourceCleanup) StartJob(ctx context.Context, clientset kubernetes.Interface) error {
jobName := k8sutil.TruncateNodeNameForJob("cleanup-job-%s", c.resource.GetName())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some uniqueness in the resource naming so we don't conflict if other resource types in the future have the same resource name, something like this:

Suggested change
jobName := k8sutil.TruncateNodeNameForJob("cleanup-job-%s", c.resource.GetName())
jobName := k8sutil.TruncateNodeNameForJob("cleanup-svg-%s-%s", svg.Spec.FilesystemName, c.resource.GetName())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New job name

❯ oc get jobs -n rook-ceph
NAME                             COMPLETIONS   DURATION   AGE
cleanup-svg-myfs-myfs-csi        1/1           10s        2m43s

@sp98
Copy link
Contributor Author

sp98 commented Apr 9, 2024

How is testing looking? In a separate PR we could add integration tests, as long as manual testing with various scenarios is covered.

Manual testing for following scenarios looks good:

  • Deleting OMAP data
  • Deleting snapshots
  • Cancelling clones.
  • Deleting snapshots.
  • Deleting subvolumes.

@Madhu-1 Mentioned that subvolumes created for the clones have to be deleted as well. That is not covered in the implementation as of now.

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nits

}
err = cephclient.DeleteSubVolume(context, clusterInfo, fsName, subVolume.Name, svg)
if err != nil {
retErr = errors.Wrapf(err, "failed to delete subvolume Group %q.", subVolume.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
retErr = errors.Wrapf(err, "failed to delete subvolume Group %q.", subVolume.Name)
retErr = errors.Wrapf(err, "failed to delete subvolume group %q.", subVolume.Name)

return "", errors.Wrapf(err, "failed to list omapKeys for omapObj %q", omapObj)
}

// TODO: Find better solution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What better solution do we need? If this parsing is working, perhaps it's fine?

@@ -339,7 +340,17 @@ func (r *ReconcileCephFilesystemSubVolumeGroup) deleteSubVolumeGroup(cephFilesys
// If the subvolume group has subvolumes the command will fail with:
// Error ENOTEMPTY: error in rmdir /volumes/csi
if ok && (code == int(syscall.ENOTEMPTY)) {
return errors.Wrapf(err, "failed to delete ceph filesystem subvolume group %q, remove the subvolumes first", cephFilesystemSubVolumeGroup.Name)
msg := fmt.Sprintf("failed to delete ceph filesystem subvolume group %q, remove the subvolumes first", cephFilesystemSubVolumeGroup.Name)
if opcontroller.IsCleanupRequired(cephFilesystemSubVolumeGroup.GetAnnotations()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper is just checking the annotations, right? How about this method name?

Suggested change
if opcontroller.IsCleanupRequired(cephFilesystemSubVolumeGroup.GetAnnotations()) {
if opcontroller.ForceDeleteRequested(cephFilesystemSubVolumeGroup.GetAnnotations()) {

opcontroller.CephFSSubVolumeGroupNameEnv: svg.Spec.Name,
opcontroller.CephFSNameEnv: svg.Spec.FilesystemName,
opcontroller.CSICephFSRadosNamesaceEnv: "csi",
opcontroller.CephFSMetaDataPoolNameEnv: fmt.Sprintf("%s-%s", svg.Spec.FilesystemName, "metadata"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling generateMetaDataPoolName() here instead of re-implementing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using generateMetaDataPoolName() function.

@sp98 sp98 force-pushed the resource-cleanup branch 2 times, most recently from cd237f6 to 83dbba8 Compare April 9, 2024 19:27
Copy link
Member

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, small nit

@@ -87,3 +106,37 @@ func startCleanUp(cmd *cobra.Command, args []string) error {

return nil
}

func startSubVolumeGroupsCleanUp(cmd *cobra.Command, args []string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startSubVolumeGroupsCleanUp to startSubVolumeGroupCleanUp as we are working with single group here.


err := cleanup.SubVolumeGroupCleanup(context, clusterInfo, fsName, subVolumeGroupName, poolName, csiNamespace)
if err != nil {
rook.TerminateFatal(fmt.Errorf("failed to cleanup cephFS SubVolumeGroup %q in the namespace %q. %v", subVolumeGroupName, namespace, err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we log filesystem name as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging it now.

for _, subVolume := range subVolumeList {
logger.Infof("starting clean up of subvolume %q", subVolume.Name)
err := CleanUpOMAPDetails(context, clusterInfo, subVolume.Name, poolName, csiNamespace)
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might need to handle not found errors as future enhancement as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added it the pending list in the PR description. Will take it as a future enhancement.

@Madhu-1
Copy link
Member

Madhu-1 commented Apr 10, 2024

@sp98 can we also add function test for this one (as a new PR not as part of this one)?

Cleanup the resources created by subvolumegroup
when its deleted. Following resources will be cleaned up:
- OMAP value
- OMAP keys
- Clones
- Snapshots
- Subvolumes

Signed-off-by: sp98 <[email protected]>
@travisn travisn merged commit 60efded into rook:master Apr 10, 2024
53 checks passed
travisn added a commit that referenced this pull request Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants