feat: redact PII fields before delete in user retirement flows#38426
feat: redact PII fields before delete in user retirement flows#38426ktyagiapphelix2u wants to merge 32 commits into
Conversation
0d5d5e6 to
2593ac2
Compare
f7e8036 to
07bf642
Compare
Please elaborate the problem statement and solution provided. |
3905bd5 to
03fce31
Compare
f340dae to
ab7b642
Compare
robrap
left a comment
There was a problem hiding this comment.
I didn't get through the review fully. Sorry. So just leaving you with what I have.
robrap
left a comment
There was a problem hiding this comment.
Thanks. Is looking good and getting close. We're almost there.
| table, | ||
| num_redact_delete_pairs=1, | ||
| require_id_filter=False, | ||
| expected_redacted_value=None, |
There was a problem hiding this comment.
Is there a reason why this is optional? At least one of the tests doesn't send this. Is there a reason?
There was a problem hiding this comment.
That is fair. I kept expected_redacted_value optional for helper reuse, but I updated the current PendingEmailChange tests to pass the explicit redacted value so coverage is stronger and no test relies on the default here.
There was a problem hiding this comment.
- I still don't understand why this is optional. This is a test for redaction. The update shouldn't be any update, but one that includes the redacted value. Is there a call where you couldn't send the redacted value?
- I'm not clear on the usefulness of
require_id_filter. Do we care how it is filtered? I would consider dropping this and the accompanying assertions, but maybe you could explain its importance?
There was a problem hiding this comment.
-
agreed on both. I've made expected_redacted_value a required positional parameter there's no valid call site that couldn't provide it, so making it optional was just unnecessary flexibility.
-
Dropped require_id_filter entirely as well, we don't need to assert how the rows are filtered, only that the redaction UPDATE happens before the DELETE, which is what the helper already verifies.
There was a problem hiding this comment.
Both issues are fixed.
expected_redacted_value is now a required positional parameter since every valid call site already provides it, so the optional default was unnecessary.
require_id_filter and its related assertions have been removed. We only need to verify that the redaction UPDATE happens before the DELETE
| record_ids = list(records_matching_user_value.values_list('id', flat=True)) | ||
| records_matching_ids = cls.objects.filter(id__in=record_ids) |
There was a problem hiding this comment.
- If needed at all, shouldn't this live in
if redact_fields:conditional, so it is only needed when we actually need to do an update? - What is the difference between
records_matching_user_valueandrecords_matching_ids, and why do we need to do another lookup? The original code already hadrecords_matching_user_value.delete(), which is presumably deleting the correct records. Isn't this what we want to update?
There was a problem hiding this comment.
I’ve updated the mixin so the second lookup only happens when redaction is actually needed.
-
records_matching_user_valueis the original queryset filtered by the user field (for example email=value). -
We only build the ID-based queryset when redact_fields is non-empty, because redacting may change that same filter field, and we still need to delete the exact same rows.
-
For non-redacting models, we now directly call delete on the original queryset (no extra lookup).
|
|
||
| @classmethod | ||
| def redact_before_delete_fields(cls): | ||
| return {'email': 'redacted-before-delete@safe.com'} |
There was a problem hiding this comment.
Can we adjust this to include two fields to ensure that would work appropriately?
There was a problem hiding this comment.
yes updated
| with mock.patch.object(model_cls, 'objects', create=True) as mock_objects: | ||
| mock_objects.filter.return_value = queryset |
There was a problem hiding this comment.
All this mocking makes me nervous. Can the test models be updated to be real Django models, like?:
class NonRedactingModel(DeletableByUserValue, models.Model):
Not sure how simple this would be, but the less mocking the better, especially in this case.
There was a problem hiding this comment.
Thanks. but
I kept this test mocked because it is a small unit test for mixin logic only.
Making real Django models here would add a lot of setup and make the test harder to read.
We already test this behavior with real database models in:
common/djangoapps/student/tests/test_models.py
openedx/core/djangoapps/course_groups/tests/test_cohorts.py
openedx/core/djangoapps/user_api/accounts/tests/test_retirement_views.py
So this test stays simple, and real DB behavior is already covered elsewhere.
There was a problem hiding this comment.
@ktyagiapphelix2u: I think mock or non-mock would be fine, so feel free to choose what you prefer. However, I'm confused, because I thought when we met you said you liked the full Django model version better. Did you have trouble with it when you removed from INSTALLED_APPS, or something else?
robrap
left a comment
There was a problem hiding this comment.
I didn't get through review, but here is something...
| table, | ||
| num_redact_delete_pairs=1, | ||
| require_id_filter=False, | ||
| expected_redacted_value=None, |
There was a problem hiding this comment.
- I still don't understand why this is optional. This is a test for redaction. The update shouldn't be any update, but one that includes the redacted value. Is there a call where you couldn't send the redacted value?
- I'm not clear on the usefulness of
require_id_filter. Do we care how it is filtered? I would consider dropping this and the accompanying assertions, but maybe you could explain its importance?
| # When redacting the field used for filtering, switch to ID-based | ||
| # operations so the subsequent delete still targets the same rows. | ||
| record_ids = list(records_matching_user_value.values_list('id', flat=True)) | ||
| records_matching_ids = cls.objects.filter(id__in=record_ids) | ||
| records_matching_ids.update(**redact_fields) |
There was a problem hiding this comment.
[New comment for addressing https://github.com//pull/38426#discussion_r3309021075]
I still don't understand why this can't be records_matching_user_value.update(**redact_fields). The existing code uses records_matching_user_value.delete(), and includes all records that need to be deleted. We simply want to redact all records we are about to delete.
Does this code not work for some reason, or are you just trying to delete values outside of the scope of how this function was called? Doesn't that seem dangerous? We only want to redact what is being deleted, and we don't want to delete anything different from what this was already deleting.
There was a problem hiding this comment.
@robrap I got it know that records_matching_user_value.update(**redact_fields) is the correct and simpler approach in most cases and the code now uses it by default. The ID-based path is only taken in the one specific case where it's unavoidable when the filter field is the same field being redacted (e.g. CourseEnrollmentAllowed and UnregisteredLearnerCohortAssignments both filter by email and redact email).
In that case, records_matching_user_value.update(email='redacted-before-delete@safe.com') succeeds, but the subsequent records_matching_user_value.delete() generates DELETE WHERE email = 'original@example.com' — which now matches zero rows because the email was just changed. The rows get redacted but never deleted.
Capturing IDs before the update is the minimal fix for that specific situation, and it only affects those two models. All other callers (filtering by user, not email) now use the simple path you suggested.
Updated the CODE
robrap
left a comment
There was a problem hiding this comment.
@ktyagiapphelix2u: It's looking great. Very minor proposals. Let's try to look at your final changes together on Monday so we can merge. Thank you.
| """ | ||
|
|
||
|
|
||
| def assert_update_before_delete( |
There was a problem hiding this comment.
Let's rename this to assert_redact_before_delete.
| Assert that UPDATE and DELETE queries for ``table`` occur in consecutive pairs, | ||
| and that each UPDATE contains the expected redacted value. |
There was a problem hiding this comment.
Proposed update:
| Assert that UPDATE and DELETE queries for ``table`` occur in consecutive pairs, | |
| and that each UPDATE contains the expected redacted value. | |
| Assert that PII is redacted before deleted. | |
| Redacting before deleting protects downstream sync of data from holding PII | |
| in soft-deleted records. This helper ensures UPDATE and DELETE queries for | |
| ``table`` occur in consecutive pairs, and that each UPDATE contains the | |
| expected redacted value. |
| @@ -0,0 +1,34 @@ | |||
| """ | |||
| SQL assertion helpers for tests. | |||
There was a problem hiding this comment.
Let's just move this to https://github.com/openedx/openedx-platform/blob/master/openedx/core/djangolib/testing/utils.py, rather than creating a new utility file.
| with mock.patch.object(model_cls, 'objects', create=True) as mock_objects: | ||
| mock_objects.filter.return_value = queryset |
There was a problem hiding this comment.
@ktyagiapphelix2u: I think mock or non-mock would be fine, so feel free to choose what you prefer. However, I'm confused, because I thought when we met you said you liked the full Django model version better. Did you have trouble with it when you removed from INSTALLED_APPS, or something else?
| def delete_by_user_value(cls, value, field): | ||
| """ | ||
| Deletes instances of this model where ``field`` equals ``value``. | ||
| Redacts and deletes instances of this model where ``field`` equals ``value``. |
There was a problem hiding this comment.
Minor update:
| Redacts and deletes instances of this model where ``field`` equals ``value``. | |
| Optionally redacts and always deletes instances of this model where ``field`` equals ``value``. |
| If ``redact_before_delete_fields`` returns a non-empty dict, matching | ||
| records are bulk-updated with those redacted values before delete. |
There was a problem hiding this comment.
Proposed update:
| If ``redact_before_delete_fields`` returns a non-empty dict, matching | |
| records are bulk-updated with those redacted values before delete. | |
| If ``redact_before_delete_fields()`` returns a non-empty dict, the | |
| returned PII fields are redacted before any records are deleted. |
| def assert_update_before_delete( | ||
| sql_list, | ||
| table, | ||
| expected_redacted_value, |
There was a problem hiding this comment.
We might want this to be expected_redacted_value_list, and we can ensure it is non-empty, and then loop over it and check each value, in case we redact multiple values. You don't need to check that the right field is updated for each field, although we could. That would require passing in the field as well, so maybe we just keep it simple.
Note: This will allow this to be re-used for the social auth PR, I hope.
Summary
Adds support for redacting sensitive PII fields before deletion in DeletableByUserValue models.
Changes include:
Ticket & Reference
https://2u-internal.atlassian.net/browse/BOMS-498
Related PR
https://2u-internal.atlassian.net/browse/BOMS-565
https://2u-internal.atlassian.net/browse/BOMS-564