-
Notifications
You must be signed in to change notification settings - Fork 344
DAOS-18881 object: increment counters on csum errors #18158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1090,14 +1090,26 @@ obj_singv_ec_rw_filter(daos_unit_oid_t oid, struct daos_oclass_attr *oca, | |
| } | ||
|
|
||
| static inline void | ||
| obj_log_csum_err(daos_unit_oid_t oid) | ||
| obj_log_csum_err(daos_unit_oid_t oid, bool ondisk) | ||
| { | ||
| struct dss_module_info *info = dss_get_module_info(); | ||
| struct bio_xs_context *bxc; | ||
|
|
||
| D_ASSERT(info != NULL); | ||
| ras_notify_eventf(RAS_OBJ_CSUM_ERR, RAS_TYPE_INFO, RAS_SEV_ERROR, NULL, NULL, NULL, NULL, | ||
| NULL, NULL, NULL, NULL, NULL, "CSUM error for " DF_UOID " on target %u\n", | ||
| DP_UOID(oid), info->dmi_tgt_id); | ||
|
|
||
| if (ondisk) { | ||
| bxc = info->dmi_nvme_ctxt; | ||
| if (bxc == NULL) { | ||
| D_ERROR("BIO NVMe context not initialized for xs:%d, tgt:%d\n", info->dmi_xs_id, | ||
| info->dmi_tgt_id); | ||
| return; | ||
| } | ||
|
|
||
| bio_log_data_csum_err(bxc); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -1444,6 +1456,7 @@ obj_local_rw_internal(crt_rpc_t *rpc, struct obj_io_context *ioc, daos_iod_t *io | |
| uint64_t bio_pre_latency = 0; | ||
| uint64_t bio_post_latency = 0; | ||
| uint32_t tgt_off = 0; | ||
| bool rc_ondisk = false; | ||
| int rc = 0; | ||
|
|
||
| create_map = orw->orw_flags & ORF_CREATE_MAP; | ||
|
|
@@ -1784,11 +1797,13 @@ obj_local_rw_internal(crt_rpc_t *rpc, struct obj_io_context *ioc, daos_iod_t *io | |
|
|
||
| rc = obj_verify_bio_csum(orw->orw_oid.id_pub, iods, iod_csums, | ||
| biod, ioc->ioc_coc->sc_csummer, iods_nr); | ||
| if (rc != 0) | ||
| if (rc != 0) { | ||
| rc_ondisk = true; | ||
| D_ERROR(DF_C_UOID_DKEY " verify_bio_csum failed: " | ||
| DF_RC"\n", | ||
| DP_C_UOID_DKEY(orw->orw_oid, dkey), | ||
| DP_RC(rc)); | ||
| } | ||
| /** CSUM Verified on update, now corrupt to fake corruption | ||
| * on disk | ||
| */ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Set the rc_ondisk as true for this faulty injection? (I suppose it's for simulating disk corruption).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is funny because your comment here literally contradicts the comment you wrote above. 😉 Corruption here either may be caused by disk or it may not. Regardless whether it is injected or not. What do you think?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A real csum mismatch detected here should be caused by network transfer or client. The purpose of fault injection is to simulate disk corruption by writing corrupted data to SSD, though I read the code closer and realized that this simulated corruption won't be detected immediately, I guess it would be detected on client on follow-on fetch, so this fault injection won't work as expected... Anyway, setting rc_ondisk as true here is useless. |
||
|
|
@@ -1807,10 +1822,11 @@ obj_local_rw_internal(crt_rpc_t *rpc, struct obj_io_context *ioc, daos_iod_t *io | |
| obj_iod_recx_daos2vos(iods_nr, iods, &ioc->ioc_oca); | ||
|
|
||
| rc = obj_fetch_create_maps(rpc, biod, iods, iods_nr, skips); | ||
| D_ASSERT(rc != -DER_CSUM); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the purpose of this assert?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function is very long. And there is a number of execution paths ending quickly or jumping over this place in code. I identified two paths which may result in calling The chances are slim someone will change this in the future but if it would have happened it would affect this on-disk/no-on-disk logic. I want to make sure it won't get unnoticed. |
||
| } | ||
|
|
||
| if (rc == -DER_CSUM) | ||
| obj_log_csum_err(orw->orw_oid); | ||
| obj_log_csum_err(orw->orw_oid, rc_ondisk); | ||
| post: | ||
| time = daos_get_ntime(); | ||
| rc = bio_iod_post_async(biod, rc); | ||
|
|
@@ -3053,7 +3069,7 @@ ds_obj_rw_handler(crt_rpc_t *rpc) | |
| struct dtx_handle *dth; | ||
|
|
||
| if (orw->orw_flags & ORF_CSUM_REPORT) { | ||
| obj_log_csum_err(orw->orw_oid); | ||
| obj_log_csum_err(orw->orw_oid, false /** ondisk */); | ||
| D_GOTO(out, rc = 0); | ||
| } | ||
|
|
||
|
|
@@ -4890,7 +4906,7 @@ ds_cpd_handle_one(crt_rpc_t *rpc, struct daos_cpd_sub_head *dcsh, struct daos_cp | |
| ioc->ioc_coc->sc_csummer, piod_nrs[i]); | ||
| if (rc != 0) { | ||
| if (rc == -DER_CSUM) | ||
| obj_log_csum_err(dcsr->dcsr_oid); | ||
| obj_log_csum_err(dcsr->dcsr_oid, true /** ondisk */); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here is to verify update RPC from client, so the 'ondisk' should be false, right?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, does it check the data after transit but before storing it? Or both are checked at the same time?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think it's before storing data. |
||
| goto out; | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's verifying update RPC from client, so the rc_ondisk should be false, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not exactly sure. Does the data+csum are checked after the transmission but before being stored here?
If they are not, csum error here might have resulted from both network and/or disk corruption and there is no clear way of differentiate between these two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it's verifying data transferred from client before storing to SSD.