Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 21 additions & 5 deletions src/object/srv_obj.c
Original file line number Diff line number Diff line change
Expand Up @@ -1090,14 +1090,26 @@ obj_singv_ec_rw_filter(daos_unit_oid_t oid, struct daos_oclass_attr *oca,
}

static inline void
obj_log_csum_err(daos_unit_oid_t oid)
obj_log_csum_err(daos_unit_oid_t oid, bool ondisk)
{
struct dss_module_info *info = dss_get_module_info();
struct bio_xs_context *bxc;

D_ASSERT(info != NULL);
ras_notify_eventf(RAS_OBJ_CSUM_ERR, RAS_TYPE_INFO, RAS_SEV_ERROR, NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, "CSUM error for " DF_UOID " on target %u\n",
DP_UOID(oid), info->dmi_tgt_id);

if (ondisk) {
bxc = info->dmi_nvme_ctxt;
if (bxc == NULL) {
D_ERROR("BIO NVMe context not initialized for xs:%d, tgt:%d\n", info->dmi_xs_id,
info->dmi_tgt_id);
return;
}

bio_log_data_csum_err(bxc);
}
}

/**
Expand Down Expand Up @@ -1444,6 +1456,7 @@ obj_local_rw_internal(crt_rpc_t *rpc, struct obj_io_context *ioc, daos_iod_t *io
uint64_t bio_pre_latency = 0;
uint64_t bio_post_latency = 0;
uint32_t tgt_off = 0;
bool rc_ondisk = false;
int rc = 0;

create_map = orw->orw_flags & ORF_CREATE_MAP;
Expand Down Expand Up @@ -1784,11 +1797,13 @@ obj_local_rw_internal(crt_rpc_t *rpc, struct obj_io_context *ioc, daos_iod_t *io

rc = obj_verify_bio_csum(orw->orw_oid.id_pub, iods, iod_csums,
biod, ioc->ioc_coc->sc_csummer, iods_nr);
if (rc != 0)
if (rc != 0) {
rc_ondisk = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's verifying update RPC from client, so the rc_ondisk should be false, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not exactly sure. Does the data+csum are checked after the transmission but before being stored here?

If they are not, csum error here might have resulted from both network and/or disk corruption and there is no clear way of differentiate between these two.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's verifying data transferred from client before storing to SSD.

D_ERROR(DF_C_UOID_DKEY " verify_bio_csum failed: "
DF_RC"\n",
DP_C_UOID_DKEY(orw->orw_oid, dkey),
DP_RC(rc));
}
/** CSUM Verified on update, now corrupt to fake corruption
* on disk
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set the rc_ondisk as true for this faulty injection? (I suppose it's for simulating disk corruption).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is funny because your comment here literally contradicts the comment you wrote above. 😉

Corruption here either may be caused by disk or it may not. Regardless whether it is injected or not.

What do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A real csum mismatch detected here should be caused by network transfer or client.

The purpose of fault injection is to simulate disk corruption by writing corrupted data to SSD, though I read the code closer and realized that this simulated corruption won't be detected immediately, I guess it would be detected on client on follow-on fetch, so this fault injection won't work as expected...

Anyway, setting rc_ondisk as true here is useless.

Expand All @@ -1807,10 +1822,11 @@ obj_local_rw_internal(crt_rpc_t *rpc, struct obj_io_context *ioc, daos_iod_t *io
obj_iod_recx_daos2vos(iods_nr, iods, &ioc->ioc_oca);

rc = obj_fetch_create_maps(rpc, biod, iods, iods_nr, skips);
D_ASSERT(rc != -DER_CSUM);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this assert?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is very long. And there is a number of execution paths ending quickly or jumping over this place in code.

I identified two paths which may result in calling obj_log_csum_err() below. This one got excluded on the assumption obj_fetch_create_maps() cannot result in -DER_CSUM which is true at this point in time.

The chances are slim someone will change this in the future but if it would have happened it would affect this on-disk/no-on-disk logic. I want to make sure it won't get unnoticed.

}

if (rc == -DER_CSUM)
obj_log_csum_err(orw->orw_oid);
obj_log_csum_err(orw->orw_oid, rc_ondisk);
post:
time = daos_get_ntime();
rc = bio_iod_post_async(biod, rc);
Expand Down Expand Up @@ -3053,7 +3069,7 @@ ds_obj_rw_handler(crt_rpc_t *rpc)
struct dtx_handle *dth;

if (orw->orw_flags & ORF_CSUM_REPORT) {
obj_log_csum_err(orw->orw_oid);
obj_log_csum_err(orw->orw_oid, false /** ondisk */);
D_GOTO(out, rc = 0);
}

Expand Down Expand Up @@ -4890,7 +4906,7 @@ ds_cpd_handle_one(crt_rpc_t *rpc, struct daos_cpd_sub_head *dcsh, struct daos_cp
ioc->ioc_coc->sc_csummer, piod_nrs[i]);
if (rc != 0) {
if (rc == -DER_CSUM)
obj_log_csum_err(dcsr->dcsr_oid);
obj_log_csum_err(dcsr->dcsr_oid, true /** ondisk */);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is to verify update RPC from client, so the 'ondisk' should be false, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, does it check the data after transit but before storing it? Or both are checked at the same time?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's before storing data.

goto out;
}

Expand Down
Loading