Skip to content

Handle crazy large email headers#412

Merged
jsuto merged 5 commits intomasterfrom
ms-header-size
Jan 23, 2026
Merged

Handle crazy large email headers#412
jsuto merged 5 commits intomasterfrom
ms-header-size

Conversation

@jsuto
Copy link
Owner

@jsuto jsuto commented Jan 23, 2026

Fixes #403

Signed-off-by: Janos SUTO <sj@acts.hu>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Issue #403 by making digest/header parsing resilient to unusually large email headers (eg from Exchange), preventing valid EML files from being rejected with hdr_len:0.

Changes:

  • Add a new max_header_size config option (default 131072) and document it in the sample config.
  • Update make_digests() to use a heap-allocated buffer sized by max_header_size, and propagate digest calculation failures to callers.
  • Update several call sites to handle/report digest calculation errors.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/digest.c Switches digest read buffer to configurable heap allocation and adds error returns/logging.
src/cfg.h Adds max_header_size to the config struct.
src/cfg.c Adds parsing/default for max_header_size.
etc/example.conf Documents max_header_size and provides a default value.
src/import.c Handles make_digests() failure before rejecting on hdr_len.
src/piler.c Aborts processing when digest calculation fails.
src/pilerexport.c Reports digest calculation errors during export; changes default index list.
src/test.c Adds basic error reporting for digest calculation failures.
Comments suppressed due to low confidence (1)

src/pilerexport.c:407

  • If make_digests() fails, the code still performs the digest/bodydigest comparisons and reports a generic “verification FAILED”. This conflates “could not compute digests” with “digests mismatch”. Consider short-circuiting on the error (eg mark verification_status and print a distinct message) to avoid misleading results.
               // I want to export no matter what, so an error won't block
               if(make_digests(sdata, cfg)) {
                  printf("digest calculation error: %s\n", filename);
               }

               if(strcmp(digest, sdata->digest) == 0 && strcmp(bodydigest, sdata->bodydigest) == 0){
                  printf("exported: %10llu\r", n); fflush(stdout);
               }
               else {
                  printf("verification FAILED. %s\n", filename);

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 69 to 78
unsigned char *buf = calloc(1, cfg->max_header_size);
if(!buf) {
syslog(LOG_PRIORITY, "ERROR: calloc() %s:%d", __func__, __LINE__);
EVP_MD_CTX_free(ctx);
EVP_MD_CTX_free(ctx2);
return 1;
}

while((n = read(fd, buf, sizeof(buf))) > 0){
while((n = read(fd, buf, cfg->max_header_size)) > 0){
EVP_DigestUpdate(ctx2, buf, n);
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search_header_end() iterates until it hits a NUL byte, but buf is allocated as exactly max_header_size and read() can fill it completely. When n == cfg->max_header_size, there may be no terminating NUL, so search_header_end() can read past the buffer. Ensure the buffer is NUL-terminated for the first chunk (eg allocate max_header_size+1 and set buf[n]=0, or switch search_header_end() to a length-bounded scan).

Copilot uses AI. Check for mistakes.
Comment on lines 69 to 77
unsigned char *buf = calloc(1, cfg->max_header_size);
if(!buf) {
syslog(LOG_PRIORITY, "ERROR: calloc() %s:%d", __func__, __LINE__);
EVP_MD_CTX_free(ctx);
EVP_MD_CTX_free(ctx2);
return 1;
}

while((n = read(fd, buf, sizeof(buf))) > 0){
while((n = read(fd, buf, cfg->max_header_size)) > 0){
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cfg->max_header_size is parsed as an int and used as a size for calloc/read. A zero/negative value can lead to unexpected behavior (calloc(…,0), huge size_t conversion, etc.). Validate it before use (eg enforce a sane minimum/maximum and return an error if out of range).

Copilot uses AI. Check for mistakes.
Comment on lines 69 to 75
unsigned char *buf = calloc(1, cfg->max_header_size);
if(!buf) {
syslog(LOG_PRIORITY, "ERROR: calloc() %s:%d", __func__, __LINE__);
EVP_MD_CTX_free(ctx);
EVP_MD_CTX_free(ctx2);
return 1;
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If calloc() fails after open() succeeds, the function returns without closing fd, leaking a file descriptor. Close(fd) on this error path (and consider a common cleanup label to handle all early returns).

Copilot uses AI. Check for mistakes.
Comment on lines 62 to 74
if(fd == -1) {
syslog(LOG_PRIORITY, "ERROR: open() %s:%d", __func__, __LINE__);
EVP_MD_CTX_free(ctx);
EVP_MD_CTX_free(ctx2);
return 1;
}

memset(buf, 0, sizeof(buf));
unsigned char *buf = calloc(1, cfg->max_header_size);
if(!buf) {
syslog(LOG_PRIORITY, "ERROR: calloc() %s:%d", __func__, __LINE__);
EVP_MD_CTX_free(ctx);
EVP_MD_CTX_free(ctx2);
return 1;
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syslog messages for open()/calloc() failures don’t include the filename or strerror(errno), which makes diagnosing import failures harder. Log sdata->filename and the OS error string/value.

Copilot uses AI. Check for mistakes.
int verbosity = 0;
int max_matches = 1000;
char *index_list = "main1,dailydelta1,delta1";
char *index_list = "piler1";
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the default Sphinx index list to "piler1" can break setups using the non-RT indices (main1,dailydelta1,delta1). Consider selecting the default after reading config based on cfg->rtindex (eg cfg->sphxdb when rtindex=1, otherwise the legacy list), similar to src/stats.c:123-126 where RT vs non-RT index names are handled.

Suggested change
char *index_list = "piler1";
char *index_list = "main1,dailydelta1,delta1";

Copilot uses AI. Check for mistakes.
Comment on lines 69 to 77
unsigned char *buf = calloc(1, cfg->max_header_size);
if(!buf) {
syslog(LOG_PRIORITY, "ERROR: calloc() %s:%d", __func__, __LINE__);
EVP_MD_CTX_free(ctx);
EVP_MD_CTX_free(ctx2);
return 1;
}

while((n = read(fd, buf, sizeof(buf))) > 0){
while((n = read(fd, buf, cfg->max_header_size)) > 0){
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is meant to fix hdr_len=0 for unusually large headers, but there’s no unit test covering a header larger than the previous fixed BIGBUFSIZE. Add a digest unit test case (eg in unit_tests/check_digest.c) with an EML having a very large header and cfg->max_header_size set accordingly, asserting hdr_len>0 and stable digests.

Copilot uses AI. Check for mistakes.
jsuto and others added 4 commits January 23, 2026 06:41
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Added sanity check for max_header_size value.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +69 to +70
unsigned char *buf = calloc(1, cfg->max_header_size);
if(!buf) {
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces a new behavior/config knob (max_header_size) intended to fix hdr_len=0 for very large headers, but there’s no regression test covering a message with headers larger than the previous fixed buffer. Since unit_tests/check_digest.c already covers make_digests(), please add a large-header fixture and test case validating hdr_len>0 and expected digests when max_header_size is increased.

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +71
unsigned char *buf = calloc(1, cfg->max_header_size);
if(!buf) {
syslog(LOG_PRIORITY, "ERROR: calloc() %s:%d", __func__, __LINE__);
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search_header_end() and the strcasestr() call inside it require a NUL-terminated buffer, but buf is allocated at exactly cfg->max_header_size and read() may fill it completely (n == max_header_size), leaving no terminator and causing out-of-bounds reads/undefined behavior. Allocate one extra byte and ensure buf[n] = '\0' before calling search_header_end (or rewrite search_header_end to iterate by length only).

Copilot uses AI. Check for mistakes.
Comment on lines +78 to 79
while((n = read(fd, buf, cfg->max_header_size)) > 0){
EVP_DigestUpdate(ctx2, buf, n);
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If read() fails (returns -1), the loop exits and the function currently proceeds to finalize and return success using partial data. Detect read errors (eg, check errno / n < 0 after the loop) and return an error after logging, while still freeing buf and closing fd.

Copilot uses AI. Check for mistakes.
Comment on lines +62 to +66
if(fd == -1) {
syslog(LOG_PRIORITY, "ERROR: open() %s:%d", __func__, __LINE__);
EVP_MD_CTX_free(ctx);
EVP_MD_CTX_free(ctx2);
return 1;
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The open() failure log doesn’t include the filename or strerror(errno), which makes troubleshooting difficult. Include sdata->filename and strerror(errno) (and ideally errno) in this syslog message.

Copilot uses AI. Check for mistakes.
Comment on lines +212 to +215
// Some basic sanity checks
if(cfg.max_header_size < 1000) {
__fatal("invalid max_header_size value");
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_header_size directly controls a per-message allocation in make_digests(); validating only a minimum value allows configurations that can cause excessive memory use or integer edge cases. Consider adding an upper bound (eg <= max_message_size and/or a reasonable hard cap) and validating it’s positive before use.

Copilot uses AI. Check for mistakes.
@jsuto jsuto merged commit 55cc103 into master Jan 23, 2026
12 of 14 checks passed
@jsuto jsuto deleted the ms-header-size branch January 23, 2026 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Valid EML, reported as hdr_len:0 and not imported

1 participant