JITSU-74 feat(ingest): capture request headers into event context#1343
JITSU-74 feat(ingest): capture request headers into event context#1343vklimontovich wants to merge 8 commits into
Conversation
Add context.headers to the event so destinations can see the raw HTTP headers (accept, content-type, sec-fetch-*, sec-ch-ua*, ...) and tell real browser traffic from bots/agents. - Browser endpoint derives context.headers from the request only; the body can't redefine them (a browser can't read its own headers anyway). - S2S endpoint captures the forwarding request's headers but lets the caller override allow-listed headers via the body to forward the original device's headers. - cookie/authorization are stripped and the write key is masked, so secrets don't leak to destinations. Types: add AnalyticsContext.headers and an optional RuntimeFacade.headers() so a Node integration can supply the original headers; jitsu-js wires it into the built context (no-op in the browser).
There was a problem hiding this comment.
Reviewed the changes in bulker/ingest/router.go, libs/jitsu-js/src/analytics-plugin.ts, and types/protocols/analytics.d.ts.
The overall direction makes sense (capturing request headers into context.headers and masking sensitive values), but I found one correctness/security edge case in the Go implementation and left an inline comment with details.
There was a problem hiding this comment.
Reviewed the changes in bulker/ingest/router.go, libs/jitsu-js/src/analytics-plugin.ts, and types/protocols/analytics.d.ts, focusing on correctness and security implications of the new context.headers flow. I found two issues worth addressing: (1) context.headers sanitization can be bypassed for __sql_type* keys because headers are stored as a plain map, and (2) the current sensitive-header filter is too narrow and may leak credentials carried in custom auth headers.
| // browser clients cannot read their own request headers and must not be able to | ||
| // spoof them: always derive context.headers from the actual request, ignoring | ||
| // whatever the body provided. | ||
| ctx.Set("headers", buildContextHeaders(c, nil)) |
There was a problem hiding this comment.
We must find a way to tell destination to put this object as JSON or as String with marshaled JSON.
__sql_type hint cannot be used - we must know specific datawarehouse in advance
There was a problem hiding this comment.
Implemented in 3b2a95a using bulker's existing warehouse-agnostic JSON data type (types.JSON, enum value 6 in bulkerlib/types/datatype.go): the builtin bulker destination now always declares context_headers in the schema stream option, which (a) puts the key into notFlatteningKeys so the nested object is kept as a single value, and (b) maps to the native JSON type per warehouse via GetSQLType - jsonb (Postgres), SUPER (Redshift), JSON (BigQuery/MySQL/ClickHouse-with-JSON), text (Snowflake). No __sql_type hint and no per-warehouse knowledge needed. Required two bulker fixes: schema option now parses from the already-unmarshalled streamOptions header object, and schema-derived notFlatteningKeys are name-transformed so the match works on Snowflake. Integration-tested with a nested context.headers object on Postgres+MySQL in both batch and stream modes (TestSchemaOptionNestedObject).
Flip header capture from a cookie/authorization blocklist to an allowlist of standard content-negotiation, fetch-metadata and client-hint headers. Non-allow-listed headers keep their name (presence is still a bot-detection signal) but the value is masked, so credentials in custom auth headers (x-api-key, proxy-authorization, vendor JWTs) can't reach destinations. __sql_type* header keys are dropped entirely: context.headers is a plain map that bypasses types.FilterEvent, so such a key would otherwise become a raw SQL type hint (DDL injection) downstream. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Header names are client-controlled, so flattening context.headers would create unbounded warehouse columns. The builtin bulker destination now declares context_headers in the stream schema option with bulker's warehouse-agnostic JSON data type (enum value 6), which both prevents flattening and maps to the native JSON type per warehouse (jsonb / SUPER / JSON / string). Applied to every data layout except jitsu-legacy, which never carries context.headers. Supporting bulker fixes: - schema option ParseFunc accepts map[string]any - the streamOptions header is unmarshalled into a map before options are parsed, so a nested schema object previously failed with "invalid value type" - notFlatteningKeys derived from the schema option are now name-transformed, matching the transformed key paths _mapForDwh compares (schema-driven non-flattening never matched on Snowflake) - TypeFromValue recognizes ordered JSON objects (types.Object is *jsonorder.OrderedMap - reflect sees Ptr, not Map) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Reviewed the changes across ingest header capture/masking ( + tests), bulker schema/type handling (, , ), and destination stream option shaping ().\n\nI focused on correctness/security regressions around header spoofing and secret leakage, SQL type-hint bypasses, and schema flattening behavior for .\n\nNo new actionable correctness or security issues found in this diff.
There was a problem hiding this comment.
Reviewed the changes across ingest header capture/masking (router.go + tests), bulker schema/type handling (options.go, abstract.go, datatype.go), and destination stream option shaping (bulker-destination.ts).
I focused on correctness/security regressions around header spoofing and secret leakage, SQL type-hint bypasses, and schema flattening behavior for context.headers.
No new actionable correctness or security issues found in this diff.
… carries headers Avoids eagerly creating the context_headers column (batch mode adds schema columns to the table even without data) on connections whose events never carry context.headers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The plain schema-option case was already covered per warehouse (see json_test_snowflake in types_test.go) and worked: without toSameCase raw field names match raw key paths. The broken combination was WithSchema + WithToSameCase (what sync-sidecar sends when the sync's sameCase option is on): notFlatteningKeys stayed raw while _mapForDwh compares case-transformed paths, so declared nested JSON fields were flattened. This test fails without the nameTransformer fix in abstract.go and passes with it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Reviewed the changes in bulker ingest/header handling, bulker schema option parsing, SQL schema mapping behavior, and destination function stream option generation.
Findings:
- Potential runtime error in
withContextHeadersSchema:streamOptions.schema.fieldsis treated as an array without a type guard; malformed or legacy config shapes can throw before event delivery.
…text fields user-agent, referer and host are removed from context.headers when the event already carries the exact same values in context.userAgent, context.page.referrer and context.page.host respectively. Differing values are kept - the mismatch itself is a bot signal. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…chema.fields (review) streamOptions is untyped user config: a non-array schema.fields would make fields.some() throw, turning a config problem into endless RetryError delivery retries. Leave a malformed schema untouched instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ma (review) bulker accepts the schema stream option as a JSON string; spreading a string schema would explode it into character keys and silently drop the user's schema config. Leave any non-plain-object schema untouched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Reviewed the changes in ingest header capture, bulker stream option/schema handling, SQL flattening behavior, and related TS/Go tests.
I did not find new actionable correctness or security regressions in this diff. The added tests around context headers, schema parsing, and nested-object handling cover the key risk areas and pass locally for the touched Go packages.
JITSU-74What
Adds
context.headersto ingested events so destinations can see the raw HTTP request headers (accept,content-type,sec-fetch-*,sec-ch-ua*, …) and distinguish real browser traffic from bots/agents. Today onlycontext.userAgentis available.Behavior
context.headersis derived only from the actual request; the body can't redefine them (a browser can't read its own request headers anyway, and shouldn't be able to spoof them).accept,accept-language,accept-encoding,content-type,user-agent,referer,dnt,sec-fetch-*,sec-ch-ua*.cookie/authorizationare stripped and the write key is masked before headers reachcontext(which is forwarded to destinations). Keys are lower-cased. The internalIngestMessage.HttpHeaders(full set) is unchanged.Types
AnalyticsContext.headers?: Record<string, string>(Jitsu extension — Segment's spec has no raw-headers field).RuntimeFacade.headers()so a Node integration can supply the original device's headers;@jitsu/jswires it into the built context (no-op in the browser).Notes for bot detection
The
sec-fetch-*/sec-ch-ua*set is the strongest tell — raw HTTP clients (curl, python-requests, most non-browser agents) don't send them; only real/headless browsers do.🤖 Generated with Claude Code