Skip to content

GravityKit/secure-list-sharing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Secure List Sharing

Two stores find their shared customers and figure out cross-sell opportunities — without either side seeing the other's customer list, customer emails, or per-customer purchase history.

Built so two stores can each learn things like "of the customers we share, X% own Product A and only Y% own Product B — let's co-promote the bundle to the rest" — without either side handing over a customer database.

One thing to know before reading: this is privacy by cooperation, not privacy by cryptography. Both sides hold a shared secret and must agree, in writing, what the data is for. If you don't trust the other side to honor that agreement, don't run it. See Threat model for the full picture.


What you get out of running this

After both sides run the script and exchange two small files, each side ends up with:

  1. The number of customers you have in common. A count, plus a list of opaque hashes that you can decode privately against your own customer list if you want to know exactly which of your customers are shared.

  2. A small CSV listing what the shared customers buy from each store, in aggregate. Here's what Party A's CSV looks like (example data, against a synthetic 5,000-customer overlap):

    product,shared_customer_count
    Product A,1302
    Product B,1245
    Product C,895
    Product D,753
    Product E,738
    Product F,263
    Product G,196
    Product H,167
    Product I,100
    Product J,84

    The other side gets an equivalent CSV from your store; you get an equivalent CSV from theirs. That's enough to plan a co-marketing campaign without learning who specifically bought what.

The lopsided rows are where co-promotion makes sense: products that are popular on one side but not the other, in a customer base both sides already share.


What the other side learns about you (and you about them)

The short version, in plain English:

✅ Yes, the other side learns ❌ No, the other side doesn't learn
The opaque hash of every paying customer's email — padded to the nearest power of 10, so they don't even learn your exact customer count Any raw email address
Which of your customers are also their customers — but only as hashes, which they decode against their own list Which of your customers are not theirs
For shared customers only: a per-product count from your store, with k-anonymity (counts under 5 are suppressed) Per-customer purchase history
Total per-product sales volume across all your customers
Order amounts, dates, addresses, anything else

The strongest informal way to put it: both sides walk away knowing the shape of the cross-sell opportunity, not the identities of the customers in it.


How the protocol works

The key idea: each side runs the script on their own machine, against their own customer data, restricted to a shared cohort that both sides agreed on. Neither side computes anything against the other's data. The only things that cross between parties are an opaque hash file and a small (product, count) CSV.

Three steps

sequenceDiagram
    autonumber
    participant A as Party A
    participant B as Party B
    Note over A,B: Out-of-band: agree on a shared 32+ char secret
    A->>A: hash + pad customers → stage1-a.txt
    B->>B: hash + pad customers → stage1-b.txt
    A->>B: stage1-a.txt (opaque hashes)
    B->>A: stage1-b.txt (opaque hashes)
    A->>A: intersect → shared.txt
    B->>B: intersect → shared.txt (identical)
    Note over A,B: Decide whether overlap is large enough to act on
    A->>A: product-report on own purchases → stage2-a.csv
    B->>B: product-report on own purchases → stage2-b.csv
    A->>B: stage2-a.csv (aggregate product, count)
    B->>A: stage2-b.csv (aggregate product, count)
    Note over A,B: Securely destroy the shared secret
Loading

Step 1 — Hash, then exchange. Both sides agree on a 32+ character secret out-of-band (1Password, signed message, in person — not plain email). Each side independently hashes their own customer list. Output is a file of opaque hashes, padded with random noise so the file size doesn't reveal the customer count. Send your file, get theirs.

Step 2 — Find the overlap. Each side runs intersect on the two hash files. Both sides produce identical output (a built-in honest-broker check: if the counts disagree, something is wrong). Look at the count: is it big enough to be worth Stage 2? If not, stop here.

Step 3 — Each side runs the product report on their own purchases. "Of the shared customers, here's how many bought each of my products." Counts under 5 are suppressed automatically. Each side sends their report; both sides end up with both reports. Now you have everything you need to plan co-marketing.

Files on each machine

flowchart LR
    subgraph A [Party A]
        direction TB
        a_in[(customers.csv<br/>purchases.csv)]
        a_stage1[stage1.txt]
        a_shared[shared.txt]
        a_stage2[stage2.csv]
        a_in -- hash + pad --> a_stage1
        a_stage1 -- intersect --> a_shared
        a_in -- product-report --> a_stage2
        a_shared --> a_stage2
    end

    subgraph B [Party B]
        direction TB
        b_in[(customers.csv<br/>purchases.csv)]
        b_stage1[stage1.txt]
        b_shared[shared.txt]
        b_stage2[stage2.csv]
        b_in -- hash + pad --> b_stage1
        b_stage1 -- intersect --> b_shared
        b_in -- product-report --> b_stage2
        b_shared --> b_stage2
    end

    a_stage1 <-. exchange .-> b_stage1
    a_stage2 <-. exchange .-> b_stage2
Loading

Solid arrows = computation done locally. Dashed double-arrows = files exchanged between parties. Each side's customers.csv and purchases.csv (and secret.txt, not pictured) never leave their own machine.

Why this is privacy-preserving

The shared.txt file is the only thing telling each side "compute over this specific subset." It's a list of opaque hashes. You decode it against your customer list to find which of your customers are shared. The other side decodes it against their list to find which of their customers are shared. Same hashes, different decodings — and the hashes alone reveal nothing about each other's customer lists.


Quick start

# 1. Both sides agree on a shared secret out-of-band, save it as secret.txt:
echo 'paste-the-shared-secret-here' > secret.txt
# Or generate one (whichever side generates it shares it with the other):
openssl rand -hex 32 > secret.txt

# 2. Configure once. Copy the example and edit it for your environment.
cp .env.example .env
$EDITOR .env   # set SOURCE, WP_CONFIG (or DB_* / CSV_PATH), and so on

# 3. Sanity-check what the script will use:
php secure-list-sharing.php config

# 4. Hash your customer list:
php secure-list-sharing.php hash
# → writes stage1.txt

# 5. Exchange your stage1.txt with the other side.
#    Save theirs to the THEIRS_FILE path in .env (default: stage1-theirs.txt).

# 6. Find the overlap:
php secure-list-sharing.php intersect --mine=stage1.txt
# → writes shared.txt

# 7. Decide together: is the overlap big enough to act on? If not, stop here.

# 8. Run the product report:
php secure-list-sharing.php product-report
# → writes stage2.csv

# 9. Exchange stage2 files. Now both sides have both reports.

# 10. Securely delete the secret on both sides:
shred -u secret.txt    # Linux
# srm secret.txt       # macOS

If you'd rather skip .env and pass everything on the command line, every option has a CLI flag — see the CLI reference below.


Operational checklist

Before running:

  • You and the other side have a written agreement (even informal email exchange) on what the data may be used for.
  • You have generated a fresh secret with openssl rand -hex 32.
  • You have shared the secret over a secure channel (1Password Vault, signed encrypted email, in-person). Never send the secret over the same channel you'll send hash files.
  • You have agreed on the k-anonymity threshold for the product report (default 5).
  • You have agreed on an abort threshold (e.g., "only run Stage 2 if shared > 100 customers").

After running:

  • Both sides securely delete the secret file (shred -u, srm, or platform equivalent).
  • Both sides decide what to do with the aggregate product reports.
  • The Stage 1 and Stage 2 outputs are stored under the same data-handling policy as customer data.

Worked example

An illustrative run against an EDD store of the size you'd see in practice:

  • ~20,000 paying customers in EDD; after normalization, a few dozen +alias / gmail-dot duplicates merge into a slightly smaller set of unique mailboxes (see Email normalization)
  • Stage 1 output: padded up to 100,000 hashes (~6 MB) — the next power of 10, so the file size reveals only the order of magnitude, not the real count
  • A synthetic 5,000-customer "shared" cohort (in production this comes from running intersect against the other side's stage1)
  • Stage 2 output: the matching purchase records collapse to a per-product count for the shared cohort, with any product bought by fewer than k=5 shared customers suppressed automatically

Top of the resulting stage2.csv:

product,shared_customer_count
Product A,1302
Product B,1245
Product C,895
Product D,753
Product E,738
Product F,263
Product G,196

When the other party's cohort comes back, the absolute numbers will differ but the shape is what matters: products with high penetration on your side and low on theirs (or vice versa) are your cross-sell candidates.



Technical reference

Everything below is for setting up your environment, understanding the cryptographic guarantees in detail, or debugging.

Files in this directory

What to share with the other side (just zip the whole directory, or invite them to the repo):

  • secure-list-sharing.php — the script. Single file, no Composer dependencies.
  • README.md — this file.
  • .env.example — template for local configuration. Each side copies it to .env and fills in their values.
  • .gitignore — keeps .env, secret.txt, and stage outputs out of version control.
  • tools/extract-customers.php, tools/extract-purchases.php — read-only wp-cli scripts for extracting CSVs from EDD when you can't connect to the database directly (managed hosting). See Extracting CSVs via wp-cli over SSH.

What you generate and exchange (gitignored; never commit any of these):

  • secret.txt — the shared HMAC secret (32+ chars). Distribute over a secure channel.
  • stage1.txt — your hashed, padded customer list.
  • stage1-theirs.txt — what the other side sends you.
  • shared.txt — the intersection (both sides should produce identical output).
  • stage2.csv — your aggregate product penetration report on the shared cohort.

Requirements

  • PHP 8.0 or newer.
  • pdo_mysql extension (only if using --source=edd).
  • Network access to your WordPress database (only if using --source=edd).

The script is a single file with no Composer dependencies.

Configuration via .env

Copy .env.example to .env and edit. CLI flags always override .env. Run php secure-list-sharing.php config to see what the script will use given your current setup (the secret is masked in the output).

Key Purpose
SECRET_FILE Path to the shared-secret file (recommended).
SECRET Inline secret value (alternative; avoid for shared environments).
SOURCE edd or csv.
WP_CONFIG Path to wp-config.php (when SOURCE=edd).
DB_HOST, DB_USER, DB_PASS, DB_NAME, DB_PREFIX Manual DB credentials (when wp-config.php parsing fails).
CSV_PATH Path to your customer CSV (when SOURCE=csv).
K K-anonymity threshold (default 5).
HASH_OUT, INTERSECT_OUT, REPORT_OUT Output paths for each mode.
MINE_FILE, THEIRS_FILE, SHARED_FILE Input paths used by intersect / product-report.

CLI reference

Run php secure-list-sharing.php help for the latest help text.

Mode Purpose
hash Read your customer list, output a sorted, padded hash file.
intersect Take both parties' hash files, output the shared hashes.
product-report For shared customers only, output (product, count) with k-anonymity.
config Print resolved settings (CLI > .env > defaults). For debugging.
Flag Description
--env=PATH Path to .env file (default: .env next to the script).
--secret-file=PATH Shared secret file. Preferred over inline.
--secret=VALUE Shared secret inline. Visible in shell history; avoid.
--source=csv|edd Data source.
--csv=PATH CSV file (when --source=csv).
--wp-config=PATH wp-config.php path (when --source=edd).
--db-host, --db-user, --db-pass, --db-name, --db-prefix Manual EDD DB credentials.
--out=PATH Output file.
--mine, --theirs Hash files for intersect.
--shared=PATH Shared-hash file for product-report.
--k=N K-anonymity threshold (default 5).

Pure-CLI example (no .env):

php secure-list-sharing.php hash \
    --source=edd --wp-config=/var/www/html/wp-config.php \
    --secret-file=secret.txt --out=stage1.txt

php secure-list-sharing.php intersect \
    --mine=stage1.txt --theirs=stage1-theirs.txt --out=shared.txt

php secure-list-sharing.php product-report \
    --source=edd --wp-config=/var/www/html/wp-config.php \
    --secret-file=secret.txt --shared=shared.txt --out=stage2.csv

Data sources

Easy Digital Downloads (EDD 3.x)

Reads paid customers and product purchases directly from the WordPress database. Uses these tables (with your $table_prefix):

  • {prefix}edd_customers
  • {prefix}edd_orders — filtered to status IN ('complete', 'partially_refunded')
  • {prefix}edd_order_items — filtered to type = 'download' and complete/partially-refunded status
  • {prefix}posts — joined for canonical product titles, so EDD price variations (Product A — Single Site, — Up to 3 Sites, etc.) collapse into one row (Product A)

Connect via:

--source=edd --wp-config=/path/to/wp-config.php
# or
--source=edd --db-host=localhost --db-user=root --db-pass=root \
             --db-name=wordpress --db-prefix=wp_

If your wp-config.php is unusual (multi-line define(), dynamic values pulled from a secrets manager, etc.), parsing may fail. Use the explicit --db-* flags.

CSV

A CSV with a header row.

For hash mode, only an email column is required:

email
alice@example.com
bob@example.com

For product-report mode, you also need a product column. Multiple rows per customer (one per product purchased) is the expected shape:

email,product
alice@example.com,Product A
alice@example.com,Product B
bob@example.com,Product A

Acceptable header aliases: email / customer_email / e-mail / mail, and product / product_name / item / item_name / sku.

Extracting CSVs via wp-cli over SSH (managed hosting)

If your store runs on managed hosting (Convesio, WP Engine, Pantheon, Kinsta, etc.) where the database isn't reachable from your laptop, use --source=csv and produce the CSVs via wp-cli over SSH. Two read-only extract scripts are bundled in tools/:

  • tools/extract-customers.php → produces customers.csv.
  • tools/extract-purchases.php → produces purchases.csv (uses wp_posts for canonical product titles).

Both run via wp eval-file and only execute SELECT statements against the EDD tables. Recipe:

SSH_USER=youruser
SSH_HOST=your.host.com
SSH_PORT=22
WP_PATH=/var/www/wordpress

# Stage 1 input — customer emails:
ssh -p $SSH_PORT $SSH_USER@$SSH_HOST \
    "tmpf=\$(mktemp); cat > \$tmpf; cd $WP_PATH && wp eval-file \$tmpf; rm \$tmpf" \
    < tools/extract-customers.php > customers.csv

# Stage 2 input — purchases:
ssh -p $SSH_PORT $SSH_USER@$SSH_HOST \
    "tmpf=\$(mktemp); cat > \$tmpf; cd $WP_PATH && wp eval-file \$tmpf; rm \$tmpf" \
    < tools/extract-purchases.php > purchases.csv

The mktemp dance is because some wp installations don't read piped scripts via wp eval-file /dev/stdin reliably over SSH. Writing the script to a remote temp file first works everywhere; the file is removed before SSH disconnects.

Then point your .env at the resulting CSVs:

SOURCE=csv
CSV_PATH=./customers.csv         # for hash mode
# (CSV_PATH=./purchases.csv when running product-report)

Email normalization

Both parties must produce identical hashes for the same logical mailbox. The script applies the following deterministic normalization before hashing:

  1. trim() and strtolower().
  2. Strip +aliases from the local part (bob+netflix@gmail.combob@gmail.com).
  3. For gmail.com and googlemail.com: strip dots from the local part and canonicalize the domain to gmail.com.

This is correct deduplication, not data loss. Two examples on a real customer database:

  • One human signed up at your store as bob@gmail.com and at the other side's store as b.ob+netflix@gmail.com. Without normalization those produce different hashes and the overlap is missed. With normalization, both sides hash to the same value and the customer is correctly identified as shared.
  • One human signed up at your store twice: once as alice@gmail.com and later as alice+work@gmail.com. Without normalization your customer list "has 2 customers"; with normalization it has 1 — which is what was true all along.

Normalization does not invent false matches. The two transformations are safe by construction:

  • Gmail dot-stripping: Google enforces uniqueness on the dot-stripped form of every Gmail address. b.ob@gmail.com and bo.b@gmail.com cannot be two different humans — they are literally the same mailbox by Google's design. Zero false positives.
  • +tag stripping: For bob+x@example.com to be in your customer list at all, signup verification had to deliver to it. If the mail server delivered to bob+x@, the server treats + as routing-only, which means bob@ reaches the same mailbox. The only way this is wrong is a corporate mail server deliberately configured to route +aliases to a different person — vanishingly rare and not something you've encountered if you're running a normal e-commerce store.

In practical terms: when secure-list-sharing.php hash reports a "real customer" count slightly lower than your database row count, that's the script correctly merging aliases. Without normalization, you would systematically undercount overlap with the other side.

Padding

The hash file is padded with random 32-byte hex strings (indistinguishable from real HMAC outputs) up to the next power of 10:

Real customers Padded total
1 – 10 10
11 – 100 100
101 – 1,000 1,000
1,001 – 10,000 10,000
10,001 – 100,000 100,000

This means the size of the hash file no longer leaks your exact customer count. The other side learns only the order of magnitude (e.g. "between 10K and 100K customers") — typically information you've already disclosed in marketing copy.

K-anonymity

The product-report mode counts distinct shared customers per product. By default, any product with fewer than 5 shared customers is suppressed entirely from the output. This prevents single-customer disclosures like "1 shared customer bought your $50,000 enterprise tier" — which, combined with other context, could identify the customer.

You can change the threshold:

  • --k=5 (default) — conservative.
  • --k=10 — very conservative; products need real penetration to show up.
  • --k=1 — no suppression. Don't use this on real data unless you've thought about it.

Threat model

This protocol is designed for the cooperating-but-cautious counterparty threat model. Each successive class of adversary below is harder to defeat.

What it defends against

  • Casual leaks of the exchanged files (laptop stolen, hash file accidentally committed to git, third party intercepts the file). An attacker without the secret cannot recover any emails — they have only random opaque strings.
  • Naive offline reconstruction. Without the secret, hashing every email in a public breach corpus and looking for matches doesn't work — they can't compute the HMAC without the key.
  • List-size leakage. Padding to a power of 10 ensures neither party learns the other's exact customer count.
  • Per-customer purchase disclosure. The product report only ever emits aggregate counts per product, never (customer, product) pairs.
  • Single-customer identification. K-anonymity suppresses cells small enough to identify individuals.

What it does not defend against

  • A malicious counterparty who has the secret. Once both parties hold the secret, either side can hash any specific email guess and check for membership ("is bob@acme.com your customer?"). They can do this for any list of emails they obtain — competitor scrapes, breach dumps, conferences, LinkedIn — as long as the secret hasn't been destroyed.

    Mitigation: use the secret once, then destroy it. Don't reuse it for future exchanges. If you suspect bad faith on the other side, don't run the protocol with them.

  • Strategic list manipulation. A party could run the protocol with a deliberately curated subset of their customers (e.g. only customers in a specific niche) to learn things about that subset's overlap. The other side has no way to verify the list is complete.

  • Frequency analysis on the product report. If you publish a product report from a tiny shared cohort with --k=1, individual purchases may be re-identifiable through process of elimination. Always use --k >= 5 and don't run the product report when the shared set is itself small (say, <50 customers).

  • Network-level adversaries. This is a file-exchange protocol; it doesn't do anything for transport security. Send the files over TLS-protected channels (encrypted email, signed S3 link, etc.).

What you'd reach for if you needed real cryptographic privacy

A Private Set Intersection (PSI) protocol — for example, DDH-based PSI or PSI-CA. Both parties run an interactive cryptographic protocol where the only thing each side learns is the intersection (or even just the intersection size). Neither side learns the other's full set, and neither side can probe individual guesses even with all of their own protocol output.

Building PSI properly is a significant project (hundreds of lines of careful crypto code, multiple round trips, careful side-channel handling). For "two ecosystem peers looking for cross-sell opportunities," the salted-HMAC approach in this script is widely considered an acceptable practical compromise — but you should know the gap exists.

License

MIT © 2026 GravityKit.

Both parties run the same script for symmetry. It's provided as-is, with no warranty — read the Threat model before relying on it for anything sensitive.

About

Two stores find their shared customers and cross-sell overlap without exchanging raw customer lists — single-file PHP, salted-HMAC, padded hashes, k-anonymous product reports.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages