Two stores find their shared customers and figure out cross-sell opportunities — without either side seeing the other's customer list, customer emails, or per-customer purchase history.
Built so two stores can each learn things like "of the customers we share, X% own Product A and only Y% own Product B — let's co-promote the bundle to the rest" — without either side handing over a customer database.
One thing to know before reading: this is privacy by cooperation, not privacy by cryptography. Both sides hold a shared secret and must agree, in writing, what the data is for. If you don't trust the other side to honor that agreement, don't run it. See Threat model for the full picture.
After both sides run the script and exchange two small files, each side ends up with:
-
The number of customers you have in common. A count, plus a list of opaque hashes that you can decode privately against your own customer list if you want to know exactly which of your customers are shared.
-
A small CSV listing what the shared customers buy from each store, in aggregate. Here's what Party A's CSV looks like (example data, against a synthetic 5,000-customer overlap):
product,shared_customer_count Product A,1302 Product B,1245 Product C,895 Product D,753 Product E,738 Product F,263 Product G,196 Product H,167 Product I,100 Product J,84
The other side gets an equivalent CSV from your store; you get an equivalent CSV from theirs. That's enough to plan a co-marketing campaign without learning who specifically bought what.
The lopsided rows are where co-promotion makes sense: products that are popular on one side but not the other, in a customer base both sides already share.
The short version, in plain English:
| ✅ Yes, the other side learns | ❌ No, the other side doesn't learn |
|---|---|
| The opaque hash of every paying customer's email — padded to the nearest power of 10, so they don't even learn your exact customer count | Any raw email address |
| Which of your customers are also their customers — but only as hashes, which they decode against their own list | Which of your customers are not theirs |
| For shared customers only: a per-product count from your store, with k-anonymity (counts under 5 are suppressed) | Per-customer purchase history |
| Total per-product sales volume across all your customers | |
| Order amounts, dates, addresses, anything else |
The strongest informal way to put it: both sides walk away knowing the shape of the cross-sell opportunity, not the identities of the customers in it.
The key idea: each side runs the script on their own machine, against their own customer data, restricted to a shared cohort that both sides agreed on. Neither side computes anything against the other's data. The only things that cross between parties are an opaque hash file and a small (product, count) CSV.
sequenceDiagram
autonumber
participant A as Party A
participant B as Party B
Note over A,B: Out-of-band: agree on a shared 32+ char secret
A->>A: hash + pad customers → stage1-a.txt
B->>B: hash + pad customers → stage1-b.txt
A->>B: stage1-a.txt (opaque hashes)
B->>A: stage1-b.txt (opaque hashes)
A->>A: intersect → shared.txt
B->>B: intersect → shared.txt (identical)
Note over A,B: Decide whether overlap is large enough to act on
A->>A: product-report on own purchases → stage2-a.csv
B->>B: product-report on own purchases → stage2-b.csv
A->>B: stage2-a.csv (aggregate product, count)
B->>A: stage2-b.csv (aggregate product, count)
Note over A,B: Securely destroy the shared secret
Step 1 — Hash, then exchange. Both sides agree on a 32+ character secret out-of-band (1Password, signed message, in person — not plain email). Each side independently hashes their own customer list. Output is a file of opaque hashes, padded with random noise so the file size doesn't reveal the customer count. Send your file, get theirs.
Step 2 — Find the overlap. Each side runs intersect on the two hash files. Both sides produce identical output (a built-in honest-broker check: if the counts disagree, something is wrong). Look at the count: is it big enough to be worth Stage 2? If not, stop here.
Step 3 — Each side runs the product report on their own purchases. "Of the shared customers, here's how many bought each of my products." Counts under 5 are suppressed automatically. Each side sends their report; both sides end up with both reports. Now you have everything you need to plan co-marketing.
flowchart LR
subgraph A [Party A]
direction TB
a_in[(customers.csv<br/>purchases.csv)]
a_stage1[stage1.txt]
a_shared[shared.txt]
a_stage2[stage2.csv]
a_in -- hash + pad --> a_stage1
a_stage1 -- intersect --> a_shared
a_in -- product-report --> a_stage2
a_shared --> a_stage2
end
subgraph B [Party B]
direction TB
b_in[(customers.csv<br/>purchases.csv)]
b_stage1[stage1.txt]
b_shared[shared.txt]
b_stage2[stage2.csv]
b_in -- hash + pad --> b_stage1
b_stage1 -- intersect --> b_shared
b_in -- product-report --> b_stage2
b_shared --> b_stage2
end
a_stage1 <-. exchange .-> b_stage1
a_stage2 <-. exchange .-> b_stage2
Solid arrows = computation done locally. Dashed double-arrows = files exchanged between parties. Each side's customers.csv and purchases.csv (and secret.txt, not pictured) never leave their own machine.
The shared.txt file is the only thing telling each side "compute over this specific subset." It's a list of opaque hashes. You decode it against your customer list to find which of your customers are shared. The other side decodes it against their list to find which of their customers are shared. Same hashes, different decodings — and the hashes alone reveal nothing about each other's customer lists.
# 1. Both sides agree on a shared secret out-of-band, save it as secret.txt:
echo 'paste-the-shared-secret-here' > secret.txt
# Or generate one (whichever side generates it shares it with the other):
openssl rand -hex 32 > secret.txt
# 2. Configure once. Copy the example and edit it for your environment.
cp .env.example .env
$EDITOR .env # set SOURCE, WP_CONFIG (or DB_* / CSV_PATH), and so on
# 3. Sanity-check what the script will use:
php secure-list-sharing.php config
# 4. Hash your customer list:
php secure-list-sharing.php hash
# → writes stage1.txt
# 5. Exchange your stage1.txt with the other side.
# Save theirs to the THEIRS_FILE path in .env (default: stage1-theirs.txt).
# 6. Find the overlap:
php secure-list-sharing.php intersect --mine=stage1.txt
# → writes shared.txt
# 7. Decide together: is the overlap big enough to act on? If not, stop here.
# 8. Run the product report:
php secure-list-sharing.php product-report
# → writes stage2.csv
# 9. Exchange stage2 files. Now both sides have both reports.
# 10. Securely delete the secret on both sides:
shred -u secret.txt # Linux
# srm secret.txt # macOSIf you'd rather skip .env and pass everything on the command line, every option has a CLI flag — see the CLI reference below.
Before running:
- You and the other side have a written agreement (even informal email exchange) on what the data may be used for.
- You have generated a fresh secret with
openssl rand -hex 32. - You have shared the secret over a secure channel (1Password Vault, signed encrypted email, in-person). Never send the secret over the same channel you'll send hash files.
- You have agreed on the k-anonymity threshold for the product report (default 5).
- You have agreed on an abort threshold (e.g., "only run Stage 2 if shared > 100 customers").
After running:
- Both sides securely delete the secret file (
shred -u,srm, or platform equivalent). - Both sides decide what to do with the aggregate product reports.
- The Stage 1 and Stage 2 outputs are stored under the same data-handling policy as customer data.
An illustrative run against an EDD store of the size you'd see in practice:
- ~20,000 paying customers in EDD; after normalization, a few dozen
+alias/ gmail-dot duplicates merge into a slightly smaller set of unique mailboxes (see Email normalization) - Stage 1 output: padded up to 100,000 hashes (~6 MB) — the next power of 10, so the file size reveals only the order of magnitude, not the real count
- A synthetic 5,000-customer "shared" cohort (in production this comes from running
intersectagainst the other side's stage1) - Stage 2 output: the matching purchase records collapse to a per-product count for the shared cohort, with any product bought by fewer than k=5 shared customers suppressed automatically
Top of the resulting stage2.csv:
product,shared_customer_count
Product A,1302
Product B,1245
Product C,895
Product D,753
Product E,738
Product F,263
Product G,196When the other party's cohort comes back, the absolute numbers will differ but the shape is what matters: products with high penetration on your side and low on theirs (or vice versa) are your cross-sell candidates.
Everything below is for setting up your environment, understanding the cryptographic guarantees in detail, or debugging.
What to share with the other side (just zip the whole directory, or invite them to the repo):
secure-list-sharing.php— the script. Single file, no Composer dependencies.README.md— this file..env.example— template for local configuration. Each side copies it to.envand fills in their values..gitignore— keeps.env,secret.txt, and stage outputs out of version control.tools/extract-customers.php,tools/extract-purchases.php— read-onlywp-cliscripts for extracting CSVs from EDD when you can't connect to the database directly (managed hosting). See Extracting CSVs via wp-cli over SSH.
What you generate and exchange (gitignored; never commit any of these):
secret.txt— the shared HMAC secret (32+ chars). Distribute over a secure channel.stage1.txt— your hashed, padded customer list.stage1-theirs.txt— what the other side sends you.shared.txt— the intersection (both sides should produce identical output).stage2.csv— your aggregate product penetration report on the shared cohort.
- PHP 8.0 or newer.
pdo_mysqlextension (only if using--source=edd).- Network access to your WordPress database (only if using
--source=edd).
The script is a single file with no Composer dependencies.
Copy .env.example to .env and edit. CLI flags always override .env. Run php secure-list-sharing.php config to see what the script will use given your current setup (the secret is masked in the output).
| Key | Purpose |
|---|---|
SECRET_FILE |
Path to the shared-secret file (recommended). |
SECRET |
Inline secret value (alternative; avoid for shared environments). |
SOURCE |
edd or csv. |
WP_CONFIG |
Path to wp-config.php (when SOURCE=edd). |
DB_HOST, DB_USER, DB_PASS, DB_NAME, DB_PREFIX |
Manual DB credentials (when wp-config.php parsing fails). |
CSV_PATH |
Path to your customer CSV (when SOURCE=csv). |
K |
K-anonymity threshold (default 5). |
HASH_OUT, INTERSECT_OUT, REPORT_OUT |
Output paths for each mode. |
MINE_FILE, THEIRS_FILE, SHARED_FILE |
Input paths used by intersect / product-report. |
Run php secure-list-sharing.php help for the latest help text.
| Mode | Purpose |
|---|---|
hash |
Read your customer list, output a sorted, padded hash file. |
intersect |
Take both parties' hash files, output the shared hashes. |
product-report |
For shared customers only, output (product, count) with k-anonymity. |
config |
Print resolved settings (CLI > .env > defaults). For debugging. |
| Flag | Description |
|---|---|
--env=PATH |
Path to .env file (default: .env next to the script). |
--secret-file=PATH |
Shared secret file. Preferred over inline. |
--secret=VALUE |
Shared secret inline. Visible in shell history; avoid. |
--source=csv|edd |
Data source. |
--csv=PATH |
CSV file (when --source=csv). |
--wp-config=PATH |
wp-config.php path (when --source=edd). |
--db-host, --db-user, --db-pass, --db-name, --db-prefix |
Manual EDD DB credentials. |
--out=PATH |
Output file. |
--mine, --theirs |
Hash files for intersect. |
--shared=PATH |
Shared-hash file for product-report. |
--k=N |
K-anonymity threshold (default 5). |
Pure-CLI example (no .env):
php secure-list-sharing.php hash \
--source=edd --wp-config=/var/www/html/wp-config.php \
--secret-file=secret.txt --out=stage1.txt
php secure-list-sharing.php intersect \
--mine=stage1.txt --theirs=stage1-theirs.txt --out=shared.txt
php secure-list-sharing.php product-report \
--source=edd --wp-config=/var/www/html/wp-config.php \
--secret-file=secret.txt --shared=shared.txt --out=stage2.csvReads paid customers and product purchases directly from the WordPress database. Uses these tables (with your $table_prefix):
{prefix}edd_customers{prefix}edd_orders— filtered tostatus IN ('complete', 'partially_refunded'){prefix}edd_order_items— filtered totype = 'download'and complete/partially-refunded status{prefix}posts— joined for canonical product titles, so EDD price variations (Product A — Single Site,— Up to 3 Sites, etc.) collapse into one row (Product A)
Connect via:
--source=edd --wp-config=/path/to/wp-config.php
# or
--source=edd --db-host=localhost --db-user=root --db-pass=root \
--db-name=wordpress --db-prefix=wp_If your
wp-config.phpis unusual (multi-linedefine(), dynamic values pulled from a secrets manager, etc.), parsing may fail. Use the explicit--db-*flags.
A CSV with a header row.
For hash mode, only an email column is required:
email
alice@example.com
bob@example.comFor product-report mode, you also need a product column. Multiple rows per customer (one per product purchased) is the expected shape:
email,product
alice@example.com,Product A
alice@example.com,Product B
bob@example.com,Product AAcceptable header aliases: email / customer_email / e-mail / mail, and product / product_name / item / item_name / sku.
If your store runs on managed hosting (Convesio, WP Engine, Pantheon, Kinsta, etc.) where the database isn't reachable from your laptop, use --source=csv and produce the CSVs via wp-cli over SSH. Two read-only extract scripts are bundled in tools/:
tools/extract-customers.php→ producescustomers.csv.tools/extract-purchases.php→ producespurchases.csv(useswp_postsfor canonical product titles).
Both run via wp eval-file and only execute SELECT statements against the EDD tables. Recipe:
SSH_USER=youruser
SSH_HOST=your.host.com
SSH_PORT=22
WP_PATH=/var/www/wordpress
# Stage 1 input — customer emails:
ssh -p $SSH_PORT $SSH_USER@$SSH_HOST \
"tmpf=\$(mktemp); cat > \$tmpf; cd $WP_PATH && wp eval-file \$tmpf; rm \$tmpf" \
< tools/extract-customers.php > customers.csv
# Stage 2 input — purchases:
ssh -p $SSH_PORT $SSH_USER@$SSH_HOST \
"tmpf=\$(mktemp); cat > \$tmpf; cd $WP_PATH && wp eval-file \$tmpf; rm \$tmpf" \
< tools/extract-purchases.php > purchases.csvThe mktemp dance is because some wp installations don't read piped scripts via wp eval-file /dev/stdin reliably over SSH. Writing the script to a remote temp file first works everywhere; the file is removed before SSH disconnects.
Then point your .env at the resulting CSVs:
SOURCE=csv
CSV_PATH=./customers.csv # for hash mode
# (CSV_PATH=./purchases.csv when running product-report)
Both parties must produce identical hashes for the same logical mailbox. The script applies the following deterministic normalization before hashing:
trim()andstrtolower().- Strip
+aliasesfrom the local part (bob+netflix@gmail.com→bob@gmail.com). - For
gmail.comandgooglemail.com: strip dots from the local part and canonicalize the domain togmail.com.
This is correct deduplication, not data loss. Two examples on a real customer database:
- One human signed up at your store as
bob@gmail.comand at the other side's store asb.ob+netflix@gmail.com. Without normalization those produce different hashes and the overlap is missed. With normalization, both sides hash to the same value and the customer is correctly identified as shared. - One human signed up at your store twice: once as
alice@gmail.comand later asalice+work@gmail.com. Without normalization your customer list "has 2 customers"; with normalization it has 1 — which is what was true all along.
Normalization does not invent false matches. The two transformations are safe by construction:
- Gmail dot-stripping: Google enforces uniqueness on the dot-stripped form of every Gmail address.
b.ob@gmail.comandbo.b@gmail.comcannot be two different humans — they are literally the same mailbox by Google's design. Zero false positives. +tagstripping: Forbob+x@example.comto be in your customer list at all, signup verification had to deliver to it. If the mail server delivered tobob+x@, the server treats+as routing-only, which meansbob@reaches the same mailbox. The only way this is wrong is a corporate mail server deliberately configured to route+aliasesto a different person — vanishingly rare and not something you've encountered if you're running a normal e-commerce store.
In practical terms: when secure-list-sharing.php hash reports a "real customer" count slightly lower than your database row count, that's the script correctly merging aliases. Without normalization, you would systematically undercount overlap with the other side.
The hash file is padded with random 32-byte hex strings (indistinguishable from real HMAC outputs) up to the next power of 10:
| Real customers | Padded total |
|---|---|
| 1 – 10 | 10 |
| 11 – 100 | 100 |
| 101 – 1,000 | 1,000 |
| 1,001 – 10,000 | 10,000 |
| 10,001 – 100,000 | 100,000 |
| … | … |
This means the size of the hash file no longer leaks your exact customer count. The other side learns only the order of magnitude (e.g. "between 10K and 100K customers") — typically information you've already disclosed in marketing copy.
The product-report mode counts distinct shared customers per product. By default, any product with fewer than 5 shared customers is suppressed entirely from the output. This prevents single-customer disclosures like "1 shared customer bought your $50,000 enterprise tier" — which, combined with other context, could identify the customer.
You can change the threshold:
--k=5(default) — conservative.--k=10— very conservative; products need real penetration to show up.--k=1— no suppression. Don't use this on real data unless you've thought about it.
This protocol is designed for the cooperating-but-cautious counterparty threat model. Each successive class of adversary below is harder to defeat.
- Casual leaks of the exchanged files (laptop stolen, hash file accidentally committed to git, third party intercepts the file). An attacker without the secret cannot recover any emails — they have only random opaque strings.
- Naive offline reconstruction. Without the secret, hashing every email in a public breach corpus and looking for matches doesn't work — they can't compute the HMAC without the key.
- List-size leakage. Padding to a power of 10 ensures neither party learns the other's exact customer count.
- Per-customer purchase disclosure. The product report only ever emits aggregate counts per product, never
(customer, product)pairs. - Single-customer identification. K-anonymity suppresses cells small enough to identify individuals.
-
A malicious counterparty who has the secret. Once both parties hold the secret, either side can hash any specific email guess and check for membership ("is
bob@acme.comyour customer?"). They can do this for any list of emails they obtain — competitor scrapes, breach dumps, conferences, LinkedIn — as long as the secret hasn't been destroyed.Mitigation: use the secret once, then destroy it. Don't reuse it for future exchanges. If you suspect bad faith on the other side, don't run the protocol with them.
-
Strategic list manipulation. A party could run the protocol with a deliberately curated subset of their customers (e.g. only customers in a specific niche) to learn things about that subset's overlap. The other side has no way to verify the list is complete.
-
Frequency analysis on the product report. If you publish a product report from a tiny shared cohort with
--k=1, individual purchases may be re-identifiable through process of elimination. Always use--k >= 5and don't run the product report when the shared set is itself small (say, <50 customers). -
Network-level adversaries. This is a file-exchange protocol; it doesn't do anything for transport security. Send the files over TLS-protected channels (encrypted email, signed S3 link, etc.).
A Private Set Intersection (PSI) protocol — for example, DDH-based PSI or PSI-CA. Both parties run an interactive cryptographic protocol where the only thing each side learns is the intersection (or even just the intersection size). Neither side learns the other's full set, and neither side can probe individual guesses even with all of their own protocol output.
Building PSI properly is a significant project (hundreds of lines of careful crypto code, multiple round trips, careful side-channel handling). For "two ecosystem peers looking for cross-sell opportunities," the salted-HMAC approach in this script is widely considered an acceptable practical compromise — but you should know the gap exists.
MIT © 2026 GravityKit.
Both parties run the same script for symmetry. It's provided as-is, with no warranty — read the Threat model before relying on it for anything sensitive.