diff --git a/.dockerignore b/.dockerignore index 2e6e678..4791eba 100644 --- a/.dockerignore +++ b/.dockerignore @@ -1,5 +1,4 @@ venv -output -bin -__pycache__ -.pytest_cache +**/output +**/__pycache__ +**/.pytest_cache diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml index 602dcf0..9516be8 100644 --- a/.github/workflows/python-app.yml +++ b/.github/workflows/python-app.yml @@ -18,7 +18,7 @@ jobs: - name: Install dependencies run: | python -m pip install --upgrade pip - pip install flake8 pytest + pip install flake8 pytest pytest-cov if [ -f requirements.txt ]; then pip install -r requirements.txt; fi - name: Lint with flake8 run: | @@ -28,4 +28,4 @@ jobs: flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics - name: Test with pytest run: | - pytest + pytest --cov=src --cov-report=term-missing --capture=no diff --git a/.gitignore b/.gitignore index 2e6e678..322a84d 100644 --- a/.gitignore +++ b/.gitignore @@ -1,5 +1,9 @@ venv -output +output* +bns +data +plots +*meta.txt bin __pycache__ .pytest_cache diff --git a/Dockerfile b/Dockerfile index cd0c712..160dec4 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,8 +1,20 @@ -FROM python:bookworm +FROM python:3.12-slim +# Install system dependencies +RUN apt-get update && apt-get install -y \ + build-essential \ + swig \ + libglpk-dev \ + python3-dev \ + libcdd-dev \ + libgmp-dev \ + && rm -rf /var/lib/apt/lists/* + +# Set working directory WORKDIR /workspace COPY . . +# Install python packages RUN pip install --upgrade pip RUN if [ -f requirements.txt ]; then pip install -r requirements.txt; fi diff --git a/README.md b/README.md index 195d78e..511e356 100644 --- a/README.md +++ b/README.md @@ -2,82 +2,112 @@ Code for paper ["Towards Privacy-Aware Bayesian Networks: A Credal Approach"](https://doi.org/10.3233/FAIA251419) presented at [ECAI 2025](https://ecai2025.org/). -## Set up Python environment +## Preliminaries -Create and activate a Python virtual environment with: +### Experiments -```bash -python3 -m venv venv -source venv/bin/activate[.fish] # use `.fish` suffix if using fish shell -``` +`` is the name of the experiment to run. Each `` has its own directory, which is named the same way. Each of these contains the experiment logic, configuration file (`config.yaml`), eventually generated models and data, output directory, and a `Plot_results.ipynb` notebook to plot results. + +`` can be one of the following: + +1. `cn_privacy`: run membership inference attack against a Bayesian network (BN), its related credal network (CN), and compute the theoretical privacy estimate of BN. + +2. `cn_vs_noisybn`: compare two privacy techniques, namely the CN and a noisy version of BN. All models are naive Bayes with target variable T. First, the CN and noisy BN hyperparameters are fine-tuned so that they achieve the same privacy level; then, their accuracy is computed in terms of most probable explanation (MPE) on variable T. + +For additional details, we refer to the paper. + +### Attacks and defenses -Install all dependencies with: +Each experiment requires the user to specify one defense and one attack mechanisms, plus additional related hyperparameters. Below, the mechanisms and hyperparameters names are reported. + +Implemented defenses: +- `def_idm`. Requires: `ess`. +- `def_ran`. Requires: `delta`. + +Implemented attacks: +- `atk_mle`. Requires: `n_bns`. +- `atk_cen`. +- `atk_ran`. +- `atk_ent`. + +## Running code + +### Using Docker (recommended) + +The `compose.yaml` file contains a set of pre-set experiments. Additional ones can also be specified. The `generate_compose.py` file helps in generating them automatically. + +Generate models and data for all experiments (controlled by `config.yaml`): ```bash -pip install -r requirements.txt +python -m experiments.cn_privacy.generate +python -m experiments.cn_vs_noisybn.generate ``` -Upgrade all Python packages with: +Run one or more experiments with: ```bash -pip install --upgrade $(pip freeze | cut -d '=' -f 1) -pip freeze > requirements.txt +docker compose up [service name] ``` -This updates the requirements file with the upgraded packages. +Results will be available under `experiments//output_*`. -## Experiments +To check the status, run one or more of the following: -`` is the name of the experiment to run. It can be one of the following. - -1. `cn_privacy`: run membership inference attack against a Bayesian network (BN), its related credal network (CN), and compute the theoretical privacy estimate of BN. The pipeline and results are described in the paper. +```bash +docker compose ps +docker compose logs [service name] +docker stats +``` -2. `cn_vs_noisybn`: additional experiment, not reported in the paper. It compares two privacy techniques, namely the CN and a noisy version of BN. All models are naive Bayes with target variable T. First, the CN and noisy BN hyperparameters are fine-tuned so that they achieve the same privacy level; then, their accuracy is computed in terms of most probable explanation (MPE) on variable T. +### Local computation -## Run code +Create and activate a Python virtual environment: -### With Docker (recommended) +```bash +python3 -m venv venv +source venv/bin/activate[.fish] # use `.fish` suffix if using fish shell +``` -1. Build the Docker image: +Install dependencies: ```bash -docker build . -t bnp:2025 +pip install -r requirements.txt ``` -2. Run the experiment: +*Notice*: if some package is missing locally, see the `Dockerfile` for additional packages to be installed (names refer to Ubuntu/Debian). + +Upgrade dependencies: ```bash -docker run [-d] [--rm] -v bnp:/workspace bnp:2025 python -m experiments..main +pip install --upgrade $(pip freeze | cut -d '=' -f 1) +pip freeze > requirements.txt ``` -3. Results available at: +*Notice:* each of the following command will overwrite any related output. -`/var/lib/docker/volumes/bnp/_data/experiments//output/`. +Generate models and data (controlled by `config.yaml`): -### Without Docker +```bash +python -m experiments..generate +``` -1. Run the experiment: +Run an experiment: ```bash -python -m experiments..main +python -m experiments..exp def_mec= [param=value] atk_mec= [param=value] ``` -2. Results available at: +Results will be available under `experiments//output`. -`experiments//output/`. -## Test code +## Testing code -Run tests with: +Run integration tests: ```bash -pytest +pytest [--cov=src] [--cov-report=term-missing] [--capture=no] ``` -Test results are available at: - -`test//output/`. - ## Formatting and linting Format code by running: @@ -91,7 +121,7 @@ Lint code by running: ```bash flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics --exclude=venv -flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics --exclude=venv +flake8 . --count --exit-zero --max-complexity=10 --ignore=E203 --max-line-length=140 --statistics --exclude=venv ``` Analyze code by running: @@ -99,7 +129,3 @@ Analyze code by running: ```bash pylint $(git ls-files '*.py') ``` - -## Plot results - -Use the `Plot_results.ipynb` notebook available for each experiment. Plots will be saved at: `experiments//output/plots`. diff --git a/compose.yaml b/compose.yaml new file mode 100644 index 0000000..66d6217 --- /dev/null +++ b/compose.yaml @@ -0,0 +1,86 @@ +version: '3.9' +services: + cn_vs_noisybn_def_idm_atk_ran_ess1: + build: . + command: + - python + - -m + - experiments.cn_vs_noisybn.exp + - def_mec=def_idm + - atk_mec=atk_ran + - ess=1 + image: bnp:2025 + volumes: + - ./experiments/cn_vs_noisybn/bns:/workspace/experiments/cn_vs_noisybn/bns + - ./experiments/cn_vs_noisybn/data:/workspace/experiments/cn_vs_noisybn/data + - ./experiments/cn_vs_noisybn/output_def_idm_atk_ran_ess1:/workspace/experiments/cn_vs_noisybn/output + cn_vs_noisybn_def_idm_atk_ran_ess100: + build: . + command: + - python + - -m + - experiments.cn_vs_noisybn.exp + - def_mec=def_idm + - atk_mec=atk_ran + - ess=100 + image: bnp:2025 + volumes: + - ./experiments/cn_vs_noisybn/bns:/workspace/experiments/cn_vs_noisybn/bns + - ./experiments/cn_vs_noisybn/data:/workspace/experiments/cn_vs_noisybn/data + - ./experiments/cn_vs_noisybn/output_def_idm_atk_ran_ess100:/workspace/experiments/cn_vs_noisybn/output + cn_vs_noisybn_def_idm_atk_ran_ess50: + build: . + command: + - python + - -m + - experiments.cn_vs_noisybn.exp + - def_mec=def_idm + - atk_mec=atk_ran + - ess=50 + image: bnp:2025 + volumes: + - ./experiments/cn_vs_noisybn/bns:/workspace/experiments/cn_vs_noisybn/bns + - ./experiments/cn_vs_noisybn/data:/workspace/experiments/cn_vs_noisybn/data + - ./experiments/cn_vs_noisybn/output_def_idm_atk_ran_ess50:/workspace/experiments/cn_vs_noisybn/output + cn_vs_noisybn_def_ran_atk_ran_delta0.001: + build: . + command: + - python + - -m + - experiments.cn_vs_noisybn.exp + - def_mec=def_ran + - atk_mec=atk_ran + - delta=0.001 + image: bnp:2025 + volumes: + - ./experiments/cn_vs_noisybn/bns:/workspace/experiments/cn_vs_noisybn/bns + - ./experiments/cn_vs_noisybn/data:/workspace/experiments/cn_vs_noisybn/data + - ./experiments/cn_vs_noisybn/output_def_ran_atk_ran_delta0.001:/workspace/experiments/cn_vs_noisybn/output + cn_vs_noisybn_def_ran_atk_ran_delta0.05: + build: . + command: + - python + - -m + - experiments.cn_vs_noisybn.exp + - def_mec=def_ran + - atk_mec=atk_ran + - delta=0.05 + image: bnp:2025 + volumes: + - ./experiments/cn_vs_noisybn/bns:/workspace/experiments/cn_vs_noisybn/bns + - ./experiments/cn_vs_noisybn/data:/workspace/experiments/cn_vs_noisybn/data + - ./experiments/cn_vs_noisybn/output_def_ran_atk_ran_delta0.05:/workspace/experiments/cn_vs_noisybn/output + cn_vs_noisybn_def_ran_atk_ran_delta0.1: + build: . + command: + - python + - -m + - experiments.cn_vs_noisybn.exp + - def_mec=def_ran + - atk_mec=atk_ran + - delta=0.1 + image: bnp:2025 + volumes: + - ./experiments/cn_vs_noisybn/bns:/workspace/experiments/cn_vs_noisybn/bns + - ./experiments/cn_vs_noisybn/data:/workspace/experiments/cn_vs_noisybn/data + - ./experiments/cn_vs_noisybn/output_def_ran_atk_ran_delta0.1:/workspace/experiments/cn_vs_noisybn/output diff --git a/experiments/__init__.py b/experiments/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/experiments/cn_privacy/Plot_results.ipynb b/experiments/cn_privacy/Plot_results.ipynb index 6492fc0..9c157ed 100644 --- a/experiments/cn_privacy/Plot_results.ipynb +++ b/experiments/cn_privacy/Plot_results.ipynb @@ -15,9 +15,12 @@ "import numpy as np\n", "import sys\n", "from pathlib import Path\n", + "import re\n", + "import ast\n", + "from itertools import cycle, product\n", "\n", "sys.path.insert(0, str(Path().resolve().parents[1]))\n", - "from src.config import *" + "from src.config import * # noqa" ] }, { @@ -28,13 +31,31 @@ "outputs": [], "source": [ "# Choose config file\n", - "config = get_config(\"config.yaml\")\n", + "config = load_config(\"cn_privacy\")\n", "\n", - "# Get results path\n", - "res_path = get_base_path(config) / config[\"results_path\"]\n", + "# Choose what to plot\n", + "folder = \"cn_privacy_20251127_cat_ln\"\n", + "params = dict()\n", + "params[\"def_mec\"] = [\"def_idm\", \"def_ran\"]\n", + "params[\"atk_mec\"] = [\"atk_mle\"]\n", + "params[\"ess\"] = [1]\n", + "params[\"delta\"] = [1.0]\n", + "cur_dir = get_cur_dir(config)\n", + "\n", + "# Get results paths\n", + "res_path = {}\n", + "for def_mec, atk_mec in product(params[\"def_mec\"], params[\"atk_mec\"]):\n", + " arg_str = \"ess\" if def_mec == \"def_idm\" else \"delta\"\n", + " arg_vals = [x for x in params[arg_str]]\n", + " for arg_val in arg_vals:\n", + " res_path[(def_mec, atk_mec, arg_str, arg_val)] = (\n", + " cur_dir\n", + " / folder\n", + " / f\"output_{def_mec}_{atk_mec}_{f'{arg_str}{arg_val}'}/results\"\n", + " )\n", "\n", "# Choose where to save plots\n", - "plots_path = get_base_path(config) / \"plots\"\n", + "plots_path = cur_dir / \"plots\"\n", "create_clean_dir(plots_path)" ] }, @@ -65,9 +86,12 @@ ")\n", "\n", "# Colors\n", - "bound_color = \"#ff441a\" # red\n", - "BN_color = \"#3934fe\" # blue\n", - "CN_color = \"#28bd6b\" # green\n", + "palette_cn = dict(\n", + " zip(res_path.keys(), sns.color_palette(palette=\"viridis\", n_colors=len(res_path)))\n", + ")\n", + "palette_bn = sns.color_palette(palette=\"afmhot\")\n", + "bound_color = palette_bn[3]\n", + "BN_color = palette_bn[2]\n", "alpha = 0.2" ] }, @@ -78,33 +102,24 @@ "metadata": {}, "outputs": [], "source": [ - "# Plot BN and bound (semilogx)\n", - "def plot_bn_bound(exp: str, ax):\n", + "# Plot BN (semilogx)\n", + "def plot_bn(path: Path, exp: str, ax, fill: bool = True):\n", "\n", " # Import results\n", - " results = os.listdir(res_path)\n", - " r_path = [r for r in results if f\"{exp}-ess1.csv\" in r][0]\n", - " result = pd.read_csv(f\"{res_path}/{r_path}\")\n", - " error = result[\"error\"]\n", - " bound = result[\"power_bound\"]\n", - " bn_cols = [c for c in result.columns if \"BN\" in c]\n", - " bn_mean = result.loc[:, bn_cols].mean(axis=1)\n", - " bn_max = result.loc[:, bn_cols].max(axis=1)\n", + " files = os.listdir(path / \"bns\")\n", + " r_path = [r for r in files if f\"{exp}\" in r][0]\n", + " res = pd.read_csv(f\"{path}/bns/{r_path}\")\n", + " error = res[\"error\"]\n", "\n", - " # Plot bound\n", - " ax.semilogx(\n", - " error,\n", - " bound,\n", - " \"^\",\n", - " color=bound_color,\n", - " label=\"Theoretical estimate\",\n", - " markersize=4,\n", - " zorder=4,\n", - " )\n", + " # Select what to plot\n", + " bn_cols = [c for c in res.columns if \"BN\" in c]\n", + " bn_mean = res.loc[:, bn_cols].mean(axis=1)\n", + " bn_max = res.loc[:, bn_cols].max(axis=1)\n", "\n", " # Plot BN (avg-max)\n", - " ax.fill_between(error, bn_mean, bn_max, color=BN_color, alpha=alpha, zorder=2)\n", - " ax.semilogx(error, bn_mean, \"-\", color=BN_color, label=\"BN\", zorder=3)\n", + " (line,) = ax.semilogx(error, bn_mean, \"-\", color=BN_color, label=\"BN\", zorder=3)\n", + " if fill:\n", + " ax.fill_between(error, bn_mean, bn_max, color=BN_color, alpha=alpha, zorder=2)\n", "\n", " # Title and axes\n", " ax.set_xlabel(\"Error\")\n", @@ -121,38 +136,59 @@ " True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1\n", " )\n", "\n", + " return line\n", + "\n", "\n", - "# Plot CN (for a given ess)\n", - "def plot_cn(exp, ax, ess: int, color: str, type: str):\n", + "# Plot bound (semilogx)\n", + "def plot_bound(path: Path, exp: str, ax):\n", "\n", " # Import results\n", - " results = os.listdir(res_path)\n", - " r_path = [r for r in results if f\"{exp}-ess{ess}.csv\" in r][0]\n", - " result = pd.read_csv(f\"{res_path}/{r_path}\")\n", - " error = result[\"error\"]\n", - " cn_cols = [c for c in result.columns if \"CN\" in c]\n", - " cn_mean = result.loc[:, cn_cols].mean(axis=1)\n", - " cn_max = result.loc[:, cn_cols].max(axis=1)\n", + " files = os.listdir(path / \"bns\")\n", + " r_path = [r for r in files if f\"{exp}\" in r][0]\n", + " res = pd.read_csv(f\"{path}/bns/{r_path}\")\n", + " bound = res[\"power_bound\"]\n", + " error = res[\"error\"]\n", + "\n", + " # Plot bound\n", + " (line,) = ax.semilogx(\n", + " error,\n", + " bound,\n", + " \"^\",\n", + " color=bound_color,\n", + " markersize=3,\n", + " zorder=4,\n", + " label=\"Theoretical estimate\",\n", + " mec=None,\n", + " )\n", + "\n", + " return line\n", + "\n", + "\n", + "# Plot CN\n", + "def plot_cn(path: Path, color, exp, ax, type: str, fill: bool = True):\n", + "\n", + " # Import results\n", + " files = os.listdir(path / \"cns\")\n", + " r_path = [r for r in files if f\"{exp}\" in r][0]\n", + " res = pd.read_csv(f\"{path}/cns/{r_path}\")\n", + " error = res[\"error\"]\n", + "\n", + " # Select what to plot\n", + " cn_cols = [c for c in res.columns if \"CN\" in c]\n", + " cn_mean = res.loc[:, cn_cols].mean(axis=1)\n", + " cn_max = res.loc[:, cn_cols].max(axis=1)\n", "\n", " # Plot CN (avg-max)\n", - " ax.fill_between(error, cn_mean, cn_max, color=color, alpha=alpha, zorder=2)\n", - " ax.semilogx(error, cn_mean, type, color=color, label=f\"CN, $S={ess}$\", zorder=3)\n", - "\n", - " # Legend\n", - " if exp == \"exp0\":\n", - " ax.legend(\n", - " loc=\"best\",\n", - " frameon=True,\n", - " fancybox=False,\n", - " framealpha=1,\n", - " facecolor=\"#e6e6e6\",\n", - " edgecolor=\"#8c8c8c\",\n", - " )\n", + " (line,) = ax.semilogx(error, cn_mean, type, color=color, zorder=3)\n", + " if fill:\n", + " ax.fill_between(error, cn_mean, cn_max, color=color, alpha=alpha, zorder=2)\n", + "\n", + " return line\n", "\n", "\n", "# Plot title function\n", "def get_title(exp: str):\n", - " with open(f\"{res_path}/exp_meta.txt\", \"r\") as meta:\n", + " with open(f\"{cur_dir}/exp_meta.txt\", \"r\") as meta:\n", " for row in meta:\n", " if exp in row:\n", " pieces = row.split()\n", @@ -169,33 +205,52 @@ "metadata": {}, "outputs": [], "source": [ + "# Names of experiments\n", + "from natsort import natsorted\n", + "\n", + "exp_names = natsorted(\n", + " [re.findall(\"(\\w+\\d+)\\.csv\", r)[0] for r in os.listdir(cur_dir / \"data\")]\n", + ")\n", + "\n", "# Layout 4x3\n", - "fig, axes = plt.subplots(4, 3, figsize=(10, 9.5))\n", - "# fig.suptitle(\"Power vs Error\", fontsize=18)\n", - "\n", - "exps = [\n", - " \"exp0\",\n", - " \"exp1\",\n", - " \"exp2\",\n", - " \"exp3\",\n", - " \"exp4\",\n", - " \"exp5\",\n", - " \"exp6\",\n", - " \"exp7\",\n", - " \"exp8\",\n", - " \"exp9\",\n", - " \"exp10\",\n", - " \"exp11\",\n", - "]\n", - "ess = [1, 1000]\n", - "\n", - "# Loop over subplots\n", - "for i, ax in enumerate(axes.flat):\n", - " plot_bn_bound(f\"{exps[i]}\", ax)\n", - " plot_cn(f\"{exps[i]}\", ax, ess=ess[0], color=CN_color, type=\"-\")\n", - " plot_cn(f\"{exps[i]}\", ax, ess=ess[1], color=CN_color, type=\"--\")\n", - " (n, e, c) = get_title(exps[i])\n", - " ax.set_title(f\"Nodes: {n}, Edges: {e}, Complexity: {c}\")\n", + "fig, axes = plt.subplots(len(exp_names) // 3 + 1, 3, figsize=(9, 8))\n", + "fig.suptitle(f\"Power vs Error\", fontsize=15)\n", + "\n", + "# Loop over results\n", + "for i, exp in enumerate(exp_names):\n", + "\n", + " # Plot BN & bound\n", + " ax = axes.flat[i]\n", + " path_bn = list(res_path.values())[0]\n", + " bn = plot_bn(path_bn, exp, ax)\n", + " plot_bound(path_bn, exp, ax)\n", + "\n", + " # Plot CNs\n", + " for res in res_path:\n", + " (def_mec, atk_mec, arg_str, arg_val) = res\n", + " path = res_path[res]\n", + "\n", + " cn = plot_cn(path, palette_cn[res], exp, ax, type=\"-\", fill=True)\n", + "\n", + " # Legend\n", + " cn.set_label(f\"CN, {def_mec}: {arg_str}={arg_val}, {atk_mec}\")\n", + " if i == 0:\n", + " ax.legend(\n", + " loc=\"best\",\n", + " frameon=True,\n", + " fancybox=False,\n", + " framealpha=1,\n", + " facecolor=\"#e6e6e6\",\n", + " edgecolor=\"#8c8c8c\",\n", + " )\n", + "\n", + " # Title\n", + " (n, e, c) = get_title(exp_names[i])\n", + " ax.set_title(f\"$N$: {n}, $E$: {e}, $C(\\mathcal G)$: {c}\")\n", + "\n", + " # # Log Y scale\n", + " # ax.set_yscale('log')\n", + " # ax.set_ylim(0.9*cn.get_ydata().min(), 1.1*bn.get_ydata().max())\n", "\n", "plt.tight_layout(rect=[0, 0, 1, 0.96])\n", "plt.show()\n", @@ -203,6 +258,40 @@ " f\"{plots_path}/results_logx.pdf\", dpi=1200, bbox_inches=\"tight\", transparent=False\n", ")" ] + }, + { + "cell_type": "markdown", + "id": "699bb07e", + "metadata": {}, + "source": [ + "def_idm:\n", + " * ess1 is more private than ess1000 (weird!), but only in huge nets\n", + " * atk_mle and atk_cen behave similarly, no real differences for each ess\n", + " * atk_ran only slighty more private only in huge nets, for each ess. With ess1000 it is more visible\n", + "\n", + "def_ran:\n", + " * Privacy increases with delta (expected), for each atk\n", + " * atk_mle and atk_cen behave similarly, no real differences for each delta\n", + " * atk_ran is more private, but not that much\n", + " * delta0.1 is similar to BN, while delta1.0 still leaks info, for each atk\n", + " * delta1.0 is less private than def_idm with ess1, for each atk, in huge nets (weird!)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc764066", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65698b92", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { diff --git a/experiments/cn_privacy/__init__.py b/experiments/cn_privacy/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/experiments/cn_privacy/config.yaml b/experiments/cn_privacy/config.yaml index 321871e..ff72094 100644 --- a/experiments/cn_privacy/config.yaml +++ b/experiments/cn_privacy/config.yaml @@ -1,27 +1,27 @@ ## Configuration file # Paths -base_path: experiments/cn_privacy/output # Base path for output +cur_dir: experiments/cn_privacy # Current directory (contains all the following) bns_path: bns # Where to save ground-truth BNs +cns_path: output/cns # Where to save CNs as obtained by def-mec from BNs learnt from pool +atk_path: output/bns_atk # Where to save BNs as obtained by atk-mec from CNs data_path: data # Where to save data as generated from ground-truth BNs -results_path: results # Where to save the experiment results -meta_file: exp_meta.txt # File of metadata +results_path: output/results # Where to save the experiment results +exp_meta: exp_meta.txt # File of metadata for experiments # Models -n_nodes_vec: '[10, 20, 50, 100]' # List of models' number of nodes -edge_ratio_vec: '[1, 2, 4]' # List of models' edge ratio +n_nodes_vec: '[10, 15]' # List of models' number of nodes +edge_ratio_vec: '[1, 1.5]' # List of models' edge ratio +n_modmax: 2 # Maximum number of variables categories # Data -gpop_ss: 10000 # Sample size of general population +gpop_ss: 500 # Sample size of general population rpop_prop: 0.5 # Sample size of reference population = gpop_ss * rpop_prop pool_prop: 0.25 # Sample size of pool population = gpop_ss * pool_prop +samples: 5 # Number of data samples # MIA -n_samples: 20 # Number of data samples -n_bns: 500 # Number of BNs to sample within the CN -error: 'np.logspace(-4, 0, 20, endpoint=False)' # Type-I errors vector -ess_vec: '[1, 10, 50, 100, 1000]' # List of ESS +error: 'np.logspace(-4, 0, 10, endpoint=False)' # Type-I errors vector # Other -seed: 42 # Global seed num_cores: 'multiprocessing.cpu_count() - 1' # Number of threads to use for parallelization diff --git a/experiments/cn_privacy/config_BAK.yaml b/experiments/cn_privacy/config_BAK.yaml new file mode 100644 index 0000000..9251cec --- /dev/null +++ b/experiments/cn_privacy/config_BAK.yaml @@ -0,0 +1,28 @@ +## Configuration file + +# Paths + +cur_dir: experiments/cn_privacy # Current directory (contains all the following) +bns_path: bns # Where to save ground-truth BNs +cns_path: output/cns # Where to save CNs as obtained by def-mec from BNs learnt from pool +atk_path: output/bns_atk # Where to save BNs as obtained by atk-mec from CNs +data_path: data # Where to save data as generated from ground-truth BNs +results_path: output/results # Where to save the experiment results +exp_meta: exp_meta.txt # File of metadata for experiments + +# Models +n_nodes_vec: '[10, 20, 50, 100]' # List of models' number of nodes +edge_ratio_vec: '[1, 2, 4]' # List of models' edge ratio +n_modmax: 2 # Maximum number of variables categories + +# Data +gpop_ss: 10000 # Sample size of general population +rpop_prop: 0.5 # Sample size of reference population = gpop_ss * rpop_prop +pool_prop: 0.25 # Sample size of pool population = gpop_ss * pool_prop +samples: 20 # Number of data samples + +# MIA +error: 'np.logspace(-4, 0, 20, endpoint=False)' # Type-I errors vector + +# Other +num_cores: 'multiprocessing.cpu_count() - 1' # Number of threads to use for parallelization diff --git a/experiments/cn_privacy/exp.py b/experiments/cn_privacy/exp.py new file mode 100644 index 0000000..0a0349a --- /dev/null +++ b/experiments/cn_privacy/exp.py @@ -0,0 +1,63 @@ +import gc +import multiprocessing # noqa: F401 # pylint: disable=unused-import +import sys + +import numpy as np # noqa: F401 # pylint: disable=unused-import +from joblib import Parallel, delayed + +from src.attack import attack_mechanism +from src.config import create_clean_dir, get_cur_dir, load_config, map_sys_args +from src.defense import defense_mechanism +from src.mia import mia_vs_bn, mia_vs_cn, theoretical_power + + +def main(): + + # Init configs + config = load_config("cn_privacy") + cur_dir = get_cur_dir(config) + create_clean_dir(cur_dir / "output") + num_cores = eval(config["num_cores"]) + + # Get command-line hyperparameters + def_mec, def_args, atk_mec, atk_args = map_sys_args(sys.argv, config) + + # Init the vectors of experiments + exp_vec = [f.stem for f in (cur_dir / config["data_path"]).iterdir() if f.is_file()] + + # Defense mechanism + print("## Defense mechanism: [", def_mec, def_args, "] ##", flush=True) + create_clean_dir(cur_dir / config["cns_path"]) + _ = Parallel(n_jobs=num_cores)( + delayed(defense_mechanism)(exp, config, def_mec, def_args) for exp in exp_vec + ) + + # Attack mechanism + print("## Attack mechanism: [", atk_mec, atk_args, "] ##", flush=True) + create_clean_dir(cur_dir / config["atk_path"]) + _ = Parallel(n_jobs=num_cores)( + delayed(attack_mechanism)(exp, config, atk_mec, atk_args) for exp in exp_vec + ) + + # MIA vs CN + print("## MIA vs CN ##", flush=True) + create_clean_dir(cur_dir / config["results_path"] / "cns") + _ = Parallel(n_jobs=num_cores)(delayed(mia_vs_cn)(exp, config) for exp in exp_vec) + + # MIA vs BN + print("## MIA vs BN ##", flush=True) + create_clean_dir(cur_dir / config["results_path"] / "bns") + _ = Parallel(n_jobs=num_cores)(delayed(mia_vs_bn)(exp, config) for exp in exp_vec) + + # Compute theoretical power + print("## Compute theoretical power ##", flush=True) + _ = Parallel(n_jobs=num_cores)( + delayed(theoretical_power)(exp, config) for exp in exp_vec + ) + + # Clean + gc.collect() + + +if __name__ == "__main__": + main() diff --git a/experiments/cn_privacy/generate.py b/experiments/cn_privacy/generate.py new file mode 100644 index 0000000..bee8241 --- /dev/null +++ b/experiments/cn_privacy/generate.py @@ -0,0 +1,43 @@ +import gc +import multiprocessing # noqa: F401 # pylint: disable=unused-import + +import numpy as np # noqa: F401 # pylint: disable=unused-import +from joblib import Parallel, delayed + +from src.config import create_clean_dir, get_cur_dir, load_config +from src.data import generate_randombn +from src.learning import estimate_bns + + +def main(): + + # Init configs + config = load_config("cn_privacy") + cur_dir = get_cur_dir(config) + num_cores = eval(config["num_cores"]) + + # Generate BNs and data + print("## Generate BNs and data ##") + create_clean_dir(cur_dir / config["bns_path"]) + create_clean_dir(cur_dir / config["bns_path"] / "gt") + create_clean_dir(cur_dir / config["data_path"]) + open(f'{cur_dir}/{config["exp_meta"]}', "w").close() + generate_randombn(config) + + # Init the vectors of experiments + exp_vec = [f.stem for f in (cur_dir / config["data_path"]).iterdir() if f.is_file()] + + # Estimate BNs from rpop and pool + print("## Estimate BNs from rpop and pool ##") + create_clean_dir(cur_dir / config["bns_path"] / "rpop") + create_clean_dir(cur_dir / config["bns_path"] / "pool") + _ = Parallel(n_jobs=num_cores)( + delayed(estimate_bns)(exp, config) for exp in exp_vec + ) + + # Clean + gc.collect() + + +if __name__ == "__main__": + main() diff --git a/experiments/cn_privacy/main.py b/experiments/cn_privacy/main.py deleted file mode 100644 index edc5fb9..0000000 --- a/experiments/cn_privacy/main.py +++ /dev/null @@ -1,14 +0,0 @@ -from src.config import get_config -from src.data import generate_randombn -from src.run_exp import run_cn_privacy - -if __name__ == "__main__": - - # Load config - config = get_config("experiments/cn_privacy/config.yaml") - - # Generate BNs and data - generate_randombn(config) - - # Run experiment - run_cn_privacy(config) diff --git a/experiments/cn_vs_noisybn/Plot_results.ipynb b/experiments/cn_vs_noisybn/Plot_results.ipynb index f602c56..29f30f3 100644 --- a/experiments/cn_vs_noisybn/Plot_results.ipynb +++ b/experiments/cn_vs_noisybn/Plot_results.ipynb @@ -1,5 +1,13 @@ { "cells": [ + { + "cell_type": "markdown", + "id": "80ba8653", + "metadata": {}, + "source": [ + "Ensure the output directories are named as `output_<...>_`, where `def_arg` can be `ess` or `delta`, and `def_value` its value." + ] + }, { "cell_type": "code", "execution_count": null, @@ -13,13 +21,16 @@ "import numpy as np\n", "import re\n", "import sys\n", + "import ast\n", + "import seaborn as sns\n", "from pathlib import Path\n", "from natsort import natsorted\n", "from sklearn.metrics import roc_curve\n", "from matplotlib.ticker import LogLocator\n", + "from statsmodels.stats.proportion import proportion_confint\n", "\n", "sys.path.insert(0, str(Path().resolve().parents[1]))\n", - "from src.config import *" + "from src.config import load_config, get_cur_dir, create_clean_dir" ] }, { @@ -30,20 +41,52 @@ "outputs": [], "source": [ "# Choose config file\n", - "config = get_config(\"config.yaml\")\n", + "config = load_config(\"cn_vs_noisybn\")\n", "\n", "# Get results path\n", - "res_path = get_base_path(config) / config[\"results_path\"]\n", + "cur_dir = get_cur_dir(config)\n", + "# res_path = cur_dir / config[\"results_path\"] / \"inferences\"\n", "\n", "# Choose where to save plots\n", - "plots_path = get_base_path(config) / \"plots\"\n", + "plots_path = cur_dir / \"plots\"\n", "create_clean_dir(plots_path)" ] }, { "cell_type": "code", "execution_count": null, - "id": "49a48f2f", + "id": "e07e1ceb", + "metadata": {}, + "outputs": [], + "source": [ + "# Names of experiments\n", + "pattern = re.compile(\"output_.*_(ess|delta)(\\d+\\.?\\d*)\")\n", + "exp_names = [re.findall(\"(\\w+\\d+)\\.csv\", r)[0] for r in os.listdir(cur_dir / \"data\")]\n", + "\n", + "# Choose what to plot\n", + "folder = \"cn_vs_noisybn_20251120_ln\"\n", + "def_mec = \"def_ran\"\n", + "atk_mec = \"atk_mle\"\n", + "out_dirs = [\n", + " item\n", + " for item in os.listdir(f\"{cur_dir}/{folder}\")\n", + " if pattern.match(item) and atk_mec in item and def_mec in item\n", + "]\n", + "\n", + "# Get ESS/delta list\n", + "def_arg = pattern.findall(out_dirs[0])[0][0]\n", + "x_values = [pattern.findall(i)[0][1] for i in out_dirs]\n", + "x_values = (\n", + " sorted([int(x) for x in x_values])\n", + " if def_mec == \"def_idm\"\n", + " else sorted([float(x) for x in x_values])\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad31867b", "metadata": {}, "outputs": [], "source": [ @@ -51,55 +94,42 @@ "def split_data(data: pd.DataFrame, col: str, threshold: float = 0.5) -> tuple:\n", "\n", " # Split results based on probabilities\n", - " cert_idx = data[(data[col] > threshold)].index\n", - " data_cert = data.iloc[cert_idx]\n", - " data_uncert = data[~data.index.isin(cert_idx)]\n", + " cond = data[col] > threshold\n", + " data_cert = data[cond]\n", + " data_uncert = data[~cond]\n", "\n", " return data_cert, data_uncert\n", "\n", "\n", - "# Accuracy function for BN\n", - "def get_acc_bn(data: pd.DataFrame, col: str, vs_col: str) -> float:\n", + "# Accuracy function for a BN\n", + "def get_acc_bn(data: pd.DataFrame, col: str, vs_col: str, alpha=0.05) -> list:\n", + "\n", + " succ = sum(data[col] == data[vs_col])\n", + " acc = succ / len(data)\n", + " lower, upper = proportion_confint(\n", + " count=succ, nobs=len(data), alpha=alpha, method=\"wilson\"\n", + " )\n", "\n", - " return sum(data[col] == data[vs_col]) / len(data)" + " return [acc, lower, upper]" ] }, { "cell_type": "code", "execution_count": null, - "id": "38f1e202", + "id": "84251635", "metadata": {}, "outputs": [], "source": [ - "res_path = get_base_path(config) / config[\"results_path\"]\n", - "dirs = natsorted(\n", - " [f\"{res_path}/{dir}\" for dir in os.listdir(f\"{res_path}/\") if \"results_\" in dir]\n", - ")\n", - "\n", - "res = {\n", - " \"ess\": [],\n", - " \"eps\": [],\n", - " \"acc_cn_cert\": [],\n", - " \"acc_cn_uncert\": [],\n", - " \"acc_cn_tot\": [],\n", - " \"acc_noisy_bn\": [],\n", - " \"cert_cn\": [],\n", - "}\n", - "\n", - "roc = {\n", - " \"roc_cn_cert\": dict(),\n", - " \"roc_cn_uncert\": dict(),\n", - " \"roc_cn_tot\": dict(),\n", - " \"roc_noisy_bn\": dict(),\n", - "}\n", - "\n", - "# For each ess...\n", - "for dir in dirs:\n", - "\n", - " # Get results\n", - " files = [f for f in os.listdir(dir) if \".csv\" in f]\n", - " data = pd.concat([pd.read_csv(dir + \"/\" + f) for f in files])\n", - " data.reset_index(inplace=True)\n", + "# Build an inferences data set for each output folder\n", + "res = dict()\n", + "for out_dir in out_dirs:\n", + " inferences_path = os.path.join(cur_dir, folder, out_dir, \"results/inferences\")\n", + " files = [os.path.join(inferences_path, f) for f in os.listdir(inferences_path)]\n", + " data = pd.concat((pd.read_csv(f) for f in files), axis=0)\n", + " data[\"cn_probs_1\"] = data.apply(\n", + " lambda row: row[\"cn_probs\"] if row[\"cn_mpes\"] == 1 else row[\"cn_probs_alt\"],\n", + " axis=1,\n", + " )\n", " data[\"bn_noisy_probs_1\"] = data.apply(\n", " lambda row: (\n", " row[\"bn_noisy_probs\"]\n", @@ -108,206 +138,225 @@ " ),\n", " axis=1,\n", " )\n", - " data[\"cn_probs_1\"] = data.apply(\n", - " lambda row: row[\"cn_probs\"] if row[\"cn_mpes\"] == 1 else row[\"cn_probs_alt\"],\n", - " axis=1,\n", - " )\n", - "\n", - " # Store ess\n", - " reg = re.search(\"nodes(\\d+)_ess(\\d+)\", dir)\n", - " n_nodes = reg.group(1)\n", - " ess = reg.group(2)\n", - "\n", - " # Store avg of eps and std\n", - " with open(dir + \"/exp_meta.txt\", \"r\") as f:\n", - " eps_vec = [\n", - " float(re.search(\"Eps: (.+)\\n\", line).group(1))\n", - " for line in f\n", - " if \"Eps: \" in line\n", - " ]\n", - " eps = (float(np.mean(eps_vec)), float(np.std(eps_vec)))\n", - "\n", - " # Split CN results based on probabilities\n", - " data_cert, data_uncert = split_data(data, \"cn_probs\", 0.5)\n", - "\n", - " # Compute accuracies\n", - " vs = \"gt\"\n", - "\n", - " acc_cn_cert = (\n", - " get_acc_bn(data_cert, f\"{vs}_mpes\", \"cn_mpes\") if len(data_cert) > 0 else None\n", - " )\n", - " acc_cn_uncert = (\n", - " get_acc_bn(data_uncert, f\"{vs}_mpes\", \"cn_mpes\")\n", - " if len(data_uncert) > 0\n", - " else None\n", - " )\n", - " acc_cn_tot = get_acc_bn(data, f\"{vs}_mpes\", \"cn_mpes\")\n", - " acc_noisy_bn = get_acc_bn(data, f\"{vs}_mpes\", \"bn_noisy_mpes\")\n", - "\n", - " # Compute CN certainty\n", - " cert_cn = sum(data[\"cn_probs\"] > 0.5) / len(data)\n", - "\n", - " # Compute ROC\n", - " roc_cn_cert = roc_curve(data_cert[f\"{vs}_mpes\"], data_cert[\"cn_probs_1\"])\n", - " roc_cn_uncert = roc_curve(data_uncert[f\"{vs}_mpes\"], data_uncert[\"cn_probs_1\"])\n", - " roc_cn_tot = roc_curve(data[f\"{vs}_mpes\"], data[\"cn_probs_1\"])\n", - " roc_noisy_bn = roc_curve(data[f\"{vs}_mpes\"], data[\"bn_noisy_probs_1\"])\n", - "\n", - " # Store results\n", - " for key in res.keys():\n", - " res[key].append(eval(key))\n", - " for key in roc.keys():\n", - " roc[key][ess] = eval(key)\n", - "\n", - " # Debug\n", - " assert (data[\"cn_probs\"] >= data[\"cn_probs_alt\"]).all()\n", - " assert (data[\"bn_noisy_probs\"] >= 0.5).all()\n", - " assert (data[\"bn_probs\"] >= 0.5).all()\n", - " assert len(data) == len(pd.read_csv(dir + \"/\" + files[0])) * len(files)\n", - "\n", - "# Debug\n", - "length = len(res[\"ess\"])\n", - "for key in res.keys():\n", - " assert len(res[key]) == length\n", - "for key in roc.keys():\n", - " assert len(roc[key]) == length\n", - "assert res[\"ess\"] == sorted(res[\"ess\"])" + " x = pattern.findall(out_dir)[0][1]\n", + " res[f\"{def_arg}{x}\"] = data\n", + "\n", + "# Retrieve AUCs for each result folder\n", + "aucs = dict()\n", + "for out_dir in out_dirs:\n", + " auc_path = os.path.join(cur_dir, folder, out_dir, \"results/auc_meta.csv\")\n", + " data = pd.read_csv(auc_path)\n", + " x = pattern.findall(out_dir)[0][1]\n", + " aucs[f\"{def_arg}{x}\"] = data" ] }, { "cell_type": "code", "execution_count": null, - "id": "b6274696", + "id": "581cee1c", "metadata": {}, "outputs": [], "source": [ "# Style\n", - "plt.style.use(\"seaborn-v0_8-paper\")\n", - "plt.rcParams.update(\n", - " {\n", - " \"font.size\": 9,\n", - " \"axes.labelsize\": 9,\n", - " \"axes.titlesize\": 9,\n", - " \"legend.fontsize\": 7,\n", - " \"xtick.labelsize\": 8,\n", - " \"ytick.labelsize\": 8,\n", - " \"lines.linewidth\": 0.8,\n", - " \"figure.dpi\": 300,\n", - " \"savefig.dpi\": 300,\n", - " \"axes.edgecolor\": \"black\",\n", - " \"axes.linewidth\": 0.8,\n", - " \"text.usetex\": True,\n", - " }\n", - ")" + "# plt.style.use(\"seaborn-v0_8-paper\")\n", + "# plt.rcParams.update(\n", + "# {\n", + "# \"font.size\": 9,\n", + "# \"axes.labelsize\": 9,\n", + "# \"axes.titlesize\": 9,\n", + "# \"legend.fontsize\": 7,\n", + "# \"xtick.labelsize\": 8,\n", + "# \"ytick.labelsize\": 8,\n", + "# \"lines.linewidth\": 0.8,\n", + "# \"figure.dpi\": 300,\n", + "# \"savefig.dpi\": 300,\n", + "# \"axes.edgecolor\": \"black\",\n", + "# \"axes.linewidth\": 0.8,\n", + "# \"text.usetex\": True,\n", + "# }\n", + "# )" ] }, { "cell_type": "code", "execution_count": null, - "id": "0e9e8e36", + "id": "2ce6f5be", "metadata": {}, "outputs": [], "source": [ - "# Ess vs eps\n", - "fig, ax = plt.subplots(1, 1, figsize=(5, 3))\n", - "\n", - "eps_mean = np.array([x[0] for x in res[\"eps\"]])\n", - "ax.semilogy(res[\"ess\"], eps_mean, \"-o\", label=\"Mean\", markersize=4)\n", - "ax.set_xlabel(\"S\")\n", + "# Retrieve related epsilon\n", + "eps_median = []\n", + "eps_uq, eps_lq = [], []\n", + "eps_up, eps_lp = [], []\n", + "for x in x_values:\n", + " data = aucs[f\"{def_arg}{x}\"]\n", + " data_eps = [i for i in data[\"epsilon\"].values if i is not None and not np.isnan(i)]\n", + " eps_median.append(np.median(data_eps))\n", + " eps_uq.append(np.percentile(data_eps, 75))\n", + " eps_lq.append(np.percentile(data_eps, 25))\n", + " eps_up.append(np.percentile(data_eps, 95))\n", + " eps_lp.append(np.percentile(data_eps, 5))\n", + "\n", + "# Plot: ess vs eps\n", + "fig, ax = plt.subplots(1, 1)\n", + "\n", + "ax.semilogy(x_values, eps_median, \"-\", color=\"black\", label=\"Median\")\n", + "ax.fill_between(\n", + " x_values,\n", + " eps_lq,\n", + " eps_uq,\n", + " color=\"#ff4d4d\",\n", + " alpha=0.25,\n", + " linewidth=0,\n", + " label=\"Quartiles (1st \\& 3rd)\",\n", + " zorder=2,\n", + ")\n", + "ax.fill_between(\n", + " x_values,\n", + " eps_lp,\n", + " eps_up,\n", + " color=\"#ff9999\",\n", + " alpha=0.20,\n", + " linewidth=0,\n", + " label=\"Percentiles (5th \\& 95th)\",\n", + " zorder=2,\n", + ")\n", + "label = \"$S$\" if def_mec == \"def_idm\" else \"$\\delta$\"\n", + "ax.set_xlabel(label)\n", "ax.set_ylabel(\"$\\epsilon$\")\n", - "ax.set_title(\"S vs $\\epsilon$\")\n", - "ax.set_ylim([1e-5, 10])\n", - "ax.set_yticks([4, 1, 1e-1, 1e-2, 1e-3, 1e-4])\n", - "ax.set_yticklabels([\"4\", \"1\", \"1e-1\", \"1e-2\", \"1e-3\", \"1e-4\"])\n", - "ax.yaxis.set_minor_locator(LogLocator(base=10.0, subs=\"auto\"))\n", - "ax.tick_params(axis=\"y\", which=\"minor\", length=2, width=0.5)\n", - "ax.tick_params(axis=\"y\", which=\"major\", length=3, width=0.9)\n", - "ax.grid(True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1)\n", - "# ax.grid(True, which='minor', linestyle='-', linewidth=0.5, color='#bfbfbf', zorder=1)\n", - "ax.legend(loc=\"best\")\n", + "ax.set_title(\"Balancing privacy\")\n", "\n", - "plt.tight_layout(rect=[0, 0, 1, 0.96])\n", - "plt.show()\n", - "fig.savefig(\n", - " f\"{plots_path}/s_vs_eps.pdf\", dpi=1200, bbox_inches=\"tight\", transparent=False\n", - ")" + "ax.set_ylim([1e-9, 100])\n", + "ax.grid(True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1)\n", + "ax.legend(loc=\"best\")" ] }, { "cell_type": "code", "execution_count": null, - "id": "90bf5cc9", + "id": "a29cb66d", "metadata": {}, "outputs": [], "source": [ - "# Ess vs CN certainty\n", - "fig, ax = plt.subplots(1, 1, figsize=(5, 3))\n", - "\n", - "ax.plot(res[\"ess\"], res[\"cert_cn\"], \"-o\", markersize=4)\n", - "ax.set_xlabel(\"S\")\n", + "# Get CN certainty\n", + "cn_certainty = []\n", + "for x in x_values:\n", + " data = res[f\"{def_arg}{x}\"]\n", + " cn_certainty.append(sum(data[\"cn_probs\"] > 0.5) / len(data))\n", + "\n", + "# Plot: ess vs CN certainty\n", + "fig, ax = plt.subplots(1, 1)\n", + "\n", + "ax.plot(x_values, cn_certainty, \"-\", color=\"black\")\n", + "label = \"$S$\" if def_mec == \"def_idm\" else \"$\\delta$\"\n", + "ax.set_xlabel(label)\n", "ax.set_ylabel(\"Ratio of (maxmin CN prob. $> 0.5$)\")\n", "ax.set_title(\"CN certainty\")\n", - "ax.grid(True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1)\n", - "\n", - "plt.tight_layout(rect=[0, 0, 1, 0.96])\n", - "plt.show()\n", - "fig.savefig(\n", - " f\"{plots_path}/cn_certainty.pdf\", dpi=1200, bbox_inches=\"tight\", transparent=False\n", - ")" + "ax.grid(True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1)" ] }, { "cell_type": "code", "execution_count": null, - "id": "b5c19b7e", + "id": "1f7ebeae", "metadata": {}, "outputs": [], "source": [ - "# Accuracy\n", - "fig, ax = plt.subplots(1, 1, figsize=(5, 3))\n", + "# Store accuracy results\n", + "acc = {\"acc_noisy_bn\": {}, \"acc_cn_tot\": {}, \"acc_cn_cert\": {}, \"acc_cn_uncert\": {}}\n", "\n", - "labels = {\n", - " \"acc_noisy_bn\": \"Noisy BN\",\n", - " \"acc_cn_tot\": \"CN (total)\",\n", - " \"acc_cn_cert\": \"CN (certain)\",\n", - " \"acc_cn_uncert\": \"CN (uncertain)\",\n", + "roc = {\n", + " \"roc_cn_cert\": dict(),\n", + " \"roc_cn_uncert\": dict(),\n", + " \"roc_cn_tot\": dict(),\n", + " \"roc_noisy_bn\": dict(),\n", "}\n", - "for key in res.keys():\n", - " if \"acc\" in key:\n", - " ax.plot(res[\"ess\"], res[key], \"-o\", label=labels[key], markersize=4)\n", "\n", - "ax.set_xlabel(\"S\")\n", - "ax.set_ylabel(\"Accuracy\")\n", - "ax.set_title(\"Accuracy\")\n", - "ax.legend(loc=\"best\")\n", - "ax.grid(True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1)\n", + "for x in x_values:\n", + " data = res[f\"{def_arg}{x}\"]\n", "\n", - "plt.tight_layout(rect=[0, 0, 1, 0.96])\n", - "plt.show()\n", - "fig.savefig(\n", - " f\"{plots_path}/accuracy.pdf\", dpi=1200, bbox_inches=\"tight\", transparent=False\n", - ")" + " # Split CN results based on probabilities\n", + " cn_cert, cn_uncert = split_data(data, \"cn_probs\", 0.5)\n", + "\n", + " # Comput accuracies\n", + " vs = \"bn\"\n", + " acc_cn_cert = (\n", + " get_acc_bn(cn_cert, f\"{vs}_mpes\", \"cn_mpes\") if len(cn_cert) > 0 else None\n", + " )\n", + " acc_cn_uncert = (\n", + " get_acc_bn(cn_uncert, f\"{vs}_mpes\", \"cn_mpes\") if len(cn_uncert) > 0 else None\n", + " )\n", + " acc_cn_tot = get_acc_bn(data, f\"{vs}_mpes\", \"cn_mpes\")\n", + " acc_noisy_bn = get_acc_bn(data, f\"{vs}_mpes\", \"bn_noisy_mpes\")\n", + "\n", + " # Compute ROC\n", + " roc_cn_cert = (\n", + " roc_curve(cn_cert[f\"{vs}_mpes\"], cn_cert[\"cn_probs_1\"])\n", + " if len(cn_cert) > 0\n", + " else None\n", + " )\n", + " roc_cn_uncert = (\n", + " roc_curve(cn_uncert[f\"{vs}_mpes\"], cn_uncert[\"cn_probs_1\"])\n", + " if len(cn_uncert) > 0\n", + " else None\n", + " )\n", + " roc_cn_tot = roc_curve(data[f\"{vs}_mpes\"], data[\"cn_probs_1\"])\n", + " roc_noisy_bn = roc_curve(data[f\"{vs}_mpes\"], data[\"bn_noisy_probs_1\"])\n", + "\n", + " # Store results\n", + " for key in acc.keys():\n", + " acc[key][x] = eval(key)\n", + " for key in roc.keys():\n", + " roc[key][x] = eval(key)" ] }, { "cell_type": "code", "execution_count": null, - "id": "5c336efd", + "id": "c158f122", "metadata": {}, "outputs": [], "source": [ - "roc.keys()" + "# Plot accuracies\n", + "fig, ax = plt.subplots(1, 1)\n", + "\n", + "labels = {\n", + " \"acc_cn_cert\": \"CN (certain)\",\n", + " \"acc_cn_uncert\": \"CN (uncertain)\",\n", + " \"acc_cn_tot\": \"CN (total)\",\n", + " \"acc_noisy_bn\": \"Noisy BN\",\n", + "}\n", + "\n", + "colors = dict(\n", + " zip(labels.keys(), sns.color_palette(palette=\"seismic\", n_colors=len(labels)))\n", + ")\n", + "\n", + "for key in acc.keys():\n", + " accuracy = [x[0] if x else np.nan for x in acc[key].values()]\n", + " lower = [x[1] if x else np.nan for x in acc[key].values()]\n", + " upper = [x[2] if x else np.nan for x in acc[key].values()]\n", + " ax.plot(x_values, accuracy, \"-\", label=labels[key], color=colors[key])\n", + " ax.fill_between(\n", + " x_values, lower, upper, color=colors[key], alpha=0.25, zorder=2, linewidth=0\n", + " )\n", + "\n", + "label = \"$S$\" if def_mec == \"def_idm\" else \"$\\delta$\"\n", + "ax.set_xlabel(label)\n", + "ax.set_ylabel(\"Accuracy\")\n", + "ax.set_title(\"MAP estimation (Wilson CI 95/%)\")\n", + "ax.legend(loc=\"best\")\n", + "ax.grid(True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1)" ] }, { "cell_type": "code", "execution_count": null, - "id": "9186101e", + "id": "089a1d9a", "metadata": {}, "outputs": [], "source": [ - "# ROC curves\n", - "fig, axes = plt.subplots(2, 3, figsize=(16, 8))\n", + "# Plot ROCs\n", + "fig, axes = plt.subplots(len(x_values) // 3 + 1, 3, figsize=(14, 7))\n", "\n", "labels = {\n", " \"roc_cn_cert\": \"CN (certain)\",\n", @@ -316,13 +365,20 @@ " \"roc_noisy_bn\": \"Noisy BN\",\n", "}\n", "\n", + "colors = dict(\n", + " zip(labels.keys(), sns.color_palette(palette=\"seismic\", n_colors=len(labels)))\n", + ")\n", + "\n", "i = 0\n", - "for ess in res[\"ess\"]:\n", + "for x in x_values:\n", " ax = axes.flatten()[i]\n", "\n", " for key in roc.keys():\n", - " fpr, tpr, _ = roc[key][ess]\n", - " ax.plot(fpr, tpr, label=labels[key], linewidth=1.3)\n", + " try:\n", + " fpr, tpr, _ = roc[key][x]\n", + " ax.plot(fpr, tpr, label=labels[key], linewidth=1.3, color=colors[key])\n", + " except:\n", + " continue\n", "\n", " ax.plot(\n", " [0, 1],\n", @@ -332,21 +388,19 @@ " linewidth=1.3,\n", " label=\"baseline\",\n", " )\n", + "\n", " ax.set_xlabel(\"FPR\")\n", " ax.set_ylabel(\"TPR\")\n", - " ax.set_title(f\"ROC (S = {ess})\")\n", " ax.grid(\n", " True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1\n", " )\n", + " ax.set_title(f\"ROC ({def_arg} = {x})\")\n", + " if i == 0:\n", + " ax.legend(loc=\"best\")\n", "\n", " i += 1\n", "\n", - "plt.legend(loc=\"best\")\n", - "plt.subplots_adjust(hspace=0.5)\n", - "\n", - "plt.tight_layout(rect=[0, 0, 1, 0.96])\n", - "plt.show()\n", - "fig.savefig(f\"{plots_path}/roc.pdf\", dpi=1200, bbox_inches=\"tight\", transparent=False)" + "plt.subplots_adjust(hspace=0.3)" ] } ], diff --git a/experiments/cn_vs_noisybn/__init__.py b/experiments/cn_vs_noisybn/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/experiments/cn_vs_noisybn/config.yaml b/experiments/cn_vs_noisybn/config.yaml index d8b5777..fb9f76e 100644 --- a/experiments/cn_vs_noisybn/config.yaml +++ b/experiments/cn_vs_noisybn/config.yaml @@ -1,38 +1,45 @@ ## Configuration file # Paths -base_path: experiments/cn_vs_noisybn/output # Base path for output +cur_dir: experiments/cn_vs_noisybn # Current directory (contains all the following) bns_path: bns # Where to save ground-truth BNs +cns_path: output/cns # Where to save CNs as obtained by def-mec from BNs learnt from pool +atk_path: output/bns_atk # Where to save BNs as obtained by atk-mec from CNs data_path: data # Where to save data as generated from ground-truth BNs -results_path: results # Where to save the experiment results -meta_file: exp_meta.txt # File of metadata +results_path: output/results # Where to save the experiment results +exp_meta: exp_meta.txt # File of metadata for experiments +auc_meta: output/results/auc_meta.csv # File of metadata for AUCs # Models (Naive Bayes) target_var: 'T' # Target variable -n_nodes: 10 # Number of nodes for each BN model +n_nodes: 20 # Number of nodes for each BN model +n_modmax: 4 # Maximum number of categories for covariates n_models: 10 # Number of models to evaluate # Data gpop_ss: 1000 # Sample size of general population rpop_prop: 0.5 # Sample size of reference population = gpop_ss * rpop_prop pool_prop: 0.25 # Sample size of pool population = gpop_ss * pool_prop +samples: 10 # Number of data samples # MIA -n_samples: 30 # Number of data samples -n_bns: 50 # Number of BNs to sample within the CN +error: 'np.logspace(-4, 0, 20, endpoint=False)' # Type-I errors vector + +# Noisy BN tol: 0.01 # To find eps s.t. |AUC(eps) - AUC(CN)| < tol -error: 'np.logspace(-4, 0, 25, endpoint=False)' # Type-I errors vector -ess_dict: # Eps list to evaluate for each ess - 1: 'np.arange(0.1, 10, 0.1)' - 10: 'np.arange(0.1, 10, 0.1)' - 20: 'np.arange(0.05, 5, 0.05)' - 30: 'np.arange(1e-3, 1, 1e-3)' - 40: 'np.arange(5e-6, 1e-2, 5e-6)' - 50: 'np.arange(5e-7, 5e-4, 5e-7)' +eps_vec: 'np.logspace(-8, 2, num=200)' # Epsilon to consider for noisy BN # Inferences -n_infer: 1000 # Number of inferences to perform +n_infer: 100 # Number of inferences to perform # Other -seed: 42 # Global seed num_cores: 'multiprocessing.cpu_count() - 1' # Number of threads to use for parallelization + +## Notes +# 1) Suggested pairs (ess: eps_vec) for n_nodes=10: +# - 1 : 'np.arange(0.1, 10, 0.1)' +# - 10: 'np.arange(0.1, 10, 0.1)' +# - 20: 'np.arange(0.05, 5, 0.05)' +# - 30: 'np.arange(1e-3, 1, 1e-3)' +# - 40: 'np.arange(5e-6, 1e-2, 5e-6)' +# - 50: 'np.arange(5e-7, 5e-4, 5e-7)' diff --git a/experiments/cn_vs_noisybn/config_BAK.yaml b/experiments/cn_vs_noisybn/config_BAK.yaml new file mode 100644 index 0000000..6cb9d4f --- /dev/null +++ b/experiments/cn_vs_noisybn/config_BAK.yaml @@ -0,0 +1,45 @@ +## Configuration file + +# Paths +cur_dir: experiments/cn_vs_noisybn # Current directory (contains all the following) +bns_path: bns # Where to save ground-truth BNs +cns_path: output/cns # Where to save CNs as obtained by def-mec from BNs learnt from pool +atk_path: output/bns_atk # Where to save BNs as obtained by atk-mec from CNs +data_path: data # Where to save data as generated from ground-truth BNs +results_path: output/results # Where to save the experiment results +exp_meta: exp_meta.txt # File of metadata for experiments +auc_meta: output/results/auc_meta.csv # File of metadata for AUCs + +# Models (Naive Bayes) +target_var: 'T' # Target variable +n_nodes: 10 # Number of nodes for each BN model +n_modmax: 2 # Maximum number of categories for covariates +n_models: 10 # Number of models to evaluate + +# Data +gpop_ss: 1000 # Sample size of general population +rpop_prop: 0.5 # Sample size of reference population = gpop_ss * rpop_prop +pool_prop: 0.25 # Sample size of pool population = gpop_ss * pool_prop +samples: 30 # Number of data samples + +# MIA +error: 'np.logspace(-4, 0, 25, endpoint=False)' # Type-I errors vector + +# Noisy BN +tol: 0.01 # To find eps s.t. |AUC(eps) - AUC(CN)| < tol +eps_vec: 'np.logspace(-8, 2, num=1000)' # Epsilon to consider for noisy BN + +# Inferences +n_infer: 1000 # Number of inferences to perform + +# Other +num_cores: 'multiprocessing.cpu_count() - 1' # Number of threads to use for parallelization + +## Notes +# 1) Suggested pairs (ess: eps_vec) for n_nodes=10: +# - 1 : 'np.arange(0.1, 10, 0.1)' +# - 10: 'np.arange(0.1, 10, 0.1)' +# - 20: 'np.arange(0.05, 5, 0.05)' +# - 30: 'np.arange(1e-3, 1, 1e-3)' +# - 40: 'np.arange(5e-6, 1e-2, 5e-6)' +# - 50: 'np.arange(5e-7, 5e-4, 5e-7)' \ No newline at end of file diff --git a/experiments/cn_vs_noisybn/exp.py b/experiments/cn_vs_noisybn/exp.py new file mode 100644 index 0000000..5110a16 --- /dev/null +++ b/experiments/cn_vs_noisybn/exp.py @@ -0,0 +1,73 @@ +import gc +import multiprocessing # noqa: F401 # pylint: disable=unused-import +import sys + +import numpy as np # noqa: F401 # pylint: disable=unused-import +import pandas as pd +from joblib import Parallel, delayed + +from src.attack import attack_mechanism +from src.config import create_clean_dir, get_cur_dir, load_config, map_sys_args +from src.defense import defense_mechanism +from src.inference import inferences +from src.mia import find_epsilon, mia_vs_cn + + +def main(): + + # Init configs + config = load_config("cn_vs_noisybn") + cur_dir = get_cur_dir(config) + create_clean_dir(cur_dir / "output") + num_cores = eval(config["num_cores"]) + + # Get command-line hyperparameters + def_mec, def_args, atk_mec, atk_args = map_sys_args(sys.argv, config) + + # Init the vectors of experiments + exp_vec = [f.stem for f in (cur_dir / config["data_path"]).iterdir() if f.is_file()] + + # Defense mechanism + print("## Defense mechanism: [", def_mec, def_args, "] ##", flush=True) + create_clean_dir(cur_dir / config["cns_path"]) + _ = Parallel(n_jobs=num_cores)( + delayed(defense_mechanism)(exp, config, def_mec, def_args) for exp in exp_vec + ) + + # Attack mechanism + print("## Attack mechanism: [", atk_mec, atk_args, "] ##", flush=True) + create_clean_dir(cur_dir / config["atk_path"]) + _ = Parallel(n_jobs=num_cores)( + delayed(attack_mechanism)(exp, config, atk_mec, atk_args) for exp in exp_vec + ) + + # MIA vs CN + print("## MIA vs CN ##", flush=True) + create_clean_dir(cur_dir / config["results_path"] / "cns") + res = Parallel(n_jobs=num_cores)(delayed(mia_vs_cn)(exp, config) for exp in exp_vec) + auc_res = pd.concat((i for i in res), axis=0) + auc_res.to_csv(f'{cur_dir}/{config["auc_meta"]}', index=False) + + # Find eps s.t. |AUC(eps) - AUC(CN)| < tol + print("## Find epsilon ##", flush=True) + create_clean_dir(cur_dir / config["results_path"] / "bn_noisy") + res = Parallel(n_jobs=num_cores)( + delayed(find_epsilon)(exp, config) for exp in exp_vec + ) + auc_res = pd.concat((i for i in res), axis=0) + auc_res.to_csv(f'{cur_dir}/{config["auc_meta"]}', index=False) + + # Run inferences + print("## Inferences ##", flush=True) + create_clean_dir(cur_dir / config["results_path"] / "inferences") + _ = Parallel(n_jobs=num_cores)( + delayed(inferences)(exp, config, def_mec, def_args) for exp in exp_vec + ) + + # Clean + gc.collect() + + +if __name__ == "__main__": + + main() diff --git a/experiments/cn_vs_noisybn/generate.py b/experiments/cn_vs_noisybn/generate.py new file mode 100644 index 0000000..557f062 --- /dev/null +++ b/experiments/cn_vs_noisybn/generate.py @@ -0,0 +1,44 @@ +import gc +import multiprocessing # noqa: F401 # pylint: disable=unused-import + +import numpy as np # noqa: F401 # pylint: disable=unused-import +from joblib import Parallel, delayed + +from src.config import create_clean_dir, get_cur_dir, load_config +from src.data import generate_naivebayes +from src.learning import estimate_bns + + +def main(): + + # Init configs + config = load_config("cn_vs_noisybn") + cur_dir = get_cur_dir(config) + num_cores = eval(config["num_cores"]) + + # Generate BNs and data + print("## Generate BNs and data ##") + create_clean_dir(cur_dir / config["bns_path"]) + create_clean_dir(cur_dir / config["bns_path"] / "gt") + create_clean_dir(cur_dir / config["data_path"]) + open(f'{cur_dir}/{config["exp_meta"]}', "w").close() + generate_naivebayes(config) + + # Init the vectors of experiments + exp_vec = [f.stem for f in (cur_dir / config["data_path"]).iterdir() if f.is_file()] + + # Estimate BNs from rpop and pool + print("## Estimate BNs from rpop and pool ##") + create_clean_dir(cur_dir / config["bns_path"] / "rpop") + create_clean_dir(cur_dir / config["bns_path"] / "pool") + _ = Parallel(n_jobs=num_cores)( + delayed(estimate_bns)(exp, config) for exp in exp_vec + ) + + # Clean + gc.collect() + + +if __name__ == "__main__": + + main() diff --git a/experiments/cn_vs_noisybn/main.py b/experiments/cn_vs_noisybn/main.py deleted file mode 100644 index f0f9b7b..0000000 --- a/experiments/cn_vs_noisybn/main.py +++ /dev/null @@ -1,14 +0,0 @@ -from src.config import get_config -from src.data import generate_naivebayes -from src.run_exp import run_cn_vs_noisybn - -if __name__ == "__main__": - - # Load config - config = get_config("experiments/cn_vs_noisybn/config.yaml") - - # Generate BNs and data - generate_naivebayes(config) - - # Run experiment - run_cn_vs_noisybn(config) diff --git a/generate_compose.py b/generate_compose.py new file mode 100644 index 0000000..d2b74d0 --- /dev/null +++ b/generate_compose.py @@ -0,0 +1,67 @@ +from itertools import product + +import yaml + +# Set hyperparameters +names = ["cn_vs_noisybn"] +def_mecs = {"def_idm": {"ess": [1, 50, 100]}, "def_ran": {"delta": [0.001, 0.05, 0.1]}} +atk_mecs = { + "atk_mle": {"n_bns": [100]}, + "atk_cen": {None: [None]}, + "atk_ran": {None: [None]}, +} + +# Initialize the `compose.yaml` file +init = {"version": "3.9"} +with open("compose.yaml", "w") as f: + yaml.dump(init, f, default_flow_style=False) + +# For any configuration ... +# (assumption: each defense and attack mechanism has at most 1 hyperparameter to be set) +data = {"services": dict()} +for name, def_mec, atk_mec in product(names, def_mecs.keys(), atk_mecs.keys()): + + def_params = list(def_mecs[def_mec].values())[0] + atk_params = list(atk_mecs[atk_mec].values())[0] + + for def_par, atk_par in product(def_params, atk_params): + + # ... set the related volume, ... + app = ( + f"_{list(def_mecs[def_mec].keys())[0]}{def_par}" + if def_par is not None + else "" + ) + volumes = [ + f"./experiments/{name}/bns:/workspace/experiments/{name}/bns", + f"./experiments/{name}/data:/workspace/experiments/{name}/data", + f"./experiments/{name}/output_{def_mec}_{atk_mec}{app}:/workspace/experiments/{name}/output", + ] + + # ... set the command, ... + command = [ + "python", + "-m", + f"experiments.{name}.exp", + f"def_mec={def_mec}", + f"atk_mec={atk_mec}", + ] + + if def_par is not None: + command.append(f"{list(def_mecs[def_mec].keys())[0]}={def_par}") + if atk_par is not None: + command.append(f"{list(atk_mecs[atk_mec].keys())[0]}={atk_par}") + + # ... and create the experiment + data["services"][f"{name}_{def_mec}_{atk_mec}{app}"] = { + "image": "bnp:2025", + "build": ".", + "volumes": volumes, + "command": command, + } +# Print number of services +print("Number of services: ", len(data["services"])) + +# Write file +with open("compose.yaml", "a") as f: + yaml.dump(data, f, default_flow_style=False) diff --git a/requirements.txt b/requirements.txt index 6e6f634..0c2c7a0 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,3 +1,4 @@ +arviz==0.22.0 astroid==3.3.11 asttokens==3.0.0 black==25.1.0 @@ -5,6 +6,7 @@ click==8.2.1 cloudpickle==3.1.1 comm==0.2.3 contourpy==1.3.3 +coverage==7.11.0 cycler==0.12.1 dask==2025.7.0 debugpy==1.8.15 @@ -14,6 +16,9 @@ executing==2.2.0 flake8==7.3.0 fonttools==4.59.0 fsspec==2025.7.0 +h5netcdf==1.7.3 +h5py==3.15.1 +hopsy==1.6.1 importlib_metadata==8.7.0 iniconfig==2.1.0 ipykernel==6.30.0 @@ -30,10 +35,12 @@ matplotlib==3.10.3 matplotlib-inline==0.1.7 mccabe==0.7.0 more-itertools==10.7.0 +mpmath==1.3.0 mypy_extensions==1.1.0 natsort==8.4.0 nest-asyncio==1.6.0 numpy==2.3.2 +optlang==1.8.3 packaging==25.0 pandarallel==1.6.5 pandas==2.3.1 @@ -45,12 +52,14 @@ pexpect==4.9.0 pillow==11.3.0 platformdirs==4.3.8 pluggy==1.6.0 +PolyRound==0.4.0 prompt_toolkit==3.0.51 psutil==7.0.0 ptyprocess==0.7.0 pure_eval==0.2.3 pyAgrum==2.2.0 pyarrow==21.0.0 +pycddlib==3.0.2 pycodestyle==2.14.0 pydot==4.0.1 pyflakes==3.4.0 @@ -58,6 +67,7 @@ Pygments==2.19.2 pylint==3.3.7 pyparsing==3.2.3 pytest==8.4.1 +pytest-cov==7.0.0 python-dateutil==2.9.0.post0 pytz==2025.2 PyYAML==6.0.2 @@ -69,6 +79,8 @@ six==1.17.0 stack-data==0.6.3 statsmodels==0.14.5 swifter==1.4.0 +swiglpk==5.0.12 +sympy==1.14.0 threadpoolctl==3.6.0 tokenize_rt==6.2.0 tomlkit==0.13.3 @@ -79,4 +91,6 @@ traitlets==5.14.3 typing_extensions==4.14.1 tzdata==2025.2 wcwidth==0.2.13 +xarray==2025.10.1 +xarray-einstats==0.9.1 zipp==3.23.0 diff --git a/src/attack.py b/src/attack.py new file mode 100644 index 0000000..6406d87 --- /dev/null +++ b/src/attack.py @@ -0,0 +1,114 @@ +import inspect + +import numpy as np +import pandas as pd +import pyagrum as gum + +from src.config import get_cur_dir, set_seed +from src.mia import get_ll +from src.utils import centroid_cn, maxent_cn, sample_from_cn + + +# Apply attack mechanism to a BN, namely, derive a BN from a CN +def attack_mechanism(exp, config, atk_mec, atk_args) -> None: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Read data + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + base_path = cur_dir / config["cns_path"] + + # Set seed + set_seed() + + # For each data sample ... + for sample in range(config["samples"]): + + # ... read the related CN + bn_min = gum.loadBN(f"{base_path}/bn_min_{exp}_sample{sample}.bif") + bn_max = gum.loadBN(f"{base_path}/bn_max_{exp}_sample{sample}.bif") + + # ... retrieve rpop, ... + rpop = gpop[gpop[f"in-rpop-{sample}"]].iloc[:, : len(bn_min.nodes())] + + # ... and derive the BN + atk_mec_fn = globals()[atk_mec] # Get the related function + sig = inspect.signature(atk_mec_fn) # Get its signature + args = { + k: v + for k, v in { + "bn_min": bn_min, + "bn_max": bn_max, + "data": rpop, + "n_bns": atk_args.get("n_bns", None), + }.items() + if k in sig.parameters + } + bn = atk_mec_fn(**args) + gum.saveBN( + bn, f'{cur_dir / config["atk_path"]}/{f"bn_{exp}_sample{sample}"}.bif' + ) + + return + + +# Get the BN inside a CN with max entropy distribution +def atk_ent(bn_min, bn_max): + + bn = maxent_cn(bn_min, bn_max) + + return bn + + +# Get a random BN inside a CN +def atk_ran(bn_min, bn_max): + + bn = sample_from_cn(bn_min, bn_max, 1) + + return bn[0] + + +# Get the centroid of a CN +def atk_cen(bn_min, bn_max): + + bn = centroid_cn(bn_min, bn_max) + + return bn + + +# Get the maximum likelihood BN inside a CN +def atk_mle(bn_min, bn_max, data, n_bns: int): + + # Sample from the CN ... + bns_sample = sample_from_cn(bn_min, bn_max, n_bns) + + # ... and take the MLE one + bn = mle_bn(bns_sample, data) + + return bn + + +# Get the maximum likelihood BN within a set +def mle_bn(bns_sample, data): + """ + Given a list `bns_sample` of BNs, + find argmax_{BN in bns_sample} ll(BN | data), + where ll is the log-likelihood function. + """ + + mle_bn = None + mle = -np.inf + + for bn in bns_sample: + + # Estimate the likelihood of data + bn_ie = gum.LazyPropagation(bn) + llr_im = data.apply(lambda x: get_ll(x.to_dict(), bn_ie), axis=1).dropna() + llr = np.sum(llr_im) + + if llr > mle: + mle_bn = bn + mle = llr + + return mle_bn diff --git a/src/config.py b/src/config.py index 016bb82..8d9cc66 100644 --- a/src/config.py +++ b/src/config.py @@ -1,47 +1,104 @@ +import os import random import shutil +import sys from pathlib import Path import pyagrum as gum import yaml +IN_PYTEST = "pytest" in sys.modules + + +# Get arguments as passed from command-line for experiment +def map_sys_args(sys_args, config) -> tuple: + + # Store parameters + params = dict([arg.split("=") for arg in sys_args if "=" in arg]) + with open(f'{config["cur_dir"]}/{config["exp_meta"]}', "a") as m: + m.write(f"\n Defense & attack parameters: \n {params}") + + # Get defense and attack mechanisms + def_mec = params.pop("def_mec") + atk_mec = params.pop("atk_mec") + + # Get defense parameters + def_args = dict() + if def_mec == "def_idm": + def_args["ess"] = int(params.pop("ess")) + assert def_args["ess"] >= 0 + elif def_mec == "def_ran": + def_args["delta"] = float(params.pop("delta")) + assert def_args["delta"] >= 0 + assert def_args["delta"] <= 1 + else: + raise Exception("Defense not implemented") + + # Save attack parameters + atk_args = dict() + if atk_mec == "atk_mle": + atk_args["n_bns"] = int(params.pop("n_bns")) + assert atk_args["n_bns"] >= 1 + elif atk_mec == "atk_cen" or atk_mec == "atk_ran" or atk_mec == "atk_ent": + pass + else: + raise Exception("Attack not implemented") + + # Exceptions + if len(params) != 0: + raise Exception(f"Unused parameters: {params}") + + return (def_mec, def_args, atk_mec, atk_args) + # Read configuration for experiment -def get_config(path): +def load_config(name: str): - with open(path, "r") as f: + subdir = "test" if os.getenv("USE_TEST_CONFIG") == "1" else "experiments" + + config_path = get_root_path() / subdir / name / "config.yaml" + + with open(config_path, "r") as f: config = yaml.safe_load(f) return config # Set global seed -def set_global_seed(seed): +def set_seed(): - random.seed(seed) - gum.initRandom(seed) + random.seed(42) + gum.initRandom(42) -# Get root directory -def get_root_path(): - return Path(__file__).resolve().parents[1] +# Create an empty directory +def create_clean_dir(path: Path): + + # If directory exists, clean it + if path.exists() and path.is_dir(): + for item in path.iterdir(): + shutil.rmtree(item) if item.is_dir() else item.unlink() + + # Else, create a new one + else: + path.mkdir(parents=True, exist_ok=True) -# Get base path -def get_base_path(config): +# Get output path +def get_cur_dir(config): root_path = get_root_path() - base_path = config["base_path"] + cur_dir = config["cur_dir"] - return root_path / base_path + return root_path / cur_dir -# Create an empty directory -def create_clean_dir(path: Path): +# Get root (project) directory +def get_root_path(): + return Path(__file__).resolve().parents[1] - # Remove the folder if already exists - if path.exists() and path.is_dir(): - shutil.rmtree(path) - # Create a new folder - path.mkdir(parents=True, exist_ok=True) +# Only perform an `assert` if code is running in `pytest` +def safe_assert(condition): + if IN_PYTEST: + assert condition diff --git a/src/data.py b/src/data.py index 31fdf2e..3e1cb3d 100644 --- a/src/data.py +++ b/src/data.py @@ -1,30 +1,34 @@ +import ast from itertools import product -from pprint import pformat +import numpy as np import pyagrum as gum +from numpy.random import randint -from src.config import create_clean_dir, get_base_path, set_global_seed -from src.utils import compact_dict +from src.config import get_cur_dir, safe_assert, set_seed def generate_naivebayes(config): - # Set seed - set_global_seed(config["seed"]) - # Set paths - base_path = get_base_path(config) - bns_path = base_path / config["bns_path"] - data_path = base_path / config["data_path"] - results_path = base_path / config["results_path"] - - # Create empty directories - create_clean_dir(bns_path) - create_clean_dir(data_path) - create_clean_dir(results_path) - - # Set BN (Naive Bayes) structure - bn_str_gen = (f'{config["target_var"]}->X{i}' for i in range(config["n_nodes"] - 1)) + cur_dir = get_cur_dir(config) + bns_path = cur_dir / config["bns_path"] + data_path = cur_dir / config["data_path"] + + # Set seed + set_seed() + + # Retrieve hyperparameters + n_modmax = config["n_modmax"] + gpop_ss = config["gpop_ss"] + pool_ss = int(gpop_ss * config["pool_prop"]) + rpop_ss = int(gpop_ss * config["rpop_prop"]) + + # Set BN (naive Bayes) structure + bn_str_gen = ( + f'{config["target_var"]}->X{i}[{randint(2, n_modmax+1)}]' + for i in range(config["n_nodes"] - 1) + ) bn_str = "; ".join(bn_str_gen) # For each model ... @@ -32,59 +36,67 @@ def generate_naivebayes(config): # ... generate BN, ... bn = gum.fastBN(bn_str) - gum.saveBN(bn, f"{bns_path}/exp{i}.bif") + gum.saveBN(bn, f'{bns_path / "gt"}/{f"exp{i}"}.bif') + + with open(f'{cur_dir}/{config["exp_meta"]}', "a") as m: + m.write( + f'- exp{i}. Naive Bayes: {config["n_nodes"]} nodes. Complexity: {bn.dim()} Max categories: {n_modmax}\n' + ) # ... and generate gpop from BN data_gen = gum.BNDatabaseGenerator(bn) data_gen.drawSamples(config["gpop_ss"]) data_gen.setDiscretizedLabelModeRandom() gpop = data_gen.to_pandas() - gpop.to_csv(f"{data_path}/exp{i}.csv", index=False) - # For each ESS ... - for ess in config["ess_dict"].keys(): + # For any data sample ... + for sample in range(config["samples"]): - # ... create results subdirectories and metadata files - meta_file_path = ( - base_path - / config["results_path"] - / f'results_nodes{config["n_nodes"]}_ess{ess}' - / config["meta_file"] - ) - meta_file_path.parent.mkdir(parents=True, exist_ok=True) + # ... sample pool and rpop + shuffled_idx = np.random.permutation(gpop.index) - with open(meta_file_path, "w") as f: - f.write(pformat(compact_dict(config)) + "\n\n" + "#" * 50 + "\n\n") + pool_idx = shuffled_idx[:pool_ss] + rpop_idx = shuffled_idx[pool_ss : pool_ss + rpop_ss] + gpop[f"in-pool-{sample}"] = gpop.index.isin(pool_idx) + gpop[f"in-rpop-{sample}"] = gpop.index.isin(rpop_idx) -def generate_randombn(config): + # Debug + safe_assert(pool_ss == len(pool_idx)) + safe_assert(rpop_ss == len(rpop_idx)) + safe_assert(sum(gpop[f"in-pool-{sample}"]) == pool_ss) + safe_assert(sum(gpop[f"in-rpop-{sample}"]) == rpop_ss) - # Set seed - set_global_seed(config["seed"]) + # Save gpop + gpop.to_csv(f"{data_path}/exp{i}.csv", index=False) + + +def generate_randombn(config): # Set paths - base_path = get_base_path(config) - bns_path = base_path / config["bns_path"] - data_path = base_path / config["data_path"] - results_path = base_path / config["results_path"] + cur_dir = get_cur_dir(config) + bns_path = cur_dir / config["bns_path"] + data_path = cur_dir / config["data_path"] - # Create empty directories - create_clean_dir(bns_path) - create_clean_dir(data_path) - create_clean_dir(results_path) + # Retrieve hyperparameters + n_nodes_vec = ast.literal_eval(config["n_nodes_vec"]) + edge_ratio_vec = ast.literal_eval(config["edge_ratio_vec"]) + gpop_ss = config["gpop_ss"] + pool_ss = int(gpop_ss * config["pool_prop"]) + rpop_ss = int(gpop_ss * config["rpop_prop"]) - n_nodes_vec = eval(config["n_nodes_vec"]) - edge_ratio_vec = eval(config["edge_ratio_vec"]) + # Set seed + set_seed() # For each configuration ... for i, (n, r) in enumerate(product(n_nodes_vec, edge_ratio_vec)): # ... generate BN, ... bn_gen = gum.BNGenerator() - bn = bn_gen.generate(n_nodes=n, n_arcs=int(n * r), n_modmax=2) - gum.saveBN(bn, f"{bns_path}/exp{i}.bif") + bn = bn_gen.generate(n_nodes=n, n_arcs=int(n * r), n_modmax=config["n_modmax"]) + gum.saveBN(bn, f'{bns_path / "gt"}/{f"exp{i}"}.bif') - with open(f'{results_path}/{config["meta_file"]}', "a") as m: + with open(f'{cur_dir}/{config["exp_meta"]}', "a") as m: m.write( f"- exp{i}. Nodes: {n} Edges: {int(n * r)} Complexity: {bn.dim()}\n" ) @@ -94,4 +106,24 @@ def generate_randombn(config): data_gen.drawSamples(config["gpop_ss"]) data_gen.setDiscretizedLabelModeRandom() gpop = data_gen.to_pandas() + + # For any data sample ... + for sample in range(config["samples"]): + + # ... sample pool and rpop + shuffled_idx = np.random.permutation(gpop.index) + + pool_idx = shuffled_idx[:pool_ss] + rpop_idx = shuffled_idx[pool_ss : pool_ss + rpop_ss] + + gpop[f"in-pool-{sample}"] = gpop.index.isin(pool_idx) + gpop[f"in-rpop-{sample}"] = gpop.index.isin(rpop_idx) + + # Debug + safe_assert(pool_ss == len(pool_idx)) + safe_assert(rpop_ss == len(rpop_idx)) + safe_assert(sum(gpop[f"in-pool-{sample}"]) == pool_ss) + safe_assert(sum(gpop[f"in-rpop-{sample}"]) == rpop_ss) + + # Save gpop gpop.to_csv(f"{data_path}/exp{i}.csv", index=False) diff --git a/src/defense.py b/src/defense.py new file mode 100644 index 0000000..35c1e41 --- /dev/null +++ b/src/defense.py @@ -0,0 +1,137 @@ +import inspect + +import numpy as np +import pandas as pd +import pyagrum as gum + +from src.config import get_cur_dir, safe_assert, set_seed +from src.utils import add_counts_to_bn, check_consistency + + +# Apply defense mechanism to a BN, namely, derive a CN from a BN +def defense_mechanism(exp, config, def_mec, def_args) -> None: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Read data + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + + # Set seed + set_seed() + + # For each data sample ... + for sample in range(config["samples"]): + + # ... read the related BN + bn = gum.loadBN( + f"{cur_dir}/{config['bns_path']}/pool/bn_{exp}_sample{sample}.bif" + ) + + # ... retrieve pool, ... + pool = gpop[gpop[f"in-pool-{sample}"]].iloc[:, : len(bn.nodes())] + + # ... and derive the CN + def_mec_fn = globals()[def_mec] # Get the related function + sig = inspect.signature(def_mec_fn) # Get its signature + args = { + k: v + for k, v in { + "bn": bn, + "ess": def_args.get("ess", None), + "delta": def_args.get("delta", None), + "data": pool, + }.items() + if k in sig.parameters + } + cn = def_mec_fn(**args) # Keep only `def_mec` args + base_path = cur_dir / config["cns_path"] + cn.saveBNsMinMax( + f"{base_path}/bn_min_{exp}_sample{sample}.bif", + f"{base_path}/bn_max_{exp}_sample{sample}.bif", + ) + + return + + +# Estimate a CN from data by local IDM +def def_idm(bn, ess, data): + bn_counts = gum.BayesNet(bn) + add_counts_to_bn(bn_counts, data) + cn = gum.CredalNet(bn_counts) + cn.idmLearning(ess) + + return cn + + +# Build a CN by bloating each BN parameter with a fixed-size random interval +def def_ran(bn, delta): + + # Initialize the extreme BNs + bn_min = gum.BayesNet(bn) + bn_max = gum.BayesNet(bn) + + # For each node ... + for n in bn.nodes(): + + # ... get the CPT, ... + cpt = bn.cpt(n).toarray() + + # ... get a matrix of eta's, ... + eta = np.random.uniform(0, delta, cpt.size).reshape(cpt.shape) + + # ... perturb the CPT, ... + cpt_min = np.minimum(1 - delta, np.maximum(0, cpt - eta)) + cpt_max = np.minimum(1, np.maximum(delta, cpt - eta + delta)) + + # ... and store it into the extreme BNs + bn_min.cpt(n).fillWith(cpt_min.flatten()) + bn_max.cpt(n).fillWith(cpt_max.flatten()) + + # Debug + safe_assert(np.all(cpt_min <= cpt)) + safe_assert(np.all(cpt_max >= cpt)) + safe_assert(np.all(np.abs(cpt_max - cpt_min - delta) < 1e-6)) + + # Build the CN from the extreme BNs + cn = gum.CredalNet(bn_min, bn_max) + cn.intervalToCredal() + + # Debug + safe_assert(check_consistency(bn, bn_min, bn_max)) + + return cn + + +# Create noisy BN by adding Laplacian noise (Zhang et al., 2017) +def noisy_bn(bn, scale: float): + + bn_ie = gum.LazyPropagation(bn) + bn_ie.makeInference() + + bn_noisy = gum.BayesNet(bn) + + # For each node X ... + for node in bn.names(): + + # Get the joint P(X, Pa(X)) + joint = bn_ie.jointPosterior(bn.family(node)) + + # Add noise to P(X, Pa(X)) and normalize + noise = np.random.laplace(scale=scale, size=np.prod(joint.shape)) + noisy_joint = np.clip( + joint.toarray().flatten() + noise, a_min=10e-10, a_max=None + ) + noisy_joint = noisy_joint / np.sum(noisy_joint) + joint.fillWith(noisy_joint) + + # Compute the conditional P(X | Pa(X)) + cond = joint / joint.sumOut(node) + + # Fill noisy BN + bn_noisy.cpt(node).fillWith(cond) + + # Check noisy bn + bn_noisy.check() # OK if = (). + + return bn_noisy diff --git a/src/inference.py b/src/inference.py index 7fad96d..273a6b7 100644 --- a/src/inference.py +++ b/src/inference.py @@ -1,21 +1,33 @@ +import inspect import math -from tempfile import TemporaryDirectory import numpy as np import pandas as pd import pyagrum as gum +from more_itertools import random_product -from src.config import get_base_path, set_global_seed -from src.utils import add_counts_to_bn, get_noisy_bn, random_product +import src.defense +from src.config import get_cur_dir, safe_assert, set_seed +from src.defense import noisy_bn +from src.learning import learn_bn_params +from src.utils import get_min_max_bns -def run_inferences(exp, ess, eps, config): +def inferences(exp, config, def_mec, def_args): - base_path = get_base_path(config) + # Read config + cur_dir = get_cur_dir(config) target = config["target_var"] + # Read data + auc_res = pd.read_csv(f'{cur_dir}/{config["auc_meta"]}') + eps_vec = [ + i for i in auc_res.loc[auc_res["exp"] == exp, "epsilon"].values if i is not None + ] + eps = np.mean(eps_vec) + # Set seed - set_global_seed(config["seed"]) + set_seed() # Set list of evidence evid_vec = [ @@ -24,29 +36,39 @@ def run_inferences(exp, ess, eps, config): ] # Store ground-truth BN - gt = gum.loadBN(f'{base_path / config["bns_path"]}/{exp}.bif') - gpop = pd.read_csv(f'{base_path / config["data_path"]}/{exp}.csv') - - # Learn BN from gpop - bn_learner = gum.BNLearner(gpop) - bn_learner.useSmoothingPrior(1e-5) - bn = bn_learner.learnParameters(gt.dag()) - - # Learn CN from gpop - bn_copy = gum.BayesNet(bn) - add_counts_to_bn(bn_copy, gpop) - cn = gum.CredalNet(bn_copy) - cn.idmLearning(ess) - - # Learn noisy BN from gpop + gt = gum.loadBN(f'{cur_dir / config["bns_path"]}/gt/{exp}.bif') + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + + # Learn BN from gpop #TODO: save results + bn = learn_bn_params(gt, gpop) + + # Learn CN from gpop (defense mechanism) #TODO: save results + def_mec_fn = getattr(src.defense, def_mec) # Get the related function + sig = inspect.signature(def_mec_fn) # Get its signature + args = { + k: v + for k, v in { + "bn": bn, + "ess": def_args.get("ess", None), + "delta": def_args.get("delta", None), + "data": gpop, + }.items() + if k in sig.parameters + } + cn = def_mec_fn(**args) # Keep only `def_mec`` args + + # Learn noisy BN from gpop #TODO: save results scale = (2 * bn.size()) / (len(gpop) * eps) - bn_noisy = get_noisy_bn(bn, scale) + bn_noisy = noisy_bn(bn, scale) # Run inferences - gt_mpes, _ = run_inference_bn(gt, target, evid_vec) - bn_mpes, bn_probs = run_inference_bn(bn, target, evid_vec) - bn_noisy_mpes, bn_noisy_probs = run_inference_bn(bn_noisy, target, evid_vec) - cn_mpes, cn_probs, cn_probs_alt = run_inference_cn(cn, target, evid_vec, exp) + try: + gt_mpes, _ = run_inference_bn(gt, target, evid_vec) + bn_mpes, bn_probs = run_inference_bn(bn, target, evid_vec) + bn_noisy_mpes, bn_noisy_probs = run_inference_bn(bn_noisy, target, evid_vec) + cn_mpes, cn_probs, cn_probs_alt = run_inference_cn(cn, target, evid_vec, exp) + except: + return # Save results results = pd.DataFrame( @@ -62,12 +84,12 @@ def run_inferences(exp, ess, eps, config): } ) - res_path = ( - base_path - / config["results_path"] - / f'results_nodes{config["n_nodes"]}_ess{ess}' + results.to_csv( + f'{cur_dir / config["results_path"]}/inferences/{exp}.csv', + index=False, ) - results.to_csv(f"{res_path}/{exp}.csv", index=False) + + return # MPE function for BN @@ -89,15 +111,14 @@ def mpe_cn( bn_min: gum.BayesNet, bn_max: gum.BayesNet, target: str, children: dict ) -> tuple: """ - Get the MPE of a CN as: argmax_t log P_lower(target=t | children), - together with its lower probability. - bn_min and bn_max derive from a binary CN. - The DAG is assumed to be a Naive Bayes model with `target` the target variable. - Return the MPE, its probability, and the lower probability of the alternative class. + Get the MPE of a CN as: argmax_t log P_lower(target=t | children). + bn_min and bn_max derive from a CN. + The DAG is a naive Bayes with `target` a binary target variable. + Returns the MPE, its probability, and the lower probability of the alternative class. """ - lp1 = get_lower_posterior(bn_min, bn_max, target, 1, children) - lp0 = get_lower_posterior(bn_min, bn_max, target, 0, children) + lp1 = nb_log_lower_posterior(bn_min, bn_max, target, 1, children) + lp0 = nb_log_lower_posterior(bn_min, bn_max, target, 0, children) if lp1 > lp0: return (1, math.exp(lp1), math.exp(lp0)) @@ -106,72 +127,73 @@ def mpe_cn( # Get a value from a BN's CPT -def get_cond( +def cpt_value( bn: gum.BayesNet, x_var: str, x_value: float, parents: dict = None ) -> float: """ Get P(X=x | parents) from the BN's CPT of X. - x_var is X, and x_value is x. + `x_var` is the X name, while `x_value` is x. """ cpt = bn.cpt(x_var) inst = gum.Instantiation(cpt) inst[x_var] = x_value - if not parents: - assert len(bn.parents(x_var)) == 0 - else: - assert bn.parents(x_var) == set(bn.ids(parents.keys())) + + if parents: for var in parents.keys(): inst[var] = parents[var] + safe_assert(bn.parents(x_var) == set(bn.ids(parents.keys()))) + else: + safe_assert(len(bn.parents(x_var)) == 0) return max(cpt.get(inst), 1e-10) # Smoothing -# Get a Naive Bayes log-joint -def get_naivebayes_log_joint( - bn: gum.BayesNet, target: str, t: float, children: dict -) -> float: +# Get a naive Bayes log-joint +def nb_log_joint(bn: gum.BayesNet, target: str, t: float, children: dict) -> float: """ - Get log[P(target=t, children)] from the BN's CPT of `target`. - The BN is assumed to be a Naive Bayes model with `target` the target variable. + Get log[P(target=t, children)] by exploiting the BN factorization. + The DAG is a naive Bayes with `target` a binary target variable. """ - sum_log = 0 + sum_log = math.log(cpt_value(bn, target, t)) for var, val in children.items(): - sum_log += math.log(get_cond(bn, var, val, {target: t})) - sum_log += math.log(get_cond(bn, target, t)) + sum_log += math.log(cpt_value(bn, var, val, {target: t})) return sum_log # Get the lower posterior from a CN -def get_lower_posterior( +def nb_log_lower_posterior( bn_min: gum.BayesNet, bn_max: gum.BayesNet, target: str, t: float, children: dict ) -> float: """ Get log P_lower(target=t | children). - bn_min and bn_max derive from a binary CN. - The DAG is assumed to be a Naive Bayes model with `target` the target variable. + bn_min and bn_max derive from a CN. + The DAG is a naive Bayes with `target` a binary target variable. """ - lp_lower = get_naivebayes_log_joint(bn_min, target, t, children) - lp_upper = get_naivebayes_log_joint(bn_max, target, 1 - t, children) + l_lower = nb_log_joint(bn_min, target, t, children) + l_upper = nb_log_joint(bn_max, target, 1 - t, children) - return lp_lower - lp_upper - math.log1p(math.exp(lp_lower - lp_upper)) + return l_lower - l_upper - math.log1p(math.exp(l_lower - l_upper)) # Run inferences on a BN def run_inference_bn(bn, target: str, evid_vec): """ - The BN is assumed to be a Naive Bayes model with `target` the target variable. + The BN is assumed to be a naive Bayes model with `target` the target variable. """ # Store information cov = sorted(list(bn.names())) cov.remove(target) + # Set seed + set_seed() + # Debug - assert len(cov) == bn.size() - 1 + safe_assert(len(cov) == bn.size() - 1) # Create object for inference bn_ie = gum.LazyPropagation(bn) @@ -186,8 +208,8 @@ def run_inference_bn(bn, target: str, evid_vec): probs.append(prob) # Debug - assert len(mpes) == len(evid_vec) - assert len(probs) == len(evid_vec) + safe_assert(len(mpes) == len(evid_vec)) + safe_assert(len(probs) == len(evid_vec)) return mpes, probs @@ -195,19 +217,19 @@ def run_inference_bn(bn, target: str, evid_vec): # Run inferences on a CN def run_inference_cn(cn, target: str, evid_vec, exp: str): """ - The CN is assumed to be a Naive Bayes model with `target` the target variable. + The CN is assumed to be a naive Bayes model with `target` the target variable. """ # Store information - with TemporaryDirectory() as tmp_path: - cn.saveBNsMinMax(f"{tmp_path}/bn_min_{exp}.bif", f"{tmp_path}/bn_max_{exp}.bif") - bn_min = gum.loadBN(f"{tmp_path}/bn_min_{exp}.bif") - bn_max = gum.loadBN(f"{tmp_path}/bn_max_{exp}.bif") + bn_min, bn_max = get_min_max_bns(cn, exp) cov = sorted(list(bn_min.names())) cov.remove(target) + # Set seed + set_seed() + # Debug - assert len(cov) == bn_min.size() - 1 + safe_assert(len(cov) == bn_min.size() - 1) # Compute all combinations of evidence mpes = [] @@ -221,8 +243,8 @@ def run_inference_cn(cn, target: str, evid_vec, exp: str): probs_alt.append(prob_alt) # Debug - assert len(mpes) == len(evid_vec) - assert len(probs) == len(evid_vec) - assert len(probs_alt) == len(evid_vec) + safe_assert(len(mpes) == len(evid_vec)) + safe_assert(len(probs) == len(evid_vec)) + safe_assert(len(probs_alt) == len(evid_vec)) return mpes, probs, probs_alt diff --git a/src/learning.py b/src/learning.py new file mode 100644 index 0000000..5d7c603 --- /dev/null +++ b/src/learning.py @@ -0,0 +1,66 @@ +import pandas as pd +import pyagrum as gum + +from src.config import get_cur_dir, safe_assert, set_seed + + +# Learn BN parameters from a given BN and data +def learn_bn_params(bn, data): + + bn_copy = gum.BayesNet(bn) + + learner = gum.BNLearner(data, bn_copy) + learner.useSmoothingPrior(1e-5) + bn_learnt = learner.learnParameters(bn_copy) + + return bn_learnt + + +# Estimate BNs from rpop and pool +def estimate_bns(exp, config) -> None: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Set seed + set_seed() + + # Read data + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + bn = gum.loadBN(f'{cur_dir / config["bns_path"]}/gt/{exp}.bif') + n_nodes = len(bn.nodes()) + gpop_ss = config["gpop_ss"] + rpop_ss = int(gpop_ss * config["rpop_prop"]) + pool_ss = int(gpop_ss * config["pool_prop"]) + + # Debug + safe_assert(gpop_ss == gpop.shape[0]) + safe_assert(n_nodes == gpop.loc[:, ~gpop.columns.str.contains("in-")].shape[1]) + + # For each data sample ... + for sample in range(config["samples"]): + + # ... retrieve pool and rpop, ... + pool = gpop[gpop[f"in-pool-{sample}"]].iloc[:, :n_nodes] + rpop = gpop[gpop[f"in-rpop-{sample}"]].iloc[:, :n_nodes] + + # ... estimate BN from rpop, ... + bn_learnt = learn_bn_params(bn, rpop) + gum.saveBN( + bn_learnt, + f'{cur_dir / config["bns_path"] / "rpop"}/{f"bn_{exp}_sample{sample}"}.bif', + ) + + # ... estimate BN from pool, ... + bn_learnt = learn_bn_params(bn, pool) + gum.saveBN( + bn_learnt, + f'{cur_dir / config["bns_path"] / "pool"}/{f"bn_{exp}_sample{sample}"}.bif', + ) + + # Debug + safe_assert(len(pool) == sum(gpop[f"in-pool-{sample}"])) + safe_assert(len(pool) == pool_ss) + safe_assert(len(rpop) == rpop_ss) + + return diff --git a/src/membership_attack.py b/src/membership_attack.py deleted file mode 100644 index c9c704a..0000000 --- a/src/membership_attack.py +++ /dev/null @@ -1,367 +0,0 @@ -import math -import traceback - -import numpy as np -import pandas as pd -import pyagrum as gum -from scipy.stats import norm -from sklearn import metrics - -from src.config import get_base_path, set_global_seed -from src.utils import (add_counts_to_bn, get_ll, get_llr, get_noisy_bn, - sample_from_cn) - - -# Get the attack power related to a fixed error -def get_power(llr_ref, llr_gen, ground_truth, error) -> float: - - # Compute the threshold - t = np.quantile(llr_ref, 1 - error).item() - - # Test: L(x) > t => reject H_0 => assign `x` to target_pop - y_pred = llr_gen > t - - # Compute power (i.e., true positive rate) - power = sum(ground_truth & y_pred) / sum(ground_truth) - - return power - - -# MIA: membership inference attack -def run_mia(model, baseline, rpop, gpop, ground_truth, error_vec): - - # Compute llr(x) on reference and general populations - llr_ref = ( - rpop.apply(lambda x: get_llr(x.to_dict(), baseline, model), axis=1) - .dropna() - .sort_values() - ) - llr_gen = gpop[[*rpop.columns]].apply( - lambda x: get_llr(x.to_dict(), baseline, model), axis=1 - ) - - power_vec = [] - - # Get the power for each error - for error in error_vec: - power = get_power(llr_ref, llr_gen, ground_truth, error) - power_vec.append(power) - - # Compute and store AUC - auc = metrics.auc(error_vec, power_vec) - - return power_vec, auc - - -# Get the maximum likelihood BN -def get_maxll_bn(bns_sample, rpop): - """ - Given a list `bns_sample` of BNs, - find argmax_{BN in bns_sample} ll(BN | rpop), - where ll is the log-likelihood function. - """ - - maxll_bn = None - maxll = -np.inf - - for bn in bns_sample: - - # Estimate the likelihood of rpop - bn_ie = gum.LazyPropagation(bn) - llr_im = rpop.apply(lambda x: get_ll(x.to_dict(), bn_ie), axis=1).dropna() - llr = np.sum(llr_im) - - if llr > maxll: - maxll_bn = bn - maxll = llr - - return maxll_bn - - -# Find eps s.t. |AUC(eps) - AUC(CN)| < tol -def get_eps(exp, ess, config): - - # Get base path - base_path = get_base_path(config) - - # Set seed - set_global_seed(config["seed"]) - - # Init hyperp. - eps_vec = eval(config["ess_dict"][ess]) - results_path = base_path / config["results_path"] - n_samples = config["n_samples"] - n_bns = config["n_bns"] - error = eval(config["error"]) - tol = config["tol"] - - # Read data - gpop = pd.read_csv(f'{base_path / config["data_path"]}/{exp}.csv') - bn = gum.loadBN(f'{base_path / config["bns_path"]}/{exp}.bif') - n_nodes = config["n_nodes"] - gpop_ss = config["gpop_ss"] - rpop_ss = int(gpop_ss * config["rpop_prop"]) - pool_ss = int(gpop_ss * config["pool_prop"]) - - # Debug - assert gpop_ss == gpop.shape[0] - assert n_nodes == gpop.shape[1] - - bn_theta_vec = [] - bn_theta_hat_vec = [] - cn_vec = [] - - # For any data sample ... - for sample in range(n_samples): - - # ... sample pool and rpop, ... - pool_idx = np.random.choice(range(gpop_ss), size=pool_ss, replace=False) - gpop[f"in-pool-{sample}"] = gpop.index.isin(pool_idx) - pool = gpop[gpop[f"in-pool-{sample}"]].iloc[:, :n_nodes] - rpop = gpop[~gpop[f"in-pool-{sample}"]].iloc[:, :n_nodes].sample(rpop_ss) - - # ... estimate BN from rpop, ... - learner = gum.BNLearner(rpop) - learner.useSmoothingPrior(1e-5) - bn_theta_vec.append(learner.learnParameters(bn.dag())) - - # ... estimate BN from pool, ... - learner = gum.BNLearner(pool) - learner.useSmoothingPrior(1e-5) - bn_theta_hat_vec.append(learner.learnParameters(bn.dag())) - - # ... and estimate CN from pool (by local IDM) - bn_counts = gum.BayesNet(bn) - add_counts_to_bn(bn_counts, pool) - cn = gum.CredalNet(bn_counts) - cn.idmLearning(ess) - cn_vec.append(cn) - - # Debug - assert len(pool) == sum(gpop[f"in-pool-{sample}"]) - assert len(pool) == pool_ss - assert len(rpop) == rpop_ss - - # Debug - assert len(bn_theta_vec) == n_samples - assert len(bn_theta_hat_vec) == n_samples - assert len(cn_vec) == n_samples - - # Run MIA against CN - auc_cn_vec = [] - for sample in range(n_samples): - - # Retrieve sample-related info - y_true = gpop[f"in-pool-{sample}"] - cn = cn_vec[sample] - bn_theta_ie = gum.LazyPropagation(bn_theta_vec[sample]) - - # Extract random subset within simplex - bns_sample = sample_from_cn(cn, n_bns, "inside") - - # Get the maximum likelihood BN - best_bn = get_maxll_bn(bns_sample, rpop) - bn_ie = gum.LazyPropagation(best_bn) - - # MIA - try: - _, auc = run_mia(bn_ie, bn_theta_ie, rpop, gpop, y_true, error) - auc_cn_vec.append(auc) - - except Exception: - - # Debug - with open(f"{results_path}/log.txt", "a") as log: - log.write(f"{exp}: error with sample {sample} (CN).\n") - log.write(traceback.format_exc()) - - # Compute Avg(AUC(CN)) across data samples - auc_cn = sum(auc_cn_vec) / len(auc_cn_vec) - - # Find eps - eps_best = eps_vec[-1] - - # For each eps ... - for eps in eps_vec: - - auc_bn_noisy_vec = [] - - # ... run MIA against noisy BN ... - for sample in range(n_samples): - - # Retrieve sample-related info - y_true = gpop[f"in-pool-{sample}"] - bn_theta_hat = bn_theta_hat_vec[sample] - bn_theta_ie = gum.LazyPropagation(bn_theta_vec[sample]) - - # Get noisy BN - scale = (2 * bn_theta_hat.size()) / (len(pool) * eps) - bn_noisy = get_noisy_bn(bn_theta_hat, scale) - bn_noisy_ie = gum.LazyPropagation(bn_noisy) - - try: - - # MIA - _, auc = run_mia(bn_noisy_ie, bn_theta_ie, rpop, gpop, y_true, error) - auc_bn_noisy_vec.append(auc) - - except Exception: - - # Debug - with open(f"{results_path}/log.txt", "a") as log: - log.write( - f"{exp}: error with sample {sample} (BN noisy, eps: {eps}).\n" - ) - log.write(traceback.format_exc()) - - # ... and compute Avg(AUC(eps)) across data samples - auc_bn = sum(auc_bn_noisy_vec) / n_samples - - # Condition on |AUC(eps) - AUC(CN)| - if abs(auc_cn - auc_bn) <= tol: - eps_best = eps - break - - # Store found eps - meta_file_path = ( - results_path - / f'results_nodes{config["n_nodes"]}_ess{ess}' - / config["meta_file"] - ) - with open(meta_file_path, "a") as m: - m.write(f"- {exp}. Nodes: {n_nodes} Eps: {eps_best}\n") - - return exp, ess, eps_best - - -# Membership attack against CN and BN -def attack_cn_bn(exp, ess, config): - - # Get base path - base_path = get_base_path(config) - - # Set seed - set_global_seed(config["seed"]) - - # Init hyperp. - results_path = base_path / config["results_path"] - n_samples = config["n_samples"] - n_bns = config["n_bns"] - error = eval(config["error"]) - - # Read data - gpop = pd.read_csv(f'{base_path / config["data_path"]}/{exp}.csv') - bn = gum.loadBN(f'{base_path / config["bns_path"]}/{exp}.bif') - n_nodes = len(bn.nodes()) - gpop_ss = config["gpop_ss"] - rpop_ss = int(gpop_ss * config["rpop_prop"]) - pool_ss = int(gpop_ss * config["pool_prop"]) - - # Debug - assert gpop_ss == gpop.shape[0] - assert n_nodes == gpop.shape[1] - - bn_theta_vec = [] - bn_theta_hat_vec = [] - cn_vec = [] - - # For any data sample ... - for sample in range(n_samples): - - # ... sample pool and rpop, ... - pool_idx = np.random.choice(range(gpop_ss), size=pool_ss, replace=False) - gpop[f"in-pool-{sample}"] = gpop.index.isin(pool_idx) - pool = gpop[gpop[f"in-pool-{sample}"]].iloc[:, :n_nodes] - rpop = gpop[~gpop[f"in-pool-{sample}"]].iloc[:, :n_nodes].sample(rpop_ss) - - # ... estimate BN from rpop, ... - learner = gum.BNLearner(rpop) - learner.useSmoothingPrior(1e-5) - bn_theta_vec.append(learner.learnParameters(bn.dag())) - - # ... estimate BN from pool, ... - learner = gum.BNLearner(pool) - learner.useSmoothingPrior(1e-5) - bn_theta_hat_vec.append(learner.learnParameters(bn.dag())) - - # ... and estimate CN from pool (by local IDM) - bn_counts = gum.BayesNet(bn) - add_counts_to_bn(bn_counts, pool) - cn = gum.CredalNet(bn_counts) - cn.idmLearning(ess) - cn_vec.append(cn) - - # Debug - assert len(pool) == sum(gpop[f"in-pool-{sample}"]) - assert len(pool) == pool_ss - assert len(rpop) == rpop_ss - - # Debug - assert len(bn_theta_vec) == n_samples - assert len(bn_theta_hat_vec) == n_samples - assert len(cn_vec) == n_samples - - # Compute theoretical bound - compl = bn.dim() - bound = math.sqrt(compl / pool_ss) - - # Find power (beta) for any error (alpha) given theoretical bound - z_alpha = [norm.ppf(1 - i).item() for i in error] - z_one_minus_beta = [bound - i for i in z_alpha] - beta = [norm.cdf(i).item() for i in z_one_minus_beta] - - # Init results - results = pd.DataFrame({"error": error, "power_bound": beta}) - - # Run MIA against CN - for sample in range(n_samples): - - # Retrieve sample-related info - y_true = gpop[f"in-pool-{sample}"] - cn = cn_vec[sample] - bn_theta_ie = gum.LazyPropagation(bn_theta_vec[sample]) - - # Extract random subset within simplex - bns_sample = sample_from_cn(cn, n_bns, "inside") - - # Get the maximum likelihood BN - best_bn = get_maxll_bn(bns_sample, rpop) - bn_ie = gum.LazyPropagation(best_bn) - - # MIA - try: - power_vec, _ = run_mia(bn_ie, bn_theta_ie, rpop, gpop, y_true, error) - results[f"power_CN_sample{sample}"] = power_vec - - except Exception: - - # Debug - with open(f"{results_path}/log.txt", "a") as log: - log.write(f"{exp}: error with sample {sample} (CN).\n") - log.write(traceback.format_exc()) - - # Run MIA against BN - for sample in range(n_samples): - - # Retrieve sample-related info - y_true = gpop[f"in-pool-{sample}"] - bn_theta_hat_ie = gum.LazyPropagation(bn_theta_hat_vec[sample]) - bn_theta_ie = gum.LazyPropagation(bn_theta_vec[sample]) - - try: - - # MIA - power_vec, _ = run_mia( - bn_theta_hat_ie, bn_theta_ie, rpop, gpop, y_true, error - ) - results[f"power_BN_sample{sample}"] = power_vec - - except Exception: - - # Debug - with open(f"{results_path}/log.txt", "a") as log: - log.write(f"{exp}: error with sample {sample} (BN).\n") - log.write(traceback.format_exc()) - - # Save results - results.to_csv(f"{results_path}/{exp}-ess{ess}.csv", index=False) diff --git a/src/mia.py b/src/mia.py new file mode 100644 index 0000000..5c843b5 --- /dev/null +++ b/src/mia.py @@ -0,0 +1,323 @@ +import math + +import numpy as np +import pandas as pd +import pyagrum as gum +from scipy.stats import norm +from sklearn import metrics + +from src.config import get_cur_dir, set_seed +from src.defense import noisy_bn + + +# MIA attack vs a BN +def mia_vs_bn(exp, config) -> dict: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Init results + power_res = pd.DataFrame({"error": eval(config["error"])}) + auc_res = pd.DataFrame({"sample": range(config["samples"])}) + auc_res["exp"] = exp + + # Read data + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + + # Set seed + set_seed() + + # For each data sample ... + auc_bns_dict = dict() + for sample in range(config["samples"]): + + # ... read the BNs as estimated from rpop and pool, ... + bn_theta = gum.loadBN( + f"{cur_dir}/{config['bns_path']}/rpop/bn_{exp}_sample{sample}.bif" + ) + bn_theta_hat = gum.loadBN( + f"{cur_dir}/{config['bns_path']}/pool/bn_{exp}_sample{sample}.bif" + ) + + bn_theta_hat_ie = gum.LazyPropagation(bn_theta_hat) + bn_theta_ie = gum.LazyPropagation(bn_theta) + + # ... retrieve rpop, ... + rpop = gpop[gpop[f"in-rpop-{sample}"]].iloc[:, : len(bn_theta.nodes())] + + # try: + + # ... and perform membership inference on gpop + power_vec, auc = run_mia( + bn_theta_hat_ie, + bn_theta_ie, + rpop, + gpop, + gpop[f"in-pool-{sample}"], + eval(config["error"]), + ) + power_res[f"power_BN_sample{sample}"] = power_vec + auc_bns_dict[sample] = auc + + # except Exception: + + # # Debug + # with open(f"{results_path}/log.txt", "a") as log: + # log.write(f"{exp}: error with sample {sample} (BN).\n") + # log.write(traceback.format_exc()) + + # Save results + power_res.to_csv( + f'{cur_dir}/{config["results_path"]}/bns/power_bn_{exp}.csv', index=False + ) + + # Return + auc_res["auc_bn"] = auc_res.apply(lambda row: auc_bns_dict[row["sample"]], axis=1) + + return auc_res + + +# MIA attack vs a CN +def mia_vs_cn(exp, config) -> pd.DataFrame: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Init results + power_res = pd.DataFrame({"error": eval(config["error"])}) + auc_res = pd.DataFrame({"sample": range(config["samples"])}) + auc_res["exp"] = exp + + # Read data + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + + # Set seed + set_seed() + + # For each data sample ... + auc_cns_dict = dict() + for sample in range(config["samples"]): + + # ... read the BN as inferred from the CN + bn_theta_hat = gum.loadBN( + f'{cur_dir}/{config["atk_path"]}/bn_{exp}_sample{sample}.bif' + ) + + # ... read the BN as estimated from rpop, ... + bn_theta = gum.loadBN( + f"{cur_dir}/{config['bns_path']}/rpop/bn_{exp}_sample{sample}.bif" + ) + + bn_theta_hat_ie = gum.LazyPropagation(bn_theta_hat) + bn_theta_ie = gum.LazyPropagation(bn_theta) + + # ... retrieve rpop, ... + rpop = gpop[gpop[f"in-rpop-{sample}"]].iloc[:, : len(bn_theta.nodes())] + + # try: + + # ... and perform membership inference on gpop + power_vec, auc = run_mia( + bn_theta_hat_ie, + bn_theta_ie, + rpop, + gpop, + gpop[f"in-pool-{sample}"], + eval(config["error"]), + ) + power_res[f"power_CN_sample{sample}"] = power_vec + auc_cns_dict[sample] = auc + + # except Exception: + + # # Debug + # with open(f"{results_path}/log.txt", "a") as log: + # log.write(f"{exp}: error with sample {sample} (BN).\n") + # log.write(traceback.format_exc()) + + # Save results + power_res.to_csv( + f'{cur_dir}/{config["results_path"]}/cns/power_cn_{exp}.csv', index=False + ) + + # Return + auc_res["auc_cn"] = auc_res.apply(lambda row: auc_cns_dict[row["sample"]], axis=1) + + return auc_res + + +# Get theoretical power +def theoretical_power(exp, config) -> None: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Read data + bn = gum.loadBN(f'{get_cur_dir(config) / config["bns_path"]}/gt/{exp}.bif') + results = pd.read_csv(f'{cur_dir}/{config["results_path"]}/bns/power_bn_{exp}.csv') + + # Set seed + set_seed() + + # Compute bound + bound = math.sqrt(bn.dim() / int(config["gpop_ss"] * config["pool_prop"])) + + # Find power (beta) for any error (alpha) given theoretical bound + z_alpha = [norm.ppf(1 - i).item() for i in eval(config["error"])] + z_one_minus_beta = [bound - i for i in z_alpha] + beta = [norm.cdf(i).item() for i in z_one_minus_beta] + + # Save results + results["power_bound"] = beta + results.to_csv( + f'{cur_dir}/{config["results_path"]}/bns/power_bn_{exp}.csv', index=False + ) + + return + + +# Find eps s.t. |AUC(eps) - AUC(CN)| < tol +def find_epsilon(exp, config) -> dict: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Init results + power_res = pd.DataFrame({"error": eval(config["error"])}) + + # Read data + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + gpop_ss = config["gpop_ss"] + pool_ss = int(gpop_ss * config["pool_prop"]) + auc_res = pd.read_csv(f'{cur_dir}/{config["auc_meta"]}') + auc_res = auc_res[auc_res["exp"] == exp] + eps_vec = eval(config["eps_vec"]) + + # Set seed + set_seed() + + # For each data sample ... + eps_dict = dict() + auc_noisy_dict = dict() + for sample in range(config["samples"]): + + # ... read the BNs as estimated from rpop and pool, ... + bn_theta = gum.loadBN( + f"{cur_dir}/{config['bns_path']}/rpop/bn_{exp}_sample{sample}.bif" + ) + bn_theta_hat = gum.loadBN( + f"{cur_dir}/{config['bns_path']}/pool/bn_{exp}_sample{sample}.bif" + ) + + # ... retrieve rpop, ... + rpop = gpop[gpop[f"in-rpop-{sample}"]].iloc[:, : len(bn_theta.nodes())] + + # ... get CN AUC, ... + auc_cn = auc_res.loc[auc_res["sample"] == sample, "auc_cn"].values[0] + + # ... init results, ... + eps_dict[sample] = None + auc_noisy_dict[sample] = None + + # ... and find epsilon + for eps in eps_vec: + + # Get noisy BN + scale = (2 * bn_theta_hat.size()) / (pool_ss * eps) + bn_noisy = noisy_bn(bn_theta_hat, scale) + + bn_noisy_ie = gum.LazyPropagation(bn_noisy) + bn_theta_ie = gum.LazyPropagation(bn_theta) + + # Perform membership inference on gpop + power_vec, auc = run_mia( + bn_noisy_ie, + bn_theta_ie, + rpop, + gpop, + gpop[f"in-pool-{sample}"], + eval(config["error"]), + ) + + # Condition on |AUC(eps) - AUC(CN)| + if abs(auc_cn - auc) < config["tol"]: + eps_dict[sample] = eps + auc_noisy_dict[sample] = auc + power_res[f"power_BN_noisy_sample{sample}"] = power_vec + break + + # Save results + power_res.to_csv( + f'{cur_dir}/{config["results_path"]}/bn_noisy/power_bn_{exp}.csv', index=False + ) + + # Return + auc_res["epsilon"] = auc_res.apply(lambda row: eps_dict[row["sample"]], axis=1) + auc_res["auc_noisy_bn"] = auc_res.apply( + lambda row: auc_noisy_dict[row["sample"]], axis=1 + ) + + return auc_res + + +# MIA: membership inference attack +def run_mia(model, baseline, rpop, gpop, ground_truth, error_vec): + + # Compute llr(x) on reference and general populations + llr_ref = ( + rpop.apply(lambda x: get_llr(x.to_dict(), baseline, model), axis=1) + .dropna() + .sort_values() + ) + llr_gen = gpop[[*rpop.columns]].apply( + lambda x: get_llr(x.to_dict(), baseline, model), axis=1 + ) + + power_vec = [] + + # Get the power for each error + for error in error_vec: + power = get_power(llr_ref, llr_gen, ground_truth, error) + power_vec.append(power) + + # Compute and store AUC + auc = metrics.auc(error_vec, power_vec) + + return power_vec, auc + + +# Get the attack power related to a fixed error +def get_power(llr_ref, llr_gen, ground_truth, error) -> float: + + # Compute the threshold + t = np.quantile(llr_ref, 1 - error).item() + + # Test: L(x) > t => reject H_0 => assign `x` to target_pop + y_pred = llr_gen > t + + # Compute power (i.e., true positive rate) + power = sum(ground_truth & y_pred) / sum(ground_truth) + + return power + + +# Log-likelihood function +def get_ll(x: dict, theta): + + # Erase all evidences and apply addEvidence(key,value) for every pairs in x + theta.setEvidence(x) + + # Compute P(x | theta) + ll = theta.evidenceProbability() + + return np.log(ll) + + +# Log-likelihood ratio (llr) function +def get_llr(x: dict, theta, theta_hat): + + # Compute log-likelihoods + ll_theta = get_ll(x, theta) + ll_theta_hat = get_ll(x, theta_hat) + + return ll_theta_hat - ll_theta diff --git a/src/run_exp.py b/src/run_exp.py deleted file mode 100644 index ec51f00..0000000 --- a/src/run_exp.py +++ /dev/null @@ -1,61 +0,0 @@ -import gc -import multiprocessing # noqa: F401 # pylint: disable=unused-import -from itertools import product - -from joblib import Parallel, delayed - -from src.config import get_base_path -from src.inference import run_inferences -from src.membership_attack import attack_cn_bn, get_eps - - -def run_cn_vs_noisybn(config): - - # Get base path - base_path = get_base_path(config) - - # Set number of threads for parallelization - num_cores = eval(config["num_cores"]) - - # For each ESS and each model ... - exp_vec = [ - f.stem for f in (base_path / config["data_path"]).iterdir() if f.is_file() - ] - ess_vec = config["ess_dict"].keys() - - # ... find eps s.t. |AUC(eps) - AUC(CN)| < tol, ... - res = Parallel(n_jobs=num_cores)( - delayed(get_eps)(exp, ess, config) for exp, ess in product(exp_vec, ess_vec) - ) - - # ... and run inferences - _ = Parallel(n_jobs=num_cores)( - delayed(run_inferences)(exp, ess, eps, config) for exp, ess, eps in res - ) - - # Clean - gc.collect() - - -def run_cn_privacy(config): - - # Get base path - base_path = get_base_path(config) - - # Set number of threads for parallelization - num_cores = eval(config["num_cores"]) - - # For each ESS and each model ... - exp_vec = [ - f.stem for f in (base_path / config["data_path"]).iterdir() if f.is_file() - ] - ess_vec = eval(config["ess_vec"]) - - # ... run MIA attack on BN and CN - Parallel(n_jobs=num_cores)( - delayed(attack_cn_bn)(exp, ess, config) - for exp, ess in product(exp_vec, ess_vec) - ) - - # Clean - gc.collect() diff --git a/src/utils.py b/src/utils.py index e980b0f..9b75f3e 100644 --- a/src/utils.py +++ b/src/utils.py @@ -1,245 +1,383 @@ -import io -import re -from collections import defaultdict -from contextlib import redirect_stdout +from fractions import Fraction +from tempfile import TemporaryDirectory +import cdd +import cdd.gmp +import hopsy import numpy as np import pyagrum as gum -from more_itertools import random_product -from numpy import random -from numpy.random import random_sample +from src.config import safe_assert -# Log-likelihood function -def get_ll(x: dict, theta): - # Erase all evidences and apply addEvidence(key,value) for every pairs in x - theta.setEvidence(x) +# Add counts of events to a BN +def add_counts_to_bn(bn, data): + + for node in bn.names(): + var = bn.variable(node) + parents = bn.parents(node) + parent_names = [bn.variable(p).name() for p in parents] + + shape = [bn.variable(p).domainSize() for p in parents] + [var.domainSize()] + counts_array = np.zeros(shape, dtype=float) # float, not int + + for _, row in data.iterrows(): + try: + key = tuple([int(row[p]) for p in parent_names] + [int(row[node])]) + counts_array[key] += 1.0 + except KeyError: + continue - # Compute P(x | theta) - ll = theta.evidenceProbability() + bn.cpt(node).fillWith(counts_array.flatten().tolist()) - return np.log(ll) +# Get the BN inside a CN with max entropy distribution +def maxent_cn(bn_min, bn_max) -> gum.BayesNet: -# Log-likelihood ratio (llr) function -def get_llr(x: dict, theta, theta_hat): + # Init an empty BN + bn = gum.BayesNet(bn_min) - # Compute log-likelihoods - ll_theta = get_ll(x, theta) - ll_theta_hat = get_ll(x, theta_hat) + # For each variable ... + for var in bn.names(): - return ll_theta_hat - ll_theta + # ... get the centroid CPT, ... + cpt = maxent_cpt(bn_min.cpt(var), bn_max.cpt(var)) + # ... and fill the BN + bn.cpt(var).fillWith(cpt.flatten()) -# Parse the credal network -def parse_cn(cn) -> tuple: + # Debug + safe_assert(check_consistency(bn, bn_min, bn_max)) - # Get the DAG - dag = gum.BayesNet(cn.current_bn()) + return bn - # Cast CN to string - buffer = io.StringIO() - with redirect_stdout(buffer): - print(cn) - cn_str = buffer.getvalue() - credal_dict = defaultdict(lambda: defaultdict(list)) - current_var = None +# Get the BN CPT inside a CN CPT with max entropy distribution +def maxent_cpt(cpt_min, cpt_max) -> np.array: - lines = cn_str.strip().split("\n") + # Transform CPTs into pandas dataframes + cpt_min = np.atleast_2d(cpt_min.topandas()) + cpt_max = np.atleast_2d(cpt_max.topandas()) - for line in lines: - line = line.strip() + # For each row in the CPT ... + cpt = [] + for row in range(cpt_min.shape[0]): - # Variable identification - var_match = re.match(r"^([A-Za-z0-9_]+):", line) - if var_match: - current_var = var_match.group(1) - continue + # ... get the centroid credal set, ... + c = maxent_cset(cpt_min[row, :], cpt_max[row, :]) + cpt.append(c) - if current_var is None or not line: - continue + # Reshape the CPT + cpt = np.array(cpt) - # CPT identification - cpt_match = re.match(r"^<([^>]*)>\s*:\s*(.*)", line) - if cpt_match: - condition = f"<{cpt_match.group(1).strip()}>" - raw_cpt = cpt_match.group(2) + # Debug + safe_assert(cpt_min.shape == cpt_max.shape) + safe_assert(cpt.shape == cpt_min.shape) - # Extraction of inner lists: [[x,x,x], [x,x,x], ...] - vectors = re.findall(r"\[\s*([^\[\]]+?)\s*\]", raw_cpt) - for vec in vectors: - prob_vec = [float(x.strip()) for x in vec.split(",")] - credal_dict[current_var][condition].append(prob_vec) + return cpt - params = [] - for var in credal_dict: - for cond, vectors in credal_dict[var].items(): - params.append((var, cond, vectors)) - return dag, params +# Get the max-entropy distribution inside a credal set +def maxent_cset(vec_min, vec_max) -> np.array: + rank = {v: k for k, v in enumerate(sorted(set(vec_min)))} + vec_order = np.array([rank[val] for val in vec_min]) -# Compute a random subset of BNs from the CN -def sample_from_cn(cn, n: int, where: str) -> list: - """ - Sample random BNs from the CN. - - Parameters: - - `cn`: the given CN. - - `n`: number of BNs to extract from the CN. - - `where`: can be `inside` or `outside`. - "inside": the BNs are taken from within the credal set; - "outside": the BNs are vertices of the credal set. - """ + s = 1 - np.sum(vec_min) + + out = vec_min + while s > 0: + idx0 = np.where(vec_order == 0)[0] + idx1 = np.where(vec_order == 1)[0] + idx_len = len(idx0) + + + + try: + s_cond = s / idx_len < out[idx1[0]] - out[idx0[0]] + mat = np.stack( + [ + ( + (s / idx_len) * np.ones(len(idx0)) + if s_cond + else out[idx1] - out[idx0] + ), + vec_max[idx0] - out[idx0], + ] + ) + except IndexError: + s_cond = True + mat = np.stack( + [(s / idx_len) * np.ones(len(idx0)), vec_max[idx0] - out[idx0]] + ) + + mat_min = np.min(mat) + q = np.argwhere(mat == mat_min) + + if np.any(q[:, 0] == 1): + if len(idx0) > len(q): + vec_order[~np.isin(np.arange(len(out)), idx0[q[:, 1]])] += 1 + elif not s_cond: + vec_order[idx0[q[:, 1]]] += 1 + + out[idx0] += mat_min + s -= mat_min * len(idx0) + vec_order -= 1 + + return out + + +# Get the centroid of a CN +def centroid_cn(bn_min, bn_max) -> gum.BayesNet: + + # Init an empty BN + bn = gum.BayesNet(bn_min) + + # For each variable ... + for var in bn.names(): + + # ... get the centroid CPT, ... + cpt = centroid_cpt(bn_min.cpt(var), bn_max.cpt(var)) + + # ... and fill the BN + bn.cpt(var).fillWith(cpt.flatten()) + + # Debug + safe_assert(check_consistency(bn, bn_min, bn_max)) + + return bn + + +# Get the centroid of a CN CPT +def centroid_cpt(cpt_min, cpt_max) -> np.array: + + # Transform CPTs into pandas dataframes + cpt_min = np.atleast_2d(cpt_min.topandas()) + cpt_max = np.atleast_2d(cpt_max.topandas()) + + # For each row in the CPT ... + cpt = [] + for row in range(cpt_min.shape[0]): + + # ... get the centroid credal set, ... + c = centroid_cset(cpt_min[row, :], cpt_max[row, :]) + cpt.append(c) + + # Reshape the CPT + cpt = np.array(cpt) + + # Debug + safe_assert(cpt_min.shape == cpt_max.shape) + safe_assert(cpt.shape == cpt_min.shape) + + return cpt + + +# Get the centroid of a credal set as the average of its extreme points +def centroid_cset(vec_min, vec_max) -> np.array: + + # Define the (in)equalities (i.e., get the H-representation of the credal set) + n_par = len(vec_min) + A = np.concatenate( + (-np.eye(n_par), np.eye(n_par), np.atleast_2d(np.ones(n_par))), axis=0 + ) + b = np.concatenate((vec_max, -vec_min, np.atleast_1d(-1))).reshape(len(A), 1) + bA = np.concatenate((b, A), axis=1) + bA_frac = np.array( + [[Fraction(x).limit_denominator() for x in row] for row in bA], dtype=object + ) # Needed for numerical stability + mat_frac = cdd.gmp.matrix_from_array( + array=bA_frac, rep_type=cdd.RepType.INEQUALITY, lin_set=set([len(A) - 1]) + ) + + # Get the polytope and extreme points. Each point is a row of the matrix `vertices` + poly_frac = cdd.gmp.polyhedron_from_matrix(mat_frac) + ext_frac = cdd.gmp.copy_generators(poly_frac) + vertices_frac = np.array(ext_frac.array)[:, 1:] + vertices = np.array( + [[float(x) for x in row] for row in vertices_frac], dtype=object + ) + + # Compute the centroid as the average across extreme points + centroid = np.sum(vertices, axis=0) / len(vertices) + + # Debug + safe_assert(len(vec_min) == len(vec_max)) + safe_assert(len(b) == 2 * len(vec_min) + 1) + safe_assert(A.shape == (len(b), len(vec_min))) + safe_assert(bA.shape == (2 * len(vec_min) + 1, len(vec_min) + 1)) + safe_assert(vertices.shape[1] == n_par) + + return centroid + + +# BNs sampler from a CN +def sample_from_cn(bn_min, bn_max, n_bns: int) -> list: - random.seed(42) - - # Parse CN - dag, params = parse_cn(cn) - - # Store variables indexes - var_idx = { - var: [idx for idx, elem in enumerate(params) if elem[0] == var] - for var in dag.names() - } - - # Cases - if where == "inside": - sample = sample_inside - elif where == "outside": - sample = sample_outside - else: - msg = "'where' can be either 'inside' or 'outside'" - print(msg) - raise ValueError(msg) - - # Draw n random BNs - k = 0 + # Get the DAG and extreme BNs + dag = gum.BayesNet(bn_min) + + # For each variable ... + cpts_dict = {} + for var in dag.names(): + + # ... sample `n_bns` CPTs from the CN + cpts_dict[var] = sample_from_cpts(bn_min.cpt(var), bn_max.cpt(var), n_bns) + + # For each sample ... bns = [] - while k < n: + for i in range(n_bns): - # Init an empty BN + # ... init an empty BN ... bn = gum.BayesNet(dag) - # Sample from CN - next_sample = next(sample(params)) - - # Fill the BN's CPTs + # ... and fill its CPTs for var in dag.names(): - array = np.array([(next_sample[idx]) for idx in var_idx.get(var)]).flatten() - bn.cpt(var).fillWith(array) + bn.cpt(var).fillWith(cpts_dict[var][i]) bns.append(bn) - k += 1 + + # Debug + safe_assert(check_consistency(bn, bn_min, bn_max)) # Debug - # assert(n == len(bns)) + safe_assert(len(cpts_dict) == len(dag.names())) + safe_assert(len(bns) == n_bns) return bns -# Given a parsed CN called `params`, sample a BN inside the credal set -def sample_inside(params): +# Sample from two extreme CPTs +def sample_from_cpts(cpt_min, cpt_max, n_bns) -> list: + + # Transform CPTs into pandas dataframes + cpt_min = np.atleast_2d(cpt_min.topandas()) + cpt_max = np.atleast_2d(cpt_max.topandas()) + + # For each row in the CPT ... + credal_dict = {} + for row in range(cpt_min.shape[0]): - p_1 = [ - (vecs[0][0] - vecs[1][0]) * random_sample() + vecs[1][0] - for _, _, vecs in params - ] - p = [[x, 1 - x] for x in p_1] + # ... sample `n_bns` points from the credal set + credal_dict[row] = sample_from_cset(cpt_min[row, :], cpt_max[row, :], n_bns) + + # For each sample ... + cpt_samples = [] + for i in range(n_bns): + + # ... build the CPT + cpt = [] + for row in range(cpt_min.shape[0]): + cpt.append(credal_dict[row][i]) + + cpt = np.array(cpt).flatten() + cpt_samples.append(cpt) # Debug - # assert(np.sum(np.array(p), axis=1).all() == 1.) + safe_assert(cpt_min.shape == cpt_max.shape) + safe_assert(len(credal_dict) == cpt_min.shape[0]) + safe_assert(len(cpt_samples) == n_bns) - yield p + return cpt_samples -# Given a parsed CN called `params`, sample a vertex of the credal set -def sample_outside(params): +# Sample from a credal set K(x | pi_x), i.e., a constrained polytope. +def sample_from_cset(vec_min, vec_max, n_bns) -> list: + """ + We assume a credal set is a polytope in a space of #X parameters, defined by a: + - Multi-dimensional rectangle, i.e., inequality constraint Ax <= b, and + - Hyperplane (provided all the variables sum up to 1), i.e., equality constraint A_eq x = b_eq. + This is true if the CN has been learnt by local IDM, for instance. + """ - yield random_product(*[vecs for _, _, vecs in params]) + # Define the rectangle + n_par = len(vec_min) + A = np.concatenate((np.eye(n_par), -np.eye(n_par)), axis=0) + b = np.concatenate((vec_max, -vec_min)) + rectangle = hopsy.Problem(A=A, b=b) + # Define the hyperplane + A_eq = np.array([np.ones(n_par)]) + b_eq = np.array([1.0]) -# Check BNs sampled from a CN -def are_all_bns_different(bn_vec) -> None: + # Define the polytope as a constrained rectangle (i.e., get the H-representation of the credal set) + constrained_rectangle = hopsy.add_equality_constraints( + rectangle, A_eq=A_eq, b_eq=b_eq + ) - signatures = set() - for bn in bn_vec: - cpt_data = [] - for var in bn.names(): - cpt = bn.cpt(var) - flat = [f"{v:.8f}" for v in cpt.toarray().flatten()] - cpt_data.append(f"{var}:" + ",".join(flat)) - sig = "|".join(cpt_data) - signatures.add(sig) + # Sample from the polytope + mc = hopsy.MarkovChain(constrained_rectangle) + rng = hopsy.RandomNumberGenerator(42) + _, constrained_samples = hopsy.sample(mc, rng, n_bns, thinning=10) + constrained_samples = constrained_samples[0] - print(f"({len(signatures)}/{len(bn_vec)} different BNs.)") + # Debug + safe_assert(np.all(vec_min <= vec_max)) + safe_assert(n_par == len(vec_max)) + safe_assert(n_par == A.shape[1]) + safe_assert(n_par == A_eq.shape[1]) + safe_assert(len(constrained_samples) == n_bns) + for i in constrained_samples: + safe_assert(len(i) == n_par) + return constrained_samples -# Add counts of events to a BN -def add_counts_to_bn(bn, data): - for node in bn.names(): - var = bn.variable(node) - parents = bn.parents(node) - parent_names = [bn.variable(p).name() for p in parents] +# Check the consistency of a BN as sampled from a CN +def check_consistency(bn, bn_min, bn_max) -> bool: - shape = [bn.variable(p).domainSize() for p in parents] + [var.domainSize()] - counts_array = np.zeros(shape, dtype=float) # float, not int + for var in bn.names(): + bn_cpt = np.atleast_2d(bn.cpt(var).topandas()) + bn_min_cpt = np.atleast_2d(bn_min.cpt(var).topandas()) + bn_max_cpt = np.atleast_2d(bn_max.cpt(var).topandas()) - for _, row in data.iterrows(): - try: - key = tuple([int(row[p]) for p in parent_names] + [int(row[node])]) - counts_array[key] += 1.0 - except KeyError: - continue + # Check if probabilities sum to 1 + sum_vec = np.sum(bn_cpt, axis=1) + probability_consistency = np.all(np.abs(sum_vec - 1) < 1e-5) - bn.cpt(node).fillWith(counts_array.flatten().tolist()) + # Check if the BN CPT is >= min CPT + min_consistency = np.all(bn_cpt >= bn_min_cpt) + # Check if the BN CPT is <= max CPT + max_consistency = np.all(bn_cpt <= bn_max_cpt) -# Compact a dictionary to be printable -def compact_dict(d): - new_dict = {} - for k, v in d.items(): - if isinstance(v, np.ndarray): - new_dict[k] = ( - f"np.ndarray: [{v[0]:.2g}, {v[1]:.2g}, ..., {v[-1]:.2g}], length={len(v)}" - ) - else: - new_dict[k] = v - return new_dict + consistency = probability_consistency and min_consistency and max_consistency + if consistency: + continue + else: + print("probability_consistency: ", probability_consistency) + print("min_consistency: ", min_consistency) + print("max_consistency: ", max_consistency) + return False -# Create noisy BN by adding Laplacian noise (Zhang et al., 2017) -def get_noisy_bn(bn, scale: float): + return True - bn_ie = gum.LazyPropagation(bn) - bn_ie.makeInference() - bn_noisy = gum.BayesNet(bn) +# Check BNs sampled from a CN +def are_all_bns_different(bn_vec) -> bool: - # For each node X ... - for node in bn.names(): + signatures = set() + for bn in bn_vec: + cpt_data = [] + for var in bn.names(): + cpt = bn.cpt(var) + flat = [f"{v:.8f}" for v in cpt.toarray().flatten()] + cpt_data.append(f"{var}:" + ",".join(flat)) + sig = "|".join(cpt_data) + signatures.add(sig) - # Get the joint P(X, Pa(X)) - joint = bn_ie.jointPosterior(bn.family(node)) + print(f"({len(signatures)}/{len(bn_vec)} different BNs.)") - # Add noise to P(X, Pa(X)) and normalize - noise = np.random.laplace(scale=scale, size=np.prod(joint.shape)) - noisy_joint = np.clip( - joint.toarray().flatten() + noise, a_min=10e-10, a_max=None - ) - noisy_joint = noisy_joint / np.sum(noisy_joint) - joint.fillWith(noisy_joint) + return len(signatures) == len(bn_vec) - # Compute the conditional P(X | Pa(X)) - cond = joint / joint.sumOut(node) - # Fill noisy BN - bn_noisy.cpt(node).fillWith(cond) +# Extract BN min and BN max from a CN +def get_min_max_bns(cn, exp: str): - # Check noisy bn - bn_noisy.check() # OK if = (). + with TemporaryDirectory() as tmp_path: + cn.saveBNsMinMax(f"{tmp_path}/bn_min_{exp}.bif", f"{tmp_path}/bn_max_{exp}.bif") + bn_min = gum.loadBN(f"{tmp_path}/bn_min_{exp}.bif") + bn_max = gum.loadBN(f"{tmp_path}/bn_max_{exp}.bif") - return bn_noisy + return bn_min, bn_max diff --git a/test/cn_privacy/config.yaml b/test/cn_privacy/config.yaml index 0124b59..1144679 100644 --- a/test/cn_privacy/config.yaml +++ b/test/cn_privacy/config.yaml @@ -1,27 +1,27 @@ ## Configuration file # Paths -base_path: test/cn_privacy/output # Base path for output +cur_dir: test/cn_privacy # Current directory (contains all the following) bns_path: bns # Where to save ground-truth BNs +cns_path: output/cns # Where to save CNs as obtained by def-mec from BNs learnt from pool +atk_path: output/bns_atk # Where to save BNs as obtained by atk-mec from CNs data_path: data # Where to save data as generated from ground-truth BNs -results_path: results # Where to save the experiment results -meta_file: exp_meta.txt # File of metadata +results_path: output/results # Where to save the experiment results +exp_meta: exp_meta.txt # File of metadata for experiments # Models -n_nodes_vec: '[10, 15]' # List of models' number of nodes +n_nodes_vec: '[4, 5]' # List of models' number of nodes edge_ratio_vec: '[1, 1.5]' # List of models' edge ratio +n_modmax: 2 # Maximum number of variables categories # Data -gpop_ss: 1000 # Sample size of general population +gpop_ss: 50 # Sample size of general population rpop_prop: 0.5 # Sample size of reference population = gpop_ss * rpop_prop pool_prop: 0.25 # Sample size of pool population = gpop_ss * pool_prop +samples: 2 # Number of data samples # MIA -n_samples: 5 # Number of data samples -n_bns: 10 # Number of BNs to sample within the CN -error: 'np.logspace(-4, 0, 10, endpoint=False)' # Type-I errors vector -ess_vec: '[1, 1000]' # List of ESS +error: 'np.logspace(-4, 0, 5, endpoint=False)' # Type-I errors vector # Other -seed: 42 # Global seed num_cores: 'multiprocessing.cpu_count() - 1' # Number of threads to use for parallelization diff --git a/test/cn_privacy/test_integration.py b/test/cn_privacy/test_integration.py index 4dca807..fe165d2 100644 --- a/test/cn_privacy/test_integration.py +++ b/test/cn_privacy/test_integration.py @@ -1,15 +1,83 @@ -from src.config import get_config -from src.data import generate_randombn -from src.run_exp import run_cn_privacy +import sys +from experiments.cn_privacy import exp, generate -def test_integration(): - # Load config - config = get_config("test/cn_privacy/config.yaml") +def test_generation(): - # Generate BNs and data - generate_randombn(config) + # Generate models and data + generate.main() + + +def test_def_ran_atk_mle(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_mle", "n_bns=5"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_mle(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_mle", "n_bns=5"] + ) + + # Run experiment + exp.main() + + +def test_def_ran_atk_cen(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_cen"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_cen(monkeypatch): + + monkeypatch.setattr(sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_cen"]) + + # Run experiment + exp.main() + + +def test_def_ran_atk_ran(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_ran"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_ran(monkeypatch): + + monkeypatch.setattr(sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_ran"]) + + # Run experiment + exp.main() + + +def test_def_ran_atk_ent(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_ent"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_ent(monkeypatch): + + monkeypatch.setattr(sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_ent"]) # Run experiment - run_cn_privacy(config) + exp.main() diff --git a/test/cn_vs_noisybn/config.yaml b/test/cn_vs_noisybn/config.yaml index beb5d12..19cf415 100644 --- a/test/cn_vs_noisybn/config.yaml +++ b/test/cn_vs_noisybn/config.yaml @@ -1,38 +1,45 @@ ## Configuration file # Paths -base_path: test/cn_vs_noisybn/output # Base path for output +cur_dir: test/cn_vs_noisybn # Current directory (contains all the following) bns_path: bns # Where to save ground-truth BNs +cns_path: output/cns # Where to save CNs as obtained by def-mec from BNs learnt from pool +atk_path: output/bns_atk # Where to save BNs as obtained by atk-mec from CNs data_path: data # Where to save data as generated from ground-truth BNs -results_path: results # Where to save the experiment results -meta_file: exp_meta.txt # File of metadata +results_path: output/results # Where to save the experiment results +exp_meta: exp_meta.txt # File of metadata for experiments +auc_meta: output/results/auc_meta.csv # File of metadata for AUCs # Models (Naive Bayes) target_var: 'T' # Target variable -n_nodes: 10 # Number of nodes for each BN model -n_models: 5 # Number of models to evaluate +n_nodes: 5 # Number of nodes for each BN model +n_modmax: 2 # Maximum number of categories for covariates +n_models: 2 # Number of models to evaluate # Data -gpop_ss: 1000 # Sample size of general population +gpop_ss: 50 # Sample size of general population rpop_prop: 0.5 # Sample size of reference population = gpop_ss * rpop_prop pool_prop: 0.25 # Sample size of pool population = gpop_ss * pool_prop +samples: 2 # Number of data samples # MIA -n_samples: 5 # Number of data samples -n_bns: 10 # Number of BNs to sample within the CN -tol: 0.02 # To find eps s.t. |AUC(eps) - AUC(CN)| < tol -error: 'np.logspace(-4, 0, 10, endpoint=False)' # Type-I errors vector -ess_dict: # Eps list to evaluate for each ess - 1: 'np.arange(0.1, 10, 0.5)' - 10: 'np.arange(0.1, 10, 0.5)' - 20: 'np.arange(0.05, 5, 0.1)' - 30: 'np.arange(1e-3, 1, 5e-3)' - 40: 'np.arange(5e-6, 1e-2, 1e-5)' - 50: 'np.arange(5e-7, 5e-4, 1e-6)' +error: 'np.logspace(-4, 0, 5, endpoint=False)' # Type-I errors vector + +# Noisy BN +tol: 0.05 # To find eps s.t. |AUC(eps) - AUC(CN)| < tol +eps_vec: 'np.logspace(-8, 2, num=10)' # Epsilon to consider for noisy BN # Inferences -n_infer: 10 # Number of inferences to perform +n_infer: 2 # Number of inferences to perform # Other -seed: 42 # Global seed num_cores: 'multiprocessing.cpu_count() - 1' # Number of threads to use for parallelization + +## Notes +# 1) Suggested pairs (ess: eps_vec) for n_nodes=10: +# - 1 : 'np.arange(0.1, 10, 0.1)' +# - 10: 'np.arange(0.1, 10, 0.1)' +# - 20: 'np.arange(0.05, 5, 0.05)' +# - 30: 'np.arange(1e-3, 1, 1e-3)' +# - 40: 'np.arange(5e-6, 1e-2, 5e-6)' +# - 50: 'np.arange(5e-7, 5e-4, 5e-7)' diff --git a/test/cn_vs_noisybn/test_integration.py b/test/cn_vs_noisybn/test_integration.py index 1ef1eea..abfb7c9 100644 --- a/test/cn_vs_noisybn/test_integration.py +++ b/test/cn_vs_noisybn/test_integration.py @@ -1,15 +1,83 @@ -from src.config import get_config -from src.data import generate_naivebayes -from src.run_exp import run_cn_vs_noisybn +import sys +from experiments.cn_vs_noisybn import exp, generate -def test_integration(): - # Load config - config = get_config("test/cn_vs_noisybn/config.yaml") +def test_generation(): - # Generate BNs and data - generate_naivebayes(config) + # Generate models and data + generate.main() + + +def test_def_ran_atk_mle(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_mle", "n_bns=5"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_mle(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_mle", "n_bns=5"] + ) + + # Run experiment + exp.main() + + +def test_def_ran_atk_cen(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_cen"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_cen(monkeypatch): + + monkeypatch.setattr(sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_cen"]) + + # Run experiment + exp.main() + + +def test_def_ran_atk_ran(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_ran"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_ran(monkeypatch): + + monkeypatch.setattr(sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_ran"]) + + # Run experiment + exp.main() + + +def test_def_ran_atk_ent(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_ent"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_ent(monkeypatch): + + monkeypatch.setattr(sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_ent"]) # Run experiment - run_cn_vs_noisybn(config) + exp.main() diff --git a/test/conftest.py b/test/conftest.py new file mode 100644 index 0000000..8948023 --- /dev/null +++ b/test/conftest.py @@ -0,0 +1,8 @@ +import os + +import pytest + + +@pytest.fixture(scope="session", autouse=True) +def enable_test_config(): + os.environ["USE_TEST_CONFIG"] = "1" diff --git a/test/unit/__init__.py b/test/unit/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/test/unit/utils.py b/test/unit/utils.py new file mode 100644 index 0000000..0949ca5 --- /dev/null +++ b/test/unit/utils.py @@ -0,0 +1,12 @@ +import numpy as np + +from src.utils import maxent_cset + + +def test_maxent_cset(): + vec_min = np.array([0.3, 0.4, 0, 0.1]) + vec_max = np.array([0.6, 0.8, 0.12, 0.17]) + + out = maxent_cset(vec_min, vec_max) + + assert np.allclose(out, np.array([0.31, 0.4, 0.12, 0.17]))