diff --git a/.dockerignore b/.dockerignore index 2e6e678..4791eba 100644 --- a/.dockerignore +++ b/.dockerignore @@ -1,5 +1,4 @@ venv -output -bin -__pycache__ -.pytest_cache +**/output +**/__pycache__ +**/.pytest_cache diff --git a/.github/workflows/pylint.yml b/.github/workflows/pylint.yml index c15fac9..bb488f9 100644 --- a/.github/workflows/pylint.yml +++ b/.github/workflows/pylint.yml @@ -19,7 +19,6 @@ jobs: run: | python -m pip install --upgrade pip pip install pylint - if [ -f requirements.txt ]; then pip install -r requirements.txt; fi - - name: Analysing the code with pylint + - name: Analyzing code with pylint run: | pylint $(git ls-files '*.py') \ No newline at end of file diff --git a/.github/workflows/python-app.yml b/.github/workflows/python-app.yml index 9516be8..8312b7a 100644 --- a/.github/workflows/python-app.yml +++ b/.github/workflows/python-app.yml @@ -17,6 +17,8 @@ jobs: python-version: ${{ matrix.python-version }} - name: Install dependencies run: | + sudo apt-get update + sudo apt-get install -y build-essential swig libglpk-dev python3-dev libcdd-dev libgmp-dev python -m pip install --upgrade pip pip install flake8 pytest pytest-cov if [ -f requirements.txt ]; then pip install -r requirements.txt; fi diff --git a/.gitignore b/.gitignore index 2e6e678..322a84d 100644 --- a/.gitignore +++ b/.gitignore @@ -1,5 +1,9 @@ venv -output +output* +bns +data +plots +*meta.txt bin __pycache__ .pytest_cache diff --git a/Dockerfile b/Dockerfile index cd0c712..160dec4 100644 --- a/Dockerfile +++ b/Dockerfile @@ -1,8 +1,20 @@ -FROM python:bookworm +FROM python:3.12-slim +# Install system dependencies +RUN apt-get update && apt-get install -y \ + build-essential \ + swig \ + libglpk-dev \ + python3-dev \ + libcdd-dev \ + libgmp-dev \ + && rm -rf /var/lib/apt/lists/* + +# Set working directory WORKDIR /workspace COPY . . +# Install python packages RUN pip install --upgrade pip RUN if [ -f requirements.txt ]; then pip install -r requirements.txt; fi diff --git a/README.md b/README.md index 8437bd8..98c7757 100644 --- a/README.md +++ b/README.md @@ -2,104 +2,144 @@ Code for paper ["Towards Privacy-Aware Bayesian Networks: A Credal Approach"](https://doi.org/10.3233/FAIA251419) presented at [ECAI 2025](https://ecai2025.org/). -## Set up Python environment +## Preliminaries -Create and activate a Python virtual environment with: +### Experiments -```bash -python3 -m venv venv -source venv/bin/activate[.fish] # use `.fish` suffix if using fish shell -``` +`` is the name of the experiment to run. Each `` has its own directory, which is named the same way. Each of these contains the experiment logic, configuration file (`config.yaml`), eventually generated models and data, output directory, and a `Plot_results.ipynb` notebook to plot results. -Install all dependencies with: +`` can be one of the following: -```bash -pip install -r requirements.txt -``` +1. `cn_privacy`: run membership inference attack against a Bayesian network (BN), its related credal network (CN), and compute the theoretical privacy estimate of BN. -Upgrade all Python packages with: +2. `cn_vs_noisybn`: compare two privacy techniques, namely the CN and a noisy version of BN. All models are naive Bayes with target variable T. First, the CN and noisy BN hyperparameters are fine-tuned so that they achieve the same privacy level; then, their accuracy is computed in terms of most probable explanation (MPE) on variable T. -```bash -pip install --upgrade $(pip freeze | cut -d '=' -f 1) -pip freeze > requirements.txt -``` +For additional details, we refer to the paper. -This updates the requirements file with the upgraded packages. +### Attacks and defenses -## Experiments +Each experiment requires the user to specify one defense and one attack mechanisms, plus additional related hyperparameters. Below, the mechanisms and hyperparameters names are reported. -`` is the name of the experiment to run. It can be one of the following. +Implemented defenses: +- `def_idm`. Requires: `ess`. +- `def_ran`. Requires: `delta`. -1. `cn_privacy`: run membership inference attack against a Bayesian network (BN), its related credal network (CN), and compute the theoretical privacy estimate of BN. The pipeline and results are described in the paper. +Implemented attacks: +- `atk_mle`. Requires: `n_bns`. +- `atk_cen`. +- `atk_ran`. +- `atk_ent`. -2. `cn_vs_noisybn`: additional experiment, not reported in the paper. It compares two privacy techniques, namely the CN and a noisy version of BN. All models are naive Bayes with target variable T. First, the CN and noisy BN hyperparameters are fine-tuned so that they achieve the same privacy level; then, their accuracy is computed in terms of most probable explanation (MPE) on variable T. +## Running code -## Run code +### Using Docker (recommended) -### With Docker (recommended) +The `compose.yaml` file contains a set of pre-set experiments. Additional ones can also be specified. The `generate_compose.py` file helps in generating them automatically. -1. Build the Docker image: +Generate models and data for all experiments (controlled by `config.yaml`): -```bash -docker build . -t bnp:2025 +```sh +python -m experiments.cn_privacy.generate +python -m experiments.cn_vs_noisybn.generate ``` -2. Run the experiment: +Run one or more experiments with: -```bash -docker run [-d] [--rm] -v bnp:/workspace bnp:2025 python -m experiments..main +```sh +docker compose up [service name] ``` -3. Results available at: +Results will be available under `experiments//output_*`. -`/var/lib/docker/volumes/bnp/_data/experiments//output/`. +To check the status, run one or more of the following: -### Without Docker +```sh +docker compose ps +docker compose logs [service name] +docker stats +``` -1. Run the experiment: +### Local computation -```bash -python -m experiments..main +Create and activate a Python virtual environment: + +```sh +python3 -m venv venv +source venv/bin/activate[.fish] # use `.fish` suffix if using fish shell ``` -2. Results available at: +Install dependencies: -`experiments//output/`. +```sh +pip install -r requirements.txt +``` -## Test code +*Notice*: if some package is missing locally, see the `Dockerfile` for additional packages to be installed (names refer to Ubuntu/Debian). -Run tests with: +Upgrade dependencies: -```bash -pytest [--cov=src] [--cov-report=term-missing] [--capture=no] +```sh +pip install --upgrade $(pip freeze | cut -d '=' -f 1) +pip freeze > requirements.txt ``` -Test results are available at: +*Notice:* each of the following command will overwrite any related output. -`test//output/`. +Generate models and data (controlled by `config.yaml`): + +```sh +python -m experiments..generate +``` + +Run an experiment: + +```sh +python -m experiments..exp def_mec= [param=value] atk_mec= [param=value] +``` + +Results will be available under `experiments//output`. + + +## Testing code + +Run integration tests: + +```sh +pytest [--cov=src] [--cov-report=term-missing] [--capture=no] +``` ## Formatting and linting Format code by running: -```bash +```sh black . isort . ``` Lint code by running: -```bash +```sh flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics --exclude=venv -flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics --exclude=venv +flake8 . --count --exit-zero --max-complexity=10 --ignore=E203 --max-line-length=140 --statistics --exclude=venv ``` Analyze code by running: -```bash +```sh pylint $(git ls-files '*.py') ``` -## Plot results +## Running actions locally + +Install `act`: + +```sh +curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/nektos/act/master/install.sh | sudo bash +``` + +Run `act` with: -Use the `Plot_results.ipynb` notebook available for each experiment. Plots will be saved at: `experiments//output/plots`. +```sh +sudo ./bin/act [-W ] +``` diff --git a/compose.yaml b/compose.yaml new file mode 100644 index 0000000..66d6217 --- /dev/null +++ b/compose.yaml @@ -0,0 +1,86 @@ +version: '3.9' +services: + cn_vs_noisybn_def_idm_atk_ran_ess1: + build: . + command: + - python + - -m + - experiments.cn_vs_noisybn.exp + - def_mec=def_idm + - atk_mec=atk_ran + - ess=1 + image: bnp:2025 + volumes: + - ./experiments/cn_vs_noisybn/bns:/workspace/experiments/cn_vs_noisybn/bns + - ./experiments/cn_vs_noisybn/data:/workspace/experiments/cn_vs_noisybn/data + - ./experiments/cn_vs_noisybn/output_def_idm_atk_ran_ess1:/workspace/experiments/cn_vs_noisybn/output + cn_vs_noisybn_def_idm_atk_ran_ess100: + build: . + command: + - python + - -m + - experiments.cn_vs_noisybn.exp + - def_mec=def_idm + - atk_mec=atk_ran + - ess=100 + image: bnp:2025 + volumes: + - ./experiments/cn_vs_noisybn/bns:/workspace/experiments/cn_vs_noisybn/bns + - ./experiments/cn_vs_noisybn/data:/workspace/experiments/cn_vs_noisybn/data + - ./experiments/cn_vs_noisybn/output_def_idm_atk_ran_ess100:/workspace/experiments/cn_vs_noisybn/output + cn_vs_noisybn_def_idm_atk_ran_ess50: + build: . + command: + - python + - -m + - experiments.cn_vs_noisybn.exp + - def_mec=def_idm + - atk_mec=atk_ran + - ess=50 + image: bnp:2025 + volumes: + - ./experiments/cn_vs_noisybn/bns:/workspace/experiments/cn_vs_noisybn/bns + - ./experiments/cn_vs_noisybn/data:/workspace/experiments/cn_vs_noisybn/data + - ./experiments/cn_vs_noisybn/output_def_idm_atk_ran_ess50:/workspace/experiments/cn_vs_noisybn/output + cn_vs_noisybn_def_ran_atk_ran_delta0.001: + build: . + command: + - python + - -m + - experiments.cn_vs_noisybn.exp + - def_mec=def_ran + - atk_mec=atk_ran + - delta=0.001 + image: bnp:2025 + volumes: + - ./experiments/cn_vs_noisybn/bns:/workspace/experiments/cn_vs_noisybn/bns + - ./experiments/cn_vs_noisybn/data:/workspace/experiments/cn_vs_noisybn/data + - ./experiments/cn_vs_noisybn/output_def_ran_atk_ran_delta0.001:/workspace/experiments/cn_vs_noisybn/output + cn_vs_noisybn_def_ran_atk_ran_delta0.05: + build: . + command: + - python + - -m + - experiments.cn_vs_noisybn.exp + - def_mec=def_ran + - atk_mec=atk_ran + - delta=0.05 + image: bnp:2025 + volumes: + - ./experiments/cn_vs_noisybn/bns:/workspace/experiments/cn_vs_noisybn/bns + - ./experiments/cn_vs_noisybn/data:/workspace/experiments/cn_vs_noisybn/data + - ./experiments/cn_vs_noisybn/output_def_ran_atk_ran_delta0.05:/workspace/experiments/cn_vs_noisybn/output + cn_vs_noisybn_def_ran_atk_ran_delta0.1: + build: . + command: + - python + - -m + - experiments.cn_vs_noisybn.exp + - def_mec=def_ran + - atk_mec=atk_ran + - delta=0.1 + image: bnp:2025 + volumes: + - ./experiments/cn_vs_noisybn/bns:/workspace/experiments/cn_vs_noisybn/bns + - ./experiments/cn_vs_noisybn/data:/workspace/experiments/cn_vs_noisybn/data + - ./experiments/cn_vs_noisybn/output_def_ran_atk_ran_delta0.1:/workspace/experiments/cn_vs_noisybn/output diff --git a/experiments/__init__.py b/experiments/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/experiments/cn_privacy/Plot_results.ipynb b/experiments/cn_privacy/Plot_results.ipynb index 6492fc0..9c157ed 100644 --- a/experiments/cn_privacy/Plot_results.ipynb +++ b/experiments/cn_privacy/Plot_results.ipynb @@ -15,9 +15,12 @@ "import numpy as np\n", "import sys\n", "from pathlib import Path\n", + "import re\n", + "import ast\n", + "from itertools import cycle, product\n", "\n", "sys.path.insert(0, str(Path().resolve().parents[1]))\n", - "from src.config import *" + "from src.config import * # noqa" ] }, { @@ -28,13 +31,31 @@ "outputs": [], "source": [ "# Choose config file\n", - "config = get_config(\"config.yaml\")\n", + "config = load_config(\"cn_privacy\")\n", "\n", - "# Get results path\n", - "res_path = get_base_path(config) / config[\"results_path\"]\n", + "# Choose what to plot\n", + "folder = \"cn_privacy_20251127_cat_ln\"\n", + "params = dict()\n", + "params[\"def_mec\"] = [\"def_idm\", \"def_ran\"]\n", + "params[\"atk_mec\"] = [\"atk_mle\"]\n", + "params[\"ess\"] = [1]\n", + "params[\"delta\"] = [1.0]\n", + "cur_dir = get_cur_dir(config)\n", + "\n", + "# Get results paths\n", + "res_path = {}\n", + "for def_mec, atk_mec in product(params[\"def_mec\"], params[\"atk_mec\"]):\n", + " arg_str = \"ess\" if def_mec == \"def_idm\" else \"delta\"\n", + " arg_vals = [x for x in params[arg_str]]\n", + " for arg_val in arg_vals:\n", + " res_path[(def_mec, atk_mec, arg_str, arg_val)] = (\n", + " cur_dir\n", + " / folder\n", + " / f\"output_{def_mec}_{atk_mec}_{f'{arg_str}{arg_val}'}/results\"\n", + " )\n", "\n", "# Choose where to save plots\n", - "plots_path = get_base_path(config) / \"plots\"\n", + "plots_path = cur_dir / \"plots\"\n", "create_clean_dir(plots_path)" ] }, @@ -65,9 +86,12 @@ ")\n", "\n", "# Colors\n", - "bound_color = \"#ff441a\" # red\n", - "BN_color = \"#3934fe\" # blue\n", - "CN_color = \"#28bd6b\" # green\n", + "palette_cn = dict(\n", + " zip(res_path.keys(), sns.color_palette(palette=\"viridis\", n_colors=len(res_path)))\n", + ")\n", + "palette_bn = sns.color_palette(palette=\"afmhot\")\n", + "bound_color = palette_bn[3]\n", + "BN_color = palette_bn[2]\n", "alpha = 0.2" ] }, @@ -78,33 +102,24 @@ "metadata": {}, "outputs": [], "source": [ - "# Plot BN and bound (semilogx)\n", - "def plot_bn_bound(exp: str, ax):\n", + "# Plot BN (semilogx)\n", + "def plot_bn(path: Path, exp: str, ax, fill: bool = True):\n", "\n", " # Import results\n", - " results = os.listdir(res_path)\n", - " r_path = [r for r in results if f\"{exp}-ess1.csv\" in r][0]\n", - " result = pd.read_csv(f\"{res_path}/{r_path}\")\n", - " error = result[\"error\"]\n", - " bound = result[\"power_bound\"]\n", - " bn_cols = [c for c in result.columns if \"BN\" in c]\n", - " bn_mean = result.loc[:, bn_cols].mean(axis=1)\n", - " bn_max = result.loc[:, bn_cols].max(axis=1)\n", + " files = os.listdir(path / \"bns\")\n", + " r_path = [r for r in files if f\"{exp}\" in r][0]\n", + " res = pd.read_csv(f\"{path}/bns/{r_path}\")\n", + " error = res[\"error\"]\n", "\n", - " # Plot bound\n", - " ax.semilogx(\n", - " error,\n", - " bound,\n", - " \"^\",\n", - " color=bound_color,\n", - " label=\"Theoretical estimate\",\n", - " markersize=4,\n", - " zorder=4,\n", - " )\n", + " # Select what to plot\n", + " bn_cols = [c for c in res.columns if \"BN\" in c]\n", + " bn_mean = res.loc[:, bn_cols].mean(axis=1)\n", + " bn_max = res.loc[:, bn_cols].max(axis=1)\n", "\n", " # Plot BN (avg-max)\n", - " ax.fill_between(error, bn_mean, bn_max, color=BN_color, alpha=alpha, zorder=2)\n", - " ax.semilogx(error, bn_mean, \"-\", color=BN_color, label=\"BN\", zorder=3)\n", + " (line,) = ax.semilogx(error, bn_mean, \"-\", color=BN_color, label=\"BN\", zorder=3)\n", + " if fill:\n", + " ax.fill_between(error, bn_mean, bn_max, color=BN_color, alpha=alpha, zorder=2)\n", "\n", " # Title and axes\n", " ax.set_xlabel(\"Error\")\n", @@ -121,38 +136,59 @@ " True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1\n", " )\n", "\n", + " return line\n", + "\n", "\n", - "# Plot CN (for a given ess)\n", - "def plot_cn(exp, ax, ess: int, color: str, type: str):\n", + "# Plot bound (semilogx)\n", + "def plot_bound(path: Path, exp: str, ax):\n", "\n", " # Import results\n", - " results = os.listdir(res_path)\n", - " r_path = [r for r in results if f\"{exp}-ess{ess}.csv\" in r][0]\n", - " result = pd.read_csv(f\"{res_path}/{r_path}\")\n", - " error = result[\"error\"]\n", - " cn_cols = [c for c in result.columns if \"CN\" in c]\n", - " cn_mean = result.loc[:, cn_cols].mean(axis=1)\n", - " cn_max = result.loc[:, cn_cols].max(axis=1)\n", + " files = os.listdir(path / \"bns\")\n", + " r_path = [r for r in files if f\"{exp}\" in r][0]\n", + " res = pd.read_csv(f\"{path}/bns/{r_path}\")\n", + " bound = res[\"power_bound\"]\n", + " error = res[\"error\"]\n", + "\n", + " # Plot bound\n", + " (line,) = ax.semilogx(\n", + " error,\n", + " bound,\n", + " \"^\",\n", + " color=bound_color,\n", + " markersize=3,\n", + " zorder=4,\n", + " label=\"Theoretical estimate\",\n", + " mec=None,\n", + " )\n", + "\n", + " return line\n", + "\n", + "\n", + "# Plot CN\n", + "def plot_cn(path: Path, color, exp, ax, type: str, fill: bool = True):\n", + "\n", + " # Import results\n", + " files = os.listdir(path / \"cns\")\n", + " r_path = [r for r in files if f\"{exp}\" in r][0]\n", + " res = pd.read_csv(f\"{path}/cns/{r_path}\")\n", + " error = res[\"error\"]\n", + "\n", + " # Select what to plot\n", + " cn_cols = [c for c in res.columns if \"CN\" in c]\n", + " cn_mean = res.loc[:, cn_cols].mean(axis=1)\n", + " cn_max = res.loc[:, cn_cols].max(axis=1)\n", "\n", " # Plot CN (avg-max)\n", - " ax.fill_between(error, cn_mean, cn_max, color=color, alpha=alpha, zorder=2)\n", - " ax.semilogx(error, cn_mean, type, color=color, label=f\"CN, $S={ess}$\", zorder=3)\n", - "\n", - " # Legend\n", - " if exp == \"exp0\":\n", - " ax.legend(\n", - " loc=\"best\",\n", - " frameon=True,\n", - " fancybox=False,\n", - " framealpha=1,\n", - " facecolor=\"#e6e6e6\",\n", - " edgecolor=\"#8c8c8c\",\n", - " )\n", + " (line,) = ax.semilogx(error, cn_mean, type, color=color, zorder=3)\n", + " if fill:\n", + " ax.fill_between(error, cn_mean, cn_max, color=color, alpha=alpha, zorder=2)\n", + "\n", + " return line\n", "\n", "\n", "# Plot title function\n", "def get_title(exp: str):\n", - " with open(f\"{res_path}/exp_meta.txt\", \"r\") as meta:\n", + " with open(f\"{cur_dir}/exp_meta.txt\", \"r\") as meta:\n", " for row in meta:\n", " if exp in row:\n", " pieces = row.split()\n", @@ -169,33 +205,52 @@ "metadata": {}, "outputs": [], "source": [ + "# Names of experiments\n", + "from natsort import natsorted\n", + "\n", + "exp_names = natsorted(\n", + " [re.findall(\"(\\w+\\d+)\\.csv\", r)[0] for r in os.listdir(cur_dir / \"data\")]\n", + ")\n", + "\n", "# Layout 4x3\n", - "fig, axes = plt.subplots(4, 3, figsize=(10, 9.5))\n", - "# fig.suptitle(\"Power vs Error\", fontsize=18)\n", - "\n", - "exps = [\n", - " \"exp0\",\n", - " \"exp1\",\n", - " \"exp2\",\n", - " \"exp3\",\n", - " \"exp4\",\n", - " \"exp5\",\n", - " \"exp6\",\n", - " \"exp7\",\n", - " \"exp8\",\n", - " \"exp9\",\n", - " \"exp10\",\n", - " \"exp11\",\n", - "]\n", - "ess = [1, 1000]\n", - "\n", - "# Loop over subplots\n", - "for i, ax in enumerate(axes.flat):\n", - " plot_bn_bound(f\"{exps[i]}\", ax)\n", - " plot_cn(f\"{exps[i]}\", ax, ess=ess[0], color=CN_color, type=\"-\")\n", - " plot_cn(f\"{exps[i]}\", ax, ess=ess[1], color=CN_color, type=\"--\")\n", - " (n, e, c) = get_title(exps[i])\n", - " ax.set_title(f\"Nodes: {n}, Edges: {e}, Complexity: {c}\")\n", + "fig, axes = plt.subplots(len(exp_names) // 3 + 1, 3, figsize=(9, 8))\n", + "fig.suptitle(f\"Power vs Error\", fontsize=15)\n", + "\n", + "# Loop over results\n", + "for i, exp in enumerate(exp_names):\n", + "\n", + " # Plot BN & bound\n", + " ax = axes.flat[i]\n", + " path_bn = list(res_path.values())[0]\n", + " bn = plot_bn(path_bn, exp, ax)\n", + " plot_bound(path_bn, exp, ax)\n", + "\n", + " # Plot CNs\n", + " for res in res_path:\n", + " (def_mec, atk_mec, arg_str, arg_val) = res\n", + " path = res_path[res]\n", + "\n", + " cn = plot_cn(path, palette_cn[res], exp, ax, type=\"-\", fill=True)\n", + "\n", + " # Legend\n", + " cn.set_label(f\"CN, {def_mec}: {arg_str}={arg_val}, {atk_mec}\")\n", + " if i == 0:\n", + " ax.legend(\n", + " loc=\"best\",\n", + " frameon=True,\n", + " fancybox=False,\n", + " framealpha=1,\n", + " facecolor=\"#e6e6e6\",\n", + " edgecolor=\"#8c8c8c\",\n", + " )\n", + "\n", + " # Title\n", + " (n, e, c) = get_title(exp_names[i])\n", + " ax.set_title(f\"$N$: {n}, $E$: {e}, $C(\\mathcal G)$: {c}\")\n", + "\n", + " # # Log Y scale\n", + " # ax.set_yscale('log')\n", + " # ax.set_ylim(0.9*cn.get_ydata().min(), 1.1*bn.get_ydata().max())\n", "\n", "plt.tight_layout(rect=[0, 0, 1, 0.96])\n", "plt.show()\n", @@ -203,6 +258,40 @@ " f\"{plots_path}/results_logx.pdf\", dpi=1200, bbox_inches=\"tight\", transparent=False\n", ")" ] + }, + { + "cell_type": "markdown", + "id": "699bb07e", + "metadata": {}, + "source": [ + "def_idm:\n", + " * ess1 is more private than ess1000 (weird!), but only in huge nets\n", + " * atk_mle and atk_cen behave similarly, no real differences for each ess\n", + " * atk_ran only slighty more private only in huge nets, for each ess. With ess1000 it is more visible\n", + "\n", + "def_ran:\n", + " * Privacy increases with delta (expected), for each atk\n", + " * atk_mle and atk_cen behave similarly, no real differences for each delta\n", + " * atk_ran is more private, but not that much\n", + " * delta0.1 is similar to BN, while delta1.0 still leaks info, for each atk\n", + " * delta1.0 is less private than def_idm with ess1, for each atk, in huge nets (weird!)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fc764066", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "65698b92", + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { diff --git a/experiments/cn_privacy/__init__.py b/experiments/cn_privacy/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/experiments/cn_privacy/config.yaml b/experiments/cn_privacy/config.yaml index af7e1ef..ff72094 100644 --- a/experiments/cn_privacy/config.yaml +++ b/experiments/cn_privacy/config.yaml @@ -1,28 +1,27 @@ ## Configuration file # Paths -base_path: experiments/cn_privacy/output # Base path for output +cur_dir: experiments/cn_privacy # Current directory (contains all the following) bns_path: bns # Where to save ground-truth BNs +cns_path: output/cns # Where to save CNs as obtained by def-mec from BNs learnt from pool +atk_path: output/bns_atk # Where to save BNs as obtained by atk-mec from CNs data_path: data # Where to save data as generated from ground-truth BNs -results_path: results # Where to save the experiment results -meta_file: exp_meta.txt # File of metadata +results_path: output/results # Where to save the experiment results +exp_meta: exp_meta.txt # File of metadata for experiments # Models -n_nodes_vec: '[10, 20, 50, 100]' # List of models' number of nodes -edge_ratio_vec: '[1, 2, 4]' # List of models' edge ratio +n_nodes_vec: '[10, 15]' # List of models' number of nodes +edge_ratio_vec: '[1, 1.5]' # List of models' edge ratio n_modmax: 2 # Maximum number of variables categories # Data -gpop_ss: 10000 # Sample size of general population +gpop_ss: 500 # Sample size of general population rpop_prop: 0.5 # Sample size of reference population = gpop_ss * rpop_prop pool_prop: 0.25 # Sample size of pool population = gpop_ss * pool_prop +samples: 5 # Number of data samples # MIA -n_samples: 20 # Number of data samples -n_bns: 500 # Number of BNs to sample within the CN -error: 'np.logspace(-4, 0, 20, endpoint=False)' # Type-I errors vector -ess_vec: '[1, 10, 50, 100, 1000]' # List of ESS +error: 'np.logspace(-4, 0, 10, endpoint=False)' # Type-I errors vector # Other -seed: 42 # Global seed num_cores: 'multiprocessing.cpu_count() - 1' # Number of threads to use for parallelization diff --git a/experiments/cn_privacy/config_BAK.yaml b/experiments/cn_privacy/config_BAK.yaml new file mode 100644 index 0000000..9251cec --- /dev/null +++ b/experiments/cn_privacy/config_BAK.yaml @@ -0,0 +1,28 @@ +## Configuration file + +# Paths + +cur_dir: experiments/cn_privacy # Current directory (contains all the following) +bns_path: bns # Where to save ground-truth BNs +cns_path: output/cns # Where to save CNs as obtained by def-mec from BNs learnt from pool +atk_path: output/bns_atk # Where to save BNs as obtained by atk-mec from CNs +data_path: data # Where to save data as generated from ground-truth BNs +results_path: output/results # Where to save the experiment results +exp_meta: exp_meta.txt # File of metadata for experiments + +# Models +n_nodes_vec: '[10, 20, 50, 100]' # List of models' number of nodes +edge_ratio_vec: '[1, 2, 4]' # List of models' edge ratio +n_modmax: 2 # Maximum number of variables categories + +# Data +gpop_ss: 10000 # Sample size of general population +rpop_prop: 0.5 # Sample size of reference population = gpop_ss * rpop_prop +pool_prop: 0.25 # Sample size of pool population = gpop_ss * pool_prop +samples: 20 # Number of data samples + +# MIA +error: 'np.logspace(-4, 0, 20, endpoint=False)' # Type-I errors vector + +# Other +num_cores: 'multiprocessing.cpu_count() - 1' # Number of threads to use for parallelization diff --git a/experiments/cn_privacy/exp.py b/experiments/cn_privacy/exp.py new file mode 100644 index 0000000..0a0349a --- /dev/null +++ b/experiments/cn_privacy/exp.py @@ -0,0 +1,63 @@ +import gc +import multiprocessing # noqa: F401 # pylint: disable=unused-import +import sys + +import numpy as np # noqa: F401 # pylint: disable=unused-import +from joblib import Parallel, delayed + +from src.attack import attack_mechanism +from src.config import create_clean_dir, get_cur_dir, load_config, map_sys_args +from src.defense import defense_mechanism +from src.mia import mia_vs_bn, mia_vs_cn, theoretical_power + + +def main(): + + # Init configs + config = load_config("cn_privacy") + cur_dir = get_cur_dir(config) + create_clean_dir(cur_dir / "output") + num_cores = eval(config["num_cores"]) + + # Get command-line hyperparameters + def_mec, def_args, atk_mec, atk_args = map_sys_args(sys.argv, config) + + # Init the vectors of experiments + exp_vec = [f.stem for f in (cur_dir / config["data_path"]).iterdir() if f.is_file()] + + # Defense mechanism + print("## Defense mechanism: [", def_mec, def_args, "] ##", flush=True) + create_clean_dir(cur_dir / config["cns_path"]) + _ = Parallel(n_jobs=num_cores)( + delayed(defense_mechanism)(exp, config, def_mec, def_args) for exp in exp_vec + ) + + # Attack mechanism + print("## Attack mechanism: [", atk_mec, atk_args, "] ##", flush=True) + create_clean_dir(cur_dir / config["atk_path"]) + _ = Parallel(n_jobs=num_cores)( + delayed(attack_mechanism)(exp, config, atk_mec, atk_args) for exp in exp_vec + ) + + # MIA vs CN + print("## MIA vs CN ##", flush=True) + create_clean_dir(cur_dir / config["results_path"] / "cns") + _ = Parallel(n_jobs=num_cores)(delayed(mia_vs_cn)(exp, config) for exp in exp_vec) + + # MIA vs BN + print("## MIA vs BN ##", flush=True) + create_clean_dir(cur_dir / config["results_path"] / "bns") + _ = Parallel(n_jobs=num_cores)(delayed(mia_vs_bn)(exp, config) for exp in exp_vec) + + # Compute theoretical power + print("## Compute theoretical power ##", flush=True) + _ = Parallel(n_jobs=num_cores)( + delayed(theoretical_power)(exp, config) for exp in exp_vec + ) + + # Clean + gc.collect() + + +if __name__ == "__main__": + main() diff --git a/experiments/cn_privacy/generate.py b/experiments/cn_privacy/generate.py new file mode 100644 index 0000000..bee8241 --- /dev/null +++ b/experiments/cn_privacy/generate.py @@ -0,0 +1,43 @@ +import gc +import multiprocessing # noqa: F401 # pylint: disable=unused-import + +import numpy as np # noqa: F401 # pylint: disable=unused-import +from joblib import Parallel, delayed + +from src.config import create_clean_dir, get_cur_dir, load_config +from src.data import generate_randombn +from src.learning import estimate_bns + + +def main(): + + # Init configs + config = load_config("cn_privacy") + cur_dir = get_cur_dir(config) + num_cores = eval(config["num_cores"]) + + # Generate BNs and data + print("## Generate BNs and data ##") + create_clean_dir(cur_dir / config["bns_path"]) + create_clean_dir(cur_dir / config["bns_path"] / "gt") + create_clean_dir(cur_dir / config["data_path"]) + open(f'{cur_dir}/{config["exp_meta"]}', "w").close() + generate_randombn(config) + + # Init the vectors of experiments + exp_vec = [f.stem for f in (cur_dir / config["data_path"]).iterdir() if f.is_file()] + + # Estimate BNs from rpop and pool + print("## Estimate BNs from rpop and pool ##") + create_clean_dir(cur_dir / config["bns_path"] / "rpop") + create_clean_dir(cur_dir / config["bns_path"] / "pool") + _ = Parallel(n_jobs=num_cores)( + delayed(estimate_bns)(exp, config) for exp in exp_vec + ) + + # Clean + gc.collect() + + +if __name__ == "__main__": + main() diff --git a/experiments/cn_privacy/main.py b/experiments/cn_privacy/main.py deleted file mode 100644 index edc5fb9..0000000 --- a/experiments/cn_privacy/main.py +++ /dev/null @@ -1,14 +0,0 @@ -from src.config import get_config -from src.data import generate_randombn -from src.run_exp import run_cn_privacy - -if __name__ == "__main__": - - # Load config - config = get_config("experiments/cn_privacy/config.yaml") - - # Generate BNs and data - generate_randombn(config) - - # Run experiment - run_cn_privacy(config) diff --git a/experiments/cn_vs_noisybn/Plot_results.ipynb b/experiments/cn_vs_noisybn/Plot_results.ipynb index f602c56..29f30f3 100644 --- a/experiments/cn_vs_noisybn/Plot_results.ipynb +++ b/experiments/cn_vs_noisybn/Plot_results.ipynb @@ -1,5 +1,13 @@ { "cells": [ + { + "cell_type": "markdown", + "id": "80ba8653", + "metadata": {}, + "source": [ + "Ensure the output directories are named as `output_<...>_`, where `def_arg` can be `ess` or `delta`, and `def_value` its value." + ] + }, { "cell_type": "code", "execution_count": null, @@ -13,13 +21,16 @@ "import numpy as np\n", "import re\n", "import sys\n", + "import ast\n", + "import seaborn as sns\n", "from pathlib import Path\n", "from natsort import natsorted\n", "from sklearn.metrics import roc_curve\n", "from matplotlib.ticker import LogLocator\n", + "from statsmodels.stats.proportion import proportion_confint\n", "\n", "sys.path.insert(0, str(Path().resolve().parents[1]))\n", - "from src.config import *" + "from src.config import load_config, get_cur_dir, create_clean_dir" ] }, { @@ -30,20 +41,52 @@ "outputs": [], "source": [ "# Choose config file\n", - "config = get_config(\"config.yaml\")\n", + "config = load_config(\"cn_vs_noisybn\")\n", "\n", "# Get results path\n", - "res_path = get_base_path(config) / config[\"results_path\"]\n", + "cur_dir = get_cur_dir(config)\n", + "# res_path = cur_dir / config[\"results_path\"] / \"inferences\"\n", "\n", "# Choose where to save plots\n", - "plots_path = get_base_path(config) / \"plots\"\n", + "plots_path = cur_dir / \"plots\"\n", "create_clean_dir(plots_path)" ] }, { "cell_type": "code", "execution_count": null, - "id": "49a48f2f", + "id": "e07e1ceb", + "metadata": {}, + "outputs": [], + "source": [ + "# Names of experiments\n", + "pattern = re.compile(\"output_.*_(ess|delta)(\\d+\\.?\\d*)\")\n", + "exp_names = [re.findall(\"(\\w+\\d+)\\.csv\", r)[0] for r in os.listdir(cur_dir / \"data\")]\n", + "\n", + "# Choose what to plot\n", + "folder = \"cn_vs_noisybn_20251120_ln\"\n", + "def_mec = \"def_ran\"\n", + "atk_mec = \"atk_mle\"\n", + "out_dirs = [\n", + " item\n", + " for item in os.listdir(f\"{cur_dir}/{folder}\")\n", + " if pattern.match(item) and atk_mec in item and def_mec in item\n", + "]\n", + "\n", + "# Get ESS/delta list\n", + "def_arg = pattern.findall(out_dirs[0])[0][0]\n", + "x_values = [pattern.findall(i)[0][1] for i in out_dirs]\n", + "x_values = (\n", + " sorted([int(x) for x in x_values])\n", + " if def_mec == \"def_idm\"\n", + " else sorted([float(x) for x in x_values])\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ad31867b", "metadata": {}, "outputs": [], "source": [ @@ -51,55 +94,42 @@ "def split_data(data: pd.DataFrame, col: str, threshold: float = 0.5) -> tuple:\n", "\n", " # Split results based on probabilities\n", - " cert_idx = data[(data[col] > threshold)].index\n", - " data_cert = data.iloc[cert_idx]\n", - " data_uncert = data[~data.index.isin(cert_idx)]\n", + " cond = data[col] > threshold\n", + " data_cert = data[cond]\n", + " data_uncert = data[~cond]\n", "\n", " return data_cert, data_uncert\n", "\n", "\n", - "# Accuracy function for BN\n", - "def get_acc_bn(data: pd.DataFrame, col: str, vs_col: str) -> float:\n", + "# Accuracy function for a BN\n", + "def get_acc_bn(data: pd.DataFrame, col: str, vs_col: str, alpha=0.05) -> list:\n", + "\n", + " succ = sum(data[col] == data[vs_col])\n", + " acc = succ / len(data)\n", + " lower, upper = proportion_confint(\n", + " count=succ, nobs=len(data), alpha=alpha, method=\"wilson\"\n", + " )\n", "\n", - " return sum(data[col] == data[vs_col]) / len(data)" + " return [acc, lower, upper]" ] }, { "cell_type": "code", "execution_count": null, - "id": "38f1e202", + "id": "84251635", "metadata": {}, "outputs": [], "source": [ - "res_path = get_base_path(config) / config[\"results_path\"]\n", - "dirs = natsorted(\n", - " [f\"{res_path}/{dir}\" for dir in os.listdir(f\"{res_path}/\") if \"results_\" in dir]\n", - ")\n", - "\n", - "res = {\n", - " \"ess\": [],\n", - " \"eps\": [],\n", - " \"acc_cn_cert\": [],\n", - " \"acc_cn_uncert\": [],\n", - " \"acc_cn_tot\": [],\n", - " \"acc_noisy_bn\": [],\n", - " \"cert_cn\": [],\n", - "}\n", - "\n", - "roc = {\n", - " \"roc_cn_cert\": dict(),\n", - " \"roc_cn_uncert\": dict(),\n", - " \"roc_cn_tot\": dict(),\n", - " \"roc_noisy_bn\": dict(),\n", - "}\n", - "\n", - "# For each ess...\n", - "for dir in dirs:\n", - "\n", - " # Get results\n", - " files = [f for f in os.listdir(dir) if \".csv\" in f]\n", - " data = pd.concat([pd.read_csv(dir + \"/\" + f) for f in files])\n", - " data.reset_index(inplace=True)\n", + "# Build an inferences data set for each output folder\n", + "res = dict()\n", + "for out_dir in out_dirs:\n", + " inferences_path = os.path.join(cur_dir, folder, out_dir, \"results/inferences\")\n", + " files = [os.path.join(inferences_path, f) for f in os.listdir(inferences_path)]\n", + " data = pd.concat((pd.read_csv(f) for f in files), axis=0)\n", + " data[\"cn_probs_1\"] = data.apply(\n", + " lambda row: row[\"cn_probs\"] if row[\"cn_mpes\"] == 1 else row[\"cn_probs_alt\"],\n", + " axis=1,\n", + " )\n", " data[\"bn_noisy_probs_1\"] = data.apply(\n", " lambda row: (\n", " row[\"bn_noisy_probs\"]\n", @@ -108,206 +138,225 @@ " ),\n", " axis=1,\n", " )\n", - " data[\"cn_probs_1\"] = data.apply(\n", - " lambda row: row[\"cn_probs\"] if row[\"cn_mpes\"] == 1 else row[\"cn_probs_alt\"],\n", - " axis=1,\n", - " )\n", - "\n", - " # Store ess\n", - " reg = re.search(\"nodes(\\d+)_ess(\\d+)\", dir)\n", - " n_nodes = reg.group(1)\n", - " ess = reg.group(2)\n", - "\n", - " # Store avg of eps and std\n", - " with open(dir + \"/exp_meta.txt\", \"r\") as f:\n", - " eps_vec = [\n", - " float(re.search(\"Eps: (.+)\\n\", line).group(1))\n", - " for line in f\n", - " if \"Eps: \" in line\n", - " ]\n", - " eps = (float(np.mean(eps_vec)), float(np.std(eps_vec)))\n", - "\n", - " # Split CN results based on probabilities\n", - " data_cert, data_uncert = split_data(data, \"cn_probs\", 0.5)\n", - "\n", - " # Compute accuracies\n", - " vs = \"gt\"\n", - "\n", - " acc_cn_cert = (\n", - " get_acc_bn(data_cert, f\"{vs}_mpes\", \"cn_mpes\") if len(data_cert) > 0 else None\n", - " )\n", - " acc_cn_uncert = (\n", - " get_acc_bn(data_uncert, f\"{vs}_mpes\", \"cn_mpes\")\n", - " if len(data_uncert) > 0\n", - " else None\n", - " )\n", - " acc_cn_tot = get_acc_bn(data, f\"{vs}_mpes\", \"cn_mpes\")\n", - " acc_noisy_bn = get_acc_bn(data, f\"{vs}_mpes\", \"bn_noisy_mpes\")\n", - "\n", - " # Compute CN certainty\n", - " cert_cn = sum(data[\"cn_probs\"] > 0.5) / len(data)\n", - "\n", - " # Compute ROC\n", - " roc_cn_cert = roc_curve(data_cert[f\"{vs}_mpes\"], data_cert[\"cn_probs_1\"])\n", - " roc_cn_uncert = roc_curve(data_uncert[f\"{vs}_mpes\"], data_uncert[\"cn_probs_1\"])\n", - " roc_cn_tot = roc_curve(data[f\"{vs}_mpes\"], data[\"cn_probs_1\"])\n", - " roc_noisy_bn = roc_curve(data[f\"{vs}_mpes\"], data[\"bn_noisy_probs_1\"])\n", - "\n", - " # Store results\n", - " for key in res.keys():\n", - " res[key].append(eval(key))\n", - " for key in roc.keys():\n", - " roc[key][ess] = eval(key)\n", - "\n", - " # Debug\n", - " assert (data[\"cn_probs\"] >= data[\"cn_probs_alt\"]).all()\n", - " assert (data[\"bn_noisy_probs\"] >= 0.5).all()\n", - " assert (data[\"bn_probs\"] >= 0.5).all()\n", - " assert len(data) == len(pd.read_csv(dir + \"/\" + files[0])) * len(files)\n", - "\n", - "# Debug\n", - "length = len(res[\"ess\"])\n", - "for key in res.keys():\n", - " assert len(res[key]) == length\n", - "for key in roc.keys():\n", - " assert len(roc[key]) == length\n", - "assert res[\"ess\"] == sorted(res[\"ess\"])" + " x = pattern.findall(out_dir)[0][1]\n", + " res[f\"{def_arg}{x}\"] = data\n", + "\n", + "# Retrieve AUCs for each result folder\n", + "aucs = dict()\n", + "for out_dir in out_dirs:\n", + " auc_path = os.path.join(cur_dir, folder, out_dir, \"results/auc_meta.csv\")\n", + " data = pd.read_csv(auc_path)\n", + " x = pattern.findall(out_dir)[0][1]\n", + " aucs[f\"{def_arg}{x}\"] = data" ] }, { "cell_type": "code", "execution_count": null, - "id": "b6274696", + "id": "581cee1c", "metadata": {}, "outputs": [], "source": [ "# Style\n", - "plt.style.use(\"seaborn-v0_8-paper\")\n", - "plt.rcParams.update(\n", - " {\n", - " \"font.size\": 9,\n", - " \"axes.labelsize\": 9,\n", - " \"axes.titlesize\": 9,\n", - " \"legend.fontsize\": 7,\n", - " \"xtick.labelsize\": 8,\n", - " \"ytick.labelsize\": 8,\n", - " \"lines.linewidth\": 0.8,\n", - " \"figure.dpi\": 300,\n", - " \"savefig.dpi\": 300,\n", - " \"axes.edgecolor\": \"black\",\n", - " \"axes.linewidth\": 0.8,\n", - " \"text.usetex\": True,\n", - " }\n", - ")" + "# plt.style.use(\"seaborn-v0_8-paper\")\n", + "# plt.rcParams.update(\n", + "# {\n", + "# \"font.size\": 9,\n", + "# \"axes.labelsize\": 9,\n", + "# \"axes.titlesize\": 9,\n", + "# \"legend.fontsize\": 7,\n", + "# \"xtick.labelsize\": 8,\n", + "# \"ytick.labelsize\": 8,\n", + "# \"lines.linewidth\": 0.8,\n", + "# \"figure.dpi\": 300,\n", + "# \"savefig.dpi\": 300,\n", + "# \"axes.edgecolor\": \"black\",\n", + "# \"axes.linewidth\": 0.8,\n", + "# \"text.usetex\": True,\n", + "# }\n", + "# )" ] }, { "cell_type": "code", "execution_count": null, - "id": "0e9e8e36", + "id": "2ce6f5be", "metadata": {}, "outputs": [], "source": [ - "# Ess vs eps\n", - "fig, ax = plt.subplots(1, 1, figsize=(5, 3))\n", - "\n", - "eps_mean = np.array([x[0] for x in res[\"eps\"]])\n", - "ax.semilogy(res[\"ess\"], eps_mean, \"-o\", label=\"Mean\", markersize=4)\n", - "ax.set_xlabel(\"S\")\n", + "# Retrieve related epsilon\n", + "eps_median = []\n", + "eps_uq, eps_lq = [], []\n", + "eps_up, eps_lp = [], []\n", + "for x in x_values:\n", + " data = aucs[f\"{def_arg}{x}\"]\n", + " data_eps = [i for i in data[\"epsilon\"].values if i is not None and not np.isnan(i)]\n", + " eps_median.append(np.median(data_eps))\n", + " eps_uq.append(np.percentile(data_eps, 75))\n", + " eps_lq.append(np.percentile(data_eps, 25))\n", + " eps_up.append(np.percentile(data_eps, 95))\n", + " eps_lp.append(np.percentile(data_eps, 5))\n", + "\n", + "# Plot: ess vs eps\n", + "fig, ax = plt.subplots(1, 1)\n", + "\n", + "ax.semilogy(x_values, eps_median, \"-\", color=\"black\", label=\"Median\")\n", + "ax.fill_between(\n", + " x_values,\n", + " eps_lq,\n", + " eps_uq,\n", + " color=\"#ff4d4d\",\n", + " alpha=0.25,\n", + " linewidth=0,\n", + " label=\"Quartiles (1st \\& 3rd)\",\n", + " zorder=2,\n", + ")\n", + "ax.fill_between(\n", + " x_values,\n", + " eps_lp,\n", + " eps_up,\n", + " color=\"#ff9999\",\n", + " alpha=0.20,\n", + " linewidth=0,\n", + " label=\"Percentiles (5th \\& 95th)\",\n", + " zorder=2,\n", + ")\n", + "label = \"$S$\" if def_mec == \"def_idm\" else \"$\\delta$\"\n", + "ax.set_xlabel(label)\n", "ax.set_ylabel(\"$\\epsilon$\")\n", - "ax.set_title(\"S vs $\\epsilon$\")\n", - "ax.set_ylim([1e-5, 10])\n", - "ax.set_yticks([4, 1, 1e-1, 1e-2, 1e-3, 1e-4])\n", - "ax.set_yticklabels([\"4\", \"1\", \"1e-1\", \"1e-2\", \"1e-3\", \"1e-4\"])\n", - "ax.yaxis.set_minor_locator(LogLocator(base=10.0, subs=\"auto\"))\n", - "ax.tick_params(axis=\"y\", which=\"minor\", length=2, width=0.5)\n", - "ax.tick_params(axis=\"y\", which=\"major\", length=3, width=0.9)\n", - "ax.grid(True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1)\n", - "# ax.grid(True, which='minor', linestyle='-', linewidth=0.5, color='#bfbfbf', zorder=1)\n", - "ax.legend(loc=\"best\")\n", + "ax.set_title(\"Balancing privacy\")\n", "\n", - "plt.tight_layout(rect=[0, 0, 1, 0.96])\n", - "plt.show()\n", - "fig.savefig(\n", - " f\"{plots_path}/s_vs_eps.pdf\", dpi=1200, bbox_inches=\"tight\", transparent=False\n", - ")" + "ax.set_ylim([1e-9, 100])\n", + "ax.grid(True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1)\n", + "ax.legend(loc=\"best\")" ] }, { "cell_type": "code", "execution_count": null, - "id": "90bf5cc9", + "id": "a29cb66d", "metadata": {}, "outputs": [], "source": [ - "# Ess vs CN certainty\n", - "fig, ax = plt.subplots(1, 1, figsize=(5, 3))\n", - "\n", - "ax.plot(res[\"ess\"], res[\"cert_cn\"], \"-o\", markersize=4)\n", - "ax.set_xlabel(\"S\")\n", + "# Get CN certainty\n", + "cn_certainty = []\n", + "for x in x_values:\n", + " data = res[f\"{def_arg}{x}\"]\n", + " cn_certainty.append(sum(data[\"cn_probs\"] > 0.5) / len(data))\n", + "\n", + "# Plot: ess vs CN certainty\n", + "fig, ax = plt.subplots(1, 1)\n", + "\n", + "ax.plot(x_values, cn_certainty, \"-\", color=\"black\")\n", + "label = \"$S$\" if def_mec == \"def_idm\" else \"$\\delta$\"\n", + "ax.set_xlabel(label)\n", "ax.set_ylabel(\"Ratio of (maxmin CN prob. $> 0.5$)\")\n", "ax.set_title(\"CN certainty\")\n", - "ax.grid(True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1)\n", - "\n", - "plt.tight_layout(rect=[0, 0, 1, 0.96])\n", - "plt.show()\n", - "fig.savefig(\n", - " f\"{plots_path}/cn_certainty.pdf\", dpi=1200, bbox_inches=\"tight\", transparent=False\n", - ")" + "ax.grid(True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1)" ] }, { "cell_type": "code", "execution_count": null, - "id": "b5c19b7e", + "id": "1f7ebeae", "metadata": {}, "outputs": [], "source": [ - "# Accuracy\n", - "fig, ax = plt.subplots(1, 1, figsize=(5, 3))\n", + "# Store accuracy results\n", + "acc = {\"acc_noisy_bn\": {}, \"acc_cn_tot\": {}, \"acc_cn_cert\": {}, \"acc_cn_uncert\": {}}\n", "\n", - "labels = {\n", - " \"acc_noisy_bn\": \"Noisy BN\",\n", - " \"acc_cn_tot\": \"CN (total)\",\n", - " \"acc_cn_cert\": \"CN (certain)\",\n", - " \"acc_cn_uncert\": \"CN (uncertain)\",\n", + "roc = {\n", + " \"roc_cn_cert\": dict(),\n", + " \"roc_cn_uncert\": dict(),\n", + " \"roc_cn_tot\": dict(),\n", + " \"roc_noisy_bn\": dict(),\n", "}\n", - "for key in res.keys():\n", - " if \"acc\" in key:\n", - " ax.plot(res[\"ess\"], res[key], \"-o\", label=labels[key], markersize=4)\n", "\n", - "ax.set_xlabel(\"S\")\n", - "ax.set_ylabel(\"Accuracy\")\n", - "ax.set_title(\"Accuracy\")\n", - "ax.legend(loc=\"best\")\n", - "ax.grid(True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1)\n", + "for x in x_values:\n", + " data = res[f\"{def_arg}{x}\"]\n", "\n", - "plt.tight_layout(rect=[0, 0, 1, 0.96])\n", - "plt.show()\n", - "fig.savefig(\n", - " f\"{plots_path}/accuracy.pdf\", dpi=1200, bbox_inches=\"tight\", transparent=False\n", - ")" + " # Split CN results based on probabilities\n", + " cn_cert, cn_uncert = split_data(data, \"cn_probs\", 0.5)\n", + "\n", + " # Comput accuracies\n", + " vs = \"bn\"\n", + " acc_cn_cert = (\n", + " get_acc_bn(cn_cert, f\"{vs}_mpes\", \"cn_mpes\") if len(cn_cert) > 0 else None\n", + " )\n", + " acc_cn_uncert = (\n", + " get_acc_bn(cn_uncert, f\"{vs}_mpes\", \"cn_mpes\") if len(cn_uncert) > 0 else None\n", + " )\n", + " acc_cn_tot = get_acc_bn(data, f\"{vs}_mpes\", \"cn_mpes\")\n", + " acc_noisy_bn = get_acc_bn(data, f\"{vs}_mpes\", \"bn_noisy_mpes\")\n", + "\n", + " # Compute ROC\n", + " roc_cn_cert = (\n", + " roc_curve(cn_cert[f\"{vs}_mpes\"], cn_cert[\"cn_probs_1\"])\n", + " if len(cn_cert) > 0\n", + " else None\n", + " )\n", + " roc_cn_uncert = (\n", + " roc_curve(cn_uncert[f\"{vs}_mpes\"], cn_uncert[\"cn_probs_1\"])\n", + " if len(cn_uncert) > 0\n", + " else None\n", + " )\n", + " roc_cn_tot = roc_curve(data[f\"{vs}_mpes\"], data[\"cn_probs_1\"])\n", + " roc_noisy_bn = roc_curve(data[f\"{vs}_mpes\"], data[\"bn_noisy_probs_1\"])\n", + "\n", + " # Store results\n", + " for key in acc.keys():\n", + " acc[key][x] = eval(key)\n", + " for key in roc.keys():\n", + " roc[key][x] = eval(key)" ] }, { "cell_type": "code", "execution_count": null, - "id": "5c336efd", + "id": "c158f122", "metadata": {}, "outputs": [], "source": [ - "roc.keys()" + "# Plot accuracies\n", + "fig, ax = plt.subplots(1, 1)\n", + "\n", + "labels = {\n", + " \"acc_cn_cert\": \"CN (certain)\",\n", + " \"acc_cn_uncert\": \"CN (uncertain)\",\n", + " \"acc_cn_tot\": \"CN (total)\",\n", + " \"acc_noisy_bn\": \"Noisy BN\",\n", + "}\n", + "\n", + "colors = dict(\n", + " zip(labels.keys(), sns.color_palette(palette=\"seismic\", n_colors=len(labels)))\n", + ")\n", + "\n", + "for key in acc.keys():\n", + " accuracy = [x[0] if x else np.nan for x in acc[key].values()]\n", + " lower = [x[1] if x else np.nan for x in acc[key].values()]\n", + " upper = [x[2] if x else np.nan for x in acc[key].values()]\n", + " ax.plot(x_values, accuracy, \"-\", label=labels[key], color=colors[key])\n", + " ax.fill_between(\n", + " x_values, lower, upper, color=colors[key], alpha=0.25, zorder=2, linewidth=0\n", + " )\n", + "\n", + "label = \"$S$\" if def_mec == \"def_idm\" else \"$\\delta$\"\n", + "ax.set_xlabel(label)\n", + "ax.set_ylabel(\"Accuracy\")\n", + "ax.set_title(\"MAP estimation (Wilson CI 95/%)\")\n", + "ax.legend(loc=\"best\")\n", + "ax.grid(True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1)" ] }, { "cell_type": "code", "execution_count": null, - "id": "9186101e", + "id": "089a1d9a", "metadata": {}, "outputs": [], "source": [ - "# ROC curves\n", - "fig, axes = plt.subplots(2, 3, figsize=(16, 8))\n", + "# Plot ROCs\n", + "fig, axes = plt.subplots(len(x_values) // 3 + 1, 3, figsize=(14, 7))\n", "\n", "labels = {\n", " \"roc_cn_cert\": \"CN (certain)\",\n", @@ -316,13 +365,20 @@ " \"roc_noisy_bn\": \"Noisy BN\",\n", "}\n", "\n", + "colors = dict(\n", + " zip(labels.keys(), sns.color_palette(palette=\"seismic\", n_colors=len(labels)))\n", + ")\n", + "\n", "i = 0\n", - "for ess in res[\"ess\"]:\n", + "for x in x_values:\n", " ax = axes.flatten()[i]\n", "\n", " for key in roc.keys():\n", - " fpr, tpr, _ = roc[key][ess]\n", - " ax.plot(fpr, tpr, label=labels[key], linewidth=1.3)\n", + " try:\n", + " fpr, tpr, _ = roc[key][x]\n", + " ax.plot(fpr, tpr, label=labels[key], linewidth=1.3, color=colors[key])\n", + " except:\n", + " continue\n", "\n", " ax.plot(\n", " [0, 1],\n", @@ -332,21 +388,19 @@ " linewidth=1.3,\n", " label=\"baseline\",\n", " )\n", + "\n", " ax.set_xlabel(\"FPR\")\n", " ax.set_ylabel(\"TPR\")\n", - " ax.set_title(f\"ROC (S = {ess})\")\n", " ax.grid(\n", " True, which=\"major\", linestyle=\"-\", linewidth=0.5, color=\"#bfbfbf\", zorder=1\n", " )\n", + " ax.set_title(f\"ROC ({def_arg} = {x})\")\n", + " if i == 0:\n", + " ax.legend(loc=\"best\")\n", "\n", " i += 1\n", "\n", - "plt.legend(loc=\"best\")\n", - "plt.subplots_adjust(hspace=0.5)\n", - "\n", - "plt.tight_layout(rect=[0, 0, 1, 0.96])\n", - "plt.show()\n", - "fig.savefig(f\"{plots_path}/roc.pdf\", dpi=1200, bbox_inches=\"tight\", transparent=False)" + "plt.subplots_adjust(hspace=0.3)" ] } ], diff --git a/experiments/cn_vs_noisybn/__init__.py b/experiments/cn_vs_noisybn/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/experiments/cn_vs_noisybn/config.yaml b/experiments/cn_vs_noisybn/config.yaml index 3c4800b..fb9f76e 100644 --- a/experiments/cn_vs_noisybn/config.yaml +++ b/experiments/cn_vs_noisybn/config.yaml @@ -1,39 +1,45 @@ ## Configuration file # Paths -base_path: experiments/cn_vs_noisybn/output # Base path for output +cur_dir: experiments/cn_vs_noisybn # Current directory (contains all the following) bns_path: bns # Where to save ground-truth BNs +cns_path: output/cns # Where to save CNs as obtained by def-mec from BNs learnt from pool +atk_path: output/bns_atk # Where to save BNs as obtained by atk-mec from CNs data_path: data # Where to save data as generated from ground-truth BNs -results_path: results # Where to save the experiment results -meta_file: exp_meta.txt # File of metadata +results_path: output/results # Where to save the experiment results +exp_meta: exp_meta.txt # File of metadata for experiments +auc_meta: output/results/auc_meta.csv # File of metadata for AUCs # Models (Naive Bayes) target_var: 'T' # Target variable -n_nodes: 10 # Number of nodes for each BN model -n_modmax: 2 # Maximum number of categories for covariates +n_nodes: 20 # Number of nodes for each BN model +n_modmax: 4 # Maximum number of categories for covariates n_models: 10 # Number of models to evaluate # Data gpop_ss: 1000 # Sample size of general population rpop_prop: 0.5 # Sample size of reference population = gpop_ss * rpop_prop pool_prop: 0.25 # Sample size of pool population = gpop_ss * pool_prop +samples: 10 # Number of data samples # MIA -n_samples: 30 # Number of data samples -n_bns: 50 # Number of BNs to sample within the CN +error: 'np.logspace(-4, 0, 20, endpoint=False)' # Type-I errors vector + +# Noisy BN tol: 0.01 # To find eps s.t. |AUC(eps) - AUC(CN)| < tol -error: 'np.logspace(-4, 0, 25, endpoint=False)' # Type-I errors vector -ess_dict: # Eps list to evaluate for each ess - 1: 'np.arange(0.1, 10, 0.1)' - 10: 'np.arange(0.1, 10, 0.1)' - 20: 'np.arange(0.05, 5, 0.05)' - 30: 'np.arange(1e-3, 1, 1e-3)' - 40: 'np.arange(5e-6, 1e-2, 5e-6)' - 50: 'np.arange(5e-7, 5e-4, 5e-7)' +eps_vec: 'np.logspace(-8, 2, num=200)' # Epsilon to consider for noisy BN # Inferences -n_infer: 1000 # Number of inferences to perform +n_infer: 100 # Number of inferences to perform # Other -seed: 42 # Global seed -num_cores: 'multiprocessing.cpu_count() - 1' # Number of threads to use for parallelization \ No newline at end of file +num_cores: 'multiprocessing.cpu_count() - 1' # Number of threads to use for parallelization + +## Notes +# 1) Suggested pairs (ess: eps_vec) for n_nodes=10: +# - 1 : 'np.arange(0.1, 10, 0.1)' +# - 10: 'np.arange(0.1, 10, 0.1)' +# - 20: 'np.arange(0.05, 5, 0.05)' +# - 30: 'np.arange(1e-3, 1, 1e-3)' +# - 40: 'np.arange(5e-6, 1e-2, 5e-6)' +# - 50: 'np.arange(5e-7, 5e-4, 5e-7)' diff --git a/experiments/cn_vs_noisybn/config_BAK.yaml b/experiments/cn_vs_noisybn/config_BAK.yaml new file mode 100644 index 0000000..6cb9d4f --- /dev/null +++ b/experiments/cn_vs_noisybn/config_BAK.yaml @@ -0,0 +1,45 @@ +## Configuration file + +# Paths +cur_dir: experiments/cn_vs_noisybn # Current directory (contains all the following) +bns_path: bns # Where to save ground-truth BNs +cns_path: output/cns # Where to save CNs as obtained by def-mec from BNs learnt from pool +atk_path: output/bns_atk # Where to save BNs as obtained by atk-mec from CNs +data_path: data # Where to save data as generated from ground-truth BNs +results_path: output/results # Where to save the experiment results +exp_meta: exp_meta.txt # File of metadata for experiments +auc_meta: output/results/auc_meta.csv # File of metadata for AUCs + +# Models (Naive Bayes) +target_var: 'T' # Target variable +n_nodes: 10 # Number of nodes for each BN model +n_modmax: 2 # Maximum number of categories for covariates +n_models: 10 # Number of models to evaluate + +# Data +gpop_ss: 1000 # Sample size of general population +rpop_prop: 0.5 # Sample size of reference population = gpop_ss * rpop_prop +pool_prop: 0.25 # Sample size of pool population = gpop_ss * pool_prop +samples: 30 # Number of data samples + +# MIA +error: 'np.logspace(-4, 0, 25, endpoint=False)' # Type-I errors vector + +# Noisy BN +tol: 0.01 # To find eps s.t. |AUC(eps) - AUC(CN)| < tol +eps_vec: 'np.logspace(-8, 2, num=1000)' # Epsilon to consider for noisy BN + +# Inferences +n_infer: 1000 # Number of inferences to perform + +# Other +num_cores: 'multiprocessing.cpu_count() - 1' # Number of threads to use for parallelization + +## Notes +# 1) Suggested pairs (ess: eps_vec) for n_nodes=10: +# - 1 : 'np.arange(0.1, 10, 0.1)' +# - 10: 'np.arange(0.1, 10, 0.1)' +# - 20: 'np.arange(0.05, 5, 0.05)' +# - 30: 'np.arange(1e-3, 1, 1e-3)' +# - 40: 'np.arange(5e-6, 1e-2, 5e-6)' +# - 50: 'np.arange(5e-7, 5e-4, 5e-7)' \ No newline at end of file diff --git a/experiments/cn_vs_noisybn/exp.py b/experiments/cn_vs_noisybn/exp.py new file mode 100644 index 0000000..5110a16 --- /dev/null +++ b/experiments/cn_vs_noisybn/exp.py @@ -0,0 +1,73 @@ +import gc +import multiprocessing # noqa: F401 # pylint: disable=unused-import +import sys + +import numpy as np # noqa: F401 # pylint: disable=unused-import +import pandas as pd +from joblib import Parallel, delayed + +from src.attack import attack_mechanism +from src.config import create_clean_dir, get_cur_dir, load_config, map_sys_args +from src.defense import defense_mechanism +from src.inference import inferences +from src.mia import find_epsilon, mia_vs_cn + + +def main(): + + # Init configs + config = load_config("cn_vs_noisybn") + cur_dir = get_cur_dir(config) + create_clean_dir(cur_dir / "output") + num_cores = eval(config["num_cores"]) + + # Get command-line hyperparameters + def_mec, def_args, atk_mec, atk_args = map_sys_args(sys.argv, config) + + # Init the vectors of experiments + exp_vec = [f.stem for f in (cur_dir / config["data_path"]).iterdir() if f.is_file()] + + # Defense mechanism + print("## Defense mechanism: [", def_mec, def_args, "] ##", flush=True) + create_clean_dir(cur_dir / config["cns_path"]) + _ = Parallel(n_jobs=num_cores)( + delayed(defense_mechanism)(exp, config, def_mec, def_args) for exp in exp_vec + ) + + # Attack mechanism + print("## Attack mechanism: [", atk_mec, atk_args, "] ##", flush=True) + create_clean_dir(cur_dir / config["atk_path"]) + _ = Parallel(n_jobs=num_cores)( + delayed(attack_mechanism)(exp, config, atk_mec, atk_args) for exp in exp_vec + ) + + # MIA vs CN + print("## MIA vs CN ##", flush=True) + create_clean_dir(cur_dir / config["results_path"] / "cns") + res = Parallel(n_jobs=num_cores)(delayed(mia_vs_cn)(exp, config) for exp in exp_vec) + auc_res = pd.concat((i for i in res), axis=0) + auc_res.to_csv(f'{cur_dir}/{config["auc_meta"]}', index=False) + + # Find eps s.t. |AUC(eps) - AUC(CN)| < tol + print("## Find epsilon ##", flush=True) + create_clean_dir(cur_dir / config["results_path"] / "bn_noisy") + res = Parallel(n_jobs=num_cores)( + delayed(find_epsilon)(exp, config) for exp in exp_vec + ) + auc_res = pd.concat((i for i in res), axis=0) + auc_res.to_csv(f'{cur_dir}/{config["auc_meta"]}', index=False) + + # Run inferences + print("## Inferences ##", flush=True) + create_clean_dir(cur_dir / config["results_path"] / "inferences") + _ = Parallel(n_jobs=num_cores)( + delayed(inferences)(exp, config, def_mec, def_args) for exp in exp_vec + ) + + # Clean + gc.collect() + + +if __name__ == "__main__": + + main() diff --git a/experiments/cn_vs_noisybn/generate.py b/experiments/cn_vs_noisybn/generate.py new file mode 100644 index 0000000..557f062 --- /dev/null +++ b/experiments/cn_vs_noisybn/generate.py @@ -0,0 +1,44 @@ +import gc +import multiprocessing # noqa: F401 # pylint: disable=unused-import + +import numpy as np # noqa: F401 # pylint: disable=unused-import +from joblib import Parallel, delayed + +from src.config import create_clean_dir, get_cur_dir, load_config +from src.data import generate_naivebayes +from src.learning import estimate_bns + + +def main(): + + # Init configs + config = load_config("cn_vs_noisybn") + cur_dir = get_cur_dir(config) + num_cores = eval(config["num_cores"]) + + # Generate BNs and data + print("## Generate BNs and data ##") + create_clean_dir(cur_dir / config["bns_path"]) + create_clean_dir(cur_dir / config["bns_path"] / "gt") + create_clean_dir(cur_dir / config["data_path"]) + open(f'{cur_dir}/{config["exp_meta"]}', "w").close() + generate_naivebayes(config) + + # Init the vectors of experiments + exp_vec = [f.stem for f in (cur_dir / config["data_path"]).iterdir() if f.is_file()] + + # Estimate BNs from rpop and pool + print("## Estimate BNs from rpop and pool ##") + create_clean_dir(cur_dir / config["bns_path"] / "rpop") + create_clean_dir(cur_dir / config["bns_path"] / "pool") + _ = Parallel(n_jobs=num_cores)( + delayed(estimate_bns)(exp, config) for exp in exp_vec + ) + + # Clean + gc.collect() + + +if __name__ == "__main__": + + main() diff --git a/experiments/cn_vs_noisybn/main.py b/experiments/cn_vs_noisybn/main.py deleted file mode 100644 index f0f9b7b..0000000 --- a/experiments/cn_vs_noisybn/main.py +++ /dev/null @@ -1,14 +0,0 @@ -from src.config import get_config -from src.data import generate_naivebayes -from src.run_exp import run_cn_vs_noisybn - -if __name__ == "__main__": - - # Load config - config = get_config("experiments/cn_vs_noisybn/config.yaml") - - # Generate BNs and data - generate_naivebayes(config) - - # Run experiment - run_cn_vs_noisybn(config) diff --git a/generate_compose.py b/generate_compose.py new file mode 100644 index 0000000..d2b74d0 --- /dev/null +++ b/generate_compose.py @@ -0,0 +1,67 @@ +from itertools import product + +import yaml + +# Set hyperparameters +names = ["cn_vs_noisybn"] +def_mecs = {"def_idm": {"ess": [1, 50, 100]}, "def_ran": {"delta": [0.001, 0.05, 0.1]}} +atk_mecs = { + "atk_mle": {"n_bns": [100]}, + "atk_cen": {None: [None]}, + "atk_ran": {None: [None]}, +} + +# Initialize the `compose.yaml` file +init = {"version": "3.9"} +with open("compose.yaml", "w") as f: + yaml.dump(init, f, default_flow_style=False) + +# For any configuration ... +# (assumption: each defense and attack mechanism has at most 1 hyperparameter to be set) +data = {"services": dict()} +for name, def_mec, atk_mec in product(names, def_mecs.keys(), atk_mecs.keys()): + + def_params = list(def_mecs[def_mec].values())[0] + atk_params = list(atk_mecs[atk_mec].values())[0] + + for def_par, atk_par in product(def_params, atk_params): + + # ... set the related volume, ... + app = ( + f"_{list(def_mecs[def_mec].keys())[0]}{def_par}" + if def_par is not None + else "" + ) + volumes = [ + f"./experiments/{name}/bns:/workspace/experiments/{name}/bns", + f"./experiments/{name}/data:/workspace/experiments/{name}/data", + f"./experiments/{name}/output_{def_mec}_{atk_mec}{app}:/workspace/experiments/{name}/output", + ] + + # ... set the command, ... + command = [ + "python", + "-m", + f"experiments.{name}.exp", + f"def_mec={def_mec}", + f"atk_mec={atk_mec}", + ] + + if def_par is not None: + command.append(f"{list(def_mecs[def_mec].keys())[0]}={def_par}") + if atk_par is not None: + command.append(f"{list(atk_mecs[atk_mec].keys())[0]}={atk_par}") + + # ... and create the experiment + data["services"][f"{name}_{def_mec}_{atk_mec}{app}"] = { + "image": "bnp:2025", + "build": ".", + "volumes": volumes, + "command": command, + } +# Print number of services +print("Number of services: ", len(data["services"])) + +# Write file +with open("compose.yaml", "a") as f: + yaml.dump(data, f, default_flow_style=False) diff --git a/requirements.txt b/requirements.txt index 9c22126..0c2c7a0 100644 --- a/requirements.txt +++ b/requirements.txt @@ -59,6 +59,7 @@ ptyprocess==0.7.0 pure_eval==0.2.3 pyAgrum==2.2.0 pyarrow==21.0.0 +pycddlib==3.0.2 pycodestyle==2.14.0 pydot==4.0.1 pyflakes==3.4.0 diff --git a/src/attack.py b/src/attack.py new file mode 100644 index 0000000..6406d87 --- /dev/null +++ b/src/attack.py @@ -0,0 +1,114 @@ +import inspect + +import numpy as np +import pandas as pd +import pyagrum as gum + +from src.config import get_cur_dir, set_seed +from src.mia import get_ll +from src.utils import centroid_cn, maxent_cn, sample_from_cn + + +# Apply attack mechanism to a BN, namely, derive a BN from a CN +def attack_mechanism(exp, config, atk_mec, atk_args) -> None: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Read data + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + base_path = cur_dir / config["cns_path"] + + # Set seed + set_seed() + + # For each data sample ... + for sample in range(config["samples"]): + + # ... read the related CN + bn_min = gum.loadBN(f"{base_path}/bn_min_{exp}_sample{sample}.bif") + bn_max = gum.loadBN(f"{base_path}/bn_max_{exp}_sample{sample}.bif") + + # ... retrieve rpop, ... + rpop = gpop[gpop[f"in-rpop-{sample}"]].iloc[:, : len(bn_min.nodes())] + + # ... and derive the BN + atk_mec_fn = globals()[atk_mec] # Get the related function + sig = inspect.signature(atk_mec_fn) # Get its signature + args = { + k: v + for k, v in { + "bn_min": bn_min, + "bn_max": bn_max, + "data": rpop, + "n_bns": atk_args.get("n_bns", None), + }.items() + if k in sig.parameters + } + bn = atk_mec_fn(**args) + gum.saveBN( + bn, f'{cur_dir / config["atk_path"]}/{f"bn_{exp}_sample{sample}"}.bif' + ) + + return + + +# Get the BN inside a CN with max entropy distribution +def atk_ent(bn_min, bn_max): + + bn = maxent_cn(bn_min, bn_max) + + return bn + + +# Get a random BN inside a CN +def atk_ran(bn_min, bn_max): + + bn = sample_from_cn(bn_min, bn_max, 1) + + return bn[0] + + +# Get the centroid of a CN +def atk_cen(bn_min, bn_max): + + bn = centroid_cn(bn_min, bn_max) + + return bn + + +# Get the maximum likelihood BN inside a CN +def atk_mle(bn_min, bn_max, data, n_bns: int): + + # Sample from the CN ... + bns_sample = sample_from_cn(bn_min, bn_max, n_bns) + + # ... and take the MLE one + bn = mle_bn(bns_sample, data) + + return bn + + +# Get the maximum likelihood BN within a set +def mle_bn(bns_sample, data): + """ + Given a list `bns_sample` of BNs, + find argmax_{BN in bns_sample} ll(BN | data), + where ll is the log-likelihood function. + """ + + mle_bn = None + mle = -np.inf + + for bn in bns_sample: + + # Estimate the likelihood of data + bn_ie = gum.LazyPropagation(bn) + llr_im = data.apply(lambda x: get_ll(x.to_dict(), bn_ie), axis=1).dropna() + llr = np.sum(llr_im) + + if llr > mle: + mle_bn = bn + mle = llr + + return mle_bn diff --git a/src/config.py b/src/config.py index 016bb82..8d9cc66 100644 --- a/src/config.py +++ b/src/config.py @@ -1,47 +1,104 @@ +import os import random import shutil +import sys from pathlib import Path import pyagrum as gum import yaml +IN_PYTEST = "pytest" in sys.modules + + +# Get arguments as passed from command-line for experiment +def map_sys_args(sys_args, config) -> tuple: + + # Store parameters + params = dict([arg.split("=") for arg in sys_args if "=" in arg]) + with open(f'{config["cur_dir"]}/{config["exp_meta"]}', "a") as m: + m.write(f"\n Defense & attack parameters: \n {params}") + + # Get defense and attack mechanisms + def_mec = params.pop("def_mec") + atk_mec = params.pop("atk_mec") + + # Get defense parameters + def_args = dict() + if def_mec == "def_idm": + def_args["ess"] = int(params.pop("ess")) + assert def_args["ess"] >= 0 + elif def_mec == "def_ran": + def_args["delta"] = float(params.pop("delta")) + assert def_args["delta"] >= 0 + assert def_args["delta"] <= 1 + else: + raise Exception("Defense not implemented") + + # Save attack parameters + atk_args = dict() + if atk_mec == "atk_mle": + atk_args["n_bns"] = int(params.pop("n_bns")) + assert atk_args["n_bns"] >= 1 + elif atk_mec == "atk_cen" or atk_mec == "atk_ran" or atk_mec == "atk_ent": + pass + else: + raise Exception("Attack not implemented") + + # Exceptions + if len(params) != 0: + raise Exception(f"Unused parameters: {params}") + + return (def_mec, def_args, atk_mec, atk_args) + # Read configuration for experiment -def get_config(path): +def load_config(name: str): - with open(path, "r") as f: + subdir = "test" if os.getenv("USE_TEST_CONFIG") == "1" else "experiments" + + config_path = get_root_path() / subdir / name / "config.yaml" + + with open(config_path, "r") as f: config = yaml.safe_load(f) return config # Set global seed -def set_global_seed(seed): +def set_seed(): - random.seed(seed) - gum.initRandom(seed) + random.seed(42) + gum.initRandom(42) -# Get root directory -def get_root_path(): - return Path(__file__).resolve().parents[1] +# Create an empty directory +def create_clean_dir(path: Path): + + # If directory exists, clean it + if path.exists() and path.is_dir(): + for item in path.iterdir(): + shutil.rmtree(item) if item.is_dir() else item.unlink() + + # Else, create a new one + else: + path.mkdir(parents=True, exist_ok=True) -# Get base path -def get_base_path(config): +# Get output path +def get_cur_dir(config): root_path = get_root_path() - base_path = config["base_path"] + cur_dir = config["cur_dir"] - return root_path / base_path + return root_path / cur_dir -# Create an empty directory -def create_clean_dir(path: Path): +# Get root (project) directory +def get_root_path(): + return Path(__file__).resolve().parents[1] - # Remove the folder if already exists - if path.exists() and path.is_dir(): - shutil.rmtree(path) - # Create a new folder - path.mkdir(parents=True, exist_ok=True) +# Only perform an `assert` if code is running in `pytest` +def safe_assert(condition): + if IN_PYTEST: + assert condition diff --git a/src/data.py b/src/data.py index 167b2ac..3e1cb3d 100644 --- a/src/data.py +++ b/src/data.py @@ -1,31 +1,30 @@ +import ast from itertools import product -from pprint import pformat +import numpy as np import pyagrum as gum from numpy.random import randint -from src.config import create_clean_dir, get_base_path, set_global_seed -from src.utils import compact_dict +from src.config import get_cur_dir, safe_assert, set_seed def generate_naivebayes(config): - # Set seed - set_global_seed(config["seed"]) - # Set paths - base_path = get_base_path(config) - bns_path = base_path / config["bns_path"] - data_path = base_path / config["data_path"] - results_path = base_path / config["results_path"] + cur_dir = get_cur_dir(config) + bns_path = cur_dir / config["bns_path"] + data_path = cur_dir / config["data_path"] - # Create empty directories - create_clean_dir(bns_path) - create_clean_dir(data_path) - create_clean_dir(results_path) + # Set seed + set_seed() - # Set BN (naive Bayes) structure + # Retrieve hyperparameters n_modmax = config["n_modmax"] + gpop_ss = config["gpop_ss"] + pool_ss = int(gpop_ss * config["pool_prop"]) + rpop_ss = int(gpop_ss * config["rpop_prop"]) + + # Set BN (naive Bayes) structure bn_str_gen = ( f'{config["target_var"]}->X{i}[{randint(2, n_modmax+1)}]' for i in range(config["n_nodes"] - 1) @@ -37,62 +36,69 @@ def generate_naivebayes(config): # ... generate BN, ... bn = gum.fastBN(bn_str) - gum.saveBN(bn, f"{bns_path}/exp{i}.bif") + gum.saveBN(bn, f'{bns_path / "gt"}/{f"exp{i}"}.bif') + + with open(f'{cur_dir}/{config["exp_meta"]}', "a") as m: + m.write( + f'- exp{i}. Naive Bayes: {config["n_nodes"]} nodes. Complexity: {bn.dim()} Max categories: {n_modmax}\n' + ) # ... and generate gpop from BN data_gen = gum.BNDatabaseGenerator(bn) data_gen.drawSamples(config["gpop_ss"]) data_gen.setDiscretizedLabelModeRandom() gpop = data_gen.to_pandas() - gpop.to_csv(f"{data_path}/exp{i}.csv", index=False) - # For each ESS ... - for ess in config["ess_dict"].keys(): + # For any data sample ... + for sample in range(config["samples"]): - # ... create results subdirectories and metadata files - meta_file_path = ( - base_path - / config["results_path"] - / f'results_nodes{config["n_nodes"]}_ess{ess}' - / config["meta_file"] - ) - meta_file_path.parent.mkdir(parents=True, exist_ok=True) + # ... sample pool and rpop + shuffled_idx = np.random.permutation(gpop.index) - with open(meta_file_path, "w") as f: - f.write(pformat(compact_dict(config)) + "\n\n" + "#" * 50 + "\n\n") + pool_idx = shuffled_idx[:pool_ss] + rpop_idx = shuffled_idx[pool_ss : pool_ss + rpop_ss] + gpop[f"in-pool-{sample}"] = gpop.index.isin(pool_idx) + gpop[f"in-rpop-{sample}"] = gpop.index.isin(rpop_idx) -def generate_randombn(config): + # Debug + safe_assert(pool_ss == len(pool_idx)) + safe_assert(rpop_ss == len(rpop_idx)) + safe_assert(sum(gpop[f"in-pool-{sample}"]) == pool_ss) + safe_assert(sum(gpop[f"in-rpop-{sample}"]) == rpop_ss) - # Set seed - set_global_seed(config["seed"]) + # Save gpop + gpop.to_csv(f"{data_path}/exp{i}.csv", index=False) + + +def generate_randombn(config): # Set paths - base_path = get_base_path(config) - bns_path = base_path / config["bns_path"] - data_path = base_path / config["data_path"] - results_path = base_path / config["results_path"] - - # Create empty directories - create_clean_dir(bns_path) - create_clean_dir(data_path) - create_clean_dir(results_path) - - n_nodes_vec = eval(config["n_nodes_vec"]) - edge_ratio_vec = eval(config["edge_ratio_vec"]) - n_modmax = config["n_modmax"] + cur_dir = get_cur_dir(config) + bns_path = cur_dir / config["bns_path"] + data_path = cur_dir / config["data_path"] + + # Retrieve hyperparameters + n_nodes_vec = ast.literal_eval(config["n_nodes_vec"]) + edge_ratio_vec = ast.literal_eval(config["edge_ratio_vec"]) + gpop_ss = config["gpop_ss"] + pool_ss = int(gpop_ss * config["pool_prop"]) + rpop_ss = int(gpop_ss * config["rpop_prop"]) + + # Set seed + set_seed() # For each configuration ... for i, (n, r) in enumerate(product(n_nodes_vec, edge_ratio_vec)): # ... generate BN, ... bn_gen = gum.BNGenerator() - bn = bn_gen.generate(n_nodes=n, n_arcs=int(n * r), n_modmax=n_modmax) - gum.saveBN(bn, f"{bns_path}/exp{i}.bif") + bn = bn_gen.generate(n_nodes=n, n_arcs=int(n * r), n_modmax=config["n_modmax"]) + gum.saveBN(bn, f'{bns_path / "gt"}/{f"exp{i}"}.bif') - with open(f'{results_path}/{config["meta_file"]}', "a") as m: + with open(f'{cur_dir}/{config["exp_meta"]}', "a") as m: m.write( - f"- exp{i}. Nodes: {n} Edges: {int(n * r)} Complexity: {bn.dim()} Max categories: {n_modmax}\n" + f"- exp{i}. Nodes: {n} Edges: {int(n * r)} Complexity: {bn.dim()}\n" ) # ... and generate gpop from BN @@ -100,4 +106,24 @@ def generate_randombn(config): data_gen.drawSamples(config["gpop_ss"]) data_gen.setDiscretizedLabelModeRandom() gpop = data_gen.to_pandas() + + # For any data sample ... + for sample in range(config["samples"]): + + # ... sample pool and rpop + shuffled_idx = np.random.permutation(gpop.index) + + pool_idx = shuffled_idx[:pool_ss] + rpop_idx = shuffled_idx[pool_ss : pool_ss + rpop_ss] + + gpop[f"in-pool-{sample}"] = gpop.index.isin(pool_idx) + gpop[f"in-rpop-{sample}"] = gpop.index.isin(rpop_idx) + + # Debug + safe_assert(pool_ss == len(pool_idx)) + safe_assert(rpop_ss == len(rpop_idx)) + safe_assert(sum(gpop[f"in-pool-{sample}"]) == pool_ss) + safe_assert(sum(gpop[f"in-rpop-{sample}"]) == rpop_ss) + + # Save gpop gpop.to_csv(f"{data_path}/exp{i}.csv", index=False) diff --git a/src/defense.py b/src/defense.py new file mode 100644 index 0000000..35c1e41 --- /dev/null +++ b/src/defense.py @@ -0,0 +1,137 @@ +import inspect + +import numpy as np +import pandas as pd +import pyagrum as gum + +from src.config import get_cur_dir, safe_assert, set_seed +from src.utils import add_counts_to_bn, check_consistency + + +# Apply defense mechanism to a BN, namely, derive a CN from a BN +def defense_mechanism(exp, config, def_mec, def_args) -> None: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Read data + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + + # Set seed + set_seed() + + # For each data sample ... + for sample in range(config["samples"]): + + # ... read the related BN + bn = gum.loadBN( + f"{cur_dir}/{config['bns_path']}/pool/bn_{exp}_sample{sample}.bif" + ) + + # ... retrieve pool, ... + pool = gpop[gpop[f"in-pool-{sample}"]].iloc[:, : len(bn.nodes())] + + # ... and derive the CN + def_mec_fn = globals()[def_mec] # Get the related function + sig = inspect.signature(def_mec_fn) # Get its signature + args = { + k: v + for k, v in { + "bn": bn, + "ess": def_args.get("ess", None), + "delta": def_args.get("delta", None), + "data": pool, + }.items() + if k in sig.parameters + } + cn = def_mec_fn(**args) # Keep only `def_mec` args + base_path = cur_dir / config["cns_path"] + cn.saveBNsMinMax( + f"{base_path}/bn_min_{exp}_sample{sample}.bif", + f"{base_path}/bn_max_{exp}_sample{sample}.bif", + ) + + return + + +# Estimate a CN from data by local IDM +def def_idm(bn, ess, data): + bn_counts = gum.BayesNet(bn) + add_counts_to_bn(bn_counts, data) + cn = gum.CredalNet(bn_counts) + cn.idmLearning(ess) + + return cn + + +# Build a CN by bloating each BN parameter with a fixed-size random interval +def def_ran(bn, delta): + + # Initialize the extreme BNs + bn_min = gum.BayesNet(bn) + bn_max = gum.BayesNet(bn) + + # For each node ... + for n in bn.nodes(): + + # ... get the CPT, ... + cpt = bn.cpt(n).toarray() + + # ... get a matrix of eta's, ... + eta = np.random.uniform(0, delta, cpt.size).reshape(cpt.shape) + + # ... perturb the CPT, ... + cpt_min = np.minimum(1 - delta, np.maximum(0, cpt - eta)) + cpt_max = np.minimum(1, np.maximum(delta, cpt - eta + delta)) + + # ... and store it into the extreme BNs + bn_min.cpt(n).fillWith(cpt_min.flatten()) + bn_max.cpt(n).fillWith(cpt_max.flatten()) + + # Debug + safe_assert(np.all(cpt_min <= cpt)) + safe_assert(np.all(cpt_max >= cpt)) + safe_assert(np.all(np.abs(cpt_max - cpt_min - delta) < 1e-6)) + + # Build the CN from the extreme BNs + cn = gum.CredalNet(bn_min, bn_max) + cn.intervalToCredal() + + # Debug + safe_assert(check_consistency(bn, bn_min, bn_max)) + + return cn + + +# Create noisy BN by adding Laplacian noise (Zhang et al., 2017) +def noisy_bn(bn, scale: float): + + bn_ie = gum.LazyPropagation(bn) + bn_ie.makeInference() + + bn_noisy = gum.BayesNet(bn) + + # For each node X ... + for node in bn.names(): + + # Get the joint P(X, Pa(X)) + joint = bn_ie.jointPosterior(bn.family(node)) + + # Add noise to P(X, Pa(X)) and normalize + noise = np.random.laplace(scale=scale, size=np.prod(joint.shape)) + noisy_joint = np.clip( + joint.toarray().flatten() + noise, a_min=10e-10, a_max=None + ) + noisy_joint = noisy_joint / np.sum(noisy_joint) + joint.fillWith(noisy_joint) + + # Compute the conditional P(X | Pa(X)) + cond = joint / joint.sumOut(node) + + # Fill noisy BN + bn_noisy.cpt(node).fillWith(cond) + + # Check noisy bn + bn_noisy.check() # OK if = (). + + return bn_noisy diff --git a/src/inference.py b/src/inference.py index 1a59c05..273a6b7 100644 --- a/src/inference.py +++ b/src/inference.py @@ -1,3 +1,4 @@ +import inspect import math import numpy as np @@ -5,17 +6,28 @@ import pyagrum as gum from more_itertools import random_product -from src.config import get_base_path, set_global_seed -from src.utils import add_counts_to_bn, get_min_max_bns, noisy_bn, safe_assert +import src.defense +from src.config import get_cur_dir, safe_assert, set_seed +from src.defense import noisy_bn +from src.learning import learn_bn_params +from src.utils import get_min_max_bns -def run_inferences(exp, ess, eps, config): +def inferences(exp, config, def_mec, def_args): - base_path = get_base_path(config) + # Read config + cur_dir = get_cur_dir(config) target = config["target_var"] + # Read data + auc_res = pd.read_csv(f'{cur_dir}/{config["auc_meta"]}') + eps_vec = [ + i for i in auc_res.loc[auc_res["exp"] == exp, "epsilon"].values if i is not None + ] + eps = np.mean(eps_vec) + # Set seed - set_global_seed(config["seed"]) + set_seed() # Set list of evidence evid_vec = [ @@ -24,29 +36,39 @@ def run_inferences(exp, ess, eps, config): ] # Store ground-truth BN - gt = gum.loadBN(f'{base_path / config["bns_path"]}/{exp}.bif') - gpop = pd.read_csv(f'{base_path / config["data_path"]}/{exp}.csv') - - # Learn BN from gpop - bn_learner = gum.BNLearner(gpop) - bn_learner.useSmoothingPrior(1e-5) - bn = bn_learner.learnParameters(gt.dag()) - - # Learn CN from gpop - bn_copy = gum.BayesNet(bn) - add_counts_to_bn(bn_copy, gpop) - cn = gum.CredalNet(bn_copy) - cn.idmLearning(ess) - - # Learn noisy BN from gpop + gt = gum.loadBN(f'{cur_dir / config["bns_path"]}/gt/{exp}.bif') + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + + # Learn BN from gpop #TODO: save results + bn = learn_bn_params(gt, gpop) + + # Learn CN from gpop (defense mechanism) #TODO: save results + def_mec_fn = getattr(src.defense, def_mec) # Get the related function + sig = inspect.signature(def_mec_fn) # Get its signature + args = { + k: v + for k, v in { + "bn": bn, + "ess": def_args.get("ess", None), + "delta": def_args.get("delta", None), + "data": gpop, + }.items() + if k in sig.parameters + } + cn = def_mec_fn(**args) # Keep only `def_mec`` args + + # Learn noisy BN from gpop #TODO: save results scale = (2 * bn.size()) / (len(gpop) * eps) bn_noisy = noisy_bn(bn, scale) # Run inferences - gt_mpes, _ = run_inference_bn(gt, target, evid_vec) - bn_mpes, bn_probs = run_inference_bn(bn, target, evid_vec) - bn_noisy_mpes, bn_noisy_probs = run_inference_bn(bn_noisy, target, evid_vec) - cn_mpes, cn_probs, cn_probs_alt = run_inference_cn(cn, target, evid_vec, exp) + try: + gt_mpes, _ = run_inference_bn(gt, target, evid_vec) + bn_mpes, bn_probs = run_inference_bn(bn, target, evid_vec) + bn_noisy_mpes, bn_noisy_probs = run_inference_bn(bn_noisy, target, evid_vec) + cn_mpes, cn_probs, cn_probs_alt = run_inference_cn(cn, target, evid_vec, exp) + except: + return # Save results results = pd.DataFrame( @@ -62,12 +84,12 @@ def run_inferences(exp, ess, eps, config): } ) - res_path = ( - base_path - / config["results_path"] - / f'results_nodes{config["n_nodes"]}_ess{ess}' + results.to_csv( + f'{cur_dir / config["results_path"]}/inferences/{exp}.csv', + index=False, ) - results.to_csv(f"{res_path}/{exp}.csv", index=False) + + return # MPE function for BN @@ -167,6 +189,9 @@ def run_inference_bn(bn, target: str, evid_vec): cov = sorted(list(bn.names())) cov.remove(target) + # Set seed + set_seed() + # Debug safe_assert(len(cov) == bn.size() - 1) @@ -200,6 +225,9 @@ def run_inference_cn(cn, target: str, evid_vec, exp: str): cov = sorted(list(bn_min.names())) cov.remove(target) + # Set seed + set_seed() + # Debug safe_assert(len(cov) == bn_min.size() - 1) diff --git a/src/learning.py b/src/learning.py new file mode 100644 index 0000000..5d7c603 --- /dev/null +++ b/src/learning.py @@ -0,0 +1,66 @@ +import pandas as pd +import pyagrum as gum + +from src.config import get_cur_dir, safe_assert, set_seed + + +# Learn BN parameters from a given BN and data +def learn_bn_params(bn, data): + + bn_copy = gum.BayesNet(bn) + + learner = gum.BNLearner(data, bn_copy) + learner.useSmoothingPrior(1e-5) + bn_learnt = learner.learnParameters(bn_copy) + + return bn_learnt + + +# Estimate BNs from rpop and pool +def estimate_bns(exp, config) -> None: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Set seed + set_seed() + + # Read data + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + bn = gum.loadBN(f'{cur_dir / config["bns_path"]}/gt/{exp}.bif') + n_nodes = len(bn.nodes()) + gpop_ss = config["gpop_ss"] + rpop_ss = int(gpop_ss * config["rpop_prop"]) + pool_ss = int(gpop_ss * config["pool_prop"]) + + # Debug + safe_assert(gpop_ss == gpop.shape[0]) + safe_assert(n_nodes == gpop.loc[:, ~gpop.columns.str.contains("in-")].shape[1]) + + # For each data sample ... + for sample in range(config["samples"]): + + # ... retrieve pool and rpop, ... + pool = gpop[gpop[f"in-pool-{sample}"]].iloc[:, :n_nodes] + rpop = gpop[gpop[f"in-rpop-{sample}"]].iloc[:, :n_nodes] + + # ... estimate BN from rpop, ... + bn_learnt = learn_bn_params(bn, rpop) + gum.saveBN( + bn_learnt, + f'{cur_dir / config["bns_path"] / "rpop"}/{f"bn_{exp}_sample{sample}"}.bif', + ) + + # ... estimate BN from pool, ... + bn_learnt = learn_bn_params(bn, pool) + gum.saveBN( + bn_learnt, + f'{cur_dir / config["bns_path"] / "pool"}/{f"bn_{exp}_sample{sample}"}.bif', + ) + + # Debug + safe_assert(len(pool) == sum(gpop[f"in-pool-{sample}"])) + safe_assert(len(pool) == pool_ss) + safe_assert(len(rpop) == rpop_ss) + + return diff --git a/src/membership_attack.py b/src/membership_attack.py deleted file mode 100644 index 472c65c..0000000 --- a/src/membership_attack.py +++ /dev/null @@ -1,367 +0,0 @@ -import math -import traceback - -import numpy as np -import pandas as pd -import pyagrum as gum -from scipy.stats import norm -from sklearn import metrics - -from src.config import get_base_path, set_global_seed -from src.utils import (add_counts_to_bn, get_ll, get_llr, noisy_bn, - safe_assert, sample_from_cn) - - -# Get the attack power related to a fixed error -def get_power(llr_ref, llr_gen, ground_truth, error) -> float: - - # Compute the threshold - t = np.quantile(llr_ref, 1 - error).item() - - # Test: L(x) > t => reject H_0 => assign `x` to target_pop - y_pred = llr_gen > t - - # Compute power (i.e., true positive rate) - power = sum(ground_truth & y_pred) / sum(ground_truth) - - return power - - -# MIA: membership inference attack -def run_mia(model, baseline, rpop, gpop, ground_truth, error_vec): - - # Compute llr(x) on reference and general populations - llr_ref = ( - rpop.apply(lambda x: get_llr(x.to_dict(), baseline, model), axis=1) - .dropna() - .sort_values() - ) - llr_gen = gpop[[*rpop.columns]].apply( - lambda x: get_llr(x.to_dict(), baseline, model), axis=1 - ) - - power_vec = [] - - # Get the power for each error - for error in error_vec: - power = get_power(llr_ref, llr_gen, ground_truth, error) - power_vec.append(power) - - # Compute and store AUC - auc = metrics.auc(error_vec, power_vec) - - return power_vec, auc - - -# Get the maximum likelihood BN -def get_maxll_bn(bns_sample, rpop): - """ - Given a list `bns_sample` of BNs, - find argmax_{BN in bns_sample} ll(BN | rpop), - where ll is the log-likelihood function. - """ - - maxll_bn = None - maxll = -np.inf - - for bn in bns_sample: - - # Estimate the likelihood of rpop - bn_ie = gum.LazyPropagation(bn) - llr_im = rpop.apply(lambda x: get_ll(x.to_dict(), bn_ie), axis=1).dropna() - llr = np.sum(llr_im) - - if llr > maxll: - maxll_bn = bn - maxll = llr - - return maxll_bn - - -# Find eps s.t. |AUC(eps) - AUC(CN)| < tol -def get_eps(exp, ess, config): - - # Get base path - base_path = get_base_path(config) - - # Set seed - set_global_seed(config["seed"]) - - # Init hyperp. - eps_vec = eval(config["ess_dict"][ess]) - results_path = base_path / config["results_path"] - n_samples = config["n_samples"] - n_bns = config["n_bns"] - error = eval(config["error"]) - tol = config["tol"] - - # Read data - gpop = pd.read_csv(f'{base_path / config["data_path"]}/{exp}.csv') - bn = gum.loadBN(f'{base_path / config["bns_path"]}/{exp}.bif') - n_nodes = config["n_nodes"] - gpop_ss = config["gpop_ss"] - rpop_ss = int(gpop_ss * config["rpop_prop"]) - pool_ss = int(gpop_ss * config["pool_prop"]) - - # Debug - safe_assert(gpop_ss == gpop.shape[0]) - safe_assert(n_nodes == gpop.shape[1]) - - bn_theta_vec = [] - bn_theta_hat_vec = [] - cn_vec = [] - - # For any data sample ... - for sample in range(n_samples): - - # ... sample pool and rpop, ... - pool_idx = np.random.choice(range(gpop_ss), size=pool_ss, replace=False) - gpop[f"in-pool-{sample}"] = gpop.index.isin(pool_idx) - pool = gpop[gpop[f"in-pool-{sample}"]].iloc[:, :n_nodes] - rpop = gpop[~gpop[f"in-pool-{sample}"]].iloc[:, :n_nodes].sample(rpop_ss) - - # ... estimate BN from rpop, ... - learner = gum.BNLearner(rpop) - learner.useSmoothingPrior(1e-5) - bn_theta_vec.append(learner.learnParameters(bn.dag())) - - # ... estimate BN from pool, ... - learner = gum.BNLearner(pool) - learner.useSmoothingPrior(1e-5) - bn_theta_hat_vec.append(learner.learnParameters(bn.dag())) - - # ... and estimate CN from pool (by local IDM) - bn_counts = gum.BayesNet(bn) - add_counts_to_bn(bn_counts, pool) - cn = gum.CredalNet(bn_counts) - cn.idmLearning(ess) - cn_vec.append(cn) - - # Debug - safe_assert(len(pool) == sum(gpop[f"in-pool-{sample}"])) - safe_assert(len(pool) == pool_ss) - safe_assert(len(rpop) == rpop_ss) - - # Debug - safe_assert(len(bn_theta_vec) == n_samples) - safe_assert(len(bn_theta_hat_vec) == n_samples) - safe_assert(len(cn_vec) == n_samples) - - # Run MIA against CN - auc_cn_vec = [] - for sample in range(n_samples): - - # Retrieve sample-related info - y_true = gpop[f"in-pool-{sample}"] - cn = cn_vec[sample] - bn_theta_ie = gum.LazyPropagation(bn_theta_vec[sample]) - - # Extract random subset within simplex - bns_sample = sample_from_cn(cn, exp, n_bns) - - # Get the maximum likelihood BN - best_bn = get_maxll_bn(bns_sample, rpop) - bn_ie = gum.LazyPropagation(best_bn) - - # MIA - try: - _, auc = run_mia(bn_ie, bn_theta_ie, rpop, gpop, y_true, error) - auc_cn_vec.append(auc) - - except Exception: - - # Debug - with open(f"{results_path}/log.txt", "a") as log: - log.write(f"{exp}: error with sample {sample} (CN).\n") - log.write(traceback.format_exc()) - - # Compute Avg(AUC(CN)) across data samples - auc_cn = sum(auc_cn_vec) / len(auc_cn_vec) - - # Find eps - eps_best = eps_vec[-1] - - # For each eps ... - for eps in eps_vec: - - auc_bn_noisy_vec = [] - - # ... run MIA against noisy BN ... - for sample in range(n_samples): - - # Retrieve sample-related info - y_true = gpop[f"in-pool-{sample}"] - bn_theta_hat = bn_theta_hat_vec[sample] - bn_theta_ie = gum.LazyPropagation(bn_theta_vec[sample]) - - # Get noisy BN - scale = (2 * bn_theta_hat.size()) / (len(pool) * eps) - bn_noisy = noisy_bn(bn_theta_hat, scale) - bn_noisy_ie = gum.LazyPropagation(bn_noisy) - - try: - - # MIA - _, auc = run_mia(bn_noisy_ie, bn_theta_ie, rpop, gpop, y_true, error) - auc_bn_noisy_vec.append(auc) - - except Exception: - - # Debug - with open(f"{results_path}/log.txt", "a") as log: - log.write( - f"{exp}: error with sample {sample} (BN noisy, eps: {eps}).\n" - ) - log.write(traceback.format_exc()) - - # ... and compute Avg(AUC(eps)) across data samples - auc_bn = sum(auc_bn_noisy_vec) / n_samples - - # Condition on |AUC(eps) - AUC(CN)| - if abs(auc_cn - auc_bn) <= tol: - eps_best = eps - break - - # Store found eps - meta_file_path = ( - results_path - / f'results_nodes{config["n_nodes"]}_ess{ess}' - / config["meta_file"] - ) - with open(meta_file_path, "a") as m: - m.write(f"- {exp}. Nodes: {n_nodes} Eps: {eps_best}\n") - - return exp, ess, eps_best - - -# Membership attack against CN and BN -def attack_cn_bn(exp, ess, config): - - # Get base path - base_path = get_base_path(config) - - # Set seed - set_global_seed(config["seed"]) - - # Init hyperp. - results_path = base_path / config["results_path"] - n_samples = config["n_samples"] - n_bns = config["n_bns"] - error = eval(config["error"]) - - # Read data - gpop = pd.read_csv(f'{base_path / config["data_path"]}/{exp}.csv') - bn = gum.loadBN(f'{base_path / config["bns_path"]}/{exp}.bif') - n_nodes = len(bn.nodes()) - gpop_ss = config["gpop_ss"] - rpop_ss = int(gpop_ss * config["rpop_prop"]) - pool_ss = int(gpop_ss * config["pool_prop"]) - - # Debug - safe_assert(gpop_ss == gpop.shape[0]) - safe_assert(n_nodes == gpop.shape[1]) - - bn_theta_vec = [] - bn_theta_hat_vec = [] - cn_vec = [] - - # For any data sample ... - for sample in range(n_samples): - - # ... sample pool and rpop, ... - pool_idx = np.random.choice(range(gpop_ss), size=pool_ss, replace=False) - gpop[f"in-pool-{sample}"] = gpop.index.isin(pool_idx) - pool = gpop[gpop[f"in-pool-{sample}"]].iloc[:, :n_nodes] - rpop = gpop[~gpop[f"in-pool-{sample}"]].iloc[:, :n_nodes].sample(rpop_ss) - - # ... estimate BN from rpop, ... - learner = gum.BNLearner(rpop) - learner.useSmoothingPrior(1e-5) - bn_theta_vec.append(learner.learnParameters(bn.dag())) - - # ... estimate BN from pool, ... - learner = gum.BNLearner(pool) - learner.useSmoothingPrior(1e-5) - bn_theta_hat_vec.append(learner.learnParameters(bn.dag())) - - # ... and estimate CN from pool (by local IDM) - bn_counts = gum.BayesNet(bn) - add_counts_to_bn(bn_counts, pool) - cn = gum.CredalNet(bn_counts) - cn.idmLearning(ess) - cn_vec.append(cn) - - # Debug - safe_assert(len(pool) == sum(gpop[f"in-pool-{sample}"])) - safe_assert(len(pool) == pool_ss) - safe_assert(len(rpop) == rpop_ss) - - # Debug - safe_assert(len(bn_theta_vec) == n_samples) - safe_assert(len(bn_theta_hat_vec) == n_samples) - safe_assert(len(cn_vec) == n_samples) - - # Compute theoretical bound - compl = bn.dim() - bound = math.sqrt(compl / pool_ss) - - # Find power (beta) for any error (alpha) given theoretical bound - z_alpha = [norm.ppf(1 - i).item() for i in error] - z_one_minus_beta = [bound - i for i in z_alpha] - beta = [norm.cdf(i).item() for i in z_one_minus_beta] - - # Init results - results = pd.DataFrame({"error": error, "power_bound": beta}) - - # Run MIA against CN - for sample in range(n_samples): - - # Retrieve sample-related info - y_true = gpop[f"in-pool-{sample}"] - cn = cn_vec[sample] - bn_theta_ie = gum.LazyPropagation(bn_theta_vec[sample]) - - # Extract random subset within simplex - bns_sample = sample_from_cn(cn, exp, n_bns) - - # Get the maximum likelihood BN - best_bn = get_maxll_bn(bns_sample, rpop) - bn_ie = gum.LazyPropagation(best_bn) - - # MIA - try: - power_vec, _ = run_mia(bn_ie, bn_theta_ie, rpop, gpop, y_true, error) - results[f"power_CN_sample{sample}"] = power_vec - - except Exception: - - # Debug - with open(f"{results_path}/log.txt", "a") as log: - log.write(f"{exp}: error with sample {sample} (CN).\n") - log.write(traceback.format_exc()) - - # Run MIA against BN - for sample in range(n_samples): - - # Retrieve sample-related info - y_true = gpop[f"in-pool-{sample}"] - bn_theta_hat_ie = gum.LazyPropagation(bn_theta_hat_vec[sample]) - bn_theta_ie = gum.LazyPropagation(bn_theta_vec[sample]) - - try: - - # MIA - power_vec, _ = run_mia( - bn_theta_hat_ie, bn_theta_ie, rpop, gpop, y_true, error - ) - results[f"power_BN_sample{sample}"] = power_vec - - except Exception: - - # Debug - with open(f"{results_path}/log.txt", "a") as log: - log.write(f"{exp}: error with sample {sample} (BN).\n") - log.write(traceback.format_exc()) - - # Save results - results.to_csv(f"{results_path}/{exp}-ess{ess}.csv", index=False) diff --git a/src/mia.py b/src/mia.py new file mode 100644 index 0000000..5c843b5 --- /dev/null +++ b/src/mia.py @@ -0,0 +1,323 @@ +import math + +import numpy as np +import pandas as pd +import pyagrum as gum +from scipy.stats import norm +from sklearn import metrics + +from src.config import get_cur_dir, set_seed +from src.defense import noisy_bn + + +# MIA attack vs a BN +def mia_vs_bn(exp, config) -> dict: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Init results + power_res = pd.DataFrame({"error": eval(config["error"])}) + auc_res = pd.DataFrame({"sample": range(config["samples"])}) + auc_res["exp"] = exp + + # Read data + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + + # Set seed + set_seed() + + # For each data sample ... + auc_bns_dict = dict() + for sample in range(config["samples"]): + + # ... read the BNs as estimated from rpop and pool, ... + bn_theta = gum.loadBN( + f"{cur_dir}/{config['bns_path']}/rpop/bn_{exp}_sample{sample}.bif" + ) + bn_theta_hat = gum.loadBN( + f"{cur_dir}/{config['bns_path']}/pool/bn_{exp}_sample{sample}.bif" + ) + + bn_theta_hat_ie = gum.LazyPropagation(bn_theta_hat) + bn_theta_ie = gum.LazyPropagation(bn_theta) + + # ... retrieve rpop, ... + rpop = gpop[gpop[f"in-rpop-{sample}"]].iloc[:, : len(bn_theta.nodes())] + + # try: + + # ... and perform membership inference on gpop + power_vec, auc = run_mia( + bn_theta_hat_ie, + bn_theta_ie, + rpop, + gpop, + gpop[f"in-pool-{sample}"], + eval(config["error"]), + ) + power_res[f"power_BN_sample{sample}"] = power_vec + auc_bns_dict[sample] = auc + + # except Exception: + + # # Debug + # with open(f"{results_path}/log.txt", "a") as log: + # log.write(f"{exp}: error with sample {sample} (BN).\n") + # log.write(traceback.format_exc()) + + # Save results + power_res.to_csv( + f'{cur_dir}/{config["results_path"]}/bns/power_bn_{exp}.csv', index=False + ) + + # Return + auc_res["auc_bn"] = auc_res.apply(lambda row: auc_bns_dict[row["sample"]], axis=1) + + return auc_res + + +# MIA attack vs a CN +def mia_vs_cn(exp, config) -> pd.DataFrame: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Init results + power_res = pd.DataFrame({"error": eval(config["error"])}) + auc_res = pd.DataFrame({"sample": range(config["samples"])}) + auc_res["exp"] = exp + + # Read data + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + + # Set seed + set_seed() + + # For each data sample ... + auc_cns_dict = dict() + for sample in range(config["samples"]): + + # ... read the BN as inferred from the CN + bn_theta_hat = gum.loadBN( + f'{cur_dir}/{config["atk_path"]}/bn_{exp}_sample{sample}.bif' + ) + + # ... read the BN as estimated from rpop, ... + bn_theta = gum.loadBN( + f"{cur_dir}/{config['bns_path']}/rpop/bn_{exp}_sample{sample}.bif" + ) + + bn_theta_hat_ie = gum.LazyPropagation(bn_theta_hat) + bn_theta_ie = gum.LazyPropagation(bn_theta) + + # ... retrieve rpop, ... + rpop = gpop[gpop[f"in-rpop-{sample}"]].iloc[:, : len(bn_theta.nodes())] + + # try: + + # ... and perform membership inference on gpop + power_vec, auc = run_mia( + bn_theta_hat_ie, + bn_theta_ie, + rpop, + gpop, + gpop[f"in-pool-{sample}"], + eval(config["error"]), + ) + power_res[f"power_CN_sample{sample}"] = power_vec + auc_cns_dict[sample] = auc + + # except Exception: + + # # Debug + # with open(f"{results_path}/log.txt", "a") as log: + # log.write(f"{exp}: error with sample {sample} (BN).\n") + # log.write(traceback.format_exc()) + + # Save results + power_res.to_csv( + f'{cur_dir}/{config["results_path"]}/cns/power_cn_{exp}.csv', index=False + ) + + # Return + auc_res["auc_cn"] = auc_res.apply(lambda row: auc_cns_dict[row["sample"]], axis=1) + + return auc_res + + +# Get theoretical power +def theoretical_power(exp, config) -> None: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Read data + bn = gum.loadBN(f'{get_cur_dir(config) / config["bns_path"]}/gt/{exp}.bif') + results = pd.read_csv(f'{cur_dir}/{config["results_path"]}/bns/power_bn_{exp}.csv') + + # Set seed + set_seed() + + # Compute bound + bound = math.sqrt(bn.dim() / int(config["gpop_ss"] * config["pool_prop"])) + + # Find power (beta) for any error (alpha) given theoretical bound + z_alpha = [norm.ppf(1 - i).item() for i in eval(config["error"])] + z_one_minus_beta = [bound - i for i in z_alpha] + beta = [norm.cdf(i).item() for i in z_one_minus_beta] + + # Save results + results["power_bound"] = beta + results.to_csv( + f'{cur_dir}/{config["results_path"]}/bns/power_bn_{exp}.csv', index=False + ) + + return + + +# Find eps s.t. |AUC(eps) - AUC(CN)| < tol +def find_epsilon(exp, config) -> dict: + + # Get current directory + cur_dir = get_cur_dir(config) + + # Init results + power_res = pd.DataFrame({"error": eval(config["error"])}) + + # Read data + gpop = pd.read_csv(f'{cur_dir / config["data_path"]}/{exp}.csv') + gpop_ss = config["gpop_ss"] + pool_ss = int(gpop_ss * config["pool_prop"]) + auc_res = pd.read_csv(f'{cur_dir}/{config["auc_meta"]}') + auc_res = auc_res[auc_res["exp"] == exp] + eps_vec = eval(config["eps_vec"]) + + # Set seed + set_seed() + + # For each data sample ... + eps_dict = dict() + auc_noisy_dict = dict() + for sample in range(config["samples"]): + + # ... read the BNs as estimated from rpop and pool, ... + bn_theta = gum.loadBN( + f"{cur_dir}/{config['bns_path']}/rpop/bn_{exp}_sample{sample}.bif" + ) + bn_theta_hat = gum.loadBN( + f"{cur_dir}/{config['bns_path']}/pool/bn_{exp}_sample{sample}.bif" + ) + + # ... retrieve rpop, ... + rpop = gpop[gpop[f"in-rpop-{sample}"]].iloc[:, : len(bn_theta.nodes())] + + # ... get CN AUC, ... + auc_cn = auc_res.loc[auc_res["sample"] == sample, "auc_cn"].values[0] + + # ... init results, ... + eps_dict[sample] = None + auc_noisy_dict[sample] = None + + # ... and find epsilon + for eps in eps_vec: + + # Get noisy BN + scale = (2 * bn_theta_hat.size()) / (pool_ss * eps) + bn_noisy = noisy_bn(bn_theta_hat, scale) + + bn_noisy_ie = gum.LazyPropagation(bn_noisy) + bn_theta_ie = gum.LazyPropagation(bn_theta) + + # Perform membership inference on gpop + power_vec, auc = run_mia( + bn_noisy_ie, + bn_theta_ie, + rpop, + gpop, + gpop[f"in-pool-{sample}"], + eval(config["error"]), + ) + + # Condition on |AUC(eps) - AUC(CN)| + if abs(auc_cn - auc) < config["tol"]: + eps_dict[sample] = eps + auc_noisy_dict[sample] = auc + power_res[f"power_BN_noisy_sample{sample}"] = power_vec + break + + # Save results + power_res.to_csv( + f'{cur_dir}/{config["results_path"]}/bn_noisy/power_bn_{exp}.csv', index=False + ) + + # Return + auc_res["epsilon"] = auc_res.apply(lambda row: eps_dict[row["sample"]], axis=1) + auc_res["auc_noisy_bn"] = auc_res.apply( + lambda row: auc_noisy_dict[row["sample"]], axis=1 + ) + + return auc_res + + +# MIA: membership inference attack +def run_mia(model, baseline, rpop, gpop, ground_truth, error_vec): + + # Compute llr(x) on reference and general populations + llr_ref = ( + rpop.apply(lambda x: get_llr(x.to_dict(), baseline, model), axis=1) + .dropna() + .sort_values() + ) + llr_gen = gpop[[*rpop.columns]].apply( + lambda x: get_llr(x.to_dict(), baseline, model), axis=1 + ) + + power_vec = [] + + # Get the power for each error + for error in error_vec: + power = get_power(llr_ref, llr_gen, ground_truth, error) + power_vec.append(power) + + # Compute and store AUC + auc = metrics.auc(error_vec, power_vec) + + return power_vec, auc + + +# Get the attack power related to a fixed error +def get_power(llr_ref, llr_gen, ground_truth, error) -> float: + + # Compute the threshold + t = np.quantile(llr_ref, 1 - error).item() + + # Test: L(x) > t => reject H_0 => assign `x` to target_pop + y_pred = llr_gen > t + + # Compute power (i.e., true positive rate) + power = sum(ground_truth & y_pred) / sum(ground_truth) + + return power + + +# Log-likelihood function +def get_ll(x: dict, theta): + + # Erase all evidences and apply addEvidence(key,value) for every pairs in x + theta.setEvidence(x) + + # Compute P(x | theta) + ll = theta.evidenceProbability() + + return np.log(ll) + + +# Log-likelihood ratio (llr) function +def get_llr(x: dict, theta, theta_hat): + + # Compute log-likelihoods + ll_theta = get_ll(x, theta) + ll_theta_hat = get_ll(x, theta_hat) + + return ll_theta_hat - ll_theta diff --git a/src/run_exp.py b/src/run_exp.py deleted file mode 100644 index ec51f00..0000000 --- a/src/run_exp.py +++ /dev/null @@ -1,61 +0,0 @@ -import gc -import multiprocessing # noqa: F401 # pylint: disable=unused-import -from itertools import product - -from joblib import Parallel, delayed - -from src.config import get_base_path -from src.inference import run_inferences -from src.membership_attack import attack_cn_bn, get_eps - - -def run_cn_vs_noisybn(config): - - # Get base path - base_path = get_base_path(config) - - # Set number of threads for parallelization - num_cores = eval(config["num_cores"]) - - # For each ESS and each model ... - exp_vec = [ - f.stem for f in (base_path / config["data_path"]).iterdir() if f.is_file() - ] - ess_vec = config["ess_dict"].keys() - - # ... find eps s.t. |AUC(eps) - AUC(CN)| < tol, ... - res = Parallel(n_jobs=num_cores)( - delayed(get_eps)(exp, ess, config) for exp, ess in product(exp_vec, ess_vec) - ) - - # ... and run inferences - _ = Parallel(n_jobs=num_cores)( - delayed(run_inferences)(exp, ess, eps, config) for exp, ess, eps in res - ) - - # Clean - gc.collect() - - -def run_cn_privacy(config): - - # Get base path - base_path = get_base_path(config) - - # Set number of threads for parallelization - num_cores = eval(config["num_cores"]) - - # For each ESS and each model ... - exp_vec = [ - f.stem for f in (base_path / config["data_path"]).iterdir() if f.is_file() - ] - ess_vec = eval(config["ess_vec"]) - - # ... run MIA attack on BN and CN - Parallel(n_jobs=num_cores)( - delayed(attack_cn_bn)(exp, ess, config) - for exp, ess in product(exp_vec, ess_vec) - ) - - # Clean - gc.collect() diff --git a/src/utils.py b/src/utils.py index 8dfad4a..9b75f3e 100644 --- a/src/utils.py +++ b/src/utils.py @@ -1,52 +1,13 @@ -import sys -from math import prod +from fractions import Fraction from tempfile import TemporaryDirectory +import cdd +import cdd.gmp import hopsy import numpy as np import pyagrum as gum -IN_PYTEST = "pytest" in sys.modules - - -# Log-likelihood function -def get_ll(x: dict, theta): - - # Erase all evidences and apply addEvidence(key,value) for every pairs in x - theta.setEvidence(x) - - # Compute P(x | theta) - ll = theta.evidenceProbability() - - return np.log(ll) - - -# Log-likelihood ratio (llr) function -def get_llr(x: dict, theta, theta_hat): - - # Compute log-likelihoods - ll_theta = get_ll(x, theta) - ll_theta_hat = get_ll(x, theta_hat) - - return ll_theta_hat - ll_theta - - -# Check BNs sampled from a CN -def are_all_bns_different(bn_vec) -> bool: - - signatures = set() - for bn in bn_vec: - cpt_data = [] - for var in bn.names(): - cpt = bn.cpt(var) - flat = [f"{v:.8f}" for v in cpt.toarray().flatten()] - cpt_data.append(f"{var}:" + ",".join(flat)) - sig = "|".join(cpt_data) - signatures.add(sig) - - print(f"({len(signatures)}/{len(bn_vec)} different BNs.)") - - return len(signatures) == len(bn_vec) +from src.config import safe_assert # Add counts of events to a BN @@ -70,163 +31,202 @@ def add_counts_to_bn(bn, data): bn.cpt(node).fillWith(counts_array.flatten().tolist()) -# Compact a dictionary to be printable -def compact_dict(d): - new_dict = {} - for k, v in d.items(): - if isinstance(v, np.ndarray): - new_dict[k] = ( - f"np.ndarray: [{v[0]:.2g}, {v[1]:.2g}, ..., {v[-1]:.2g}], length={len(v)}" - ) - else: - new_dict[k] = v - return new_dict +# Get the BN inside a CN with max entropy distribution +def maxent_cn(bn_min, bn_max) -> gum.BayesNet: + # Init an empty BN + bn = gum.BayesNet(bn_min) -# Create noisy BN by adding Laplacian noise (Zhang et al., 2017) -def noisy_bn(bn, scale: float): + # For each variable ... + for var in bn.names(): - bn_ie = gum.LazyPropagation(bn) - bn_ie.makeInference() + # ... get the centroid CPT, ... + cpt = maxent_cpt(bn_min.cpt(var), bn_max.cpt(var)) - bn_noisy = gum.BayesNet(bn) + # ... and fill the BN + bn.cpt(var).fillWith(cpt.flatten()) - # For each node X ... - for node in bn.names(): + # Debug + safe_assert(check_consistency(bn, bn_min, bn_max)) - # Get the joint P(X, Pa(X)) - joint = bn_ie.jointPosterior(bn.family(node)) + return bn - # Add noise to P(X, Pa(X)) and normalize - noise = np.random.laplace(scale=scale, size=np.prod(joint.shape)) - noisy_joint = np.clip( - joint.toarray().flatten() + noise, a_min=10e-10, a_max=None - ) - noisy_joint = noisy_joint / np.sum(noisy_joint) - joint.fillWith(noisy_joint) - # Compute the conditional P(X | Pa(X)) - cond = joint / joint.sumOut(node) +# Get the BN CPT inside a CN CPT with max entropy distribution +def maxent_cpt(cpt_min, cpt_max) -> np.array: - # Fill noisy BN - bn_noisy.cpt(node).fillWith(cond) + # Transform CPTs into pandas dataframes + cpt_min = np.atleast_2d(cpt_min.topandas()) + cpt_max = np.atleast_2d(cpt_max.topandas()) - # Check noisy bn - bn_noisy.check() # OK if = (). + # For each row in the CPT ... + cpt = [] + for row in range(cpt_min.shape[0]): - return bn_noisy + # ... get the centroid credal set, ... + c = maxent_cset(cpt_min[row, :], cpt_max[row, :]) + cpt.append(c) + # Reshape the CPT + cpt = np.array(cpt) -# Only perform an `assert` if code is running in `pytest` -def safe_assert(condition): - if IN_PYTEST: - assert condition + # Debug + safe_assert(cpt_min.shape == cpt_max.shape) + safe_assert(cpt.shape == cpt_min.shape) + return cpt -# Extract BN min and BN max from a CN -def get_min_max_bns(cn, exp: str): - with TemporaryDirectory() as tmp_path: - cn.saveBNsMinMax(f"{tmp_path}/bn_min_{exp}.bif", f"{tmp_path}/bn_max_{exp}.bif") - bn_min = gum.loadBN(f"{tmp_path}/bn_min_{exp}.bif") - bn_max = gum.loadBN(f"{tmp_path}/bn_max_{exp}.bif") +# Get the max-entropy distribution inside a credal set +def maxent_cset(vec_min, vec_max) -> np.array: - return bn_min, bn_max + rank = {v: k for k, v in enumerate(sorted(set(vec_min)))} + vec_order = np.array([rank[val] for val in vec_min]) + s = 1 - np.sum(vec_min) -# Sample from a credal set K(x | pi_x), i.e., a constrained polytope. -def sample_from_cset(vec_min, vec_max, n_samples) -> list: - """ - We assume a credal set is a polytope in a space of #X parameters, defined by a: - - Multi-dimensional rectangle, i.e., inequality constraint Ax <= b, and - - Hyperplane (provided all the variables sum up to 1), i.e., equality constraint A_eq x = b_eq. - In the case of local IDM, this is ensured. - """ + out = vec_min + while s > 0: + idx0 = np.where(vec_order == 0)[0] + idx1 = np.where(vec_order == 1)[0] + idx_len = len(idx0) - # Define the rectangle - n_par = len(vec_min) - A = np.concat((np.eye(n_par), -np.eye(n_par)), axis=0) - b = np.array(np.concatenate((vec_max, -vec_min))) - rectangle = hopsy.Problem(A=A, b=b) + - # Define the hyperplane - A_eq = np.array([np.ones(n_par)]) - b_eq = np.array([1.0]) + try: + s_cond = s / idx_len < out[idx1[0]] - out[idx0[0]] + mat = np.stack( + [ + ( + (s / idx_len) * np.ones(len(idx0)) + if s_cond + else out[idx1] - out[idx0] + ), + vec_max[idx0] - out[idx0], + ] + ) + except IndexError: + s_cond = True + mat = np.stack( + [(s / idx_len) * np.ones(len(idx0)), vec_max[idx0] - out[idx0]] + ) - # Define the polytope as a constrained rectangle - constrained_rectangle = hopsy.add_equality_constraints( - rectangle, A_eq=A_eq, b_eq=b_eq - ) + mat_min = np.min(mat) + q = np.argwhere(mat == mat_min) - # Sample from the polytope - mc = hopsy.MarkovChain(constrained_rectangle) - rng = hopsy.RandomNumberGenerator(42) - _, constrained_samples = hopsy.sample(mc, rng, n_samples, thinning=10) - constrained_samples = constrained_samples[0] + if np.any(q[:, 0] == 1): + if len(idx0) > len(q): + vec_order[~np.isin(np.arange(len(out)), idx0[q[:, 1]])] += 1 + elif not s_cond: + vec_order[idx0[q[:, 1]]] += 1 + + out[idx0] += mat_min + s -= mat_min * len(idx0) + vec_order -= 1 + + return out + + +# Get the centroid of a CN +def centroid_cn(bn_min, bn_max) -> gum.BayesNet: + + # Init an empty BN + bn = gum.BayesNet(bn_min) + + # For each variable ... + for var in bn.names(): + + # ... get the centroid CPT, ... + cpt = centroid_cpt(bn_min.cpt(var), bn_max.cpt(var)) + + # ... and fill the BN + bn.cpt(var).fillWith(cpt.flatten()) # Debug - safe_assert(np.all(vec_min <= vec_max)) - safe_assert(n_par == len(vec_max)) - safe_assert(n_par == A.shape[1]) - safe_assert(n_par == A_eq.shape[1]) - safe_assert(len(constrained_samples) == n_samples) - for i in constrained_samples: - safe_assert(len(i) == n_par) + safe_assert(check_consistency(bn, bn_min, bn_max)) - return constrained_samples + return bn -# Sample from two extreme CPTs -def sample_from_cpts(cpt_min, cpt_max, n_samples) -> list: +# Get the centroid of a CN CPT +def centroid_cpt(cpt_min, cpt_max) -> np.array: # Transform CPTs into pandas dataframes cpt_min = np.atleast_2d(cpt_min.topandas()) cpt_max = np.atleast_2d(cpt_max.topandas()) # For each row in the CPT ... - credal_dict = {} + cpt = [] for row in range(cpt_min.shape[0]): - # ... sample `n_samples` points from the credal set - credal_dict[row] = sample_from_cset(cpt_min[row, :], cpt_max[row, :], n_samples) + # ... get the centroid credal set, ... + c = centroid_cset(cpt_min[row, :], cpt_max[row, :]) + cpt.append(c) - # For each sample ... - cpt_samples = [] - for i in range(n_samples): + # Reshape the CPT + cpt = np.array(cpt) - # ... build the CPT - cpt = [] - for row in range(cpt_min.shape[0]): - cpt.append(credal_dict[row][i]) + # Debug + safe_assert(cpt_min.shape == cpt_max.shape) + safe_assert(cpt.shape == cpt_min.shape) - cpt = np.array(cpt).flatten() - cpt_samples.append(cpt) + return cpt + + +# Get the centroid of a credal set as the average of its extreme points +def centroid_cset(vec_min, vec_max) -> np.array: + + # Define the (in)equalities (i.e., get the H-representation of the credal set) + n_par = len(vec_min) + A = np.concatenate( + (-np.eye(n_par), np.eye(n_par), np.atleast_2d(np.ones(n_par))), axis=0 + ) + b = np.concatenate((vec_max, -vec_min, np.atleast_1d(-1))).reshape(len(A), 1) + bA = np.concatenate((b, A), axis=1) + bA_frac = np.array( + [[Fraction(x).limit_denominator() for x in row] for row in bA], dtype=object + ) # Needed for numerical stability + mat_frac = cdd.gmp.matrix_from_array( + array=bA_frac, rep_type=cdd.RepType.INEQUALITY, lin_set=set([len(A) - 1]) + ) + + # Get the polytope and extreme points. Each point is a row of the matrix `vertices` + poly_frac = cdd.gmp.polyhedron_from_matrix(mat_frac) + ext_frac = cdd.gmp.copy_generators(poly_frac) + vertices_frac = np.array(ext_frac.array)[:, 1:] + vertices = np.array( + [[float(x) for x in row] for row in vertices_frac], dtype=object + ) + + # Compute the centroid as the average across extreme points + centroid = np.sum(vertices, axis=0) / len(vertices) # Debug - safe_assert(cpt_min.shape == cpt_max.shape) - safe_assert(len(credal_dict) == prod(cpt_min.shape)) - safe_assert(len(cpt_samples) == n_samples) + safe_assert(len(vec_min) == len(vec_max)) + safe_assert(len(b) == 2 * len(vec_min) + 1) + safe_assert(A.shape == (len(b), len(vec_min))) + safe_assert(bA.shape == (2 * len(vec_min) + 1, len(vec_min) + 1)) + safe_assert(vertices.shape[1] == n_par) - return cpt_samples + return centroid # BNs sampler from a CN -def sample_from_cn(cn, exp: str, n_samples: int) -> list: +def sample_from_cn(bn_min, bn_max, n_bns: int) -> list: # Get the DAG and extreme BNs - dag = gum.BayesNet(cn.current_bn()) - bn_min, bn_max = get_min_max_bns(cn, exp) + dag = gum.BayesNet(bn_min) # For each variable ... cpts_dict = {} for var in dag.names(): - # ... sample `n_samples` CPTs from the CN - cpts_dict[var] = sample_from_cpts(bn_min.cpt(var), bn_max.cpt(var), n_samples) + # ... sample `n_bns` CPTs from the CN + cpts_dict[var] = sample_from_cpts(bn_min.cpt(var), bn_max.cpt(var), n_bns) # For each sample ... bns = [] - for i in range(n_samples): + for i in range(n_bns): # ... init an empty BN ... bn = gum.BayesNet(dag) @@ -242,11 +242,87 @@ def sample_from_cn(cn, exp: str, n_samples: int) -> list: # Debug safe_assert(len(cpts_dict) == len(dag.names())) - safe_assert(len(bns) == n_samples) + safe_assert(len(bns) == n_bns) return bns +# Sample from two extreme CPTs +def sample_from_cpts(cpt_min, cpt_max, n_bns) -> list: + + # Transform CPTs into pandas dataframes + cpt_min = np.atleast_2d(cpt_min.topandas()) + cpt_max = np.atleast_2d(cpt_max.topandas()) + + # For each row in the CPT ... + credal_dict = {} + for row in range(cpt_min.shape[0]): + + # ... sample `n_bns` points from the credal set + credal_dict[row] = sample_from_cset(cpt_min[row, :], cpt_max[row, :], n_bns) + + # For each sample ... + cpt_samples = [] + for i in range(n_bns): + + # ... build the CPT + cpt = [] + for row in range(cpt_min.shape[0]): + cpt.append(credal_dict[row][i]) + + cpt = np.array(cpt).flatten() + cpt_samples.append(cpt) + + # Debug + safe_assert(cpt_min.shape == cpt_max.shape) + safe_assert(len(credal_dict) == cpt_min.shape[0]) + safe_assert(len(cpt_samples) == n_bns) + + return cpt_samples + + +# Sample from a credal set K(x | pi_x), i.e., a constrained polytope. +def sample_from_cset(vec_min, vec_max, n_bns) -> list: + """ + We assume a credal set is a polytope in a space of #X parameters, defined by a: + - Multi-dimensional rectangle, i.e., inequality constraint Ax <= b, and + - Hyperplane (provided all the variables sum up to 1), i.e., equality constraint A_eq x = b_eq. + This is true if the CN has been learnt by local IDM, for instance. + """ + + # Define the rectangle + n_par = len(vec_min) + A = np.concatenate((np.eye(n_par), -np.eye(n_par)), axis=0) + b = np.concatenate((vec_max, -vec_min)) + rectangle = hopsy.Problem(A=A, b=b) + + # Define the hyperplane + A_eq = np.array([np.ones(n_par)]) + b_eq = np.array([1.0]) + + # Define the polytope as a constrained rectangle (i.e., get the H-representation of the credal set) + constrained_rectangle = hopsy.add_equality_constraints( + rectangle, A_eq=A_eq, b_eq=b_eq + ) + + # Sample from the polytope + mc = hopsy.MarkovChain(constrained_rectangle) + rng = hopsy.RandomNumberGenerator(42) + _, constrained_samples = hopsy.sample(mc, rng, n_bns, thinning=10) + constrained_samples = constrained_samples[0] + + # Debug + safe_assert(np.all(vec_min <= vec_max)) + safe_assert(n_par == len(vec_max)) + safe_assert(n_par == A.shape[1]) + safe_assert(n_par == A_eq.shape[1]) + safe_assert(len(constrained_samples) == n_bns) + for i in constrained_samples: + safe_assert(len(i) == n_par) + + return constrained_samples + + # Check the consistency of a BN as sampled from a CN def check_consistency(bn, bn_min, bn_max) -> bool: @@ -276,3 +352,32 @@ def check_consistency(bn, bn_min, bn_max) -> bool: return False return True + + +# Check BNs sampled from a CN +def are_all_bns_different(bn_vec) -> bool: + + signatures = set() + for bn in bn_vec: + cpt_data = [] + for var in bn.names(): + cpt = bn.cpt(var) + flat = [f"{v:.8f}" for v in cpt.toarray().flatten()] + cpt_data.append(f"{var}:" + ",".join(flat)) + sig = "|".join(cpt_data) + signatures.add(sig) + + print(f"({len(signatures)}/{len(bn_vec)} different BNs.)") + + return len(signatures) == len(bn_vec) + + +# Extract BN min and BN max from a CN +def get_min_max_bns(cn, exp: str): + + with TemporaryDirectory() as tmp_path: + cn.saveBNsMinMax(f"{tmp_path}/bn_min_{exp}.bif", f"{tmp_path}/bn_max_{exp}.bif") + bn_min = gum.loadBN(f"{tmp_path}/bn_min_{exp}.bif") + bn_max = gum.loadBN(f"{tmp_path}/bn_max_{exp}.bif") + + return bn_min, bn_max diff --git a/test/cn_privacy/config.yaml b/test/cn_privacy/config.yaml index d05ae95..1144679 100644 --- a/test/cn_privacy/config.yaml +++ b/test/cn_privacy/config.yaml @@ -1,28 +1,27 @@ ## Configuration file # Paths -base_path: test/cn_privacy/output # Base path for output +cur_dir: test/cn_privacy # Current directory (contains all the following) bns_path: bns # Where to save ground-truth BNs +cns_path: output/cns # Where to save CNs as obtained by def-mec from BNs learnt from pool +atk_path: output/bns_atk # Where to save BNs as obtained by atk-mec from CNs data_path: data # Where to save data as generated from ground-truth BNs -results_path: results # Where to save the experiment results -meta_file: exp_meta.txt # File of metadata +results_path: output/results # Where to save the experiment results +exp_meta: exp_meta.txt # File of metadata for experiments # Models -n_nodes_vec: '[10, 15]' # List of models' number of nodes +n_nodes_vec: '[4, 5]' # List of models' number of nodes edge_ratio_vec: '[1, 1.5]' # List of models' edge ratio n_modmax: 2 # Maximum number of variables categories # Data -gpop_ss: 500 # Sample size of general population +gpop_ss: 50 # Sample size of general population rpop_prop: 0.5 # Sample size of reference population = gpop_ss * rpop_prop pool_prop: 0.25 # Sample size of pool population = gpop_ss * pool_prop +samples: 2 # Number of data samples # MIA -n_samples: 5 # Number of data samples -n_bns: 5 # Number of BNs to sample within the CN -error: 'np.logspace(-4, 0, 10, endpoint=False)' # Type-I errors vector -ess_vec: '[1, 100]' # List of ESS +error: 'np.logspace(-4, 0, 5, endpoint=False)' # Type-I errors vector # Other -seed: 42 # Global seed num_cores: 'multiprocessing.cpu_count() - 1' # Number of threads to use for parallelization diff --git a/test/cn_privacy/test_integration.py b/test/cn_privacy/test_integration.py index 4dca807..fe165d2 100644 --- a/test/cn_privacy/test_integration.py +++ b/test/cn_privacy/test_integration.py @@ -1,15 +1,83 @@ -from src.config import get_config -from src.data import generate_randombn -from src.run_exp import run_cn_privacy +import sys +from experiments.cn_privacy import exp, generate -def test_integration(): - # Load config - config = get_config("test/cn_privacy/config.yaml") +def test_generation(): - # Generate BNs and data - generate_randombn(config) + # Generate models and data + generate.main() + + +def test_def_ran_atk_mle(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_mle", "n_bns=5"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_mle(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_mle", "n_bns=5"] + ) + + # Run experiment + exp.main() + + +def test_def_ran_atk_cen(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_cen"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_cen(monkeypatch): + + monkeypatch.setattr(sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_cen"]) + + # Run experiment + exp.main() + + +def test_def_ran_atk_ran(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_ran"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_ran(monkeypatch): + + monkeypatch.setattr(sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_ran"]) + + # Run experiment + exp.main() + + +def test_def_ran_atk_ent(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_ent"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_ent(monkeypatch): + + monkeypatch.setattr(sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_ent"]) # Run experiment - run_cn_privacy(config) + exp.main() diff --git a/test/cn_vs_noisybn/config.yaml b/test/cn_vs_noisybn/config.yaml index 0876424..19cf415 100644 --- a/test/cn_vs_noisybn/config.yaml +++ b/test/cn_vs_noisybn/config.yaml @@ -1,35 +1,45 @@ ## Configuration file # Paths -base_path: test/cn_vs_noisybn/output # Base path for output +cur_dir: test/cn_vs_noisybn # Current directory (contains all the following) bns_path: bns # Where to save ground-truth BNs +cns_path: output/cns # Where to save CNs as obtained by def-mec from BNs learnt from pool +atk_path: output/bns_atk # Where to save BNs as obtained by atk-mec from CNs data_path: data # Where to save data as generated from ground-truth BNs -results_path: results # Where to save the experiment results -meta_file: exp_meta.txt # File of metadata +results_path: output/results # Where to save the experiment results +exp_meta: exp_meta.txt # File of metadata for experiments +auc_meta: output/results/auc_meta.csv # File of metadata for AUCs # Models (Naive Bayes) target_var: 'T' # Target variable -n_nodes: 10 # Number of nodes for each BN model +n_nodes: 5 # Number of nodes for each BN model n_modmax: 2 # Maximum number of categories for covariates -n_models: 5 # Number of models to evaluate +n_models: 2 # Number of models to evaluate # Data -gpop_ss: 500 # Sample size of general population +gpop_ss: 50 # Sample size of general population rpop_prop: 0.5 # Sample size of reference population = gpop_ss * rpop_prop pool_prop: 0.25 # Sample size of pool population = gpop_ss * pool_prop +samples: 2 # Number of data samples # MIA -n_samples: 5 # Number of data samples -n_bns: 5 # Number of BNs to sample within the CN -tol: 0.03 # To find eps s.t. |AUC(eps) - AUC(CN)| < tol -error: 'np.logspace(-4, 0, 10, endpoint=False)' # Type-I errors vector -ess_dict: # Eps list to evaluate for each ess - 1: 'np.arange(0.1, 10, 0.5)' - 50: 'np.arange(5e-7, 5e-4, 1e-6)' +error: 'np.logspace(-4, 0, 5, endpoint=False)' # Type-I errors vector + +# Noisy BN +tol: 0.05 # To find eps s.t. |AUC(eps) - AUC(CN)| < tol +eps_vec: 'np.logspace(-8, 2, num=10)' # Epsilon to consider for noisy BN # Inferences -n_infer: 5 # Number of inferences to perform +n_infer: 2 # Number of inferences to perform # Other -seed: 42 # Global seed num_cores: 'multiprocessing.cpu_count() - 1' # Number of threads to use for parallelization + +## Notes +# 1) Suggested pairs (ess: eps_vec) for n_nodes=10: +# - 1 : 'np.arange(0.1, 10, 0.1)' +# - 10: 'np.arange(0.1, 10, 0.1)' +# - 20: 'np.arange(0.05, 5, 0.05)' +# - 30: 'np.arange(1e-3, 1, 1e-3)' +# - 40: 'np.arange(5e-6, 1e-2, 5e-6)' +# - 50: 'np.arange(5e-7, 5e-4, 5e-7)' diff --git a/test/cn_vs_noisybn/test_integration.py b/test/cn_vs_noisybn/test_integration.py index 1ef1eea..abfb7c9 100644 --- a/test/cn_vs_noisybn/test_integration.py +++ b/test/cn_vs_noisybn/test_integration.py @@ -1,15 +1,83 @@ -from src.config import get_config -from src.data import generate_naivebayes -from src.run_exp import run_cn_vs_noisybn +import sys +from experiments.cn_vs_noisybn import exp, generate -def test_integration(): - # Load config - config = get_config("test/cn_vs_noisybn/config.yaml") +def test_generation(): - # Generate BNs and data - generate_naivebayes(config) + # Generate models and data + generate.main() + + +def test_def_ran_atk_mle(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_mle", "n_bns=5"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_mle(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_mle", "n_bns=5"] + ) + + # Run experiment + exp.main() + + +def test_def_ran_atk_cen(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_cen"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_cen(monkeypatch): + + monkeypatch.setattr(sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_cen"]) + + # Run experiment + exp.main() + + +def test_def_ran_atk_ran(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_ran"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_ran(monkeypatch): + + monkeypatch.setattr(sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_ran"]) + + # Run experiment + exp.main() + + +def test_def_ran_atk_ent(monkeypatch): + + monkeypatch.setattr( + sys, "argv", ["def_mec=def_ran", "delta=0.3", "atk_mec=atk_ent"] + ) + + # Run experiment + exp.main() + + +def test_def_idm_atk_ent(monkeypatch): + + monkeypatch.setattr(sys, "argv", ["def_mec=def_idm", "ess=1", "atk_mec=atk_ent"]) # Run experiment - run_cn_vs_noisybn(config) + exp.main() diff --git a/test/conftest.py b/test/conftest.py new file mode 100644 index 0000000..8948023 --- /dev/null +++ b/test/conftest.py @@ -0,0 +1,8 @@ +import os + +import pytest + + +@pytest.fixture(scope="session", autouse=True) +def enable_test_config(): + os.environ["USE_TEST_CONFIG"] = "1" diff --git a/test/unit/__init__.py b/test/unit/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/test/unit/utils.py b/test/unit/utils.py new file mode 100644 index 0000000..0949ca5 --- /dev/null +++ b/test/unit/utils.py @@ -0,0 +1,12 @@ +import numpy as np + +from src.utils import maxent_cset + + +def test_maxent_cset(): + vec_min = np.array([0.3, 0.4, 0, 0.1]) + vec_max = np.array([0.6, 0.8, 0.12, 0.17]) + + out = maxent_cset(vec_min, vec_max) + + assert np.allclose(out, np.array([0.31, 0.4, 0.12, 0.17]))