somkit is a Python library for Self-Organizing Maps.
- Sequential training via
train_pak:rlentotal steps, per-step linear decay (alpha to 0, radius to 1),bubble/gaussianneighborhood,linear/inverse_talpha schedule - Two-phase training (coarse ordering + fine tuning) via
train_two_phaseor twotrain_pakcalls - Best-map selection via
SOMTrainer.vfind: trainn_trialsmaps with different seeds, keep the lowest quantization error - Per-sample BMU / quantization error via
compute_visual/write_vis(.visoutput) - Training snapshots: save the codebook every N steps as numbered
.codfiles - File interchange: read/write
.codcodebook files (load_cod/save_cod), load.datdata files (load_som_pak_data) - Per-sample metadata via
SOMData: missing-component masks (xfields), sample weights, fixed-point BMUs vcal-equivalent label calibration (calibrate_labels)- Reproducible seeding via
OrandRNG(theorandlinear congruential generator) - Multiple topology support: hexagonal and rectangular
- Visualization:
- U-Matrix (interpolated grid with per-unit calibration labels)
- Component planes, hit map, class distribution map
- Sammon's Mapping projection
- Evaluation metrics: quantization error, WCSS, Silhouette Score, Topological Error
git clone https://github.com/remokasu/somkit.git
cd somkit
pip install -e .The canonical usage is two-phase training (coarse ordering then fine tuning). The example below uses the Iris dataset via a DatasetWrapper, which adapts any object with .data, .target, and .target_names attributes.
import somkit
from sklearn.datasets import load_iris
data = load_iris()
som = somkit.create_trainer(
data=data, # sklearn Bunch, DatasetWrapper, ndarray, or SOMData
size=(10, 10),
learning_rate=0.05,
topology="hexagonal", # or "rectangular"
)
som.initialize_weights_randomly() # randinit: per-component [min, max] range
# Two-phase training
som.train_two_phase(
phase1=dict(rlen=1000, alpha=0.05, radius=10.0, neighborhood="bubble", seed=1),
phase2=dict(rlen=10000, alpha=0.02, radius=3.0, neighborhood="bubble", seed=1),
)
# Evaluate
evaluator = somkit.SOMEvaluator(som)
print("WCSS:", evaluator.calculate_wcss())
print("Silhouette Score:", evaluator.calculate_silhouette_score())
print("Topological Error:", evaluator.calculate_topological_error())
# Visualize
visualizer = somkit.SOMVisualizer(som)
visualizer.plot_umatrix() # umat style with vcal labels
visualizer.plot_component_planes()
visualizer.plot_hit_map()
visualizer.plot_class_distribution()The core algorithms and the .cod / .dat / .vis file formats are tested against reference outputs of SOM_PAK 3.1, the original SOM implementation by the Kohonen lab. The reference files live under test/golden/, and the test suite compares somkit's outputs against them; see the tests there for exactly what is covered and with which tolerances.
train_pak runs rlen total steps, presenting one sample per step by cycling the data with optional shuffling. Learning rate and radius decay per step:
alphadecays linearly from the initial value to 0radiusdecays linearly from the initial value to 1 (floor)
som.train_pak(
rlen=10000, # total steps
alpha=0.02, # initial learning rate
radius=3.0, # initial neighborhood radius (decays to 1)
alpha_type="linear", # "linear" (default) or "inverse_t"
neighborhood="bubble", # "bubble" (default) or "gaussian"
seed=1, # seed for the sample presentation order
)The classic SOM schedule trains in two phases: a short coarse-ordering run with a large radius, followed by a longer fine-tuning run with a small radius. train_two_phase is syntactic sugar that calls train_pak twice on the same trainer, so phase 2 continues from phase 1's weights.
som.train_two_phase(
phase1=dict(rlen=1000, alpha=0.05, radius=10.0, neighborhood="bubble", seed=1),
phase2=dict(rlen=10000, alpha=0.02, radius=3.0, neighborhood="bubble", seed=1),
)The examples/animal.py file uses these exact parameters.
SOM results depend on the random initialization, so a standard workflow trains several maps with different seeds and keeps the one with the smallest quantization error. SOMTrainer.vfind runs that loop:
best = somkit.SOMTrainer.vfind(
data, (10, 10),
phase1=dict(rlen=1000, alpha=0.05, radius=10.0),
phase2=dict(rlen=10000, alpha=0.02, radius=3.0),
n_trials=5, # seeds 1..5 (or pass seeds=[...])
test_data=None, # None evaluates on the training data
)
best.vfind_best_seed # winning seed
best.vfind_best_qerror # its mean per-sample quantization error
best.vfind_qerrors # {seed: qerror} for every trialEach trial's quantization error is logged at INFO level.
train_pak can save the codebook every N steps, producing numbered .cod files for convergence analysis or animation:
som.train_pak(
rlen=10000, alpha=0.05, radius=10.0,
snapshot_interval=1000,
snapshot_path="run/map.cod", # -> run/map_01000.cod, run/map_02000.cod, ...
)For a trained map, compute_visual returns each sample's best matching unit and its quantization error; write_vis saves the same data as a .vis file:
res = som.compute_visual() # VisualResult: coords (n,2), qerrors (n,), labels
som.write_vis("result.vis") # .vis outputEach .vis line is x y qerror [label], where the label is the BMU unit's calibrated label. A fully masked sample (no valid components) is written as -1 -1 -1.
load_som_pak_data reads .dat files (the SOM_PAK data format) and returns a DatasetWrapper (compatible with create_trainer). Each row holds space-separated feature values; a trailing label token is optional.
data = somkit.load_som_pak_data("animal.dat")To use features such as missing-component masks (x fields in .dat), per-sample learning weights, or fixed-point BMUs, pass a SOMData container to create_trainer.
import numpy as np
from somkit.data_loader import SOMData
X = np.random.rand(100, 8)
# Mask specific components for specific samples (True = ignore)
mask = np.zeros((100, 8), dtype=bool)
mask[5, 2] = True # sample 5, component 2 is missing
# Per-sample learning weight
sample_weights = np.ones(100)
sample_weights[10] = 2.0 # sample 10 counts double
sdata = SOMData(data=X, mask=mask, weights=sample_weights)
som = somkit.create_trainer(data=sdata, size=(10, 10), learning_rate=0.05)
som.initialize_weights_randomly()
som.train_pak(rlen=5000, alpha=0.05, radius=5.0, neighborhood="bubble", seed=1)SOMData also accepts fixed / fixed_valid to force the BMU for specific samples.
After training, save the codebook. Labels from calibrate_labels can be embedded in the file, equivalent to vcal output.
labels = som.calibrate_labels(numlabs=1) # majority label per unit
som.save_cod("result.cod", neigh="bubble", labels=labels)som = somkit.SOMTrainer.load_cod("result.cod")
# attach data before further training or evaluation
som.set_data(data)load_cod / save_cod are also available as standalone functions:
header, weights = somkit.read_cod("result.cod")
somkit.write_cod("copy.cod", weights, topol="hexa", neigh="bubble")initialize_weights_randomly initializes each component from the per-component [min, max] range of the training data.
som.initialize_weights_randomly() # default
som.initialize_weights_randomly(rng=somkit.functions.OrandRNG(seed=42)) # explicit seedInitializes weights in the subspace spanned by the two largest principal components of the data, arranged in a linear grid.
som.initialize_weights_linearly()Normalization is opt-in:
som.normalize_data(method="standard") # Z-score (mean=0, std=1)
som.normalize_data(method="minmax") # scale to [0, 1]
som.normalize_data(method="variance") # divide by std, preserve meanBoth hexagonal and rectangular topologies are supported. Topology affects BMU search distance, neighborhood function, and visualization grid shape.
som = somkit.create_trainer(data=data, size=(10, 10), learning_rate=0.05, topology="hexagonal")
som = somkit.create_trainer(data=data, size=(10, 10), learning_rate=0.05, topology="rectangular")All visualization methods are on SOMVisualizer. Every map shares the same grid orientation (row 0 at top), so the same unit appears at the same position across all plot types.
visualizer = somkit.SOMVisualizer(som)
visualizer.plot_umatrix() # umat style (default)
visualizer.plot_umatrix(
show_labels=True, # vcal majority labels per unit
numlabs=1, # max labels per unit (0 = all)
show_nodes=True, # dot on units with no label
file_name="umatrix.png",
show=False,
)The U-Matrix uses the (2*x-1, 2*y-1) interpolated grid: cells between units show the inter-unit distance, and the darker walls mark cluster boundaries.
visualizer.plot_component_planes(file_name="planes.png", show=False)
visualizer.plot_hit_map(file_name="hitmap.png", show=False)
visualizer.plot_class_distribution(file_name="classes.png", show=False)Projects high-dimensional data and SOM nodes to 2D using Sammon's mapping. Preserves inter-point distances, providing a topology-independent view of the data structure.
visualizer.plot_sammon_projection(
show_nodes=True,
show_data_points=True,
show_connections=True,
connection_style="spring", # "spring" (thickness ~ distance) or "line"
colormap="tab10", # auto-switches to tab20 for >10 classes
max_iter=500,
learning_rate=0.2,
random_state=42,
file_name="sammon.png",
show=False,
)evaluator = somkit.SOMEvaluator(som)
print("WCSS:", evaluator.calculate_wcss())
print("Silhouette Score:", evaluator.calculate_silhouette_score())
print("Topological Error:", evaluator.calculate_topological_error())som.save_model("my_som.h5") # somkit native HDF5 checkpointFor .cod output, use save_cod instead (see Codebook I/O).
All examples are in examples/. Run from the examples/ directory.
cd examples
python animal.pypython iris.pypython breast_cancer.py
python digits.py
python wine.pysomkit/
somkit/
trainer/ # SOMTrainer, create_trainer, train_pak, train_two_phase
functions/ # neighborhood, decay, rng (OrandRNG), learning, initialization
data_loader/ # SOMData, SOMPakDataLoader, load_som_pak_data
io/ # cod.py — read_cod, write_cod
visualizer/ # SOMVisualizer, compute_umatrix_pak
evaluator/ # SOMEvaluator
topology/ # HexagonalTopology, RectangularTopology
preprocessing/ # normalization
projection/ # Sammon's mapping
decomposition/ # PCA
examples/
test/



