GitHub - nvidia-isaac/cuNLS: Nvidia Isaac CUDA-Accelerated Nonlinear Least Square Solver

GPU-Accelerated Nonlinear Least-Squares Solver

CUDA/C++ · Gauss-Newton · Factor Graph · Manifold Optimization · Sparse Linear Algebra

cuNLS is a CUDA/C++ library for solving nonlinear least-squares problems on the GPU. It is built around batched factor evaluation, sparse Jacobian assembly, and sparse linear solvers — designed for large-scale geometric estimation workloads such as bundle adjustment, pose graph optimization, and ICP-style alignment.

cuNLS solves optimization problems of the form:

$$x^* = \arg\min_x \sum_i \rho_i!\left(\left|f_i(x)\right|^2_{\Sigma_i}\right)$$

where $x$ is the optimization variable (often on a manifold), $f_i(x)$ are residual functions, $\rho_i(\cdot)$ are optional robust loss functions, and $\left|v\right|^2_{\Sigma} = v^T \Sigma^{-1} v$ is the Mahalanobis norm.

Gallery

cuNLS refining two large estimation problems, one Gauss-Newton/LM iteration per frame:

Left — Sphere pose-graph refinement. 200k points that should lie on a sphere are connected by relative ("between") constraints and start as a disturbed blob. cuNLS drives them back onto the sphere; color encodes per-point error, cooling from hot to calm as the solve converges.
Right — Kepler orbit fitting. A family of orbits is observed as noisy 3-D points; cuNLS estimates the five Keplerian elements per orbit (custom NVIDIA Warp factor on an $\mathbb{R}^5$ state) and a jittered tangle of loops organizes into a crisp nested rosette of tilted ellipses.

Features

Category	Details
Manifold support	SO(2), SO(3), SE(2), SE(3), Sim(2), Sim(3), SL(4), Euclidean vectors
Solvers	Gauss-Newton, Levenberg-Marquardt with adaptive damping
Robust losses	Huber, Cauchy, Arctan, SoftL1, Tolerant, Tukey, Scaled
Built-in factors	Reprojection, PnP, between (SO(2)/SO(3)/SE(2)/SE(3)/Sim(2)/Sim(3)/SL(4)/vector), point-to-point, point-to-plane, symmetric point-to-plane, prior
Custom factors	User-defined CUDA kernels via `SizedFactorBatch`
Linear solver	Block-sparse PCG (variable block-Jacobi preconditioner, default), NVIDIA cuDSS, dense LDLT, dense Cholesky (cuSOLVER), dense QR (cuSOLVER)
Safety checks	Optional runtime validation (linear-solver diagnostics and more) — disable via `MinimizerOptions::disable_safety_checks` for low-latency solves
Execution model	Fully asynchronous via CUDA streams

Prerequisites

NVIDIA GPU with compatible driver
CUDA Toolkit (nvcc, cudart, cuBLAS, cuSPARSE, cuSOLVER)
CMake >= 3.24
C++17 compiler
GNU Make

Installation

Build locally

./scripts/build_cunls.sh <build_dir> <Release|Coverage> [install_dir]

Example — release build with install:

./scripts/build_cunls.sh build Release /tmp/cunls_install

Build with Docker

Install the NVIDIA Container Toolkit.
Run:

./scripts/build_cunls_in_docker.sh <Release|Coverage> [local_install_dir]

The Docker build produces both shared and static variants. Intermediate build directories live inside the container and are discarded; only the final install directory is mounted to the host.

Install artifacts (default build_docker/, or the specified directory):

<install_dir>/
  include/cunls/        # headers
  lib/
    libcunls.so         # shared library
    libcunls.a          # static library (with bundled deps)
    cmake/cunls/        # CMake package config

Direct CMake build

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/tmp/cunls_install
cmake --build build -j
cmake --install build

By default this builds a shared library. Pass -DBUILD_SHARED_LIBS=OFF to build a static library instead.

Quick Start

The following minimal program solves a 1-D prior problem: a scalar variable $x$ pulled toward a target $o = 2$.

main.cpp

#include <cuda_runtime.h>
#include <iostream>
#include <vector>
#include "cunls/cunls.h"

int main() {
  cudaStream_t stream = nullptr;
  cudaStreamCreate(&stream);

  std::vector<float> h_state = {0.0f};
  std::vector<float> h_obs   = {2.0f};

  cunls::dvector<float> d_state(h_state);
  cunls::dvector<float> d_obs(h_obs);

  cunls::VectorStateBatch<1> state_batch(d_state.data(), 1);
  cunls::PriorVectorFactorBatch<1> prior(
      reinterpret_cast<const cunls::Vector<1>*>(d_obs.data()), 1);

  std::vector<float*> state_ptrs = {state_batch.StateBlockDevicePtr(0)};

  cunls::Problem problem;
  problem.AddStateBatch(&state_batch);
  problem.AddFactorBatch(&prior, state_ptrs);

  cunls::LevenbergMarquardtMinimizer minimizer;
  auto summary = minimizer.Minimize(stream, problem);

  std::cout << "Iterations: "   << summary.num_iterations << "\n";
  std::cout << "Initial cost: " << summary.initial_cost   << "\n";
  std::cout << "Final cost: "   << summary.final_cost     << "\n";

  cudaStreamDestroy(stream);
  return 0;
}

CMakeLists.txt

cmake_minimum_required(VERSION 3.24)
project(cunls_quick_start LANGUAGES CXX CUDA)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

if(NOT DEFINED CUNLS_INSTALL_DIR)
  message(FATAL_ERROR "Set CUNLS_INSTALL_DIR to cuNLS install prefix.")
endif()

find_package(CUDAToolkit REQUIRED)
find_library(CUNLS_LIBRARY cunls PATHS "${CUNLS_INSTALL_DIR}/lib" REQUIRED NO_DEFAULT_PATH)

add_executable(minimal main.cpp)
target_include_directories(minimal PRIVATE "${CUNLS_INSTALL_DIR}/include")
target_link_libraries(minimal PRIVATE "${CUNLS_LIBRARY}" CUDA::cudart)
set_target_properties(minimal PROPERTIES
  BUILD_RPATH "${CUNLS_INSTALL_DIR}/lib"
  INSTALL_RPATH "${CUNLS_INSTALL_DIR}/lib"
)

Build and run:

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DCUNLS_INSTALL_DIR=/tmp/cunls_install
cmake --build build -j
./build/minimal

Tutorial Examples

The examples/ directory contains complete working pipelines:

Example	Description	Key API
Sparse Bundle Adjustment	Jointly optimize camera poses and 3D landmarks from multi-view reprojection error	`ReprojectionFactorBatch`, `SE3StateBatch`, `VectorStateBatch<3>`
Pose Graph Optimization	Recover a chain of SE(3) poses from consecutive relative-transform measurements	`SE3BetweenFactorBatch`, `SE3StateBatch`
Custom Factor	User-defined CUDA kernel for a 1-D difference chain	`SizedFactorBatch<1,1,1>`, `PriorVectorFactorBatch<1>`

Build all examples:

cmake -S examples -B build/examples/all \
  -DCMAKE_BUILD_TYPE=Release \
  -DCUNLS_INSTALL_DIR=/path/to/cunls_install
cmake --build build/examples/all -j

Or build in Docker:

./examples/build_in_docker.sh Release ./artifacts/examples

Python Bindings (pycunls)

pycunls exposes cuNLS to Python via nanobind, with first-class CuPy interop and optional NVIDIA Warp support for writing custom factor kernels in Python.

Build the wheel in Docker

Requires Docker with the NVIDIA Container Toolkit.

./scripts/build_pycunls_in_docker.sh [local_output_dir]

The output directory defaults to ./dist. The script builds the wheel inside a container with the source mounted read-only. Intermediate build directories live inside the container and are discarded; only the final .whl file is written to the host output directory.

Build the wheel locally

cd python
pip install scikit-build-core nanobind
pip wheel . --no-build-isolation --no-deps --wheel-dir ../dist

Install the wheel

pip install ./dist/pycunls-*.whl

Editable install for development

For an editable (in-place) install that reflects source changes without rebuilding:

cd python
pip install scikit-build-core nanobind
pip install -e ".[test]" --no-build-isolation

This installs pycunls along with all test dependencies (pytest, cupy-cuda12x, warp-lang). Other optional dependency groups:

pip install -e ".[warp]"   # warp-lang only
pip install -e ".[all]"    # all optional extras

Run Python tests

pytest -v python/tests

Python examples

The python/examples/ directory contains end-to-end pipelines using pycunls:

Example	Description
`sparse_bundle_adjustment.py`	Joint camera-pose and landmark optimization with CuPy
`pose_graph_optimization.py`	SE(3) pose-graph optimization with CuPy
`custom_warp_factor.py`	Custom factor kernel using NVIDIA Warp
`custom_warp_state.py`	Custom state batch (positive-scalar manifold) using NVIDIA Warp

C++ Testing

cmake -S . -B build/tests -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=ON
cmake --build build/tests -j
ctest --test-dir build/tests --output-on-failure

Or run the test binary directly:

./build/tests/bin/nls_tests

Coverage build:

./scripts/build_cunls.sh build/coverage Coverage

Building Documentation

python -m pip install -r docs/sphinx/requirements.txt
python -m sphinx -b html docs/sphinx docs/sphinx/_build

Or build in Docker:

bash docs/build_in_docker.sh [output_dir]

Code Style

cuNLS follows the Google C++ Style Guide and uses a pre-commit hook for auto-formatting.

sudo apt install pre-commit
pre-commit install

To manually reformat:

sudo apt install clang-format
find . -iname '*.h' -o -iname '*.cpp' | xargs clang-format -i

License

cuNLS is licensed under the Apache License 2.0. Third-party license notices are in NOTICE and third_party/LICENSES/.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.github		.github
assets		assets
cmake		cmake
cunls		cunls
docs		docs
examples		examples
python		python
scripts		scripts
tests		tests
third_party/LICENSES		third_party/LICENSES
.clang-format		.clang-format
.coderabbit.yml		.coderabbit.yml
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
llms.txt		llms.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GPU-Accelerated Nonlinear Least-Squares Solver

Gallery

Features

Prerequisites

Installation

Build locally

Build with Docker

Direct CMake build

Quick Start

Tutorial Examples

Python Bindings (pycunls)

Build the wheel in Docker

Build the wheel locally

Install the wheel

Editable install for development

Run Python tests

Python examples

C++ Testing

Building Documentation

Code Style

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

GPU-Accelerated Nonlinear Least-Squares Solver

Gallery

Features

Prerequisites

Installation

Build locally

Build with Docker

Direct CMake build

Quick Start

Tutorial Examples

Python Bindings (pycunls)

Build the wheel in Docker

Build the wheel locally

Install the wheel

Editable install for development

Run Python tests

Python examples

C++ Testing

Building Documentation

Code Style

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages