Skip to content
This repository was archived by the owner on Nov 26, 2025. It is now read-only.
This repository was archived by the owner on Nov 26, 2025. It is now read-only.

RuntimeError: CUDA error: device-side assert triggered #53

@Mahhos

Description

@Mahhos

Hello! I have created a pipeline using the torchfastText library which works fine with 1M data. However, with a larger dataset ~5M, I keep getting RuntimeError: CUDA error: device-side assert triggered error, which I cannot understand why it is happening. The labelEncoder part is fine as below. Do you have any idea what might cause this issue?
I am running this on a Databricks notebook with a cluster using g5.16xlarge with 256GB, 64 core.

Number of unique classes: 10285
Min label index: 0
Max label index: 10284


X_train shape: (4000000, 1)
X_train dtype: object

Here is how I am reading and training the model:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from torchFastText import torchFastText
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import torch
import os


# Load data
train_df = pd.read_parquet("/Volumes/train_5M.parquet")
test_df = pd.read_parquet('/Volumes/test.parquet')

# Fit LabelEncoder on training labels
encoder = LabelEncoder()
encoder.fit(train_df["label"])

# Memory-efficient split using pandas sample
val_df = train_df.sample(frac=0.2, random_state=42)
train_df_small = train_df.drop(val_df.index)

# Prepare training data
X_train = train_df_small["text"].astype(str).values.reshape(-1, 1)
y_train = encoder.transform(train_df_small["label"])

# Prepare validation data
X_val = val_df["text"].astype(str).values.reshape(-1, 1)
y_val = encoder.transform(val_df["label"])

# Filter test set to only include known labels
test_df_filtered = test_df[test_df["label"].isin(encoder.classes_)]

# Prepare test data
X_test = test_df_filtered["text"].astype(str).values.reshape(-1, 1)
y_test = encoder.transform(test_df_filtered["label"])



# Initialize model
model = torchFastText(
    num_tokens=100000,
    embedding_dim=100,
    min_count=5,
    min_n=3,
    max_n=6,
    len_word_ngrams=2,
    sparse=False
)



# Train model

os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
gpu_count = torch.cuda.device_count()

# print(gpu_count)
# num_workers=min(4 * gpu_count if gpu_count > 0 else 2, os.cpu_count() - 1)
# oss= os.cpu_count() - 1
# print("num_workers", num_workers, "oss", oss)

model.train(
X_train=X_train,
y_train=y_train,
X_val=X_val,
y_val=y_val,
num_epochs=14,
batch_size=64,
lr=4e-3,
verbose=True,
num_workers=min(4 * gpu_count if gpu_count > 0 else 2, os.cpu_count() - 1)
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions