Hello! I have created a pipeline using the torchfastText library which works fine with 1M data. However, with a larger dataset ~5M, I keep getting RuntimeError: CUDA error: device-side assert triggered error, which I cannot understand why it is happening. The labelEncoder part is fine as below. Do you have any idea what might cause this issue?
I am running this on a Databricks notebook with a cluster using g5.16xlarge with 256GB, 64 core.
Number of unique classes: 10285
Min label index: 0
Max label index: 10284
X_train shape: (4000000, 1)
X_train dtype: object
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from torchFastText import torchFastText
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import torch
import os
# Load data
train_df = pd.read_parquet("/Volumes/train_5M.parquet")
test_df = pd.read_parquet('/Volumes/test.parquet')
# Fit LabelEncoder on training labels
encoder = LabelEncoder()
encoder.fit(train_df["label"])
# Memory-efficient split using pandas sample
val_df = train_df.sample(frac=0.2, random_state=42)
train_df_small = train_df.drop(val_df.index)
# Prepare training data
X_train = train_df_small["text"].astype(str).values.reshape(-1, 1)
y_train = encoder.transform(train_df_small["label"])
# Prepare validation data
X_val = val_df["text"].astype(str).values.reshape(-1, 1)
y_val = encoder.transform(val_df["label"])
# Filter test set to only include known labels
test_df_filtered = test_df[test_df["label"].isin(encoder.classes_)]
# Prepare test data
X_test = test_df_filtered["text"].astype(str).values.reshape(-1, 1)
y_test = encoder.transform(test_df_filtered["label"])
# Initialize model
model = torchFastText(
num_tokens=100000,
embedding_dim=100,
min_count=5,
min_n=3,
max_n=6,
len_word_ngrams=2,
sparse=False
)
# Train model
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
gpu_count = torch.cuda.device_count()
# print(gpu_count)
# num_workers=min(4 * gpu_count if gpu_count > 0 else 2, os.cpu_count() - 1)
# oss= os.cpu_count() - 1
# print("num_workers", num_workers, "oss", oss)
model.train(
X_train=X_train,
y_train=y_train,
X_val=X_val,
y_val=y_val,
num_epochs=14,
batch_size=64,
lr=4e-3,
verbose=True,
num_workers=min(4 * gpu_count if gpu_count > 0 else 2, os.cpu_count() - 1)
)
Hello! I have created a pipeline using the torchfastText library which works fine with 1M data. However, with a larger dataset ~5M, I keep getting
RuntimeError: CUDA error: device-side assert triggerederror, which I cannot understand why it is happening. The labelEncoder part is fine as below. Do you have any idea what might cause this issue?I am running this on a Databricks notebook with a cluster using
g5.16xlargewith 256GB, 64 core.Here is how I am reading and training the model: