Skip to content

Contents


Train-Test Split

Splitting of the processed data is performed to ensure that all feature matrices and labels are aligned for modeling. All outputs are saved in clearly separated folders for train and validation (data/processed/split_train and data/processed/split_val).

Split X_train

The cleaned and processed tabular data (X_train_processed.csv) serves as the basis for splitting.
A reproducible split is generated using using train_test_split from sklearn with a fixed random state.
Resulting splits are saved as X_train.csv and X_val.csv in:

  • data/processed/split_train/X_train.csv
  • /data/processed/split_val/X_val.csv

Split y_train

The label file (Y_train_CVw08PX.csv) contains the prdtypecode for each sample. Splitting is performed to match the indices of the tabular split, ensuring that each row in y_train and y_val corresponds exactly to the rows in X_train and X_val.
Outputs:

  • /data/processed/split_train/y_train.csv
  • /data/processed/split_val/y_val.csv

Split TF-IDF Matrix

Full TF-IDF matrix (X_designation_tfidf_train.npz) is split using the same indices as the tabular data. This guarantees that the TF-IDF features correspond exactly to the samples in the tabular and label splits. First, only the designation matrix was used, because there are already many features and description has many missing values. Outputs:

  • /data/processed/split_train/X_designation_tfidf_train.npz
  • /data/processed/split_val/X_designation_tfidf_val.npz

Split Image Data

Image files are referenced by imageid and productid in the tabular data.
Lists of image filenames for train and validation sets are generated based on the split tabular files.
Corresponding image arrays are loaded for each split.

Data Alignment

All splits (tabular, TF-IDF, image, and labels) are aligned by using the same indices/order derived from the initial train-test split.
For any sample index, the tabular features, TF-IDF features, image features, and label all correspond to the same original data point.