Appearance
Contents
Train-Test Split
Splitting of the processed data is performed to ensure that all feature matrices and labels are aligned for modeling. All outputs are saved in clearly separated folders for train and validation (data/processed/split_train
and data/processed/split_val
).
Split X_train
The cleaned and processed tabular data (X_train_processed.csv
) serves as the basis for splitting.
A reproducible split is generated using using train_test_split from sklearn with a fixed random state.
Resulting splits are saved as X_train.csv
and X_val.csv
in:
data/processed/split_train/X_train.csv
/data/processed/split_val/X_val.csv
Split y_train
The label file (Y_train_CVw08PX.csv
) contains the prdtypecode
for each sample. Splitting is performed to match the indices of the tabular split, ensuring that each row in y_train
and y_val
corresponds exactly to the rows in X_train
and X_val
.
Outputs:
/data/processed/split_train/y_train.csv
/data/processed/split_val/y_val.csv
Split TF-IDF Matrix
Full TF-IDF matrix (X_designation_tfidf_train.npz
) is split using the same indices as the tabular data. This guarantees that the TF-IDF features correspond exactly to the samples in the tabular and label splits. First, only the designation matrix was used, because there are already many features and description has many missing values. Outputs:
/data/processed/split_train/X_designation_tfidf_train.npz
/data/processed/split_val/X_designation_tfidf_val.npz
Split Image Data
Image files are referenced by imageid
and productid
in the tabular data.
Lists of image filenames for train and validation sets are generated based on the split tabular files.
Corresponding image arrays are loaded for each split.
Data Alignment
All splits (tabular, TF-IDF, image, and labels) are aligned by using the same indices/order derived from the initial train-test split.
For any sample index, the tabular features, TF-IDF features, image features, and label all correspond to the same original data point.