Appearance
Contents
Feature Selection
By first reducing the number of features (high-dimensional data resulting from TF-IDF matrices and many images) and then concatenating them to fully numeric feature matrices, the data is being prepared for a first base modelling.
Dimension Reduction
High-dimensional features, particularly those derived from TF-IDF vectorization and image data, are reduced in dimensionality to create a more manageable and informative feature set for modeling.
Image Feature Reduction
Principal Component Analysis (PCA) is performed on the image feature arrays. PCA is fit on the training set and then applied to the validation set, ensuring that the same transformation is used for both. The reduced image features are saved as:
/data/processed/split_train/X_train_img_pca.npy
/data/processed/split_val/X_val_img_pca.npy
TF-IDF Dimensionality Reduction
Already when vectorizing the textual variables designation
and description
. a maximum of 5,000 TF-IDF features was set to reduce complexity while retaining as much information as possible. Then, Truncated Singular Value Decomposition (TruncatedSVD) is applied to the TF-IDF matrices.
A fixed number of 2,500 components is chosen to retain a substantial portion of the explained variance (~84%, not 95% in this case because of lack of computational ressources), while reducing complexity. The reduced TF-IDF arrays are saved as:
/data/processed/split_train/X_designation_tfidf_svd.npy
/data/processed/split_val/X_designation_tfidf_svd.npy
Feature Concatenation
The final modeling datasets are created by horizontally concatenating the following for each sample:
- Numeric tabular features (only 'productid' and 'imageid' are left, textual variables are tokenized in TF-IDF Matrix)
- Reduced TF-IDF features (from TruncatedSVD)
- Reduced image features (from PCA)
This results in a single, fully numeric feature matrix for both training and validation sets:
/data/processed/split_train/X_train_combined.npy
/data/processed/split_val/X_val_combined.npy
Dataset | Shape |
---|---|
X_train | (67,932, 3,602) |
X_val | (16,984, 3,602) |
y_train | (67,932,) |
y_val | (16,984,) |