Skip to content

Contents


Convolutional Neural Networks (Multimodal)

This section documents the multimodal neural network approach, which combines both image and text data for product classification.

The multimodal models use the preprocessed datasets split into split_train and split_val directories. Images are stored in class-specific subfolders and are referenced in the tabular data by imageid and productid. The text data is constructed by concatenating the designation and description fields.

Preprocessing

  • Image preprocessing: Images are loaded in grayscale and resized further to 128x128 pixels.
  • Text preprocessing: The designation and description fields are combined and tokenized using a Keras TextVectorization layer.
  • Label encoding: Product type codes (prdtypecode) are label-encoded for use in the neural network.

CNN Model 1

The first multimodal architecture serves as a baseline for combining image and text information.

Image Branch

A convolutional neural network processes the image input. The architecture includes three convolutional blocks, each with batch normalization, max pooling, and dropout for regularization. The output is flattened and passed through a dense layer.

Text Branch

The text branch uses a Keras TextVectorization (using only the first 100 tokens of the text data for a quick first analysis) layer followed by an embedding layer and two bidirectional LSTM layers. The output is passed through a dense layer with batch normalization and dropout.

Fusion and Output

The outputs of the image and text branches are concatenated and passed through additional dense, batch normalization, and dropout layers before the final softmax output layer.

Training Configuration

Optimizer: Adam
Loss Function: Sparse Categorical Crossentropy
Batch Size: 64
Epochs: 10
Callbacks:

  • EarlyStopping (patience=3)
  • ReduceLROnPlateau (factor=0.5, patience=2)
  • ModelCheckpoint (save best model)

Training and Validation Accuracy Plot

Training and Validation Accuracy 1

The training and validation accuracy curves show steady improvement until epoch 3, where validation accuracy peaks (validation loss peaks at epoch 4). Training accuracy improves further and training loss continues to fall during the following epochs (6, before training is stopped by callbacks), indicating some overfitting but reasonable accuracy of around 80%.


Classification Report

ClassPrecisionRecallF1-scoreSupport
00.59270.74180.6589612
10.74380.69100.7164521
20.78570.77030.7779357
30.92720.86960.8974161
40.66880.77180.7166539
50.92150.92620.9239786
60.56410.30140.3929146
70.57430.62750.5997961
80.48200.50470.4931424
90.94080.83260.8834974
100.88120.83430.8571169
110.81060.75150.7799507
120.75600.65480.7018672
130.90250.74930.81881013
140.91810.87990.8986841
150.81450.73720.7739137
160.67960.83670.75001029
170.91540.70000.7933170
180.91040.73350.8125942
190.73030.81850.7719986
200.78410.77120.7776306
210.88070.90920.8947991
220.70420.69050.6973462
230.96860.96290.96572047
240.75190.75620.7540525
250.80130.94390.8668517
260.96910.99470.9817189
MetricValue
Accuracy0.7999
Macro avg Precision0.7918
Macro avg Recall0.7689
Macro avg F1-score0.7761
Weighted avg Precision0.8085
Weighted avg Recall0.7999
Weighted avg F1-score0.8010

Summary CNN 1

CNN Model 1 demonstrates strong overall performance as a multimodal classifier, achieving a validation accuracy of 79.99% and a weighted F1-score of 0.8010.

The combination of a deep CNN for image features and bidirectional LSTM layers for text enables the model to capture both visual and sequential textual patterns effectively.

The training and validation curves indicate decent generalization with moderate overfitting.

  • Stronger performance is observed in several classes (e.g., classes 3, 5, 9, 13, 14, 17, 18, 21, 23, 25, 26), with F1-scores above 0.85.
  • Weaker performance is seen in classes with fewer samples or less distinctive features (e.g., class 6).

CNN Model 2

The second multimodal architecture explores a different approach to text sequence modeling and learning rate scheduling.

Image Branch

The image branch is identical to CNN Model 1. It processes grayscale images resized to 128x128 pixels using a series of convolutional blocks (Conv2D, BatchNormalization, MaxPooling2D, Dropout), followed by flattening and a dense layer.

Text Branch

The text branch differs from Model 1 in two key ways:

  • Token Length: The TextVectorization layer uses the first 300 tokens of the concatenated designation and description fields (instead of 100).
  • Sequence Model: Instead of LSTM layers, the branch uses two 1D convolutional layers (Conv1D) to extract local n-gram features from the embedded text sequence. This is followed by global max pooling and dense layers with batch normalization and dropout for regularization.

Fusion and Output

See model 1.

Training Configuration

Optimizer: Adam
Loss Function: Sparse Categorical Crossentropy
Batch Size: 64
Epochs: 10
Callbacks:

  • EarlyStopping (patience=3)
  • CosineDecay learning rate scheduler (replaces ReduceLROnPlateau)
  • ModelCheckpoint (save best model)

Training and Validation Accuracy Plot

Training and Validation Accuracy 2

While training accuracy improves for the first two epochs, then falls and improves again at a lower level of only around 40% up to epoch 4, validation accuracy seems volatile and falls significantly during the first two epochs to below 20%, then slowly but steadily increasing until epcoh 4 when training is stopped by callbacks.

The adjustmeants to the learning rate may have lead to this extreme volatility in validation accuracy, which remains lower at around 71% compared to almost 80% for model 1.


Classification Report

ClassPrecisionRecallF1-scoreSupport
00.39030.31700.3499612
10.66310.47600.5542521
20.39090.86830.5391357
30.96720.73290.8339161
40.63450.68270.6577539
50.96560.74940.8438786
60.44440.02740.0516146
70.49610.72530.5892961
80.49770.25710.3390424
90.90890.80900.8561974
100.90150.70410.7907169
110.85710.69820.7696507
120.56550.68750.6206672
130.87430.72060.79001013
140.93720.78120.8521841
150.52000.18980.2781137
160.73830.76480.75131029
170.68460.52350.5933170
180.87530.70060.7783942
190.87190.67650.7619986
200.83940.37580.5192306
210.89600.83450.8642991
220.73330.52380.6111462
230.89780.96580.93062047
240.37600.79430.5104525
250.45400.83950.5893517
260.98940.98410.9867189
MetricValue
Accuracy0.7168
Macro avg Precision0.7174
Macro avg Recall0.6448
Macro avg F1-score0.6523
Weighted avg Precision0.7555
Weighted avg Recall0.7168
Weighted avg F1-score0.7192

Summary CNN 2

CNN Model 2 achieves a validation accuracy of 71.68% and a weighted F1-score of 0.7192.

This model uses the same CNN image branch as model 1, but replaces the LSTM-based text branch with two Conv1D layers and increases the text sequence length to 300 tokens. A cosine decay learning rate schedule is also introduced.

The training and validation curves show significant volatility, with validation accuracy dropping sharply in the early epochs before recovering, and overall accuracy remaining lower than Model 1.

  • Stronger performance is still observed in several classes (e.g., classes 3, 5, 9, 13, 14, 18, 21, 23, 26), but with generally lower F1-scores compared to model 1.
  • Weaker performance is especially pronounced in classes with fewer samples or less distinctive features (e.g., class 6, class 15), and the model struggles more with class imbalance.

CNN Model 3

The third multimodal architecture builds on the strengths of previous models, aiming for improved performance and stability.

Image Branch

The image branch remains identical to previous models, processing grayscale images resized to 128x128 pixels through a series of convolutional blocks (Conv2D, BatchNormalization, MaxPooling2D, Dropout), followed by flattening and a dense layer.

Text Branch

The text branch returns to a bidirectional LSTM-based approach, but with moderate enhancements:

  • Token Length: The TextVectorization layer uses the first 150 tokens of the concatenated designation and description fields.
  • Vocabulary Size: The vocabulary is increased to 12,000 (max_tokens=12000), allowing the model to capture a broader range of unique words and product-specific terminology.
  • Sequence Model: Two bidirectional LSTM layers are used, each with 80 units, providing a balance between model capacity and training speed. Dropout is applied for regularization, and the embedding dimension is set to 80 for richer representations.

Fusion and Output

See model 1.

Training Configuration

Optimizer: Adam
Loss Function: Sparse Categorical Crossentropy
Batch Size: 64
Epochs: 10
Callbacks:

  • EarlyStopping (patience=3)
  • ReduceLROnPlateau (factor=0.5, patience=2)
  • ModelCheckpoint (save best model)

Training and Validation Accuracy Plot

Training and Validation Accuracy 3

There is slow but steady improvement in the validation accuracy up to epoch 5 where it drops and then plateaus for the remaining training epochs. The loss function also shows the best results in epoch 5.

The training accuracy continues to improve, showing some signs of overfitting, however both accuracy values reach over 80% eventually.


Classification Report

ClassPrecisionRecallF1-scoreSupport
00.67440.66340.6689612
10.80570.65260.7211521
20.78410.77310.7786357
30.89160.91930.9052161
40.72760.76810.7473539
50.90170.95670.9284786
60.65810.52740.5856146
70.62620.67120.6479961
80.58690.54950.5676424
90.91240.86650.8889974
100.91870.86980.8936169
110.77730.77120.7743507
120.81460.68010.7413672
130.84130.82130.83121013
140.92830.89300.9103841
150.88680.68610.7737137
160.77130.78330.77721029
170.82420.80000.8119170
180.89840.76960.8290942
190.68680.85190.7605986
200.79140.78100.7862306
210.85350.91730.8842991
220.66930.72730.6971462
230.95170.97120.96132047
240.78710.71810.7510525
250.83360.92070.8750517
260.96910.99470.9817189
MetricValue
Accuracy0.8141
Macro avg Precision0.8064
Macro avg Recall0.7891
Macro avg F1-score0.7955
Weighted avg Precision0.8168
Weighted avg Recall0.8141
Weighted avg F1-score0.8136

Summary CNN 3

CNN Model 3 achieves a validation accuracy of 81.41% and a weighted F1-score of 0.8136.

This model combines a robust CNN image branch with an enhanced bidirectional LSTM text branch (two layers, 80 units each, 150 tokens), striking a balance between model capacity and efficiency. The training and validation curves show stable improvement despite overfitting.

  • Stronger performance is observed in many classes (e.g., classes 3, 5, 9, 10, 13, 14, 15, 17, 18, 21, 23, 25, 26), with F1-scores above 0.85.
  • Weaker performance is still present in some underrepresented or less distinctive classes (e.g., class 6, class 8), but overall class balance and generalization are improved compared to previous models.

Camembert Model

This section documents the multimodal neural network approach using pretrained Camembert embeddings combined with a CNN architecture for product classification.

The model utilizes frozen Camembert [CLS] embeddings extracted from concatenated text fields (designation + description). These embeddings serve as fixed-length feature vectors input to convolutional layers.

Camembert Model Embeddings

The text data is tokenized using the CamembertTokenizer and fed into the frozen TFCamembertModel. The [CLS] token embedding from the last hidden layer is extracted, representing the entire sequence's semantic meaning. Embeddings are computed in batches, saved, and used as input features for the CNN model.

Image Branch

See other models.

Text Branch

The model's text branch takes the precomputed Camembert [CLS] embeddings as input. These embeddings are passed through several 1D convolutional layers with batch normalization and dropout. This enables capturing local semantic n-gram features within the embedding space. A global max pooling layer follows to aggregate features.

Fusion and Output

See other models.

Training Configuration

Optimizer: Adam
Loss Function: Sparse Categorical Crossentropy
Batch Size: 64
Epochs: 10
Callbacks:

  • EarlyStopping (patience=3)
  • ReduceLROnPlateau (factor=0.5, patience=2)
  • ModelCheckpoint (save best model)

Training and Validation Accuracy Plot

Training and Validation Accuracy Camembert

Both training and validation accuracy steadily improve throughout the epochs, with validation accuracy closely tracking training accuracy and ending at roughly 79%.

Minor fluctuations in validation accuracy and loss around epochs 4–7 reflect normal variance rather than overfitting, as the validation metrics mirror the training curves and the gap between losses narrows over time. Overall, the model demonstrates stable learning and good generalization, with no significant signs of overfitting or underfitting.


Classification Report

ClassPrecisionRecallF1-scoreSupport
00.63030.78270.6983612
10.71790.69870.7082521
20.75620.67790.7149357
30.93880.85710.8961161
40.67990.79590.7333539
50.92730.95670.9418786
60.60380.21920.3216146
70.56460.50470.5330961
80.53940.51650.5277424
90.78850.91890.8487974
100.92130.69230.7905169
110.71920.62130.6667507
120.77850.68010.7260672
130.75530.73450.74471013
140.89600.88110.8885841
150.88000.96350.9199137
160.69480.76770.72951029
170.83640.81180.8239170
180.87190.83120.8511942
190.80520.79210.7986986
200.73150.77450.7524306
210.85600.86980.8629991
220.68150.62990.6547462
230.93670.96090.94862047
240.75050.69330.7208525
250.91590.94780.9316517
260.96370.98410.9738189
MetricValue
Accuracy0.7907
Macro avg Precision0.7830
Macro avg Recall0.7616
Macro avg F1-score0.7669
Weighted avg Precision0.7897
Weighted avg Recall0.7907
Weighted avg F1-score0.7879

Summary Camembert Model

The CNN Camembert model leverages powerful contextual embeddings from a pretrained Camembert language model, using frozen [CLS] embeddings as fixed feature vectors.

Passing these embeddings through convolutional layers allows capturing local semantic patterns, resulting in a validation accuracy of 79.07% and weighted F1-score of 0.7879.

Despite strong performance, the model underperforms compared to enhanced LSTM models due to:

  • Lack of fine-tuning on Camembert embeddings, limiting task adaptation.
  • CNN over embeddings captures local patterns but misses longer-range dependencies modeled by LSTMs.
  • No image data integration, unlike some multimodal architectures.
  • Model 3's improved text branch and training regimen lead to better generalization and class balance.

Nevertheless, the Camembert model provides a strong baseline utilizing pretrained language knowledge effectively in a CNN framework.

Fine-Tuned Camembert Multimodal Model

This section documents the multimodal neural network approach using a jointly fine-tuned Camembert model combined with a CNN architecture for product classification. Unlike previous models that use frozen Camembert embeddings, this architecture enables end-to-end training, allowing Camembert's language representations to adapt to the specific classification task.


Camembert Model Embeddings

Text data is dynamically tokenized using the CamembertTokenizer and fed into a custom Keras layer wrapping the PyTorch CamembertModel. The [CLS] token embedding from the last hidden layer is extracted for each sample, representing the semantic meaning of the concatenated designation and description fields. Crucially, Camembert's weights are updated during training, enabling the model to learn task-specific features.


Image Branch

The image branch remains unchanged from previous architectures. Grayscale product images are resized and normalized, then passed through a deep stack of convolutional layers with batch normalization and dropout. These layers extract high-level visual features, which are further processed by dense layers.


Text Branch

The text branch receives [CLS] embeddings from the fine-tuned Camembert model. These embeddings are processed through dense layers with batch normalization and dropout, enhancing feature extraction and regularization. This enables the model to capture both global and local semantic patterns relevant to product classification.


Fusion and Output

Text and image features are concatenated and passed through additional dense layers with dropout and batch normalization. The final output layer uses softmax activation to predict product classes.


Training Configuration

Optimizer: Adam
Loss Function: Sparse Categorical Crossentropy
Batch Size: 64
Epochs: 10
Callbacks:

  • EarlyStopping (patience=3)
  • ReduceLROnPlateau (factor=0.5, patience=2)
  • ModelCheckpoint (save best model)

Training and Validation Accuracy Plot

Training and Validation Accuracy Camembert Finetuned

Both training and validation accuracy show steady improvement across epochs, with validation accuracy closely tracking training accuracy and reaching approximately 79.4% by the final epoch. Minor fluctuations in validation metrics are observed but the overall trends indicate stable learning and good generalization. The narrowing gap between training and validation loss further suggests the model is not overfitting.


Classification Report

ClassPrecisionRecallF1-scoreSupport
00.68650.79080.7350612
10.76030.67560.7154521
20.74370.73950.7416357
30.92620.85710.8903161
40.66720.81450.7335539
50.94700.95550.9512786
60.64860.32880.4364146
70.59980.50680.5494961
80.55460.47880.5139424
90.81540.91170.8609974
100.91970.74560.8235169
110.68230.66070.6713507
120.81530.65030.7235672
130.75260.71770.73471013
140.88480.88590.8853841
150.89040.94890.9187137
160.66450.79110.72231029
170.84710.78240.8135170
180.83060.87470.8521942
190.79880.78090.7897986
200.75800.77780.7677306
210.86860.86680.8677991
220.61650.66450.6396462
230.93690.95800.94732047
240.75330.65710.7019525
250.94190.94000.9409517
260.95410.98940.9714189
MetricValue
Accuracy0.7936
Macro avg Precision0.7876
Macro avg Recall0.7685
Macro avg F1-score0.7740
Weighted avg Precision0.7930
Weighted avg Recall0.7936
Weighted avg F1-score0.7909

Summary Camembert Model

By fine-tuning Camembert within the multimodal architecture, the model adapts powerful language representations to the product classification task, resulting in improved semantic understanding and class discrimination. The fusion of trainable Camembert embeddings and deep CNN image features yields a validation accuracy of 79.36% and a weighted F1-score of 0.7909, slightly outperforming the frozen Camembert baseline but still not reaching CNN model 3 (Enhanced LSTM).

The model remains robust, with no significant signs of overfitting, and provides a strong foundation for further multimodal enhancements.

Model Comparison

  • CNN Model 1 (LSTM)

    • Validation Accuracy: 79.99%
    • Weighted F1-score: 0.8010
  • CNN Model 2 (Conv1D)

    • Validation Accuracy: 71.68%
    • Weighted F1-score: 0.7192
  • CNN Model 3 (Enhanced LSTM)

    • Validation Accuracy: 81.41%
    • Weighted F1-score: 0.8136
  • CNN Camembert Model

    • Validation Accuracy: 79.07%
    • Weighted F1-score: 0.7879
  • Fine-Tuned Camembert Multimodal Model

    • Validation Accuracy: 79.36%
    • Weighted F1-score: 0.7909

Model 3 achieves the best overall performance, surpassing both previous models in accuracy and F1-score. The enhanced LSTM-based text branch and stable training configuration lead to improved generalization and class balance, making Model 3 the preferred choice for this multimodal product classification task.