Convolutional Neural Networks (Multimodal)

This section documents the multimodal neural network approach, which combines both image and text data for product classification.

The multimodal models use the preprocessed datasets split into split_train and split_val directories. Images are stored in class-specific subfolders and are referenced in the tabular data by imageid and productid. The text data is constructed by concatenating the designation and description fields.

Preprocessing

Image preprocessing: Images are loaded in grayscale and resized further to 128x128 pixels.
Text preprocessing: The designation and description fields are combined and tokenized using a Keras TextVectorization layer.
Label encoding: Product type codes (prdtypecode) are label-encoded for use in the neural network.

CNN Model 1

The first multimodal architecture serves as a baseline for combining image and text information.

Image Branch

A convolutional neural network processes the image input. The architecture includes three convolutional blocks, each with batch normalization, max pooling, and dropout for regularization. The output is flattened and passed through a dense layer.

Text Branch

The text branch uses a Keras TextVectorization (using only the first 100 tokens of the text data for a quick first analysis) layer followed by an embedding layer and two bidirectional LSTM layers. The output is passed through a dense layer with batch normalization and dropout.

Fusion and Output

The outputs of the image and text branches are concatenated and passed through additional dense, batch normalization, and dropout layers before the final softmax output layer.

Training Configuration

Optimizer: Adam
Loss Function: Sparse Categorical Crossentropy
Batch Size: 64
Epochs: 10
Callbacks:

EarlyStopping (patience=3)
ReduceLROnPlateau (factor=0.5, patience=2)
ModelCheckpoint (save best model)

Training and Validation Accuracy Plot

Training and Validation Accuracy 1

The training and validation accuracy curves show steady improvement until epoch 3, where validation accuracy peaks (validation loss peaks at epoch 4). Training accuracy improves further and training loss continues to fall during the following epochs (6, before training is stopped by callbacks), indicating some overfitting but reasonable accuracy of around 80%.

Classification Report

Class	Precision	Recall	F1-score	Support
0	0.5927	0.7418	0.6589	612
1	0.7438	0.6910	0.7164	521
2	0.7857	0.7703	0.7779	357
3	0.9272	0.8696	0.8974	161
4	0.6688	0.7718	0.7166	539
5	0.9215	0.9262	0.9239	786
6	0.5641	0.3014	0.3929	146
7	0.5743	0.6275	0.5997	961
8	0.4820	0.5047	0.4931	424
9	0.9408	0.8326	0.8834	974
10	0.8812	0.8343	0.8571	169
11	0.8106	0.7515	0.7799	507
12	0.7560	0.6548	0.7018	672
13	0.9025	0.7493	0.8188	1013
14	0.9181	0.8799	0.8986	841
15	0.8145	0.7372	0.7739	137
16	0.6796	0.8367	0.7500	1029
17	0.9154	0.7000	0.7933	170
18	0.9104	0.7335	0.8125	942
19	0.7303	0.8185	0.7719	986
20	0.7841	0.7712	0.7776	306
21	0.8807	0.9092	0.8947	991
22	0.7042	0.6905	0.6973	462
23	0.9686	0.9629	0.9657	2047
24	0.7519	0.7562	0.7540	525
25	0.8013	0.9439	0.8668	517
26	0.9691	0.9947	0.9817	189

Metric	Value
Accuracy	0.7999
Macro avg Precision	0.7918
Macro avg Recall	0.7689
Macro avg F1-score	0.7761
Weighted avg Precision	0.8085
Weighted avg Recall	0.7999
Weighted avg F1-score	0.8010

Summary CNN 1

CNN Model 1 demonstrates strong overall performance as a multimodal classifier, achieving a validation accuracy of 79.99% and a weighted F1-score of 0.8010.

The combination of a deep CNN for image features and bidirectional LSTM layers for text enables the model to capture both visual and sequential textual patterns effectively.

The training and validation curves indicate decent generalization with moderate overfitting.

Stronger performance is observed in several classes (e.g., classes 3, 5, 9, 13, 14, 17, 18, 21, 23, 25, 26), with F1-scores above 0.85.
Weaker performance is seen in classes with fewer samples or less distinctive features (e.g., class 6).

CNN Model 2

The second multimodal architecture explores a different approach to text sequence modeling and learning rate scheduling.

Image Branch

The image branch is identical to CNN Model 1. It processes grayscale images resized to 128x128 pixels using a series of convolutional blocks (Conv2D, BatchNormalization, MaxPooling2D, Dropout), followed by flattening and a dense layer.

Text Branch

The text branch differs from Model 1 in two key ways:

Token Length: The TextVectorization layer uses the first 300 tokens of the concatenated designation and description fields (instead of 100).
Sequence Model: Instead of LSTM layers, the branch uses two 1D convolutional layers (Conv1D) to extract local n-gram features from the embedded text sequence. This is followed by global max pooling and dense layers with batch normalization and dropout for regularization.

Fusion and Output

See model 1.

Training Configuration

Optimizer: Adam
Loss Function: Sparse Categorical Crossentropy
Batch Size: 64
Epochs: 10
Callbacks:

EarlyStopping (patience=3)
CosineDecay learning rate scheduler (replaces ReduceLROnPlateau)
ModelCheckpoint (save best model)

Training and Validation Accuracy Plot

Training and Validation Accuracy 2

While training accuracy improves for the first two epochs, then falls and improves again at a lower level of only around 40% up to epoch 4, validation accuracy seems volatile and falls significantly during the first two epochs to below 20%, then slowly but steadily increasing until epcoh 4 when training is stopped by callbacks.

The adjustmeants to the learning rate may have lead to this extreme volatility in validation accuracy, which remains lower at around 71% compared to almost 80% for model 1.

Classification Report

Class	Precision	Recall	F1-score	Support
0	0.3903	0.3170	0.3499	612
1	0.6631	0.4760	0.5542	521
2	0.3909	0.8683	0.5391	357
3	0.9672	0.7329	0.8339	161
4	0.6345	0.6827	0.6577	539
5	0.9656	0.7494	0.8438	786
6	0.4444	0.0274	0.0516	146
7	0.4961	0.7253	0.5892	961
8	0.4977	0.2571	0.3390	424
9	0.9089	0.8090	0.8561	974
10	0.9015	0.7041	0.7907	169
11	0.8571	0.6982	0.7696	507
12	0.5655	0.6875	0.6206	672
13	0.8743	0.7206	0.7900	1013
14	0.9372	0.7812	0.8521	841
15	0.5200	0.1898	0.2781	137
16	0.7383	0.7648	0.7513	1029
17	0.6846	0.5235	0.5933	170
18	0.8753	0.7006	0.7783	942
19	0.8719	0.6765	0.7619	986
20	0.8394	0.3758	0.5192	306
21	0.8960	0.8345	0.8642	991
22	0.7333	0.5238	0.6111	462
23	0.8978	0.9658	0.9306	2047
24	0.3760	0.7943	0.5104	525
25	0.4540	0.8395	0.5893	517
26	0.9894	0.9841	0.9867	189

Metric	Value
Accuracy	0.7168
Macro avg Precision	0.7174
Macro avg Recall	0.6448
Macro avg F1-score	0.6523
Weighted avg Precision	0.7555
Weighted avg Recall	0.7168
Weighted avg F1-score	0.7192

Summary CNN 2

CNN Model 2 achieves a validation accuracy of 71.68% and a weighted F1-score of 0.7192.

This model uses the same CNN image branch as model 1, but replaces the LSTM-based text branch with two Conv1D layers and increases the text sequence length to 300 tokens. A cosine decay learning rate schedule is also introduced.

The training and validation curves show significant volatility, with validation accuracy dropping sharply in the early epochs before recovering, and overall accuracy remaining lower than Model 1.

Stronger performance is still observed in several classes (e.g., classes 3, 5, 9, 13, 14, 18, 21, 23, 26), but with generally lower F1-scores compared to model 1.
Weaker performance is especially pronounced in classes with fewer samples or less distinctive features (e.g., class 6, class 15), and the model struggles more with class imbalance.

CNN Model 3

The third multimodal architecture builds on the strengths of previous models, aiming for improved performance and stability.

Image Branch

The image branch remains identical to previous models, processing grayscale images resized to 128x128 pixels through a series of convolutional blocks (Conv2D, BatchNormalization, MaxPooling2D, Dropout), followed by flattening and a dense layer.

Text Branch

The text branch returns to a bidirectional LSTM-based approach, but with moderate enhancements:

Token Length: The TextVectorization layer uses the first 150 tokens of the concatenated designation and description fields.
Vocabulary Size: The vocabulary is increased to 12,000 (max_tokens=12000), allowing the model to capture a broader range of unique words and product-specific terminology.
Sequence Model: Two bidirectional LSTM layers are used, each with 80 units, providing a balance between model capacity and training speed. Dropout is applied for regularization, and the embedding dimension is set to 80 for richer representations.

Fusion and Output

See model 1.

Training Configuration

Optimizer: Adam
Loss Function: Sparse Categorical Crossentropy
Batch Size: 64
Epochs: 10
Callbacks:

EarlyStopping (patience=3)
ReduceLROnPlateau (factor=0.5, patience=2)
ModelCheckpoint (save best model)

Training and Validation Accuracy Plot

Training and Validation Accuracy 3

There is slow but steady improvement in the validation accuracy up to epoch 5 where it drops and then plateaus for the remaining training epochs. The loss function also shows the best results in epoch 5.

The training accuracy continues to improve, showing some signs of overfitting, however both accuracy values reach over 80% eventually.

Classification Report

Class	Precision	Recall	F1-score	Support
0	0.6744	0.6634	0.6689	612
1	0.8057	0.6526	0.7211	521
2	0.7841	0.7731	0.7786	357
3	0.8916	0.9193	0.9052	161
4	0.7276	0.7681	0.7473	539
5	0.9017	0.9567	0.9284	786
6	0.6581	0.5274	0.5856	146
7	0.6262	0.6712	0.6479	961
8	0.5869	0.5495	0.5676	424
9	0.9124	0.8665	0.8889	974
10	0.9187	0.8698	0.8936	169
11	0.7773	0.7712	0.7743	507
12	0.8146	0.6801	0.7413	672
13	0.8413	0.8213	0.8312	1013
14	0.9283	0.8930	0.9103	841
15	0.8868	0.6861	0.7737	137
16	0.7713	0.7833	0.7772	1029
17	0.8242	0.8000	0.8119	170
18	0.8984	0.7696	0.8290	942
19	0.6868	0.8519	0.7605	986
20	0.7914	0.7810	0.7862	306
21	0.8535	0.9173	0.8842	991
22	0.6693	0.7273	0.6971	462
23	0.9517	0.9712	0.9613	2047
24	0.7871	0.7181	0.7510	525
25	0.8336	0.9207	0.8750	517
26	0.9691	0.9947	0.9817	189

Metric	Value
Accuracy	0.8141
Macro avg Precision	0.8064
Macro avg Recall	0.7891
Macro avg F1-score	0.7955
Weighted avg Precision	0.8168
Weighted avg Recall	0.8141
Weighted avg F1-score	0.8136

Summary CNN 3

CNN Model 3 achieves a validation accuracy of 81.41% and a weighted F1-score of 0.8136.

This model combines a robust CNN image branch with an enhanced bidirectional LSTM text branch (two layers, 80 units each, 150 tokens), striking a balance between model capacity and efficiency. The training and validation curves show stable improvement despite overfitting.

Stronger performance is observed in many classes (e.g., classes 3, 5, 9, 10, 13, 14, 15, 17, 18, 21, 23, 25, 26), with F1-scores above 0.85.
Weaker performance is still present in some underrepresented or less distinctive classes (e.g., class 6, class 8), but overall class balance and generalization are improved compared to previous models.

Camembert Model

This section documents the multimodal neural network approach using pretrained Camembert embeddings combined with a CNN architecture for product classification.

The model utilizes frozen Camembert [CLS] embeddings extracted from concatenated text fields (designation + description). These embeddings serve as fixed-length feature vectors input to convolutional layers.

Camembert Model Embeddings

The text data is tokenized using the CamembertTokenizer and fed into the frozen TFCamembertModel. The [CLS] token embedding from the last hidden layer is extracted, representing the entire sequence's semantic meaning. Embeddings are computed in batches, saved, and used as input features for the CNN model.

Image Branch

See other models.

Text Branch

The model's text branch takes the precomputed Camembert [CLS] embeddings as input. These embeddings are passed through several 1D convolutional layers with batch normalization and dropout. This enables capturing local semantic n-gram features within the embedding space. A global max pooling layer follows to aggregate features.

Fusion and Output

See other models.

Training Configuration

Optimizer: Adam
Loss Function: Sparse Categorical Crossentropy
Batch Size: 64
Epochs: 10
Callbacks:

EarlyStopping (patience=3)
ReduceLROnPlateau (factor=0.5, patience=2)
ModelCheckpoint (save best model)

Training and Validation Accuracy Plot

Training and Validation Accuracy Camembert

Both training and validation accuracy steadily improve throughout the epochs, with validation accuracy closely tracking training accuracy and ending at roughly 79%.

Minor fluctuations in validation accuracy and loss around epochs 4–7 reflect normal variance rather than overfitting, as the validation metrics mirror the training curves and the gap between losses narrows over time. Overall, the model demonstrates stable learning and good generalization, with no significant signs of overfitting or underfitting.

Classification Report

Class	Precision	Recall	F1-score	Support
0	0.6303	0.7827	0.6983	612
1	0.7179	0.6987	0.7082	521
2	0.7562	0.6779	0.7149	357
3	0.9388	0.8571	0.8961	161
4	0.6799	0.7959	0.7333	539
5	0.9273	0.9567	0.9418	786
6	0.6038	0.2192	0.3216	146
7	0.5646	0.5047	0.5330	961
8	0.5394	0.5165	0.5277	424
9	0.7885	0.9189	0.8487	974
10	0.9213	0.6923	0.7905	169
11	0.7192	0.6213	0.6667	507
12	0.7785	0.6801	0.7260	672
13	0.7553	0.7345	0.7447	1013
14	0.8960	0.8811	0.8885	841
15	0.8800	0.9635	0.9199	137
16	0.6948	0.7677	0.7295	1029
17	0.8364	0.8118	0.8239	170
18	0.8719	0.8312	0.8511	942
19	0.8052	0.7921	0.7986	986
20	0.7315	0.7745	0.7524	306
21	0.8560	0.8698	0.8629	991
22	0.6815	0.6299	0.6547	462
23	0.9367	0.9609	0.9486	2047
24	0.7505	0.6933	0.7208	525
25	0.9159	0.9478	0.9316	517
26	0.9637	0.9841	0.9738	189

Metric	Value
Accuracy	0.7907
Macro avg Precision	0.7830
Macro avg Recall	0.7616
Macro avg F1-score	0.7669
Weighted avg Precision	0.7897
Weighted avg Recall	0.7907
Weighted avg F1-score	0.7879

Summary Camembert Model

The CNN Camembert model leverages powerful contextual embeddings from a pretrained Camembert language model, using frozen [CLS] embeddings as fixed feature vectors.

Passing these embeddings through convolutional layers allows capturing local semantic patterns, resulting in a validation accuracy of 79.07% and weighted F1-score of 0.7879.

Despite strong performance, the model underperforms compared to enhanced LSTM models due to:

Lack of fine-tuning on Camembert embeddings, limiting task adaptation.
CNN over embeddings captures local patterns but misses longer-range dependencies modeled by LSTMs.
No image data integration, unlike some multimodal architectures.
Model 3's improved text branch and training regimen lead to better generalization and class balance.

Nevertheless, the Camembert model provides a strong baseline utilizing pretrained language knowledge effectively in a CNN framework.

Fine-Tuned Camembert Multimodal Model

This section documents the multimodal neural network approach using a jointly fine-tuned Camembert model combined with a CNN architecture for product classification. Unlike previous models that use frozen Camembert embeddings, this architecture enables end-to-end training, allowing Camembert's language representations to adapt to the specific classification task.

Camembert Model Embeddings

Text data is dynamically tokenized using the CamembertTokenizer and fed into a custom Keras layer wrapping the PyTorch CamembertModel. The [CLS] token embedding from the last hidden layer is extracted for each sample, representing the semantic meaning of the concatenated designation and description fields. Crucially, Camembert's weights are updated during training, enabling the model to learn task-specific features.

Image Branch

The image branch remains unchanged from previous architectures. Grayscale product images are resized and normalized, then passed through a deep stack of convolutional layers with batch normalization and dropout. These layers extract high-level visual features, which are further processed by dense layers.

Text Branch

The text branch receives [CLS] embeddings from the fine-tuned Camembert model. These embeddings are processed through dense layers with batch normalization and dropout, enhancing feature extraction and regularization. This enables the model to capture both global and local semantic patterns relevant to product classification.

Fusion and Output

Text and image features are concatenated and passed through additional dense layers with dropout and batch normalization. The final output layer uses softmax activation to predict product classes.

Training Configuration

Optimizer: Adam
Loss Function: Sparse Categorical Crossentropy
Batch Size: 64
Epochs: 10
Callbacks:

EarlyStopping (patience=3)
ReduceLROnPlateau (factor=0.5, patience=2)
ModelCheckpoint (save best model)

Training and Validation Accuracy Plot

Training and Validation Accuracy Camembert Finetuned

Both training and validation accuracy show steady improvement across epochs, with validation accuracy closely tracking training accuracy and reaching approximately 79.4% by the final epoch. Minor fluctuations in validation metrics are observed but the overall trends indicate stable learning and good generalization. The narrowing gap between training and validation loss further suggests the model is not overfitting.

Classification Report

Class	Precision	Recall	F1-score	Support
0	0.6865	0.7908	0.7350	612
1	0.7603	0.6756	0.7154	521
2	0.7437	0.7395	0.7416	357
3	0.9262	0.8571	0.8903	161
4	0.6672	0.8145	0.7335	539
5	0.9470	0.9555	0.9512	786
6	0.6486	0.3288	0.4364	146
7	0.5998	0.5068	0.5494	961
8	0.5546	0.4788	0.5139	424
9	0.8154	0.9117	0.8609	974
10	0.9197	0.7456	0.8235	169
11	0.6823	0.6607	0.6713	507
12	0.8153	0.6503	0.7235	672
13	0.7526	0.7177	0.7347	1013
14	0.8848	0.8859	0.8853	841
15	0.8904	0.9489	0.9187	137
16	0.6645	0.7911	0.7223	1029
17	0.8471	0.7824	0.8135	170
18	0.8306	0.8747	0.8521	942
19	0.7988	0.7809	0.7897	986
20	0.7580	0.7778	0.7677	306
21	0.8686	0.8668	0.8677	991
22	0.6165	0.6645	0.6396	462
23	0.9369	0.9580	0.9473	2047
24	0.7533	0.6571	0.7019	525
25	0.9419	0.9400	0.9409	517
26	0.9541	0.9894	0.9714	189

Metric	Value
Accuracy	0.7936
Macro avg Precision	0.7876
Macro avg Recall	0.7685
Macro avg F1-score	0.7740
Weighted avg Precision	0.7930
Weighted avg Recall	0.7936
Weighted avg F1-score	0.7909

Summary Camembert Model

By fine-tuning Camembert within the multimodal architecture, the model adapts powerful language representations to the product classification task, resulting in improved semantic understanding and class discrimination. The fusion of trainable Camembert embeddings and deep CNN image features yields a validation accuracy of 79.36% and a weighted F1-score of 0.7909, slightly outperforming the frozen Camembert baseline but still not reaching CNN model 3 (Enhanced LSTM).

The model remains robust, with no significant signs of overfitting, and provides a strong foundation for further multimodal enhancements.

Model Comparison

CNN Model 1 (LSTM)
- Validation Accuracy: 79.99%
- Weighted F1-score: 0.8010
CNN Model 2 (Conv1D)
- Validation Accuracy: 71.68%
- Weighted F1-score: 0.7192
CNN Model 3 (Enhanced LSTM)
- Validation Accuracy: 81.41%
- Weighted F1-score: 0.8136
CNN Camembert Model
- Validation Accuracy: 79.07%
- Weighted F1-score: 0.7879
Fine-Tuned Camembert Multimodal Model
- Validation Accuracy: 79.36%
- Weighted F1-score: 0.7909

Model 3 achieves the best overall performance, surpassing both previous models in accuracy and F1-score. The enhanced LSTM-based text branch and stable training configuration lead to improved generalization and class balance, making Model 3 the preferred choice for this multimodal product classification task.

Contents ​

Convolutional Neural Networks (Multimodal) ​

Preprocessing ​

CNN Model 1 ​

Image Branch ​

Text Branch ​

Fusion and Output ​

Training Configuration ​

Training and Validation Accuracy Plot ​

Classification Report ​

Summary CNN 1 ​

CNN Model 2 ​

Image Branch ​

Text Branch ​

Fusion and Output ​

Training Configuration ​

Training and Validation Accuracy Plot ​

Classification Report ​

Summary CNN 2 ​

CNN Model 3 ​

Image Branch ​

Text Branch ​

Fusion and Output ​

Training Configuration ​

Training and Validation Accuracy Plot ​

Classification Report ​

Summary CNN 3 ​

Camembert Model ​

Camembert Model Embeddings ​

Image Branch ​

Text Branch ​

Fusion and Output ​

Training Configuration ​

Training and Validation Accuracy Plot ​

Classification Report ​

Summary Camembert Model ​

Fine-Tuned Camembert Multimodal Model ​

Camembert Model Embeddings ​

Image Branch ​

Text Branch ​

Fusion and Output ​

Training Configuration ​

Training and Validation Accuracy Plot ​

Classification Report ​

Summary Camembert Model ​

Model Comparison ​

Contents

Convolutional Neural Networks (Multimodal)

Preprocessing

CNN Model 1

Image Branch

Text Branch

Fusion and Output

Training Configuration

Training and Validation Accuracy Plot

Classification Report

Summary CNN 1

CNN Model 2

Image Branch

Text Branch

Fusion and Output

Training Configuration

Training and Validation Accuracy Plot

Classification Report

Summary CNN 2

CNN Model 3

Image Branch

Text Branch

Fusion and Output

Training Configuration

Training and Validation Accuracy Plot

Classification Report

Summary CNN 3

Camembert Model

Camembert Model Embeddings

Image Branch

Text Branch

Fusion and Output

Training Configuration

Training and Validation Accuracy Plot

Classification Report

Summary Camembert Model

Fine-Tuned Camembert Multimodal Model

Camembert Model Embeddings

Image Branch

Text Branch

Fusion and Output

Training Configuration

Training and Validation Accuracy Plot

Classification Report

Summary Camembert Model

Model Comparison