Textual Preprocessing Report

The preprocessing of data is split into textual and visual preprocessing, with the textual preprocessing focusing on unifying and preparing the text elements of the datasets as features for later modelling.

Cleaning the Textual Variables

Descriptions: Removed HTML tags and URLs from descriptions to clean up the text.
Descriptions: Replaced common French HTML entities with their appropriate representations (e.g., ' for the apostrophe).
Lowercasing: Converted all text in designations and descriptions to lowercase for consistency.
Stop Words Removal: Got rid of stop words in English, French, and German using the TweetTokenizer and stopwords.

Measurement Value Normalization

To standardize diverse numerical expressions of size, mass, and volume across multiple inconsistent formats in designation and description, a regex for replacing the majority of measurement values with a placeholder token measurement_value was implemented.

This is especially useful for reducing vocabulary size and focusing the model on semantically relevant structure.

Regex Pattern Matching

The regular expression was crafted to catch the following categories of measurement expressions:

Dimensional values: Formats such as 100x200, 420x220x130cm, or 160 x 190 (including both x and Unicode × separators) with optional units like cm, mm, etc.
Simple unit values: Matches like 18 cm, 32 mm, 400m, 3.5kg, 200 ml, 280 gr², 280 g/m², 420g / m2 (handles spacing and different formats of grams per square meter including superscripts)
Labeled values: Keywords followed by values such as longueur 19.50, largeur 9
Excluded Patterns: All percentages (e.g., 80%) are explicitly excluded from replacement to preserve meaningful relative quantities.

Remaining Issue

Measurements: Various formats of measurements are tough to catch with regex. The spaCy pipeline isn't processing these correctly, still leaving many unusable tokens.

Handling Missing Values

Filled missing descriptions with designations: Where descriptions are missing, the column is filled with the according designation.

Transformation of Text into Tokens

Tokenization: Used the TweetTokenizer to break down both designations and descriptions into tokens. These tokens are stored in df_train['designation_tokens'] and df_train['description_tokens'].

Transformation of Tokens into Numeric Vectors

Vectorization: Applied TfidfVectorizer on designation tokens and description tokens to turn them into numeric vectors. The vectorized features are stored in X_designation_tfidf and X_description_tfidf respectively.

Visualizations

Contents

Textual Preprocessing Report

Cleaning the Textual Variables

Measurement Value Normalization

Regex Pattern Matching

Remaining Issue

Handling Missing Values

Transformation of Text into Tokens

Transformation of Tokens into Numeric Vectors

Contents ​

Textual Preprocessing Report ​

Cleaning the Textual Variables ​

Measurement Value Normalization ​

Regex Pattern Matching ​

Remaining Issue ​

Handling Missing Values ​

Transformation of Text into Tokens ​

Transformation of Tokens into Numeric Vectors ​

Contents

Textual Preprocessing Report

Cleaning the Textual Variables

Measurement Value Normalization

Regex Pattern Matching

Remaining Issue

Handling Missing Values

Transformation of Text into Tokens

Transformation of Tokens into Numeric Vectors