Appearance
Contents
Textual Preprocessing Report
The preprocessing of data is split into textual and visual preprocessing, with the textual preprocessing focusing on unifying and preparing the text elements of the datasets as features for later modelling.
Cleaning the Textual Variables
- Descriptions: Removed HTML tags and URLs from descriptions to clean up the text.
- Descriptions: Replaced common French HTML entities with their appropriate representations (e.g.,
'
for the apostrophe). - Lowercasing: Converted all text in designations and descriptions to lowercase for consistency.
- Stop Words Removal: Got rid of stop words in English, French, and German using the TweetTokenizer and stopwords.
Measurement Value Normalization
To standardize diverse numerical expressions of size, mass, and volume across multiple inconsistent formats in designation and description, a regex for replacing the majority of measurement values with a placeholder token measurement_value
was implemented.
This is especially useful for reducing vocabulary size and focusing the model on semantically relevant structure.
Regex Pattern Matching
The regular expression was crafted to catch the following categories of measurement expressions:
Dimensional values: Formats such as 100x200, 420x220x130cm, or 160 x 190 (including both x and Unicode × separators) with optional units like cm, mm, etc.
Simple unit values: Matches like 18 cm, 32 mm, 400m, 3.5kg, 200 ml, 280 gr², 280 g/m², 420g / m2 (handles spacing and different formats of grams per square meter including superscripts)
Labeled values: Keywords followed by values such as longueur 19.50, largeur 9
Excluded Patterns: All percentages (e.g., 80%) are explicitly excluded from replacement to preserve meaningful relative quantities.
Remaining Issue
- Measurements: Various formats of measurements are tough to catch with regex. The spaCy pipeline isn't processing these correctly, still leaving many unusable tokens.
Handling Missing Values
- Filled missing descriptions with designations: Where descriptions are missing, the column is filled with the according designation.
Transformation of Text into Tokens
- Tokenization: Used the TweetTokenizer to break down both designations and descriptions into tokens. These tokens are stored in
df_train['designation_tokens']
anddf_train['description_tokens']
.
Transformation of Tokens into Numeric Vectors
- Vectorization: Applied
TfidfVectorizer
on designation tokens and description tokens to turn them into numeric vectors. The vectorized features are stored inX_designation_tfidf
andX_description_tfidf
respectively.