Skip to content

Contents


Heatmap of missing descriptions

alt text

Description and analysis

The heatmap displays the correlation between product type and missing values for variable 'description'. 'Description' is the only variable with missing values in the training and test set.

It is clearly visible that product types 2403 and 2462 have the most missing values for that variable, with more than 97% and 06% of missing values respectively.

Product Type CodeMissing Description Percentage
240397.36
246296.27
228093.28
116091.17
1089.15
118079.97
4065.47
114064.62
270537.12
132033.97
5028.32
252223.83
128123.77
130023.25
128021.93
6015.99
130113.01
130211.12
25838.85
22208.01
25857.93
20605.73
19405.60
19204.81
15603.49
25822.39
29050.00

Validation

A chi-square test reveals a significant deviation from a uniform distribution in missing descriptions per product type, indicating that the missing descriptions are indeed not evenly distributed across all product types. The missing value of the 'description' variable may even in itself be a useful feature for predicting the target variable.

Chi-square statistic: 40627.95

p-value: 0.0

INFO

A p-value of 0 may also be due to the very large sample size of X_train, so that the deviation is not as strong as implied by the values of the chi-square test.

Business relevance

Product types with higher percentages of missing descriptions may indicate areas where product information is lacking. This may hinder customer decision-making and affect SEO performance for Rakuten e-commerce. Addressing these gaps can enhance product visibility and improve customer satisfaction.

Conversely, product types with lower percentages of missing descriptions suggest better documentation and information availability, which can be leveraged in marketing strategies to highlight well-documented products.