Skip to content

Cleaning the Textual Variables

  • Deleting Descrition
  • Replace common French HTML entities with their characters
  • Language Recognition with langdetect
  • sent_tokenize with french language (?!?)
  • Removal of special characters and numbers except [^a-zA-ZéàèêëîïôùüçÀÉÈÊËÎÏÔÙÜÇ]
  • word_tokenize --> dic
  • Stemming with detected language and SnowballStemmer in en/de/fr
  • join dic to string

Proceed to