One of the most common methods of obtaining data used for NLP is through social media. This resource’s significant challenge is that the text is not traditionally accurate as it is filled with short forms and colloquial substitutes. The project’s goal was to develop a Lexical Normalization system, which enables efficient information extraction by converting non-standard text to a ready-to-use standard register.

The process involved experimenting with data augmentation methods and implementing baselines such as Maximum Frequency Replacement. A final hybrid Character-based Encoder-Decoder model architecture was built that handled In-Vocab words and Out-Of-Vocab words differently. This method resulted in an increase in the accuracy by 3%, and I am currently working on scraping twitter data and categorizing words that fail spell check to improve the system’s overall efficacy.

View the code -> GitHub

Read the report -> Google Drive

Team

Jonas Oppenheim Parth Sheth