Question 22
Content Domain 2: Exploratory Data AnalysisA Data Scientist is working on an application that performs sentiment analysis. The validation accuracy is poor, and the Data Scientist thinks that the cause may be a rich vocabulary and a low average frequency of words in the dataset. Which tool should be used to improve the validation accuracy?
Correct answer: D
Explanation
A TF-IDF vectorizer helps when the dataset has a "rich vocabulary" and words appear with "low average frequency" by converting text into weighted features that emphasize informative terms and downweight common ones. Scikit-learn’s term frequency-inverse document frequency (TF-IDF) vectorizer is used for this purpose, which can improve validation accuracy in sentiment analysis models.
Why each option is right or wrong
A. Amazon Comprehend syntax analysis and entity detection.
Comprehend syntax and entity detection extract linguistic structure, not optimized sentiment-classification feature vectors.
B. Amazon SageMaker BlazingText cbow mode.
BlazingText cbow learns word embeddings from context, not direct sparse weighting for rare-term classification.
C. Natural Language Toolkit (NLTK) stemming and stop word removal.
Stemming and stop-word removal can help preprocessing, but alone do not solve sparse term weighting.
D. Scikit-leam term frequency-inverse document frequency (TF-IDF) vectorizer.
A rich vocabulary with low per-word frequency is exactly the sparse-text scenario addressed by TF-IDF, which assigns each term a weight of \(tf \times \log(N/df)\) so rare but discriminative words contribute more than ubiquitous ones. In scikit-learn, `TfidfVectorizer` is the standard tool for this preprocessing step, and it is commonly used to improve classification performance on sentiment tasks by reducing the impact of very common tokens and producing more informative feature vectors.