Question 36
Content Domain 2: Exploratory Data AnalysisA data scientist is preparing a dataset that includes a free-text product description field for use in a machine learning model. Which feature engineering technique is most appropriate as the first step to convert that text into model-usable units?
Correct answer: B
Explanation
Tokenization is used to split raw text into smaller units, such as words or subwords, so the text can be transformed into features for modeling. Other feature engineering methods address different data issues, such as categorical representation, grouping values, or reducing feature count. — Source material: Analyze and evaluate feature engineering concepts; Key terms include binning, tokenization, outliers, synthetic features, one-hot encoding, and reducing dimensionality of data.
Why each option is right or wrong
A. Apply binning to group the text into numeric ranges
Binning groups values into intervals, not free-text into language units for modeling.
B. Use tokenization to split the text into smaller units
The source material identifies tokenization as a feature engineering concept specifically suited to text data. Because the field contains free-text product descriptions, the appropriate first step is to break the text into smaller units that can later be represented as features for the model.
C. Use one-hot encoding to detect unusual text values
One-hot encoding represents categorical values; it does not first divide raw text into words or subwords.
D. Reduce dimensionality before extracting any text elements
Reducing dimensionality is used after features exist; raw text must first be converted into usable components.