Question 5
Domain 2: Data PreparationA Generative AI Engineer is building a RAG application that will rely on context retrieved from source documents that are currently in HTML format. They want to develop a solution using the least amount of lines of code. Which Python package should be used to extract the text from the source documents?
Correct answer: D
Explanation
The exam guide says to “Choose the appropriate Python package to extract document content from provided source data and format.” For HTML source documents, Beautiful Soup is the standard package for parsing HTML and extracting text with minimal code, which fits the least-lines-of-code requirement.
Why each option is right or wrong
A. pytesseract
pytesseract is an OCR library for extracting text from images, not HTML.
B. numpy
numpy is for numerical arrays and computation, not document parsing.
C. pypdf2
pypdf2 is for reading and manipulating PDF files, not HTML documents.
D. beautifulsoup
The exam guide’s Data Preparation objective explicitly asks candidates to “choose the appropriate Python package to extract document content from provided source data and format,” and for HTML sources the standard parser is Beautiful Soup. In this fact pattern, the source files are already HTML, so a lightweight HTML parser that can pull text with minimal code is the correct fit; by contrast, OCR-oriented packages are for scanned images, not markup documents.