Question 8
Domain 1You work on an operations team at an international company that manages a large fleet of on-premises servers located in few data centers around the world. Your team collects monitoring data from the servers, including CPU/memory consumption. When an incident occurs on a server, your team is responsible for fixing it. Incident data has not been properly labeled yet. Your management team wants you to build a predictive maintenance solution that uses monitoring data from the VMs to detect potential failures and then alerts the service desk team. What should you do first?
Correct answer: D
Explanation
Because the incident data is unlabeled, the first step is to create labels from historical monitoring patterns. A simple heuristic such as a z-score can identify abnormal CPU or memory behavior, letting you "label the machines' historical performance data" and then "train a model to predict anomalies" from that dataset.
Why each option is right or wrong
A. Train a time-series model to predict the machines' performance values. Configure an alert if a machine's actual performance values significantly differ from the predicted performance values.
B. Develop a simple heuristic (e.g., based on z-score) to label the machines' historical performance data. Test this heuristic in a production environment.
C. Hire a team of qualified analysts to review and label the machines' historical performance data. Train a model based on this manually labeled dataset.
D. Implement a simple heuristic (e.g., based on z-score) to label the machines' historical performance data. Train a model to predict anomalies based on this labeled dataset.
Under a predictive-maintenance workflow, the blocker here is the absence of incident labels, so the first required step is to generate a training set from historical telemetry using a rule-based proxy labeler. A z-score heuristic is a standard way to flag outliers in CPU/memory time series, and once those historical points are labeled as normal vs. anomalous, a supervised model can be trained to detect failures and trigger alerts.