Random Forest Classification for ML/Data Science Requirement

This file builds a classification model to predict whether a job posting requires machine learning or data science skills. It combines structured fields (like job title and industry) with unstructured job descriptions using TF-IDF to improve prediction accuracy and uncover key features that signal ML-related roles.

Classification report

              precision    recall  f1-score   support

           0       0.76      0.77      0.77      7933
           1       0.71      0.70      0.70      6318

    accuracy                           0.74     14251
   macro avg       0.73      0.73      0.73     14251
weighted avg       0.74      0.74      0.74     14251

The Random Forest model achieved an overall accuracy of 74% when classifying whether a role requires ML skills. The precision and recall for class 0, non-ML, roles were slightly higher with precision: 0.76 and recall: 0.77 compared to class 1, ML roles, where both precision and recall were around 0.70. This indicates the model performs reasonably well but is slightly better at identifying non-ML roles than ML roles. Overall, the model shows decent predictive power, but there is some room for improvement, especially in detecting ML-related positions.

This bar chart displays the feature importance scores from a random forest model predicting whether a job role involves ML/Data Science. The most influential feature by far in the model is the job title (TITLE), which has a significantly higher importance than all other variables. Secondary contributors include industry classification (NAICS2_NAME) and minimum years of experience, where education level and SOC code had relatively low influence on the model’s prediction. This is suggesting that the job title alone carries strong predictive power for identifying ML-related roles.

              precision    recall  f1-score   support

           0       0.79      0.87      0.83      9991
           1       0.81      0.70      0.75      7823

    accuracy                           0.79     17814
   macro avg       0.80      0.78      0.79     17814
weighted avg       0.80      0.79      0.79     17814

In this model, we incorporated the job description text by applying TF-IDF vectorization to extract the most important words from each posting. Using these features, the Random Forest model achieved a strong accuracy of 79%, a notable improvement compared to the previous model that relied only on structured fields like job titles and industries. The model shows a higher recall 87% for non-ML roles but a lower recall of 69% for ML-related roles, suggesting it is better at identifying traditional roles than detecting specialized data science positions. Overall, adding the job description significantly enhanced the model’s ability to capture complex signals related to AI/ML requirements across different industries.

This bar chart shows the top words contributing to the classification of job roles as Machine Learning (ML)related based on job description data. Surprisingly, the most influential words are “attention,” “chain,” and “supply”, which could be an indication of overlap with supply chain roles or reflect noise in the model. More expected terms like “machine,” “learning,” “python,” “AI,” and “analytics” also appear, reinforcing that relevant technical language still plays a role in identifying ML-related positions. The presence of general words like “strong” or “communication” suggests that not all influential terms are strictly technical.

The confusion matrix illustrates that while the model performs well overall, it is particularly strong at identifying non-ML roles, class 0, but faces more difficulty correctly predicting ML-related roles, class 1, reflected in a higher number of false negatives. Nevertheless, the integration of the unstructured data meaningfully improved classification performance.

Structured features such as TITLE, SOC_2021_4_NAME, NAICS2_NAME, MIN_EDULEVELS_NAME, and MIN_YEARS_EXPERIENCE were chosen based on domain relevance, these fields reflect the role’s function, industry, required education, and experience level, all of which can signal ML-related requirements. Additionally, we included the job description BODY text, applying TF-IDF vectorization to extract key terms. This allowed the model to learn from nuanced language patterns within postings. Feature importance and performance metrics confirm that both structured metadata and text data contribute meaningfully to classification accuracy.