Text classification has evolved significantly over the years, driven by advances in natural language processing (NLP) techniques and the availability of larger and more diverse datasets. From rule-based approaches to sophisticated deep learning models, leveraging transfer learning, attention mechanisms, and multimodal approaches, ongoing research continues to push the boundaries of accuracy, efficiency, and ethical considerations in text classification.
Specifically, the recent increase in availability of Large Language Models (LLMs) have enabled users to perform text classification tasks more easily. These LLMs can be fine-tuned to perform specific classification tasks with very few labeled examples, allowing rapid deployment of models for new classification tasks without the need for extensive training data. In this blog post, we look at how we can make use of pre-trained language models, specifically BERT (Bidirectional Encoder Representations from Transformers) from Hugging Face, for text classification, in Dataiku and monitor the accuracy of the results.
To allow a more efficient process of tagging of course domains (“label”), we would like to identify key domains for courses offered based on their course descriptions (“text”). These courses are published on a portal targeting working professionals, where individuals can sign up and pursue their skills in various domains. Figure 1 below shows examples of course descriptions along with the course domains that were manually tagged. The text for course descriptions have been cleaned (i.e., normalized with stopwords removed). We want to fine-tune a pre-trained LLM from Hugging Face to perform the categorization of course domains.
Figure 1. Example of course categories and cleaned course descriptions
We will first have to set up the code environment with the necessary resource initialization script that contains the Hugging Face model that we want. You can find a code snippet in the developer guide here.
Figure 2. Code environment
As with the typical machine learning process, we split the dataset into a training set and a test set for performance evaluation. This is a multi-class classification problem and using a Python recipe, the DistilBERT model from Hugging Face was imported and trained/ fine-tuned to identify the hyperparameters that give the best accuracy.
Figure 3 shows snippets of the Python recipe. The entire code used can be found here. The output of the Python script is a Dataiku-managed folder that contains models and results based on different hyperparameters, as shown in Figure 4. The code environment used for this Python script will be the one that we have set up earlier.
...
Figure 3. Snippet of Python script containing code that performs the model training/ fine-tuning of Hugging Face model
Figure 4. Example of output files of the Python script that runs the model training
The model with the best accuracy was then deployed as an MLflow object into the flow and can be applied against the test set for scoring using the Evaluate recipe. Figure 5 shows the overview of the process outlined.
Figure 5. Simple workflow in Dataiku for the text classification exercise
Upon applying the Evaluate recipe, we can get the results of the prediction. Figure 6 shows the results of the “predicted classes” against “actual classes” on the test set of over 3,000 records. The accuracy is particularly good for “Healthcare” and “Information and Communications” classes. On the other hand, the MAUC (Multi-class Area Under the Curve) for this model is 0.962, which is excellent. It will be worth trying out other pretrained models available on Hugging Face to see if the accuracy improves.
Figure 6. Confusion matrix of predicted classes against actual classes
We can see that even though we only had slightly more than 7,000 records for training, we are still able to achieve good accuracy through using pre-trained LLMs from Hugging Face. Apart from using pre-trained models from Hugging Face, there are also other approaches to text classification. To learn about these other approaches to text classification, check out this blog post.
For more information on the dataset used, check out this post.
Bình luận