|
1 | 1 | # Research-topic-Prediction
|
2 | 2 |
|
3 |
| -### Problem Statement |
| 3 | +## Overview |
| 4 | + |
4 | 5 | Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more difficult. Tagging or topic modelling provides a way to give token of identification to research articles which facilitates recommendation and search process.
|
5 | 6 |
|
6 |
| -Given the abstract and title for a set of research articles, predict the topics for each article included in the test set. |
| 7 | +## Dataset |
| 8 | + |
| 9 | +The dataset used in this challenge consists of research papers with their titles, abstracts, and corresponding categories. The categories include: |
| 10 | + |
| 11 | +- Computer Science |
| 12 | +- Physics |
| 13 | +- Mathematics |
| 14 | +- Statistics |
| 15 | +- Quantitative Biology |
| 16 | +- Quantitative Finance |
| 17 | + |
| 18 | +## Approach |
| 19 | + |
| 20 | +The approach used to solve this challenge is as follows: |
| 21 | + |
| 22 | +1. **Data Preprocessing**: The title and abstract of each research paper are combined and preprocessed by removing punctuation, converting to lowercase, and removing stop words. |
| 23 | +2. **Feature Extraction**: The preprocessed text data is then converted into numerical features using the CountVectorizer and TfidfTransformer from scikit-learn. |
| 24 | +3. **Model Training**: A MultiOutputClassifier with a LinearSVC estimator is trained on the feature data to predict the categories of the research papers. |
| 25 | +4. **Model Evaluation**: The performance of the model is evaluated using accuracy score, precision, recall, and F1-score. |
| 26 | +5. **Submission**: The predicted categories for the test data are submitted in a CSV file. |
7 | 27 |
|
8 |
| -Note that a research article can possibly have more than 1 topic. The research article abstracts and titles are sourced from the following 6 topics: |
| 28 | +## Code Structure |
9 | 29 |
|
10 |
| -1. Computer Science |
| 30 | +The code is organized into the following sections: |
11 | 31 |
|
12 |
| -2. Physics |
| 32 | +1. **Importing Libraries**: The necessary libraries, including scikit-learn, pandas, and numpy, are imported. |
| 33 | +2. **Loading Data**: The training and test data are loaded from CSV files. |
| 34 | +3. **Data Preprocessing**: The title and abstract of each research paper are combined and preprocessed. |
| 35 | +4. **Feature Extraction**: The preprocessed text data is converted into numerical features. |
| 36 | +5. **Model Training**: The MultiOutputClassifier with a LinearSVC estimator is trained on the feature data. |
| 37 | +6. **Model Evaluation**: The performance of the model is evaluated using accuracy score, precision, recall, and F1-score. |
| 38 | +7. **Submission**: The predicted categories for the test data are submitted in a CSV file. |
13 | 39 |
|
14 |
| -3. Mathematics |
| 40 | +## Dependencies |
15 | 41 |
|
16 |
| -4. Statistics |
| 42 | +The following dependencies are required to run the code: |
17 | 43 |
|
18 |
| -5. Quantitative Biology |
| 44 | +- scikit-learn |
| 45 | +- pandas |
| 46 | +- numpy |
19 | 47 |
|
20 |
| -6. Quantitative Finance |
|
0 commit comments