Skip to content

Commit 85e9aec

Browse files
Update README.md
1 parent be79e97 commit 85e9aec

File tree

1 file changed

+36
-9
lines changed
  • Research Topic Prediction using Deep Learning/Model

1 file changed

+36
-9
lines changed

Research Topic Prediction using Deep Learning/Model/README.md

+36-9
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,47 @@
11
# Research-topic-Prediction
22

3-
### Problem Statement
3+
## Overview
4+
45
Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more difficult. Tagging or topic modelling provides a way to give token of identification to research articles which facilitates recommendation and search process.
56

6-
Given the abstract and title for a set of research articles, predict the topics for each article included in the test set.
7+
## Dataset
8+
9+
The dataset used in this challenge consists of research papers with their titles, abstracts, and corresponding categories. The categories include:
10+
11+
- Computer Science
12+
- Physics
13+
- Mathematics
14+
- Statistics
15+
- Quantitative Biology
16+
- Quantitative Finance
17+
18+
## Approach
19+
20+
The approach used to solve this challenge is as follows:
21+
22+
1. **Data Preprocessing**: The title and abstract of each research paper are combined and preprocessed by removing punctuation, converting to lowercase, and removing stop words.
23+
2. **Feature Extraction**: The preprocessed text data is then converted into numerical features using the CountVectorizer and TfidfTransformer from scikit-learn.
24+
3. **Model Training**: A MultiOutputClassifier with a LinearSVC estimator is trained on the feature data to predict the categories of the research papers.
25+
4. **Model Evaluation**: The performance of the model is evaluated using accuracy score, precision, recall, and F1-score.
26+
5. **Submission**: The predicted categories for the test data are submitted in a CSV file.
727

8-
Note that a research article can possibly have more than 1 topic. The research article abstracts and titles are sourced from the following 6 topics:
28+
## Code Structure
929

10-
1. Computer Science
30+
The code is organized into the following sections:
1131

12-
2. Physics
32+
1. **Importing Libraries**: The necessary libraries, including scikit-learn, pandas, and numpy, are imported.
33+
2. **Loading Data**: The training and test data are loaded from CSV files.
34+
3. **Data Preprocessing**: The title and abstract of each research paper are combined and preprocessed.
35+
4. **Feature Extraction**: The preprocessed text data is converted into numerical features.
36+
5. **Model Training**: The MultiOutputClassifier with a LinearSVC estimator is trained on the feature data.
37+
6. **Model Evaluation**: The performance of the model is evaluated using accuracy score, precision, recall, and F1-score.
38+
7. **Submission**: The predicted categories for the test data are submitted in a CSV file.
1339

14-
3. Mathematics
40+
## Dependencies
1541

16-
4. Statistics
42+
The following dependencies are required to run the code:
1743

18-
5. Quantitative Biology
44+
- scikit-learn
45+
- pandas
46+
- numpy
1947

20-
6. Quantitative Finance

0 commit comments

Comments
 (0)