You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Simple Search Engine project is a lightweight yet powerful tool designed to provide efficient text search capabilities. this project allows users to search through a corpus of documents and obtain relevant result.
3
+
The Vector Space Model (VSM) is a crucial concept in Natural Language Processing (NLP) used to represent text data numerically in a high-dimensional space. This project implements a search engine using the VSM approach, allowing users to retrieve relevant information from a given corpus.
4
+
5
+
## Description
6
+
7
+
This search engine project involves several key steps:
8
+
9
+
### Step 0: Importing Corpus
10
+
11
+
The initial step involves reading text corpora from the local machine. The Python script utilizes the NLTK library for further processing.
12
+
13
+
### Step 1: Preprocessing & Tokenizing
14
+
15
+
Text preprocessing is carried out to eliminate unnecessary tokens and simplify calculations. The NLTK library is employed for tasks such as tokenization, lemmatization, and stop-word removal.
16
+
17
+
### Step 2: Creating our Dataset
18
+
19
+
The preprocessed data is organized into a CSV file, creating a structured dataset for subsequent analysis.
20
+
21
+
### Step 3: Creating our Matrix
22
+
23
+
A term-document matrix is generated from the dataset, representing the frequency of terms in each document.
24
+
25
+
### Step 5: Calculating Cosine Similarity
26
+
27
+
Cosine similarity is computed to measure the similarity between the input query and the documents in the corpus. The results are ranked based on similarity.
28
+
29
+
## How to Use
30
+
31
+
1. Clone the repository to your local machine.
32
+
2. Install the necessary dependencies (NLTK, pandas).
33
+
3. Run the Python script to build the search engine.
34
+
35
+
## Dependencies
36
+
37
+
- NLTK
38
+
- Pandas
39
+
40
+
## Author
41
+
42
+
[Kiarash Rahmani]
43
+
44
+
## License
45
+
46
+
This project is licensed under the [MIT License](LICENSE).
0 commit comments