Skip to content

Commit 104e4a2

Browse files
authored
Implement v2.0.0 (#16)
* major code update * add missing docstring * linting 🐐
1 parent 14b7518 commit 104e4a2

21 files changed

+922
-686
lines changed

.flake8

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[flake8]
2+
max-line-length = 88
3+
extend-ignore = E203

.gitignore

Lines changed: 158 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,160 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
113
dist/
2-
paper/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
share/python-wheels/
24+
*.egg-info/
25+
.installed.cfg
26+
*.egg
27+
MANIFEST
28+
29+
# PyInstaller
30+
# Usually these files are written by a python script from a template
31+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
32+
*.manifest
33+
*.spec
34+
35+
# Installer logs
36+
pip-log.txt
37+
pip-delete-this-directory.txt
38+
39+
# Unit test / coverage reports
40+
htmlcov/
41+
.tox/
42+
.nox/
43+
.coverage
44+
.coverage.*
45+
.cache
46+
nosetests.xml
47+
coverage.xml
48+
*.cover
49+
*.py,cover
50+
.hypothesis/
51+
.pytest_cache/
52+
cover/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
.pybuilder/
76+
target/
77+
78+
# Jupyter Notebook
79+
.ipynb_checkpoints
80+
81+
# IPython
82+
profile_default/
83+
ipython_config.py
84+
85+
# pyenv
86+
# For a library or package, you might want to ignore these files since the code is
87+
# intended to run in multiple environments; otherwise, check them in:
88+
# .python-version
89+
90+
# pipenv
91+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
93+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
94+
# install all needed dependencies.
95+
#Pipfile.lock
96+
97+
# poetry
98+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
99+
# This is especially recommended for binary packages to ensure reproducibility, and is more
100+
# commonly ignored for libraries.
101+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102+
#poetry.lock
103+
104+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
105+
__pypackages__/
106+
107+
# Celery stuff
108+
celerybeat-schedule
109+
celerybeat.pid
110+
111+
# SageMath parsed files
112+
*.sage.py
113+
114+
# Environments
115+
.env
116+
.venv
117+
env/
118+
venv/
119+
ENV/
120+
env.bak/
121+
venv.bak/
122+
123+
# Spyder project settings
124+
.spyderproject
125+
.spyproject
126+
127+
# Rope project settings
128+
.ropeproject
129+
130+
# mkdocs documentation
131+
/site
132+
133+
# mypy
134+
.mypy_cache/
135+
.dmypy.json
136+
dmypy.json
137+
138+
# Pyre type checker
139+
.pyre/
140+
141+
# pytype static type analyzer
142+
.pytype/
143+
144+
# Cython debug symbols
145+
cython_debug/
146+
147+
# PyCharm
148+
# JetBrains specific template is maintainted in a separate JetBrains.gitignore that can
149+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
150+
# and can be added to the global gitignore or merged into this file. For a more nuclear
151+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
3152
.idea/
4-
build/
5-
cazy_parser.egg-info/
153+
154+
# VScode
155+
.vscode/
156+
157+
# Project-specific
158+
*.csv
159+
*.chk
160+
*.fasta

.isort.cfg

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
[settings]
2+
profile = black
Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,10 @@
1-
## cazy-parser
2-
*A way to extract specific information from the Carbohydrate-Active enZYmes.*
3-
4-
# How to contribute?
1+
# How to contribute to cazy-parser?
52

63
There are still a few features that could be implemented, such as:
74

85
* Organism specific selection
96
* Retrieve three dimensional structures for each family
107

11-
and specially
12-
13-
* **Retrieve fasta sequences from NCBIs servers**
14-
158
___
169

1710
Feel free to contact me with **suggestions**, **bugs reports** or if you need any **assistance** running the software.

README.md

Lines changed: 29 additions & 77 deletions
Original file line numberDiff line numberDiff line change
@@ -18,105 +18,57 @@ License: [GNU GPLv3](https://www.gnu.org/licenses/gpl-3.0.html)
1818

1919
doi: 10.21105/joss.00053
2020

21-
# IMPORTANT
22-
23-
Due to changes in the CAZy database, the parser is no longer functional, I will try to revive the code and update it soon. (:
2421

2522
## Introduction
2623
*cazy-parser* is a tool that extract information from [CAZy](http://www.cazy.org/) in a more usable and readable format. Firstly, a script reads the HTML structure and creates a mirror of the database as a tab delimited file. Secondly, information is extracted from the database according to user inputted parameters and presented to the user as a set of accession codes.
2724

2825
## Install / Upgrade
29-
`$ pip install --upgrade cazy-parser`
30-
31-
or
32-
33-
Download latest source from [this link](https://pypi.python.org/pypi/cazy-parser)
34-
3526
```
36-
$ tar -zxvf cazy-parser-x.x.x.tar.gz
37-
$ cd cazy-parser-x.x.x
38-
$ python setup.py install
39-
27+
$ pip install --upgrade cazy-parser
4028
```
4129

42-
Note: It my be necessary to open a new terminal.
4330

4431
## Usage
4532

4633
*Internet connection required*
4734

48-
1) Database creation
49-
50-
`$ create_cazy_db`
51-
52-
(-h for help)
53-
* This script will parse the [CAZy](http://www.cazy.org/) database website and create a comma separated table containing the following information:
54-
* domain
55-
* protein_name
56-
* family
57-
* tag *(characterized status)*
58-
* organism_code
59-
* [EC](http://www.enzyme-database.org/) number (ec stands for enzyme comission number)
60-
* [GENBANK](https://www.ncbi.nlm.nih.gov/genbank/) id
61-
* [UNIPROT](https://www.uniprot.org) code
62-
* subfamily
63-
* organism
64-
* [PDB](http://www.rcsb.org/) code
65-
66-
2) Extract accession codes
67-
68-
* Based on the previously generated csv table, extract accession codes for a given protein family.
69-
70-
`$ extract_cazy_ids --db <database> --family <family code>`
71-
72-
(-h for help)
73-
* Optional:
74-
75-
`--subfamilies` Create a file for each subfamily, default = False
76-
77-
`--characterized` Create a file containing only characterized enzymes, default = False
78-
79-
## Usage examples
80-
81-
1) Extract all accession codes from family 9 of Glycosyl Transferases.
82-
83-
`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GT9`
84-
85-
This will generate the following files:
86-
```
87-
GT9.csv
88-
```
89-
90-
2) Extract all accession codes from family 43 of Glycoside Hydrolase, including subfamilies
91-
92-
`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GH43 --subfamilies`
93-
94-
This will generate the following files:
9535

9636
```
97-
GH43.csv
98-
GH43_sub1.csv
99-
GH43_sub2.csv
100-
GH43_sub3.csv
101-
(...)
102-
GH43_sub37.csv
37+
cazy-parser -h
38+
usage: cazy-parser [-h] [-f FAMILY] [-s SUBFAMILY] [-c CHARACTERIZED] [-v] {GH,GT,PL,CA,AA}
39+
40+
positional arguments:
41+
{GH,GT,PL,CA,AA}
42+
43+
optional arguments:
44+
-h, --help show this help message and exit
45+
-f FAMILY, --family FAMILY
46+
-s SUBFAMILY, --subfamily SUBFAMILY
47+
-c CHARACTERIZED, --characterized CHARACTERIZED
48+
-v, --version show version
10349
```
10450

105-
3) Extract all accession codes from family 42 of Polysaccharide Lyases including characterized entries
106-
107-
`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family PL42 --characterized`
51+
### Example
10852

109-
This will generate the following files:
53+
Extract all fasta sequences from family 43 of Glycoside Hydrolase subfamily 1
11054

11155
```
112-
PL42.csv
113-
PL42_characterized.csv
56+
$ cazy-parser GH -f 43 -s 1
57+
[2022-05-26 16:39:21,511 91 INFO] ------------------------------------------
58+
[2022-05-26 16:39:21,511 92 INFO]
59+
[2022-05-26 16:39:21,511 93 INFO] ┌─┐┌─┐┌─┐┬ ┬ ┌─┐┌─┐┬─┐┌─┐┌─┐┬─┐
60+
[2022-05-26 16:39:21,511 94 INFO] │ ├─┤┌─┘└┬┘───├─┘├─┤├┬┘└─┐├┤ ├┬┘
61+
[2022-05-26 16:39:21,511 95 INFO] └─┘┴ ┴└─┘ ┴ ┴ ┴ ┴┴└─└─┘└─┘┴└─ v2.0.0
62+
[2022-05-26 16:39:21,511 96 INFO]
63+
[2022-05-26 16:39:21,511 97 INFO] ------------------------------------------
64+
[2022-05-26 16:39:21,511 183 INFO] Fetching links for Glycoside-Hydrolases, url: http://www.cazy.org/Glycoside-Hydrolases.html
65+
[2022-05-26 16:39:22,454 189 INFO] Only using links of family 43 subfamily 1
66+
[2022-05-26 16:39:23,029 26 INFO] Dowloading 1415 fasta sequences...
67+
[2022-05-26 16:40:32,187 51 INFO] Dumping fasta sequences to file GH43_1_26052022.fasta
11468
```
11569

116-
### Download fasta sequences
117-
118-
Go to [NCBI's Batch Entrez](https://www.ncbi.nlm.nih.gov/sites/batchentrez) change the database to protein and submit the generated `.csv`.
70+
This will generate the following file `GH43_1_DDMMYYY.fasta` containing the fasta sequences.
11971

12072
## To-do and how to contribute
12173

122-
Please refer to CONTRIBUTE.md
74+
Please refer to [CONTRIBUTING](CONTRIBUTING.md) (:

README.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ cazy-parser
33

44
The `Carbohydrate-Active enZYmes Database (CAZy) <https://www.cazy.org>`_ provides access to a sequence based classification of enzyme that are responsible for the assembly, modification and breakdown of oligo and polysaccharides.
55

6-
This database has been online for eighteen years providing relevant genomic, structural and biochemical data on carbohydrate-active enzymes, such asglycoside hydrolases, glycosyl transferases, polysaccharide lyases, carbohydrateesterases and similar enzymes with auxiliary activities. The database isorganized and presented to the user as a series of highly annotated HTML tables.
6+
This database has been online for several years providing relevant genomic, structural and biochemical data on carbohydrate-active enzymes, such as glycoside hydrolases, glycosyl transferases, polysaccharide lyases, carbohydrateesterases and similar enzymes with auxiliary activities. The database isorganized and presented to the user as a series of highly annotated HTML tables.
77

88
This script provides a way to extract information from the database according to user need.
99

0 commit comments

Comments
 (0)