Skip to content

Commit 96b24ec

Browse files
author
rodrigo
committed
reviwed code according to joss feedback
1 parent 70c523f commit 96b24ec

20 files changed

+841
-229
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
dist/cazy-parser*

CONTRIBUTE.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
## cazy-parser
2+
*A way to extract specific information from the Carbohydrate-Active enZYmes.*
3+
4+
# How to contribute?
5+
6+
There are still a few features that could be implemented, such as:
7+
8+
* Organism specific selection
9+
* Retrieve three dimensional structures for each family
10+
11+
and specially
12+
13+
* **Retrieve fasta sequences from NCBIs servers**
14+
15+
___
16+
17+
Feel free to contact me with **suggestions**, **bugs reports** or if you need any **assistance** running the softare.
18+
19+
*rvhonorato at gmail.com*

README.md

Lines changed: 43 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -5,65 +5,81 @@
55

66
License: [GNU GPLv3](https://www.gnu.org/licenses/gpl-3.0.html)
77

8-
If you are using this tool, **make sure to cite and visit CAZy website**
8+
If you are using this tool, **make sure to cite and visit the CAZy website**
99

1010
* http://www.cazy.org/
1111
* Lombard V, Golaconda Ramulu H, Drula E, Coutinho PM, Henrissat B (2014) The Carbohydrate-active enzymes database (CAZy) in 2013. **Nucleic Acids Res** 42:D490–D495. [PMID: [24270786](http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=search&term=24270786)].
1212

13-
### Introduction
14-
*cazy-parser* is a tool that extract information from CAZy in a more usable and readable format. Firstly, a script reads the HTML structure and creates a mirror of the database as a tab delimited file. Secondly, information is extracted from the database according to user inputted parameters and presented to the user as a set of accession codes.
13+
## Introduction
14+
*cazy-parser* is a tool that extract information from [CAZy](http://www.cazy.org/) in a more usable and readable format. Firstly, a script reads the HTML structure and creates a mirror of the database as a tab delimited file. Secondly, information is extracted from the database according to user inputted parameters and presented to the user as a set of accession codes.
1515

16-
### Requirements
16+
## Installation
17+
`pip install cazy-parser`
1718

18-
* Python 2.x
19-
* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) module
19+
## Usage
2020

21-
### Usage
22-
23-
*Both steps require an internet conection*
21+
*Please note that both steps require an internet conection*
2422

2523
1) Database creation
2624

27-
`$ python create-cazy-db.py`
25+
`$ create_cazy_db`
26+
27+
(-h for help)
28+
* This script will parse the [CAZy](http://www.cazy.org/) database website and create a comma separated table containing the following information:
29+
* domain
30+
* protein_name
31+
* family
32+
* tag *(characterized status)*
33+
* organism_code
34+
* [EC](http://www.enzyme-database.org/) number (ec stands for enzyme comission number)
35+
* [GENBANK](https://www.ncbi.nlm.nih.gov/genbank/) id
36+
* [UNIPROT](uniprot.org) code
37+
* subfamily
38+
* organism
39+
* [PDB](http://www.rcsb.org/) code
2840

2941
2) Extract sequences
3042

31-
`$ python select-cazy-sequences --db <database>`
32-
* Options:
43+
* Based on the previously generated csv table, extract accession codes for a given protein family.
44+
45+
`$ extract_cazy_ids --db <database> --family <family code>`
3346

34-
`--family` Family to be searched, case sensitive
47+
(-h for help)
48+
* Optional:
3549

36-
`--subfamilies` Create a file for each subfamily
50+
`--subfamilies` Create a file for each subfamily, default = False
3751

38-
`--characterized` Create a file containing only characterized enzymes
52+
`--characterized` Create a file containing only characterized enzymes, default = False
3953

40-
### Examples
54+
## Usage examples
4155

4256
1) Extract all accession codes from family 9 of Glycosyl Transferases.
4357

44-
`$ python select-cazy-sequences --db CAZy_DB_xx-xx-xxxx.csv --family GT9`
58+
`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GT9`
4559

4660
This will generate the following files:
4761
```
48-
GT9.fasta
62+
GT9.csv
4963
```
5064

5165
2) Extract all accession codes from family 43 of Glycoside Hydrolase, including subfamilies
5266

53-
`$ python select-cazy-sequences --db CAZy_DB_xx-xx-xxxx.csv --family GH43 --subfamilies`
67+
`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GH43 --subfamilies`
5468

5569
This will generate the following files:
5670

5771
```
58-
GH43.fasta
59-
GH43_sub1.fasta
72+
GH43.csv
73+
GH43_sub1.csv
74+
GH43_sub2.csv
75+
GH43_sub3.csv
6076
(...)
61-
GH43_sub37.fasta
77+
GH43_sub37.csv
6278
```
6379

6480
3) Extract all accession codes from family 42 of Polysaccharide Lyases including characterized entries
6581

66-
`$ python select-cazy-sequences --db CAZy_DB_xx-xx-xxxx.csv --family PL42 --characterized`
82+
`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family PL42 --characterized`
6783

6884
This will generate the following files:
6985

@@ -72,14 +88,14 @@ PL42.fasta
7288
PL42_characterized.fasta
7389
```
7490

75-
### To-do
91+
## To-do and how to contribute
92+
93+
Please refer to CONTRIBUTE.md
7694

77-
1. Extract sequences based on organism/domain
78-
2. Select structural data
7995

8096
### Known bugs
8197

82-
**Sequence retrieval was done using the wrong NCBI service, thus blocking access to the site. Issue is being addressed.**
98+
None, yet.
8399

84100
#### Contact info
85101

README.rst

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
cazy-parser
2+
============
3+
4+
The `Carbohydrate-Active enZYmes Database (CAZy) <https://www.cazy.org>`_ provides access to a sequence based classification of enzyme that are responsible for the assembly, modification and breakdown of oligo and polysaccharides.
5+
6+
This database has been online for eighteen years providing relevant genomic, structural and biochemical data on carbohydrate-active enzymes, such asglycoside hydrolases, glycosyl transferases, polysaccharide lyases, carbohydrateesterases and similar enzymes with auxiliary activities. The database isorganized and presented to the user as a series of highly annotated HTML tables.
7+
8+
This script provides a way to extract information from the database according to user need.
9+
10+
Installation
11+
============
12+
::
13+
14+
pip install cazy-parser
15+
16+
Documentation
17+
=============
18+
19+
Please refer to the `project page <https://github.com/rodrigovrgs/cazy-parser/blob/master/README.md>`_ for usage and more information

build/lib/cazy_parser/__init__.py

Whitespace-only changes.
Lines changed: 208 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,208 @@
1+
#!/usr/bin/env python
2+
#==============================================================================#
3+
# Copyright (C) 2016 Rodrigo Honorato
4+
#
5+
# This program is free software: you can redistribute it and/or modify
6+
# it under the terms of the GNU General Public License as published by
7+
# the Free Software Foundation, either version 3 of the License, or
8+
# (at your option) any later version.
9+
#
10+
# This program is distributed in the hope that it will be useful,
11+
# but WITHOUT ANY WARRANTY; without even the implied warranty of
12+
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
13+
# GNU General Public License for more details.
14+
#
15+
# You should have received a copy of the GNU General Public License
16+
# along with this program. If not, see <http://www.gnu.org/licenses/>
17+
#==============================================================================#
18+
19+
#==============================================================================#
20+
# create a parsed database exploting CAZY html structure
21+
import os, sys, urllib, re, string, time, string
22+
from bs4 import BeautifulSoup
23+
24+
def main():
25+
#==============================================================================#
26+
# Species part
27+
#==============================================================================#
28+
print '>> Gathering species codes for species with full genomes'
29+
# a = archea // b = bacteria // e = eukaryota // v = virus
30+
species_domain_list = ['a', 'b', 'e', 'v']
31+
species_dic = {}
32+
for initial in string.uppercase:
33+
for domain in species_domain_list:
34+
link = 'http://www.cazy.org/%s%s.html' % (domain, initial)
35+
f = urllib.urlopen(link)
36+
species_list_hp = f.read()
37+
# parse webpage
38+
index_list = re.findall('"http://www.cazy.org/(b\d.*).html" class="nav">(.*)</a>', species_list_hp)
39+
for entry in index_list:
40+
index, species = entry
41+
try:
42+
species_dic[species].append(index)
43+
except:
44+
species_dic[species] = [index]
45+
46+
# Double check to see which of the species codes are valid
47+
for species in species_dic:
48+
entry_list = species_dic[species]
49+
if len(entry_list) > 1:
50+
# More than one entry for this species
51+
# > This is (likely) a duplicate
52+
# > Use the higher number, should be the newer page
53+
newer_entry = max([int(i.split('b')[-1]) for i in entry_list])
54+
selected_entry = 'b%i' % newer_entry
55+
56+
species_dic[species] = selected_entry
57+
else:
58+
species_dic[species] = species_dic[species][0]
59+
60+
#==============================================================================#
61+
# Enzyme class part
62+
#==============================================================================#
63+
64+
enzyme_classes = ['Glycoside-Hydrolases',
65+
'GlycosylTransferases',
66+
'Polysaccharide-Lyases',
67+
'Carbohydrate-Esterases',
68+
'Auxiliary-Activities']
69+
70+
db_dic = {}
71+
protein_counter = 0
72+
for e_class in enzyme_classes:
73+
print '>> %s' % e_class
74+
main_class_link = 'http://www.cazy.org/%s.html' % e_class
75+
76+
77+
#==============================================================================#
78+
# Family section
79+
#==============================================================================#
80+
soup = BeautifulSoup(urllib.urlopen(main_class_link))
81+
family_table = soup.findAll(name='table')[0]
82+
rows = family_table.findAll(name='td')
83+
84+
family_list = [str(r.find('a')['href'].split('/')[-1].split('.html')[0]) for r in rows]
85+
86+
print '>> %i families found on %s' % (len(family_list), main_class_link)
87+
#==============================================================================#
88+
# Identification section
89+
#==============================================================================#
90+
for family in family_list:
91+
print '> %s' % family
92+
#
93+
main_link = 'http://www.cazy.org/%s.html' % family
94+
family_soup = BeautifulSoup(urllib.urlopen(main_link))
95+
# main_link_dic = {'http://www.cazy.org/%s_all.html#pagination_PRINC' % family: '',
96+
# 'http://www.cazy.org/%s_characterized.html#pagination_PRINC' % family: 'characterized'}
97+
#====================#
98+
superfamily_list = [l.find('a')['href'] for l in family_soup.findAll('span', attrs={'class':'choix'})][1:]
99+
100+
# remove structure tab, for now
101+
superfamily_list = [f for f in superfamily_list if not 'structure' in f]
102+
103+
# DEBUG
104+
# superfamily_list = superfamily_list[:-2]
105+
#====================#
106+
for main_link in superfamily_list:
107+
108+
page_zero = main_link
109+
110+
soup = BeautifulSoup(urllib.urlopen(main_link))
111+
112+
# Get page list for the family // 1, 2, 3, 4, 5, 7
113+
page_index_list = soup.findAll(name = 'a', attrs={'class':'lien_pagination'})
114+
# page_list = ['http://www.cazy.org/' + str(l['href']) for l in page_index_list] # deprecated
115+
if bool(page_index_list):
116+
first_page_idx = int(page_index_list[0]['href'].split('PRINC=')[-1].split('#')[0]) # be careful with this
117+
last_page_idx = int(page_index_list[-2]['href'].split('PRINC=')[-1].split('#')[0]) # be careful with this
118+
119+
# generate page_list
120+
page_list = []
121+
page_list.append(page_zero)
122+
for i in range(first_page_idx, last_page_idx+first_page_idx, first_page_idx):
123+
link = 'http://www.cazy.org/' + page_index_list[0]['href'].split('=')[0] + '=' + str(i) + '#' + page_index_list[0]['href'].split('#')[1]
124+
page_list.append(link)
125+
else:
126+
page_list = [page_zero]
127+
128+
# page_list.append(main_link) # deprecated
129+
# page_list = list(set(page_list)) # deprecated
130+
for link in page_list:
131+
# print link
132+
# tr = rows // # td = cells
133+
soup = BeautifulSoup(urllib.urlopen(link))
134+
table = soup.find('table', attrs={'class':'listing'})
135+
domain = ''
136+
137+
# consistency check to look for deleted families. i.e. GH21
138+
try:
139+
check = table.findAll('tr')
140+
except AttributeError:
141+
# not a valid link, move on
142+
continue
143+
144+
for row in table.findAll('tr'):
145+
try:
146+
if row['class'] == 'royaume' and row.text != 'Top':
147+
domain = str(row.text).lower()
148+
except:
149+
pass
150+
151+
tds = row.findAll('td')
152+
if len(tds) > 1 and tds[0].text != 'Protein Name':
153+
# valid line
154+
db_dic[protein_counter] = {}
155+
156+
db_dic[protein_counter]['protein_name'] = tds[0].text.replace('&nbsp;','')
157+
db_dic[protein_counter]['family'] = family
158+
db_dic[protein_counter]['domain'] = domain
159+
db_dic[protein_counter]['ec'] = tds[1].text.replace('&nbsp;','')
160+
db_dic[protein_counter]['organism'] = tds[2].text.replace('&nbsp;','')
161+
try:
162+
db_dic[protein_counter]['genbank'] = tds[3].find('a').text.replace('&nbsp;','') # get latest entry
163+
except:
164+
# there is a crazy aberration when there is no genbank available
165+
db_dic[protein_counter]['genbank'] = 'unavailable'
166+
#
167+
db_dic[protein_counter]['uniprot'] = tds[4].text.replace('&nbsp;','')
168+
db_dic[protein_counter]['pdb'] = tds[5].text.replace('&nbsp;','')
169+
170+
# check if this is species has a complete genome
171+
try:
172+
db_dic[protein_counter]['organism_code'] = species_dic[tds[2].text.replace('&nbsp;','')]
173+
except:
174+
db_dic[protein_counter]['organism_code'] = 'invalid'
175+
176+
# check if there are subfamilies
177+
try:
178+
db_dic[protein_counter]['subfamily'] = tds[6].text.replace('&nbsp;','')
179+
except:
180+
db_dic[protein_counter]['subfamily'] = ''
181+
182+
if 'characterized' in main_link:
183+
db_dic[protein_counter]['tag'] = 'characterized'
184+
else:
185+
db_dic[protein_counter]['tag'] = ''
186+
# debug entries
187+
# print '\t'.join(db_dic[protein_counter].keys())
188+
# print '\t'.join(db_dic[protein_counter].values())
189+
protein_counter += 1
190+
191+
# Ouput
192+
output_f = 'CAZy_DB_%s.csv' % time.strftime("%d-%m-%Y")
193+
out = open(output_f, 'w')
194+
header = '\t'.join(db_dic[0].keys())
195+
out.write(header + '\n')
196+
197+
for p in db_dic:
198+
tbw = '\t'.join(db_dic[p].values())
199+
tbw = tbw.encode('utf8') # make sure codification is ok
200+
out.write(tbw + '\n')
201+
202+
out.close()
203+
204+
sys.exit(0)
205+
206+
if __name__ == '__main__':
207+
main()
208+
# done.

build/lib/paper/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)