rvhonorato
diff --git a/‎.gitignore
Lines changed: 1 addition & 0 deletions b/‎.gitignore
Lines changed: 1 addition & 0 deletions
diff --git a/‎CONTRIBUTE.md
Lines changed: 19 additions & 0 deletions b/‎CONTRIBUTE.md
Lines changed: 19 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 43 additions & 27 deletions b/‎README.md
Lines changed: 43 additions & 27 deletions
diff --git a/‎README.rst
Lines changed: 19 additions & 0 deletions b/‎README.rst
Lines changed: 19 additions & 0 deletions
diff --git a/‎build/lib/cazy_parser/__init__.py b/‎build/lib/cazy_parser/__init__.py
diff --git a/‎build/lib/cazy_parser/create_cazy_db.py
Lines changed: 208 additions & 0 deletions b/‎build/lib/cazy_parser/create_cazy_db.py
Lines changed: 208 additions & 0 deletions
diff --git a/‎select-cazy-sequences.py renamed to ‎build/lib/cazy_parser/select_cazy_sequences.py b/‎select-cazy-sequences.py renamed to ‎build/lib/cazy_parser/select_cazy_sequences.py
diff --git a/‎build/lib/paper/__init__.py b/‎build/lib/paper/__init__.py
@@ -0,0 +1 @@
+dist/cazy-parser*
@@ -0,0 +1,19 @@
+## cazy-parser
+*A way to extract specific information from the Carbohydrate-Active enZYmes.*
+
+# How to contribute?
+
+There are still a few features that could be implemented, such as:
+
+* Organism specific selection
+* Retrieve three dimensional structures for each family
+
+and specially
+
+* **Retrieve fasta sequences from NCBIs servers**
+
+___
+
+Feel free to contact me with **suggestions**, **bugs reports** or if you need any **assistance** running the softare.
+
+*rvhonorato at gmail.com*
@@ -5,65 +5,81 @@
 
 License: [GNU GPLv3](https://www.gnu.org/licenses/gpl-3.0.html)
 
-If you are using this tool, **make sure to cite and visit CAZy website**
+If you are using this tool, **make sure to cite and visit the CAZy website**
 
 * http://www.cazy.org/
 * Lombard V, Golaconda Ramulu H, Drula E, Coutinho PM, Henrissat B (2014) The Carbohydrate-active enzymes database (CAZy) in 2013. **Nucleic Acids Res** 42:D490–D495. [PMID: [24270786](http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed&cmd=search&term=24270786)].
 
-### Introduction
- *cazy-parser* is a tool that extract information from CAZy in a more usable and readable format. Firstly, a script reads the HTML structure and creates a mirror of the database as a tab delimited file. Secondly, information is extracted from the database according to user inputted parameters and presented to the user as a set of accession codes.
+## Introduction
+ *cazy-parser* is a tool that extract information from [CAZy](http://www.cazy.org/) in a more usable and readable format. Firstly, a script reads the HTML structure and creates a mirror of the database as a tab delimited file. Secondly, information is extracted from the database according to user inputted parameters and presented to the user as a set of accession codes.
 
-### Requirements
+## Installation
+`pip install cazy-parser`
 
-* Python 2.x
-* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) module
+## Usage
 
-### Usage
-
-*Both steps require an internet conection*
+*Please note that both steps require an internet conection*
 
 1) Database creation
 
-`$ python create-cazy-db.py`
+`$ create_cazy_db`
+
+(-h for help)
+* This script will parse the [CAZy](http://www.cazy.org/) database website and create a comma separated table containing the following information:
+    * domain
+    * protein_name
+    * family
+    * tag *(characterized status)*
+    * organism_code
+    * [EC](http://www.enzyme-database.org/) number (ec stands for enzyme comission number)
+    * [GENBANK](https://www.ncbi.nlm.nih.gov/genbank/) id
+    * [UNIPROT](uniprot.org) code
+    * subfamily
+    * organism
+    * [PDB](http://www.rcsb.org/) code
 
 2) Extract sequences
 
-`$ python select-cazy-sequences --db <database>`
-* Options:
+* Based on the previously generated csv table, extract accession codes for a given protein family.
+
+`$ extract_cazy_ids --db <database> --family <family code>`
 
-`--family` Family to be searched, case sensitive
+(-h for help)
+* Optional:
 
-`--subfamilies` Create a file for each subfamily
+`--subfamilies` Create a file for each subfamily, default = False
 
-`--characterized` Create a file containing only characterized enzymes
+`--characterized` Create a file containing only characterized enzymes, default = False
 
-### Examples
+## Usage examples
 
 1) Extract all accession codes from family 9 of Glycosyl Transferases.
 
-`$ python select-cazy-sequences --db CAZy_DB_xx-xx-xxxx.csv --family GT9`
+`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GT9`
 
 This will generate the following files:
 ```
-GT9.fasta
+GT9.csv
 ```
 
 2) Extract all accession codes from family 43 of Glycoside Hydrolase, including subfamilies
 
-`$ python select-cazy-sequences --db CAZy_DB_xx-xx-xxxx.csv --family GH43 --subfamilies`
+`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family GH43 --subfamilies`
 
 This will generate the following files:
 
 ```
-GH43.fasta
-GH43_sub1.fasta
+GH43.csv
+GH43_sub1.csv
+GH43_sub2.csv
+GH43_sub3.csv
 (...)
-GH43_sub37.fasta
+GH43_sub37.csv
 ```
 
 3) Extract all accession codes from family 42 of Polysaccharide Lyases including characterized entries
 
-`$ python select-cazy-sequences --db CAZy_DB_xx-xx-xxxx.csv --family PL42 --characterized`
+`$ extract_cazy_ids --db CAZy_DB_xx-xx-xxxx.csv --family PL42 --characterized`
 
 This will generate the following files:
 
@@ -72,14 +88,14 @@ PL42.fasta
 PL42_characterized.fasta
 ```
 
-### To-do
+## To-do and how to contribute
+
+Please refer to CONTRIBUTE.md
 
-1. Extract sequences based on organism/domain
-2. Select structural data
 
 ### Known bugs
 
-**Sequence retrieval was done using the wrong NCBI service, thus blocking access to the site. Issue is being addressed.**
+None, yet.
 
 #### Contact info
 
 
@@ -0,0 +1,19 @@
+cazy-parser
+============
+
+The `Carbohydrate-Active enZYmes Database (CAZy) <https://www.cazy.org>`_ provides access to a sequence based classification of enzyme that are responsible for the assembly, modification and breakdown of oligo and polysaccharides.
+
+This database has been online for eighteen years providing relevant genomic, structural and biochemical data on carbohydrate-active enzymes, such asglycoside hydrolases, glycosyl transferases, polysaccharide lyases, carbohydrateesterases and similar enzymes with auxiliary activities. The database isorganized and presented to the user as a series of highly annotated HTML tables.
+
+This script provides a way to extract information from the database according to user need.
+
+Installation
+============
+::
+
+    pip install cazy-parser
+
+Documentation
+=============
+
+Please refer to the `project page <https://github.com/rodrigovrgs/cazy-parser/blob/master/README.md>`_  for usage and more information
@@ -0,0 +1,208 @@
+#!/usr/bin/env python
+#==============================================================================#
+# Copyright (C) 2016  Rodrigo Honorato
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>
+#==============================================================================#
+
+#==============================================================================#
+# create a parsed database exploting CAZY html structure
+import os, sys, urllib, re, string, time, string
+from bs4 import BeautifulSoup
+
+def main():
+	#==============================================================================#
+	# Species part
+	#==============================================================================#
+	print '>> Gathering species codes for species with full genomes'
+	# a = archea // b = bacteria // e = eukaryota // v = virus
+	species_domain_list = ['a', 'b', 'e', 'v']
+	species_dic = {}
+	for initial in string.uppercase:
+		for domain in species_domain_list:
+			link = 'http://www.cazy.org/%s%s.html' % (domain, initial)
+			f = urllib.urlopen(link)
+			species_list_hp = f.read()
+			# parse webpage
+			index_list = re.findall('"http://www.cazy.org/(b\d.*).html" class="nav">(.*)</a>', species_list_hp)
+			for entry in index_list:
+				index, species = entry
+				try:
+					species_dic[species].append(index)
+				except:
+					species_dic[species] = [index]
+
+	# Double check to see which of the species codes are valid
+	for species in species_dic:
+		entry_list = species_dic[species]
+		if len(entry_list) > 1:
+			# More than one entry for this species
+			#  > This is (likely) a duplicate
+			#  > Use the higher number, should be the newer page
+			newer_entry = max([int(i.split('b')[-1]) for i in entry_list])
+			selected_entry = 'b%i' % newer_entry
+
+			species_dic[species] = selected_entry
+		else:
+			species_dic[species] = species_dic[species][0]
+
+	#==============================================================================#
+	# Enzyme class part
+	#==============================================================================#
+
+	enzyme_classes = ['Glycoside-Hydrolases',
+		'GlycosylTransferases',
+		'Polysaccharide-Lyases',
+		'Carbohydrate-Esterases',
+		'Auxiliary-Activities']
+
+	db_dic = {}
+	protein_counter = 0
+	for e_class in enzyme_classes:
+		print '>> %s' % e_class
+		main_class_link = 'http://www.cazy.org/%s.html' % e_class
+
+
+		#==============================================================================#
+		# Family section
+		#==============================================================================#
+		soup = BeautifulSoup(urllib.urlopen(main_class_link))
+		family_table = soup.findAll(name='table')[0]
+		rows = family_table.findAll(name='td')
+
+		family_list = [str(r.find('a')['href'].split('/')[-1].split('.html')[0]) for r in rows]
+
+		print '>> %i families found on %s' % (len(family_list), main_class_link)
+		#==============================================================================#
+		# Identification section
+		#==============================================================================#
+		for family in family_list:
+			print '> %s' % family
+			#
+			main_link = 'http://www.cazy.org/%s.html' % family
+			family_soup = BeautifulSoup(urllib.urlopen(main_link))
+			# main_link_dic = {'http://www.cazy.org/%s_all.html#pagination_PRINC' % family: '',
+			# 	'http://www.cazy.org/%s_characterized.html#pagination_PRINC' % family: 'characterized'}
+			#====================#
+			superfamily_list = [l.find('a')['href'] for l in family_soup.findAll('span', attrs={'class':'choix'})][1:]
+
+			# remove structure tab, for now
+			superfamily_list = [f for f in superfamily_list if not 'structure' in f]
+
+			# DEBUG
+			# superfamily_list = superfamily_list[:-2]
+			#====================#
+			for main_link in superfamily_list:
+
+				page_zero = main_link
+
+				soup = BeautifulSoup(urllib.urlopen(main_link))
+
+				# Get page list for the family // 1, 2, 3, 4, 5, 7
+				page_index_list = soup.findAll(name = 'a', attrs={'class':'lien_pagination'})
+				# page_list = ['http://www.cazy.org/' + str(l['href']) for l in page_index_list] # deprecated
+				if bool(page_index_list):
+					first_page_idx = int(page_index_list[0]['href'].split('PRINC=')[-1].split('#')[0]) # be careful with this
+					last_page_idx = int(page_index_list[-2]['href'].split('PRINC=')[-1].split('#')[0]) # be careful with this
+
+					# generate page_list
+					page_list = []
+					page_list.append(page_zero)
+					for i in range(first_page_idx, last_page_idx+first_page_idx, first_page_idx):
+						link = 'http://www.cazy.org/' + page_index_list[0]['href'].split('=')[0] + '=' + str(i) + '#' + page_index_list[0]['href'].split('#')[1]
+						page_list.append(link)
+				else:
+					page_list = [page_zero]
+
+				# page_list.append(main_link) # deprecated
+				# page_list = list(set(page_list)) # deprecated
+				for link in page_list:
+					# print link
+					# tr  = rows // # td = cells
+					soup = BeautifulSoup(urllib.urlopen(link))
+					table = soup.find('table', attrs={'class':'listing'})
+					domain = ''
+
+					# consistency check to look for deleted families. i.e. GH21
+					try:
+						check = table.findAll('tr')
+					except AttributeError:
+						# not a valid link, move on
+						continue
+
+					for row in table.findAll('tr'):
+						try:
+							if row['class'] == 'royaume' and row.text != 'Top':
+								domain = str(row.text).lower()
+						except:
+							pass
+
+						tds = row.findAll('td')
+						if len(tds) > 1 and tds[0].text != 'Protein Name':
+							# valid line
+							db_dic[protein_counter] = {}
+
+							db_dic[protein_counter]['protein_name'] = tds[0].text.replace('&nbsp;','')
+							db_dic[protein_counter]['family'] = family
+							db_dic[protein_counter]['domain'] = domain
+							db_dic[protein_counter]['ec'] = tds[1].text.replace('&nbsp;','')
+							db_dic[protein_counter]['organism'] = tds[2].text.replace('&nbsp;','')
+							try:
+								db_dic[protein_counter]['genbank'] = tds[3].find('a').text.replace('&nbsp;','') # get latest entry
+							except:
+								# there is a crazy aberration when there is no genbank available
+								db_dic[protein_counter]['genbank'] = 'unavailable'
+							#
+							db_dic[protein_counter]['uniprot'] = tds[4].text.replace('&nbsp;','')
+							db_dic[protein_counter]['pdb'] = tds[5].text.replace('&nbsp;','')
+
+							# check if this is species has a complete genome
+							try:
+								db_dic[protein_counter]['organism_code'] = species_dic[tds[2].text.replace('&nbsp;','')]
+							except:
+								db_dic[protein_counter]['organism_code'] = 'invalid'
+
+							# check if there are subfamilies
+							try:
+								db_dic[protein_counter]['subfamily'] = tds[6].text.replace('&nbsp;','')
+							except:
+								db_dic[protein_counter]['subfamily'] = ''
+
+							if 'characterized' in main_link:
+								db_dic[protein_counter]['tag'] = 'characterized'
+							else:
+								db_dic[protein_counter]['tag'] = ''
+							# debug entries
+							# print '\t'.join(db_dic[protein_counter].keys())
+							# print '\t'.join(db_dic[protein_counter].values())
+							protein_counter += 1
+
+	# Ouput
+	output_f = 'CAZy_DB_%s.csv' % time.strftime("%d-%m-%Y")
+	out = open(output_f, 'w')
+	header = '\t'.join(db_dic[0].keys())
+	out.write(header + '\n')
+
+	for p in db_dic:
+		tbw = '\t'.join(db_dic[p].values())
+		tbw = tbw.encode('utf8') # make sure codification is ok
+		out.write(tbw + '\n')
+
+	out.close()
+
+	sys.exit(0)
+
+if __name__ == '__main__':
+	main()
+# done.