Welcome to the NeoChemSynthWave: Data project !!!
Over the past decade, computer-assisted chemical synthesis has re-emerged as a prominent research subject. Even though the idea of utilizing computers to assist chemical synthesis has existed for nearly as long as computers themselves, the inherent complexity repeatedly exceeded the available resources. However, recent machine learning approaches have exhibited the potential to break this tendency. The performance of such approaches is dependent on data that frequently suffer from limited quantity, quality, visibility, and accessibility, posing significant challenges to potential scientific breakthroughs. Consequently, the primary objective of the NeoChemSynthWave: Data project is to provide access to essential open computer-assisted chemical synthesis data.
An environment can be created using the git and conda commands as follows:
git clone https://github.com/neo-chem-synth-wave/ncsw-data.git
cd ncsw-data
conda env create -f environment.yaml
conda activate ncsw-data-env
The ncsw_data package can be installed using the pip command as follows:
pip install .
The purpose of the case_study directory is to illustrate how to download, extract, and format the relevant data and subsequently construct, manage, and query a version of the Computer-assisted Chemical Synthesis (CaCS) database that reflects the current state of computer-assisted chemical synthesis data.
First, the a_download_extract_and_format_data script can be utilized as follows:
python use_case/scripts/a_download_extract_and_format_data.py \
--data_source_category "reaction" \
--data_source_name "uspto" \
--data_source_version "v_50k_by_20171116_coley_c_w_et_al" \
--output_directory_path "/path/to/the/output/directory"
Next, the b_insert_archive_data script can be utilized as follows:
python use_case/scripts/b_insert_archive_data.py \
--sqlite_database_file_path "sqlite:////path/to/the/cacs_db.sqlite" \
--input_csv_file_path "/path/to/the/xxx_uspto_v_50k_by_20171116_coley_c_w_et_al.csv" \
--smiles_or_smarts_column_name "rxn_smiles" \
--file_name_column_name "file_name" \
--data_source_category "reaction" \
--data_source_name "uspto" \
--data_source_version "v_50k_by_20171116_coley_c_w_et_al"
Next, the c_migrate_archive_to_workbench_data script can be utilized as follows:
python use_case/scripts/c_migrate_archive_to_workbench_data.py \
--sqlite_database_file_path "sqlite:////path/to/the/cacs_db.sqlite" \
--data_source_category "reaction"
Ultimately, the d_update_workbench_data script can be utilized as follows:
python use_case/scripts/d_update_workbench_data.py \
--sqlite_database_file_path "sqlite:////path/to/the/cacs_db.sqlite"
The relevant SQLite scripts and Jupyter notebooks of the case study illustrating the querying of the CaCS database can be found in the notebooks directory.
The contents of this repository are published under the MIT license. Please refer to the individual references for more details regarding the license information of external resources utilized within the repository.
If you are interested in contributing to this research project by reporting bugs, suggesting improvements, or submitting feedback, feel free to do so using GitHub Issues.