This repository is used by the digital trade team. The project is to use financial transactions data to produce estimates for total household spending by UK cardholders, online and abroad – or household’s digitally ordered imports.
The set up instructions are important to ensure that you do not push data to the repository.
Git should only be used for code and not data, so code files including R files, R markdown files, python files and notebooks should be pushed. However, we need to ensure no data or graphs are included within commits.
There are 2 ways in which data and charts can be commited to Git:
- Seperate files, for example .txt, .csv, .png etc.
- Within scripts, for example Notebooks etc.
Notebooks are a type of code file that allows you to view outputs, they contain information about these outputs. This information needs to be cleared before it is pushed to GitHub, this is done through Pre-commit hooks.
This repository has been created using the 'cookie cutter repo template', therefore the pre-commit hooks are already setup, however they need to be "switched on".
When this isn’t done, it will allow you to push notebooks to GitHub without clearing out the outputs.
The following instructions should be followed to ensure pre-commit hooks are set up correctly: These instructions need to be followed for every repository.
-
Clone the repository.
-
IMPORTANT Install Auto-setup pre-commit hooks by:
- Opening the terminal
- Type the following into the terminal to navigate to the correct directory:
cd <repo_name>
- Once in the correct directory type the following:
make pre_commit
-
Check pre commit hooks run when you push a commit. If you do not see a list of checks in the terminal, there may be a problem.
-
If you are unsure, create a notebook, create a graph using dummy data, run the notebook, push the changes to the repo. When you open GitHub in a browser you should only be able to see the code you created. If you can see the graph the setup is not correct.
Separate file management is done using a git ignore file in the repository. The git ignore file tells git to skip anything with certain file extensions.
This repository has been created using the 'cookie cutter repo template', this means that a git ignore file already exists.
You should check the git ignore file contains all the types of data file in your project
If you require more file extensions to be ignored, these can be added within .gitignore
It is strongly recommended that you create a new environment for each project, to avoid dependency issues. This is where the package requirements for the repo can be installed.
The following instructions should be followed to create a virtual environment for a new repo using Python's venv:
- Open the terminal
- Type the following into the terminal to navigate to the root directory of the repo:
cd <repo_name>
- Once in the correct directory type the following:
make venv
This will:
- create a venv in the new repo named after that repo
- Install the requirements.txt into that venv
- Install an ipykernel based on the new venv
If you want to install new packages into your virtual environment using pip
, you have to activate it first. This has to be done every time a new terminal is opened. The terminal command to activate the venv is:
source <venv_name>/bin/activate
In order to use your virtual environment in a jupyter notebook, a jupyter ipykernel is created. make venv
automatically does this. You must select the correct kernel from the drop-down kernel list in jupyter lab.
If you ever need to delete a virtual environment (for example, if something goes wrong and you want a fresh environment), do the following:
- Open the terminal
- Type the following into the terminal to navigate to the root directory of the repo:
cd <repo_name>
- Once in the correct directory type the following:
make delete_venv
This will:
- delete the venv that is named after the repository (the default name when using
make venv
) - delete the ipykernel associated to this venv
- Data folder and all contents are untracked by GIT
- .gitignore default blocks popular data export file types due to data sensitivity
- Documents folder contains AQA plan, assumptions, caveats, project planning
- Notebooks stores notebooks
- Outputs stores outputs
- src for functions, modules, python scripts
- Use .env for consistent file paths
- Test for tests of src and coverage
- Secrets require .secrets and are not tracked by Git, see Secrets and credentials for more information
See /docs/structure/README.md for an overview of the project structure, and docs/contributor_guide/README.md for an overview on contributing for developers.
Where this documentation refers to the root folder we mean where this README.md is
located.
- Access to Google Cloud Platform, Vertex AI Notebooks, Google Big Query
- Permissions to access project datasets and personal notebook
- Python 3.6.1+ installed
- a
.secrets
file (if needed to manage credentials) with the needed secrets and credentials - load environment variables from
.env
To load environment variables such as file paths from the .env
hidden file you will need to use dotenv
and os
if using Python on windows.
- Open your notebook and import modules
from dotenv import load_dotenv
import os
- Load
.env
file
load_dotenv(override=True)
%env #returns all environment variables
- Select the environment variable you require
path_to_outputs = os.getenv("DIR_OUTPUTS")
print(path_to_outputs) #returns ./outputs
Hidden Files & .Gitignore (if required)
For developers note that if using JuypterLabs that some files such as .gitignore are hidden by default. .Gitignore is set-up for both Python, R, and generally popular output methods. This limits the possibility to commit privileged data to Git through various filetypes but this is not a replacement for best practice.
These hidden files are accessed through GitHub or through changing JuypterLabs server config settings if required for individual users. This latter step is only recommended if developers require and is not recommended for standard or non-technical users. To enable hidden files:
- To locate the jupyter config file, open the terminal and enter:
cd ~/.jupyter && nano jupyter_notebook_config.py
- This will open the nano editor, under additional code at the bottom add this line
c.ContentsManager.allow_hidden = True
-
Press ctrl + O to save and hit enter to confirm the filename with agnostic file type saving. Press ctrl + x to exit.
-
Restart the notebook and kernels, any hidden . files will now be visible such as .secrets or .gitignore.
This is a recommended best practice to employ a .secrets
file but may not be necessary for all projects where credential management is not required. For GCP Vertex AI Notebook work this step may be optional as credential management occurs at an earlier stage and external access is typically disallowed, users may wish to include project ID's and similar as secrets however.
To run this project, you need a .secrets
file with secrets/credentials as environmental variables. The
secrets/credentials should have the following environment variable name(s):
Secret/credential | Environment variable name | Description |
---|---|---|
Secret 1 | SECRET_VARIABLE_1 |
Plain English description of Secret 1. |
Credential 1 | CREDENTIAL_VARIABLE_1 |
Plain English description of Credential 1. |
Once added you can load these environment variables using .env
. See docs/user_guide/loading_environment_variables.md for an overview.
To load environment variables such as credentials from the .secret
hidden file you will need to use dotenv
and os
if using Python on windows.
This is not the best practice solution and should be able to work without changing
directories or specifying the full directory similar to `.env` process.
Any contributions to fix this are welcome.
- Create your
.secrets
file if it does not exist, open your terminal and type:
touch .secrets
-
Add your credentials into .secrets, this is untracked by GIT. Follow the format shown in Secrets and credentials (if required) section.
-
Open your notebook and import modules:
from dotenv import load_dotenv
import os
- Load
.secrets
file by either using the full file path or changing your directory to project root: Using full file path:
%pwd #this shows your full working directory, grab to the project root
load_dotenv("<full_directory_to_project_root>/.secrets", override=True)
%env #returns all environment variables
Alternatively change directory to root:
#change directory to project root
cd ..
load_dotenv(".secrets", override=True)
%env #returns all environment variables
- Select the environment variable you require
my_secret = os.getenv("SECRET_1")
print(my_secret) #returns "yuki is a cute kitten"