Skip to content

Commit 37d3869

Browse files
committed
Initial commit
Signed-off-by: tirthajyoti <[email protected]>
0 parents  commit 37d3869

15 files changed

+43329
-0
lines changed

.gitignore

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# general things to ignore
2+
build/
3+
dist/
4+
*.egg-info/
5+
*.egg
6+
*.py[cod]
7+
__pycache__/
8+
*.so
9+
*~
10+
11+
# due to using tox and pytest
12+
.tox
13+
.cache

.pypirc

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
[distutils]
2+
index-servers=
3+
pypi
4+
testpypi
5+
6+
[testpypi]
7+
repository: https://test.pypi.org/legacy/
8+
username: tirthajyoti
9+
password: Pinku1920

LICENSE.txt

+19
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
Copyright (c) 2016 The Python Packaging Authority (PyPA)
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of
4+
this software and associated documentation files (the "Software"), to deal in
5+
the Software without restriction, including without limitation the rights to
6+
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
7+
of the Software, and to permit persons to whom the Software is furnished to do
8+
so, subject to the following conditions:
9+
10+
The above copyright notice and this permission notice shall be included in all
11+
copies or substantial portions of the Software.
12+
13+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
19+
SOFTWARE.

MANIFEST.in

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Include the license file
2+
include LICENSE.txt

README.rst

+176
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
************************************
2+
Random database/dataframe generator
3+
************************************
4+
5+
**(Tirthajyoti Sarkar, Sunnyvale, USA, March 2018)**
6+
7+
Introduction
8+
=============
9+
Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (**.DB** or **.sqlite**) for practicing SQL commands. **Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one's own choice?**
10+
11+
After all, databases break every now and then and it is safest to practice with a randomly generated one :-)
12+
13+
.. image:: https://imgs.xkcd.com/comics/exploits_of_a_mom.png
14+
15+
16+
While it is easy to generate random numbers or simple words for Pandas or dataframe operation learning, it is often non-trivial to generate full data tables with meaningful yet random entries of most commonly encountered fields in the world of database, such as ``name, age, birthday, credit card number, SSN, email id, physical address, company name, job title`` etc.
17+
18+
This Python package generates a random database ``TABLE`` (or a Pandas dataframe, or an Excel file) based on user's choice of data types (database fields). User can specify the number of samples needed. One can also designate a **"PRIMARY KEY"** for the database table. Finally, the ``TABLE`` is inserted into a new or existing database file of user's choice.
19+
20+
Dependency and Acknowledgement
21+
===============================
22+
At its core, ``pydbgen`` uses ``Faker`` as the default random data generating engine for most of the data types. Original function is written for few data types such as ``realistic email`` and ``license plate``. Also the default phone number generated by ``Faker`` is free-format and does not correspond to US 10 digit format. Therefore, a ``simple phone number`` data type is introduced in ``pydbgen``. The original contribution of ``pydbgen`` is to take the single data-generating function from ``Faker`` and use it cleverly to generate Pandas data series or dataframe or SQLite database tables as per the specification of the user.
23+
Here is the link if you want to look up more about ``Faker`` package,
24+
25+
`Faker Documentation Home <https://faker.readthedocs.io/en/latest/index.html>`_
26+
27+
Installation
28+
=============
29+
You can use pip to install pydbgen: ``pip install pydbgen``
30+
31+
Usage
32+
=========
33+
Current version (1.0.0) of ``pydbgen`` comes with the following primary methods,
34+
35+
* ``gen_data_series()``
36+
* ``gen_dataframe()``
37+
* ``gen_table()``
38+
* ``gen_excel()``
39+
40+
The ``gen_table()`` method allows you to build a database with as many tables as you want, filled with random data and fields of your choice. But first, you have to create an object of ``pydb`` class::
41+
42+
myDB = pydbgen.pydb()
43+
44+
gen_data_series()
45+
------------------
46+
Returns a `Pandas series object <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html>`_ with the desired number of entries and data type. Data types available:
47+
48+
* Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
49+
* Month, weekday, year, time, date
50+
* Personal email, official email, SSN
51+
* Company, Job title, phone number, license plate
52+
53+
Phone number can be of two types:
54+
55+
* ``phone_number_simple`` generates 10 digit US number in xxx-xxx-xxxx format
56+
* ``phone_number_full`` may generate an international number with different format
57+
58+
**Code example**::
59+
60+
se=myDB.gen_data_series(data_type='date')
61+
print(se)
62+
63+
0 1995-08-09
64+
1 2001-08-01
65+
2 1980-06-26
66+
3 2018-02-18
67+
4 1972-10-12
68+
5 1983-11-12
69+
6 1975-09-04
70+
7 1970-11-01
71+
8 1978-03-23
72+
9 1976-06-03
73+
dtype: object
74+
75+
gen_dataframe()
76+
------------------
77+
Generates a `Pandas dataframe <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html>`_ filled with random entries. User can specify the number of rows and data type of the fields/columns.
78+
79+
* Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
80+
* Month, weekday, year, time, date
81+
* Personal email, official email, SSN
82+
* Company, Job title, phone number, license plate
83+
Customization choices are following:
84+
85+
- **real_email**: If ``TRUE`` and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, ``Tirtha Sarkar`` name with this choice enabled, will generate emails like ``[email protected]`` or ``[email protected]``.
86+
- **real_city**: If ``TRUE``, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
87+
- **phone_simple**: If ``TRUE``, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.
88+
89+
**Code example**::
90+
91+
testdf=myDB.gen_dataframe(25,fields='name','city','phone',
92+
'license_plate','email',real_email=True,phone_simple=True)
93+
94+
gen_table()
95+
------------------
96+
Attempts to create a table in a database (.db) file using Python's built-in SQLite engine. User can specify various data types to be included as database table fields.All data types (fields) in the SQLite table will be of VARCHAR type. Data types available:
97+
98+
* Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
99+
* Month, weekday, year, time, date
100+
* Personal email, official email, SSN
101+
* Company, Job title, phone number, license plate
102+
Customization choices are following:
103+
104+
- **real_email**: If ``TRUE`` and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, ``Tirtha Sarkar`` name with this choice enabled, will generate emails like ``[email protected]`` or ``[email protected]``.
105+
- **real_city**: If ``TRUE``, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
106+
- **phone_simple**: If ``TRUE``, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.
107+
108+
``db_file``: Name of the database where the ``TABLE`` will be created or updated. Default database name will be chosen if not specified by user.
109+
110+
``table_name``: Name of the table, to be chosen by user. Default table name will be chosen if not specified by user.
111+
112+
113+
114+
``primarykey``: User can choose a PRIMARY KEY from among the various fields. If nothing specified, the first data field will be made PRIMARY KEY. If user chooses a field, which is not in the specified list, an error will be thrown and no table will be generated.
115+
116+
**Code example**::
117+
118+
myDB.gen_table(20,fields=['name','city','job_title','phone','company','email'],
119+
db_file='TestDB.db',table_name='People',primarykey='name',real_city=False)
120+
121+
gen_excel()
122+
------------------
123+
Attempts to create an Excel file using Pandas excel_writer function. User can specify various data types to be included. All data types (fields) in the Excel file will be of text type. Data types available:
124+
125+
* Name, country, city, real (US) cities, US state, zipcode, latitude, longitude
126+
* Month, weekday, year, time, date
127+
* Personal email, official email, SSN
128+
* Company, Job title, phone number, license plate
129+
Customization choices are following:
130+
131+
- **real_email**: If ``TRUE`` and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, ``Tirtha Sarkar`` name with this choice enabled, will generate emails like ``[email protected]`` or ``[email protected]``.
132+
- **real_city**: If ``TRUE``, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated.
133+
- **phone_simple**: If ``TRUE``, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned.
134+
135+
``filename``: Name of the Excel file to be created or updated. Default file name will be chosen if not specified by user.
136+
137+
**Code example**::
138+
139+
myDB.gen_excel(15,fields=['name','year','email','license_plate'],
140+
filename='TestExcel.xlsx',real_email=True)
141+
142+
Other auxilarry methods available
143+
----------------------------------
144+
Few other auxilarry functions available in this package.
145+
146+
* **Realistic email** with a given name as seed::
147+
148+
for _ in range(10):
149+
print(myDB.realistic_email('Tirtha Sarkar'))
150+
151+
152+
153+
154+
155+
156+
157+
158+
159+
160+
161+
162+
* **License plate** in few different style (1,2, or 3)::
163+
164+
for _ in range(10):
165+
print(myDB.license_plate())
166+
167+
1OAG936
168+
LTZ-6460
169+
ODQ-846
170+
8KNW713
171+
MFX-8256
172+
6WMH396
173+
OQX-2780
174+
OOD-124
175+
RXY-8865
176+
JZV-3326

pydbgen/Domains.txt

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
gmail.com
2+
yahoo.com
3+
hotmail.com
4+
comcast.net
5+
outlook.com
6+
aol.com
7+
protonmail.com
8+
yandex.com
9+
zoho.com
10+
mail.com
11+
att.com
12+
xfinity.com
13+
verizon.com

0 commit comments

Comments
 (0)