|
| 1 | +************************************ |
| 2 | +Random database/dataframe generator |
| 3 | +************************************ |
| 4 | + |
| 5 | +**(Tirthajyoti Sarkar, Sunnyvale, USA, March 2018)** |
| 6 | + |
| 7 | +Introduction |
| 8 | +============= |
| 9 | +Often, beginners in SQL or data science struggle with the matter of easy access to a large sample database file (**.DB** or **.sqlite**) for practicing SQL commands. **Would it not be great to have a simple tool or library to generate a large database with multiple tables, filled with data of one's own choice?** |
| 10 | + |
| 11 | +After all, databases break every now and then and it is safest to practice with a randomly generated one :-) |
| 12 | + |
| 13 | +.. image:: https://imgs.xkcd.com/comics/exploits_of_a_mom.png |
| 14 | + |
| 15 | + |
| 16 | +While it is easy to generate random numbers or simple words for Pandas or dataframe operation learning, it is often non-trivial to generate full data tables with meaningful yet random entries of most commonly encountered fields in the world of database, such as ``name, age, birthday, credit card number, SSN, email id, physical address, company name, job title`` etc. |
| 17 | + |
| 18 | +This Python package generates a random database ``TABLE`` (or a Pandas dataframe, or an Excel file) based on user's choice of data types (database fields). User can specify the number of samples needed. One can also designate a **"PRIMARY KEY"** for the database table. Finally, the ``TABLE`` is inserted into a new or existing database file of user's choice. |
| 19 | + |
| 20 | +Dependency and Acknowledgement |
| 21 | +=============================== |
| 22 | +At its core, ``pydbgen`` uses ``Faker`` as the default random data generating engine for most of the data types. Original function is written for few data types such as ``realistic email`` and ``license plate``. Also the default phone number generated by ``Faker`` is free-format and does not correspond to US 10 digit format. Therefore, a ``simple phone number`` data type is introduced in ``pydbgen``. The original contribution of ``pydbgen`` is to take the single data-generating function from ``Faker`` and use it cleverly to generate Pandas data series or dataframe or SQLite database tables as per the specification of the user. |
| 23 | +Here is the link if you want to look up more about ``Faker`` package, |
| 24 | + |
| 25 | +`Faker Documentation Home <https://faker.readthedocs.io/en/latest/index.html>`_ |
| 26 | + |
| 27 | +Installation |
| 28 | +============= |
| 29 | +You can use pip to install pydbgen: ``pip install pydbgen`` |
| 30 | + |
| 31 | +Usage |
| 32 | +========= |
| 33 | +Current version (1.0.0) of ``pydbgen`` comes with the following primary methods, |
| 34 | + |
| 35 | +* ``gen_data_series()`` |
| 36 | +* ``gen_dataframe()`` |
| 37 | +* ``gen_table()`` |
| 38 | +* ``gen_excel()`` |
| 39 | + |
| 40 | +The ``gen_table()`` method allows you to build a database with as many tables as you want, filled with random data and fields of your choice. But first, you have to create an object of ``pydb`` class:: |
| 41 | + |
| 42 | + myDB = pydbgen.pydb() |
| 43 | + |
| 44 | +gen_data_series() |
| 45 | +------------------ |
| 46 | +Returns a `Pandas series object <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html>`_ with the desired number of entries and data type. Data types available: |
| 47 | + |
| 48 | +* Name, country, city, real (US) cities, US state, zipcode, latitude, longitude |
| 49 | +* Month, weekday, year, time, date |
| 50 | +* Personal email, official email, SSN |
| 51 | +* Company, Job title, phone number, license plate |
| 52 | + |
| 53 | +Phone number can be of two types: |
| 54 | + |
| 55 | +* ``phone_number_simple`` generates 10 digit US number in xxx-xxx-xxxx format |
| 56 | +* ``phone_number_full`` may generate an international number with different format |
| 57 | + |
| 58 | +**Code example**:: |
| 59 | + |
| 60 | + se=myDB.gen_data_series(data_type='date') |
| 61 | + print(se) |
| 62 | + |
| 63 | + 0 1995-08-09 |
| 64 | + 1 2001-08-01 |
| 65 | + 2 1980-06-26 |
| 66 | + 3 2018-02-18 |
| 67 | + 4 1972-10-12 |
| 68 | + 5 1983-11-12 |
| 69 | + 6 1975-09-04 |
| 70 | + 7 1970-11-01 |
| 71 | + 8 1978-03-23 |
| 72 | + 9 1976-06-03 |
| 73 | + dtype: object |
| 74 | + |
| 75 | +gen_dataframe() |
| 76 | +------------------ |
| 77 | +Generates a `Pandas dataframe <https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html>`_ filled with random entries. User can specify the number of rows and data type of the fields/columns. |
| 78 | + |
| 79 | +* Name, country, city, real (US) cities, US state, zipcode, latitude, longitude |
| 80 | +* Month, weekday, year, time, date |
| 81 | +* Personal email, official email, SSN |
| 82 | +* Company, Job title, phone number, license plate |
| 83 | +Customization choices are following: |
| 84 | + |
| 85 | +- ** real_email**: If `` TRUE`` and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, `` Tirtha Sarkar`` name with this choice enabled, will generate emails like `` [email protected]`` or `` [email protected]``. |
| 86 | +- **real_city**: If ``TRUE``, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated. |
| 87 | +- **phone_simple**: If ``TRUE``, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned. |
| 88 | + |
| 89 | +**Code example**:: |
| 90 | + |
| 91 | + testdf=myDB.gen_dataframe(25,fields='name','city','phone', |
| 92 | + 'license_plate','email',real_email=True,phone_simple=True) |
| 93 | + |
| 94 | +gen_table() |
| 95 | +------------------ |
| 96 | +Attempts to create a table in a database (.db) file using Python's built-in SQLite engine. User can specify various data types to be included as database table fields.All data types (fields) in the SQLite table will be of VARCHAR type. Data types available: |
| 97 | + |
| 98 | +* Name, country, city, real (US) cities, US state, zipcode, latitude, longitude |
| 99 | +* Month, weekday, year, time, date |
| 100 | +* Personal email, official email, SSN |
| 101 | +* Company, Job title, phone number, license plate |
| 102 | +Customization choices are following: |
| 103 | + |
| 104 | +- ** real_email**: If `` TRUE`` and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, `` Tirtha Sarkar`` name with this choice enabled, will generate emails like `` [email protected]`` or `` [email protected]``. |
| 105 | +- **real_city**: If ``TRUE``, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated. |
| 106 | +- **phone_simple**: If ``TRUE``, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned. |
| 107 | + |
| 108 | +``db_file``: Name of the database where the ``TABLE`` will be created or updated. Default database name will be chosen if not specified by user. |
| 109 | + |
| 110 | +``table_name``: Name of the table, to be chosen by user. Default table name will be chosen if not specified by user. |
| 111 | + |
| 112 | + |
| 113 | + |
| 114 | +``primarykey``: User can choose a PRIMARY KEY from among the various fields. If nothing specified, the first data field will be made PRIMARY KEY. If user chooses a field, which is not in the specified list, an error will be thrown and no table will be generated. |
| 115 | + |
| 116 | + **Code example**:: |
| 117 | + |
| 118 | + myDB.gen_table(20,fields=['name','city','job_title','phone','company','email'], |
| 119 | + db_file='TestDB.db',table_name='People',primarykey='name',real_city=False) |
| 120 | + |
| 121 | +gen_excel() |
| 122 | +------------------ |
| 123 | +Attempts to create an Excel file using Pandas excel_writer function. User can specify various data types to be included. All data types (fields) in the Excel file will be of text type. Data types available: |
| 124 | + |
| 125 | +* Name, country, city, real (US) cities, US state, zipcode, latitude, longitude |
| 126 | +* Month, weekday, year, time, date |
| 127 | +* Personal email, official email, SSN |
| 128 | +* Company, Job title, phone number, license plate |
| 129 | +Customization choices are following: |
| 130 | + |
| 131 | +- ** real_email**: If `` TRUE`` and if a person's name is also included in the fields, a realistic email will be generated corresponding to the name of the person. For example, `` Tirtha Sarkar`` name with this choice enabled, will generate emails like `` [email protected]`` or `` [email protected]``. |
| 132 | +- **real_city**: If ``TRUE``, a real US city's name will be picked up from a list (included as a text data file with the installation package). Otherwise, a fictitious city name will be generated. |
| 133 | +- **phone_simple**: If ``TRUE``, a 10 digit US number in the format xxx-xxx-xxxx will be generated. Otherwise, an international number with different format may be returned. |
| 134 | + |
| 135 | +``filename``: Name of the Excel file to be created or updated. Default file name will be chosen if not specified by user. |
| 136 | + |
| 137 | +**Code example**:: |
| 138 | + |
| 139 | + myDB.gen_excel(15,fields=['name','year','email','license_plate'], |
| 140 | + filename='TestExcel.xlsx',real_email=True) |
| 141 | + |
| 142 | +Other auxilarry methods available |
| 143 | +---------------------------------- |
| 144 | +Few other auxilarry functions available in this package. |
| 145 | + |
| 146 | +* **Realistic email** with a given name as seed:: |
| 147 | + |
| 148 | + for _ in range(10): |
| 149 | + print(myDB.realistic_email('Tirtha Sarkar')) |
| 150 | + |
| 151 | + |
| 152 | + |
| 153 | + |
| 154 | + |
| 155 | + |
| 156 | + |
| 157 | + |
| 158 | + |
| 159 | + |
| 160 | + |
| 161 | + |
| 162 | +* **License plate** in few different style (1,2, or 3):: |
| 163 | + |
| 164 | + for _ in range(10): |
| 165 | + print(myDB.license_plate()) |
| 166 | + |
| 167 | + 1OAG936 |
| 168 | + LTZ-6460 |
| 169 | + ODQ-846 |
| 170 | + 8KNW713 |
| 171 | + MFX-8256 |
| 172 | + 6WMH396 |
| 173 | + OQX-2780 |
| 174 | + OOD-124 |
| 175 | + RXY-8865 |
| 176 | + JZV-3326 |
0 commit comments