Skip to content

Commit c2962ef

Browse files
authored
Merge pull request #3 from qidi1/format
add format
2 parents 6aca861 + 980bcae commit c2962ef

20 files changed

+7664
-5462
lines changed

README.md

+15-3
Original file line numberDiff line numberDiff line change
@@ -4,11 +4,13 @@ sqlgpt-parser is a Python implementation of an SQL parser that effectively conve
44

55
## Quick Start
66

7+
### Install
78
```sh
89
pip install sqlgpt-parser
910
```
11+
### Parser SQL
1012

11-
```sh
13+
```python
1214
>>> from sql_parser.mysql_parser import parser as mysql_parser
1315
>>> mysql_parser.parse("select * from t")
1416
Query(query_body=QuerySpecification(select=Select(distinct=False, select_items=[SingleColumn(expression=QualifiedNameReference(name=QualifiedName.of("*")))]), from_=Table(name=QualifiedName.of("t"), for_update=False), order_by=[], limit=0, offset=0, for_update=False, nowait_or_wait=False), order_by=[], limit=0, offset=0)
@@ -20,7 +22,17 @@ Query(query_body=QuerySpecification(select=Select(distinct=False, select_items=[
2022
Query(query_body=QuerySpecification(select=Select(distinct=False, select_items=[SingleColumn(expression=QualifiedNameReference(name=QualifiedName.of("*")))]), from_=Table(name=QualifiedName.of("t"), for_update=False), order_by=[], limit=0, offset=0, for_update=False, nowait_or_wait=False), order_by=[], limit=0, offset=0)
2123
```
2224

25+
### Format SQL
26+
```python
27+
>>> from sql_parser.format.formatter import format_sql
28+
>>> from sql_parser.mysql_parser import parser
29+
>>> result=parser.parse("select * from t")
30+
>>> format_sql(result)
31+
'SELECT\n *\nFROM\n t'
32+
33+
```
2334
## Getting Started with SQL Parser Development
2435

25-
English Document: [SQL Parser Development Guide](./docs/docs-en/SQL%20Parser%20Development%20Guide.md)
26-
中文文档:[SQL Parser 开发指南](./docs/docs-ch/SQL%20Parser%20开发指南.md)
36+
English Document: [SQL Parser Development Guide](https://github.com/eosphoros-ai/sqlgpt-parser/blob/main/docs/docs-en/SQL%20Parser%20Development%20Guide.md)
37+
38+
中文文档:[SQL Parser 开发指南](https://github.com/eosphoros-ai/sqlgpt-parser/blob/main/docs/docs-ch/SQL%20Parser%20%E5%BC%80%E5%8F%91%E6%8C%87%E5%8D%97.md)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,256 @@
1+
# Parser Module Development Guide
2+
3+
The `parser` module is the foundational module of `sql-lifecycle-management`. It parses SQL statements according to predefined SQL grammar rules, converting them from text into an abstract syntax tree (`AST`). The SQL rewriting, optimization, and other functionalities in `sql-lifecycle-management` are implemented based on the `AST` generated by the `parser` module.
4+
5+
The `parser` module in `sql-lifecycle-management` is written using [PLY](https://github.com/dabeaz/ply). PLY is a Python tool for building lexical and parsing analyzers. It can analyze input text based on specified patterns and automatically compile the lexical and grammar rule files in the `parser` folder of the project before the program runs, generating executable code.
6+
7+
## Lexical Analysis and Syntax Analysis
8+
9+
![Relationship between Lexical Analyzer and Syntax Analyzer](./pictures/parsing-overview.png)
10+
11+
Lexical analysis and syntax analysis are two steps in the parsing of SQL. The relationship between them is as follows: lexical analysis reads the user's input and converts it into tokens based on lexical rules. Syntax analysis then uses the tokens generated by lexical analysis as input and creates an abstract syntax tree based on syntax rules. In order to generate lexical analyzers and syntax analyzers that meet user requirements, users need to provide custom lexical rules and syntax rules. In PLY, lexical rules and syntax rules are defined using two different sets of rules.
12+
13+
### Lexical Rules
14+
15+
```python
16+
import ply.lex as lex
17+
18+
tokens = (
19+
'NUMBER',
20+
'PLUS',
21+
'MINUS',
22+
'TIMES',
23+
'DIVIDE',
24+
'LPAREN',
25+
'RPAREN',
26+
)
27+
28+
t_PLUS = r'\+'
29+
t_MINUS = r'-'
30+
t_TIMES = r'\*'
31+
t_DIVIDE = r'/'
32+
t_LPAREN = r'\('
33+
t_RPAREN = r'\)'
34+
35+
def t_NUMBER(t):
36+
r'\d+'
37+
t.value = int(t.value)
38+
return t
39+
40+
lexer = lex.lex()
41+
```
42+
43+
In `PLY`, `tokens` are represented using regular expression rules. These rules need to start with `t_`, followed by a word that must correspond to a value in the `tokens` list.
44+
45+
For simple `tokens`, you can directly define them using regular expressions:
46+
47+
```python
48+
t_PLUS=r'\+'
49+
```
50+
51+
For complex `tokens`, you can define them as functions. When the input string matches the regular expression, the code inside the function will be executed. In the following function, the input will be converted to an integer and stored in `t.value`, and a `token` of type `NUMBER` will be returned:
52+
53+
```python
54+
def t_NUMBER(t):
55+
r'\d+'
56+
t.value = int(t.value)
57+
return t
58+
```
59+
60+
### Syntax Rules
61+
62+
#### Basics of Syntax Analysis
63+
64+
```
65+
%left '+' '-'
66+
%left '*' '/'
67+
%%
68+
expr:
69+
INTEGER
70+
| expr + expr { $$ = $1 + $3; }
71+
| expr - expr { $$ = $1 - $3; }
72+
| expr * expr { $$ = $1 * $3; }
73+
| expr / expr { $$ = $1 / $3; }
74+
| '(' expr ')' { $$ = $2; }
75+
```
76+
77+
The first part defines the associativity of `token` types and operators. All four operators are left-associative, and operators on the same line have the same precedence. Operators on different lines have higher precedence for those defined later.
78+
79+
The syntax rules are defined using BNF (Backus-Naur Form). BNF is used to express context-free languages, and most modern programming languages can be represented using BNF. The rules above define a production. The items on the left-hand side of the production rule (e.g., `expr`) are called non-terminals, while `INTEGER` and `+`, `-`, `*`, `/` are called terminals, which are tokens returned by the lexical analyzer.
80+
81+
The syntax analyzer generated by PLY uses a bottom-up shift-reduce parsing technique and uses a stack to store intermediate states. Here is the parsing process for the expression `1 + 2 * 3`:
82+
83+
```
84+
1 . 1 + 2 * 3
85+
2 1 . + 2 * 3
86+
3 expr . + 2 * 3
87+
4 expr + . 2 * 3
88+
5 expr + 2 . * 3
89+
6 expr + expr . * 3
90+
7 expr + expr * . 3
91+
8 expr + expr * 3 .
92+
9 expr + expr * expr .
93+
10 expr + expr .
94+
11 expr .
95+
```
96+
97+
The actions associated with a rule are defined inside the curly braces on the right-hand side of a production rule. For example:
98+
99+
```
100+
expr: expr '+' expr { $$ = $1 + $3; }
101+
```
102+
103+
We replace the items in the stack that match the right-hand side of the production rule with the left-hand side non-terminal of the production rule. In this example, we pop `expr '*' expr` from the stack and then push `expr` back onto the stack. We can access the items in the stack using the `$position` notation, where `$1` refers to the first item, `$2` refers to the second item, and so on. `$$` represents the top of the stack after the reduction operation. The action in this example pops three items from the stack, adds the two expressions, and pushes the result back onto the top of the stack.
104+
105+
#### Define syntax rules using PLY
106+
107+
```python
108+
import ply.yacc as yacc
109+
110+
# Get the token map from the lexer. This is required.
111+
from calclex import tokens
112+
113+
precedence = (
114+
('left', 'PLUS', 'MINUS'),
115+
('left', 'TIMES','DIV'),
116+
)
117+
118+
def p_expr(p):
119+
"""expr : expr PLUS expr
120+
| expr MINUS expr
121+
| expr TIMES expr
122+
| expr DIV expr
123+
"""
124+
if p.slice[2].type == 'PLUS':
125+
p[0]=p[1]+p[3]
126+
elif p.slice[2].type == 'MINUS':
127+
p[0]=p[1]-p[3]
128+
elif p.slice[2].type == "TIMES":
129+
p[0]=p[1]*p[3]
130+
elif p.slice[2].type == "DIV":
131+
p[0]=p[1]/p[3]
132+
133+
def p_expr_paren(p):
134+
"""expr : LPAREN expr RPAREN"""
135+
p[0]=p[2]
136+
137+
def p_expr_number(p):
138+
"""expr : NUMBER"""
139+
p[0]=p[1]
140+
141+
# Build the parser
142+
parser = yacc.yacc()
143+
```
144+
145+
`precedence` defines the associativity and precedence of `tokens`. As shown in the above example, the first element in the tuple represents the associativity of the `token`, where `left` indicates left associativity. `tokens` on the same line have the same precedence, and the precedence increases from bottom to top for tokens on different lines. In the given example, the `TIMES` and `DIV` have higher precedence than `PLUS` and `MINUS`.
146+
147+
Each syntax rule is defined as a method in Python, where the method's comments describe the corresponding context-free grammar, and the statements implement the semantic behavior of the rule. Each method takes a `p` parameter, which is a sequence containing the symbols of the current matched grammar. The correspondence between `p[i]` and the grammar symbols is as follows:
148+
149+
```python
150+
def p_expr_paren(p):
151+
"""expr : LPAREN expr RPAREN"""
152+
# ^ ^ ^ ^
153+
# p[0] p[1] p[2] p[3]
154+
p[0] = p[2]
155+
```
156+
157+
PLY uses the notation `p[position]` to access the stack, where `p[0]` corresponds to `$$` mentioned earlier, `p[1]` corresponds to `$1`, `p[2]` corresponds to `$2`, and so on. In this case, the action involves popping the top three elements from the stack, assigning the value of `p[2]` to `p[0]`, and then pushing it back onto the stack.
158+
159+
## Implementation of the `parser` for `sql-lifecycle-management`
160+
161+
There are a total of three SQL parsers in `sql-lifecycle-management`, located in the [mysql_parser](https://github.com/oceanbase/sql-lifecycle-management/tree/main/src/parser/mysql_parser), [oceanbase_parser](https://github.com/oceanbase/sql-lifecycle-management/tree/main/src/parser/oceanbase_parser), and [odps_parser](https://github.com/oceanbase/sql-lifecycle-management/tree/main/src/parser/odps_parser) folders. Each of these folders contains three files: `lexer.py`, `reserved.py`, and `parser.py`.
162+
163+
The `lexer.py` and `reserved.py` files are both used for lexical analysis. In `reserved.py`, SQL keywords are defined and stored in two variables: `reserved` and `nonreserved`. The `reserved` variable contains all the keywords that cannot be used as column names, table names, or aliases in SQL. On the other hand, `nonreserved` contains keywords that can be used as column names, table names, or aliases.
164+
165+
In `lexer.py`, there are two sections. The `tokens` variable defines all the tokens that can be used in the parser. Here, the SQL keywords from `reserved.py` are imported and converted into tokens that the parser can recognize and use.
166+
167+
```python
168+
tokens = (
169+
[
170+
'IDENTIFIER',
171+
'DIGIT_IDENTIFIER',
172+
...
173+
]
174+
+ list(reserved)
175+
+ list(nonreserved)
176+
)
177+
```
178+
179+
The remaining sections define what `token` the user input will be converted into.
180+
181+
```python
182+
...
183+
t_BIT_MOVE_LEFT = r'<<'
184+
t_BIT_MOVE_RIGHT = r'>>'
185+
t_EXCLA_MARK = r'!'
186+
187+
def t_DOUBLE(t):
188+
r"[0-9]*\.[0-9]+([eE][-+]?[0-9]+)?|[-+]?[0-9]+([eE][-+]?[0-9]+)"
189+
if 'e' in t.value or 'E' in t.value or '.' in t.value:
190+
t.type = "FRACTION"
191+
else:
192+
t.type = "NUMBER"
193+
return t
194+
...
195+
```
196+
197+
As mentioned earlier, simple tokens are defined directly using regular expressions. The values that match the regular expressions are converted into tokens with a prefix `t_`. Complex tokens, on the other hand, are defined using methods. For example, in `t_DOUBLE`, the incoming value is further examined. If it is a decimal number, the token value is set to `FRACTION`. If it is not a decimal number, it is set to `NUMBER`.
198+
199+
The `parser.py` file is also divided into two sections. `precedence` defines the priority and associativity of tokens. The remaining sections define the corresponding grammar rules and their associated actions.
200+
201+
```python
202+
precedence = (
203+
('right', 'ASSIGNMENTEQ'),
204+
('left', 'PIPES', 'OR'),
205+
('left', 'XOR'),
206+
('left', 'AND', 'ANDAND'),
207+
('right', 'NOT'),
208+
...
209+
('left', 'EXCLA_MARK'),
210+
('left', 'LPAREN'),
211+
('right', 'RPAREN'),
212+
)
213+
```
214+
215+
`right` and `left` in the tuple indicate whether the `token` is right-associative or left-associative, and the priority is arranged from low to high. In the example above, `RPAREN` has the highest priority, while `ASSIGNMENTEQ` has the lowest priority.
216+
217+
The grammar rules of SQL are quite complex, and most of the content in `parser.py` is dedicated to defining these rules. The grammar rules for SQL can be referenced from the corresponding database's documentation. For example, for the MySQL database, you can refer to the [SQL Statements](https://dev.mysql.com/doc/refman/8.0/en/sql-statements.html) section of its reference manual. The syntax definition for the `DELETE` statement in MySQL is as follows:
218+
219+
```python
220+
def p_delete(p):
221+
r"""delete : DELETE FROM relations where_opt order_by_opt limit_opt
222+
| DELETE FROM relations partition where_opt order_by_opt limit_opt
223+
| DELETE table_name_list FROM relations where_opt order_by_opt limit_opt
224+
| DELETE table_name_list FROM relations partition where_opt order_by_opt limit_opt
225+
| DELETE FROM table_name_list USING relations where_opt order_by_opt limit_opt
226+
| DELETE FROM table_name_list USING relations partition where_opt order_by_opt limit_opt
227+
"""
228+
length=len(p)
229+
p_limit = p[length-1]
230+
if p_limit is not None:
231+
offset,limit = int(p_limit[0]),int(p_limit[1])
232+
else:
233+
offset,limit=0,0
234+
if p.slice[3].type=="relations":
235+
tables,table_refs=p[3],None
236+
elif p.slice[2].type=="table_name_list":
237+
tables,table_refs=p[4],p[2]
238+
else:
239+
tables,table_refs=p[3],p[5]
240+
p[0] = Delete(table=tables,table_refs=table_refs,where=p[length-3], order_by=p[length-2], limit=limit, offset=offset)
241+
```
242+
243+
The comment in `p_delete` corresponds to the syntax rule for the `DELETE` statement. When the input satisfies the syntax rule, the function in the method is called to construct an `AST` node for the `DELETE` statement.
244+
245+
Once the grammar rules are written, they can be used to parse SQL statements. Taking `mysql_parser` as an example, you can use this grammar rule for parsing SQL statements.
246+
247+
```python
248+
from src.parser.mysql_parser.parser import parser as mysql_parser
249+
from src.parser.mysql_parser.lexer import lexer as mysql_lexer
250+
sql = "DELETE FROM t WHERE a=1"
251+
result = mysql_parser.parse(sql, lexer=mysql_lexer)
252+
```
253+
254+
The execution result is shown in the following diagram, which depicts the execution plan tree for the SQL statement.
255+
256+
![DELETE AST](./pictures/DELETE%20语法树.png)

pyproject.toml

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "flit_core.buildapi"
44

55
[project]
66
name = "sqlgpt-parser"
7-
version = "0.0.1a"
7+
version = "0.0.1a1"
88
authors = [
99
{ name="luliwjc", email="[email protected]" },
1010
{ name="Ifffff", email="[email protected]" },

0 commit comments

Comments
 (0)