You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: Fix incorrect primitive type detection (#122)
Problem
=======
`typeLength`, and potentially `precision`, with value "null" causes
incorrect primitive type detection result.
Solution
========
We should handle the null values such that when the `typeLength` or
`precisions` field is of value "null", its primitive type are detected
as "INT64".
Steps to Verify:
The bug reproduces when the parquet file consists of a Dictionary_Page
with a INT64 field whose typeLength is null upon read. Unfortunately, I
don't have such a test file for now. My debugging was based on a piece
of privately shared data from our customer.
When the bug reproduces, the primitive type parsed from the schema
(Fixed_Length_Byte_Array) won't match the primitive type discovered from
the column data (Int64). Due to a discrepancy on how the library decodes
data pages, when the data is in a Dictionary_Page, the decoding logic
will hit the check for `typeLength` and fail. For Data_Page and
Data_Page_V2, decoding ignores the schema and privileges the primitive
type inferred from the column data. However, for Dictionary_Page,
decoding uses the primitive type specified in the schema.
decodeDataPageV2
https://github.com/LibertyDSNP/parquetjs/blob/91fc71f262c699fdb5be50df2e0b18da8acf8e19/lib/reader.ts#L1104
decodeDictionaryPage
https://github.com/LibertyDSNP/parquetjs/blob/91fc71f262c699fdb5be50df2e0b18da8acf8e19/lib/reader.ts#L947
Notice that one uses "opts.type" while the other uses
"opts.column.primitiveType".
0 commit comments