You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feature - collect and report multiple field errors (#75)
Problem
=======
This PR is intended to implement 2 enhancements to schema error
reporting.
* When a parquet schema includes an invalid type, encoding or
compression the current error does not indicate which column has the the
problem
* When a parquet schema has multiple issues, the code currently fails on
the first, making multiple errors quite cumbersome
Solution
========
Modified the schema.ts and added tests to:
* Change error messages from the original `invalid parquet type:
UNKNOWN` to `invalid parquet type: UNKNOWN, for Column: quantity`
* Keep track of schema errors as we loop through each column in the
schema, and at the end, if there are any errors report them all as
below:
`invalid parquet type: UNKNOWN, for Column: quantity`
`invalid parquet type: UNKNOWN, for Column: value`
Change summary:
---------------
* adding tests and code to ensure multiple field errors are logged, as
well as indicating which column had the error
* also adding code to handle multiple encoding and compression schema
issues
Steps to Verify:
----------------
1. Download this [parquet
file](https://usaz02prismdevmlaas01.blob.core.windows.net/ml-job-config/dataSets/multiple-unsupported-columns.parquet?sv=2020-10-02&st=2023-01-09T15%3A28%3A09Z&se=2025-01-10T15%3A28%3A00Z&sr=b&sp=r&sig=GS0Skk93DCn5CnC64DbnIH2U7JhzHM2nnhq1U%2B2HwPs%3D)
2. attempt to open this parquet with this library `const reader = await
parquet.ParquetReader.openFile(<path to parquet file>)`
3. You should receive errors for more than one column, which also
includes the column name for each error
---------
Co-authored-by: Wil Wade <[email protected]>
0 commit comments