Skip to content

option has_header true is ignored #146

Open
@alexradzin

Description

@alexradzin

I tried to run a simple example with CSV file that has headers.

name,age
Alice,29
Bob,31

So, I have created external table as following:

      context
          .sql("CREATE EXTERNAL TABLE test_table (name VARCHAR, age INT) STORED AS CSV LOCATION '/tmp/test/test.csv' OPTIONS ('has_header' 'true');")
          .thenComposeAsync(df -> df.collect(allocator))
          .join();

... and then executed query:

      context.sql("select * from test_table").thenComposeAsync(DataFrame::show).join();

As the result I got the following exception:

Exception in thread "main" java.util.concurrent.CompletionException: java.lang.RuntimeException: Arrow error: Parser error: Error while parsing value age for column 1 at line 0
	at java.base/java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:368)
	at java.base/java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:377)
	at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1152)
	at java.base/java.util.concurrent.CompletableFuture$Completion.exec(CompletableFuture.java:483)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
Caused by: java.lang.RuntimeException: Arrow error: Parser error: Error while parsing value age for column 1 at line 0
	at org.apache.arrow.datafusion.DefaultDataFrame$RuntimeExceptionCallback.accept(DefaultDataFrame.java:127)
	at org.apache.arrow.datafusion.DefaultDataFrame$RuntimeExceptionCallback.accept(DefaultDataFrame.java:117)
	at org.apache.arrow.datafusion.DataFrames.showDataframe(Native Method)
	at org.apache.arrow.datafusion.DefaultDataFrame.show(DefaultDataFrame.java:70)
	at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1150)
	... 6 more

I have also implemented my own "show()" method:

  private static void show(ArrowReader reader) {
    try {
      VectorSchemaRoot root = reader.getVectorSchemaRoot();
      System.out.println(root.getSchema().getFields());
      while (reader.loadNextBatch()) {
        int n = root.getFieldVectors().size();
        System.out.println(root.getFieldVectors().stream().map(v -> v.getField().getName() + ":" + v.getField().getFieldType().getType()).collect(Collectors.joining("|")));
        int rows =  root.getRowCount();
        for (int r = 0; r < rows; r++) {
          for (int i = 0; i < n; i++) {
            FieldVector nameVector = root.getVector(i);
            System.out.print(nameVector.getObject(r) + " | ");
          }
          System.out.println();
        }
      }
      reader.close();
    } catch (IOException e) {
      logger.warn("got IO Exception", e);
    }
  }

and used it as following:

      context
          .sql("select * from test_table")
          .thenComposeAsync(df -> df.collect(allocator))
          .thenAccept(ExampleMain::show)
          .join();

In this case the error message looks like this:

thread '<unnamed>' panicked at src/dataframe.rs:29:14:
failed to collect dataframe: ArrowError(ParseError("Error while parsing value age for column 1 at line 0"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5

Both examples work if CSV file does not have header or if age column is defined as VARCHAR. In this case the code works but it reads header as a first line of the data. Attempt to use formant.has_header instead of has_header does not help.

Note that the same scenario works correctly for me with datafusion-cli. It looks that the OPTIONS ('has_header' 'true') is just ignored when running with datafusion-java. It is strange because as far as I can see datafusion-java is just a thin JNI wrapper over the native datafusion API.

I am running on Ubunty and using java 21 (if it matters).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions