Skip to content

Implement support for duckdb_arrow_scan for ingesting data via Arrow #488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

phillipleblanc
Copy link
Contributor

Adds support for calling the duckdb_arrow_scan when passed an FFI_ArrowArrayStream from the Arrow library that implements the https://arrow.apache.org/docs/format/CStreamInterface.html

This matches behavior that exists in the Go DuckDB library: https://github.com/marcboeker/go-duckdb/blob/c607539e645091c1e3f600f929229d8baa7166e4/arrow.go#L251

This also provides an alternative to the arrow-vtab module that is not zero-copy in all cases for Arrow -> DuckDB buffers, whereas the native C++ DuckDB code has better support for zero-copy Arrow -> DuckDB buffers.

#18)

* Implement support for `duckdb_arrow_scan` for ingesting data via Arrow

* Map the NulError
@Maxxen
Copy link
Member

Maxxen commented Apr 9, 2025

Hello! Thanks a lot for the PR! Im about to cut v1.2.2 so in order to expedite getting this merged I forked this off into #489 with formatting/clippy fixes

@Maxxen
Copy link
Member

Maxxen commented Apr 9, 2025

Actually, ASAN reports memory leaks related to the test. I've done some digging and asked around the office from people with more arrow-knowledge, and my impression is that the arrow_scan_array method is most likely partly broken duckdb-side in that it has some weird assumptions regarding ownership and doesn't always free schemas properly. This PR seems related.

I guess these functions are deprecated for a reason. I don't know exactly when we will get around to replace them with something better, but we're looking into trying to fix what we think is causing this issue in particular - and if we do I'll check against this PR again.

@phillipleblanc
Copy link
Contributor Author

Actually, ASAN reports memory leaks related to the test. I've done some digging and asked around the office from people with more arrow-knowledge, and my impression is that the arrow_scan_array method is most likely partly broken duckdb-side in that it has some weird assumptions regarding ownership and doesn't always free schemas properly. This PR seems related.

I guess these functions are deprecated for a reason. I don't know exactly when we will get around to replace them with something better, but we're looking into trying to fix what we think is causing this issue in particular - and if we do I'll check against this PR again.

Thanks for checking! I'll convert this PR to draft until the memory leak issue is fixed and/or a new C API is available that we can use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants