Skip to content

Commit 3136f53

Browse files
authored
Fixes #1796: procedure/function to provide an URL + indexed String or json object (#4375) (#4390)
* Fixes #1796: procedure/function to provide an URL + indexed String or json object (#4375) * Fixes #1796: procedure/function to provide an URL + indexed String or json object * added docs, performance test compared with csv * restored gitmodule stuff and added azure test * fix tests * added docs * cleanup * fix test * ignored flaky test * Update ExportExtendedSecurityTest.java
1 parent e4e7e58 commit 3136f53

File tree

14 files changed

+808
-14
lines changed

14 files changed

+808
-14
lines changed

docs/asciidoc/modules/ROOT/nav.adoc

+1
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ include::partial$generated-documentation/nav.adoc[]
3030
** xref::import/gexf.adoc[]
3131
** xref::import/arrow.adoc[]
3232
** xref::import/load-json.adoc[]
33+
** xref::import/load-partial.adoc[]
3334
3435
* xref:export/index.adoc[]
3536
** xref::export/xls.adoc[]

docs/asciidoc/modules/ROOT/pages/import/index.adoc

+1
Original file line numberDiff line numberDiff line change
@@ -16,3 +16,4 @@ For more information on these procedures, see:
1616
* xref::import/gexf.adoc[]
1717
* xref::import/load-json.adoc[]
1818
* xref::import/arrow.adoc[]
19+
* xref::import/load-partial.adoc[]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
[[load-csv]]
2+
= Load CSV
3+
:description: This section describes procedures that can be used to import data from CSV files.
4+
5+
If we have large files locally or on http(s)/s3/gcp storage that you want to import into neo4j,
6+
we can use this procedure that can provide a URL with an offset that can just read a string from a file (e.g. CSV).
7+
We can also set an optional limit as the 3rd parameter, otherwise (with 0) it read until the end.
8+
9+
[source]
10+
----
11+
apoc.load.stringPartial(urlOrBinary :: ANY?, offset :: LONG, limit :: LONG = 0, config = {} :: MAP?) :: (value :: STRING?)
12+
----
13+
14+
15+
For reading from files you'll have to enable the config option:
16+
17+
----
18+
apoc.import.file.enabled=true
19+
----
20+
21+
22+
The procedure support the following config parameters:
23+
24+
.Config parameters
25+
[opts=header]
26+
|===
27+
| name | type | default | description
28+
| headers | Map<String, Object> | Empty map | Additional headers to be added or replaced to the default
29+
| archiveLimit | int | 1024*1024*10 (10MB) | Size limit to locate ZIP entries and buffers
30+
| bufferLimit | int | 1024*1024*10 (10MB) | Buffer read limit
31+
| compression | Enum[NONE, GZIP, BZIP2, DEFLATE, BLOCK_LZ4, FRAMED_SNAPPY] | NONE | Set the compression algorithm used, in case of a byte array reading.
32+
|===
33+
34+
35+
== Usage examples
36+
37+
We can read a portion of a string from a local file URL, a remote URL (i.e. http(s)/gcp/S3/Azure/Hdfs), a local/remote file placed in an archive, or a byte array.
38+
It's useful compared to other load procedures since the access into the file is handled more efficiently, not an openStream and a read to location.
39+
40+
That is:
41+
42+
- in case of a local file under-the-hood an https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/io/RandomAccessFile.html[RandomAccessFile] will be created.
43+
- in case of http(s) URL we will put an additional HTTP header `Range: bytes=<offset>`, while in case of limit set it will be `Range: bytes=<offset>-<httpLimit>`, where httpLimit is equal to `offset + limit - 1`.
44+
- in case of S3 location, an https://sdk.amazonaws.com/java/api/latest/software/amazon/awssdk/services/s3/model/GetObjectRequest.html#range()[GetObjectRequest.range()] will be used.
45+
- in the other cases, we will execute an https://docs.oracle.com/javase/8/docs/api/java/io/InputStream.html#skip-long-[InputStream.skip()]
46+
47+
48+
49+
If we have the following CSV file
50+
51+
.test.csv
52+
----
53+
name,age,beverage
54+
Selma,9,Soda
55+
Rana,12,Tea;Milk
56+
Selina,19,Cola
57+
----
58+
59+
We can execute:
60+
61+
[source,cypher]
62+
----
63+
CALL apoc.load.stringPartial("path/to/localfile/test.csv", 17, 15)
64+
----
65+
66+
.Results
67+
[opts="header"]
68+
|===
69+
| value
70+
| Rana,11
71+
Selina,
72+
|===
73+
74+
Or also, without limit set:
75+
76+
[source,cypher]
77+
----
78+
CALL apoc.load.stringPartial("path/to/localfile/test.csv", 17)
79+
----
80+
81+
.Results
82+
[opts="header"]
83+
|===
84+
| value
85+
| Rana,11
86+
Selina,18
87+
88+
|===
89+
90+
We can read in the same way and with a similar result from a remote URL, for example:
91+
92+
[source,cypher]
93+
----
94+
CALL apoc.load.stringPartial("https://raw.githubusercontent.com/neo4j-contrib/neo4j-apoc-procedures/refs/heads/dev/extended/src/test/resources/test.csv", 17)
95+
----
96+
97+
We can also read from an archive file, using the syntax `<pathToArchive>!<fileToRead>`, for example:
98+
99+
[source,cypher]
100+
----
101+
CALL apoc.load.stringPartial("https://www3.stats.govt.nz/2018census/Age-sex-by-ethnic-group-grouped-total-responses-census-usually-resident-population-counts-2006-2013-2018-Censuses-RC-TA-SA2-DHB.zip!Data8277.csv", 17)
102+
----
103+
104+
Or also from a byte array, optionally setting the compression type (default 'NONE', that is not compressed), for example using the `apoc.util.compress` (placed in APOC Core):
105+
106+
[source,cypher]
107+
----
108+
WITH apoc.util.compress('testFooBar', {compression: 'DEFLATE'}) AS compressed
109+
CALL apoc.load.stringPartial(compressed, 5, 17, {compression: 'DEFLATE'}) YIELD value RETURN value
110+
----
111+
112+
.Results
113+
[opts="header"]
114+
|===
115+
| value
116+
| testFooBar
117+
|===

extended-it/src/test/java/apoc/azure/LoadAzureStorageTest.java

+17-1
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
import apoc.load.LoadHtml;
66
import apoc.load.LoadJsonExtended;
77
import apoc.load.Xml;
8+
import apoc.load.partial.LoadPartial;
89
import apoc.load.xls.LoadXls;
910
import apoc.util.TestUtil;
1011
import org.junit.BeforeClass;
@@ -19,10 +20,15 @@
1920
import static apoc.ApocConfig.apocConfig;
2021
import static apoc.load.LoadCsvTest.commonTestLoadCsv;
2122
import static apoc.load.LoadHtmlTest.testLoadHtmlWithGetLinksCommon;
23+
import static apoc.load.partial.LoadPartialTest.PARTIAL_CSV;
2224
import static apoc.load.xls.LoadXlsTest.testLoadXlsCommon;
2325
import static apoc.util.ExtendedITUtil.EXTENDED_RESOURCES_PATH;
2426
import static apoc.util.ExtendedITUtil.testLoadJsonCommon;
2527
import static apoc.util.ExtendedITUtil.testLoadXmlCommon;
28+
import static apoc.util.MapUtil.map;
29+
import static apoc.util.TestUtil.singleResultFirstColumn;
30+
import static apoc.util.s3.S3Util.putToS3AndGetUrl;
31+
import static org.junit.Assert.assertEquals;
2632

2733

2834
public class LoadAzureStorageTest extends AzureStorageBaseTest {
@@ -35,7 +41,7 @@ public class LoadAzureStorageTest extends AzureStorageBaseTest {
3541
public static void setUp() throws Exception {
3642
AzureStorageBaseTest.setUp();
3743

38-
TestUtil.registerProcedure(db, LoadCsv.class, LoadDirectory.class, LoadJsonExtended.class, LoadHtml.class, LoadXls.class, Xml.class);
44+
TestUtil.registerProcedure(db, LoadCsv.class, LoadDirectory.class, LoadJsonExtended.class, LoadHtml.class, LoadXls.class, Xml.class, LoadPartial.class);
3945
apocConfig().setProperty(APOC_IMPORT_FILE_ENABLED, true);
4046
apocConfig().setProperty(APOC_IMPORT_FILE_USE_NEO4J_CONFIG, false);
4147
}
@@ -71,4 +77,14 @@ public void testLoadHtml() {
7177
testLoadHtmlWithGetLinksCommon(db, url);
7278
}
7379

80+
@Test
81+
public void testLoadPartial() {
82+
String url = putToAzureStorageAndGetUrl(EXTENDED_RESOURCES_PATH + "test.csv");
83+
String result = singleResultFirstColumn(db, "CALL apoc.load.stringPartial($url, 17, 15)",
84+
map("url", url)
85+
);
86+
87+
assertEquals(PARTIAL_CSV, result);
88+
}
89+
7490
}

extended-it/src/test/java/apoc/gc/LoadGoogleCloudStorageTest.java

+42-3
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import apoc.load.LoadHtml;
55
import apoc.load.LoadJsonExtended;
66
import apoc.load.Xml;
7+
import apoc.load.partial.LoadPartial;
78
import apoc.load.xls.LoadXls;
89
import apoc.util.GoogleCloudStorageContainerExtension;
910
import apoc.util.TestUtil;
@@ -26,11 +27,12 @@
2627
import java.util.Map;
2728

2829
import static apoc.load.LoadCsvTest.assertRow;
30+
import static apoc.load.partial.LoadPartialTest.PARTIAL_CSV;
31+
import static apoc.load.partial.LoadPartialTest.PARTIAL_CSV_WITHOUT_LIMIT;
2932
import static apoc.util.ExtendedITUtil.testLoadJsonCommon;
3033
import static apoc.util.GoogleCloudStorageContainerExtension.gcsUrl;
3134
import static apoc.util.MapUtil.map;
32-
import static apoc.util.TestUtil.testCall;
33-
import static apoc.util.TestUtil.testResult;
35+
import static apoc.util.TestUtil.*;
3436
import static java.util.Arrays.asList;
3537
import static org.junit.Assert.assertEquals;
3638
import static org.junit.Assert.assertFalse;
@@ -40,6 +42,7 @@ public class LoadGoogleCloudStorageTest {
4042

4143
public static GoogleCloudStorageContainerExtension gcs = new GoogleCloudStorageContainerExtension()
4244
.withMountedResourceFile("test.csv", "/folder/test.csv")
45+
.withMountedResourceFile("testload.zip", "/folder/testload.zip")
4346
.withMountedResourceFile("map.json", "/folder/map.json")
4447
.withMountedResourceFile("xml/books.xml", "/folder/books.xml")
4548
.withMountedResourceFile("load_test.xlsx", "/folder/load_test.xlsx")
@@ -53,7 +56,7 @@ public class LoadGoogleCloudStorageTest {
5356
@BeforeClass
5457
public static void setUp() throws Exception {
5558
gcs.start();
56-
TestUtil.registerProcedure(db, LoadCsv.class, LoadJsonExtended.class, LoadHtml.class, LoadXls.class, Xml.class);
59+
TestUtil.registerProcedure(db, LoadCsv.class, LoadJsonExtended.class, LoadHtml.class, LoadXls.class, Xml.class, LoadPartial.class);
5760
}
5861

5962
@AfterClass
@@ -114,6 +117,42 @@ public void testLoadHtml() {
114117
});
115118
}
116119

120+
@Test
121+
public void testLoadPartial() {
122+
String url = gcsUrl(gcs, "test.csv");
123+
124+
Object result = singleResultFirstColumn(db,
125+
"CALL apoc.load.stringPartial($url, 17, 15)",
126+
map("url", url)
127+
);
128+
129+
assertEquals(PARTIAL_CSV, result);
130+
}
131+
132+
@Test
133+
public void testLoadPartialZip() {
134+
String url = gcsUrl(gcs, "testload.zip");
135+
136+
Object result = singleResultFirstColumn(db,
137+
"CALL apoc.load.stringPartial($url, 17, 15)",
138+
map("url", url + "!csv/test.csv")
139+
);
140+
141+
assertEquals(PARTIAL_CSV, result);
142+
}
143+
144+
@Test
145+
public void testLoadPartialWithoutLimit() {
146+
String url = gcsUrl(gcs, "test.csv");
147+
148+
Object result = singleResultFirstColumn(db,
149+
"CALL apoc.load.stringPartial($url, 17)",
150+
map("url", url)
151+
);
152+
153+
assertEquals(PARTIAL_CSV_WITHOUT_LIMIT, result);
154+
}
155+
117156
static void assertXlsRow(Result r, long lineNo, Object...data) {
118157
Map<String, Object> row = r.next();
119158
Map<String, Object> map = map(data);

extended-it/src/test/java/apoc/s3/LoadS3Test.java

+73-7
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,7 @@
11
package apoc.s3;
22

3-
import apoc.load.LoadCsv;
4-
import apoc.load.LoadDirectory;
5-
import apoc.load.LoadHtml;
6-
import apoc.load.LoadJson;
7-
import apoc.load.LoadJsonExtended;
8-
import apoc.load.Xml;
3+
import apoc.load.*;
4+
import apoc.load.partial.LoadPartial;
95
import apoc.load.xls.LoadXls;
106
import apoc.util.TestUtil;
117
import apoc.util.s3.S3BaseTest;
@@ -21,11 +17,18 @@
2117
import static apoc.ApocConfig.apocConfig;
2218
import static apoc.load.LoadCsvTest.commonTestLoadCsv;
2319
import static apoc.load.LoadHtmlTest.testLoadHtmlWithGetLinksCommon;
20+
import static apoc.load.partial.LoadPartialTest.PARTIAL_CSV;
21+
import static apoc.load.partial.LoadPartialTest.PARTIAL_CSV_WITHOUT_LIMIT;
2422
import static apoc.load.xls.LoadXlsTest.testLoadXlsCommon;
2523
import static apoc.util.ExtendedITUtil.EXTENDED_RESOURCES_PATH;
2624
import static apoc.util.ExtendedITUtil.testLoadJsonCommon;
2725
import static apoc.util.ExtendedITUtil.testLoadXmlCommon;
26+
import static apoc.util.MapUtil.map;
27+
import static apoc.util.TestUtil.singleResultFirstColumn;
28+
import static apoc.util.TestUtil.testResult;
2829
import static apoc.util.s3.S3Util.putToS3AndGetUrl;
30+
import static org.junit.Assert.assertEquals;
31+
import static org.junit.Assert.assertFalse;
2932

3033
public class LoadS3Test extends S3BaseTest {
3134

@@ -35,7 +38,7 @@ public class LoadS3Test extends S3BaseTest {
3538

3639
@Before
3740
public void setUp() throws Exception {
38-
TestUtil.registerProcedure(db, LoadCsv.class, LoadDirectory.class, LoadJsonExtended.class, LoadHtml.class, LoadXls.class, Xml.class);
41+
TestUtil.registerProcedure(db, LoadCsv.class, LoadDirectory.class, LoadJsonExtended.class, LoadHtml.class, LoadXls.class, Xml.class, LoadPartial.class);
3942
apocConfig().setProperty(APOC_IMPORT_FILE_ENABLED, true);
4043
apocConfig().setProperty(APOC_IMPORT_FILE_USE_NEO4J_CONFIG, false);
4144
}
@@ -70,4 +73,67 @@ public void testLoadHtml() {
7073
testLoadHtmlWithGetLinksCommon(db, url);
7174
}
7275

76+
@Test
77+
public void testLoadPartial() {
78+
String url = putToS3AndGetUrl(s3Container, EXTENDED_RESOURCES_PATH + "test.csv");
79+
String result = singleResultFirstColumn(db, "CALL apoc.load.stringPartial($url, 17, 15)",
80+
map("url", url)
81+
);
82+
83+
assertEquals(PARTIAL_CSV, result);
84+
}
85+
86+
@Test
87+
public void testLoadPartialWithoutLimit() {
88+
String url = putToS3AndGetUrl(s3Container, EXTENDED_RESOURCES_PATH + "test.csv");
89+
String result = singleResultFirstColumn(db, "CALL apoc.load.stringPartial($url, 17)",
90+
map("url", url)
91+
);
92+
93+
assertEquals(PARTIAL_CSV_WITHOUT_LIMIT, result);
94+
}
95+
96+
@Test
97+
public void testLoadPartialZip() {
98+
String url = putToS3AndGetUrl(s3Container, EXTENDED_RESOURCES_PATH + "testload.zip");
99+
String result = singleResultFirstColumn(db, "CALL apoc.load.stringPartial($url, 17, 15)",
100+
map("url", url + "!csv/test.csv")
101+
);
102+
103+
assertEquals(PARTIAL_CSV, result);
104+
}
105+
106+
107+
@Test
108+
public void testLoadPartialTar() {
109+
String url = putToS3AndGetUrl(s3Container, EXTENDED_RESOURCES_PATH + "testload.tar");
110+
String result = singleResultFirstColumn(db, "CALL apoc.load.stringPartial($url, 17, 15)",
111+
map("url", url + "!csv/test.csv")
112+
);
113+
114+
assertEquals(PARTIAL_CSV, result);
115+
}
116+
117+
118+
@Test
119+
public void testLoadPartialTarGz() {
120+
String url = putToS3AndGetUrl(s3Container, EXTENDED_RESOURCES_PATH + "testload.tar.gz");
121+
String result = singleResultFirstColumn(db, "CALL apoc.load.stringPartial($url, 17, 15)",
122+
map("url", url + "!csv/test.csv")
123+
);
124+
125+
assertEquals(PARTIAL_CSV, result);
126+
}
127+
128+
129+
@Test
130+
public void testLoadPartialTgz() {
131+
String url = putToS3AndGetUrl(s3Container, EXTENDED_RESOURCES_PATH + "testload.tgz");
132+
String result = singleResultFirstColumn(db, "CALL apoc.load.stringPartial($url, 17, 15)",
133+
map("url", url + "!csv/test.csv")
134+
);
135+
136+
assertEquals(PARTIAL_CSV, result);
137+
}
138+
73139
}

0 commit comments

Comments
 (0)