Welcome to CleverCSV Discussions! #31

GjjvdBurg · 2021-01-15T16:47:07Z

GjjvdBurg
Jan 15, 2021
Maintainer

👋 Welcome!

We’re using Discussions as a place to connect with other members of our community. We hope that you:

Ask questions you’re wondering about in the Q&A section
Share ideas about improvements to CleverCSV in the Ideas section
Tell us about use cases where CleverCSV was helpful to you in the Show & Tell section
Engage with other community members: if you know the answer to a question posted by someone else, please help them out!
Welcome others and are open-minded. Remember that this is a community we
build together 💪. Please keep the code of conduct in mind.

BoboOpenSource · 2023-06-18T14:27:27Z

BoboOpenSource
Jun 18, 2023

Dear Gertjan van den Burg,After looking through the CleverCSV code, I have some points that I do not quite understand and hope you can help answer. In the detect_pattern.py pattern_score() method, I understand the meanings of k, K, Nk, and Lk but do not understand the formula for calculating P. For example, what is the reason for using (Lk - 1)/ Lk? Why do we sum the score for each pattern and then divide by K?

2 replies

GjjvdBurg Jul 8, 2023
Maintainer Author

Hi @BoboOpenSource, thanks for your question.. We use (Lk - 1)/Lk because it weighs longer row patterns more than shorter row patterns, while remaining bounded to [0, 1]. This way, it's a weight for the frequency of a row pattern (Nk). Summing the score for each pattern and then dividing by the number of patterns (K) is simply taking the average. So we can think about the pattern score for a dialect as the average of the "pattern weights" that it results in, where a pattern weight is a combination of both the pattern's frequency (Nk) and it's length ((Lk - 1)/Lk).

Ultimately the formula is a heuristic, you can find more details in Section 4.1 of our paper.

BoboOpenSource Jul 26, 2023

So that's it. I understand. Thank you very much for your answer

jas-natimark · 2025-01-07T14:38:05Z

jas-natimark
Jan 7, 2025

I have a use case where we're currently using the Python CSV dialect, but we're retreiving the dialect.lineterminator, which I notice is NOT in CleverCSV.SimpleDialect.

We have a number of regularly occurring odd csv formats that we need to process, and with a couple of the recent error cases I tested CleverCSV and it gets the delimiter correct, but because it's not defined in SimpleDialect we're running into an exception case when I sniff the sample data with CleverCSV, even though it correctly identifies the delimiter much more accurately.

Is there a reason the line terminator has been left out?
The obvious follow up is... Is there any plans to implement that as part of the sniffer/dialect in the future?

2 replies

ws-garcia Jan 12, 2025

Indeed, line terminator/records delimiter, is a fundamental part of the dialect detection process. BTW, Python's ability to handle files in a smooth way turns this into a second category necessity. However, I developed the CSVsniffer library as a research paper culmination and the line terminator is included as part of the dialect. The library use some CleverCSV code, and provides improved accuracy in dialect detection. Feel free to try and adapt to your needs.

GjjvdBurg Jan 31, 2025
Maintainer Author

Thanks for your question @jas-natimark. The line terminator isn't part of the dialect because generally CSV files are opened using newline="", as suggested in the Python documentation. The CSV parser in CleverCSV and in CPython both don't use the line terminator for parsing for that reason. So, because it wasn't needed, it was left out of the "simple" dialect.

Can you open an issue with an example of your use-case, where you need to specify the line terminator to process the file correctly? I can then see what modifications need to be made to CleverCSV to support it. Thanks!

jas-natimark · 2025-02-03T21:35:50Z

jas-natimark
Feb 3, 2025

Thank you for the response, Gertjan. Our use case is that both the data we receive and our internal systems are composite systems, we run both windows and linux servers to generate and process the data files. Some of our legacy code, which is where I ran into this issue, is pulling and storing the newline character from the dialect for later processing regardless of the system it’s on. After close review I was able to determine that, at least in this case, our processing doesn’t actually use the newline, though it was being pulled from the dialect. I was able to makeshift some simple code to determine the newline anyway (#bandaid #hack l33t ***@***.*** ᕙ(`▽´)ᕗ ), in case it is needed somewhere else. If it becomes more critical to our processing later I’ll create some sample data and provide a more real world use case and create an actual issue. James S Developer [e.] ***@***.******@***.***> [w.] www.natimark.com<https://www.natimark.com/> | counts.natimark.com<https://counts.natimark.com/> [Better Results through Better Data.] From: Gertjan van den Burg ***@***.***> Sent: Friday, January 31, 2025 2:38 PM To: alan-turing-institute/CleverCSV ***@***.***> Cc: James Stevens ***@***.***>; Mention ***@***.***> Subject: Re: [alan-turing-institute/CleverCSV] Welcome to CleverCSV Discussions! (Discussion #31) Thanks for your question @jas-natimark<https://github.com/jas-natimark>. The line terminator isn't part of the dialect because generally CSV files are opened using newline="", as suggested in the Python documentation<https://docs.python.org/3/library/csv.html#id4>. The CSV parser in CleverCSV and in CPython both don't use the line terminator for parsing for that reason. So, because it wasn't needed, it was left out of the "simple" dialect. Can you open an issue with an example of your use-case, where you need to specify the line terminator to process the file correctly? I can then see what modifications need to be made to CleverCSV to support it. Thanks! — Reply to this email directly, view it on GitHub<#31 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BOH6WLSZMYD6QO7DLU6TBGT2NPUKRAVCNFSM6AAAAABUX3B3JKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTEMBSGIZTMOI>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Welcome to CleverCSV Discussions! #31

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Welcome to CleverCSV Discussions! #31

Uh oh!

GjjvdBurg Jan 15, 2021 Maintainer

👋 Welcome!

Replies: 3 comments · 4 replies

Uh oh!

BoboOpenSource Jun 18, 2023

Uh oh!

GjjvdBurg Jul 8, 2023 Maintainer Author

Uh oh!

BoboOpenSource Jul 26, 2023

Uh oh!

jas-natimark Jan 7, 2025

Uh oh!

ws-garcia Jan 12, 2025

Uh oh!

GjjvdBurg Jan 31, 2025 Maintainer Author

Uh oh!

jas-natimark Feb 3, 2025

GjjvdBurg
Jan 15, 2021
Maintainer

Replies: 3 comments 4 replies

BoboOpenSource
Jun 18, 2023

GjjvdBurg Jul 8, 2023
Maintainer Author

jas-natimark
Jan 7, 2025

GjjvdBurg Jan 31, 2025
Maintainer Author

jas-natimark
Feb 3, 2025