Replies: 3 comments 4 replies
-
Dear Gertjan van den Burg,After looking through the CleverCSV code, I have some points that I do not quite understand and hope you can help answer. In the detect_pattern.py pattern_score() method, I understand the meanings of k, K, Nk, and Lk but do not understand the formula for calculating P. For example, what is the reason for using (Lk - 1)/ Lk? Why do we sum the score for each pattern and then divide by K? |
Beta Was this translation helpful? Give feedback.
-
I have a use case where we're currently using the Python CSV dialect, but we're retreiving the dialect.lineterminator, which I notice is NOT in CleverCSV.SimpleDialect. We have a number of regularly occurring odd csv formats that we need to process, and with a couple of the recent error cases I tested CleverCSV and it gets the delimiter correct, but because it's not defined in SimpleDialect we're running into an exception case when I sniff the sample data with CleverCSV, even though it correctly identifies the delimiter much more accurately. Is there a reason the line terminator has been left out? |
Beta Was this translation helpful? Give feedback.
-
Thank you for the response, Gertjan.
Our use case is that both the data we receive and our internal systems are composite systems, we run both windows and linux servers to generate and process the data files.
Some of our legacy code, which is where I ran into this issue, is pulling and storing the newline character from the dialect for later processing regardless of the system it’s on.
After close review I was able to determine that, at least in this case, our processing doesn’t actually use the newline, though it was being pulled from the dialect. I was able to makeshift some simple code to determine the newline anyway (#bandaid #hack l33t ***@***.*** ᕙ(`▽´)ᕗ ), in case it is needed somewhere else.
If it becomes more critical to our processing later I’ll create some sample data and provide a more real world use case and create an actual issue.
James S
Developer
[e.] ***@***.******@***.***>
[w.] www.natimark.com<https://www.natimark.com/> | counts.natimark.com<https://counts.natimark.com/>
[Better Results through Better Data.]
From: Gertjan van den Burg ***@***.***>
Sent: Friday, January 31, 2025 2:38 PM
To: alan-turing-institute/CleverCSV ***@***.***>
Cc: James Stevens ***@***.***>; Mention ***@***.***>
Subject: Re: [alan-turing-institute/CleverCSV] Welcome to CleverCSV Discussions! (Discussion #31)
Thanks for your question @jas-natimark<https://github.com/jas-natimark>. The line terminator isn't part of the dialect because generally CSV files are opened using newline="", as suggested in the Python documentation<https://docs.python.org/3/library/csv.html#id4>. The CSV parser in CleverCSV and in CPython both don't use the line terminator for parsing for that reason. So, because it wasn't needed, it was left out of the "simple" dialect.
Can you open an issue with an example of your use-case, where you need to specify the line terminator to process the file correctly? I can then see what modifications need to be made to CleverCSV to support it. Thanks!
—
Reply to this email directly, view it on GitHub<#31 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/BOH6WLSZMYD6QO7DLU6TBGT2NPUKRAVCNFSM6AAAAABUX3B3JKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTEMBSGIZTMOI>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
Beta Was this translation helpful? Give feedback.
-
👋 Welcome!
We’re using Discussions as a place to connect with other members of our community. We hope that you:
build together 💪. Please keep the code of conduct in mind.
Beta Was this translation helpful? Give feedback.
All reactions