Set csv record delimiter in Python Pandas - python

I created a script to merge couple of .csv files, using pandas python library. All files use "\n\r" as record delimiter.
I ran into issue with one file, where in specific field, sometimes "\n" occurs. That causes for pandas.read_csv to count it as new row.
Is there any chance to (in addition to field delimiter) specify record delimiter? Or would there be any better solution to this?
Thank you and best regards

Look through all of the kwargs in pandas.read_csv
There is the lineterminator kwarg:
lineterminator : str (length 1), default None
Character to break file into lines. Only valid with C parser.
Note that it requires the use of the C parser (see engine kwarg)
Given that your lines end with \r, which is the carriage return character I would suggest using that as the lineterminator and doing post-processing to clean up the \n's left behind.
I would think that setting the lineterminator='\r' should fix your problem.

Related

Read CSV with field having multiple quotes and commas

I'm aware this is a much discussed topic and even though there are similar questions I haven't found one that covers my particular case.
I have a csv file that is as follows:
alarm_id,alarm_incident_id,alarm_sitename,alarm_additionalinfo,alarm_summary
"XXXXXXX","XXXXXXXXX","XXXXX|4G_Availability_Issues","TTN-XXXX","XXXXXXX;[{"severity":"CRITICAL","formula":"${XXXXX} < 85"}];[{"name":"XXXXX","value":"0","updateTimestamp":"Oct 27, 2021, 2:00:00 PM"}];[{"coName":{"XXXX/XXX":"MRBTS-XXXX","LNCEL":"XXXXXX","LNBTS":"XXXXXXX"}}]||"
It has more lines but this is the trouble line. If you notice, the fifth field has within it several quotes and commas, which is also the separator. The quotes are also single instead of double quotes which are normally used to signal a quote character that should be kept in the field. What this is doing is splitting this last field into several when reading with pandas.read_csv() method, which throws an error of extra fields. I've tried several configurations and parameters regarding quoting in pandas.read_csv() but none works...
The csv is badly formatted, I just wanted to know if there is a way to still read it, even if using a roundabout way or it really is just hopeless.
Edit: This can happen to more than one column and I never know in which column(s) this may happen
Thank you for your help.
I think I've got what you're looking for, at least I hope.
You can read the file as regular, creating a list of the lines in the csv file.
Then iterate through the lines variable and split it into 4 parts, since you have 4 columns in the csv.
with open("test.csv", "r") as f:
lines = f.readlines()
for item in lines:
new_ls = item.strip().split(",", 4)
for new_item in new_ls:
print(new_item)
Now you can iterate through each lines' column item and do whatever you have/want to do.
If all your lines fields are consistently enclosed in quotes, you can try to split the line on ",", and to remove the initial and terminating quote. The current line is correctly separated with:
row = line.strip('"').split('","', 4)
But because of the incorrect formatting of your initial file, you will have to manually control it matches all the lines...
Can't post a comment so just making a post:
One option is to escape the internal quotes / commas, or use a regex.
Also, pandas.read_csv has a quoting parameter where you can adjust how it reacts to quotes, which might be useful.

Pandas read csv skips some lines

Following an old question of mine. I finally identified what happens.
I have a csv-file which has the sperator \t and reading it with the following command:
df = pd.read_csv(r'C:\..\file.csv', sep='\t', encoding='unicode_escape')
the length for example is: 800.000
The problem is the original file has around 1.400.000 lines, and I also know where the issue occures, one column (let's say columnA) has the following entry:
"HILFE FüR DIE Alten
Do you have any idea what is happening? When I delete that row I get the correct number of lines (length), what is python doing here?
According to pandas documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
It may be issue with double quotes symbol.
Try this instead:
df = pd.read_csv(r'C:\..\file.csv', sep='\\t', encoding='unicode_escape', engine='python')
or this:
df = pd.read_csv(r'C:\..\file.csv', sep=r'\t', encoding='unicode_escape')

Python 3: how to parse a csv file where the text fields can contain embedded new line characters

When exporting excel/libreoffice sheets where the cells can contain new lines as CSV, the resulting file will have those new lines preserved as literal newline characters not something like the char string "\n".
The standard csv module in Python 3 apparently does not handle this as would be necessary. The documentation says "Note The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future." . Well, duh.
Is there some other way to read in such csv files properly? What csv really should do is ignore any new lines withing quoted text fields and only recognise new line characters outside a field, but since it does not, is there a different way to solve this short of implementing my own CSV parser?
Try using pandas with something like df = pandas.read_csv('my_data.csv'). You'll have more granular control over how the data is read in. If you're worried about formatting, you can also set the delimiter for the csv from libreoffice to something that doesn't occur in nature like ;;

Unexpected read_csv result with \W+ separator

I have an input file I am trying to read into a pandas dataframe.
The file is space delimited, including white space before the first value.
I have tried both read_csv and read_table with a "\W+" regex as the separator.
data = pd.io.parsers.read_csv('file.txt',names=header,sep="\W+")
They read in the correct number of columns, but the values themselves are totally bogus. Has any one else experienced this, or am I using it incorrectly
I have also tried to read file line by line, create a series from row.split() and append the series to a dataframe, but it appears to crash due to memory.
Are there any other options for creating a data frame from a file?
I am using Pandas v0.11.0, Python 2.7
The regex '\W' means "not a word character" (a "word character" being letters, digits, and underscores), see the re docs, hence the strange results. I think you meant to use whitespace '\s+'.
Note: read_csv offers a delim_whitespace argument (which you can set to True), but personally I prefer to use '\s+'.
I don't know what your data looks like, so I can't reproduce your error. I created some sample data and it worked fine, but sometimes using regex in read_csv can be troublesome. If you want to specify the separator, use " " as a separator instead. But I'd advise first trying Andy Hayden's suggestion. It's "delim_whitespace=True". It works well.
You can see it in the documentation here: http://pandas.pydata.org/pandas-docs/dev/io.html

Python CSV module - quotes go missing

I have a CSV file that has data like this
15,"I",2,41301888,"BYRNESS RAW","","BYRNESS VILLAGE","NORTHUMBERLAND","ENG"
11,"I",3,41350101,2,2935,2,2008-01-09,1,8,0,2003-02-01,,2009-12-22,2003-02-11,377016.00,601912.00,377105.00,602354.00,10
I am reading this and then writing different rows to different CSV files.
However, in the original data there are quotes around the non-numeric fields, as some of them contain commas within the field.
I am not able to keep the quotes.
I have researched lots and discovered the quoting=csv.QUOTE_NONNUMERIC however this now results in a quote mark around every field and I dont know why??
If i try one of the other quoting options like MINIMAL I end up with an error message regarding the date value, 2008-01-09, not being a float.
I have tried to create a dialect, add the quoting on the csv reader and writer but nothing I have tried results in the getting an exact match to the original data.
Anyone had this same problem and found a solution.
When writing, quoting=csv.QUOTE_NONNUMERIC keeps values unquoted as long as they're numbers, ie. if their type is int or float (for example), which means it will write what you expect.
Your problem could be that, when reading, a csv.reader will turn every row it reads into a list of strings (if you read the documentation carefully enough, you'll see a reader does not perform automatic data type conversion!
If you don't perform any kind of conversion after reading, then when you write you'll end up with everything on quotes... because everything you write is a string.
Edit: of course, date fields will be quoted, because they are not numbers, meaning you cannot get the exact expected behaviour using the standard csv.writer.
Are you sure you have a problem? The behavior you're describing is correct: The csv module will enclose strings in quotes only if it's necessary for parsing them correctly. So you should expect to see quotes only around strings containing a comma, newlines, etc. Unless you're getting errors reading your output back in, there is no problem.
Trying to get an "exact match" of the original data is a difficult and potentially fruitless endeavor. quoting=csv.QUOTE_NONNUMERIC put quotes around everything because every field was a string when you read it in.
Your concern that some of the "quoted" input fields could have commas is usually not that big a deal. If you added a comma to one of your quoted fields and used the default writer, the field with the comma would be automatically quoted in the output.

Categories