Read CSV with field having multiple quotes and commas

Read CSV with field having multiple quotes and commas - python

I'm aware this is a much discussed topic and even though there are similar questions I haven't found one that covers my particular case.
I have a csv file that is as follows:
alarm_id,alarm_incident_id,alarm_sitename,alarm_additionalinfo,alarm_summary
"XXXXXXX","XXXXXXXXX","XXXXX|4G_Availability_Issues","TTN-XXXX","XXXXXXX;[{"severity":"CRITICAL","formula":"${XXXXX} < 85"}];[{"name":"XXXXX","value":"0","updateTimestamp":"Oct 27, 2021, 2:00:00 PM"}];[{"coName":{"XXXX/XXX":"MRBTS-XXXX","LNCEL":"XXXXXX","LNBTS":"XXXXXXX"}}]||"
It has more lines but this is the trouble line. If you notice, the fifth field has within it several quotes and commas, which is also the separator. The quotes are also single instead of double quotes which are normally used to signal a quote character that should be kept in the field. What this is doing is splitting this last field into several when reading with pandas.read_csv() method, which throws an error of extra fields. I've tried several configurations and parameters regarding quoting in pandas.read_csv() but none works...
The csv is badly formatted, I just wanted to know if there is a way to still read it, even if using a roundabout way or it really is just hopeless.
Edit: This can happen to more than one column and I never know in which column(s) this may happen
Thank you for your help.

I think I've got what you're looking for, at least I hope.
You can read the file as regular, creating a list of the lines in the csv file.
Then iterate through the lines variable and split it into 4 parts, since you have 4 columns in the csv.
with open("test.csv", "r") as f:
lines = f.readlines()
for item in lines:
new_ls = item.strip().split(",", 4)
for new_item in new_ls:
print(new_item)
Now you can iterate through each lines' column item and do whatever you have/want to do.

If all your lines fields are consistently enclosed in quotes, you can try to split the line on ",", and to remove the initial and terminating quote. The current line is correctly separated with:
row = line.strip('"').split('","', 4)
But because of the incorrect formatting of your initial file, you will have to manually control it matches all the lines...

Can't post a comment so just making a post:
One option is to escape the internal quotes / commas, or use a regex.
Also, pandas.read_csv has a quoting parameter where you can adjust how it reacts to quotes, which might be useful.

Related

How to save data to a file on separate items instead of one long string?

I am having trouble simply saving items into a file for later reading. When I save the file, instead of listing the items as single items, it appends the data together as one long string. According to my Google searches, this should not be appending the items.
What am I doing wrong?
Code:
with open('Ped.dta','w+') as p:
p.write(str(recnum)) # Add record number to top of file
for x in range(recnum):
p.write(dte[x]) # Write date
p.write(str(stp[x])) # Write Steps number

Since you do not show your data or your output I cannot be sure. But it seems you are trying to use the write method like the print function, but there are important differences.
Most important, write does not follow its written characters with any separator (like space by default for print) or end (like \n by default for print).
Therefore there is no space between your data and steps number or between the lines because you did not write them and Python did not add them.
So add those. Try the lines
p.write(dte[x]) # Write date
p.write(' ') # space separator
p.write(str(stp[x])) # Write Steps number
p.write('\n') # line terminator
Note that I do not know the format of your "date" that is written, so you may need to convert that to text before writing it.
Now that I have the time, I'll implement #abarnert's suggestion (in a comment) and show you how to get the advantages of the print function and still write to a file. Just use the file= parameter in Python 3, or in Python 2 after executing the statement
from __future__ import print_function
Using print you can do my four lines above in one line, since print automatically adds the space separator and newline end:
print(dte[x], str(stp[x]), file=p)
This does assume that your date datum dte[x] is to be printed as text.

Try adding a newline ('\n') character at the end of your lines as you see in docs. This should solve the problem of 'listing the items as single items', but the file you create may not be greatly structured nonetheless.
For further of your google searches you may want to check serialization, as well as json and csv formats, covered in python standard library.
You question would have befited if you gave very small example of recnum variable + original f.close() is not necessary as you have a with statement, see here at SO.

Set csv record delimiter in Python Pandas

I created a script to merge couple of .csv files, using pandas python library. All files use "\n\r" as record delimiter.
I ran into issue with one file, where in specific field, sometimes "\n" occurs. That causes for pandas.read_csv to count it as new row.
Is there any chance to (in addition to field delimiter) specify record delimiter? Or would there be any better solution to this?
Thank you and best regards

Look through all of the kwargs in pandas.read_csv
There is the lineterminator kwarg:
lineterminator : str (length 1), default None
Character to break file into lines. Only valid with C parser.
Note that it requires the use of the C parser (see engine kwarg)
Given that your lines end with \r, which is the carriage return character I would suggest using that as the lineterminator and doing post-processing to clean up the \n's left behind.
I would think that setting the lineterminator='\r' should fix your problem.

Python CSV writer, how to handle quotes in order to avoid triple quotes in output

I am working with Python's CSV module, specifically the writer. My question is how can I add double quotes to a single item in a list and have the writer write the string the same way as a print statement would?
for example:
import csv
#test "data"
test = ['item1','01','001',1]
csvOut = csv.writer(open('file.txt','a')) #'a' used for keeping past results
test[1] = '"'+test[1]+'"'
print test
#prints: ['item1', '"01"', '001', 1]
csvOut.writerow(test)
#written in the output file: item1,"""01""",001,1
#I was expecting: item1,"01",001,1
del csvOut
I tired adding a quoting=csv.QUOTE_NONE option, but that raised an error. I am guessing this is related to the many csv dialects, I was hoping to avoid digging too far into that.
In retrospect I could probably have built my initial data set smarter and perhaps avoided the need for this situation but at this point curiosity is really getting the better of me (this is a simplified example): how do you keep the written output from adding those extra quotes?

It's not actually triple-quoting, although it looks that way. Try it with another example to see:
test = ['item1', 'abc"def']
Now you'll see that it writes this:
"abc""def"
In other words, it's just wrapping quotes around your string, and escaping the literal quote characters by doubling them, because that's how default Excel-style CSV handles quote characters.
The question is, what format do you want here? Almost anything you want (within reason) is doable, but you have to pick something. Backslash-escaping quotes? Backslash-escaping everything instead of using quotes in the first place? Single quotes instead of double quotes?
For example, this looks like an answer:
csvOut = csv.writer(open('file.txt','a'), quotechar="'")
… until you have an item like Filet O'Fish and the whole thing gets single-quoted and the ' gets doubled and you have the exact same problem you were trying to avoid. If you're aiming for human readability, and ' is a lot less common in your data than ", that may actually be the right answer, but it's not a perfect answer.
And really, no answer can be perfect: you need some way to either quote or escape commas—and other things, like newlines—and the way you do that is going to add at least one more character that needs to be quote-doubled or escaped. If you know there are never any commas, newlines, etc. in your data, and there's at least one other character you know will never show up, you can get away with setting either quotechar to that other character, or escapechar to that other character and quoting=QUOTE_NONE. But the first time someone unexpectedly uses the character you were sure would never appear, your code will break, so you'd better actually be sure.

Quotes get escaped because your data could contain a comma. You probably don't want a CSV file if you don't want quotes escaped. Just join on a comma (this will break downstream if your data has a comma in it)

Regex out leading and trailing quotes if not contains comma

I'm at a total loss of how to do this.
My Question: I want to take this:
"A, two words with comma","B","C word without comma","D"
"E, two words with comma","F","G more stuff","H no commas here!"
... (continue)
To this:
"A, two words with comma",B,C word without comma,D
"E, two words with comma",F,G more stuff,H no commas here!
... (continue)
I used software that created 1,900 records in a text file and I think it was supposed to be a CSV but whoever wrote the software doesn't know how CSV files work because it only needs quotes if the cell contains a comma (right?). At least I know that in Excel it puts everything in the first cell...
I would prefer this to be solvable using some sort of command line tool like perl or python (I'm on a Mac). I don't want to make a whole project in Java or anything to take care of this.
Any help is greatly appreciated!

Shot in the dark here, but I think that Excel is putting everything in the first column because it doesn't know it's being given comma-separated data.
Excel has a "text-to-columns" feature, where you'll be able to split a column by a delimiter (make sure you choose the comma).
There's more info here:
http://support.microsoft.com/kb/214261
edit
You might also try renaming the file from *.txt to *.csv. That will change the way Excel reads the file, so it better understands how to parse whatever it finds inside.

If just bashing is an option, you can try this one-liner in a terminal:
cat file.csv | sed 's/"\([^,]*\)"/\1/g' >> new-file.csv

That technically should be fine. It is text delimited with the " and separated via the ,
I don't see anything wrong with the first at all, any field may be quoted, only some require it. More than likely the writer of the code didn't want to over complicate the logic and quoted everything.

One way to clean it up is to feed the data to csv and dump it back.
import csv
from cStringIO import StringIO
bad_data = """\
"A, two words with comma","B","C word without comma","D"
"E, two words with comma","F","G more stuff","H no commas here!"
"""
buffer = StringIO()
writer = csv.writer(buffer)
writer.writerows(csv.reader(bad_data.split('\n')))
buffer.seek(0)
print buffer.read()
Python's csv.writer will default to the "excel" dialect, so it will not write the commas when not necessary.

Python CSV module - quotes go missing

I have a CSV file that has data like this
15,"I",2,41301888,"BYRNESS RAW","","BYRNESS VILLAGE","NORTHUMBERLAND","ENG"
11,"I",3,41350101,2,2935,2,2008-01-09,1,8,0,2003-02-01,,2009-12-22,2003-02-11,377016.00,601912.00,377105.00,602354.00,10
I am reading this and then writing different rows to different CSV files.
However, in the original data there are quotes around the non-numeric fields, as some of them contain commas within the field.
I am not able to keep the quotes.
I have researched lots and discovered the quoting=csv.QUOTE_NONNUMERIC however this now results in a quote mark around every field and I dont know why??
If i try one of the other quoting options like MINIMAL I end up with an error message regarding the date value, 2008-01-09, not being a float.
I have tried to create a dialect, add the quoting on the csv reader and writer but nothing I have tried results in the getting an exact match to the original data.
Anyone had this same problem and found a solution.

When writing, quoting=csv.QUOTE_NONNUMERIC keeps values unquoted as long as they're numbers, ie. if their type is int or float (for example), which means it will write what you expect.
Your problem could be that, when reading, a csv.reader will turn every row it reads into a list of strings (if you read the documentation carefully enough, you'll see a reader does not perform automatic data type conversion!
If you don't perform any kind of conversion after reading, then when you write you'll end up with everything on quotes... because everything you write is a string.
Edit: of course, date fields will be quoted, because they are not numbers, meaning you cannot get the exact expected behaviour using the standard csv.writer.

Are you sure you have a problem? The behavior you're describing is correct: The csv module will enclose strings in quotes only if it's necessary for parsing them correctly. So you should expect to see quotes only around strings containing a comma, newlines, etc. Unless you're getting errors reading your output back in, there is no problem.

Trying to get an "exact match" of the original data is a difficult and potentially fruitless endeavor. quoting=csv.QUOTE_NONNUMERIC put quotes around everything because every field was a string when you read it in.
Your concern that some of the "quoted" input fields could have commas is usually not that big a deal. If you added a comma to one of your quoted fields and used the default writer, the field with the comma would be automatically quoted in the output.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.