Remove unwanted commas from CSV using Python

Remove unwanted commas from CSV using Python - python

I need some help, I have a CSV file that contains an address field, whoever input the data into the original database used commas to separate different parts of the address - for example:
Flat 5, Park Street
When I try to use the CSV file it treats this one entry as two separate fields when in fact it is a single field. I have used Python to strip commas out where they are between inverted commas as it is easy to distinguish them from a comma that should actually be there, however this problem has me stumped.
Any help would be gratefully received.
Thanks.

You can define the separating and quoting characters with Python's CSV reader. For example:
With this CSV:
1,`Flat 5, Park Street`
And this Python:
import csv
with open('14144315.csv', 'rb') as csvfile:
rowreader = csv.reader(csvfile, delimiter=',', quotechar='`')
for row in rowreader:
print row
You will see this output:
['1', 'Flat 5, Park Street']
This would use commas to separate values but inverted commas for quoted commas

The CSV file was not generated properly. CSV files should have some form of escaping of text, usually using double-quotes:
1,John Doe,"City, State, Country",12345
Some CSV exports do this to all fields (this is an option when exporting from Excel/LibreOffice), but ambiguous fields (such as those including commas) must be escaped.
Either fix this manually or properly regenerate the CSV. Naturally, this cannot be fixed programatically.
Edit: I just noticed something about "inverted commas" being used for escaping - if that is the case see Jason Sperske's answer, which is spot on.

Related

Read CSV with field having multiple quotes and commas

I'm aware this is a much discussed topic and even though there are similar questions I haven't found one that covers my particular case.
I have a csv file that is as follows:
alarm_id,alarm_incident_id,alarm_sitename,alarm_additionalinfo,alarm_summary
"XXXXXXX","XXXXXXXXX","XXXXX|4G_Availability_Issues","TTN-XXXX","XXXXXXX;[{"severity":"CRITICAL","formula":"${XXXXX} < 85"}];[{"name":"XXXXX","value":"0","updateTimestamp":"Oct 27, 2021, 2:00:00 PM"}];[{"coName":{"XXXX/XXX":"MRBTS-XXXX","LNCEL":"XXXXXX","LNBTS":"XXXXXXX"}}]||"
It has more lines but this is the trouble line. If you notice, the fifth field has within it several quotes and commas, which is also the separator. The quotes are also single instead of double quotes which are normally used to signal a quote character that should be kept in the field. What this is doing is splitting this last field into several when reading with pandas.read_csv() method, which throws an error of extra fields. I've tried several configurations and parameters regarding quoting in pandas.read_csv() but none works...
The csv is badly formatted, I just wanted to know if there is a way to still read it, even if using a roundabout way or it really is just hopeless.
Edit: This can happen to more than one column and I never know in which column(s) this may happen
Thank you for your help.

I think I've got what you're looking for, at least I hope.
You can read the file as regular, creating a list of the lines in the csv file.
Then iterate through the lines variable and split it into 4 parts, since you have 4 columns in the csv.
with open("test.csv", "r") as f:
lines = f.readlines()
for item in lines:
new_ls = item.strip().split(",", 4)
for new_item in new_ls:
print(new_item)
Now you can iterate through each lines' column item and do whatever you have/want to do.

If all your lines fields are consistently enclosed in quotes, you can try to split the line on ",", and to remove the initial and terminating quote. The current line is correctly separated with:
row = line.strip('"').split('","', 4)
But because of the incorrect formatting of your initial file, you will have to manually control it matches all the lines...

Can't post a comment so just making a post:
One option is to escape the internal quotes / commas, or use a regex.
Also, pandas.read_csv has a quoting parameter where you can adjust how it reacts to quotes, which might be useful.

Python 3: how to parse a csv file where the text fields can contain embedded new line characters

When exporting excel/libreoffice sheets where the cells can contain new lines as CSV, the resulting file will have those new lines preserved as literal newline characters not something like the char string "\n".
The standard csv module in Python 3 apparently does not handle this as would be necessary. The documentation says "Note The reader is hard-coded to recognise either '\r' or '\n' as end-of-line, and ignores lineterminator. This behavior may change in the future." . Well, duh.
Is there some other way to read in such csv files properly? What csv really should do is ignore any new lines withing quoted text fields and only recognise new line characters outside a field, but since it does not, is there a different way to solve this short of implementing my own CSV parser?

Try using pandas with something like df = pandas.read_csv('my_data.csv'). You'll have more granular control over how the data is read in. If you're worried about formatting, you can also set the delimiter for the csv from libreoffice to something that doesn't occur in nature like ;;

How to give double quotes to column with strings that have comma's in csv

I have a csv file that has a column of strings that has comma's inside the string. If i want to read the csv using pandas it sees the extra comma's as extra columns.Which gives me the error of have more rows then expected. I thought of using double quotes around the strings as solution to the problem.
This is how the csv currently looks
lead,Chat.Event,Role,Data,chatid
lead,x,Lead,Hello, how are you,1
How it should look like
lead,Chat.Event,Role,Data,chatid
lead,x,Lead,"Hello, how are you",1
Is using double quotes around the strings the best solution? and if yes how do i do that? And if not what other solution can you recommend?

if you got the original file / database through which you generated the csv, you should do it again using a different kind of separator (the default is comma), one which you would not have within your strings, such as "|" (vertical bar).
than, when reading the csv with pandas, you can just pass the argument:
pd.read_csv(file_path, sep="your separator symbol here")
hope that helps

Regex out leading and trailing quotes if not contains comma

I'm at a total loss of how to do this.
My Question: I want to take this:
"A, two words with comma","B","C word without comma","D"
"E, two words with comma","F","G more stuff","H no commas here!"
... (continue)
To this:
"A, two words with comma",B,C word without comma,D
"E, two words with comma",F,G more stuff,H no commas here!
... (continue)
I used software that created 1,900 records in a text file and I think it was supposed to be a CSV but whoever wrote the software doesn't know how CSV files work because it only needs quotes if the cell contains a comma (right?). At least I know that in Excel it puts everything in the first cell...
I would prefer this to be solvable using some sort of command line tool like perl or python (I'm on a Mac). I don't want to make a whole project in Java or anything to take care of this.
Any help is greatly appreciated!

Shot in the dark here, but I think that Excel is putting everything in the first column because it doesn't know it's being given comma-separated data.
Excel has a "text-to-columns" feature, where you'll be able to split a column by a delimiter (make sure you choose the comma).
There's more info here:
http://support.microsoft.com/kb/214261
edit
You might also try renaming the file from *.txt to *.csv. That will change the way Excel reads the file, so it better understands how to parse whatever it finds inside.

If just bashing is an option, you can try this one-liner in a terminal:
cat file.csv | sed 's/"\([^,]*\)"/\1/g' >> new-file.csv

That technically should be fine. It is text delimited with the " and separated via the ,
I don't see anything wrong with the first at all, any field may be quoted, only some require it. More than likely the writer of the code didn't want to over complicate the logic and quoted everything.

One way to clean it up is to feed the data to csv and dump it back.
import csv
from cStringIO import StringIO
bad_data = """\
"A, two words with comma","B","C word without comma","D"
"E, two words with comma","F","G more stuff","H no commas here!"
"""
buffer = StringIO()
writer = csv.writer(buffer)
writer.writerows(csv.reader(bad_data.split('\n')))
buffer.seek(0)
print buffer.read()
Python's csv.writer will default to the "excel" dialect, so it will not write the commas when not necessary.

Python CSV module - quotes go missing

I have a CSV file that has data like this
15,"I",2,41301888,"BYRNESS RAW","","BYRNESS VILLAGE","NORTHUMBERLAND","ENG"
11,"I",3,41350101,2,2935,2,2008-01-09,1,8,0,2003-02-01,,2009-12-22,2003-02-11,377016.00,601912.00,377105.00,602354.00,10
I am reading this and then writing different rows to different CSV files.
However, in the original data there are quotes around the non-numeric fields, as some of them contain commas within the field.
I am not able to keep the quotes.
I have researched lots and discovered the quoting=csv.QUOTE_NONNUMERIC however this now results in a quote mark around every field and I dont know why??
If i try one of the other quoting options like MINIMAL I end up with an error message regarding the date value, 2008-01-09, not being a float.
I have tried to create a dialect, add the quoting on the csv reader and writer but nothing I have tried results in the getting an exact match to the original data.
Anyone had this same problem and found a solution.

When writing, quoting=csv.QUOTE_NONNUMERIC keeps values unquoted as long as they're numbers, ie. if their type is int or float (for example), which means it will write what you expect.
Your problem could be that, when reading, a csv.reader will turn every row it reads into a list of strings (if you read the documentation carefully enough, you'll see a reader does not perform automatic data type conversion!
If you don't perform any kind of conversion after reading, then when you write you'll end up with everything on quotes... because everything you write is a string.
Edit: of course, date fields will be quoted, because they are not numbers, meaning you cannot get the exact expected behaviour using the standard csv.writer.

Are you sure you have a problem? The behavior you're describing is correct: The csv module will enclose strings in quotes only if it's necessary for parsing them correctly. So you should expect to see quotes only around strings containing a comma, newlines, etc. Unless you're getting errors reading your output back in, there is no problem.

Trying to get an "exact match" of the original data is a difficult and potentially fruitless endeavor. quoting=csv.QUOTE_NONNUMERIC put quotes around everything because every field was a string when you read it in.
Your concern that some of the "quoted" input fields could have commas is usually not that big a deal. If you added a comma to one of your quoted fields and used the default writer, the field with the comma would be automatically quoted in the output.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove unwanted commas from CSV using Python - python

Related

Read CSV with field having multiple quotes and commas

Python 3: how to parse a csv file where the text fields can contain embedded new line characters

How to give double quotes to column with strings that have comma's in csv

Regex out leading and trailing quotes if not contains comma

Python CSV module - quotes go missing

Categories

Resources