pandas read csv with extra commas and quotations in column - python

I'm reading a basic csv file where the columns are separated by commas. However, the body column is a string which may contain commas and quotations.
For example, there are some cells like "Bahamas\", The" and "Germany, West"
I have tried
text = pd.read_table("input.txt", encoding = 'utf-16', quotechar='"', sep = ','),
text = pd.read_table("input.txt", encoding = 'utf-16', quotechar='"', delimiter = ','). But they both cannot work.
Is there a way to go around this problem?

Are you able to regenerate the csv? If yes, change the delimit character to a pipe, I.e | . If not, you may be forced to take the long route... because there is no way for any code to figure out which characters are delimiting/quoting and which are part of the value if you have both commas and quotes lurking inside the value.
A workaround could involve leveraging the column position where this problem occurs... I.e first you could isolate the columns to the left of the troubled column, isolate all columns to the right, then all characters remaining are your troubled column. Can you post a few example rows? It would be good to see a few rows that have this issue, and a few that work fine

Related

Read CSV with field having multiple quotes and commas

I'm aware this is a much discussed topic and even though there are similar questions I haven't found one that covers my particular case.
I have a csv file that is as follows:
alarm_id,alarm_incident_id,alarm_sitename,alarm_additionalinfo,alarm_summary
"XXXXXXX","XXXXXXXXX","XXXXX|4G_Availability_Issues","TTN-XXXX","XXXXXXX;[{"severity":"CRITICAL","formula":"${XXXXX} < 85"}];[{"name":"XXXXX","value":"0","updateTimestamp":"Oct 27, 2021, 2:00:00 PM"}];[{"coName":{"XXXX/XXX":"MRBTS-XXXX","LNCEL":"XXXXXX","LNBTS":"XXXXXXX"}}]||"
It has more lines but this is the trouble line. If you notice, the fifth field has within it several quotes and commas, which is also the separator. The quotes are also single instead of double quotes which are normally used to signal a quote character that should be kept in the field. What this is doing is splitting this last field into several when reading with pandas.read_csv() method, which throws an error of extra fields. I've tried several configurations and parameters regarding quoting in pandas.read_csv() but none works...
The csv is badly formatted, I just wanted to know if there is a way to still read it, even if using a roundabout way or it really is just hopeless.
Edit: This can happen to more than one column and I never know in which column(s) this may happen
Thank you for your help.
I think I've got what you're looking for, at least I hope.
You can read the file as regular, creating a list of the lines in the csv file.
Then iterate through the lines variable and split it into 4 parts, since you have 4 columns in the csv.
with open("test.csv", "r") as f:
lines = f.readlines()
for item in lines:
new_ls = item.strip().split(",", 4)
for new_item in new_ls:
print(new_item)
Now you can iterate through each lines' column item and do whatever you have/want to do.
If all your lines fields are consistently enclosed in quotes, you can try to split the line on ",", and to remove the initial and terminating quote. The current line is correctly separated with:
row = line.strip('"').split('","', 4)
But because of the incorrect formatting of your initial file, you will have to manually control it matches all the lines...
Can't post a comment so just making a post:
One option is to escape the internal quotes / commas, or use a regex.
Also, pandas.read_csv has a quoting parameter where you can adjust how it reacts to quotes, which might be useful.

Pandas read_csv not ignoring commas inside quotted string

I have a exported csv dataset which allows html text from users and I need to turn it into a DataFrame.
The columns with possible extra commas are quotted with ", but the parser is using the commas inside them as separators.
This is the code I'm using, and I've already tried solutions from a github issue and another post here.
pd.read_csv(filePath,sep=',', quotechar='"', error_bad_lines=False)
results in
Here is the csv file itself, with the columns and first entry.
I don't know what the issue is, quotechar was supposed to work, maybe the extra " inside the quotted string?
Here's the issue you're running into:
You set quote (") as your quotechar. Unfortunately, you also have quotes in your text:
<a href ="....">
And so... after that anchor tag, the next few commas are NOT considered inside quotes. Your best bet is probably to remake the original csv file with something else as quotechar (that doesn't appear at all in your text).

How do i export a dataframe to a csv without quotation marks?

I have an output dataframe which i want to get into a .out file (same thing as csv without commas), but when i do the list comes back with quotation marks. For example, 1 cell is 2.328e+00+ j 0.0000e+00. When I look at it in the output file it looks like "2.328e+00+ j 0.0000e+00". I tried using the following code:
Output.to_csv (r'C:\Users\[DIRECTORY OMITTED]\export_dataframe.out', sep=' ', index = False, header=True, quoting=csv.QUOTE_NONE,escapechar=" ")
however the issue with this is that every space comes back double spaced. Using another escape character doesnt work as this populates all spaces with the escape character. Formatting is extremely important, so is there any way I can export the dataframe whilst retaining the formatting, without apostrophe marks?
Kind regards,
Tom_Ice

How to give double quotes to column with strings that have comma's in csv

I have a csv file that has a column of strings that has comma's inside the string. If i want to read the csv using pandas it sees the extra comma's as extra columns.Which gives me the error of have more rows then expected. I thought of using double quotes around the strings as solution to the problem.
This is how the csv currently looks
lead,Chat.Event,Role,Data,chatid
lead,x,Lead,Hello, how are you,1
How it should look like
lead,Chat.Event,Role,Data,chatid
lead,x,Lead,"Hello, how are you",1
Is using double quotes around the strings the best solution? and if yes how do i do that? And if not what other solution can you recommend?
if you got the original file / database through which you generated the csv, you should do it again using a different kind of separator (the default is comma), one which you would not have within your strings, such as "|" (vertical bar).
than, when reading the csv with pandas, you can just pass the argument:
pd.read_csv(file_path, sep="your separator symbol here")
hope that helps

Pandas to_csv with escape characters and other junk causing return to next line

I have a dataframe with a column that has a bunch of manually entered text, some of which contains various escape characters.
Currently, there are a couple of lines where the output creates a new row. The one that is causing the most problems are the <br/> in the middle and at the end of the text. I'm looking to clean the text just enough so that a new line is not created
EDIT
Here's some examples of strings that are causing problems
Example<br/>
Example sentence (number two)\r<br/>That caused an issue
try using converters for the read_csv, adapt the example below to your needs:
def remove_br(x):
return x.replace('<br \>','')
convert_dict = {'col_name':remove_br}
df = pd.read_csv('file.csv', converters=converter_dict)

Categories