How to read csv lines with pandas containing " and ' between quoting character "? - python

I'm trying to import csv with pandas read_csv and can't get the lines containing the following snippet to work:
"","",""BSF" code - Intermittant, see notes",""
I am able to get pass it via with the options error_bad_lines=False, low_memory=False, engine='c'. However it should be possible to parse them correctly. I'm not good with regular expressions so I didn't try using engine='python', sep=regex yet. Thanks for any help.

Well, that's quite a hard one ... given that all fields are quoted you could use a regex to only use , followed and preceded by " as a separator:
data = pd.read_csv(filename,sep=r'(?<="),(?=")',quotechar='"')
However, you will still end up with quotes around all fields, but you could fix this by applying
data = data.applymap(lambda s:s[1:-1])

Related

Pandas read_csv not ignoring commas inside quotted string

I have a exported csv dataset which allows html text from users and I need to turn it into a DataFrame.
The columns with possible extra commas are quotted with ", but the parser is using the commas inside them as separators.
This is the code I'm using, and I've already tried solutions from a github issue and another post here.
pd.read_csv(filePath,sep=',', quotechar='"', error_bad_lines=False)
results in
Here is the csv file itself, with the columns and first entry.
I don't know what the issue is, quotechar was supposed to work, maybe the extra " inside the quotted string?
Here's the issue you're running into:
You set quote (") as your quotechar. Unfortunately, you also have quotes in your text:
<a href ="....">
And so... after that anchor tag, the next few commas are NOT considered inside quotes. Your best bet is probably to remake the original csv file with something else as quotechar (that doesn't appear at all in your text).

Pandas read csv skips some lines

Following an old question of mine. I finally identified what happens.
I have a csv-file which has the sperator \t and reading it with the following command:
df = pd.read_csv(r'C:\..\file.csv', sep='\t', encoding='unicode_escape')
the length for example is: 800.000
The problem is the original file has around 1.400.000 lines, and I also know where the issue occures, one column (let's say columnA) has the following entry:
"HILFE FüR DIE Alten
Do you have any idea what is happening? When I delete that row I get the correct number of lines (length), what is python doing here?
According to pandas documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
It may be issue with double quotes symbol.
Try this instead:
df = pd.read_csv(r'C:\..\file.csv', sep='\\t', encoding='unicode_escape', engine='python')
or this:
df = pd.read_csv(r'C:\..\file.csv', sep=r'\t', encoding='unicode_escape')

importing CSV file with values wrapped in " when some of them contains " as well as commas

I think I searched throughout but if I missed something - let me know please.
I am trying to import CSV file where all non numerical values are wrapped with ".
I have encountered a problem with:
df = pd.read_csv(file.csv)
Example of CSV:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company "MoscowMining" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company" Jankowski,A,B""
Because of multiple quotes and commas inside them, pandas is seeing more columns than 4 in this case (like 5 or 6).
I have already tried to play with
df = pd.read_csv(file.csv, quotechar='"', quoting=2)
But got
ParserError: Error tokenizing data (...)
What works is skipping bad lines by
error_bad_lines=False
but I'd rather have all data somehow taken into consideration than just ommit it.
Many thanks for any help!
This seems like badly formed CSV data as the '"' characters within the values should be escaped. I've often seen such values escaped by doubling them up or prefixing with a \. See https://en.wikipedia.org/wiki/Comma-separated_values#cite_ref-13
First thing I'd do is fix whatever is exporting those files. However if you cannot do that you may be able to work around the issue by escaping the " which are part of a value.
Your best bet might be to assume that a " is only followed (or preceeded) by a comma or newline if it is the end of a value. Then you could do a regex something like (working from memory so may not be 100% - but should give you the right idea. You'll have to adapt it for whatever regex library you have handy)
s/([^,\n])"([^,\n])/$1""$2/g
So if you were to run your example file though that it would be escaped something like this:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company ""MoscowMining"" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company"" Jankowski,A,B"""
or using the following
s/([^,\n])"([^,\n])/$1\"$2/g
the file would be escaped something like this:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company \"MoscowMining\" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company\" Jankowski,A,B\""
Depending on your CSV parser, one of those should be accepted and work as expected.
If, as #exe suggests, your CSV parser also requires the commas within values to be escaped, you can apply a similar regex to replace the commas.
If i understand well what you need is cast the quotes and commas before panda read the csv.
Like these:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company \"MoscowMining\" Owner1\, Owner2\, Owner3"
"Agriculture","Poland","Warsaw","Company\" Jankowski\,A\,B\""

Python writing into csv without line break \r\n

I am using Python 3 and scrapy to crawl some data. For some instances, I have 2 sentences which would like to write to excel as comma separated csv file.
How can I make them not to split into new line concernig the '\r\n'? Instead, how can I treat the whole sentence as a string
The sentences are as below
'USBについての質問です\r\n下記のサイトの通りCentOS7を1USBからインストールしよう...',
'USBからインストールしよう...',
Without seeing a code snippet of how you are parsing the strings, it's a bit difficult to suggest how exactly you can solve your problem. Anyway, you can always use replace to remove the occurrences of \r\n from your string:
>>> string = 'abc\r\ndef'
>>> print string
abc
def
>>> string.replace('\r\n', ' ')
'abc def'
Since you want to write to a CSV file, I'd suggest you use pandas dataframe as it make life whole lot a easier.
import pandas as pd
string="'USBについての質問です\r\n下記のサイトの通りCentOS7を1USBからインストールしよう...'\r\n'USBからインストールしよう...'"
string = string.repalce('\r\n',',')
list1=list(string)
df = pd.DataFrame(list1,columns=None,header=False)
df.to_csv('file.csv')
Thanks for all the advice and possible solution provided above. Yet, I have found the way to solve it.
string="'USBについての質問です\r\n下記のサイトの通りCentOS7を1USBからインストールしよう...'\r\n'USBからインストールしよう...'"
string.replace('\r','\\r').replace('\n','\\n'')
Then write this string into the csv will make the csv show \r\n together with the other text as a whole string.

Unexpected read_csv result with \W+ separator

I have an input file I am trying to read into a pandas dataframe.
The file is space delimited, including white space before the first value.
I have tried both read_csv and read_table with a "\W+" regex as the separator.
data = pd.io.parsers.read_csv('file.txt',names=header,sep="\W+")
They read in the correct number of columns, but the values themselves are totally bogus. Has any one else experienced this, or am I using it incorrectly
I have also tried to read file line by line, create a series from row.split() and append the series to a dataframe, but it appears to crash due to memory.
Are there any other options for creating a data frame from a file?
I am using Pandas v0.11.0, Python 2.7
The regex '\W' means "not a word character" (a "word character" being letters, digits, and underscores), see the re docs, hence the strange results. I think you meant to use whitespace '\s+'.
Note: read_csv offers a delim_whitespace argument (which you can set to True), but personally I prefer to use '\s+'.
I don't know what your data looks like, so I can't reproduce your error. I created some sample data and it worked fine, but sometimes using regex in read_csv can be troublesome. If you want to specify the separator, use " " as a separator instead. But I'd advise first trying Andy Hayden's suggestion. It's "delim_whitespace=True". It works well.
You can see it in the documentation here: http://pandas.pydata.org/pandas-docs/dev/io.html

Categories