Unexpected read_csv result with \W+ separator - python

I have an input file I am trying to read into a pandas dataframe.
The file is space delimited, including white space before the first value.
I have tried both read_csv and read_table with a "\W+" regex as the separator.
data = pd.io.parsers.read_csv('file.txt',names=header,sep="\W+")
They read in the correct number of columns, but the values themselves are totally bogus. Has any one else experienced this, or am I using it incorrectly
I have also tried to read file line by line, create a series from row.split() and append the series to a dataframe, but it appears to crash due to memory.
Are there any other options for creating a data frame from a file?
I am using Pandas v0.11.0, Python 2.7

The regex '\W' means "not a word character" (a "word character" being letters, digits, and underscores), see the re docs, hence the strange results. I think you meant to use whitespace '\s+'.
Note: read_csv offers a delim_whitespace argument (which you can set to True), but personally I prefer to use '\s+'.

I don't know what your data looks like, so I can't reproduce your error. I created some sample data and it worked fine, but sometimes using regex in read_csv can be troublesome. If you want to specify the separator, use " " as a separator instead. But I'd advise first trying Andy Hayden's suggestion. It's "delim_whitespace=True". It works well.
You can see it in the documentation here: http://pandas.pydata.org/pandas-docs/dev/io.html

Related

Pandas read csv skips some lines

Following an old question of mine. I finally identified what happens.
I have a csv-file which has the sperator \t and reading it with the following command:
df = pd.read_csv(r'C:\..\file.csv', sep='\t', encoding='unicode_escape')
the length for example is: 800.000
The problem is the original file has around 1.400.000 lines, and I also know where the issue occures, one column (let's say columnA) has the following entry:
"HILFE FüR DIE Alten
Do you have any idea what is happening? When I delete that row I get the correct number of lines (length), what is python doing here?
According to pandas documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
It may be issue with double quotes symbol.
Try this instead:
df = pd.read_csv(r'C:\..\file.csv', sep='\\t', encoding='unicode_escape', engine='python')
or this:
df = pd.read_csv(r'C:\..\file.csv', sep=r'\t', encoding='unicode_escape')

Assigning multi-line raw string to variable for use in read_csv

I am trying to assign a raw file path to a variable for use in read_csv in python. The ultimate intent is to take file path as an input in a GUI, and use this to run read_csv. The string is very long, and, for the time being, I am just trying to get the string - variable assignment working.
I followed another thread which suggested using r'''drive:\yada\yada...''' however this adds an additional "\" to each step in the file path. Any suggestions for how to prevent this? Also, any suggestions on best approach to take a file path as input to a GUI and use this to read_csv would be greatly appreciated.
Example of problem below...
In[219]: pathProject = r'''C:\Users\Account\OneDrive\
\Documents\Projects\2016\Shared\
\Project-1\Administrative\Phase-1\
\Final'''
In[220]: pathProject
Out[220]: 'C:\\Users\\Account\\OneDrive\\\n\\Documents\\Projects\\2016\\Shared\\\n\\Project-1\\Administrative\\Phase-1\\\n\\Final'
If you want to enter a long string by spliting it on many lines, you can take advantage of Python's string concatenation. As you want to enter it on many lines, they have to be included in parentheses, for example:
pathProject = (r"C:\Users\Account\OneDrive"
r"\Documents\Projects\2016\Shared"
r"\Project-1\Administrative\Phase-1"
r"\Final")
print(pathProject)
# C:\Users\Account\OneDrive\Documents\Projects\2016\Shared\Project-1\Administrative\Phase-1\Final
Note the opening and closing parentheses, and that each part of the string has to be declared as a raw string.

importing CSV file with values wrapped in " when some of them contains " as well as commas

I think I searched throughout but if I missed something - let me know please.
I am trying to import CSV file where all non numerical values are wrapped with ".
I have encountered a problem with:
df = pd.read_csv(file.csv)
Example of CSV:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company "MoscowMining" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company" Jankowski,A,B""
Because of multiple quotes and commas inside them, pandas is seeing more columns than 4 in this case (like 5 or 6).
I have already tried to play with
df = pd.read_csv(file.csv, quotechar='"', quoting=2)
But got
ParserError: Error tokenizing data (...)
What works is skipping bad lines by
error_bad_lines=False
but I'd rather have all data somehow taken into consideration than just ommit it.
Many thanks for any help!
This seems like badly formed CSV data as the '"' characters within the values should be escaped. I've often seen such values escaped by doubling them up or prefixing with a \. See https://en.wikipedia.org/wiki/Comma-separated_values#cite_ref-13
First thing I'd do is fix whatever is exporting those files. However if you cannot do that you may be able to work around the issue by escaping the " which are part of a value.
Your best bet might be to assume that a " is only followed (or preceeded) by a comma or newline if it is the end of a value. Then you could do a regex something like (working from memory so may not be 100% - but should give you the right idea. You'll have to adapt it for whatever regex library you have handy)
s/([^,\n])"([^,\n])/$1""$2/g
So if you were to run your example file though that it would be escaped something like this:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company ""MoscowMining"" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company"" Jankowski,A,B"""
or using the following
s/([^,\n])"([^,\n])/$1\"$2/g
the file would be escaped something like this:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company \"MoscowMining\" Owner1, Owner2, Owner3"
"Agriculture","Poland","Warsaw","Company\" Jankowski,A,B\""
Depending on your CSV parser, one of those should be accepted and work as expected.
If, as #exe suggests, your CSV parser also requires the commas within values to be escaped, you can apply a similar regex to replace the commas.
If i understand well what you need is cast the quotes and commas before panda read the csv.
Like these:
"Business focus","Country","City","Company Name"
"IT","France","Lyon","Societe General"
"Mining","Russia","Moscow","Company \"MoscowMining\" Owner1\, Owner2\, Owner3"
"Agriculture","Poland","Warsaw","Company\" Jankowski\,A\,B\""

Set csv record delimiter in Python Pandas

I created a script to merge couple of .csv files, using pandas python library. All files use "\n\r" as record delimiter.
I ran into issue with one file, where in specific field, sometimes "\n" occurs. That causes for pandas.read_csv to count it as new row.
Is there any chance to (in addition to field delimiter) specify record delimiter? Or would there be any better solution to this?
Thank you and best regards
Look through all of the kwargs in pandas.read_csv
There is the lineterminator kwarg:
lineterminator : str (length 1), default None
Character to break file into lines. Only valid with C parser.
Note that it requires the use of the C parser (see engine kwarg)
Given that your lines end with \r, which is the carriage return character I would suggest using that as the lineterminator and doing post-processing to clean up the \n's left behind.
I would think that setting the lineterminator='\r' should fix your problem.

Removing special characters in a pandas dataframe

I have found information on how this could be done, but nothing has worked for me. I am trying to replace the special character 'ð'. I imported my data from a csv file and I used encoding='latin1' or else I kept getting errors. However, a simple DF['Column'].str.replace('ð', '') will not do the trick. I also tried decoding and using the hex value for that character which was recommended on another post, but that still won't work for me. Help is very much appreciated, and I am willing to post code if necessary.
Call str.encode followed by str.decode:
df.YourCol.str.encode('utf-8').str.decode('ascii', 'ignore')
If you want to do this for multiple columns, you can slice and call df.applymap:
df[col_list].applymap(lambda x: x.encode('utf-8').decode('ascii', 'ignore'))
Remember that these operations are not in-place. So, you'll have to assign those columns back to their rightful place.

Categories