Following an old question of mine. I finally identified what happens.
I have a csv-file which has the sperator \t and reading it with the following command:
df = pd.read_csv(r'C:\..\file.csv', sep='\t', encoding='unicode_escape')
the length for example is: 800.000
The problem is the original file has around 1.400.000 lines, and I also know where the issue occures, one column (let's say columnA) has the following entry:
"HILFE FüR DIE Alten
Do you have any idea what is happening? When I delete that row I get the correct number of lines (length), what is python doing here?
According to pandas documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
It may be issue with double quotes symbol.
Try this instead:
df = pd.read_csv(r'C:\..\file.csv', sep='\\t', encoding='unicode_escape', engine='python')
or this:
df = pd.read_csv(r'C:\..\file.csv', sep=r'\t', encoding='unicode_escape')
Related
I am importing a CSV file like the one below, using pandas.read_csv:
df = pd.read_csv(Input, delimiter=";")
Example of CSV file:
10;01.02.2015 16:58;01.02.2015 16:58;-0.59;0.1;-4.39;NotApplicable;0.79;0.2
11;01.02.2015 16:58;01.02.2015 16:58;-0.57;0.2;-2.87;NotApplicable;0.79;0.21
The problem is that when I later on in my code try to use these values I get this error: TypeError: can't multiply sequence by non-int of type 'float'
The error is because the number I'm trying to use is not written with a dot (.) as a decimal separator but a comma(,). After manually changing the commas to a dots my program works.
I can't change the format of my input, and thus have to replace the commas in my DataFrame in order for my code to work, and I want python to do this without the need of doing it manually. Do you have any suggestions?
pandas.read_csv has a decimal parameter for this: doc
I.e. try with:
df = pd.read_csv(Input, delimiter=";", decimal=",")
I think the earlier mentioned answer of including decimal="," in pandas read_csv is the preferred option.
However, I found it is incompatible with the Python parsing engine. e.g. when using skiprow=, read_csv will fall back to this engine and thus you can't use skiprow= and decimal= in the same read_csv statement as far as I know. Also, I haven't been able to actually get the decimal= statement to work (probably due to me though)
The long way round I used to achieving the same result is with list comprehensions, .replace and .astype. The major downside to this method is that it needs to be done one column at a time:
df = pd.DataFrame({'a': ['120,00', '42,00', '18,00', '23,00'],
'b': ['51,23', '18,45', '28,90', '133,00']})
df['a'] = [x.replace(',', '.') for x in df['a']]
df['a'] = df['a'].astype(float)
Now, column a will have float type cells. Column b still contains strings.
Note that the .replace used here is not pandas' but rather Python's built-in version. Pandas' version requires the string to be an exact match or a regex.
stallasia's answer looks like the best one.
However, if you want to change the separator when you already have a dataframe, you could do :
df['a'] = df['a'].str.replace(',', '.').astype(float)
Thanks for the great answers. I just want to add that in my case just using decimal=',' did not work because I had numbers like 1.450,00 (with thousands separator), therefore pandas did not recognize it, but passing thousands='.' helped to read the file correctly:
df = pd.read_csv(
Input,
delimiter=";",
decimal=","
thousands="."
)
I answer to the question about how to change the decimal comma to the decimal dot with Python Pandas.
$ cat test.py
import pandas as pd
df = pd.read_csv("test.csv", quotechar='"', decimal=",")
df.to_csv("test2.csv", sep=',', encoding='utf-8', quotechar='"', decimal='.')
where we specify the reading in decimal separator as comma while the output separator is specified as dot. So
$ cat test.csv
header,header2
1,"2,1"
3,"4,0"
$ cat test2.csv
,header,header2
0,1,2.1
1,3,4.0
where you see that the separator has changed to dot.
I created a script to merge couple of .csv files, using pandas python library. All files use "\n\r" as record delimiter.
I ran into issue with one file, where in specific field, sometimes "\n" occurs. That causes for pandas.read_csv to count it as new row.
Is there any chance to (in addition to field delimiter) specify record delimiter? Or would there be any better solution to this?
Thank you and best regards
Look through all of the kwargs in pandas.read_csv
There is the lineterminator kwarg:
lineterminator : str (length 1), default None
Character to break file into lines. Only valid with C parser.
Note that it requires the use of the C parser (see engine kwarg)
Given that your lines end with \r, which is the carriage return character I would suggest using that as the lineterminator and doing post-processing to clean up the \n's left behind.
I would think that setting the lineterminator='\r' should fix your problem.
I'm trying to import csv with pandas read_csv and can't get the lines containing the following snippet to work:
"","",""BSF" code - Intermittant, see notes",""
I am able to get pass it via with the options error_bad_lines=False, low_memory=False, engine='c'. However it should be possible to parse them correctly. I'm not good with regular expressions so I didn't try using engine='python', sep=regex yet. Thanks for any help.
Well, that's quite a hard one ... given that all fields are quoted you could use a regex to only use , followed and preceded by " as a separator:
data = pd.read_csv(filename,sep=r'(?<="),(?=")',quotechar='"')
However, you will still end up with quotes around all fields, but you could fix this by applying
data = data.applymap(lambda s:s[1:-1])
It appears that the pandas read_csv function only allows single character delimiters/separators. Is there some way to allow for a string of characters to be used like, "*|*" or "%%" instead?
Pandas does now support multi character delimiters
import panda as pd
pd.read_csv(csv_file, sep="\*\|\*")
The solution would be to use read_table instead of read_csv:
1*|*2*|*3*|*4*|*5
12*|*12*|*13*|*14*|*15
21*|*22*|*23*|*24*|*25
So, we could read this with:
pd.read_table('file.csv', header=None, sep='\*\|\*')
As Padraic Cunningham writes in the comment above, it's unclear why you want this. The Wiki entry for the CSV Spec states about delimiters:
... separated by delimiters (typically a single reserved character such as comma, semicolon, or tab; sometimes the delimiter may include optional spaces),
It's unsurprising, that both the csv module and pandas don't support what you're asking.
However, if you really want to do so, you're pretty much down to using Python's string manipulations. The following example shows how to turn the dataframe to a "csv" with $$ separating lines, and %% separating columns.
'$$'.join('%%'.join(str(r) for r in rec) for rec in df.to_records())
Of course, you don't have to turn it into a string like this prior to writing it into a file.
Not a pythonic way but definitely a programming way, you can use something like this:
import re
def row_reader(row,fd):
arr=[]
in_arr = str.split(fd)
i = 0
while i < len(in_arr):
if re.match('^".*',in_arr[i]) and not re.match('.*"$',in_arr[i]):
flag = True
buf=''
while flag and i < len(in_arr):
buf += in_arr[i]
if re.match('.*"$',in_arr[i]):
flag = False
i+=1
buf += fd if flag else ''
arr.append(buf)
else:
arr.append(in_arr[i])
i+=1
return arr
with open(file_name,'r') as infile:
for row in infile:
for field in row_reader(row,'%%'):
print(field)
In pandas 1.1.4, when I try to use a multiple char separator, I get the message:
ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
Hence, to be able to use multiple char separator, a modern solution seems to be to add engine='python' in read_csv argument (in my case, I use it with sep='[ ]?;)
I have an input file I am trying to read into a pandas dataframe.
The file is space delimited, including white space before the first value.
I have tried both read_csv and read_table with a "\W+" regex as the separator.
data = pd.io.parsers.read_csv('file.txt',names=header,sep="\W+")
They read in the correct number of columns, but the values themselves are totally bogus. Has any one else experienced this, or am I using it incorrectly
I have also tried to read file line by line, create a series from row.split() and append the series to a dataframe, but it appears to crash due to memory.
Are there any other options for creating a data frame from a file?
I am using Pandas v0.11.0, Python 2.7
The regex '\W' means "not a word character" (a "word character" being letters, digits, and underscores), see the re docs, hence the strange results. I think you meant to use whitespace '\s+'.
Note: read_csv offers a delim_whitespace argument (which you can set to True), but personally I prefer to use '\s+'.
I don't know what your data looks like, so I can't reproduce your error. I created some sample data and it worked fine, but sometimes using regex in read_csv can be troublesome. If you want to specify the separator, use " " as a separator instead. But I'd advise first trying Andy Hayden's suggestion. It's "delim_whitespace=True". It works well.
You can see it in the documentation here: http://pandas.pydata.org/pandas-docs/dev/io.html