It appears that the pandas read_csv function only allows single character delimiters/separators. Is there some way to allow for a string of characters to be used like, "*|*" or "%%" instead?
Pandas does now support multi character delimiters
import panda as pd
pd.read_csv(csv_file, sep="\*\|\*")
The solution would be to use read_table instead of read_csv:
1*|*2*|*3*|*4*|*5
12*|*12*|*13*|*14*|*15
21*|*22*|*23*|*24*|*25
So, we could read this with:
pd.read_table('file.csv', header=None, sep='\*\|\*')
As Padraic Cunningham writes in the comment above, it's unclear why you want this. The Wiki entry for the CSV Spec states about delimiters:
... separated by delimiters (typically a single reserved character such as comma, semicolon, or tab; sometimes the delimiter may include optional spaces),
It's unsurprising, that both the csv module and pandas don't support what you're asking.
However, if you really want to do so, you're pretty much down to using Python's string manipulations. The following example shows how to turn the dataframe to a "csv" with $$ separating lines, and %% separating columns.
'$$'.join('%%'.join(str(r) for r in rec) for rec in df.to_records())
Of course, you don't have to turn it into a string like this prior to writing it into a file.
Not a pythonic way but definitely a programming way, you can use something like this:
import re
def row_reader(row,fd):
arr=[]
in_arr = str.split(fd)
i = 0
while i < len(in_arr):
if re.match('^".*',in_arr[i]) and not re.match('.*"$',in_arr[i]):
flag = True
buf=''
while flag and i < len(in_arr):
buf += in_arr[i]
if re.match('.*"$',in_arr[i]):
flag = False
i+=1
buf += fd if flag else ''
arr.append(buf)
else:
arr.append(in_arr[i])
i+=1
return arr
with open(file_name,'r') as infile:
for row in infile:
for field in row_reader(row,'%%'):
print(field)
In pandas 1.1.4, when I try to use a multiple char separator, I get the message:
ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
Hence, to be able to use multiple char separator, a modern solution seems to be to add engine='python' in read_csv argument (in my case, I use it with sep='[ ]?;)
Related
Following an old question of mine. I finally identified what happens.
I have a csv-file which has the sperator \t and reading it with the following command:
df = pd.read_csv(r'C:\..\file.csv', sep='\t', encoding='unicode_escape')
the length for example is: 800.000
The problem is the original file has around 1.400.000 lines, and I also know where the issue occures, one column (let's say columnA) has the following entry:
"HILFE FüR DIE Alten
Do you have any idea what is happening? When I delete that row I get the correct number of lines (length), what is python doing here?
According to pandas documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
It may be issue with double quotes symbol.
Try this instead:
df = pd.read_csv(r'C:\..\file.csv', sep='\\t', encoding='unicode_escape', engine='python')
or this:
df = pd.read_csv(r'C:\..\file.csv', sep=r'\t', encoding='unicode_escape')
I am importing a CSV file like the one below, using pandas.read_csv:
df = pd.read_csv(Input, delimiter=";")
Example of CSV file:
10;01.02.2015 16:58;01.02.2015 16:58;-0.59;0.1;-4.39;NotApplicable;0.79;0.2
11;01.02.2015 16:58;01.02.2015 16:58;-0.57;0.2;-2.87;NotApplicable;0.79;0.21
The problem is that when I later on in my code try to use these values I get this error: TypeError: can't multiply sequence by non-int of type 'float'
The error is because the number I'm trying to use is not written with a dot (.) as a decimal separator but a comma(,). After manually changing the commas to a dots my program works.
I can't change the format of my input, and thus have to replace the commas in my DataFrame in order for my code to work, and I want python to do this without the need of doing it manually. Do you have any suggestions?
pandas.read_csv has a decimal parameter for this: doc
I.e. try with:
df = pd.read_csv(Input, delimiter=";", decimal=",")
I think the earlier mentioned answer of including decimal="," in pandas read_csv is the preferred option.
However, I found it is incompatible with the Python parsing engine. e.g. when using skiprow=, read_csv will fall back to this engine and thus you can't use skiprow= and decimal= in the same read_csv statement as far as I know. Also, I haven't been able to actually get the decimal= statement to work (probably due to me though)
The long way round I used to achieving the same result is with list comprehensions, .replace and .astype. The major downside to this method is that it needs to be done one column at a time:
df = pd.DataFrame({'a': ['120,00', '42,00', '18,00', '23,00'],
'b': ['51,23', '18,45', '28,90', '133,00']})
df['a'] = [x.replace(',', '.') for x in df['a']]
df['a'] = df['a'].astype(float)
Now, column a will have float type cells. Column b still contains strings.
Note that the .replace used here is not pandas' but rather Python's built-in version. Pandas' version requires the string to be an exact match or a regex.
stallasia's answer looks like the best one.
However, if you want to change the separator when you already have a dataframe, you could do :
df['a'] = df['a'].str.replace(',', '.').astype(float)
Thanks for the great answers. I just want to add that in my case just using decimal=',' did not work because I had numbers like 1.450,00 (with thousands separator), therefore pandas did not recognize it, but passing thousands='.' helped to read the file correctly:
df = pd.read_csv(
Input,
delimiter=";",
decimal=","
thousands="."
)
I answer to the question about how to change the decimal comma to the decimal dot with Python Pandas.
$ cat test.py
import pandas as pd
df = pd.read_csv("test.csv", quotechar='"', decimal=",")
df.to_csv("test2.csv", sep=',', encoding='utf-8', quotechar='"', decimal='.')
where we specify the reading in decimal separator as comma while the output separator is specified as dot. So
$ cat test.csv
header,header2
1,"2,1"
3,"4,0"
$ cat test2.csv
,header,header2
0,1,2.1
1,3,4.0
where you see that the separator has changed to dot.
I'm trying to import csv with pandas read_csv and can't get the lines containing the following snippet to work:
"","",""BSF" code - Intermittant, see notes",""
I am able to get pass it via with the options error_bad_lines=False, low_memory=False, engine='c'. However it should be possible to parse them correctly. I'm not good with regular expressions so I didn't try using engine='python', sep=regex yet. Thanks for any help.
Well, that's quite a hard one ... given that all fields are quoted you could use a regex to only use , followed and preceded by " as a separator:
data = pd.read_csv(filename,sep=r'(?<="),(?=")',quotechar='"')
However, you will still end up with quotes around all fields, but you could fix this by applying
data = data.applymap(lambda s:s[1:-1])
I have an input file I am trying to read into a pandas dataframe.
The file is space delimited, including white space before the first value.
I have tried both read_csv and read_table with a "\W+" regex as the separator.
data = pd.io.parsers.read_csv('file.txt',names=header,sep="\W+")
They read in the correct number of columns, but the values themselves are totally bogus. Has any one else experienced this, or am I using it incorrectly
I have also tried to read file line by line, create a series from row.split() and append the series to a dataframe, but it appears to crash due to memory.
Are there any other options for creating a data frame from a file?
I am using Pandas v0.11.0, Python 2.7
The regex '\W' means "not a word character" (a "word character" being letters, digits, and underscores), see the re docs, hence the strange results. I think you meant to use whitespace '\s+'.
Note: read_csv offers a delim_whitespace argument (which you can set to True), but personally I prefer to use '\s+'.
I don't know what your data looks like, so I can't reproduce your error. I created some sample data and it worked fine, but sometimes using regex in read_csv can be troublesome. If you want to specify the separator, use " " as a separator instead. But I'd advise first trying Andy Hayden's suggestion. It's "delim_whitespace=True". It works well.
You can see it in the documentation here: http://pandas.pydata.org/pandas-docs/dev/io.html
A trivial CSV line could be spitted using string split function. But some lines could have ", e.g.
"good,morning", 100, 300, "1998,5,3"
thus directly using string split would not solve the problem.
My solution is to first split out the line using , and then combining the strings with " at then begin or end of the string.
What's the best practice for this problem?
I am interested if there's a Python or F# code snippet for this.
EDIT: I am more interested in the implementation detail, rather than using a library.
There's a csv module in Python, which handles this.
Edit: This task falls into "build a lexer" category. The standard way to do such tasks is to build a state machine (or use a lexer library/framework that will do it for you.)
The state machine for this task would probably only need two states:
Initial one, where it reads every character except comma and newline as part of field (exception: leading and trailing spaces) , comma as the field separator, newline as record separator. When it encounters an opening quote it goes into
read-quoted-field state, where every character (including comma & newline) excluding quote is treated as part of field, a quote not followed by a quote means end of read-quoted-field (back to initial state), a quote followed by a quote is treated as a single quote (escaped quote).
By the way, your concatenating solution will break on "Field1","Field2" or "Field1"",""Field2".
From python's CSV module:
reading a normal CSV file:
import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
print row
Reading a file with an alternate format:
import csv
reader = csv.reader(open("passwd", "rb"), delimiter=':', quoting=csv.QUOTE_NONE)
for row in reader:
print row
There are some nice usage examples in LinuxJournal.com.
If you're interested with the details, read "split string at commas respecting quotes when string not in csv format" showing some nice regexen related to this problem, or simply read the csv module source.
Chapter 4 of The Practice of Programming gave both C and C++ implementations of the CSV parser.
The generic implementation detail would be something like this (untested)
def csvline2fields(line):
fields = []
quote = None
while line.strip():
line = line.strip()
if line[0] in ("'", '"'):
# Find the next quote:
end = line.find(line[0])
fields.append(line[1:end])
# Find the beginning of the next field
next = line.find(SEPARATOR)
if next == -1:
break
line = line[next+1:]
continue
# find the next separator:
next = line.find(SEPARATOR)
fields.append(line[0:next])
line = line[next+1:]