I have a csv database that looks like this:
Date,String
2010-12-31,'This, is, an example string'
2011-12-31,"This is an, example string"
2012-12-31,This is an example, string
I am trying to use pandas, because I believe it is one of the most widespread libraries to working with this kind of situations. Is there a way to create a DataFrame taking into account only the first comma using the read_csv function? (regardless of the fact that the string after has "" or '' or nothing to isolate it).
If not, what's the most efficient alternative to do so?
Thanks so much in advance for any help,
You can cheat by passing a regex for the sep argument of read_csv. The regex I used is ^([^,]+), which grabs the first comma. I also used the engine argument in order to avoid a pandas warning (since the default C engine does not support a regex sep) and the usecols argument to make sure we only get the columns we want (without it we also get an "unnamed" column, I'm not sure why to be honest).
You can get more information about each argument in read_csv docs.
test.csv
Date,String
2010-12-31,'This, is, an example string'
2011-12-31,"This is an, example string"
2012-12-31,This is an example, string
Then
print(pd.read_csv('test.csv', sep='^([^,]+),', engine='python', usecols=['Date', 'String']))
Outputs
Date String
0 2010-12-31 'This, is, an example string'
1 2011-12-31 "This is an, example string"
2 2012-12-31 This is an example, string
This will not work if you will have more than 2 "actual" columns in the CSV file
Related
Following an old question of mine. I finally identified what happens.
I have a csv-file which has the sperator \t and reading it with the following command:
df = pd.read_csv(r'C:\..\file.csv', sep='\t', encoding='unicode_escape')
the length for example is: 800.000
The problem is the original file has around 1.400.000 lines, and I also know where the issue occures, one column (let's say columnA) has the following entry:
"HILFE FüR DIE Alten
Do you have any idea what is happening? When I delete that row I get the correct number of lines (length), what is python doing here?
According to pandas documentation https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
sep : str, default ‘,’
Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.
It may be issue with double quotes symbol.
Try this instead:
df = pd.read_csv(r'C:\..\file.csv', sep='\\t', encoding='unicode_escape', engine='python')
or this:
df = pd.read_csv(r'C:\..\file.csv', sep=r'\t', encoding='unicode_escape')
I am attempting to find a semi-common occurring string and remove all other data in the column. Pandas and Re have been imported. For instance, I have dataframe...
>>>df
COLUMN COUNT DATA
1 this row RA-123: data 8b43a
2 here RA-5372: data 94h63c
I need to keep just the RA-'number that follows' and remove everything before and after. The numbers that follow are not always the same length and the 'RA-' string does not always occur in the same position. There is a colon after every instance that can be used as a delimiter.
I tried this (a friend wrote the regex search piece for me because I am not familiar with it).
df.assign(DATA= df['DATA'].str.extract(re.search('RA[^:]+')))
But python returned
TypeError: search() missing 1 required positional argument: 'string'
What am I missing here? Thanks in advance!
You should use acapturing group with extract:
df['DATA'].str.extract(r'(RA-\d+)')
Here, (RA-\d+) is a capturing group matching RA, then a hyphen and then one or more digits.
You may use your own pattern, but you still need to wrap it with capturing parentheses, r'(RA[^:]+)'.
Looking at the docs, you don't need the re.search method. You just call df[DATA] = df['DATA'].str.extract(r'RA[^:]+'))
As I mentioned earlier, no need for re here.
Other answers addressed well how to use extract directly. However, to answer your specificly, if you really want to use re, the way to go is to use re.compile instead of re.search.
df.assign(DATA= df['DATA'].str.extract(re.compile(regex_str)))
I am importing a CSV file like the one below, using pandas.read_csv:
df = pd.read_csv(Input, delimiter=";")
Example of CSV file:
10;01.02.2015 16:58;01.02.2015 16:58;-0.59;0.1;-4.39;NotApplicable;0.79;0.2
11;01.02.2015 16:58;01.02.2015 16:58;-0.57;0.2;-2.87;NotApplicable;0.79;0.21
The problem is that when I later on in my code try to use these values I get this error: TypeError: can't multiply sequence by non-int of type 'float'
The error is because the number I'm trying to use is not written with a dot (.) as a decimal separator but a comma(,). After manually changing the commas to a dots my program works.
I can't change the format of my input, and thus have to replace the commas in my DataFrame in order for my code to work, and I want python to do this without the need of doing it manually. Do you have any suggestions?
pandas.read_csv has a decimal parameter for this: doc
I.e. try with:
df = pd.read_csv(Input, delimiter=";", decimal=",")
I think the earlier mentioned answer of including decimal="," in pandas read_csv is the preferred option.
However, I found it is incompatible with the Python parsing engine. e.g. when using skiprow=, read_csv will fall back to this engine and thus you can't use skiprow= and decimal= in the same read_csv statement as far as I know. Also, I haven't been able to actually get the decimal= statement to work (probably due to me though)
The long way round I used to achieving the same result is with list comprehensions, .replace and .astype. The major downside to this method is that it needs to be done one column at a time:
df = pd.DataFrame({'a': ['120,00', '42,00', '18,00', '23,00'],
'b': ['51,23', '18,45', '28,90', '133,00']})
df['a'] = [x.replace(',', '.') for x in df['a']]
df['a'] = df['a'].astype(float)
Now, column a will have float type cells. Column b still contains strings.
Note that the .replace used here is not pandas' but rather Python's built-in version. Pandas' version requires the string to be an exact match or a regex.
stallasia's answer looks like the best one.
However, if you want to change the separator when you already have a dataframe, you could do :
df['a'] = df['a'].str.replace(',', '.').astype(float)
Thanks for the great answers. I just want to add that in my case just using decimal=',' did not work because I had numbers like 1.450,00 (with thousands separator), therefore pandas did not recognize it, but passing thousands='.' helped to read the file correctly:
df = pd.read_csv(
Input,
delimiter=";",
decimal=","
thousands="."
)
I answer to the question about how to change the decimal comma to the decimal dot with Python Pandas.
$ cat test.py
import pandas as pd
df = pd.read_csv("test.csv", quotechar='"', decimal=",")
df.to_csv("test2.csv", sep=',', encoding='utf-8', quotechar='"', decimal='.')
where we specify the reading in decimal separator as comma while the output separator is specified as dot. So
$ cat test.csv
header,header2
1,"2,1"
3,"4,0"
$ cat test2.csv
,header,header2
0,1,2.1
1,3,4.0
where you see that the separator has changed to dot.
I have a csv file that has a column of strings that has comma's inside the string. If i want to read the csv using pandas it sees the extra comma's as extra columns.Which gives me the error of have more rows then expected. I thought of using double quotes around the strings as solution to the problem.
This is how the csv currently looks
lead,Chat.Event,Role,Data,chatid
lead,x,Lead,Hello, how are you,1
How it should look like
lead,Chat.Event,Role,Data,chatid
lead,x,Lead,"Hello, how are you",1
Is using double quotes around the strings the best solution? and if yes how do i do that? And if not what other solution can you recommend?
if you got the original file / database through which you generated the csv, you should do it again using a different kind of separator (the default is comma), one which you would not have within your strings, such as "|" (vertical bar).
than, when reading the csv with pandas, you can just pass the argument:
pd.read_csv(file_path, sep="your separator symbol here")
hope that helps
I have an input file I am trying to read into a pandas dataframe.
The file is space delimited, including white space before the first value.
I have tried both read_csv and read_table with a "\W+" regex as the separator.
data = pd.io.parsers.read_csv('file.txt',names=header,sep="\W+")
They read in the correct number of columns, but the values themselves are totally bogus. Has any one else experienced this, or am I using it incorrectly
I have also tried to read file line by line, create a series from row.split() and append the series to a dataframe, but it appears to crash due to memory.
Are there any other options for creating a data frame from a file?
I am using Pandas v0.11.0, Python 2.7
The regex '\W' means "not a word character" (a "word character" being letters, digits, and underscores), see the re docs, hence the strange results. I think you meant to use whitespace '\s+'.
Note: read_csv offers a delim_whitespace argument (which you can set to True), but personally I prefer to use '\s+'.
I don't know what your data looks like, so I can't reproduce your error. I created some sample data and it worked fine, but sometimes using regex in read_csv can be troublesome. If you want to specify the separator, use " " as a separator instead. But I'd advise first trying Andy Hayden's suggestion. It's "delim_whitespace=True". It works well.
You can see it in the documentation here: http://pandas.pydata.org/pandas-docs/dev/io.html