Pandas Unable to Read CSV file using pandas, with extra quote char - python

i have following CSV with following entries
"column1"| "column2"| "column3"| "column4"| "column5"
"123" | "sometext", "this somedata", "8 inches"", "hello"
The issue comes when i try to read 8 inches", i am unable to read the csv using read_csv().
Pandas.read_csv(io.BytesIO(obj['Body'].read()), sep="|",
quoting=1,
engine='c', error_bad_lines=False, warn_bad_lines=True,
encoding="utf-8", converters=pandas_config['converters'],skipinitialspace=True,escapechar='\"')
Is there a way to handle the quote within the cell.

Start from passing appropriate parameters for this case:
sep='[|,]' - there are two separators: a pipe char and a comma,
so define them as a regex.
skipinitialspace=True - your source text contains extra spaces (after
separators), so you should drop them.
engine='python' - to suppress a warning concerning Falling back to the
'python' engine.
The above options alone allow to call read_csv with no error, but the downside
(for now) is that double quotes remain.
To eliminate them, at least from the data rows, another trick is needed:
Define a converter (lambda) function:
cnv = lambda txt: txt.replace('"', '')
and apply it to all source columns.
In your case you have 5 columns, so to keep the code concise,
you can use a dictionary comprehension:
{ i: cnv for i in range(5) }
So the whole code can be:
df = pd.read_csv(io.StringIO(txt), sep='[|,]', skipinitialspace=True,
engine='python', converters={ i: cnv for i in range(5) })
and the result is:
"column1" "column2" "column3" "column4" "column5"
0 123 sometext this somedata 8 inches hello
But remember that now all columns are of string type, so you should
convert required columns to numbers.
An alternative is to pass second converter for numeric columns,
returning a number instead of a string.
To have proper column names (without double quotes), you can pass additional parameters:
skiprows=1 - to omit the initial line,
names=["column1", "column2", "column3", "column4", "column5"] - to
define the column list on your own.

We can specify a somewhat complicated separator, read the datas and strip the extra quote chars:
# Test data:
text='''"column1"| "column2"| "column3"| "column4"| "column5"
"123" | "sometext", "this somedata", "8 inches"", "hello"'''
ff=io.StringIO(text)
df= pd.read_csv(ff,sep=r'"\s*[|,]\s*"',engine="python")
# Make it tidy:
df= df.transform(lambda s: s.str.strip('"'))
df.columns= ["column1"]+list(df.columns[1:-1])+["column5"]
The result:
column1 column2 column3 column4 column5
0 123 sometext this somedata 8 inches hello

Related

Python dataframe : strip part of string, on each column row, if it is in specific format [duplicate]

I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers

Format string in pandas dataframe cell if it contains a pipe

I need to read a standard CSV into a data frame, do some manipulations, and stringify the data frame into a specialized pipe separated format (text file). In order to comply with the file format, I have to add double quotes to the entire string in that cell (if it contains a pipe) before writing the final string to a file.
I wanted to leverage Pandas functions to accomplish this. I tried dabbling with the contains and format functions, but have not been successful.
Does anyone know of a simple way to accomplish this leveraging Pandas?
Expected Input:
colA,colB,colC,colD
cat,waverly way,foo,10.0
dog,smokey | st,foo,9.7
cow,rapid ave,foo,6.6
rabbit,far | blvd,foo,3.2
Expected Output:
cat|waverly way|foo|10.0/
dog|"smokey|st"|foo|9.7/
cow|rapid ave|foo|6.6/
rabbit|"far|blvd"|foo|3.2/
The "/" is intentional
You can use np.where & manipulate the matching string as below.
df['colB'] = np.where(df['colB'].str.contains('\|'),'"' + df['colB'] + '"' , df['colB'])
Note: Since only colB has the pipe (|) character the code above is written to check only that column & manipulate only that. If pipe (|) character is expected in other columns as well, you may to to repeat the code for other columns as well.
For colD you have to convert it into string(if it is not already as string) & add a forward slash as below
df['colD'] = df['colD'].astype(str) + '/'
Output
colA colB colC colD
0 cat waverly way foo 10.0/
1 dog "smokey | st" foo 9.7/
2 cow rapid ave foo 6.6/
3 rabbit "far | blvd" foo 3.2/
import pandas as pd
import csv
test = pd.read_csv("test.csv")
test.to_csv("final.csv", sep="|", quoting=csv.QUOTE_NONNUMERIC, line_terminator="/\n", header=False, index=False)
Here is the contents of "final.csv":
"cat"|"waverly way"|"foo"|10.0/
"dog"|"smokey | st"|"foo"|9.7/
"cow"|"rapid ave"|"foo"|6.6/
"rabbit"|"far | blvd"|"foo"|3.2/
Edit: this will add quotes all non-numeric strings. If you want quotes on only the values with pipes, you can remove the quoting parameter and use moy's solution:
import pandas as pd
import numpy as np
df = pd.read_csv("test.csv")
for col in list(df.select_dtypes(include=[object]).columns.values):
df[col] = np.where(df[col].str.contains('\|') & df[col].str.endswith('"') & df[col].str.startswith('"'),'"' + df[col] + '"', df[col])
df.to_csv("final.csv", sep="|", line_terminator="/\n", header=False, index=False)

Extract particular string which appears in multiple lines in cell Pandas

I have to extract string which starts with "Year" and finishes with "\n", but for each line that appears in a cell in Pandas data frame.
Additionally, I want to remove \n at the end of cell.
This is data frame:
df
Column1
not_important1\nnot_important2\nE012-855 Year-1972\nE012-856 Year-1983\nnot_important3\nE012-857 Year-1977\nnot_important4\nnot_important5\nE012-858 Year-2012\n
not_important6\nnot_important7\nE013-200 Year-1982\nE013-201 Year-1984\nnot_important8\nE013-202 Year-1987\n
not_important9\nnot_important10\nE014-652 Year-1988\nE014-653 Year-1980\nnot_important11\nE014-654 Year-1989\n
This is what I want to get:
df
Column1
Year-1972\nYear-1983\nYear-1977\nYear-2012
Year-1982\nYear-1984\nYear-1987
Year-1988\nYear-1980\nYear-1989
How to do this?
You can use findall with this regex r'Year.*?\\n' to catch the substrings. Then create a string from the list of the found elements with ''.join and then remove the last \n with [:-2] :
import re
df['Column1'] = df['Column1'].apply(lambda x: ''.join(re.findall('Year.*?\\n', x))[:-2])
Or, if after the 4 digits of the year there is always \n, you can do in this way:
df['Column1'] = df['Column1'].apply(lambda x: '\n'.join(re.findall('Year-\d\d\d\d', x)))

How to replace character in row of dataframe?

I open raw data using pandas
df=pd.read_cvs(file)
Here's part of my dataframe look like:
37280 7092|156|Laboratory Data|A648C751-A4DD-4CZ2-85
47981 7092|156|Laboratory Data|Z22CD01C-8Z4B-4ZCB-8B
57982 7092|156|Laboratory Data|C12CE01C-8F4B-4CZB-8B
I'd like to replace all pipe('|') into tab ('\t')
So I tried :
df.replace('|','\t')
But it never works. How could I do this?
Many thanks!
The replace method on data frame by default is meant to replace values exactly match the string provided; You need to specify regex=True to replace patterns, and since | is a special character in regex, an escape is needed here:
df1 = df.replace("\|", "\t", regex=True)
df1
# 0 1
#0 37280 7092\t156\tLaboratory Data\tA648C751-A4DD-4CZ2-85
#1 47981 7092\t156\tLaboratory Data\tZ22CD01C-8Z4B-4ZCB-8B
#2 57982 7092\t156\tLaboratory Data\tC12CE01C-8F4B-4CZB-8B
If we print the cell, the tab are printed as expected:
print(df1[1].iat[0])
# 7092 156 Laboratory Data A648C751-A4DD-4CZ2-85
Just need to set the variable to itself:
df = df.replace('|', '\t')

Remove String Labels from every Row

I am using pandas to read in a csv column where every row has the following format:
IP: XXX:XX:XX:XXX
To get rid of the IP: prefix, I am editing the column after the fact:
logs['ip'] = logs['ip'].str[4:]
Is there a way to perform this operation within read_csv, maybe with regex, to avoid the post-computation?
Update |
Consider this scenario where there are multiple columns that have these prefixes – is there a better way?
logs['mac'] = logs['mac'].str[5:]
logs['id'] = logs['id'].str[4:]
logs['lan'] = logs['lan'].str[5:]
logs['ip'] = logs['ip'].str[4:]
The converters option for read_csv might provide a useful way. Let's say the file looks like this:
id address
1 IP:123.1.1.1
2 IP:456.1.1.1
3 IP:789.1.1.1
Then you could specify that 'IP:' should be converted to '' (blank) like this:
dct = { 'address': lambda x: x.replace('IP:','') }
df = pd.read_csv( 'foo.txt', delimiter=' *', converters=dct )
id address
0 1 123.1.1.1
1 2 456.1.1.1
2 3 789.1.1.1
I'm ignoring the slight complication that if there is a space after IP: then you might be reading IP: in as it's own column, but you ought to be able to adapt this fairly easily to handle that.
you could just convert the csv column to a string the use .split("IP: ")[1] on the string which will contain everything except for "IP: ". I'm not sure if this is the best approach but it's what came to mind.
str.split("IP":\s")

Categories