I open raw data using pandas
df=pd.read_cvs(file)
Here's part of my dataframe look like:
37280 7092|156|Laboratory Data|A648C751-A4DD-4CZ2-85
47981 7092|156|Laboratory Data|Z22CD01C-8Z4B-4ZCB-8B
57982 7092|156|Laboratory Data|C12CE01C-8F4B-4CZB-8B
I'd like to replace all pipe('|') into tab ('\t')
So I tried :
df.replace('|','\t')
But it never works. How could I do this?
Many thanks!
The replace method on data frame by default is meant to replace values exactly match the string provided; You need to specify regex=True to replace patterns, and since | is a special character in regex, an escape is needed here:
df1 = df.replace("\|", "\t", regex=True)
df1
# 0 1
#0 37280 7092\t156\tLaboratory Data\tA648C751-A4DD-4CZ2-85
#1 47981 7092\t156\tLaboratory Data\tZ22CD01C-8Z4B-4ZCB-8B
#2 57982 7092\t156\tLaboratory Data\tC12CE01C-8F4B-4CZB-8B
If we print the cell, the tab are printed as expected:
print(df1[1].iat[0])
# 7092 156 Laboratory Data A648C751-A4DD-4CZ2-85
Just need to set the variable to itself:
df = df.replace('|', '\t')
Related
Problem: Currently, I have a column of a dataframe with results like 1.Goalkeeper, 4.Midfield...and I can't change partially replace the string.
Objective: My goal is to replace it with 1.GK, 4.MD...but it doesn't make the replacement. It seems as if these lines are not written. Any ideas?
The code works if the input is the same as the replacement. For example, Goalkeeper, Midfield... but it doesn't work when I prefix it with ( number + dot).
CODE
df2['Posicion'].replace({'Goalkeeper':'GK','Left-Back':'LI','Defensive Midfield':'MCD'
,'Right Midfield':'MD','Attacking Midfield':'MP','Right Winger':'ED','Centre-Forward':'DC',
'Centre-Back':'DFC','Right-Back':'LD','Central Midfield':'MC','Second Striker':'SD',
'Left Midfield':'MI','Left Winger':'EI','N':'','None':'','Sweeper':'DFC'}, inplace=True)
regex=True will do the trick here.
df2 = pd.DataFrame({
'Posicion' : ['1.Goalkeeper', '2.Midfield', '3.Left Winger']
})
df2['Posicion'].replace({'Goalkeeper':'GK',
'Left Winger':'EI',
'N':'',
'None':'',
'Sweeper':'DFC'},
regex=True,
inplace=True)
Output:
Posicion
0 1.GK
1 2.Midfield
2 3.EI
I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers
I've the following strings in column on a dataframe:
"LOCATION: FILE-ABC.txt"
"DRAFT-1-FILENAME-ADBCD.txt"
And I want to extract everything that is between the word FILE and the ".". But I want to include the first delimiter. Basically I am trying to return the following result:
"FILE-ABC"
"FILENAME-ABCD"
For that I am using the script below:
df['field'] = df.string_value.str.extract('FILE/(.w+)')
But I am not able to return the desired information (always getting NA).
How can I do this?
you can accomplish this all within the regex without having to use string slicing.
df['field'] = df.string_value.str.extract('(FILE.*(?=.txt))')
FILE is the what we begin the match on
.* grabs any number of characters
(?=) is a lookahead assertion that matches without
consuming.
Handy regex tool https://pythex.org/
If the strings will always end in .txt then you can try with the following:
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Example:
import pandas as pd
text = ["LOCATION: FILE-ABC.txt","DRAFT-1-FILENAME-ADBCD.txt"]
data = {'index':[0,1],'string_value':text}
df = pd.DataFrame(data)
df['field'] = df['string_value'].str.extract('(FILE.*)')[0].str[:-4]
Output:
index string_value field
0 0 LOCATION: FILE-ABC.txt FILE-ABC
1 1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD
You can make a capturing group that captures from (including) 'FILE' greedily to the last period. Or you can make it not greedy so it stops at the first . after FILE.
import pandas as pd
df = pd.DataFrame({'string_value': ["LOCATION: FILE-ABC.txt", "DRAFT-1-FILENAME-ADBCD.txt",
"BADFILENAME.foo.txt"]})
df['field_greedy'] = df['string_value'].str.extract('(FILE.*)\.')
df['field_not_greedy'] = df['string_value'].str.extract('(FILE.*?)\.')
print(df)
string_value field_greedy field_not_greedy
0 LOCATION: FILE-ABC.txt FILE-ABC FILE-ABC
1 DRAFT-1-FILENAME-ADBCD.txt FILENAME-ADBCD FILENAME-ADBCD
2 BADFILENAME.foo.txt FILENAME.foo FILENAME
i have following CSV with following entries
"column1"| "column2"| "column3"| "column4"| "column5"
"123" | "sometext", "this somedata", "8 inches"", "hello"
The issue comes when i try to read 8 inches", i am unable to read the csv using read_csv().
Pandas.read_csv(io.BytesIO(obj['Body'].read()), sep="|",
quoting=1,
engine='c', error_bad_lines=False, warn_bad_lines=True,
encoding="utf-8", converters=pandas_config['converters'],skipinitialspace=True,escapechar='\"')
Is there a way to handle the quote within the cell.
Start from passing appropriate parameters for this case:
sep='[|,]' - there are two separators: a pipe char and a comma,
so define them as a regex.
skipinitialspace=True - your source text contains extra spaces (after
separators), so you should drop them.
engine='python' - to suppress a warning concerning Falling back to the
'python' engine.
The above options alone allow to call read_csv with no error, but the downside
(for now) is that double quotes remain.
To eliminate them, at least from the data rows, another trick is needed:
Define a converter (lambda) function:
cnv = lambda txt: txt.replace('"', '')
and apply it to all source columns.
In your case you have 5 columns, so to keep the code concise,
you can use a dictionary comprehension:
{ i: cnv for i in range(5) }
So the whole code can be:
df = pd.read_csv(io.StringIO(txt), sep='[|,]', skipinitialspace=True,
engine='python', converters={ i: cnv for i in range(5) })
and the result is:
"column1" "column2" "column3" "column4" "column5"
0 123 sometext this somedata 8 inches hello
But remember that now all columns are of string type, so you should
convert required columns to numbers.
An alternative is to pass second converter for numeric columns,
returning a number instead of a string.
To have proper column names (without double quotes), you can pass additional parameters:
skiprows=1 - to omit the initial line,
names=["column1", "column2", "column3", "column4", "column5"] - to
define the column list on your own.
We can specify a somewhat complicated separator, read the datas and strip the extra quote chars:
# Test data:
text='''"column1"| "column2"| "column3"| "column4"| "column5"
"123" | "sometext", "this somedata", "8 inches"", "hello"'''
ff=io.StringIO(text)
df= pd.read_csv(ff,sep=r'"\s*[|,]\s*"',engine="python")
# Make it tidy:
df= df.transform(lambda s: s.str.strip('"'))
df.columns= ["column1"]+list(df.columns[1:-1])+["column5"]
The result:
column1 column2 column3 column4 column5
0 123 sometext this somedata 8 inches hello
I have read some pricing data into a pandas dataframe the values appear as:
$40,000*
$40000 conditions attached
I want to strip it down to just the numeric values.
I know I can loop through and apply regex
[0-9]+
to each field then join the resulting list back together but is there a not loopy way?
Thanks
You could use Series.str.replace:
import pandas as pd
df = pd.DataFrame(['$40,000*','$40000 conditions attached'], columns=['P'])
print(df)
# P
# 0 $40,000*
# 1 $40000 conditions attached
df['P'] = df['P'].str.replace(r'\D+', '', regex=True).astype('int')
print(df)
yields
P
0 40000
1 40000
since \D matches any character that is not a decimal digit.
You could use pandas' replace method; also you may want to keep the thousands separator ',' and the decimal place separator '.'
import pandas as pd
df = pd.DataFrame(['$40,000.32*','$40000 conditions attached'], columns=['pricing'])
df['pricing'].replace(to_replace="\$([0-9,\.]+).*", value=r"\1", regex=True, inplace=True)
print(df)
pricing
0 40,000.32
1 40000
You could remove all the non-digits using re.sub():
value = re.sub(r"[^0-9]+", "", value)
regex101 demo
You don't need regex for this. This should work:
df['col'] = df['col'].astype(str).convert_objects(convert_numeric=True)
In case anyone is still reading this. I'm working on a similar problem and need to replace an entire column of pandas data using a regex equation I've figured out with re.sub
To apply this on my entire column, here's the code.
#add_map is rules of replacement for the strings in pd df.
add_map = dict([
("AV", "Avenue"),
("BV", "Boulevard"),
("BP", "Bypass"),
("BY", "Bypass"),
("CL", "Circle"),
("DR", "Drive"),
("LA", "Lane"),
("PY", "Parkway"),
("RD", "Road"),
("ST", "Street"),
("WY", "Way"),
("TR", "Trail"),
])
obj = data_909['Address'].copy() #data_909['Address'] contains the original address'
for k,v in add_map.items(): #based on the rules in the dict
rule1 = (r"(\b)(%s)(\b)" % k) #replace the k only if they're alone (lookup \
b)
rule2 = (lambda m: add_map.get(m.group(), m.group())) #found this online, no idea wtf this does but it works
obj = obj.str.replace(rule1, rule2, regex=True, flags=re.IGNORECASE) #use flags here to avoid the dictionary iteration problem
data_909['Address_n'] = obj #store it!
Hope this helps anyone searching for the problem I had. Cheers