Modify column data before a specific character using Regex in pandas

Modify column data before a specific character using Regex in pandas - python

I'm trying to modify the Address column data by removing all the characters before the comma.
Sample data:
**ADDRESS**
0 Ksfc Layout,Bangalore
1 Vishweshwara Nagar,Mysore
2 Jigani,Bangalore
3 Sector-1 Vaishali,Ghaziabad
4 New Town,Kolkata
Expected Output:
**ADDRESS**
0 Bangalore
1 Mysore
2 Bangalore
3 Ghaziabad
4 Kolkata
I tried this code but it's not working can someone correct the code?
import pandas as pd
import regex as re
data = pd.read_csv("train.csv")
data.ADDRESS.replace(re.sub(r'.*,',"", data.ADDRESS), regex=True, inplace=True)

Try this:
data.ADDRESS = data.ADDRESS.str.split(',').str[-1]

You can do it without a regex:
def removeFirst(x):
return x.split(",")[-1]
df['ADDRESS'] = df['ADDRESS'].apply(removeFirst)

You can try like this without Regex:
data['ADDRESS'] = data['ADDRESS'].str.split(',').str[-1]

Use Series.str.replace:
data['ADDRESS'] = data['ADDRESS'].str.replace(r'.*,', '')
See proof

Related

Pandas - Extract a string starting with a particular character

It should be fairly simple yet I'm not able to achieve it.
I have a dataframe df1, having a column "name_str". Example below:
name_str
0 alp:ha
1 bra:vo
2 charl:ie
I have to create another column that would comprise - say 5 characters - that start after the colon (:). I've written the following code:
import pandas as pd
data = {'name_str':["alp:ha", "bra:vo", "charl:ie"]}
#indx = ["name_1",]
df1 = pd.DataFrame(data=data)
n= df1['name_str'].str.find(":")+1
df1['slize'] = df1['name_str'].str.slice(n,2)
print(df1)
But the output is disappointing: NaanN
name_str slize
0 alp:ha NaN
1 bra:vo NaN
2 charl:ie NaN
The output should've been:
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Would anyone please help? Appreciate it.

You can use str.extract to extract everything after the colon with this regular expression: :(.*)
df1['slize'] = df1.name_str.str.extract(':(.*)')
>>> df1
name_str slize
0 alp:ha ha
1 bra:vo vo
2 charl:ie ie
Edit, based on your updated question
If you'd like to extract up to 5 characters after the colon, then you can use this modification:
df['slize'] = df1.name_str.str.extract(':(.{,5})')

How to find multiple substrings between <> in one column in pandas data frame + python

I am using Pandas and Python. My data is:
a=pd.DataFrame({'ID':[1,2,3,4,5],
'Str':['aa <aafae><afre> ht4',
'v fef <><433>',
'<1234334> <a>',
'<bijf> 04<9tu0>q4g <vie>',
'aaa 1']})
I want to extract all the sub strings between < > and merge them with blank. For example, the above example should have the result:
aafae afre
433
1234334 a
bijf 9tu0 vie
nan
So all the sub strings between < > are extracted. There will be nan if no such strings. I have already tried re library and str functions. But I am really new to regex. Could anyone help me out here.

Use pandas.Series.str.findall:
a['Str'].str.findall('<(.*?)>').apply(' '.join)
Output:
0 aafae afre
1 433
2 1234334 a
3 bijf 9tu0 vie
4
Name: Str, dtype: object

Maybe, this expression might work somewhat and to some extent:
import pandas as pd
a=pd.DataFrame({'ID':[1,2,3,4,5],
'Str':['aa <aafae><afre> ht4',
'v fef <><433>',
'<1234334> <a>',
'<bijf> 04<9tu0>q4g <vie>',
'aaa 1']})
a["new_str"]=a["Str"].str.replace(r'.*?<([^>]+)>|(?:.+)', r'\1 ',regex=True)
print(a)

Python Pandas: Dataframe is not updating using string methods

I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')

You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa

import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:

Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line

How to extract content from the regex output which has square bracket in python

I have a Python's (2.7) Pandas DF which has columns which looks something like this :
email
['jsaw#yahoo.com']
['jfsjhj#yahoo.com']
['jwrk#yahoo.com']
['rankw#yahoo.com']
I want to extract email from it without the square bracket and single quotes. Output should like this :
email
jsaw#yahoo.com
jfsjhj#yahoo.com
jwrk#yahoo.com
rankw#yahoo.com
I have tried the suggestions from this answer :Replace all occurrences of a string in a pandas dataframe (Python) . But its not working. Any help will be appreciated.
edit:
What if I have array of more than 1 dimension. something like :
email
['jsaw#yahoo.com']
['jfsjhj#yahoo.com']
['jwrk#yahoo.com']
['rankw#yahoo.com','fsffsnl#gmail.com']
['mklcu#yahoo.com','riserk#gmail.com', 'funkdl#yahoo.com']
is it possible to get the output in three different columns without square brackets and single quotes.

You can use str.strip if type of values is string:
print type(df.at[0,'email'])
<type 'str'>
df['email'] = df.email.str.strip("[]'")
print df
email
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com
If type is list apply Series:
print type(df.at[0,'email'])
<type 'list'>
df['email'] = df.email.apply(pd.Series)
print df
email
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com
EDIT: If you have multiple values in array, you can use:
df1 = df['email'].apply(pd.Series).fillna('')
print df1
0 1 2
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com fsffsnl#gmail.com
4 mklcu#yahoo.com riserk#gmail.com funkdl#yahoo.com

Try this one:
from re import findall
s = "['rankw#yahoo.com']"
m = findall(r"\[([A-Za-z0-9#'._]+)\]", s)
print(m[0].replace("'",''))

Replace WhiteSpace with a 0 in Pandas (Python 3)

simple question here -- how do I replace all of the whitespaces in a column with a zero?
For example:
Name Age
John 12
Mary
Tim 15
into
Name Age
John 12
Mary 0
Tim 15
I've been trying using something like this but I am unsure how Pandas actually reads whitespace:
merged['Age'].replace(" ", 0).bfill()
Any ideas?

merged['Age'] = merged['Age'].apply(lambda x: 0 if x == ' ' else x)

Use the built in method convert_objects and set param convert_numeric=True:
In [12]:
# convert objects will handle multiple whitespace, this will convert them to NaN
# we then call fillna to convert those to 0
df.Age = df[['Age']].convert_objects(convert_numeric=True).fillna(0)
df
Out[12]:
Name Age
0 John 12
1 Mary 0
2 Tim 15

Here's an answer modified from this, more thorough question. I'll make it a little bit more Pythonic and resolve your basestring issue.
def ws_to_zero(maybe_ws):
try:
if maybe_ws.isspace():
return 0
else:
return maybe_ws
except AttributeError:
return maybe_ws
d.applymap(ws_to_zero)
where d is your dataframe.

if you want to use NumPy, then you can use the below snippet:
import numpy as np
df['column_of_interest'] = np.where(df['column_of_interest']==' ',0,df['column_of_interest']).astype(float)
While Paulo's response is excellent, my snippet above may be useful when multiple criteria are required during advanced data manipulation.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Modify column data before a specific character using Regex in pandas - python

Try this: data.ADDRESS = data.ADDRESS.str.split(',').str[-1]

You can do it without a regex: def removeFirst(x): return x.split(",")[-1] df['ADDRESS'] = df['ADDRESS'].apply(removeFirst)

You can try like this without Regex: data['ADDRESS'] = data['ADDRESS'].str.split(',').str[-1]

Use Series.str.replace: data['ADDRESS'] = data['ADDRESS'].str.replace(r'.*,', '') See proof

Related

Pandas - Extract a string starting with a particular character

How to find multiple substrings between <> in one column in pandas data frame + python

Python Pandas: Dataframe is not updating using string methods

How to extract content from the regex output which has square bracket in python

Replace WhiteSpace with a 0 in Pandas (Python 3)

Categories

Resources