Replacing pandas column with a subset of itself through regex - python

I have data streaming in the following format:
from StringIO import StringIO
data ="""\
ANI/IP
sip:5554447777#10.94.2.15
sip:10.66.7.34#6665554444
sip:3337775555#10.94.2.11
"""
import pandas as pd
df = pd.read_table(StringIO(data),sep='\s+',dtype='str')
What I would like to do is replace the column content with just the phone number part of the string above. I tried the suggestions from this thread like so:
df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
print(df)
However, this results in:
.....print(df)
ANI/IP
0 sip:#10.94.2.15
1 sip:#10.66.7.34
2 sip:#10.94.2.11
I need the phone numbers, so how do I achieve this? :
ANI/IP
0 5554447777
1 6665554444
2 3337775555

The regex \d{10} searches for substring of digits precisely 10 characters long.
df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
This removes the numbers!
Note: You shouldn't do astype str (it's not needed and there is no str dtype in pandas).
You want to extract these phone numbers:
In [11]: df["ANI/IP"].str.extract(r'(\d{10})') # before overwriting!
Out[11]:
0 5554447777
1 6665554444
2 3337775555
Name: ANI/IP, dtype: object
Set this as another column and you're away:
In [12]: df["phone_number"] = df["ANI/IP"].str.extract(r'(\d{10})')

You could use pandas.core.strings.StringMethods.extract to extract
In [10]: df['ANI/IP'].str.extract("(\d{10})")
Out[10]:
0 5554447777
1 6665554444
2 3337775555
Name: ANI/IP, dtype: object

Related

Removing comma from values in column (csv file) using Python Pandas

I want to remove commas from a column named size.
CSV looks like below:
number name size
1 Car 9,32,123
2 Bike 1,00,000
3 Truck 10,32,111
I want the output as below:
number name size
1 Car 932123
2 Bike 100000
3 Truck 1032111
I am using python3 and Pandas module for handling this csv.
I am trying replace method but I don't get the desired output.
Snapshot from my code :
import pandas as pd
df = pd.read_csv("file.csv")
// df.replace(",","")
// df['size'] = df['size'].replace(to_replace = "," , value = "")
// df['size'] = df['size'].replace(",", "")
df['size'] = df['size'].replace({",", ""})
print(df['size']) // expecting to see 'size' column without comma
I don't see any error/exception. The last line print(df['size']) simply displays values as it is, ie, with commas.
With replace, we need regex=True because otherwise it looks for exact match in a cell, i.e., cells with , in them only:
>>> df["size"] = df["size"].replace(",", "", regex=True)
>>> df
number name size
0 1 Car 932123
1 2 Bike 100000
2 3 Truck 1032111
I am using python3 and Pandas module for handling this csv
Note that pandas.read_csv function has optional argument thousands, if , are used for denoting thousands you might set thousands="," consider following example
import io
import pandas as pd
some_csv = io.StringIO('value\n"1"\n"1,000"\n"1,000,000"\n')
df = pd.read_csv(some_csv, thousands=",")
print(df)
output
value
0 1
1 1000
2 1000000
For brevity I used io.StringIO, same effect might be achieved providing name of file with same content as first argument in io.StringIO.
Try with str.replace instead:
df['size'] = df['size'].str.replace(',', '')
Optional convert to int with astype:
df['size'] = df['size'].str.replace(',', '').astype(int)
number name size
0 1 Car 932123
1 2 Bike 100000
2 3 Truck 1032111
Sample Frame Used:
df = pd.DataFrame({'number': [1, 2, 3], 'name': ['Car', 'Bike', 'Truck'],
'size': ['9,32,123', '1,00,000', '10,32,111']})
number name size
0 1 Car 9,32,123
1 2 Bike 1,00,000
2 3 Truck 10,32,111

How to convert a data frame column in python?

After reading from large excel I have the following data
Mode Fiscal Year/Period Amount
ABC 12.2001 10243.00
CAB 2.201 987.87
I need to convert the above data frame as below
Mode Fiscal Year/Period Amount
ABC 012.2001 10243.00
CAB 002.2010 987.87
need help in converting the Fiscal Year/Period column.
It is always easier for us and you will get better help if you provide your attempts at the solution (your code).
Try this,
import pandas as pd
Recreating your data
data = {'mode':['abc', 'cab'], 'Fiscal Year/Period':[12.2001, 2.201]}
And put it in a dataframe,
data=pd.DataFrame(data)
Convert the column to a str,
data['Fiscal Year/Period']=data['Fiscal Year/Period'].astype(str)
And use zfill() to fill with zeros
data['Fiscal Year/Period'].apply(lambda x: x.zfill(8))
yields,
0 012.2001
1 0002.201
Name: Fiscal Year/Period, dtype: object
IIUC, you can just zfill and ljust
s = df['Fiscal_Year/Period'].str.split('.',expand=True)
s[0] = s[0].str.zfill(3)
s[1] = s[1].str.ljust(4,'0')
df['Year'] = s.agg('.'.join,axis=1)
print(df)
Mode Fiscal_Year/Period Amount Year
0 ABC 12.2001 10243.00 012.2001
1 CAB 2.201 987.87 002.2010

python - Replace first five characters in a column with asterisks

I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425

Python Pandas: Dataframe is not updating using string methods

I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line

How to extract content from the regex output which has square bracket in python

I have a Python's (2.7) Pandas DF which has columns which looks something like this :
email
['jsaw#yahoo.com']
['jfsjhj#yahoo.com']
['jwrk#yahoo.com']
['rankw#yahoo.com']
I want to extract email from it without the square bracket and single quotes. Output should like this :
email
jsaw#yahoo.com
jfsjhj#yahoo.com
jwrk#yahoo.com
rankw#yahoo.com
I have tried the suggestions from this answer :Replace all occurrences of a string in a pandas dataframe (Python) . But its not working. Any help will be appreciated.
edit:
What if I have array of more than 1 dimension. something like :
email
['jsaw#yahoo.com']
['jfsjhj#yahoo.com']
['jwrk#yahoo.com']
['rankw#yahoo.com','fsffsnl#gmail.com']
['mklcu#yahoo.com','riserk#gmail.com', 'funkdl#yahoo.com']
is it possible to get the output in three different columns without square brackets and single quotes.
You can use str.strip if type of values is string:
print type(df.at[0,'email'])
<type 'str'>
df['email'] = df.email.str.strip("[]'")
print df
email
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com
If type is list apply Series:
print type(df.at[0,'email'])
<type 'list'>
df['email'] = df.email.apply(pd.Series)
print df
email
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com
EDIT: If you have multiple values in array, you can use:
df1 = df['email'].apply(pd.Series).fillna('')
print df1
0 1 2
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com fsffsnl#gmail.com
4 mklcu#yahoo.com riserk#gmail.com funkdl#yahoo.com
Try this one:
from re import findall
s = "['rankw#yahoo.com']"
m = findall(r"\[([A-Za-z0-9#'._]+)\]", s)
print(m[0].replace("'",''))

Categories