I want to remove duplicates in a column via Pandas.
I tried df.drop_duplicates() but no luck.
How to achieve this in Pandas?
Input:
A
team=red, Manager=Travis
team=Blue, Manager=John, team=Blue
Manager=David, Bank=HDFC, team=XYZ, Bank=HDFC
Expected_Output:
A
team=red, Manager=Travis
team=Blue, Manager=John
Manager=David, Bank=HDFC, team=XYZ
Code
df = df.drop_duplicates('A', keep='last')
You can use some data structures to achieve this result.
split entries
convert to set (or some non duplicated structure)
join back to string
print(df['A'])
0 team=red, Manager=Travis
1 team=Blue, Manager=John, team=Blue
2 Manager=David, Bank=HDFC, team=XYZ, Bank=HDFC
Name: A, dtype: object
out = (
df['A'].str.split(r',\s+')
.map(set)
.str.join(", ")
)
print(out)
0 Manager=Travis, team=red
1 team=Blue, Manager=John
2 Bank=HDFC, team=XYZ, Manager=David
Name: A, dtype: object
Alternatively, if the order of your string entries is important, you can use dict.fromkeys instead of a set. Since dictionaries are implicitly ordered as of Py > 3.6
out = (
df['A'].str.split(r',\s+')
.map(dict.fromkeys)
.str.join(", ")
)
print(out)
0 team=red, Manager=Travis
1 team=Blue, Manager=John
2 Manager=David, Bank=HDFC, team=XYZ
Name: A, dtype: object
Try:
df['A'].str.split(',').explode().str.strip(' ')\
.drop_duplicates().groupby(level=0).agg(','.join)
Output:
0 team=red,Manager=Travis
1 team=Blue,Manager=John
2 Manager=David,Bank=HDFC,team=XYZ
Name: A, dtype: object
Related
I have a column called SSN in a CSV file with values like this
289-31-9165
I need to loop through the values in this column and replace the first five characters so it looks like this
***-**-9165
Here's the code I have so far:
emp_file = "Resources/employee_data1.csv"
emp_pd = pd.read_csv(emp_file)
new_ssn = emp_pd["SSN"].str.replace([:5], "*")
emp_pd["SSN"] = new_ssn
How do I loop through the value and replace just the first five numbers (only) with asterisks and keep the hiphens as is?
Similar to Mr. Me, this will instead remove everything before the first 6 characters and replace them with your new format.
emp_pd["SSN"] = emp_pd["SSN"].apply(lambda x: "***-**" + x[6:])
You can simply achieve this with replace() method:
Example dataframe :
borrows from #AkshayNevrekar..
>>> df
ssn
0 111-22-3333
1 121-22-1123
2 345-87-3425
Result:
>>> df.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
OR
>>> df.ssn.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
Name: ssn, dtype: object
OR:
df['ssn'] = df['ssn'].str.replace(r'^\d{3}-\d{2}', "***-**", regex=True)
Put your asterisks in front, then grab the last 4 digits.
new_ssn = '***-**-' + emp_pd["SSN"][-4:]
You can use regex
df = pd.DataFrame({'ssn':['111-22-3333','121-22-1123','345-87-3425']})
def func(x):
return re.sub(r'\d{3}-\d{2}','***-**', x)
df['ssn'] = df['ssn'].apply(func)
print(df)
Output:
ssn
0 ***-**-3333
1 ***-**-1123
2 ***-**-3425
I have a dataframe that contains userdata. There is a column that includes filenames that users have accessed. The filenames look like this:
blah-blah-blah/dss_outline.pdf
doot-doot/helper_doc.pdf
blah-blah-blah/help_file.pdf
My goal is to chop off everything after and including the / so that I can just look at the top-level programs people are examining (which the numerous different files are organized under).
So, I'm having two challenges:
1 - How do I 'grab' everything up to the '/'? I've been looking at regex, but I'm having a hard time writing the correct expression.
2 - How do I replace all of the filenames with the concatenated filename? I found that I could use df['Filename'] = df['Filename'].str.split('/')[0] to grab the proper portion, but it won't apply across the series object. That's the logic of what I want to do, but I can't figure out how to do it.
Thanks
You have lot of solutions handy:
1) Just with split() method:
>>> df
col1
0 blah-blah-blah/dss_outline.pdf
1 doot-doot/helper_doc.pdf
2 blah-blah-blah/help_file.pdf
>>> df['col1'].str.split('/', 1).str[0].str.strip()
0 blah-blah-blah
1 doot-doot
2 blah-blah-blah
Name: col1, dtype: object
2) You can use apply() + split()
>>> df['col1'].apply(lambda s: s.split('/')[0])
0 blah-blah-blah
1 doot-doot
2 blah-blah-blah
Name: col1, dtype: object
3) You can use rsplit() + str[0] to strip off the desired:
>>> df['col1'].str.rsplit('/').str[0]
0 blah-blah-blah
1 doot-doot
2 blah-blah-blah
Name: col1, dtype: object
4) You can use pandas native regex With extract():
>>> df['col1'] = df['col1'].str.extract('([^/]+)')
>>> df
col1
0 blah-blah-blah
1 doot-doot
2 blah-blah-blah
OR
# df.col1.str.extract('([^/]+)')
You may use \/.*$ to match the part you don't need and remove it: DEMO
This matches a forward slash and any following character till the end of the string (be careful to use a multiline flag if your engine needs it!).
OR you may use ^[^/]+ to match the part you want and extract it: DEMO
This matches any consecutive characters except / from the beginning of a string (again, multiline needed!).
Use series.apply():
>>> import pandas
>>> data = {'filename': ["blah-blah-blah/dss_outline.pdf", "doot-doot/helper_doc.pdf", "blah-blah-blah/help_file.pdf"]}
>>> df = pandas.DataFrame(data=data)
>>> df
filename
0 blah-blah-blah/dss_outline.pdf
1 doot-doot/helper_doc.pdf
2 blah-blah-blah/help_file.pdf
>>> def get_top_level_from(string):
... return string.split('/')[0]
...
>>> series = df["filename"]
>>> series
0 blah-blah-blah/dss_outline.pdf
1 doot-doot/helper_doc.pdf
2 blah-blah-blah/help_file.pdf
Name: filename, dtype: object
>>> series.apply(get_top_level_from)
0 blah-blah-blah
1 doot-doot
2 blah-blah-blah
Name: filename, dtype: object
Code:
def get_top_level_from(string):
return string.split('/')[0]
results = df["filename"].apply(get_top_level_from)
Use df.replace
df.replace('\/.*$','',regex=True)
col
0 blah-blah-blah
1 doot-doot
2 blah-blah-blah
I have this sample data in a cell:
EmployeeID
2016-CT-1028
2016-CT-1028
2017-CT-1063
2017-CT-1063
2015-CT-948
2015-CT-948
So, my problem is how can I add 0 inside this data 2015-CT-948 to
make it like this 2015-CT-0948.
I tried this code:
pattern = re.compile(r'(\d\d+)-(\w\w)-(\d\d\d)')
newlist = list(filter(pattern.match, idList))
Just to get the match regex pattern then add the 0 with zfill() but its not working. Please, can someone give me an idea on how can I do it. Is there anyway I can do it in regex or in pandas. Thank you!
This is one approach using zfill
Ex:
import pandas as pd
def custZfill(val):
val = val.split("-")
#alternative split by last -
#val = val.rsplit("-",1)
val[-1] = val[-1].zfill(4)
return "-".join(val)
df = pd.DataFrame({"EmployeeID": ["2016-CT-1028", "2016-CT-1028",
"2017-CT-1063", "2017-CT-1063",
"2015-CT-948", "2015-CT-948"]})
print(df["EmployeeID"].apply(custZfill))
Output:
0 2016-CT-1028
1 2016-CT-1028
2 2017-CT-1063
3 2017-CT-1063
4 2015-CT-0948
5 2015-CT-0948
Name: EmployeeID, dtype: object
With pandas it can be solved with split instead of regex:
df['EmployeeID'].apply(lambda x: '-'.join(x.split('-')[:-1] + [x.split('-')[-1].zfill(4)]))
In pandas, you could use str.replace
df['EmployeeID'] = df.EmployeeID.str.replace(r'-(\d{3})$', r'-0\1', regex=True)
# Output:
0 2016-CT-1028
1 2016-CT-1028
2 2017-CT-1063
3 2017-CT-1063
4 2015-CT-0948
5 2015-CT-0948
Name: EmployeeID, dtype: object
if the format of the id's is strictly defined, you can also use a simple list comprehension to do this job:
ids = [
'2017-CT-1063',
'2015-CT-948',
'2015-CT-948'
]
new_ids = [id if len(id) == 12 else id[0:8]+'0'+id[8:] for id in ids]
print(new_ids)
# ['2017-CT-1063', '2015-CT-0948', '2015-CT-0948']
Here's a one liner:
df['EmployeeID'].apply(lambda x: '-'.join(xi if i != 2 else '%04d' % int(xi) for i, xi in enumerate(x.split('-'))))
I have an SQL database which has two columns. One has the timestamp, the other holds data in JSON format
for example df:
ts data
'2017-12-18 02:30:20.553' {'name':'bob','age':10, 'location':{'town':'miami','state':'florida'}}
'2017-12-18 02:30:21.101' {'name':'dan','age':15, 'location':{'town':'new york','state':'new york'}}
'2017-12-18 02:30:21.202' {'name':'jay','age':11, 'location':{'town':'tampa','state':'florida'}}
If I do the following :
df = df['data'][0]
print (df['name'].encode('ascii', 'ignore'))
I get :
'bob'
Is there a way I can get all of the data correspondings to a JSON key for the whole column?
(i.e. for the df column 'data' get 'name')
'bob'
'dan'
'jay'
Essentially I would like to be able to make a new df column called 'name'
You can use json_normalize i.e
pd.io.json.json_normalize(df['data'])['name']
0 bob
1 dan
2 jay
Name: name, dtype: object
IIUC, lets use apply with lambda function to select value from dictionary by key:
df['data'].apply(lambda x: x['name'])
Output:
0 bob
1 dan
2 jay
Name: data, dtype: object
I have data streaming in the following format:
from StringIO import StringIO
data ="""\
ANI/IP
sip:5554447777#10.94.2.15
sip:10.66.7.34#6665554444
sip:3337775555#10.94.2.11
"""
import pandas as pd
df = pd.read_table(StringIO(data),sep='\s+',dtype='str')
What I would like to do is replace the column content with just the phone number part of the string above. I tried the suggestions from this thread like so:
df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
print(df)
However, this results in:
.....print(df)
ANI/IP
0 sip:#10.94.2.15
1 sip:#10.66.7.34
2 sip:#10.94.2.11
I need the phone numbers, so how do I achieve this? :
ANI/IP
0 5554447777
1 6665554444
2 3337775555
The regex \d{10} searches for substring of digits precisely 10 characters long.
df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
This removes the numbers!
Note: You shouldn't do astype str (it's not needed and there is no str dtype in pandas).
You want to extract these phone numbers:
In [11]: df["ANI/IP"].str.extract(r'(\d{10})') # before overwriting!
Out[11]:
0 5554447777
1 6665554444
2 3337775555
Name: ANI/IP, dtype: object
Set this as another column and you're away:
In [12]: df["phone_number"] = df["ANI/IP"].str.extract(r'(\d{10})')
You could use pandas.core.strings.StringMethods.extract to extract
In [10]: df['ANI/IP'].str.extract("(\d{10})")
Out[10]:
0 5554447777
1 6665554444
2 3337775555
Name: ANI/IP, dtype: object