Sort Pandas Dataframe by substrings of a column - python

Given a DataFrame:
name email
0 Carl carl#yahoo.com
1 Bob bob#gmail.com
2 Alice alice#yahoo.com
3 David dave#hotmail.com
4 Eve eve#gmail.com
How can it be sorted according to the email's domain name (alphabetically, ascending), and then, within each domain group, according to the string before the "#"?
The result of sorting the above should then be:
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com

Use:
df = df.reset_index(drop=True)
idx = df['email'].str.split('#', expand=True).sort_values([1,0]).index
df = df.reindex(idx).reset_index(drop=True)
print (df)
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com
Explanation:
First reset_index with drop=True for unique default indices
Then split values to new DataFrame and sort_values
Last reindex to new order

Option 1
sorted + reindex
df = df.set_index('email')
df.reindex(sorted(df.index, key=lambda x: x.split('#')[::-1])).reset_index()
email name
0 bob#gmail.com Bob
1 eve#gmail.com Eve
2 dave#hotmail.com David
3 alice#yahoo.com Alice
4 carl#yahoo.com Carl
Option 2
sorted + pd.DataFrame
As an alternative, you can ditch the reindex call from Option 1 by re-creating a new DataFrame.
pd.DataFrame(
sorted(df.values, key=lambda x: x[1].split('#')[::-1]),
columns=df.columns
)
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com

Related

How to auto increment counter by repeteaded values in a column

I Have a data frame with the column name and I need to create the column seq, which allows me identify the different times that a name appears in the data frame, it's important to preserve the order.
import pandas as pd
data = {'name': ['Tom', 'Joseph','Joseph','Joseph', 'Tom', 'Tom', 'John','Tom','Tom','John','Joseph']
, 'seq': ['Tom 0', 'Joseph 0','Joseph 0','Joseph 0', 'Tom 1', 'Tom 1', 'John 0','Tom 2','Tom 2','John 1','Joseph 1']}
df = pd.DataFrame(data)
print(df)
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
Create a boolean mask to know if the name has changed from the previous row. Then filter out the second, third, ... names of a sequence before grouping by name. cumcount increment the sequence number and finally concatenate name and sequence number.
# Boolean mask
m = df['name'].ne(df['name'].shift())
# Create sequence number
seq = df.loc[m].groupby('name').cumcount().astype(str) \
.reindex(df.index, fill_value=pd.NA).ffill()
# Concatenate name and seq
df['seq'] = df['name'] + ' ' + seq
Output:
>>> df
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
>>> m
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 True
8 False
9 True
10 True
Name: name, dtype: bool
You need check for the existence of a new name and then create a new index for each name using groupby and cumsum, the resulting string Series can be concatenated with str.cat
df['seq'] = df['name'].str.cat(
df['name'].ne(df['name'].shift()).groupby(df['name']).cumsum().sub(1).astype(str),
sep=' '
)
Assuming your data frame is indexes sequentiallly (0, 1, 2, 3, ...):
Group the data frame by name
For each group, apply a gap-and-island algorithm: every time the index jumps by more than 1, create a new island
def sequencer(group):
idx = group.index.to_series()
# Every time the index has a gap >1, create a new island
return idx.diff().ne(1).cumsum().sub(1)
seq = df.groupby('name').apply(sequencer).droplevel(0).rename('seq')
df.merge(seq, left_index=True, right_index=True)

How can I generate a new column to group by membership in Pandas?

I have a dataframe:
df = pd.DataFrame({'name':['John','Fred','John','George','Fred']})
How can I transform this to generate a new column giving me group membership by value? Such that:
new_df = pd.DataFrame({'name':['John','Fred','John','George','Fred'], 'group':[1,2,1,3,2]})
Use factorize:
df['group'] = pd.factorize(df['name'])[0] + 1
print (df)
name group
0 John 1
1 Fred 2
2 John 1
3 George 3
4 Fred 2

Select Row by Username with Pandas

I have a Table with multiple users and the data belonging to them.
Now I want to create separate tables for each user like this:
Each account belonging to the users has a different ID so I can't use the ID to select.
How can I select the all Rows belonging to one specific name in the "User" row and then create separate table?
Also I would like take data out of a column and sort it into two new columns.
One example would be something like the email like:
John.tomson#email.com and split it at the dot and create two new Columns "Name" and "Surname".
Breaking down by User
df.groupby('User').get_group('John')
ID User Email
0 1 John john.tomson#email.com
1 2 John john.tomson#email.com
2 3 John john.tomson#email.com
Can also be done in a loop
grp = df.groupby('User')
for group in grp.groups:
print(grp.get_group(group))
Email ID User
3 david.matty#email.com 4 David
4 david.matty#email.com 5 David
Email ID User
5 fred.brainy#email.com 6 Fred
Email ID User
0 john.tomson#email.com 1 John
1 john.tomson#email.com 2 John
2 john.tomson#email.com 3 John
Splitting the Email column
email_df = df['Email'].str.split(r'(.+)\.(.+)#', expand=True)]
pd.concat([df, email_df], axis=1)
Email ID User 0 1 2
0 john.tomson#email.com 1 John john tomson email.com
1 john.tomson#email.com 2 John john tomson email.com
2 john.tomson#email.com 3 John john tomson email.com
3 david.matty#email.com 4 David david matty email.com
4 david.matty#email.com 5 David david matty email.com
5 fred.brainy#email.com 6 Fred fred brainy email.com

Check if value from one dataframe exists in another dataframe

I have 2 dataframes.
Df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad']
Df2 = pd.DataFrame({'IDs': ['Jake', 'John', 'Marc', 'Tony', 'Bob']
I want to loop over every row in Df1['name'] and check if each name is somewhere in Df2['IDs'].
The result should return 1 if the name is in there, 0 if it is not like so:
Marc 1
Jake 1
Sam 0
Brad 0
Thank you.
Use isin
Df1.name.isin(Df2.IDs).astype(int)
0 1
1 1
2 0
3 0
Name: name, dtype: int32
Show result in data frame
Df1.assign(InDf2=Df1.name.isin(Df2.IDs).astype(int))
name InDf2
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
In a Series object
pd.Series(Df1.name.isin(Df2.IDs).values.astype(int), Df1.name.values)
Marc 1
Jake 1
Sam 0
Brad 0
dtype: int32
This should do it:
Df1 = Df1.assign(result=Df1['name'].isin(Df2['IDs']).astype(int))
By using merge
s=Df1.merge(Df2,left_on='name',right_on='IDs',how='left')
s.IDs=s.IDs.notnull().astype(int)
s
Out[68]:
name IDs
0 Marc 1
1 Jake 1
2 Sam 0
3 Brad 0
This is one way. Convert to set for O(1) lookup and use astype(int) to represent Boolean values as integers.
values = set(Df2['IDs'])
Df1['Match'] = Df1['name'].isin(values).astype(int)

Set value of a row in a Pandas dataframe equal to that of a row in a different dataframe

I have two dataframes, both have the same amount of columns and contain text data. The problem is that the data in the second dataframe is missing details:
A B
1 Bob Hoskins
2 Laura Hogan
3 Tom Jones
A B
1 Bob x
2 Bob x
3 Bob x
4 Laura x
5 Laura x
6 Tom x
What is the fastest way in Pandas to set the value of the 'B' column in the second dataframe equal to its respective conditional value in the first? So any row where 'A' = 'Bob' will have 'B' set to Hoskins, Laura to Hogan and so on? The second dataframe is quite large as well, with 100,000 rows so a speedy solution is preferred.
Perform a left join on the second df:
output = df2.merge(df1, how = "left", on = "A")
* desired df: *
A B
0 Bob Hoskins
1 Bob Hoskins
2 Bob Hoskins
3 Laura Hogan
4 Laura Hogan
5 Tom Jones
You can set A as index for the first data frame and then filter rows based on the index:
df.set_index('A').loc[df1.A].reset_index()
# A B
# 0 Bob Hoskins
# 1 Bob Hoskins
# 2 Bob Hoskins
# 3 Laura Hogan
# 4 Laura Hogan
# 5 Tom Jones

Categories