Select Row by Username with Pandas - python

I have a Table with multiple users and the data belonging to them.
Now I want to create separate tables for each user like this:
Each account belonging to the users has a different ID so I can't use the ID to select.
How can I select the all Rows belonging to one specific name in the "User" row and then create separate table?
Also I would like take data out of a column and sort it into two new columns.
One example would be something like the email like:
John.tomson#email.com and split it at the dot and create two new Columns "Name" and "Surname".

Breaking down by User
df.groupby('User').get_group('John')
ID User Email
0 1 John john.tomson#email.com
1 2 John john.tomson#email.com
2 3 John john.tomson#email.com
Can also be done in a loop
grp = df.groupby('User')
for group in grp.groups:
print(grp.get_group(group))
Email ID User
3 david.matty#email.com 4 David
4 david.matty#email.com 5 David
Email ID User
5 fred.brainy#email.com 6 Fred
Email ID User
0 john.tomson#email.com 1 John
1 john.tomson#email.com 2 John
2 john.tomson#email.com 3 John
Splitting the Email column
email_df = df['Email'].str.split(r'(.+)\.(.+)#', expand=True)]
pd.concat([df, email_df], axis=1)
Email ID User 0 1 2
0 john.tomson#email.com 1 John john tomson email.com
1 john.tomson#email.com 2 John john tomson email.com
2 john.tomson#email.com 3 John john tomson email.com
3 david.matty#email.com 4 David david matty email.com
4 david.matty#email.com 5 David david matty email.com
5 fred.brainy#email.com 6 Fred fred brainy email.com

Related

Python Pandas To split text in a cell to multiple rows

I have a dataframe which is in this format.
I wonder how I can split the Items ordered column into multiple rows like below.
Thanks in advance!
You can use:
>>> (df.assign(**{'Items ordered': lambda x: x['Items ordered'].str.rstrip(';').str.split(';\s*')})
.explode('Items ordered', ignore_index=True))
Transaction ID Client Name Items ordered
0 1 Sam Fruit
1 1 Sam Water
2 1 Sam Coffee
3 2 Peter Fruit
4 2 Peter Soup
5 2 Peter Sandwich
6 3 Han Fruit
7 3 Han Coffee
8 3 Han Ice Cream

Add blank row to a pandas dataframe after every period

I have a pandas dataframe that is quite similar to this:-
name
status
eric
single
.
0
xavier
couple
sarah
couple
.
0
aaron
divorced
.
0
I would like to add a new row after every period as below:-
name
status
eric
single
.
0
xavier
couple
sarah
couple
.
0
aaron
divorced
.
0
Appreciate any guidance on this!
You can use groupby and apply a concatenation of each group to a dummy row:
(df
.groupby(df['name'].shift().eq('.').cumsum(), group_keys=False)
.apply(lambda g: pd.concat([g, pd.DataFrame(index=[0], columns=g.columns)]).fillna(''))
)
output:
name status
0 eric single
1 . 0
0
2 xavier couple
3 sarah couple
4 . 0
0
5 aaron divorced
6 . 0
0
Or extract the rows with . and concat:
df2 = df[df['name'].eq('.')].copy()
df2.loc[:] = ''
pd.concat([df, df2]).sort_index(kind='stable')
output:
name status
0 eric single
1 . 0
1
2 xavier couple
3 sarah couple
4 . 0
4
5 aaron divorced
6 . 0
6

Map names to column values pandas

The Problem
I had a hard time phrasing this question but essentially I have a series of X columns that represent weights at specific points in time. Then another set of X columns that represent the names of those people that were measured.
That table looks like this (there's more than two columns, this is just a toy example):
a_weight
b_weight
a_name
b_name
10
5
John
Michael
1
2
Jake
Michelle
21
3
Alice
Bob
2
1
Ashley
Brian
What I Want
I want to have a two columns with the maximum weight and name at each point in time. I want this to be vectorized because the data is a lot. I can do it using a for loop or an .apply(lambda row: row[col]) but it is very slow.
So the final table would look something like this:
a_weight
b_weight
a_name
b_name
max_weight
max_name
10
5
John
Michael
a_weight
John
1
2
Jake
Michelle
b_weight
Michelle
21
3
Alice
Bob
a_weight
Alice
2
1
Ashley
Brian
a_weight
Ashley
What I've Tried
I've been able to create a mirror df_subset with just the weights, then use the idxmax function to make a max_weight column:
df_subset = df[[c for c in df.columns if "weight" in c]]
max_weight_col = df_subset.idxmax(axis="columns")
This returns a column that is the max_weight column in the section above. Now I run:
df["max_name_col"] = max_weight_col.str.replace("_weight","_name")
and I have this:
a_weight
b_weight
a_name
b_name
max_weight
max_name_col
10
5
John
Michael
a_weight
a_name
1
2
Jake
Michelle
b_weight
b_name
21
3
Alice
Bob
a_weight
a_name
2
1
Ashley
Brian
a_weight
a_name
I basically want to run a code similar to the one below without a for-loop:
df["max_name"] = [row[row["max_name_col"]] for row in df]
How do I move on from here? I feel like I'm so close but I'm stuck. Any help? I'm also open to throwing away the entire code and doing something else if there's a faster way.
You can do that for sure just pass to numpy argmax
v1 = df.filter(like='weight').values
v2 = df.filter(like='name').values
df['max_weight'] = v1[df.index, v1.argmax(1)]
df['max_name'] = v2[df.index, v1.argmax(1)]
df
Out[921]:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael 10 John
1 1 2 Jake Michelle 2 Michelle
2 21 3 Alice Bob 21 Alice
3 2 1 Ashley Brian 2 Ashley
This would do the trick assuming you only have 2 weight columns:
df["max_weight"] = df[["a_weight", "b_weight"]].idxmax(axis=1)
mask = df["max_weight"] == "a_weight"
df.loc[mask, "max_name"] = df[mask]["a_name"]
df.loc[~mask, "max_name"] = df[~mask]["b_name"]
We could use idxmax to find the column names; then use factorize + numpy advanced indexing to get the names:
df['max_weight'] = df.loc[:, df.columns.str.contains('weight')].idxmax(axis=1)
df['max_name'] = (df.loc[:, df.columns.str.contains('name')].to_numpy()
[np.arange(len(df)), df['max_weight'].factorize()[0]])
Output:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael a_weight John
1 1 2 Jake Michelle b_weight Michelle
2 21 3 Alice Bob a_weight Alice
3 2 1 Ashley Brian a_weight Ashley

Correct way to look up missing values from one dataframe in another

I have a dataframe with 2 fields, including "name" and "team" called "df1". I want to add an additional column called "user_id" based on each person's user_id which can be found in a separate dataframe based on that person's "team".
The "user_id" values can be found in other dataframes which are separated by the team field, named "df_a", "df_b", "df_c"... etc. Each of these dataframes contains the same three fields ("name", "team", and "user_id"), but each one only contains names from that team, and each of these dataframes is complete (no NaNs found in any columns).
I was wonder what the most pythonic way was to add the "user_id" column to df1 using the data from my team dataframes (there could be many team dataframes, but each is relatively small). Thus far, I've tried looping through each team dataframe and merging them onto df1 based on the "name" field using inner and left merges, but the output ends up either missing rows from the original dataframe or outputting many "user_id_x", "user_id_y" columns filled with NaNs.
Example dataframes:
df1:
name team
0 john doe a
2 jane doe b
3 amy doe b
4 jane smith c
5 john johnson c
df_a:
name team user_id
0 john doe a 15368
1 john smith a 15382
2 sally smith a 15212
df_b:
name team user_id
0 jane doe b 6325
1 amy doe b 6164
2 sally doe b 6294
df_c:
name team user_id
0 steve doe c 52956
1 jane smith c 83635
2 john johnson c 54871
This is my desired output after taking the user_id values from each team dataframe:
name team user_id
0 john doe a 15368
2 jane doe b 6325
3 amy doe b 6164
4 jane smith c 83635
5 john johnson c 54871
Let me know if there is anything I can clarify, and thanks in advance!
try this,
main_df=pd.concat([df_a,df_b,df_c],ignore_index=True)
df=pd.merge(df,main_df,how='left',on=['name','team'])
concat all df_x dataframes then perform left join
Output:
name team user_id
0 john doe a 15368
1 jane doe b 6325
2 amy doe b 6164
3 jane smith c 83635
4 john johnson c 54871
Output for print (main_df):
name team user_id
0 john doe a 15368
1 john smith a 15382
2 sally smith a 15212
3 jane doe b 6325
4 amy doe b 6164
5 sally doe b 6294
6 steve doe c 52956
7 jane smith c 83635
8 john johnson c 54871

Sort Pandas Dataframe by substrings of a column

Given a DataFrame:
name email
0 Carl carl#yahoo.com
1 Bob bob#gmail.com
2 Alice alice#yahoo.com
3 David dave#hotmail.com
4 Eve eve#gmail.com
How can it be sorted according to the email's domain name (alphabetically, ascending), and then, within each domain group, according to the string before the "#"?
The result of sorting the above should then be:
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com
Use:
df = df.reset_index(drop=True)
idx = df['email'].str.split('#', expand=True).sort_values([1,0]).index
df = df.reindex(idx).reset_index(drop=True)
print (df)
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com
Explanation:
First reset_index with drop=True for unique default indices
Then split values to new DataFrame and sort_values
Last reindex to new order
Option 1
sorted + reindex
df = df.set_index('email')
df.reindex(sorted(df.index, key=lambda x: x.split('#')[::-1])).reset_index()
email name
0 bob#gmail.com Bob
1 eve#gmail.com Eve
2 dave#hotmail.com David
3 alice#yahoo.com Alice
4 carl#yahoo.com Carl
Option 2
sorted + pd.DataFrame
As an alternative, you can ditch the reindex call from Option 1 by re-creating a new DataFrame.
pd.DataFrame(
sorted(df.values, key=lambda x: x[1].split('#')[::-1]),
columns=df.columns
)
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com

Categories