The Problem
I had a hard time phrasing this question but essentially I have a series of X columns that represent weights at specific points in time. Then another set of X columns that represent the names of those people that were measured.
That table looks like this (there's more than two columns, this is just a toy example):
a_weight
b_weight
a_name
b_name
10
5
John
Michael
1
2
Jake
Michelle
21
3
Alice
Bob
2
1
Ashley
Brian
What I Want
I want to have a two columns with the maximum weight and name at each point in time. I want this to be vectorized because the data is a lot. I can do it using a for loop or an .apply(lambda row: row[col]) but it is very slow.
So the final table would look something like this:
a_weight
b_weight
a_name
b_name
max_weight
max_name
10
5
John
Michael
a_weight
John
1
2
Jake
Michelle
b_weight
Michelle
21
3
Alice
Bob
a_weight
Alice
2
1
Ashley
Brian
a_weight
Ashley
What I've Tried
I've been able to create a mirror df_subset with just the weights, then use the idxmax function to make a max_weight column:
df_subset = df[[c for c in df.columns if "weight" in c]]
max_weight_col = df_subset.idxmax(axis="columns")
This returns a column that is the max_weight column in the section above. Now I run:
df["max_name_col"] = max_weight_col.str.replace("_weight","_name")
and I have this:
a_weight
b_weight
a_name
b_name
max_weight
max_name_col
10
5
John
Michael
a_weight
a_name
1
2
Jake
Michelle
b_weight
b_name
21
3
Alice
Bob
a_weight
a_name
2
1
Ashley
Brian
a_weight
a_name
I basically want to run a code similar to the one below without a for-loop:
df["max_name"] = [row[row["max_name_col"]] for row in df]
How do I move on from here? I feel like I'm so close but I'm stuck. Any help? I'm also open to throwing away the entire code and doing something else if there's a faster way.
You can do that for sure just pass to numpy argmax
v1 = df.filter(like='weight').values
v2 = df.filter(like='name').values
df['max_weight'] = v1[df.index, v1.argmax(1)]
df['max_name'] = v2[df.index, v1.argmax(1)]
df
Out[921]:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael 10 John
1 1 2 Jake Michelle 2 Michelle
2 21 3 Alice Bob 21 Alice
3 2 1 Ashley Brian 2 Ashley
This would do the trick assuming you only have 2 weight columns:
df["max_weight"] = df[["a_weight", "b_weight"]].idxmax(axis=1)
mask = df["max_weight"] == "a_weight"
df.loc[mask, "max_name"] = df[mask]["a_name"]
df.loc[~mask, "max_name"] = df[~mask]["b_name"]
We could use idxmax to find the column names; then use factorize + numpy advanced indexing to get the names:
df['max_weight'] = df.loc[:, df.columns.str.contains('weight')].idxmax(axis=1)
df['max_name'] = (df.loc[:, df.columns.str.contains('name')].to_numpy()
[np.arange(len(df)), df['max_weight'].factorize()[0]])
Output:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael a_weight John
1 1 2 Jake Michelle b_weight Michelle
2 21 3 Alice Bob a_weight Alice
3 2 1 Ashley Brian a_weight Ashley
Related
I Have a data frame with the column name and I need to create the column seq, which allows me identify the different times that a name appears in the data frame, it's important to preserve the order.
import pandas as pd
data = {'name': ['Tom', 'Joseph','Joseph','Joseph', 'Tom', 'Tom', 'John','Tom','Tom','John','Joseph']
, 'seq': ['Tom 0', 'Joseph 0','Joseph 0','Joseph 0', 'Tom 1', 'Tom 1', 'John 0','Tom 2','Tom 2','John 1','Joseph 1']}
df = pd.DataFrame(data)
print(df)
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
Create a boolean mask to know if the name has changed from the previous row. Then filter out the second, third, ... names of a sequence before grouping by name. cumcount increment the sequence number and finally concatenate name and sequence number.
# Boolean mask
m = df['name'].ne(df['name'].shift())
# Create sequence number
seq = df.loc[m].groupby('name').cumcount().astype(str) \
.reindex(df.index, fill_value=pd.NA).ffill()
# Concatenate name and seq
df['seq'] = df['name'] + ' ' + seq
Output:
>>> df
name seq
0 Tom Tom 0
1 Joseph Joseph 0
2 Joseph Joseph 0
3 Joseph Joseph 0
4 Tom Tom 1
5 Tom Tom 1
6 John John 0
7 Tom Tom 2
8 Tom Tom 2
9 John John 1
10 Joseph Joseph 1
>>> m
0 True
1 True
2 False
3 False
4 True
5 False
6 True
7 True
8 False
9 True
10 True
Name: name, dtype: bool
You need check for the existence of a new name and then create a new index for each name using groupby and cumsum, the resulting string Series can be concatenated with str.cat
df['seq'] = df['name'].str.cat(
df['name'].ne(df['name'].shift()).groupby(df['name']).cumsum().sub(1).astype(str),
sep=' '
)
Assuming your data frame is indexes sequentiallly (0, 1, 2, 3, ...):
Group the data frame by name
For each group, apply a gap-and-island algorithm: every time the index jumps by more than 1, create a new island
def sequencer(group):
idx = group.index.to_series()
# Every time the index has a gap >1, create a new island
return idx.diff().ne(1).cumsum().sub(1)
seq = df.groupby('name').apply(sequencer).droplevel(0).rename('seq')
df.merge(seq, left_index=True, right_index=True)
I'm trying to find a method of inserting a zero into a pandas dataframe where the result of the .count()aggregate function is < 1. I've tried putting in a condition where it looks for null/None values and using a simple < 1 operator. So far I can only count instances where a categorical variable exists. Below is some example code to demonstrate my issue:
data = {'Person': ['Jim', 'Jim', 'Jim', 'Jim', 'Jim', 'Bob','Bob','Bob','Bob','Bob',], 'Result': ['Good', 'Good','Good','Good','Good','Good','Bad','Good','Bad','Bad',]}
dtf = pd.DataFrame.from_dict(data)
names = ['Jim','Bob']
append = []
for i in names:
good = dtf[dtf['Person']==i]
good = good[good['Result']=='Good']
if good['Result'].count() > 0:
good.insert(2,"Count",good['Result'].count())
elif good['Result'].count() < 1:
good.insert(2,"Count",0)
bad = dtf[dtf['Person']==i]
bad = bad[bad['Result']=='Bad']
if bad['Result'].count() > 0:
bad.insert(2,"Count",bad['Result'].count())
elif bad['Result'].count() < 1:
bad.insert(2,"Count",0)
res = [good,bad]
res = pd.concat(res)
append.append(res)
print(res)
The current output is:
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
Person Result Count
5 Bob Good 2
7 Bob Good 2
6 Bob Bad 3
8 Bob Bad 3
9 Bob Bad 3
What I am trying to achieve is a zero count for Jim for the 'Bad' variable in the dtf['Results'] column. Like this:
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
5 Jim Bad 0
Person Result Count
6 Bob Good 2
7 Bob Good 2
8 Bob Bad 3
9 Bob Bad 3
10 Bob Bad 3
I hope this makes sense. Vive la Resistance! └[∵┌]└[ ∵ ]┘[┐∵]┘
First create a multiindex mi from the product of Person and Result to keep missing combinations from df. Then count (size) all groups and reindex by the multiindex. Finally, merge the two dataframes use union of keys from both.
mi = pd.MultiIndex.from_product([df["Person"].unique(),
df["Result"].unique()],
names=["Person", "Result"])
out = df.groupby(["Person", "Result"]) \
.size() \
.reindex(mi, fill_value=0) \
.rename("Count") \
.reset_index()
out = out.merge(df, on=["Person", "Result"], how="outer")
>>> out
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
5 Jim Bad 0
6 Bob Good 2
7 Bob Good 2
8 Bob Bad 3
9 Bob Bad 3
10 Bob Bad 3
Output:
names, append = list(zip(*out.groupby("Person")))
>>> names
('Bob', 'Jim')
>>> append
( Person Result Count
6 Bob Good 2
7 Bob Good 2
8 Bob Bad 3
9 Bob Bad 3
10 Bob Bad 3,
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
5 Jim Bad 0)
I have a large dataframe with a bunch of names which appear in two columns
It is in the following layout
Winner Value_W Loser Value_L
Jack 5 Sally -3
Sally 2 Max -1
Max 4 Jack -2
Lucy 1 Jack -6
Jack 6 Henry -3
Henry 5 Lucy -4
I then filtered on columns 'Winner' and 'Loser' to get all rows which Jack appears in using the following code
pd.loc[(df['Winner'] == 'Jack') | (df['Loser'] == 'Jack')]
Which returns the following:
Winner Value_W Loser Value_L
Jack 5 Sally -3
Max 4 Jack -2
Lucy 1 Jack -6
Jack 6 Henry -3
I am now looking to generate one column which only has Jack and his corresponding values.
So in this example, the output I want is:
New_1 New_2
Jack 5
Jack -2
Jack -6
Jack 6
I am unsure of how to do this.
You could wide_to_long after renaming the columns slightly. This allows you to capture additional information, like whether that row is a Win or Loss. Or if you don't care do df1 = df1.reset_index(drop=True)
d = {'Winner': 'Person_W', 'Loser': 'Person_L'}
df1 = pd.wide_to_long(df.rename(columns=d).reset_index(),
stubnames=['Person', 'Value'],
i='index',
j='Win_Lose',
sep='_',
suffix='.*')
df1[df1.Person == 'Jack']
# Person Value
#index Win_Lose
#0 W Jack 5
#4 W Jack 6
#2 L Jack -2
#3 L Jack -6
If that specific ordering is important, we still have the original Index so:
df1.sort_index(level=0).query('Person == "Jack"').reset_index(drop=True)
# Person Value
#0 Jack 5
#1 Jack -2
#2 Jack -6
#3 Jack 6
You should go wide_to_long for sure, but here is a hidden function so called lreshape (May remove in the future, depends on pandas' developer)
pd.lreshape(df,{'name':['Winner','Loser'],'v':['Value_W','Value_L']}).query("name=='Jack'")
Out[75]:
name v
0 Jack 5
4 Jack 6
8 Jack -2
9 Jack -6
name = 'Jack'
>>> pd.DataFrame({
'New_1': name,
'New_2': df.loc[df['Winner'].eq(name), 'Value_W'].tolist()
+ df.loc[df['Loser'].eq(name), 'Value_L'].tolist()})
New_1 New_2
0 Jack 5
1 Jack 6
2 Jack -2
3 Jack -6
I think you could use numpy.where after you've selected only the rows with 'Jack'
import numpy as np
df['New_2'] = np.where(df['Winner'] == 'Jack', df['Value_W'], df['Value_L'])
Possibly:
Split it into two dataframes
Rename some columns
Join
Possibly drop extra rows
df_win = df[['Winner', 'Value_W']].rename(columns={'Winner':'Name','Value_W':'Value'})
df_lose = df[['Loser', 'Value_L']].rename(columns={'Loser':'Name','Value_W':'Value'})
df = df_win.join(df_lose, on='Name', how='outer')
df.loc[df.Name == 'Jack']
I do really like ALollz's answer though.
Also DataFrame.where + DataFrame.shift with axis=1
new_df=df.where(df.eq('Jack').shift(axis=1)).sum(axis=1,min_count=1).dropna().to_frame('value')
new_df.insert(0,'Name','Jack')
print(new_df)
Name value
0 Jack 5.0
2 Jack -2.0
3 Jack -6.0
4 Jack 6.0
Given a DataFrame:
name email
0 Carl carl#yahoo.com
1 Bob bob#gmail.com
2 Alice alice#yahoo.com
3 David dave#hotmail.com
4 Eve eve#gmail.com
How can it be sorted according to the email's domain name (alphabetically, ascending), and then, within each domain group, according to the string before the "#"?
The result of sorting the above should then be:
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com
Use:
df = df.reset_index(drop=True)
idx = df['email'].str.split('#', expand=True).sort_values([1,0]).index
df = df.reindex(idx).reset_index(drop=True)
print (df)
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com
Explanation:
First reset_index with drop=True for unique default indices
Then split values to new DataFrame and sort_values
Last reindex to new order
Option 1
sorted + reindex
df = df.set_index('email')
df.reindex(sorted(df.index, key=lambda x: x.split('#')[::-1])).reset_index()
email name
0 bob#gmail.com Bob
1 eve#gmail.com Eve
2 dave#hotmail.com David
3 alice#yahoo.com Alice
4 carl#yahoo.com Carl
Option 2
sorted + pd.DataFrame
As an alternative, you can ditch the reindex call from Option 1 by re-creating a new DataFrame.
pd.DataFrame(
sorted(df.values, key=lambda x: x[1].split('#')[::-1]),
columns=df.columns
)
name email
0 Bob bob#gmail.com
1 Eve eve#gmail.com
2 David dave#hotmail.com
3 Alice alice#yahoo.com
4 Carl carl#yahoo.com
I have two dataframes, both have the same amount of columns and contain text data. The problem is that the data in the second dataframe is missing details:
A B
1 Bob Hoskins
2 Laura Hogan
3 Tom Jones
A B
1 Bob x
2 Bob x
3 Bob x
4 Laura x
5 Laura x
6 Tom x
What is the fastest way in Pandas to set the value of the 'B' column in the second dataframe equal to its respective conditional value in the first? So any row where 'A' = 'Bob' will have 'B' set to Hoskins, Laura to Hogan and so on? The second dataframe is quite large as well, with 100,000 rows so a speedy solution is preferred.
Perform a left join on the second df:
output = df2.merge(df1, how = "left", on = "A")
* desired df: *
A B
0 Bob Hoskins
1 Bob Hoskins
2 Bob Hoskins
3 Laura Hogan
4 Laura Hogan
5 Tom Jones
You can set A as index for the first data frame and then filter rows based on the index:
df.set_index('A').loc[df1.A].reset_index()
# A B
# 0 Bob Hoskins
# 1 Bob Hoskins
# 2 Bob Hoskins
# 3 Laura Hogan
# 4 Laura Hogan
# 5 Tom Jones