Pandas: Merging two columns into one with corresponding values - python

I have a large dataframe with a bunch of names which appear in two columns
It is in the following layout
Winner Value_W Loser Value_L
Jack 5 Sally -3
Sally 2 Max -1
Max 4 Jack -2
Lucy 1 Jack -6
Jack 6 Henry -3
Henry 5 Lucy -4
I then filtered on columns 'Winner' and 'Loser' to get all rows which Jack appears in using the following code
pd.loc[(df['Winner'] == 'Jack') | (df['Loser'] == 'Jack')]
Which returns the following:
Winner Value_W Loser Value_L
Jack 5 Sally -3
Max 4 Jack -2
Lucy 1 Jack -6
Jack 6 Henry -3
I am now looking to generate one column which only has Jack and his corresponding values.
So in this example, the output I want is:
New_1 New_2
Jack 5
Jack -2
Jack -6
Jack 6
I am unsure of how to do this.

You could wide_to_long after renaming the columns slightly. This allows you to capture additional information, like whether that row is a Win or Loss. Or if you don't care do df1 = df1.reset_index(drop=True)
d = {'Winner': 'Person_W', 'Loser': 'Person_L'}
df1 = pd.wide_to_long(df.rename(columns=d).reset_index(),
stubnames=['Person', 'Value'],
i='index',
j='Win_Lose',
sep='_',
suffix='.*')
df1[df1.Person == 'Jack']
# Person Value
#index Win_Lose
#0 W Jack 5
#4 W Jack 6
#2 L Jack -2
#3 L Jack -6
If that specific ordering is important, we still have the original Index so:
df1.sort_index(level=0).query('Person == "Jack"').reset_index(drop=True)
# Person Value
#0 Jack 5
#1 Jack -2
#2 Jack -6
#3 Jack 6

You should go wide_to_long for sure, but here is a hidden function so called lreshape (May remove in the future, depends on pandas' developer)
pd.lreshape(df,{'name':['Winner','Loser'],'v':['Value_W','Value_L']}).query("name=='Jack'")
Out[75]:
name v
0 Jack 5
4 Jack 6
8 Jack -2
9 Jack -6

name = 'Jack'
>>> pd.DataFrame({
'New_1': name,
'New_2': df.loc[df['Winner'].eq(name), 'Value_W'].tolist()
+ df.loc[df['Loser'].eq(name), 'Value_L'].tolist()})
New_1 New_2
0 Jack 5
1 Jack 6
2 Jack -2
3 Jack -6

I think you could use numpy.where after you've selected only the rows with 'Jack'
import numpy as np
df['New_2'] = np.where(df['Winner'] == 'Jack', df['Value_W'], df['Value_L'])

Possibly:
Split it into two dataframes
Rename some columns
Join
Possibly drop extra rows
df_win = df[['Winner', 'Value_W']].rename(columns={'Winner':'Name','Value_W':'Value'})
df_lose = df[['Loser', 'Value_L']].rename(columns={'Loser':'Name','Value_W':'Value'})
df = df_win.join(df_lose, on='Name', how='outer')
df.loc[df.Name == 'Jack']
I do really like ALollz's answer though.

Also DataFrame.where + DataFrame.shift with axis=1
new_df=df.where(df.eq('Jack').shift(axis=1)).sum(axis=1,min_count=1).dropna().to_frame('value')
new_df.insert(0,'Name','Jack')
print(new_df)
Name value
0 Jack 5.0
2 Jack -2.0
3 Jack -6.0
4 Jack 6.0

Related

I sorted two dataframes by id that contain the same values but am getting that they are not equal

I have two dataframes:
df1
ID Name
15 Max
7 Stacy
3 Frank
2 Joe
df2
ID Name
2 Abigail
3 Josh
15 Jake
7 Brian
I sorteded them by doing
df1 = df1.sort_values(by=['ID'])
df2 = df2.sort_values(by=['ID'])
to get
df1
ID Name
2 Joe
3 Frank
7 Stacy
15 Max
df2
ID Name
2 Abigail
3 Josh
7 Brian
15 Jake
However when I check that the 'ID' column is the same across both dataframes by doing
print(df1['ID'].equals(df2['ID']))
it returns False, why is this so? Is there another method I can use to return that the two columns are equal?
They're still being compared on the original indices:
import io
df1 = pd.read_csv(io.StringIO('''ID Name
15 Max
7 Stacy
3 Frank
2 Joe''', sep='\s+'))
df2 = pd.read_csv(io.StringIO('''ID Name
2 Abigail
3 Josh
15 Jake
7 Brian'''), sep='\s+')
df1 = df1.sort_values(by=['ID'])
df2 = df2.sort_values(by=['ID'])
What is basically happening is that it is checking whether ID and ID_other in the following data frame are equal; they are not.
>>> df1.join(df2, rsuffix='_other')
ID Name ID_other Name_other
3 2 Joe 7 Brian
2 3 Frank 15 Jake
1 7 Stacy 3 Josh
0 15 Max 2 Abigail
If you want to check equality without regard to index, consider:
df1['ID'].values == df2['ID'].values
Or reset indices on both sides then use eq.
The frames have most probably different indices. You should do:
df1 = df1.sort_values(by=['ID']).reset_index(drop=True)
df2 = df2.sort_values(by=['ID']).reset_index(drop=True)
print(df1['ID'].equals(df2['ID'])) # this returns True
Alternative:
import pandas as pd
df1 = pd.DataFrame({'ID': [15, 7, 3, 2], 'Name': ['Max', 'Stacy', 'Frank', 'Joe']})
df2 = pd.DataFrame({'ID': [2, 3, 15, 7], 'Name': ['Abigail', 'Josh', 'Jake', 'Brian']})
df1 = df1.sort_values(by=['ID']).reset_index(drop=True)
df2 = df2.sort_values(by=['ID']).reset_index(drop=True)
print(df1['ID'].equals(df2['ID'])) # should return True
You don't need to sort. You can use pandas.DataFrame.set_index the use pandas.DataFrame.eq.
df1.set_index('ID').eq(df2.set_index('ID'))
For example if df1 and df2 like:
>>> print(df1)
# ID Name
# 0 15 Max
# 1 7 Stacy
# 2 3 Frank
# 3 2 Joe
>>> print(df2)
# ID Name
# 0 2 Abigail
# 1 3 Josh
# 2 15 Max
# 3 7 Brian
>>> df1.set_index('ID').eq(df2.set_index('ID'))
Name
ID
2 False
3 False
7 False
15 True

Get count of unique year repeated throughout the event by groupby pandas [duplicate]

I would like to count the unique observations by a group in a pandas dataframe and create a new column that has the unique count. Importantly, I would not like to reduce the rows in the dataframe; effectively performing something similar to a window function in SQL.
df = pd.DataFrame({
'uID': ['James', 'Henry', 'Abe', 'James', 'Henry', 'Brian', 'Claude', 'James'],
'mID': ['A', 'B', 'A', 'B', 'A', 'A', 'A', 'C']
})
df.groupby('mID')['uID'].nunique()
Will get the unique count per group, but it summarises (reduces the rows), I would effectively like to do something along the lines of:
df['ncount'] = df.groupby('mID')['uID'].transform('nunique')
(this obviously does not work)
It is possible to accomplish the desired outcome by taking the unique summarised dataframe and joining it to the original dataframe but I am wondering if there is a more minimal solution.
Thanks
GroupBy.transform('nunique')
On v0.23.4, your solution works for me.
df['ncount'] = df.groupby('mID')['uID'].transform('nunique')
df
uID mID ncount
0 James A 5
1 Henry B 2
2 Abe A 5
3 James B 2
4 Henry A 5
5 Brian A 5
6 Claude A 5
7 James C 1
GroupBy.nunique + pd.Series.map
Additionally, with your existing solution, you could map the series back to mID:
df['ncount'] = df.mID.map(df.groupby('mID')['uID'].nunique())
df
uID mID ncount
0 James A 5
1 Henry B 2
2 Abe A 5
3 James B 2
4 Henry A 5
5 Brian A 5
6 Claude A 5
7 James C 1
You are very close!
df['ncount'] = df.groupby('mID')['uID'].transform(pd.Series.nunique)
uID mID ncount
0 James A 5
1 Henry B 2
2 Abe A 5
3 James B 2
4 Henry A 5
5 Brian A 5
6 Claude A 5
7 James C 1

Map names to column values pandas

The Problem
I had a hard time phrasing this question but essentially I have a series of X columns that represent weights at specific points in time. Then another set of X columns that represent the names of those people that were measured.
That table looks like this (there's more than two columns, this is just a toy example):
a_weight
b_weight
a_name
b_name
10
5
John
Michael
1
2
Jake
Michelle
21
3
Alice
Bob
2
1
Ashley
Brian
What I Want
I want to have a two columns with the maximum weight and name at each point in time. I want this to be vectorized because the data is a lot. I can do it using a for loop or an .apply(lambda row: row[col]) but it is very slow.
So the final table would look something like this:
a_weight
b_weight
a_name
b_name
max_weight
max_name
10
5
John
Michael
a_weight
John
1
2
Jake
Michelle
b_weight
Michelle
21
3
Alice
Bob
a_weight
Alice
2
1
Ashley
Brian
a_weight
Ashley
What I've Tried
I've been able to create a mirror df_subset with just the weights, then use the idxmax function to make a max_weight column:
df_subset = df[[c for c in df.columns if "weight" in c]]
max_weight_col = df_subset.idxmax(axis="columns")
This returns a column that is the max_weight column in the section above. Now I run:
df["max_name_col"] = max_weight_col.str.replace("_weight","_name")
and I have this:
a_weight
b_weight
a_name
b_name
max_weight
max_name_col
10
5
John
Michael
a_weight
a_name
1
2
Jake
Michelle
b_weight
b_name
21
3
Alice
Bob
a_weight
a_name
2
1
Ashley
Brian
a_weight
a_name
I basically want to run a code similar to the one below without a for-loop:
df["max_name"] = [row[row["max_name_col"]] for row in df]
How do I move on from here? I feel like I'm so close but I'm stuck. Any help? I'm also open to throwing away the entire code and doing something else if there's a faster way.
You can do that for sure just pass to numpy argmax
v1 = df.filter(like='weight').values
v2 = df.filter(like='name').values
df['max_weight'] = v1[df.index, v1.argmax(1)]
df['max_name'] = v2[df.index, v1.argmax(1)]
df
Out[921]:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael 10 John
1 1 2 Jake Michelle 2 Michelle
2 21 3 Alice Bob 21 Alice
3 2 1 Ashley Brian 2 Ashley
This would do the trick assuming you only have 2 weight columns:
df["max_weight"] = df[["a_weight", "b_weight"]].idxmax(axis=1)
mask = df["max_weight"] == "a_weight"
df.loc[mask, "max_name"] = df[mask]["a_name"]
df.loc[~mask, "max_name"] = df[~mask]["b_name"]
We could use idxmax to find the column names; then use factorize + numpy advanced indexing to get the names:
df['max_weight'] = df.loc[:, df.columns.str.contains('weight')].idxmax(axis=1)
df['max_name'] = (df.loc[:, df.columns.str.contains('name')].to_numpy()
[np.arange(len(df)), df['max_weight'].factorize()[0]])
Output:
a_weight b_weight a_name b_name max_weight max_name
0 10 5 John Michael a_weight John
1 1 2 Jake Michelle b_weight Michelle
2 21 3 Alice Bob a_weight Alice
3 2 1 Ashley Brian a_weight Ashley

Count how many times two unique values occur in the same group without order

Consider a pandas DataFrame with 2 columns: image_id and name
Each row represents one person (name) located in an image (image_id)
Each image can have 1 or more people
Each name can appear at most once in an image
Friendship order does not matter, e.g. Bob & Mary = Mary & Bob
How can I count how many times two people occur in the same image across the entire dataset?
data = [[1, 'Mary'], [1, 'Bob'], [1, 'Susan'],
[2, 'Bob'], [2, 'Joe'],
[3, 'Isaac'],
[4, 'Mary'], [4, 'Susan'],
[5, 'Mary'], [5, 'Susan'], [5, 'Bob'], [5, 'Joe']]
df = pd.DataFrame(data, columns=['image_id', 'name'])
# Now what?
Expected dataframe (order of rows or names doesn't matter):
name1 name2 count
Mary Susan 3
Bob Susan 2
Mary Bob 2
Bob Joe 2
Mary Joe 1
Susan Joe 1
Alternative solution:
It would also be acceptable to have a symmetric 2D matrix where rows and columns are the names of all people, and the cell value is the number of times those two people have appeared in the same image.
We can use crosstab to calculate frequency table then calculate the inner product on this frequency table to count the number of times two people occur in same image
s = pd.crosstab(df['image_id'], df['name'])
c = s.T # s
c = c.mask(np.triu(c, 1) == 0).stack()\
.rename_axis(['name1', 'name2']).reset_index(name='count')
name1 name2 count
0 Bob Joe 2.0
1 Bob Mary 2.0
2 Bob Susan 2.0
3 Joe Mary 1.0
4 Joe Susan 1.0
5 Mary Susan 3.0
EDIT by OP:
Here's a detailed explanation of the above code:
# Compute a frequency table of names that appear in each image.
s = pd.crosstab(df['image_id'], df['name'])
name Bob Isaac Joe Mary Susan
image_id
1 1 0 0 1 1
2 1 0 1 0 0
3 0 1 0 0 0
4 0 0 0 1 1
5 1 0 1 1 1
# Inner product counts the occurrences of each pair.
# The diagonal counts the number of times a name appeared in any image.
c = s.T # s
name Bob Isaac Joe Mary Susan
name
Bob 3 0 2 2 2
Isaac 0 1 0 0 0
Joe 2 0 2 1 1
Mary 2 0 1 3 3
Susan 2 0 1 3 3
# Keep the non-zero elements in the upper triangle, since matrix is symmetric.
c = c.mask(np.triu(c, 1) == 0)
name Bob Isaac Joe Mary Susan
name
Bob NaN NaN 2.0 2.0 2.0
Isaac NaN NaN NaN NaN NaN
Joe NaN NaN NaN 1.0 1.0
Mary NaN NaN NaN NaN 3.0
Susan NaN NaN NaN NaN NaN
# Group all counts in a single column.
# Each row represents a unique pair of names.
c = c.stack()
name name
Bob Joe 2.0
Mary 2.0
Susan 2.0
Joe Mary 1.0
Susan 1.0
Mary Susan 3.0
# Expand the MultiIndex into separate columns.
c = c.rename_axis(['name1', 'name2']).reset_index(name='count')
name1 name2 count
0 Bob Joe 2.0
1 Bob Mary 2.0
2 Bob Susan 2.0
3 Joe Mary 1.0
4 Joe Susan 1.0
5 Mary Susan 3.0
See crosstab, # (matrix mult.), T (transpose), triu, mask and stack for more details.
I know the answer is already made and accepted by the user. But still, I want to share my code. This is my code to achieve the expected output by the "HARD WAY".
import itertools
import pandas as pd
data = [[1, 'Mary'], [1, 'Bob'], [1, 'Susan'],
[2, 'Bob'], [2, 'Joe'],
[3, 'Isaac'],
[4, 'Mary'], [4, 'Susan'],
[5, 'Mary'], [5, 'Susan'], [5, 'Bob'], [5, 'Joe']]
df = pd.DataFrame(data, columns=['image_id', 'name'])
# Group the df by 'image_id' and get the value of name in the form of list
groups = df.groupby(['image_id'])['name'].apply(list).reset_index()
output = {}
# Loop through the groups dataframe
for index, row in groups.iterrows():
# Sort the list of names in ascending order
row['name'].sort()
# Get the all possible combination of list in pair of twos
temp = list(itertools.combinations(row['name'], 2))
# Loop through it and maintain the output dictionary with its occurrence
# Default set occurrence value to 1 when initialize
# Increment it when we found more occurrence of it
for i, val in enumerate(temp):
if val not in output:
output[val] = 1
else:
output[val] += 1
temp_output = []
# Reformat the output dictionary so we can initialize it into pandas dictionary
for key, val in output.items():
temp = [key[0], key[1], val]
temp_output.append(temp)
df = pd.DataFrame(temp_output, columns=['name1', 'name2', 'count'])
print(df.sort_values(by=['count'], ascending=False))
And this is the output I am getting:
name1 name2 count
2 Mary Susan 3
0 Bob Mary 2
1 Bob Susan 2
3 Bob Joe 2
4 Joe Mary 1
5 Joe Susan 1
This is "NOT THE PYTHONIC" way, but this is how I solve most of my problems, which is not that good but it does my job.
NOTE: How code works is already mentioned in the comments but still if anyone of you has any doubts/questions/suggestions, then kindly let me know.

Insert a Zero in a Pandas Dataframe pd.count() Result < 1

I'm trying to find a method of inserting a zero into a pandas dataframe where the result of the .count()aggregate function is < 1. I've tried putting in a condition where it looks for null/None values and using a simple < 1 operator. So far I can only count instances where a categorical variable exists. Below is some example code to demonstrate my issue:
data = {'Person': ['Jim', 'Jim', 'Jim', 'Jim', 'Jim', 'Bob','Bob','Bob','Bob','Bob',], 'Result': ['Good', 'Good','Good','Good','Good','Good','Bad','Good','Bad','Bad',]}
dtf = pd.DataFrame.from_dict(data)
names = ['Jim','Bob']
append = []
for i in names:
good = dtf[dtf['Person']==i]
good = good[good['Result']=='Good']
if good['Result'].count() > 0:
good.insert(2,"Count",good['Result'].count())
elif good['Result'].count() < 1:
good.insert(2,"Count",0)
bad = dtf[dtf['Person']==i]
bad = bad[bad['Result']=='Bad']
if bad['Result'].count() > 0:
bad.insert(2,"Count",bad['Result'].count())
elif bad['Result'].count() < 1:
bad.insert(2,"Count",0)
res = [good,bad]
res = pd.concat(res)
append.append(res)
print(res)
The current output is:
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
Person Result Count
5 Bob Good 2
7 Bob Good 2
6 Bob Bad 3
8 Bob Bad 3
9 Bob Bad 3
What I am trying to achieve is a zero count for Jim for the 'Bad' variable in the dtf['Results'] column. Like this:
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
5 Jim Bad 0
Person Result Count
6 Bob Good 2
7 Bob Good 2
8 Bob Bad 3
9 Bob Bad 3
10 Bob Bad 3
I hope this makes sense. Vive la Resistance! └[∵┌]└[ ∵ ]┘[┐∵]┘
First create a multiindex mi from the product of Person and Result to keep missing combinations from df. Then count (size) all groups and reindex by the multiindex. Finally, merge the two dataframes use union of keys from both.
mi = pd.MultiIndex.from_product([df["Person"].unique(),
df["Result"].unique()],
names=["Person", "Result"])
out = df.groupby(["Person", "Result"]) \
.size() \
.reindex(mi, fill_value=0) \
.rename("Count") \
.reset_index()
out = out.merge(df, on=["Person", "Result"], how="outer")
>>> out
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
5 Jim Bad 0
6 Bob Good 2
7 Bob Good 2
8 Bob Bad 3
9 Bob Bad 3
10 Bob Bad 3
Output:
names, append = list(zip(*out.groupby("Person")))
>>> names
('Bob', 'Jim')
>>> append
( Person Result Count
6 Bob Good 2
7 Bob Good 2
8 Bob Bad 3
9 Bob Bad 3
10 Bob Bad 3,
Person Result Count
0 Jim Good 5
1 Jim Good 5
2 Jim Good 5
3 Jim Good 5
4 Jim Good 5
5 Jim Bad 0)

Categories