Merging two dataframes with an added condition

Merging two dataframes with an added condition - python

I got two dataframes that I need to merge as per follows:
Df1
Name Type Speed
a x 1
a y 0
a z 1
Df2
Type Fast Slow
x 2 3
y 3 5
z 4 6
Df3 - DESIRED OUTCOME
Name Type Speed Time
a x 1 2
a y 0 5
a z 1 4
So basically I need to create a new 'Time' column that displays times from either 'Fast' or 'Slow' column based on 'Speed' column and the object 'Type'. I have literally no idea how to do this so any help would be much appreciated! Thanks in advance. Apologies for the confusing explanation..

Use merge + np.where for a more succinct solution:
v = df1.merge(df2, on=['Type'])
v['Time'] = np.where(v['Speed'], v.pop('Fast'), v.pop('Slow'))
Name Type Speed Time
0 a x 1 2
1 a y 0 5
2 a z 1 4

Use melt for reshape first, then map for correct match Speed and last merge with left join:
df = df2.melt('Type', var_name='Speed', value_name='Time')
df['Speed'] = df['Speed'].map({'Fast':1, 'Slow':0})
print (df)
Type Speed Time
0 x 1 2
1 y 1 3
2 z 1 4
3 x 0 3
4 y 0 5
5 z 0 6
df3 = df1.merge(df, how='left', on=['Type','Speed'])
print (df3)
Name Type Speed Time
0 a x 1 2
1 a y 0 5
2 a z 1 4
If performance is important merge is not necessary - map by Series created by set_index with numpy.where - df1['Speed'] is 0 and 1, so is processes like Falses and Trues:
s1 = df2.set_index('Type')['Fast']
s2 = df2.set_index('Type')['Slow']
df1['Time'] = np.where(df1['Speed'], df1['Type'].map(s1), df1['Type'].map(s2))
print (df1)
Name Type Speed Time
0 a x 1 2
1 a y 0 5
2 a z 1 4

Related

(Pandas, Python) Selecting indices of a parent DF based on shared column values with a child DF

(I recently asked this question on r/learnpython (here), but didn't get any feedback, so am re-posting it verbatim here. Hope that is okay!)
Suppose I have a DataFrame Y that looks like this:

index
x1
x2
A
0
1
2
5
1
3
7
1
And then I have a parent DataFrame X like this:

index
x1
x2
A
0
1
2
0
1
1
3
0
2
3
4
0
3
3
7
0
Further, suppose that any ['x1','x2'] combination that exists is unique (so any combination exists in either X or Y only 0 or 1 times, and if it exists in Y, then that particular ['x1','x2'] combination (although the 'A' value may be different) exists in X as well).
For all ['x1','x2'] combinations in Y, I would like to find the corresponding indices in X. So here, those indices would be a list [0,3] that I want.
My goal is, for all such rows in X with such an index (I'll call it j here), to set
X['A'].loc[j] = Y['A'].iloc[j]
Currently I have this:
for i in range(len(X)):
v1 = X['x1'].iloc[i]
v2 = X['x2'].iloc[i]
extract = Y.query("x1 == %d" % v1).query("x2 == %d" % v2)
if len(extract_ori) != 0:
X['count'].loc[i] = extract['count'].iloc[0]

This does what I want, except it is pretty slow, and it seems like there should be a faster way to do this. Wondering what this might be, if it exists!

So the resulting DataFrame X should look like

index
x1
x2
A
0
1
2
5
1
1
3
0
2
3
4
0
3
3
7
1

One option is to use MultiIndex.map:
cols = ['x1','x2']
X['A'] = X.set_index(cols).index.map(Y.set_index(cols)['A']).fillna(0).astype(int)
Another option is left-merge on two columns:
cols = ['x1','x2']
X = X[cols].merge(Y[cols+['A']], on=cols, how='left').fillna(0)
Output:
index x1 x2 A
0 0 1 2 5
1 1 1 3 0
2 2 3 4 0
3 3 3 7 1

You could set_index the x1 and x2 column and update the dataframe on the A column.
X1 = X.set_index(['x1', 'x2'])
Y1 = Y.set_index(['x1', 'x2'])
X1['A'].update(Y1['A']) # works inplace
X1.reset_index(inplace=True)
print(X1)
x1 x2 index A
0 1 2 0 5
1 1 3 1 0
2 3 4 2 0
3 3 7 3 1
UPDATE
the shorter version:
X.set_index(['x1','x2'], inplace=True)
X.update(Y.set_index(['x1','x2']))
X.reset_index(inplace=True)

Updating values from different dataframe on a certain id value [duplicate]

Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).
Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.
df1:
Name Nonprofit Business Education
X 1 1 0
Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ
Z 0 0 0
Y 0 1 0
df2:
Name Nonprofit Education
Y 1 1 <- this df has the correct values.
Z 1 1
pd.merge(df1, df2, on='Name', how='outer')
Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y
Y 1 1 1 1 1
Y 1 1 1 1 1
X 1 1 0 nan nan
Z 1 1 1 1 1
In a previous post, I tried combine_First and dropna(), but these don't do the job.
I want to replace zeros in df1 with the values in df2.
Furthermore, I want all rows with the same Names to be changed according to df2.
Name Nonprofit Business Education
Y 1 1 1
Y 1 1 1
X 1 1 0
Z 1 0 1
(need to clarify: The value in 'Business' column where name = Z should 0.)
My existing solution does the following:
I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.
pubunis_df = df2
sdf = df1
regex = str_to_regex(', '.join(pubunis_df.ORGS))
pubunis = searchnamesre(sdf, 'ORGS', regex)
sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)

Attention: In latest version of pandas, both answers above doesn't work anymore:
KSD's answer will raise error:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1]],columns=["Name","Nonprofit", "Education"])
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
and EdChum's answer will give us the wrong result:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
df1
Out[852]:
Name Nonprofit Business Education
0 X 1.0 1 0.0
1 Y 1.0 1 1.0
2 Z NaN 0 NaN
3 Y NaN 1 NaN
Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.
Here is my answer:
Way 1:
df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
Way 2:
df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)
More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.
Example:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1],
['U',1,3]],columns=["Name2","Nonprofit", "Education"])
df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')
df1.update(df2)
result:
Nonprofit Business Education
Name1
X 1.0 1 0.0
Y 1.0 1 1.0
Z 1.0 0 1.0
Y 1.0 1 1.0

Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:
In [27]:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]

In [27]:
This is the correct one.
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
The above will work only when all rows in df1 exists in df . In other words df should be super set of df1
Incase if you have some non matching rows to df in df1,you should follow below
In other words df is not superset of df1 :
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] =
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values

df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()

Pandas: return the occurrences of the most frequent value for each group (possibly without apply)

Let's assume the input dataset:
test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])
that corresponds to:
A B C
0 0 7 50
1 0 3 51
2 0 3 45
3 1 5 50
4 1 0 50
5 2 6 50
I would like to obtain the a dataset grouped by 'A', together with the most common value for 'B' in each group, and the occurrences of that value:
A most_freq freq
0 3 2
1 5 1
2 6 1
I can obtain the first 2 columns with:
grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())
but I am having problems the last column.
Also: is there a faster way that doesn't involve 'apply'? This solution doesn't scale well with lager inputs (I also tried dask).
Thanks a lot!

Use SeriesGroupBy.value_counts which sorting by default, so then add DataFrame.drop_duplicates for top values after Series.reset_index:
df = (df_test.groupby('A')['B']
.value_counts()
.rename_axis(['A','most_freq'])
.reset_index(name='freq')
.drop_duplicates('A'))
print (df)
A most_freq freq
0 0 3 2
2 1 0 1
4 2 6 1

For each row return the column names of the smallest value - pandas

Assuming that I have a dataframe with the following values:
id product1sold product2sold product3sold
1 2 3 3
2 0 0 5
3 3 2 1
How do I add a 'most_sold' and 'least_sold' column containing all most and least sold products in a list per id?
It should look like this.
id product1 product2 product3 most_sold least_sold
1 2 3 3 [product2, product3] [product1]
2 0 0 5 [product3] [product1, product2]
3 3 2 1 [product1] [product3]

Use list comprehension with test minimal and maximal values for list of products:
#select all columns without first
df1 = df.iloc[:, 1:]
cols = df1.columns.to_numpy()
df['most_sold'] = [cols[x].tolist() for x in df1.eq(df1.max(axis=1), axis=0).to_numpy()]
df['least_sold'] = [cols[x].tolist() for x in df1.eq(df1.min(axis=1), axis=0).to_numpy()]
print (df)
id product1sold product2sold product3sold most_sold \
0 1 2 3 3 [product2sold, product3sold]
1 2 0 0 5 [product3sold]
2 3 3 2 1 [product1sold]
least_sold
0 [product1sold]
1 [product1sold, product2sold]
2 [product3sold]
If performance is not important is possible use DataFrame.apply:
df1 = df.iloc[:, 1:]
f = lambda x: x.index[x].tolist()
df['most_sold'] = df1.eq(df1.max(axis=1), axis=0).apply(f, axis=1)
df['least_sold'] = df1.eq(df1.min(axis=1), axis=0).apply(f, axis=1)

You can do something like this.
minValueCol = yourDataFrame.idxmin(axis=1)
maxValueCol = yourDataFrame.idxmax(axis=1)

Pandas: conditional rolling count

I have a Series that looks the following:
col
0 B
1 B
2 A
3 A
4 A
5 B
It's a time series, therefore the index is ordered by time.
For each row, I'd like to count how many times the value has appeared consecutively, i.e.:
Output:
col count
0 B 1
1 B 2
2 A 1 # Value does not match previous row => reset counter to 1
3 A 2
4 A 3
5 B 1 # Value does not match previous row => reset counter to 1
I found 2 related questions, but I can't figure out how to "write" that information as a new column in the DataFrame, for each row (as above). Using rolling_apply does not work well.
Related:
Counting consecutive events on pandas dataframe by their index
Finding consecutive segments in a pandas data frame

I think there is a nice way to combine the solution of #chrisb and #CodeShaman (As it was pointed out CodeShamans solution counts total and not consecutive values).
df['count'] = df.groupby((df['col'] != df['col'].shift(1)).cumsum()).cumcount()+1
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 B 1

One-liner:
df['count'] = df.groupby('col').cumcount()
or
df['count'] = df.groupby('col').cumcount() + 1
if you want the counts to begin at 1.

Based on the second answer you linked, assuming s is your series.
df = pd.DataFrame(s)
df['block'] = (df['col'] != df['col'].shift(1)).astype(int).cumsum()
df['count'] = df.groupby('block').transform(lambda x: range(1, len(x) + 1))
In [88]: df
Out[88]:
col block count
0 B 1 1
1 B 1 2
2 A 2 1
3 A 2 2
4 A 2 3
5 B 3 1

I like the answer by #chrisb but wanted to share my own solution, since some people might find it more readable and easier to use with similar problems....
1) Create a function that uses static variables
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
2) apply it to your Series after converting to dataframe
df = pd.DataFrame(s)
df['count'] = df['col'].apply(rolling_count) #new column in dataframe
output of df
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 B 1

If you wish to do the same thing but filter on two columns, you can use this.
def count_consecutive_items_n_cols(df, col_name_list, output_col):
cum_sum_list = [
(df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
]
df[output_col] = df.groupby(
["_".join(map(str, x)) for x in zip(*cum_sum_list)]
).cumcount() + 1
return df
col_a col_b count
0 1 B 1
1 1 B 2
2 1 A 1
3 2 A 1
4 2 A 2
5 2 B 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merging two dataframes with an added condition - python

Use merge + np.where for a more succinct solution: v = df1.merge(df2, on=['Type']) v['Time'] = np.where(v['Speed'], v.pop('Fast'), v.pop('Slow')) Name Type Speed Time 0 a x 1 2 1 a y 0 5 2 a z 1 4

Related

(Pandas, Python) Selecting indices of a parent DF based on shared column values with a child DF

Updating values from different dataframe on a certain id value [duplicate]

Pandas: return the occurrences of the most frequent value for each group (possibly without apply)

For each row return the column names of the smallest value - pandas

Pandas: conditional rolling count

Categories

Resources