Merge (or concat) two dataframes by index with duplicates index - python

I have two dataframe A and B with common indexes for A and B. These common indexes can appear several times (duplicate) for A and B.
I want to merge A and B in these 3 ways :
Case 0: If index i of A appear one time (i1) and index i for B
appear one times (i1), I want my merged by index dataframe to add
the rows A(i1), B(i1)
Case 1 : If index i of A appear one time (i1) and index i for B
appear two times in this order : (i1 and i2), I want my merged by
index dataframe to add the rows A(i1), B(i1) and A(i1), B(i2).
Case 2: If index i of A appear two time in this order : (i1, i2) and
index i for B appear two times in this order : (i1 and i2), I want
my merged by index dataframe to add the rows A(i1), B(i1) and A(i2),
B(i2).
These 3 cases are all of the possible case that could appear on my data.
When using pandas.merge, case 0 and case 1 works. But for case 2, the returned dataframe will add rows A(i1), B(i1) and A(i1), B(i2) and A(i2), B(i1) and A(i2), B(i2) instead of A(i1), B(i1) and A(i2), B(i2).
I could use pandas.merge method and then delete the undesired merged rows but is there a ways to combine these 3 cases at the same time ?
A = pd.DataFrame([[1, 2], [4, 2], [5,5], [5,5], [1,1]], index=['a', 'a', 'b', 'c', 'c'])
B = pd.DataFrame([[1, 5], [4, 8], [7,7], [5,5]], index=['b', 'c', 'a', 'a'])
pd.merge(A,B, left_index=True, right_index=True, how='inner')
For example, in the dataframe above, I want exactly it without the second and third index 'a'.

Basically, your 3 cases can be summarized into 2 cases:
Index i occur the same times (1 or 2 times) in A and B, merge according to the order.
Index i occur 2 times in A and 1 time in B, merge using B content for all rows.
Prep code:
def add_secondary_index(df):
df.index.name = 'Old'
df['Order'] = df.groupby(df.index).cumcount()
df.set_index('Order', append=True, inplace=True)
return df
import pandas as pd
A = pd.DataFrame([[1, 2], [4, 2], [5,5], [5,5], [1,1]], index=['a', 'a', 'b', 'c', 'c'])
B = pd.DataFrame([[1, 5], [4, 8], [7,7], [5,5]], index=['b', 'c', 'a', 'a'])
index_times = A.groupby(A.index).count() == B.groupby(B.index).count()
For case 1 is easy to solve, you just need to add the secondary index:
same_times_index = index_times[index_times[0].values].index
A_same = A.loc[same_times_index].copy()
B_same = B.loc[same_times_index].copy()
add_secondary_index(A_same)
add_secondary_index(B_same)
result_merge_same = pd.merge(A_same,B_same,left_index=True,right_index=True)
For case 2, you need to seprately consider:
not_same_times_index = index_times[~index_times.index.isin(same_times_index)].index
A_notsame = A.loc[not_same_times_index].copy()
B_notsame = B.loc[not_same_times_index].copy()
result_merge_notsame = pd.merge(A_notsame,B_notsame,left_index=True,right_index=True)
You could consider whether to add secondary index for result_merge_notsame, or drop it for result_merge_same.

Related

Apply arithmetic calculations on specific rows of a large dataframe

Suppose that we have a data-frame (df) with a high number of rows (1600000X4). Also, we have a list of lists such as this one:
inx = [[1,2],[4,5], [8,9,10], [15,16]]
We need to calculate average of first and third columns of this data-frame and median of second and fourth columns for every list in inx. For example, for the first list of inx, we should do this for first and second rows and replace all these rows with a new row which contains the output of these calculations. What is the fastest way to do this?
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
a b c d
0 1 2 3 3
1 4 5 6 1
2 7 8 9 3
3 1 1 1 1
The output for just the first list inside of inx ([1,2]) will be something like this:
a b c d
0 1 2 3 3
1 5.5 6.5 7.5 2
3 1 1 1 1
As you can see, we don't change first row (0), because it's not in the main list. After that, we're going to do the same for [4,5]. We don't change anything in row 3 because it's not in the list too. inx is a large list of lists (more than 100000 elements).
EDIT: NEW APPROACH AVOIDING LOOPS
Here below you find an approach relying on pandas and avoiding loops.
After generating some fake data with the same size of yours, I basically create list of indexes from your inx list of rows; i.e., with your inx being:
[[2,3], [5,6,7], [10,11], ...]
the created list is:
[[1,1], [2,2,2], [3,3],...]
After that, this list is flattened and added to the original dataframe to mark various groups of rows to operate on.
After proper calculations, the resulting dataframe is joined back with original rows which don't need calculations (in my example above, rows: [0, 1, 4, 8, 9, ...]).
You find more comments in the code.
At the end of the answer I leave also my previous approach for the records.
On my box, the old algo involving a loop take more than 18 minutes... unbearable!
Using pandas only, it takes less than half second!! Pandas is great!
import pandas as pd
import numpy as np
import random
# Prepare some fake data to test
data = np.random.randint(0, 9, size=(160000, 4))
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
inxl = random.sample(range(1, 160000), 140000)
inxl.sort()
inx=[]
while len(inxl) > 3:
i = random.randint(2,3)
l = inxl[0:i]
inx.append(l)
inxl = inxl[i:]
inx.append(inxl)
# flatten inx (used below)
flat_inx = [item for sublist in inx for item in sublist]
# for each element (list) in inx create equivalent list (same length)
# of increasing ints. They'll be used to group corresponding rows
gr=[len(sublist) for sublist in inx]
t = list(zip(gr, range(1, len(inx)+1)))
group_list = [a*[b] for (a,b) in t]
# the groups are flatten either
flat_group_list = [item for sublist in group_list for item in sublist]
# create a new dataframe to mark rows to group retaining
# original index for each row
df_groups = pd.DataFrame({'groups': flat_group_list}, index=flat_inx)
# and join the group dataframe to the original df
df['groups'] = df_groups
# rows not belonging to a group are marked with 0
df['groups']=df['groups'].fillna(0)
# save rows not belonging to a group for later
df_untouched = df[df['groups'] == 0]
df_untouched = df_untouched.drop('groups', axis=1)
# new dataframe containg only rows belonging to a group
df_to_operate = df[df['groups']>0]
df_to_operate = df_to_operate.assign(ind=df_to_operate.index)
# at last, we group the rows according to original inx
df_grouped = df_to_operate.groupby('groups')
# calculate mean and median
# for each group we retain the index of first row of group
df_operated =df_grouped.agg({'a' : 'mean',
'b' : 'median',
'c' : 'mean',
'd' : 'median',
'ind': 'first'})
# set correct index on dataframe
df_operated=df_operated.set_index('ind')
# finally, join the previous dataframe with saved
# dataframe of rows which don't need calcullations
df_final = df_operated.combine_first(df_untouched)
OLD ALGO, TOO SLOW FOR SO MUCH DATA
This algo involving a loop, though giving a correct result, takes to long for such a big amount of data:
import pandas as pd
df = pd.DataFrame(np.array([[1, 2, 3, 3], [4, 5, 6, 1], [7, 8, 9, 3], [1, 1, 1, 1]]), columns=['a', 'b', 'c', 'd'])
inx = [[1,2]]
for l in inx:
means=df.iloc[l][['a', 'c']].mean()
medians=df.iloc[l][['b', 'd']].median()
df.iloc[l[0]]=pd.DataFrame([means, medians]).fillna(method='bfill').iloc[0]
df.drop(index=l[1:], inplace=True)

Pandas: renaming columns that have the same name

I have a dataframe that has duplicated column names a, b and b. I would like to rename the second b into c.
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "b1": [7, 8, 9]})
df.rename(index=str, columns={'b1' : 'b'})
Trying this with no success..
df.rename(index=str, columns={2 : "c"})
try:
>>> df.columns = ['a', 'b', 'c']
>>> df
a b c
0 1 4 7
1 2 5 8
2 3 6 9
You can always just manually rename all the columns.
df.columns = ['a', 'b', 'c']
You can simply do:
df.columns = ['a','b','c']
If your columns are ordered and you want lettered columns, don't type names out manually. This is prone to error.
You can use string.ascii_lowercase, assuming you have a maximum of 26 columns:
from string import ascii_lowercase
df = pd.DataFrame(columns=['a', 'b', 'b1'])
df.columns = list(ascii_lowercase[:len(df.columns)])
print(df.columns)
Index(['a', 'b', 'c'], dtype='object')
These solutions don't take into account the problem with having many cols.
Here is a solution where, independent on the amount of columns, you can rename the columns with the same name to a unique name
df.columns = ['name'+str(col[0]) if col[1] == 'name' else col[1] for col in enumerate(df.columns)]

How to add a value to specific columns of a pandas dataframe?

I have to perform the same arithmetic operation on specific columns of a pandas DataFrame. I do it as
c.loc[:,'col3'] += cons
c.loc[:,'col5'] += cons
c.loc[:,'col6'] += cons
There should be a simpler approach to do all of these in one operation. I mean updating col3,col5,col6 in one command.
pd.DataFrame.loc label indexing accepts lists:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.loc[:, ['B', 'C']] += 10
print(df)
A B C
0 1 12 13
1 4 15 16

get pandas dataframe records according to a specific column quantiles

I would like to get the records of dataframe df whose values of column c equal to a list of specified quantiles.
for a single quantile this works:
df = pd.DataFrame({'A': ['a', 'b', 'c', 'd', 'e'], 'C': [1, 2, 3, 4, 5]})
print(df[df['C'] == df['C'].quantile(q = 0.25)])
and outputs:
A C
1 b 2
but it looks clunky to me, and also fails when there are multiple quantiles: print(df[df['C'] == df['C'].quantile(q = [0.25, 0.75])]) throws ValueError: Can only compare identically-labeled Series objects
related to Retrieve the Kth quantile within each group in Pandas
You can do it using this way:
All you have to do is keep your desired quantiles, in a list: as shown below:
You will have your result in final_df
quantile_list = [0.1,0.5,0.4]
final_df = pd.DataFrame(columns = df.columns)
for i in quantile_list:
temp = df[df['c'] == df['c'].quantile(q = i)]
final_df = pd.concat([final_df,temp])
final_df.reset_index(drop=True,inplace=True) #optional incase you want to reset the index

What's the most efficient way to get a variable length of rows w.r.t each group of a dataframe

To illustrate my question clearly, for a dummy dataframe like this:
df = pd.DataFrame({'X' : ['B', 'B', 'A', 'A', 'A'], 'Y' : [1, 2, 3, 4, 5]})
How can I get top 1 row of group A and top 2 rows of group B, and get rid of the rest rows of each group? By the way, the real dataset is big with hundreds of thousands of rows and thousands of groups.
And the output looks like this:
pd.DataFrame({'X' : ['B', 'B', 'A'], 'Y' : [1, 2, 3]})
My main gripe is .groupby().head() only gives me a fixed length of rows within each group, and I want have a different number of rows of different groups.
One way to do this is create a dictionary contains the number of rows each group should keep, and in the groupby.apply, use the g.name as the key to look up the value in the dictionary, with the head method you can keep different rows for each group:
rows_per_group = {"A": 1, "B": 2}
df.groupby("X", group_keys=False).apply(lambda g: g.head(rows_per_group[g.name]))
# X Y
#2 A 3
#0 B 1
#1 B 2

Categories