I have two pandas data frames, with some indexes and some column names in common (like partially overlapping time-series related to common quantities).
I need to merge these two dataframes in a single one containing all the indexes and all the values for each index, keeping the values of the left (right) one in case an index-column combination appears in both data frames.
Both merge and join methods are unhelpful as the merge method will duplicate information I don't need and join causes the same problem.
What's an efficient method to obtain the result I need?
EDIT:
If for example I have the two data frames
df1 = pd.DataFrame({
'C1' : [1.1, 1.2, 1.3],
'C2' : [2.1, 2.2, 2.3],
'C3': [3.1, 3.2, 3.3]},
index=['a', 'b', 'c'])
df2 = pd.DataFrame({
'C3' : [3.1, 3.2, 33.3],
'C4' : [4.1, 4.2, 4.3]},
index=['b', 'c', 'd'])
What I need is a method that allows me to create:
merged = pd.DataFrame({
'C1': [1.1, 1.2, 1.3, 'nan'],
'C2': [2.1, 2.2, 2.3, 'nan'],
'C3': [3.1, 3.2, 3.3, 33.3],
'C4': ['nan', 4.1, 4.2, 4.3]},
index=['a', 'b', 'c', 'd'])
Here are three possibilities:
Use concat/groupby: First concatenate both DataFrames vertically. Then group by the index and select the first row in each group.
Use combine_first: Make a new index which is the union of df1 and df2. Reindex df1 using the new index. Then use combine_first to fill in NaNs with values from df2.
Use manual construction: We could use df2.index.difference(df1.index) to find exactly which rows need to be added to df1. So we could manually select those rows from df2 and concatenate them on to df1.
For small DataFrames, using_concat is faster. For larger DataFrames, using_combine_first appears to be slightly faster than the other options:
import numpy as np
import pandas as pd
import perfplot
def make_dfs(N):
df1 = pd.DataFrame(np.random.randint(10, size=(N,2)))
df2 = pd.DataFrame(np.random.randint(10, size=(N,2)), index=range(N//2,N//2 + N))
return df1, df2
def using_concat(dfs):
df1, df2 = dfs
result = pd.concat([df1,df2], sort=False)
n = result.index.nlevels
return result.groupby(level=range(n)).first()
def using_combine_first(dfs):
df1, df2 = dfs
index = df1.index.union(df2.index)
result = df1.reindex(index)
result = result.combine_first(df2)
return result
def using_manual_construction(dfs):
df1, df2 = dfs
index = df2.index.difference(df1.index)
cols = df2.columns.difference(df1.columns)
result = pd.concat([df1, df2.loc[index]], sort=False)
result.loc[df2.index, cols] = df2
return result
perfplot.show(
setup=make_dfs,
kernels=[using_concat, using_combine_first,
using_manual_construction],
n_range=[2**k for k in range(5,21)],
logx=True,
logy=True,
xlabel='len(df)')
Without seeing your code I can only give a generic answer:
To merge 2 dataframes use
df3 = pd.merge(df1, df2, how='right', on=('col1', 'col2'))
or
a.merge(b, how='right', on=('c1', 'c2'))
Related
I have a large DataFrame with multiple cols, one of which has ID of string, and the others are of float
example of the dataframe:
df = pd.DataFrame({'ID': ['Child', 'Child', 'Child', 'Child', 'Baby', 'Baby', 'Baby', 'Baby'],
'income': [40000, 50000, 42000, 300, 2000, 4000, 2000, 3000],
'Height': [1.3, 1.5, 1.9, 2.0, 2.3, 1.4, 0.9, 0.8]})
What I want to do is a combination of calculating the average of every n rows of all cols, inside every ID group.
desired output:
steps = 3
df = pd.DataFrame({'ID': ['Child', 'Child', 'Baby', 'Baby'],
'income': [44000, 300, 2600 , 3000],
'Height': [1.567, 2.0, 1.533, 0.8],
'Values': [3, 1, 3, 1]})
Where the rows are first grouped by ID and then the mean is taken over every 3 values in the same group. I added Values such that i can track how many rows are taken for that row's average of all cols.
I have found similar questions but I cannot seem to combine them to solve my problem:
This question gives averages of n rows.
[This question2 covers pd.cut which I might need as well, I just dont understand how the bins work.
How can I make this happen?
You can use a double groupby:
# set up secondary grouper
group = df.groupby('ID').cumcount().floordiv(steps)
# groupy+agg
(df.groupby(['ID', group], as_index=False, sort=False)
.agg(**{'income': ('income', 'mean'),
'Height': ('Height', 'mean'),
'Values': ('Height', 'count'),
})
)
output:
ID income Height Values
0 Child 44000.000000 1.566667 3
1 Child 300.000000 2.000000 1
2 Baby 2666.666667 1.533333 3
3 Baby 3000.000000 0.800000 1
I am trying to transform this Dataframe.
To look like the following:
Here is the code to create the sample df
df = pd.DataFrame(data = [[1, 'A', 0, '2021-07-01'],
[1, 'B', 1, '2021-07-02'],
[2, 'D', 3, '2021-07-02'],
[2, 'C', 2, '2021-07-02'],
[2, 'E', 4, '2021-07-02']
], columns = ['id', 'symbol', 'value', 'date'])
symbol_list = [['A', 'B', ''], ['C','D','E']]
The end result dataframe is grouped by id field with symbol column turns into multiple columns with symbol ordering mapped to the user input list.
I was using .apply() method to construct each datarow for the above dataframe but it is taking very long time for 10000+ datapoints.
I am trying to find a more efficient way to transform the dataframe. I am thinking that I will need to use pivot function to unstack the data frame with the combination of resetting index (to turn category value into column). Appreciate any help on this!
Use GroupBy.cumcount with DataFrame.unstack for reshape, then extract date by DataFrame.pop with max per rows, flatten columns and last add new column date by DataFrame.assign:
df = pd.DataFrame(data = [[1, 'A', 0, '2021-07-01'],
[1, 'B', 1, '2021-07-02'],
[2, 'D', 3, '2021-07-02'],
[2, 'C', 2, '2021-07-02'],
[2, 'E', 4, '2021-07-02']
], columns = ['id', 'symbol', 'value', 'date'])
#IMPORTANT all values from symbol_list are in column symbol (without empty strings)
symbol_list = [['A', 'B', ''], ['C','D','E']]
order = [y for x in symbol_list for y in x if y]
print (order)
['A', 'B', 'C', 'D', 'E']
#convert all values to Categoricals with specified order by flatten lists
df['symbol'] = pd.Categorical(df['symbol'], ordered=True, categories=order)
df['date'] = pd.to_datetime(df['date'])
#sorting by id and symbol
df = df.sort_values(['id','symbol'])
df1 = df.set_index(['id',df.groupby('id').cumcount()]).unstack()
date_max = df1.pop('date').max(axis=1)
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.assign(date = date_max)
print (df1)
symbol_0 symbol_1 symbol_2 value_0 value_1 value_2 date
id
1 A B NaN 0.0 1.0 NaN 2021-07-02
2 C D E 2.0 3.0 4.0 2021-07-02
I have a dictionary of dataframes (df_t). Each dataframe has the same exactly same two columns but different number of rows . I merged all the dataframes to one dataframe.
However, I would like to use (df_A [label_1]) as a reference and drop the (df_B [label_2]) (df_A [label_3]) like in the example
df_A = pd.DataFrame({'A': [1, 0.5, 1, 0.5],'label_1': ['-1', '1', '-1', '1']})
df_B = pd.DataFrame({'A': [1, 1.5, 2.3],'label_2': ['-1', '1','-1']})
df_C = pd.DataFrame({'A': [2.1, 5.5],'label_3': ['-1', '1']})
df_t = {'1': df_A, '2': df_B, '3': df_C}
#d = { k: v.set_index('label') for k, v in df_t.items()}
dfx = pd.concat(df_t, axis=1)
dfx.columns = dfx.columns.droplevel(0)
I have two dataframes that look like this:
df_1 = pd.DataFrame({
'A' : [1.0, 2.0, 3.0, 4.0],
'B' : [100, 200, 300, 400],
'C' : [2, 3, 4, 5]
})
df_2 = pd.DataFrame({
'B' : [1.0, 2.0, 3.0, 4.0],
'C' : [100, 200, 300, 400],
'D' : [2, 3, 4, 5]
})
Now if I utilize pandas .isin function I can do something nifty like this
>>> print df_2.columns.isin(df_1.columns)
array([ True, True, False], dtype=bool)
Columns B and C from df_2 exist in df_1 while D doesn't
My question is: does anyone know of a way to return the columns' labels for columns that exist in df_2 but not in df_1
something like this
array([u'D'], dtype=string)
Thank you in advance!
Pandas index object have set-like properties, so you can directly do:
df_2.columns.difference(df_1.columns)
Index([u'D'], dtype='object')
You can also use operators like &|^ to compute intersection, union and symmetric difference:
df_1.columns & df_2.columns
Index([u'B', u'C'], dtype='object')
df_1.columns | df_2.columns
Index([u'A', u'B', u'C', u'D'], dtype='object')
df_1.columns ^ df_2.columns
Index([u'A', u'D'], dtype='object')
There use to be the -operator for difference, now deprecated:
df_2.columns - df_1.columns
FutureWarning: using '-' to provide set differences with Indexes is deprecated, use .difference()
Index([u'D'], dtype='object')
Numpy solution with numpy.setdiff1d:
a = np.setdiff1d(df_2.columns, df_1.columns)
print (a)
['D']
Pandas solution with Index.difference:
a = df_2.columns.difference(df_1.columns)
print (a)
Index(['D'], dtype='object')
Another pandas methods are intersection,
union and symmetric_difference
:
print (df_2.columns.intersection(df_1.columns))
Index(['B', 'C'], dtype='object')
print (df_2.columns.union(df_1.columns))
Index(['A', 'B', 'C', 'D'], dtype='object')
print (df_2.columns.symmetric_difference(df_1.columns))
Index(['A', 'D'], dtype='object')
And numpy functions are intersect1d, union1d and setxor1d:
print (np.intersect1d(df_2.columns, df_1.columns))
['B' 'C']
print (np.union1d(df_2.columns, df_1.columns))
['A' 'B' 'C' 'D']
print (np.setxor1d(df_2.columns, df_1.columns))
['A' 'D']
here it is buddy
set(df_2.columns).difference(df_1.columns)
Out[76]: {'D'}
You can use:
set(df_2.columns.values) - set(df_1.columns.values)
which returns a set containing column labels of columns in df_2 but not in df_1.
import pandas as pd
import numpy as np
df = pd.DataFrame(data = np.random.random(36).reshape((9,4)), index = np.arange(1, np.random.random(36).reshape((9,4)).shape[0]+1), columns=['A', 'B', 'C', 'D'])
te = pd.DataFrame(data = np.random.randint(low=1, high=10, size=1000), columns=['Number_test'])
I want to concatenate two dataframes by index of df to each corresponding element in column Number_test
Use pandas.DataFrame.merge:
pd.merge(df, te, left_index=True, right_on='Number_test')
or
pd.merge(df.reset_index(), te, left_on='index', right_on='Number_test')