Merging dataframes based on a column 's values of a dataframe - python

I have a dictionary of dataframes (df_t). Each dataframe has the same exactly same two columns but different number of rows . I merged all the dataframes to one dataframe.
However, I would like to use (df_A [label_1]) as a reference and drop the (df_B [label_2]) (df_A [label_3]) like in the example
df_A = pd.DataFrame({'A': [1, 0.5, 1, 0.5],'label_1': ['-1', '1', '-1', '1']})
df_B = pd.DataFrame({'A': [1, 1.5, 2.3],'label_2': ['-1', '1','-1']})
df_C = pd.DataFrame({'A': [2.1, 5.5],'label_3': ['-1', '1']})
df_t = {'1': df_A, '2': df_B, '3': df_C}
#d = { k: v.set_index('label') for k, v in df_t.items()}
dfx = pd.concat(df_t, axis=1)
dfx.columns = dfx.columns.droplevel(0)

Related

Efficiently sort each and every row of a pandas data frame by reference to a predefined relation

I have a data frame where each row is some permutation of (an ordered) list of elements. The same row can not have the same element twice, but it may not have any. For example if a row contains five values and the possible values are "alpha" through "epsilon", {"alpha","beta", ",","} is allowed, {"beta","alpha", ",","} is also allowed, but {"alpha","alpha", ",","} is not. It cannot appear in the frame, by construction.
The rows of the data frame can therefore be un-ordered. What I want is to sort each row according to a predefined relation, eg a dict. For example, the data frame may look like
yy = {
'x1': ['alpha', '', 'beta', '', 'gamma'],
'x2': ['', '', '', '','alpha'],
'x3': ['', 'beta', '', 'alpha',''],
}
df = pd.DataFrame(yy)
df
The given (= predefined) order is sort_order = {'alpha': 0, 'beta': 1, 'gamma': 2} and using this the desired output is
# Desired output
yy = {
'x1': ['alpha', '', '', 'alpha', 'alpha'],
'x2': ['', 'beta', 'beta', '', ''],
'x3': ['', '', '', '', 'gamma']
}
df = pd.DataFrame(yy)
df
How is it possible to do that? My actual data frame is not really big, but it's still ~ 20K x 200, so it pays to (1) avoid looping over all rows and use if-then statements to order each row within each loop iteration and (2) pass all the columns at once and not have to specify something like [['x1', 'x2', ... , 'x200']].
First create helper Series by ordering with same index like columns names and then per rows test if match by Series.isin with Series.where for set not matched values by empty string:
sort_order = {'alpha': 0, 'beta': 1, 'gamma': 2}
s = pd.Series(list(sort_order), index=df.columns)
df = df.apply(lambda x: s.where(s.isin(x), ''), axis=1)
print (df)
x1 x2 x3
0 alpha
1 beta
2 beta
3 alpha
4 alpha gamma
Alternative solution with numpy.where:
s = pd.Series(list(sort_order), index=df.columns)
df = pd.DataFrame(np.where((df.apply(lambda x: s.isin(x), axis=1)), s, ''),
index=df.index,
columns=df.columns)
print (df)
x1 x2 x3
0 alpha
1 beta
2 beta
3 alpha
4 alpha gamma

How can I add column with the dates (starting from today)?

How can I add to column C the dates (starting from today)? Column B is an example of what I want to get.
df = pd.DataFrame({'N': ['1', '2', '3', '4'], 'B': ['16.11.2021', '17.11.2021', '18.11.2021', '19.11.2021'], 'C': ['nan', 'nan', 'nan', 'nan']})
If I understood your question correctly, you want something like this:
import datetime
base = datetime.datetime.today()
date_list = sorted([(base - datetime.timedelta(days=x)).strftime('%d.%m.%Y') for x in range(len(df))])
df['C'] = date_list
This will produce the same result as in column B.

Map dictionary values to key values in a dataframe column

I have a dataframe:
values
NaN
NaN
[1,2,5]
[2]
[5]
And a dictionary
{nan: nan,
'1': '10',
'2': '11',
'5': '12',}
The dataframe contains keys from the dictionary.
How can I replace these keys with the corresponding values from the same dictionary?
Output:
values
NaN
NaN
[10,11,12]
[11]
[12]
I have tried
so_df['values'].replace(my_dictionary, inplace=True)
so_df.head()
You can use apply() method of pandas df. Check the implementation below:
import pandas as pd
import numpy as np
df = pd.DataFrame([np.nan,
np.nan,
['1', '2', '5'],
['2'],
['5']], columns=['values'])
my_dict = {np.nan: np.nan,
'1': '10',
'2': '11',
'5': '12'}
def update(row):
if isinstance(row['values'], list):
row['values'] = [my_dict.get(val) for val in row['values']]
else:
row['values'] = my_dict.get(row['values'])
return row
df = df.apply(lambda row: update(row), axis=1)
Simple implementation. Just make sure if your dataframe contains string, your dictionary keys also contains string.
Try:
df['values']=pd.to_numeric(df['values'].explode().astype(str).map(my_dict), errors='coerce').groupby(level=0).agg(list)
Setup
import numpy as np
df=pd.DataFrame({'values':[np.nan,np.nan,[1,2,5],[2],5]})
my_dict={np.nan: np.nan, '1': '10', '2': '11', '5': '12'}
Use Series.explode with Series.map
df['values']=( df['values'].explode()
.astype(str)
.map(my_dict)
.dropna()
.astype(int)
.groupby(level = 0)
.agg(list) )
If there are others strings in your values column you would need pd.to_numeric with errors = coerce, to keep it you should do:
df['values']=(pd.to_numeric( df['values'].explode()
.astype(str)
.replace(my_dict),
errors = 'coerce')
.dropna()
.groupby(level = 0)
.agg(list)
.fillna(df['values'])
)
Output
values
0 NaN
1 NaN
2 [10, 11, 12]
3 [11]
4 [12]
UPDATE
solution without explode
df['values']=(pd.to_numeric( df['values'].apply(pd.Series)
.stack()
.reset_index(level=1,drop=1)
.astype(str)
.replace(my_dict),
errors = 'coerce')
.dropna()
.groupby(level = 0)
.agg(list)
.fillna(df['values'])
)

Which Pandas function do I need? group_by or pivot

I'm still relatively new to Pandas and I can't tell which of the functions I'm best off using to get to my answer. I have looked at pivot, pivot_table, group_by and aggregate but I can't seem to get it to do what I require. Quite possibly user error, for which I apologise!
I have data like this:
Code to create df:
import pandas as pd
df = pd.DataFrame([
['1', '1', 'A', 3, 7],
['1', '1', 'B', 2, 9],
['1', '1', 'C', 2, 9],
['1', '2', 'A', 4, 10],
['1', '2', 'B', 4, 0],
['1', '2', 'C', 9, 8],
['2', '1', 'A', 3, 8],
['2', '1', 'B', 10, 4],
['2', '1', 'C', 0, 1],
['2', '2', 'A', 1, 6],
['2', '2', 'B', 10, 2],
['2', '2', 'C', 10, 3]
], columns = ['Field1', 'Field2', 'Type', 'Price1', 'Price2'])
print(df)
I am trying to get data like this:
Although my end goal will be to end up with one column for A, one for B and one for C. As A will use Price1 and B & C will use Price2.
I don't want to necessarily get the max or min or average or sum of the Price as theoretically (although unlikely) there could be two different Price1's for the same Fields & Type.
What's the best function to use in Pandas to get to what I need?
Use DataFrame.set_index with DataFrame.unstack for reshape - output is MultiIndex in columns, so added sorting second level by DataFrame.sort_index, flatten values and last create column from Field levels:
df1 = (df.set_index(['Field1','Field2', 'Type'])
.unstack(fill_value=0)
.sort_index(axis=1, level=1))
df1.columns = [f'{b}-{a}' for a, b in df1.columns]
df1 = df1.reset_index()
print (df1)
Field1 Field2 A-Price1 A-Price2 B-Price1 B-Price2 C-Price1 C-Price2
0 1 1 3 7 2 9 2 9
1 1 2 4 10 4 0 9 8
2 2 1 3 8 10 4 0 1
3 2 2 1 6 10 2 10 3
Solution with DataFrame.pivot_table is also possible, but it aggregate values in duplicates first 3 columns with default mean function:
df2 = (df.pivot_table(index=['Field1','Field2'],
columns='Type',
values=['Price1', 'Price2'],
aggfunc='mean')
.sort_index(axis=1, level=1))
df2.columns = [f'{b}-{a}' for a, b in df2.columns]
df2 = df2.reset_index()
print (df2)
use pivot_table
pd.pivot_table(df, values =['Price1', 'Price2'], index=['Field1','Field2'],columns='Type').reset_index()

pandas aggregate data from two data frames

I have two pandas data frames, with some indexes and some column names in common (like partially overlapping time-series related to common quantities).
I need to merge these two dataframes in a single one containing all the indexes and all the values for each index, keeping the values of the left (right) one in case an index-column combination appears in both data frames.
Both merge and join methods are unhelpful as the merge method will duplicate information I don't need and join causes the same problem.
What's an efficient method to obtain the result I need?
EDIT:
If for example I have the two data frames
df1 = pd.DataFrame({
'C1' : [1.1, 1.2, 1.3],
'C2' : [2.1, 2.2, 2.3],
'C3': [3.1, 3.2, 3.3]},
index=['a', 'b', 'c'])
df2 = pd.DataFrame({
'C3' : [3.1, 3.2, 33.3],
'C4' : [4.1, 4.2, 4.3]},
index=['b', 'c', 'd'])
What I need is a method that allows me to create:
merged = pd.DataFrame({
'C1': [1.1, 1.2, 1.3, 'nan'],
'C2': [2.1, 2.2, 2.3, 'nan'],
'C3': [3.1, 3.2, 3.3, 33.3],
'C4': ['nan', 4.1, 4.2, 4.3]},
index=['a', 'b', 'c', 'd'])
Here are three possibilities:
Use concat/groupby: First concatenate both DataFrames vertically. Then group by the index and select the first row in each group.
Use combine_first: Make a new index which is the union of df1 and df2. Reindex df1 using the new index. Then use combine_first to fill in NaNs with values from df2.
Use manual construction: We could use df2.index.difference(df1.index) to find exactly which rows need to be added to df1. So we could manually select those rows from df2 and concatenate them on to df1.
For small DataFrames, using_concat is faster. For larger DataFrames, using_combine_first appears to be slightly faster than the other options:
import numpy as np
import pandas as pd
import perfplot
def make_dfs(N):
df1 = pd.DataFrame(np.random.randint(10, size=(N,2)))
df2 = pd.DataFrame(np.random.randint(10, size=(N,2)), index=range(N//2,N//2 + N))
return df1, df2
def using_concat(dfs):
df1, df2 = dfs
result = pd.concat([df1,df2], sort=False)
n = result.index.nlevels
return result.groupby(level=range(n)).first()
def using_combine_first(dfs):
df1, df2 = dfs
index = df1.index.union(df2.index)
result = df1.reindex(index)
result = result.combine_first(df2)
return result
def using_manual_construction(dfs):
df1, df2 = dfs
index = df2.index.difference(df1.index)
cols = df2.columns.difference(df1.columns)
result = pd.concat([df1, df2.loc[index]], sort=False)
result.loc[df2.index, cols] = df2
return result
perfplot.show(
setup=make_dfs,
kernels=[using_concat, using_combine_first,
using_manual_construction],
n_range=[2**k for k in range(5,21)],
logx=True,
logy=True,
xlabel='len(df)')
Without seeing your code I can only give a generic answer:
To merge 2 dataframes use
df3 = pd.merge(df1, df2, how='right', on=('col1', 'col2'))
or
a.merge(b, how='right', on=('c1', 'c2'))

Categories