Pandas: moving data from two dataframes to another with tuple index - python

I have three dataframes like the following:
final_df
other ref
(2014-12-24 13:20:00-05:00, a) NaN NaN
(2014-12-24 13:40:00-05:00, b) NaN NaN
(2018-07-03 14:00:00-04:00, d) NaN NaN
ref_df
a b c d
2014-12-24 13:20:00-05:00 1 2 3 4
2014-12-24 13:40:00-05:00 2 3 4 5
2017-11-24 13:10:00-05:00 ..............
2018-07-03 13:25:00-04:00 ..............
2018-07-03 14:00:00-04:00 9 10 11 12
2019-07-03 13:10:00-04:00 ..............
other_df
a b c d
2014-12-24 13:20:00-05:00 10 20 30 40
2014-12-24 13:40:00-05:00 20 30 40 50
2017-11-24 13:10:00-05:00 ..............
2018-07-03 13:20:00-04:00 ..............
2018-07-03 13:25:00-04:00 ..............
2018-07-03 14:00:00-04:00 90 100 110 120
2019-07-03 13:10:00-04:00 ..............
And I need to remplace the NaN values in my final_df with the related dataframe to be like that:
other ref
(2014-12-24 13:20:00-05:00, a) 10 1
(2014-12-24 13:40:00-05:00, b) 30 3
(2018-07-03 14:00:00-04:00, d) 110 11
How can I get it?

pandas.DataFrame.lookup
final_df['ref'] = ref_df.lookup(*zip(*final_df.index))
final_df['other'] = other_df.lookup(*zip(*final_df.index))
map and get
For when you have missing bits
final_df['ref'] = list(map(ref_df.stack().get, final_df.index))
final_df['other'] = list(map(other_df.stack().get, final_df.index))
Demo
Setup
idx = pd.MultiIndex.from_tuples([(1, 'a'), (2, 'b'), (3, 'd')])
final_df = pd.DataFrame(index=idx, columns=['other', 'ref'])
ref_df = pd.DataFrame([
[ 1, 2, 3, 4],
[ 2, 3, 4, 5],
[ 9, 10, 11, 12]
], [1, 2, 3], ['a', 'b', 'c', 'd'])
other_df = pd.DataFrame([
[ 10, 20, 30, 40],
[ 20, 30, 40, 50],
[ 90, 100, 110, 120]
], [1, 2, 3], ['a', 'b', 'c', 'd'])
print(final_df, ref_df, other_df, sep='\n\n')
other ref
1 a NaN NaN
2 b NaN NaN
3 d NaN NaN
a b c d
1 1 2 3 4
2 2 3 4 5
3 9 10 11 12
a b c d
1 10 20 30 40
2 20 30 40 50
3 90 100 110 120
Result
final_df['ref'] = ref_df.lookup(*zip(*final_df.index))
final_df['other'] = other_df.lookup(*zip(*final_df.index))
final_df
other ref
1 a 10 1
2 b 30 3
3 d 120 12

Another solution that can work with missing dates in ref_df and other_df:
index = pd.MultiIndex.from_tuples(final_df.index)
ref = ref_df.stack().rename('ref')
other = other_df.stack().rename('other')
result = pd.DataFrame(index=index).join(ref).join(other)

Related

Using pandas cut function with groupby and group-specific bins

I have the following sample DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({'Tag': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C'],
'ID': [11, 12, 16, 19, 14, 9, 4, 13, 6, 18, 21, 1, 2],
'Value': [1, 13, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
to which I add the percentage of the Value using
df['Percent_value'] = df['Value'].rank(method='dense', pct=True)
and add the Order using pd.cut() with pre-defined percentage bins
percentage = np.array([10, 20, 50, 70, 100])/100
df['Order'] = pd.cut(df['Percent_value'], bins=np.insert(percentage, 0, 0), labels = [1,2,3,4,5])
which gives
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 2
5 B 9 3 0.230769 3
6 B 4 4 0.307692 3
7 C 13 5 0.384615 3
8 C 6 6 0.461538 3
9 C 18 7 0.538462 4
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 5
My Question
Now, instead of having a single percentage array (bins) for all Tags (groups), I have a separate percentage array for each Tag group. i.e., A, B and C. How can I apply df.groupby('Tag') and then apply pd.cut() using different percentage bins for each group from the following dictionary? Is there some direct-way avoiding for loops as I do below?
percentages = {'A': np.array([10, 20, 50, 70, 100])/100,
'B': np.array([20, 40, 60, 90, 100])/100,
'C': np.array([30, 50, 60, 80, 100])/100}
Desired outcome (Note: Order is now computed for each Tag using different bins):
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 1
5 B 9 3 0.230769 2
6 B 4 4 0.307692 2
7 C 13 5 0.384615 2
8 C 6 6 0.461538 2
9 C 18 7 0.538462 3
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 4
My Attempt
orders = []
for k, g in df.groupby(['Tag']):
percentage = percentages[k]
g['Order'] = pd.cut(g['Percent_value'], bins=np.insert(percentage, 0, 0), labels = [1,2,3,4,5])
orders.append(g)
df_final = pd.concat(orders, axis=0, join='outer')
You can apply pd.cut within groupby,
df['Order'] = df.groupby('Tag').apply(lambda x: pd.cut(x['Percent_value'], bins=np.insert(percentages[x.name],0,0), labels=[1,2,3,4,5])).reset_index(drop = True)
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 1
5 B 9 3 0.230769 2
6 B 4 4 0.307692 2
7 C 13 5 0.384615 2
8 C 6 6 0.461538 2
9 C 18 7 0.538462 3
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 4

Grouping the range of intervals based on 2 columns

I am a geologist needing to clean up data.
I have a .csv file containing drilling intervals, that I imported as a pandas dataframe that looks like this:
hole_name from to interval_type
0 A 0 1 Gold
1 A 1 2 Gold
2 A 2 4 Inferred_fault
3 A 4 6 NaN
4 A 6 7 NaN
5 A 7 8 NaN
6 A 8 9 Inferred_fault
7 A 9 10 NaN
8 A 10 11 Inferred_fault
9 B2 11 12 Inferred_fault
10 B2 12 13 Inferred_fault
11 B2 13 14 NaN
For each individual "hole_name", I would like to group/merge the "from" and "to" range for consecutive intervals associated with the same "interval_type". The NaN values can be dropped, they are of no use to me (but I already know how to do this, so it is fine).
Based on the example above, I would like to get something like this:
hole_name from to interval_type
0 A 0 2 Gold
2 A 2 4 Inferred_fault
3 A 4 8 NaN
6 A 8 9 Inferred_fault
7 A 9 10 NaN
8 A 10 11 Inferred_fault
9 B2 11 13 Inferred_fault
11 B2 13 14 NaN
I have looked around and tried to use groupby or pyranges but cannot figure how to do this...
Thanks a lot in advance for your help!
This should do the trick:
import pandas as pd
import numpy as np
from itertools import groupby
# create dataframe
data = {
'hole_name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B'],
'from': [0, 1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 13],
'to': [1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14],
'interval_type': ['Gold', 'Gold', 'Inferred_fault', np.nan, np.nan, np.nan,
'Inferred_fault', np.nan, 'Inferred_fault', 'Inferred_fault',
'Inferred_fault', np.nan]
}
df = pd.DataFrame(data=data)
# create auxiliar column that groups repetitive consecutive values
grouped = [list(g) for k, g in groupby(list(zip(df.hole_name.tolist(), df.interval_type.tolist())))]
df['interval_type_id'] = np.repeat(range(len(grouped)),[len(x) for x in grouped])+1
# aggregate results
cols = df.columns[:-1]
vals = []
for idx, group in df.groupby(['interval_type_id', 'hole_name']):
vals.append([group['hole_name'].iloc[0], group['from'].min(), group['to'].max(), group['interval_type'].iloc[0]])
result = pd.DataFrame(data=vals, columns=cols)
result
result should be:
hole_name from to interval_type
A 0 2 Gold
A 2 4 Inferred_fault
A 4 8
A 8 9 Inferred_fault
A 9 10
A 10 11 Inferred_fault
B 11 13 Inferred_fault
B 13 14
EDIT: added hole_name to the groupby function.
You can first build an indicator column for grouping. Then use agg to merge the sub groups to get from and to.
(
df.assign(ind=df.interval_type.fillna(''))
.assign(ind=lambda x: x.ind.ne(x.ind.shift(1).bfill()).cumsum())
.groupby(['hole_name', 'ind'])
.agg({'from':'first', 'to':'last', 'interval_type': 'first'})
.reset_index()
.drop('ind',1)
)
hole_name from to interval_type
0 A 0 2 Gold
1 A 2 4 Inferred_fault
2 A 4 8 NaN
3 A 8 9 Inferred_fault
4 A 9 10 NaN
5 A 10 11 Inferred_fault
6 B 11 13 Inferred_fault
7 B 13 14 NaN

How to fill NAs with median of means of 2-column groupby in pandas?

Working with pandas, I have a dataframe with two hierarchies A and B, where B can be NaN, and I want to fill some NaNs in D in a particular way:
In the example below, A has "B-subgroups" where there are no values at all for D (e.g. (1, 1)), while A also has values for D in other subgroups (e.g. (1, 3)).
Now I want to get the mean of each subgroup (120, 90 and 75 for A==1), find the median of these means (90 for A==1) and use this median to fill NaNs in the other subgroups of A==1.
Groups like A==2, where there are only NaNs for D, should not be filled.
Groups like A==3, where there are some values for D but only rows with B being NaN have NaN in D, should not be filled if possible (I intend to fill these later with the mean of all values of D of their whole A groups).
Example df:
d = {'A': [1, 1, 1, 1, 1, 1, 1, 2, 3, 3, 3],
'B': [1, 2, 3, 3, 4, 5, 6, 1, 1, np.NaN, np.NaN],
'D': [np.NaN, np.NaN, 120, 120, 90, 75, np.NaN, np.NaN, 60, 50, np.NaN]}
df = pd.DataFrame(data=d)
A B D
1 1 NaN
1 2 NaN
1 3 120
1 3 120
1 4 90
1 5 75
1 6 NaN
2 1 NaN
3 1 60
3 NaN 50
3 NaN NaN
Expected result:
A B D
1 1 90
1 2 90
1 3 120
1 3 120
1 4 90
1 5 75
1 6 90
2 1 NaN
3 1 60
3 NaN 50
3 NaN NaN
With df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median') or .median() I seem to get the right values, but using
df['D'] = df['D'].fillna(
df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median')
)
does not seem to change any values in D.
Any help is greatly appreciated, I've been stuck on this for a while and cannot find any solution anywhere.
Your first step is correct. After that we use Series.map to map the correct medians to each group in column A.
Finally we use np.where to conditionally fill in column D if B is not NaN:
medians = df.groupby(['A', 'B'])['D'].mean().groupby(['A']).agg('median')
df['D'] = np.where(df['B'].notna(), # if B is not NaN
df['D'].fillna(df['A'].map(medians)), # fill in the median
df['D']) # else keep the value of column D
A B D
0 1 1.00 90.00
1 1 2.00 90.00
2 1 3.00 120.00
3 1 3.00 120.00
4 1 4.00 90.00
5 1 5.00 75.00
6 1 6.00 90.00
7 2 1.00 nan
8 3 1.00 60.00
9 3 nan 50.00
10 3 nan nan

Pandas Dataframe Merging Issue

I need to merge the following 2 dataframes:
df1:
A B C D F
0 1 a zz 10 11
1 1 a zz 15 11
2 2 b yy 20 12
3 3 c xx 30 13
4 4 d ww 40 14
5 5 e vv 50 15
6 6 f uu 60 16
7 7 g NaN 70 17
8 8 h ss 80 18
9 9 NaN rr 90 19
10 13 m nn 130 113
11 15 o ll 150 115
df2:
A B C D G
0 1 NaN zz 15 100
1 6 f uu 60 600
2 7 g tt 70 700
3 10 j qq 100 1000
4 12 l NaN 120 1200
5 14 n NaN 140 1400
The merged dataframe should be:
A B C D F G
0 1 a zz 10 11 None
1 1 a zz 15 11 100
2 2 b yy 20 12 None
3 3 c xx 30 13 None
4 4 d ww 40 14 None
5 5 e vv 50 15 None
6 6 f uu 60 16 600
7 7 g tt 70 17 700
8 8 h ss 80 18 None
9 9 NaN rr 90 19 None
10 13 m nn 130 113 None
11 15 o ll 150 115 None
12 10 j qq 100 None 1000
13 12 l NaN 120 None 1200
14 14 n NaN 140 None 1400
Following is the code to generate df1 and df2:
df1 = pd.DataFrame({'A': [1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 13, 15],
'B': ['a', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', np.NAN, 'm', 'o'],
'C': ['zz', 'zz', 'yy', 'xx', 'ww', 'vv', 'uu', np.NAN, 'ss', 'rr', 'nn', 'll'],
'D': [10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 130, 150],
'F': [11, 11, 12, 13, 14, 15, 16, 17, 18, 19, 113, 115]})
df2 = pd.DataFrame({'A': [1, 6, 7, 10, 12, 14],
'B': [np.NAN, 'f', 'g', 'j', 'l', 'n'],
'C': ['zz', 'uu', 'tt', 'qq', np.NAN, np.NAN],
'D': [15, 60, 70, 100, 120, 140],
'G': [100, 600, 700, 1000, 1200, 1400]})
I tried the following methods:
md1 = df1.merge(df2, how='outer')
md2 = df1.merge(df2, how='outer', on=['A', 'D'])
md3 = df1.merge(df2, how='outer', left_on=['A', 'D'], right_on=['A', 'D'])
md4 = df1.merge(df2, how='outer', left_on=['A', 'B', 'C', 'D'], right_on=['A', 'B', 'C', 'D'])
Following are the results of md1 and md4 (same result):
print(md1.to_string())
A B C D F G
0 1 a zz 10 11.0 NaN
1 1 a zz 15 11.0 NaN
2 2 b yy 20 12.0 NaN
3 3 c xx 30 13.0 NaN
4 4 d ww 40 14.0 NaN
5 5 e vv 50 15.0 NaN
6 6 f uu 60 16.0 600.0
7 7 g NaN 70 17.0 NaN
8 8 h ss 80 18.0 NaN
9 9 NaN rr 90 19.0 NaN
10 13 m nn 130 113.0 NaN
11 15 o ll 150 115.0 NaN
12 1 NaN zz 15 NaN 100.0
13 7 g tt 70 NaN 700.0
14 10 j qq 100 NaN 1000.0
15 12 l NaN 120 NaN 1200.0
16 14 n NaN 140 NaN 1400.0
Following are the results of md2 and md3 (same result):
print(md2.to_string())
A B_x C_x D F B_y C_y G
0 1 a zz 10 11.0 NaN NaN NaN
1 1 a zz 15 11.0 NaN zz 100.0
2 2 b yy 20 12.0 NaN NaN NaN
3 3 c xx 30 13.0 NaN NaN NaN
4 4 d ww 40 14.0 NaN NaN NaN
5 5 e vv 50 15.0 NaN NaN NaN
6 6 f uu 60 16.0 f uu 600.0
7 7 g NaN 70 17.0 g tt 700.0
8 8 h ss 80 18.0 NaN NaN NaN
9 9 NaN rr 90 19.0 NaN NaN NaN
10 13 m nn 130 113.0 NaN NaN NaN
11 15 o ll 150 115.0 NaN NaN NaN
12 10 NaN NaN 100 NaN j qq 1000.0
13 12 NaN NaN 120 NaN l NaN 1200.0
14 14 NaN NaN 140 NaN n NaN 1400.0
But none of the above results is what I need from the merge operation!
So, I wrote a function to get what I want:
def merge_df(d1, d2, on_columns):
d1_row_count = d1.shape[0]
d2_row_count = d2.shape[0]
d1_columns = list(d1.columns)
d2_columns = list(d2.columns)
extra_columns_in_d1 = []
extra_columns_in_d2 = []
common_columns = []
for c in d1_columns:
if c not in d2_columns:
extra_columns_in_d1.append(c)
else:
common_columns.append(c)
for c in d2_columns:
if c not in d1_columns:
extra_columns_in_d2.append(c)
print(common_columns)
# start with the merged dataframe equal to d1
md = d1.copy(deep=True)
# Append the extra columns to md (with None values in the newly appended columns)
for c in extra_columns_in_d2:
md[c] = [None] * d1_row_count
d1_new_row_number = d1_row_count
# iterate thru each row in d2
for i in range(d2_row_count):
# create the match query string
d1_match_condition = ''
for p, c in enumerate(on_columns):
d1_match_condition += c + ' == ' + str(d2.loc[i, c])
if p < (len(on_columns) - 1):
d1_match_condition += ' and '
match_in_d1 = d1.query(d1_match_condition)
# if match is not found, then append the row
if match_in_d1.shape[0] == 0:
# build a list representing the row to append
row_list = []
for c in common_columns:
row_list.append(d2.loc[i, c])
for c in extra_columns_in_d1:
row_list.append(None)
for c in extra_columns_in_d2:
row_list.append(d2.loc[i, c])
md.loc[d1_new_row_number] = row_list
d1_new_row_number += 1
# if match is found, then modify the found row
else:
match_in_d1_index = list(match_in_d1.index)[0]
for c in common_columns:
if (md.loc[match_in_d1_index, c]) is None or (md.loc[match_in_d1_index, c]) is np.NAN:
md.loc[match_in_d1_index, c] = d2.loc[i, c]
for c in extra_columns_in_d2:
md.loc[match_in_d1_index, c] = d2.loc[i, c]
return md
When I use this function, I get the desired merged dataframe:
md5 = merge_df(df1, df2, ['A', 'D'])
Am I missing something basic with the inbuilt dataframe merge method to get the desired result?
You could merge first then use .assing and .combine_first. The resulting columns of the merge need to put to toghether correctly by taking the value of the right df and update its value with the left df it has an entry at this specific point. This is what .combine_first does.
m = pd.merge(df1, df2, on=['A','D'], how='outer')
m.assign(B=m['B_x'].combine_first(m['B_y']), C=m['C_x'].combine_first(m['C_y']))\
.drop(['B_x','C_x','B_y','C_y'], 1)[['A','B','C','D','F','G']]
result
A B C D F G
0 1 a zz 10 11.0 NaN
1 1 a zz 15 11.0 100.0
2 2 b yy 20 12.0 NaN
3 3 c xx 30 13.0 NaN
4 4 d ww 40 14.0 NaN
5 5 e vv 50 15.0 NaN
6 6 f uu 60 16.0 600.0
7 7 g tt 70 17.0 700.0
8 8 h ss 80 18.0 NaN
9 9 NaN rr 90 19.0 NaN
10 13 m nn 130 113.0 NaN
11 15 o ll 150 115.0 NaN
12 10 j qq 100 NaN 1000.0
13 12 l NaN 120 NaN 1200.0
14 14 n NaN 140 NaN 1400.0
You have the format wrong on merge operation. Try the following code
result = df1.merge(df2,on=['A','D'], how='outer')
try this
df1 = df1.merge(df2,on=['A','D'],how='outer')
df1['C'] = df1[['C_x','C_y']].apply(lambda x: x['C_y'] if x['C_x'] is np.nan else x['C_x'],axis=1)
df1['B'] = df1[['B_x','B_y']].apply(lambda x: x['B_y'] if x['B_x'] is np.nan else x['B_x'],axis=1)
df1 = df1.drop(labels=['B_x','B_y','C_x','C_y'],axis=1)

Pandas code to PySpark with groupby operations

In pandas I achieved to transform the following, which is basically splitting the first non null value across following null values.
[100, None, None, 40, None, 120]
into
[33.33, 33.33, 33.33, 20, 20, 120]
Thanks to the solution given here, I managed to produce the following code for my specific task:
cols = ['CUSTOMER', 'WEEK', 'PRODUCT_ID']
colsToSplit = ['VOLUME', 'REVENUE']
df = pd.concat([
d.asfreq('W')
for _, d in df.set_index('WEEK').groupby(['CUSTOMER', 'PRODUCT_ID'])
]).reset_index()
df[cols] = df[cols].ffill()
df['nb_nan'] = df.groupby(['CUSTOMER', 'PRODUCT_ID', df_sellin['VOLUME'].notnull().cumsum()])['VOLUME'].transform('size')
df[colsToSplit] = df.groupby(['CUSTOMER', 'PRODUCT_ID'])[colsToSplit].ffill()[colsToSplit].div(df.nb_nan, axis=0)
df
My full dataframe looks like this :
df = pd.DataFrame(map(list, zip(*[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c'],
['2018-01-14', '2018-01-28', '2018-01-14', '2018-01-28', '2018-01-14', '2018-02-04', '2018-02-11', '2018-01-28', '2018-02-11'],
[1, 1, 2, 2, 1, 1, 1, 3, 3],
[50, 44, 22, 34, 42, 41, 43, 12, 13],
[15, 14, 6, 11, 14, 13.5, 13.75, 3, 3.5]])), columns =['CUSTOMER', 'WEEK', 'PRODUCT_ID', 'VOLUME', 'REVENUE'])
df
Out[16]:
CUSTOMER WEEK PRODUCT_ID VOLUME REVENUE
0 a 2018-01-14 1 50 15.00
1 a 2018-01-28 1 44 14.00
2 a 2018-01-14 2 22 6.00
3 a 2018-01-28 2 34 11.00
4 b 2018-01-14 1 42 14.00
5 b 2018-02-04 1 41 13.50
6 b 2018-02-11 1 43 13.75
7 c 2018-01-28 3 12 3.00
8 c 2018-02-11 3 13 3.50
In this case for example, the result would be :
CUSTOMER WEEK PRODUCT_ID VOLUME REVENUE
a 2018-01-14 1 25 7.50
a 2018-01-21 1 25 7.50
a 2018-01-28 1 44 14.00
a 2018-01-14 2 11 3.00
a 2018-01-21 2 11 3.00
a 2018-01-28 2 34 11.00
b 2018-01-14 1 14 4.67
b 2018-01-21 1 14 4.67
b 2018-01-28 1 14 4.67
b 2018-02-04 1 41 13.50
b 2018-02-11 1 43 13.75
c 2018-01-28 3 6 1.50
c 2018-02-04 3 6 1.50
c 2018-02-11 3 13 3.50
Sadly, my dataframe is way too big for further use and joins with other datasets, therefore I would like to test it out with Spark. I checked out many tutorials to compute most of those steps in PySpark, but none of them really showed how to include the groupby part. So I found how to do a transform('size') but not how to df.groupby(...).transform('size') and how I can combine all my steps.
Is there maybe a tool that can do pandas to PySpark translation ? Otherwise, could I have a clue on how to translate this piece of code ? Thanks, maybe I'm just over complicating this.

Categories