Using pandas cut function with groupby and group-specific bins - python

I have the following sample DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({'Tag': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C'],
'ID': [11, 12, 16, 19, 14, 9, 4, 13, 6, 18, 21, 1, 2],
'Value': [1, 13, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
to which I add the percentage of the Value using
df['Percent_value'] = df['Value'].rank(method='dense', pct=True)
and add the Order using pd.cut() with pre-defined percentage bins
percentage = np.array([10, 20, 50, 70, 100])/100
df['Order'] = pd.cut(df['Percent_value'], bins=np.insert(percentage, 0, 0), labels = [1,2,3,4,5])
which gives
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 2
5 B 9 3 0.230769 3
6 B 4 4 0.307692 3
7 C 13 5 0.384615 3
8 C 6 6 0.461538 3
9 C 18 7 0.538462 4
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 5
My Question
Now, instead of having a single percentage array (bins) for all Tags (groups), I have a separate percentage array for each Tag group. i.e., A, B and C. How can I apply df.groupby('Tag') and then apply pd.cut() using different percentage bins for each group from the following dictionary? Is there some direct-way avoiding for loops as I do below?
percentages = {'A': np.array([10, 20, 50, 70, 100])/100,
'B': np.array([20, 40, 60, 90, 100])/100,
'C': np.array([30, 50, 60, 80, 100])/100}
Desired outcome (Note: Order is now computed for each Tag using different bins):
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 1
5 B 9 3 0.230769 2
6 B 4 4 0.307692 2
7 C 13 5 0.384615 2
8 C 6 6 0.461538 2
9 C 18 7 0.538462 3
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 4
My Attempt
orders = []
for k, g in df.groupby(['Tag']):
percentage = percentages[k]
g['Order'] = pd.cut(g['Percent_value'], bins=np.insert(percentage, 0, 0), labels = [1,2,3,4,5])
orders.append(g)
df_final = pd.concat(orders, axis=0, join='outer')

You can apply pd.cut within groupby,
df['Order'] = df.groupby('Tag').apply(lambda x: pd.cut(x['Percent_value'], bins=np.insert(percentages[x.name],0,0), labels=[1,2,3,4,5])).reset_index(drop = True)
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 1
5 B 9 3 0.230769 2
6 B 4 4 0.307692 2
7 C 13 5 0.384615 2
8 C 6 6 0.461538 2
9 C 18 7 0.538462 3
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 4

Related

Reshape data frame, so the index column values become the columns

I want to reshape the data so that the values in the index column become the columns
My Data frame:
Gender_Male Gender_Female Location_london Location_North Location_South
Cat
V 5 4 4 2 3
W 15 12 12 7 8
X 11 15 16 4 6
Y 22 18 21 9 9
Z 8 7 7 4 4
Desired Data frame:
Is there an easy way to do this? I also have 9 other categorical variables in my data set in addition to the Gender and Location variables. I have only included two variables to keep the example simple.
Code to create the example dataframe:
df1 = pd.DataFrame({
'Cat' : ['V','W', 'X', 'Y', 'Z'],
'Gender_Male' : [5, 15, 11, 22, 8],
'Gender_Female' : [4, 12, 15, 18, 7],
'Location_london': [4,12, 16, 21, 7],
'Location_North' : [2, 7, 4, 9, 4],
'Location_South' : [3, 8, 6, 9, 4]
}).set_index('Cat')
df1
You can transpose the dataframe and then split and set the new index:
Transpose
dft = df1.T
print(dft)
Cat V W X Y Z
Gender_Male 5 15 11 22 8
Gender_Female 4 12 15 18 7
Location_london 4 12 16 21 7
Location_North 2 7 4 9 4
Location_South 3 8 6 9 4
Split and set the new index
dft.index = dft.index.str.split('_', expand=True)
dft.columns.name = None
print(dft)
V W X Y Z
Gender Male 5 15 11 22 8
Female 4 12 15 18 7
Location london 4 12 16 21 7
North 2 7 4 9 4
South 3 8 6 9 4

Grouping the range of intervals based on 2 columns

I am a geologist needing to clean up data.
I have a .csv file containing drilling intervals, that I imported as a pandas dataframe that looks like this:
hole_name from to interval_type
0 A 0 1 Gold
1 A 1 2 Gold
2 A 2 4 Inferred_fault
3 A 4 6 NaN
4 A 6 7 NaN
5 A 7 8 NaN
6 A 8 9 Inferred_fault
7 A 9 10 NaN
8 A 10 11 Inferred_fault
9 B2 11 12 Inferred_fault
10 B2 12 13 Inferred_fault
11 B2 13 14 NaN
For each individual "hole_name", I would like to group/merge the "from" and "to" range for consecutive intervals associated with the same "interval_type". The NaN values can be dropped, they are of no use to me (but I already know how to do this, so it is fine).
Based on the example above, I would like to get something like this:
hole_name from to interval_type
0 A 0 2 Gold
2 A 2 4 Inferred_fault
3 A 4 8 NaN
6 A 8 9 Inferred_fault
7 A 9 10 NaN
8 A 10 11 Inferred_fault
9 B2 11 13 Inferred_fault
11 B2 13 14 NaN
I have looked around and tried to use groupby or pyranges but cannot figure how to do this...
Thanks a lot in advance for your help!
This should do the trick:
import pandas as pd
import numpy as np
from itertools import groupby
# create dataframe
data = {
'hole_name': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B'],
'from': [0, 1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 13],
'to': [1, 2, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14],
'interval_type': ['Gold', 'Gold', 'Inferred_fault', np.nan, np.nan, np.nan,
'Inferred_fault', np.nan, 'Inferred_fault', 'Inferred_fault',
'Inferred_fault', np.nan]
}
df = pd.DataFrame(data=data)
# create auxiliar column that groups repetitive consecutive values
grouped = [list(g) for k, g in groupby(list(zip(df.hole_name.tolist(), df.interval_type.tolist())))]
df['interval_type_id'] = np.repeat(range(len(grouped)),[len(x) for x in grouped])+1
# aggregate results
cols = df.columns[:-1]
vals = []
for idx, group in df.groupby(['interval_type_id', 'hole_name']):
vals.append([group['hole_name'].iloc[0], group['from'].min(), group['to'].max(), group['interval_type'].iloc[0]])
result = pd.DataFrame(data=vals, columns=cols)
result
result should be:
hole_name from to interval_type
A 0 2 Gold
A 2 4 Inferred_fault
A 4 8
A 8 9 Inferred_fault
A 9 10
A 10 11 Inferred_fault
B 11 13 Inferred_fault
B 13 14
EDIT: added hole_name to the groupby function.
You can first build an indicator column for grouping. Then use agg to merge the sub groups to get from and to.
(
df.assign(ind=df.interval_type.fillna(''))
.assign(ind=lambda x: x.ind.ne(x.ind.shift(1).bfill()).cumsum())
.groupby(['hole_name', 'ind'])
.agg({'from':'first', 'to':'last', 'interval_type': 'first'})
.reset_index()
.drop('ind',1)
)
hole_name from to interval_type
0 A 0 2 Gold
1 A 2 4 Inferred_fault
2 A 4 8 NaN
3 A 8 9 Inferred_fault
4 A 9 10 NaN
5 A 10 11 Inferred_fault
6 B 11 13 Inferred_fault
7 B 13 14 NaN

Pandas: moving data from two dataframes to another with tuple index

I have three dataframes like the following:
final_df
other ref
(2014-12-24 13:20:00-05:00, a) NaN NaN
(2014-12-24 13:40:00-05:00, b) NaN NaN
(2018-07-03 14:00:00-04:00, d) NaN NaN
ref_df
a b c d
2014-12-24 13:20:00-05:00 1 2 3 4
2014-12-24 13:40:00-05:00 2 3 4 5
2017-11-24 13:10:00-05:00 ..............
2018-07-03 13:25:00-04:00 ..............
2018-07-03 14:00:00-04:00 9 10 11 12
2019-07-03 13:10:00-04:00 ..............
other_df
a b c d
2014-12-24 13:20:00-05:00 10 20 30 40
2014-12-24 13:40:00-05:00 20 30 40 50
2017-11-24 13:10:00-05:00 ..............
2018-07-03 13:20:00-04:00 ..............
2018-07-03 13:25:00-04:00 ..............
2018-07-03 14:00:00-04:00 90 100 110 120
2019-07-03 13:10:00-04:00 ..............
And I need to remplace the NaN values in my final_df with the related dataframe to be like that:
other ref
(2014-12-24 13:20:00-05:00, a) 10 1
(2014-12-24 13:40:00-05:00, b) 30 3
(2018-07-03 14:00:00-04:00, d) 110 11
How can I get it?
pandas.DataFrame.lookup
final_df['ref'] = ref_df.lookup(*zip(*final_df.index))
final_df['other'] = other_df.lookup(*zip(*final_df.index))
map and get
For when you have missing bits
final_df['ref'] = list(map(ref_df.stack().get, final_df.index))
final_df['other'] = list(map(other_df.stack().get, final_df.index))
Demo
Setup
idx = pd.MultiIndex.from_tuples([(1, 'a'), (2, 'b'), (3, 'd')])
final_df = pd.DataFrame(index=idx, columns=['other', 'ref'])
ref_df = pd.DataFrame([
[ 1, 2, 3, 4],
[ 2, 3, 4, 5],
[ 9, 10, 11, 12]
], [1, 2, 3], ['a', 'b', 'c', 'd'])
other_df = pd.DataFrame([
[ 10, 20, 30, 40],
[ 20, 30, 40, 50],
[ 90, 100, 110, 120]
], [1, 2, 3], ['a', 'b', 'c', 'd'])
print(final_df, ref_df, other_df, sep='\n\n')
other ref
1 a NaN NaN
2 b NaN NaN
3 d NaN NaN
a b c d
1 1 2 3 4
2 2 3 4 5
3 9 10 11 12
a b c d
1 10 20 30 40
2 20 30 40 50
3 90 100 110 120
Result
final_df['ref'] = ref_df.lookup(*zip(*final_df.index))
final_df['other'] = other_df.lookup(*zip(*final_df.index))
final_df
other ref
1 a 10 1
2 b 30 3
3 d 120 12
Another solution that can work with missing dates in ref_df and other_df:
index = pd.MultiIndex.from_tuples(final_df.index)
ref = ref_df.stack().rename('ref')
other = other_df.stack().rename('other')
result = pd.DataFrame(index=index).join(ref).join(other)

Is there an opposite function of pandas.DataFrame.droplevel (like keeplevel)?

Is there an opposite function of pandas.DataFrame.droplevel where I can keep some levels of the multi-level index/columns using either the level name or index?
Example:
df = pd.DataFrame([
[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]
], columns=['a','b','c','d']).set_index(['a','b','c']).T
a 1 5 9 13
b 2 6 10 14
c 3 7 11 15
d 4 8 12 16
Both the following commands can return the following dataframe:
df.droplevel(['a','b'], axis=1)
df.droplevel([0, 1], axis=1)
c 3 7 11 15
d 4 8 12 16
I am looking for a "keeplevel" command such that both the following commands can return the following dataframe:
df.keeplevel(['a','b'], axis=1)
df.keeplevel([0, 1], axis=1)
a 1 5 9 13
b 2 6 10 14
d 4 8 12 16
There is no keeplevel because it would be redundant: in a closed and well-defined set, when you define what you want to drop, you automatically define what you want to keep
You may get the difference from what you have and what droplevel returns.
def keeplevel(df, levels, axis=1):
return df.droplevel(df.axes[axis].droplevel(levels).names, axis=axis)
>>> keeplevel(df, [0, 1])
a 1 5 9 13
b 2 6 10 14
d 4 8 12 16
Using set to find the different
df.droplevel(list(set(df.columns.names)-set(['a','b'])),axis=1)
Out[134]:
a 1 5 9 13
b 2 6 10 14
d 4 8 12 16
You can modify the Index objects, which should be fast. Note, this will even modify inplace.
def keep_level(df, keep, axis):
idx = pd.MultiIndex.from_arrays([df.axes[axis].get_level_values(x) for x in keep])
df.set_axis(idx, axis=axis, inplace=True)
return df
keep_level(df.copy(), ['a', 'b'], 1) # Copy to not modify original for illustration
#a 1 5 9 13
#b 2 6 10 14
#d 4 8 12 16
keep_level(df.copy(), [0, 1], 1)
#a 1 5 9 13
#b 2 6 10 14
#d 4 8 12 16

How to do R dplyr's do in pandas?

I would like to split my pandas DataFrame into groups and then run a complex function on each chunk. The complex function returns for each chunk a DataFrame with arbitrary number and names of columns and an arbitrary number of rows. I would like those result DataFrame to be combined after the operation. In R I am able to do this with
library(tibble)
library(dplyr)
df = tribble(
~g, ~c1, ~c2,
"a", 1, 6,
"a", 2, 7,
"b", 3, 8,
"b", 4, 9,
"b", 5, 10
)
myfct <- function(x, y){
data.frame(c1 = x,
c2 = y,
res = c(x * y, x + y, x / y),
type = c('mult', 'add', 'div'))
}
df %>% group_by(g) %>% do(myfct(.$c1, .$c2))
with the result being
Source: local data frame [15 x 5]
Groups: g [2]
g c1 c2 res type
<chr> <dbl> <dbl> <dbl> <fctr>
1 a 1 6 6.0000000 mult
2 a 2 7 14.0000000 add
3 a 1 6 7.0000000 div
4 a 2 7 9.0000000 mult
5 a 1 6 0.1666667 add
6 a 2 7 0.2857143 div
7 b 3 8 24.0000000 mult
8 b 4 9 36.0000000 add
9 b 5 10 50.0000000 div
10 b 3 8 11.0000000 mult
11 b 4 9 13.0000000 add
12 b 5 10 15.0000000 div
13 b 3 8 0.3750000 mult
14 b 4 9 0.4444444 add
15 b 5 10 0.5000000 div
This - of course - is only an example.
I think you need apply - check also flexible apply:
def myfct(x):
print (x)
return pd.DataFrame({'mult':x['c1'] * x['c2'],
'add':x['c1'] + x['c2'],
'div':x['c1'] / x['c2'],
'g':x.name,
'c1': x['c1'],
'c2':x['c2']})
df = df.groupby('g')['c1','c2'].apply(myfct)
print (df)
add c1 c2 div g mult
0 7 1 6 0.166667 a 6
1 9 2 7 0.285714 a 14
2 11 3 8 0.375000 b 24
3 13 4 9 0.444444 b 36
4 15 5 10 0.500000 b 50
Also for reshape is possible use melt:
df = df.groupby('g')['c1','c2'].apply(myfct)
.melt(id_vars=['g','c1','c2'], value_name='res', var_name='type')
print (df)
g c1 c2 type res
0 a 1 6 add 7.000000
1 a 2 7 add 9.000000
2 b 3 8 add 11.000000
3 b 4 9 add 13.000000
4 b 5 10 add 15.000000
5 a 1 6 div 0.166667
6 a 2 7 div 0.285714
7 b 3 8 div 0.375000
8 b 4 9 div 0.444444
9 b 5 10 div 0.500000
10 a 1 6 mult 6.000000
11 a 2 7 mult 14.000000
12 b 3 8 mult 24.000000
13 b 4 9 mult 36.000000
14 b 5 10 mult 50.000000

Categories