Obtaining the class with maximum frequency(python) - python

so based on the following groupby code:
aps1.groupby(['S3bin2','S105_9bin2', 'class_predict']).size().unstack()
I get the following output:
class_predict 0 1
S3bin2 S105_9bin2
50 50 16058.0 133.0
100 256.0 7.0
150 161.0 NaN
200 160.0 1.0
400000 4195.0 58.0
100 50 3480.0 20.0
100 68.0 NaN
150 43.0 1.0
200 48.0 1.0
400000 689.0 2.0
150 50 1617.0 6.0
100 73.0 NaN
150 33.0 NaN
200 52.0 NaN
400000 935.0 3.0
200 50 1155.0 8.0
100 73.0 1.0
150 37.0 NaN
200 45.0 NaN
400000 937.0 NaN
300000 50 11508.0 178.0
100 748.0 11.0
150 446.0 5.0
200 350.0 9.0
400000 13080.0 49.0
So for the group 50 in both S3bin2 and S105_9bin2, the frequency of 0 is the highest. Is it possible to run a function whereby I can print the groups for which 0 has highest count, and also the count? I've tried transform(max) and other things but I'm not getting it.

Solution for test maximum in all data:
First you can remove unstack and add aggregate for max and idxmin and last create output by format:
s = aps1.groupby(['S3bin2','S105_9bin2', 'class_predict']).size()
a = s.agg(['idxmax', 'max'])
print (a)
idxmax (50, 50, 0)
max 16058
dtype: object
print (s.index.names)
['S3bin2', 'S105_9bin2', None]
a,b,c = a['max'], a['idxmax'], s.index.names
d = 'Maximum failure ({0}) at {1[0]}({2[0]}) and {1[1]}({2[1]})'.format(a,b,c)
print (d)
Maximum failure (16058) at 50(S3bin2) and 50(S105_9bin2)
But if want test only column 0 or 1:
df = aps1.groupby(['S3bin2','S105_9bin2', 'class_predict']).size().unstack()
#change 0 to 1 for test column 1
a = df[0].agg(['idxmax', 'max'])
print (a)
idxmax (50, 50)
max 16058
Name: 0, dtype: object
a,b,c = a['max'], a['idxmax'], df.index.names
d = 'Maximum failure ({0}) at {1[0]}({2[0]}) and {1[1]}({2[1]})'.format(a,b,c)
print (d)
Maximum failure (16058.0) at 50(S3bin2) and 50(S105_9bin2)

Related

pandas create column as lagged difference of two other columns grouped by key

I have the following dataframe (df)
AmountNeeded AmountAvailable
Source Target
1 2 290.0 600.0
4 300.0 600.0
6 200.0 600.0
3 2 290.0 450.0
5 100.0 450.0
7 8 0.0 500.0
I would like to compute the remaining availability per source:
AmountNeeded AmountAvailable RemainingAvailability
Source Target
1 2 290.0 600.0 600
4 300.0 600.0 310
6 200.0 600.0 10
3 2 290.0 450.0 450
5 100.0 450.0 160
7 8 0.0 500.0 500
So if a Source appears more than once, I need to subtract the sum of lagged values of AmountNeeded for that particular Source.
If we take Source 1 and Target 4 the remaining amount should be AmountAvailable-AmountNeeded(previous_row) = 600 - 290 = 310
If we move to Source 1 and Target 6 this will be: 600 - (290+300) = 10.
This also be computed as RemainingAvailability - AmountNeeded = 310 - 300 = 10
I tried to use different combinations of groupby and diff but without much success.
Use Series.sub with helper Series created by lambda function with Series.shift and cumulative sum Series.cumsum:
s = df.groupby(level=0)['AmountNeeded'].apply(lambda x: x.shift(fill_value=0).cumsum())
df['RemainingAvailability'] = df['AmountAvailable'].sub(s)
print (df)
AmountNeeded AmountAvailable RemainingAvailability
Source Target
1 2 290.0 600.0 600.0
4 300.0 600.0 310.0
6 200.0 600.0 10.0
3 2 290.0 450.0 450.0
5 100.0 450.0 160.0
7 8 0.0 500.0 500.0

Get summary data columns in new pandas dataframe from existing dataframe based on other column-ID

I'm want to summarize the data in a dataframe and add the new columns to another dataframe. My data contains appartments with an ID-number and it has surface and volume values for each room in the appartment. What I want is having a dataframe that summarizes this and gives me the total surface and volume per appartment. There are two conditions for the original dataframe:
Two conditions:
- the dataframe can contain empty cells
- when the values of surface or volume are equal for all of the rows within that ID
(so all the same values for the same ID), then the data (surface, volumes) is not
summed but one value/row is passed to the new summary column (example: 'ID 4')(as
this could be a mistake in the original dataframe and the total surface/volume was
inserted for all the rooms by the government-employee)
Initial dataframe 'data':
print(data)
ID Surface Volume
0 2 10.0 25.0
1 2 12.0 30.0
2 2 24.0 60.0
3 2 8.0 20.0
4 4 84.0 200.0
5 4 84.0 200.0
6 4 84.0 200.0
7 52 NaN NaN
8 52 96.0 240.0
9 95 8.0 20.0
10 95 6.0 15.0
11 95 12.0 30.0
12 95 30.0 75.0
13 95 12.0 30.0
Desired output from 'df':
print(df)
ID Surface Volume
0 2 54.0 135.0
1 4 84.0 200.0 #-> as the values are the same for each row of this ID in the original data, the sum is not taken, but only one of the rows is passed (see the second condition)
2 52 96.0 240.0
3 95 68.0 170.0
Tried code:
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [2,4,52,95]})

data = pd.DataFrame({"ID": [2,2,2,2,4,4,4,52,52,95,95,95,95,95],
"Surface": [10,12,24,8,84,84,84,np.nan,96,8,6,12,30,12],
"Volume": [25,30,60,20,200,200,200,np.nan,240,20,15,30,75,30]})

print(data)

#Tried something, but no idea how to do this actually:
df["Surface"] = data.groupby("ID").agg(sum)
df["Volume"] = data.groupby("ID").agg(sum)
print(df)
Here are necessary 2 conditions - first testing if unique values per groups for each columns separately by GroupBy.transform and DataFrameGroupBy.nunique and compare by eq for equal with 1 and then second condition - it used DataFrame.duplicated by each column with ID column.
Chain both masks by & for bitwise AND and repalce matched values by NaNs by DataFrame.mask and last aggregate sum:
cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
df = data[cols].mask(m1 & m2).groupby(data["ID"]).sum().reset_index()
print(df)
ID Surface Volume
0 2 54.0 135.0
1 4 84.0 200.0
2 52 96.0 240.0
3 95 68.0 170.0
If need new columns filled by aggregate sum values use GroupBy.transform :
cols = ['Surface','Volume']
m1 = data.groupby("ID")[cols].transform('nunique').eq(1)
m2 = data[cols].apply(lambda x: x.to_frame().join(data['ID']).duplicated())
data[cols] = data[cols].mask(m1 & m2).groupby(data["ID"]).transform('sum')
print(data)
ID Surface Volume
0 2 54.0 135.0
1 2 54.0 135.0
2 2 54.0 135.0
3 2 54.0 135.0
4 4 84.0 200.0
5 4 84.0 200.0
6 4 84.0 200.0
7 52 96.0 240.0
8 52 96.0 240.0
9 95 68.0 170.0
10 95 68.0 170.0
11 95 68.0 170.0
12 95 68.0 170.0
13 95 68.0 170.0

Python: Array-based equation

I have a dataframe 500 rows long by 4 columns. I need to find the proper python code that would divide the current row by the row below and then multiply that value by the value in the last row for every value in each column. I need to replicate this excel formula basically.
It's not clear if your data is stored in an array as provided by Numpy, were it true you'd write, with the original data contained in a
b = a[-1]*(a[:-1]/a[+1:])
a[-1] is the last row, a[:-1] the array without the last row and a[+1:] the array without the first (index zero, that is) row
Assuming you are talking about pandas DataFrame
import pandas as pd
import random
# sample DataFrame object
df = pd.DataFrame((float(random.randint(1, 100)),
float(random.randint(1, 100)),
float(random.randint(1, 100)),
float(random.randint(1, 100)))
for _ in range(10))
def function(col):
for i in range(len(col)-1):
col[i] = (col[i]/col[i+1])*col[len(col)-1]
print(df) # before formula apply
df.apply(function)
print(df) # after formula apply
>>>
0 1 2 3
0 10.0 78.0 27.0 23.0
1 72.0 42.0 77.0 86.0
2 82.0 12.0 58.0 98.0
3 27.0 92.0 19.0 86.0
4 48.0 83.0 14.0 43.0
5 55.0 18.0 58.0 77.0
6 20.0 58.0 20.0 22.0
7 76.0 19.0 63.0 82.0
8 23.0 99.0 58.0 15.0
9 60.0 57.0 89.0 100.0
0 1 2 3
0 8.333333 105.857143 31.207792 26.744186
1 52.682927 199.500000 118.155172 87.755102
2 182.222222 7.434783 271.684211 113.953488
3 33.750000 63.180723 120.785714 200.000000
4 52.363636 262.833333 21.482759 55.844156
5 165.000000 17.689655 258.100000 350.000000
6 15.789474 174.000000 28.253968 26.829268
7 198.260870 10.939394 96.672414 546.666667
8 23.000000 99.000000 58.000000 15.000000
9 60.000000 57.000000 89.000000 100.000000

Replace the element of a column (if consecutive value differ by 10 ) by mean of upper and lower value in pandas

I have a dataframe having a column of temperature. In the temperature column in some rows consecutive values are differing with more than 10 and I want to clean my data set. I want to replace that value with mean of upper and lower value.
I have tried some conditional replacement but that is not working...
df.loc[df['Temperature1'] > 50, 'Temperature'] = 23
I have tried this but this will change all elements above 50 with 23..
but i want to compare two rows and check the difference if it greater than 10 than only i have to replace..
EDIT: added example with rolling window (see also: window functions)
You can use shift() to put values from upper row and lower row in middle row.
import pandas as pd
df = pd.DataFrame({'Temperature': [10,30,20,40,50]})
df['upper_row'] = df['Temperature'].shift()
df['lower_row'] = df['Temperature'].shift(-1)
print(df)
Result
Temperature upper_row lower_row
0 10 NaN 30.0
1 30 10.0 20.0
2 20 30.0 40.0
3 40 20.0 50.0
4 50 40.0 NaN
And then you have three values in one row and you can normally substract them, calculate mean, compare them, etc
df['difference'] = (df['Temperature'] - df['upper_row']).abs()
df['mean'] = (df['upper_row'] + df['lower_row'])/2
print(df)
Result
Temperature upper_row lower_row difference mean
0 10 NaN 30.0 NaN NaN
1 30 10.0 20.0 20.0 15.0
2 20 30.0 40.0 10.0 35.0
3 40 20.0 50.0 20.0 35.0
4 50 40.0 NaN 10.0 NaN
And you can replace values in Temperature
df['Temperature'][ df['difference']>10 ] = df['mean']
print(df)
Result
Temperature upper_row lower_row difference mean
0 10 NaN 30.0 NaN NaN
1 15 10.0 20.0 20.0 15.0
2 20 30.0 40.0 10.0 35.0
3 35 20.0 50.0 20.0 35.0
4 50 40.0 NaN 10.0 NaN
Full example:
import pandas as pd
df = pd.DataFrame({'Temperature': [10,30,20,40,50]})
df['upper_row'] = df['Temperature'].shift()
df['lower_row'] = df['Temperature'].shift(-1)
print(df)
df['difference'] = (df['Temperature'] - df['upper_row']).abs()
df['mean'] = (df['upper_row'] + df['lower_row'])/2
print(df)
df['Temperature'][ df['difference']>10 ] = df['mean']
print(df)
EDIT: you can also use rolling window to work with two or three consecutive rows. See comments in code.
import pandas as pd
df = pd.DataFrame({'Temperature': [10,30,20,40,50]})
# work with two consecutive rows and result assign to last row
rw2 = df['Temperature'].rolling(2)
df['difference'] = rw2.apply(lambda rows:abs(rows[1] - rows[0]), raw=True)
# work with three consecutive rows and result assign to middle/center row
rw3 = df['Temperature'].rolling(3, center=True)
df['mean'] = rw3.apply(lambda rows:(rows[0] + rows[2])/2, raw=True)
print(df)
df['Temperature'][ df['difference']>10 ] = df['mean']
print(df)

finding intersection of intervals in pandas

I have two dataframes
df_a=
Start Stop Value
0 0 100 0.0
1 101 200 1.0
2 201 1000 0.0
df_b=
Start Stop Value
0 0 50 0.0
1 51 300 1.0
2 301 1000 0.0
I would like to generate a DataFrame which contains the intervals as identified by Start and Stop, where Value was the same in df_a and df_b. For each interval I would like to store: if Value was the same, and which was the value in df_a and df_b.
Desired output:
df_out=
Start Stop SameValue Value_dfA Value_dfB
0 50 1 0 0
51 100 0 0 1
101 200 1 1 1
201 300 0 0 1
[...]
Not sure if this is the best way to do this but you can reindex, join, groupby and agg to get your intervals, e.g.:
Expand each df so that the index is every single value of the range (Start to Stop) using reindex() and padding the values:
In []:
df_a_expanded = df_a.set_index('Start').reindex(range(max(df_a['Stop'])+1)).fillna(method='pad')
df_a_expanded
Out[]:
Stop Value
Start
0 100.0 0.0
1 100.0 0.0
2 100.0 0.0
3 100.0 0.0
4 100.0 0.0
...
997 1000.0 0.0
998 1000.0 0.0
999 1000.0 0.0
1000 1000.0 0.0
[1001 rows x 2 columns]
In []:
df_b_expanded = df_b.set_index('Start').reindex(range(max(df_b['Stop'])+1)).fillna(method='pad')
Join the two expanded dfs:
In []:
df = df_a_expanded.join(df_b_expanded, lsuffix='_dfA', rsuffix='_dfB').reset_index()
df
Out[]:
Start Stop_dfA Value_dfA Stop_dfB Value_dfB
0 0 100.0 0.0 50.0 0.0
1 1 100.0 0.0 50.0 0.0
2 2 100.0 0.0 50.0 0.0
3 3 100.0 0.0 50.0 0.0
4 4 100.0 0.0 50.0 0.0
...
Note: you can ignore the Stop columns and could have dropped them in the previous step.
There is no standard way to groupby only consecutive values (à la itertools.groupby), so resorting to a cumsum() hack:
In []:
groups = (df[['Value_dfA', 'Value_dfB']] != df[['Value_dfA', 'Value_dfB']].shift()).any(axis=1).cumsum()
g = df.groupby([groups, 'Value_dfA', 'Value_dfB'], as_index=False)
Now you can get the result you want by aggregating the group with min, max:
In []:
df_out = g['Start'].agg({'Start': 'min', 'Stop': 'max'})
df_out
Out[]:
Value_dfA Value_dfB Start Stop
0 0.0 0.0 0 50
1 0.0 1.0 51 100
2 1.0 1.0 101 200
3 0.0 1.0 201 300
4 0.0 0.0 301 1000
Now you just have to add the SameValue column and, if desired, order the columns to get the exact output you want:
In []:
df_out['SameValue'] = (df_out['Value_dfA'] == df_out['Value_dfB'])*1
df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Out[]:
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0
This assumes the ranges of the two dataframes are the same, or you will need to handle the NaNs you will get with the join().
I found a way but not sure it is the most efficient. You have the input data:
import pandas as pd
dfa = pd.DataFrame({'Start': [0, 101, 201], 'Stop': [100, 200, 1000], 'Value': [0., 1., 0.]})
dfb = pd.DataFrame({'Start': [0, 51, 301], 'Stop': [50, 300, 1000], 'Value': [0., 1., 0.]})
First I would create the columns Start and Stop of df_out with:
df_out = pd.DataFrame({'Start': sorted(set(dfa['Start'])|set(dfb['Start'])),
'Stop': sorted(set(dfa['Stop'])|set(dfb['Stop']))})
Then to get the value of dfa (and dfb) associated to the right range(Start,Stop) in a column named Value_dfA (and Value_dfB), I would do:
df_out['Value_dfA'] = df_out['Start'].apply(lambda x: dfa['Value'][dfa['Start'] <= x].iloc[-1])
df_out['Value_dfB'] = df_out['Start'].apply(lambda x: dfb['Value'][dfb['Start'] <= x].iloc[-1])
To get the column SameValue, do:
df_out['SameValue'] = df_out.apply(lambda x: 1 if x['Value_dfA'] == x['Value_dfB'] else 0,axis=1)
If it matters, you can reorder the columns with:
df_out = df_out[['Start', 'Stop', 'SameValue', 'Value_dfA', 'Value_dfB']]
Your output is then
Start Stop SameValue Value_dfA Value_dfB
0 0 50 1 0.0 0.0
1 51 100 0 0.0 1.0
2 101 200 1 1.0 1.0
3 201 300 0 0.0 1.0
4 301 1000 1 0.0 0.0
I have O(nlog(n)) solution where n is the sum of rows of df_a and df_b. Here's how it goes:
Rename value column of both dataframes to value_a and value_b repsectively. Next append df_b to df_a.
df = df_a.append(df_b)
Sort the df with respect to start column.
df = df.sort_values('start')
Resulting dataframe will look like this:
start stop value_a value_b
0 0 100 0.0 NaN
0 0 50 NaN 0.0
1 51 300 NaN 1.0
1 101 200 1.0 NaN
2 201 1000 0.0 NaN
2 301 1000 NaN 0.0
Forward fill the missing values:
df = df.fillna(method='ffill')
Compute same_value column:
df['same_value'] = df['value_a'] == df['value_b']
Recompute stop column:
df.stop = df.start.shift(-1)
You will get the dataframe you desire (except the first and last row which is pretty easy to fix):
start stop value_a value_b same_value
0 0 0.0 0.0 NaN False
0 0 51.0 0.0 0.0 True
1 51 101.0 0.0 1.0 False
1 101 201.0 1.0 1.0 True
2 201 301.0 0.0 1.0 False
2 301 NaN 0.0 0.0 True
Here is an answer which computes the overlapping intervals really quickly (which answers the question in the title):
from io import StringIO
import pandas as pd
from ncls import NCLS
c1 = StringIO("""Start Stop Value
0 100 0.0
101 200 1.0
201 1000 0.0""")
c2 = StringIO("""Start Stop Value
0 50 0.0
51 300 1.0
301 1000 0.0""")
df1 = pd.read_table(c1, sep="\s+")
df2 = pd.read_table(c2, sep="\s+")
ncls = NCLS(df1.Start.values, df1.Stop.values, df1.index.values)
x1, x2 = ncls.all_overlaps_both(df2.Start.values, df2.Stop.values, df2.index.values)
df1 = df1.reindex(x2).reset_index(drop=True)
df2 = df2.reindex(x1).reset_index(drop=True)
# print(df1)
# print(df2)
df = df1.join(df2, rsuffix="2")
print(df)
# Start Stop Value Start2 Stop2 Value2
# 0 0 100 0.0 0 50 0.0
# 1 0 100 0.0 51 300 1.0
# 2 101 200 1.0 51 300 1.0
# 3 201 1000 0.0 51 300 1.0
# 4 201 1000 0.0 301 1000 0.0
With this final df it should be simple to get to the result you need (but it is left as an exercise for the reader).
See NCLS for the interval overlap data structure.

Categories