Dataframe to count conditional occurrence - python

A data frame like below.
I want to find out when sales was >20, (in its previous 5 data) how many times the inventory was > 10.
The ideal output is:
2018/12/26 has Sales 36 when 2 times.
2018/11/19 has Sales 34 when 2 times.
Here is what I do with xlrd:
import xlrd
from datetime import datetime
old_file = xlrd.open_workbook("C:\\Sales.xlsx")
the_sheet = old_file.sheet_by_name("Sales")
for row_index in range(1, the_sheet.nrows):
Dates = the_sheet.cell(row_index, 0).value
Inventory = the_sheet.cell(row_index, 1).value
Sales = the_sheet.cell(row_index, 2).value
list_of_Inventory = []
for i in range(1,5):
list_of_Inventory.append(the_sheet.cell(row_index - i, 1).value)
if Sales > 20:
print str(Dates) + " has Sales " + str(Sales) + " when " + str(sum(i > 10 for i in list_of_Inventory)) + " times."
It doesn't work well.
What would be the proper way to work it out? Appreciate some guidance in pandas.
Thank you.
p.s. here is the data.
data = {'Date': ["2018/12/29","2018/12/26","2018/12/24","2018/12/15","2018/12/11","2018/12/8","2018/11/28","2018/11/20","2018/11/19","2018/11/11","2018/11/6","2018/11/1","2018/10/28","2018/10/11","2018/9/25","2018/9/24"],
'Inventory': [5,5,5,22,5,25,5,15,15,5,5,15,0,22,2,10],
'Sales' : [0,36,18,0,0,17,18,17,34,16,0,0,18,18,51,18]}
df = pd.DataFrame(data)

I don't think you're going to get around iterating over the dataframe (based on the specifics of your output). So provided your data isn't huge, it shouldn't be a problem. Here's another quick solution you can implement:
for idx in df.loc[df.Sales > 20].index:
inv = df.loc[idx-4:idx, 'Inventory'].ge(10)
date, _, sales = df.loc[idx]
if len(inv) >= 5:
print(f'{date} has Sales {sales} when {inv.sum()} times')
2018/11/19 has Sales 34 when 2 times
2018/9/25 has Sales 51 when 2 times

I think you can get there with a couple of "cheater" columns to do some intermediate work using pandas rolling function. Note 'HSHIC' = High Sales High Inventory Count. (Needed an acronym). This actually works well for your desire to exclude first 4 rows because rolling will exclude them automatically.
In [42]: df = pd.DataFrame(data)
In [43]: df
Out[43]:
Date Inventory Sales
0 2018/12/29 5 0
1 2018/12/26 5 36
2 2018/12/24 5 18
3 2018/12/15 6 0
4 2018/12/11 5 0
5 2018/12/8 0 17
6 2018/11/28 5 18
7 2018/11/20 15 17
8 2018/11/19 15 34
9 2018/11/11 5 16
10 2018/11/6 5 0
11 2018/11/1 15 0
12 2018/10/28 0 18
13 2018/10/11 10 18
14 2018/9/25 2 51
15 2018/9/24 10 18
In [44]: df['High Inventory'] = df['Inventory'] > 10
In [45]: df['High Inv Cnt'] = df['High Inventory'].rolling(window=5).sum()
In [46]: df
Out[46]:
Date Inventory Sales High Inventory High Inv Cnt
0 2018/12/29 5 0 False NaN
1 2018/12/26 5 36 False NaN
2 2018/12/24 5 18 False NaN
3 2018/12/15 6 0 False NaN
4 2018/12/11 5 0 False 0.0
5 2018/12/8 0 17 False 0.0
6 2018/11/28 5 18 False 0.0
7 2018/11/20 15 17 True 1.0
8 2018/11/19 15 34 True 2.0
9 2018/11/11 5 16 False 2.0
10 2018/11/6 5 0 False 2.0
11 2018/11/1 15 0 True 3.0
12 2018/10/28 0 18 False 2.0
13 2018/10/11 10 18 False 1.0
14 2018/9/25 2 51 False 1.0
15 2018/9/24 10 18 False 1.0
In [47]: df['HSHIC'] = df['High Inv Cnt'][df.Sales > 20]
In [48]: df
Out[48]:
Date Inventory Sales High Inventory High Inv Cnt HSHIC
0 2018/12/29 5 0 False NaN NaN
1 2018/12/26 5 36 False NaN NaN
2 2018/12/24 5 18 False NaN NaN
3 2018/12/15 6 0 False NaN NaN
4 2018/12/11 5 0 False 0.0 NaN
5 2018/12/8 0 17 False 0.0 NaN
6 2018/11/28 5 18 False 0.0 NaN
7 2018/11/20 15 17 True 1.0 NaN
8 2018/11/19 15 34 True 2.0 2.0
9 2018/11/11 5 16 False 2.0 NaN
10 2018/11/6 5 0 False 2.0 NaN
11 2018/11/1 15 0 True 3.0 NaN
12 2018/10/28 0 18 False 2.0 NaN
13 2018/10/11 10 18 False 1.0 NaN
14 2018/9/25 2 51 False 1.0 1.0
15 2018/9/24 10 18 False 1.0 NaN
In [49]:

Error in the first post of the question (what's on page now is correct), so let me put down a working solution by Python 2.
Thanks to #manwithfewneeds and #kantal.
for idx in df.index[df.Sales > 20]:
inv = df.loc[idx + 1 : idx + 5, 'Inventory'].ge(10) # downwards 5 rows, Inventory > 10
date, _, sales = df.loc[idx]
if len(inv) >= 5:
print '%s. has Sales %s. when %s. times' % (date, sales, inv.sum())

Related

Adding rows in dataframe with a condition in list

I am trying to add a row with the condition but was having difficulty achieving this.
Currently, I have pandas dataframes in a list that looks like following
The objective is to add a row with the condition that I want to add a row with a fixed number for 'ID' and increase the month by 3.
For example, for this[1] I want it to add rows that look like following
ID | month | num
6 | 0 | 5
6 | 3 | NaN
6 | 6 | 4
6 | 9 | NaN
6 | 12 | 3
...
6 | 36 | 1
I am trying to create a function that takes the index of the list (so it would be an actual dataframe), the max number of the month of that dataframe, and month I want it to be incremented by (3), which would look like
def add_rows(df, max_mon, res):
if max_mon > res:
add rows with fixed ID and NaN num
skip the month that already exist
final = []
for i in range(len(this)):
final.append(add_rows(this[i], this[i]['month'].max(), 3))
I have tried to insert rows but I did not manage to get it work.
The toy data
d = {'ID':[5,5,5,5,5], 'month':[0,6,12,24,36], 'num':[5,4,3,2,1]}
tempo = pd.DataFrame(data = d)
d2 = {'ID':[6,6,6,6,6], 'month':[0,6,12,18,36], 'num':[5,4,3,2,1]}
tempo2 = pd.DataFrame(data = d2)
this = []
this.append(tempo)
this.append(tempo2)
I would really appreciate if I could get help on building the function!
You can use:
for i, df in enumerate(this):
this[i] = (df
.set_index('month')
.groupby('ID')
.apply(lambda x: x.drop(columns='ID')
.reindex(range(x.index.min(), x.index.max()+3, 3))
)
.reset_index()[df.columns]
)
Updated this:
[ ID month num
0 5 0 5.0
1 5 3 NaN
2 5 6 4.0
3 5 9 NaN
4 5 12 3.0
5 5 15 NaN
6 5 18 NaN
7 5 21 NaN
8 5 24 2.0
9 5 27 NaN
10 5 30 NaN
11 5 33 NaN
12 5 36 1.0,
ID month num
0 6 0 5.0
1 6 3 NaN
2 6 6 4.0
3 6 9 NaN
4 6 12 3.0
5 6 15 NaN
6 6 18 2.0
7 6 21 NaN
8 6 24 NaN
9 6 27 NaN
10 6 30 NaN
11 6 33 NaN
12 6 36 1.0]

How to use each vector entry to fill NAN's of a separate groups in a dataframe

Say I have a vector ValsHR which looks like this:
valsHR=[78.8, 82.3, 91.0]
And I have a dataframe MainData
Age Patient HR
21 1 NaN
21 1 NaN
21 1 NaN
30 2 NaN
30 2 NaN
24 3 NaN
24 3 NaN
24 3 NaN
I want to fill the NaNs so that the first value in valsHR will only fill in the NaNs for patient 1, the second will fill the NaNs for patient 2 and the third will fill in for patient 3.
So far I've tried using this:
mainData['HR'] = mainData['HR'].fillna(ValsHR) but it fills all the NaNs with the first value in the vector.
I've also tried to use this:
mainData['HR'] = mainData.groupby('Patient').fillna(ValsHR) fills the NaNs with values that aren't in the valsHR vector at all.
I was wondering if anyone knew a way to do this?
Create dictionary by Patient values with missing values, map to original column and replace missing values only:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- value is not replaced
4 30 2 NaN
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0
If some groups has no NaNs:
print (df)
Age Patient HR
0 21 1 NaN
1 21 1 NaN
2 21 1 NaN
3 30 2 100.0 <- group 2 is not replaced
4 30 2 100.0 <- group 2 is not replaced
5 24 3 NaN
6 24 3 NaN
7 24 3 NaN
p = df.loc[df.HR.isna(), 'Patient'].unique()
valsHR = [78.8, 82.3, 91.0]
df['HR'] = df['HR'].fillna(df['Patient'].map(dict(zip(p, valsHR))))
print (df)
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 100.0
4 30 2 100.0
5 24 3 82.3
6 24 3 82.3
7 24 3 82.3
It is simply mapping, if all of NaN should be replaced
import pandas as pd
from io import StringIO
valsHR=[78.8, 82.3, 91.0]
vals = {i:k for i,k in enumerate(valsHR, 1)}
df = pd.read_csv(StringIO("""Age Patient
21 1
21 1
21 1
30 2
30 2
24 3
24 3
24 3"""), sep="\s+")
df["HR"] = df["Patient"].map(vals)
>>> df
Age Patient HR
0 21 1 78.8
1 21 1 78.8
2 21 1 78.8
3 30 2 82.3
4 30 2 82.3
5 24 3 91.0
6 24 3 91.0
7 24 3 91.0

How to conditionally select previous row's value in python?

I want to select the previous row's value only if it meets a certain condition
E.g.
df:
Value Marker
10 0
12 0
50 1
42 1
52 0
23 1
I want to select the previous row's value where marker == 0if the current value marker == 1.
Result:
df:
Value Marker Prev_Value
10 0 nan
12 0 nan
50 1 12
42 1 12
52 0 nan
23 1 52
I tried:
df[prev_value] = np.where(df[marker] == 1, df[Value].shift(), np.nan)
but that does not take conditional previous value like i want.
condition = (df.Marker.shift() == 0) & (df.Marker == 1)
df['Prev_Value'] = np.where(condition, df.Value.shift(), np.nan)
Output:
df
Value Marker Prev_Value
0 10 0 NaN
1 12 0 NaN
2 50 1 12.0
3 42 1 NaN
4 52 0 NaN
5 23 1 52.0
You could try this:
df['Prev_Value']=np.where(dataframe['Marker'].diff()==1,dataframe['Value'].shift(1, axis = 0),np.nan)
Output:
df
Value Marker Prev_Value
0 10 0 NaN
1 12 0 NaN
2 50 1 12.0
3 42 1 NaN
4 52 0 NaN
5 23 1 52.0
If you want to get the previous non-1 marker value, if marker==1, you could try this:
prevro=[]
for i in reversed(df.index):
if df.iloc[i,1]==1:
prevro_zero=df.iloc[0:i,0][df.iloc[0:i,1].eq(0)].tolist()
if len(prevro_zero)>0:
prevro.append(prevro_zero[len(prevro_zero)-1])
else:
prevro.append(np.nan)
else:
prevro.append(np.nan)
df['Prev_Value']=list(reversed(prevro))
print(df)
Output:
Value Marker Prev_Value
0 10 0 NaN
1 12 0 NaN
2 50 1 12.0
3 42 1 12.0
4 52 0 NaN
5 23 1 52.0

Filter out zeros in np.percentile

I am trying to decile the column score of a DataFrame.
I use the following code:
np.percentile(df['score'], np.arange(0, 100, 10))
My problem is in score, there are lots of zeros. How can I filter out these 0 values and only decile the rest of values?
Filter them with boolean indexing:
df.loc[df['score']!=0, 'score']
or
df['score'][lambda x: x!=0]
and pass that to the percentile function.
np.percentile(df['score'][lambda x: x!=0], np.arange(0,100,10))
Consider the dataframe df
df = pd.DataFrame(
dict(score=np.random.rand(20))
).where(
np.random.choice([True, False], (20, 1), p=(.8, .2)),
0
)
score
0 0.380777
1 0.559356
2 0.103099
3 0.800843
4 0.262055
5 0.389330
6 0.477872
7 0.393937
8 0.189949
9 0.571908
10 0.133402
11 0.033404
12 0.650236
13 0.593495
14 0.000000
15 0.013058
16 0.334851
17 0.000000
18 0.999757
19 0.000000
Use pd.qcut to decile
pd.qcut(df.loc[df.score != 0, 'score'], 10, range(10))
0 4
1 6
2 1
3 9
4 3
5 4
6 6
7 5
8 2
9 7
10 1
11 0
12 8
13 8
15 0
16 3
18 9
Name: score, dtype: category
Categories (10, int64): [0 < 1 < 2 < 3 ... 6 < 7 < 8 < 9]
Or all together
df.assign(decile=pd.qcut(df.loc[df.score != 0, 'score'], 10, range(10)))
score decile
0 0.380777 4.0
1 0.559356 6.0
2 0.103099 1.0
3 0.800843 9.0
4 0.262055 3.0
5 0.389330 4.0
6 0.477872 6.0
7 0.393937 5.0
8 0.189949 2.0
9 0.571908 7.0
10 0.133402 1.0
11 0.033404 0.0
12 0.650236 8.0
13 0.593495 8.0
14 0.000000 NaN
15 0.013058 0.0
16 0.334851 3.0
17 0.000000 NaN
18 0.999757 9.0
19 0.000000 NaN
You can simply mask zeros and then remove them from your column using boolean indexing:
score = df['score']
score_no_zero = score[score != 0]
np.percentile(score_no_zero, np.arange(0,100,10))
or in one step:
np.percentile(df['score'][df['score'] != 0], np.arange(0,100,10))

Modify function to return dataframe with specified values

With reference to the test data below and the function I use to identify values within variable thresh of each other.
Can anyone please help me modify this to show the desired output I have shown?
Test data
import pandas as pd
import numpy as np
from itertools import combinations
df2 = pd.DataFrame(
{'AAA' : [4,5,6,7,9,10],
'BBB' : [10,20,30,40,11,10],
'CCC' : [100,50,25,10,10,11],
'DDD' : [98,50,25,10,10,11],
'EEE' : [103,50,25,10,10,11]});
Function:
thresh = 5
def closeCols2(df):
max_value = None
for k1,k2 in combinations(df.keys(),2):
if abs(df[k1] - df[k2]) < thresh:
if max_value is None:
max_value = max(df[k1],df[k2])
else:
max_value = max(max_value, max(df[k1],df[k2]))
return max_value
Data Before function applied:
AAA BBB CCC DDD EEE
0 4 10 100 98 103
1 5 20 50 50 50
2 6 30 25 25 25
3 7 40 10 10 10
4 9 11 10 10 10
5 10 10 11 11 11
Current series output after applied:
df2.apply(closeCols2, axis=1)
0 103
1 50
2 25
3 10
4 11
5 11
dtype: int64
Desired output is a dataframe showing all values within thresh and a nan for any not within thresh
AAA BBB CCC DDD EEE
0 nan nan 100 98 103
1 nan nan 50 50 50
2 nan 30 25 25 25
3 7 nan 10 10 10
4 9 11 10 10 10
5 10 10 11 11 11
use mask and sub with axis=1
df2.mask(df2.sub(df2.apply(closeCols2, 1), 0).abs() > thresh)
AAA BBB CCC DDD EEE
0 NaN NaN 100 98 103
1 NaN NaN 50 50 50
2 NaN 30.0 25 25 25
3 7.0 NaN 10 10 10
4 9.0 11.0 10 10 10
5 10.0 10.0 11 11 11
note:
I'd redefine closeCols to include thresh as a parameter. Then you could pass it in the apply call.
def closeCols2(df, thresh):
max_value = None
for k1,k2 in combinations(df.keys(),2):
if abs(df[k1] - df[k2]) < thresh:
if max_value is None:
max_value = max(df[k1],df[k2])
else:
max_value = max(max_value, max(df[k1],df[k2]))
return max_value
df2.apply(closeCols2, 1, thresh=5)
extra credit
I vectorized and embedded your closeCols for some mind numbing fun.
Notice there is no apply
numpy broadcasting to get all combinations of columns subtracted from each other.
np.abs
<= 5
sum(-1) I arranged the broadcasting such that the difference of say row 0, column AAA with all of row 0 will be laid out across the last dimension. -1 in the sum(-1) says to sum across last dimension.
<= 1 all values are less than 5 away from themselves. So I want the sum of these to be greater than 1. Thus, we mask all less than or equal to one.
v = df2.values
df2.mask((np.abs(v[:, :, None] - v[:, None]) <= 5).sum(-1) <= 1)
AAA BBB CCC DDD EEE
0 NaN NaN 100 98 103
1 NaN NaN 50 50 50
2 NaN 30.0 25 25 25
3 7.0 NaN 10 10 10
4 9.0 11.0 10 10 10
5 10.0 10.0 11 11 11

Categories