Filter out zeros in np.percentile - python

I am trying to decile the column score of a DataFrame.
I use the following code:
np.percentile(df['score'], np.arange(0, 100, 10))
My problem is in score, there are lots of zeros. How can I filter out these 0 values and only decile the rest of values?

Filter them with boolean indexing:
df.loc[df['score']!=0, 'score']
or
df['score'][lambda x: x!=0]
and pass that to the percentile function.
np.percentile(df['score'][lambda x: x!=0], np.arange(0,100,10))

Consider the dataframe df
df = pd.DataFrame(
dict(score=np.random.rand(20))
).where(
np.random.choice([True, False], (20, 1), p=(.8, .2)),
0
)
score
0 0.380777
1 0.559356
2 0.103099
3 0.800843
4 0.262055
5 0.389330
6 0.477872
7 0.393937
8 0.189949
9 0.571908
10 0.133402
11 0.033404
12 0.650236
13 0.593495
14 0.000000
15 0.013058
16 0.334851
17 0.000000
18 0.999757
19 0.000000
Use pd.qcut to decile
pd.qcut(df.loc[df.score != 0, 'score'], 10, range(10))
0 4
1 6
2 1
3 9
4 3
5 4
6 6
7 5
8 2
9 7
10 1
11 0
12 8
13 8
15 0
16 3
18 9
Name: score, dtype: category
Categories (10, int64): [0 < 1 < 2 < 3 ... 6 < 7 < 8 < 9]
Or all together
df.assign(decile=pd.qcut(df.loc[df.score != 0, 'score'], 10, range(10)))
score decile
0 0.380777 4.0
1 0.559356 6.0
2 0.103099 1.0
3 0.800843 9.0
4 0.262055 3.0
5 0.389330 4.0
6 0.477872 6.0
7 0.393937 5.0
8 0.189949 2.0
9 0.571908 7.0
10 0.133402 1.0
11 0.033404 0.0
12 0.650236 8.0
13 0.593495 8.0
14 0.000000 NaN
15 0.013058 0.0
16 0.334851 3.0
17 0.000000 NaN
18 0.999757 9.0
19 0.000000 NaN

You can simply mask zeros and then remove them from your column using boolean indexing:
score = df['score']
score_no_zero = score[score != 0]
np.percentile(score_no_zero, np.arange(0,100,10))
or in one step:
np.percentile(df['score'][df['score'] != 0], np.arange(0,100,10))

Related

How to do pandas rolling window in both forward and backward at the same time

I have a pd.DataFrame df with one column, say:
A = [1,2,3,4,5,6,7,8,2,4]
df = pd.DataFrame(A,columns = ['A'])
For each row, I want to take previous 2 values, current value and next 2 value (a window= 5) and get the sum and store it in new column. Desire output,
A A_sum
1 6
2 10
3 15
4 20
5 25
6 30
7 28
8 27
2 21
4 14
I have tried,
df['A_sum'] = df['A'].rolling(2).sum()
Tried with shift, but all doing either forward or backward, I'm looking for a combination of both.
Use rolling by 5, add parameter center=True and min_periods=1 to Series.rolling:
df['A_sum'] = df['A'].rolling(5, center=True, min_periods=1).sum()
print (df)
A A_sum
0 1 6.0
1 2 10.0
2 3 15.0
3 4 20.0
4 5 25.0
5 6 30.0
6 7 28.0
7 8 27.0
8 2 21.0
9 4 14.0
If you are allowed to use numpy, then you might use numpy.convolve to get desired output
import numpy as np
import pandas as pd
A = [1,2,3,4,5,6,7,8,2,4]
B = np.convolve(A,[1,1,1,1,1], 'same')
df = pd.DataFrame({"A":A,"A_sum":B})
print(df)
output
A A_sum
0 1 6
1 2 10
2 3 15
3 4 20
4 5 25
5 6 30
6 7 28
7 8 27
8 2 21
9 4 14
You can use shift for this (straightforward if not elegant):
df["A_sum"] = df.A + df.A.shift(-2).fillna(0) + df.A.shift(-1).fillna(0) + df.A.shift(1).fillna(0)
output:
A A_sum
0 1 6.0
1 2 10.0
2 3 14.0
3 4 18.0
4 5 22.0
5 6 26.0
6 7 23.0
7 8 21.0
8 2 14.0
9 4 6.0

Save dictionary to Pandas dataframe with keys as columns and merge indices

I know there are already lots of posts on how to convert a pandas dict to a dataframe, however I could not find one discussing the issue I have.
My dictionary looks as follows:
[Out 23]:
{'atmosphere': 0
2 5
9 4
15 1
26 5
29 5
... ..
2621 4
6419 3
[6934 rows x 1 columns],
'communication': 0
13 1
15 1
26 1
2621 2
3119 5
... ..
6419 4
6532 1
[714 rows x 1 columns]
Now, what I want is to create a dataframe out of this dictionary, where the 'atmosphere' and 'communication' are the columns, and the indices of both items are merged, so that the dataframe looks as follows:
index atmosphere commmunication
2 5
9 4
13 1
15 1 1
26 5 1
29 5
2621 4 2
3119 5
6419 3 4
6532 1
I already tried pd.DataFrame.from_dict, but it saves all values in one row.
Any help is much appreciated!
Use concat with DataFrame.droplevel for remove second level 0 from MultiIndex in columns:
d = {'atmosphere':pd.DataFrame({0: {2: 5, 9: 4, 15: 1, 26: 5, 29: 5,
2621: 4, 6419: 3}}),
'communication':pd.DataFrame({0: {13: 1, 15: 1, 26: 1, 2621: 2,
3119: 5, 6419: 4, 6532: 1}})}
print (d['atmosphere'])
0
2 5
9 4
15 1
26 5
29 5
2621 4
6419 3
print (d['communication'])
0
13 1
15 1
26 1
2621 2
3119 5
6419 4
6532 1
df = pd.concat(d, axis=1).droplevel(1, axis=1)
print (df)
atmosphere communication
2 5.0 NaN
9 4.0 NaN
13 NaN 1.0
15 1.0 1.0
26 5.0 1.0
29 5.0 NaN
2621 4.0 2.0
3119 NaN 5.0
6419 3.0 4.0
6532 NaN 1.0
Alternative solution:
df = pd.concat({k: v[0] for k, v in d.items()}, axis=1)
You can use pandas.concat on the values and set_axis with the dictionary keys:
out = pd.concat(d.values(), axis=1).set_axis(d, axis=1)
output:
atmosphere communication
2 5.0 NaN
9 4.0 NaN
13 NaN 1.0
15 1.0 1.0
26 5.0 1.0
29 5.0 NaN
2621 4.0 2.0
3119 NaN 5.0
6419 3.0 4.0
6532 NaN 1.0

Dataframe to count conditional occurrence

A data frame like below.
I want to find out when sales was >20, (in its previous 5 data) how many times the inventory was > 10.
The ideal output is:
2018/12/26 has Sales 36 when 2 times.
2018/11/19 has Sales 34 when 2 times.
Here is what I do with xlrd:
import xlrd
from datetime import datetime
old_file = xlrd.open_workbook("C:\\Sales.xlsx")
the_sheet = old_file.sheet_by_name("Sales")
for row_index in range(1, the_sheet.nrows):
Dates = the_sheet.cell(row_index, 0).value
Inventory = the_sheet.cell(row_index, 1).value
Sales = the_sheet.cell(row_index, 2).value
list_of_Inventory = []
for i in range(1,5):
list_of_Inventory.append(the_sheet.cell(row_index - i, 1).value)
if Sales > 20:
print str(Dates) + " has Sales " + str(Sales) + " when " + str(sum(i > 10 for i in list_of_Inventory)) + " times."
It doesn't work well.
What would be the proper way to work it out? Appreciate some guidance in pandas.
Thank you.
p.s. here is the data.
data = {'Date': ["2018/12/29","2018/12/26","2018/12/24","2018/12/15","2018/12/11","2018/12/8","2018/11/28","2018/11/20","2018/11/19","2018/11/11","2018/11/6","2018/11/1","2018/10/28","2018/10/11","2018/9/25","2018/9/24"],
'Inventory': [5,5,5,22,5,25,5,15,15,5,5,15,0,22,2,10],
'Sales' : [0,36,18,0,0,17,18,17,34,16,0,0,18,18,51,18]}
df = pd.DataFrame(data)
I don't think you're going to get around iterating over the dataframe (based on the specifics of your output). So provided your data isn't huge, it shouldn't be a problem. Here's another quick solution you can implement:
for idx in df.loc[df.Sales > 20].index:
inv = df.loc[idx-4:idx, 'Inventory'].ge(10)
date, _, sales = df.loc[idx]
if len(inv) >= 5:
print(f'{date} has Sales {sales} when {inv.sum()} times')
2018/11/19 has Sales 34 when 2 times
2018/9/25 has Sales 51 when 2 times
I think you can get there with a couple of "cheater" columns to do some intermediate work using pandas rolling function. Note 'HSHIC' = High Sales High Inventory Count. (Needed an acronym). This actually works well for your desire to exclude first 4 rows because rolling will exclude them automatically.
In [42]: df = pd.DataFrame(data)
In [43]: df
Out[43]:
Date Inventory Sales
0 2018/12/29 5 0
1 2018/12/26 5 36
2 2018/12/24 5 18
3 2018/12/15 6 0
4 2018/12/11 5 0
5 2018/12/8 0 17
6 2018/11/28 5 18
7 2018/11/20 15 17
8 2018/11/19 15 34
9 2018/11/11 5 16
10 2018/11/6 5 0
11 2018/11/1 15 0
12 2018/10/28 0 18
13 2018/10/11 10 18
14 2018/9/25 2 51
15 2018/9/24 10 18
In [44]: df['High Inventory'] = df['Inventory'] > 10
In [45]: df['High Inv Cnt'] = df['High Inventory'].rolling(window=5).sum()
In [46]: df
Out[46]:
Date Inventory Sales High Inventory High Inv Cnt
0 2018/12/29 5 0 False NaN
1 2018/12/26 5 36 False NaN
2 2018/12/24 5 18 False NaN
3 2018/12/15 6 0 False NaN
4 2018/12/11 5 0 False 0.0
5 2018/12/8 0 17 False 0.0
6 2018/11/28 5 18 False 0.0
7 2018/11/20 15 17 True 1.0
8 2018/11/19 15 34 True 2.0
9 2018/11/11 5 16 False 2.0
10 2018/11/6 5 0 False 2.0
11 2018/11/1 15 0 True 3.0
12 2018/10/28 0 18 False 2.0
13 2018/10/11 10 18 False 1.0
14 2018/9/25 2 51 False 1.0
15 2018/9/24 10 18 False 1.0
In [47]: df['HSHIC'] = df['High Inv Cnt'][df.Sales > 20]
In [48]: df
Out[48]:
Date Inventory Sales High Inventory High Inv Cnt HSHIC
0 2018/12/29 5 0 False NaN NaN
1 2018/12/26 5 36 False NaN NaN
2 2018/12/24 5 18 False NaN NaN
3 2018/12/15 6 0 False NaN NaN
4 2018/12/11 5 0 False 0.0 NaN
5 2018/12/8 0 17 False 0.0 NaN
6 2018/11/28 5 18 False 0.0 NaN
7 2018/11/20 15 17 True 1.0 NaN
8 2018/11/19 15 34 True 2.0 2.0
9 2018/11/11 5 16 False 2.0 NaN
10 2018/11/6 5 0 False 2.0 NaN
11 2018/11/1 15 0 True 3.0 NaN
12 2018/10/28 0 18 False 2.0 NaN
13 2018/10/11 10 18 False 1.0 NaN
14 2018/9/25 2 51 False 1.0 1.0
15 2018/9/24 10 18 False 1.0 NaN
In [49]:
Error in the first post of the question (what's on page now is correct), so let me put down a working solution by Python 2.
Thanks to #manwithfewneeds and #kantal.
for idx in df.index[df.Sales > 20]:
inv = df.loc[idx + 1 : idx + 5, 'Inventory'].ge(10) # downwards 5 rows, Inventory > 10
date, _, sales = df.loc[idx]
if len(inv) >= 5:
print '%s. has Sales %s. when %s. times' % (date, sales, inv.sum())

DataFrame only keep higher/lower values

I am trying to clean up a dataset. Only values smaller than the last value should be kept.
Right now it look slike this:
my_data
0 10
1 8
2 7
3 10
4 5
5 8
6 2
after the cleanup it should look like this:
my_data
0 10
1 8
2 7
3 7
4 5
5 5
6 2
I also have some working code but I am looking for a faster and more pythonic way of doing it.
import pandas as pd
df_results = pd.DataFrame()
df_results['my_data'] = [10, 8, 7, 10, 5, 8, 2]
data_idx = list(df_results['my_data']._index)
for i in range(1, len(df_results['my_data'])):
current_value = df_results['my_data'][data_idx[i]]
last_value = df_results['my_data'][data_idx[i - 1]]
df_results['my_data'][data_idx[i]] = current_value if current_value < last_value else last_value
You can use:
In [53]: df[df.my_data.diff() > 0] = np.nan
In [54]: df
Out[54]:
my_data
0 10.0
1 8.0
2 7.0
3 NaN
4 5.0
5 NaN
6 2.0
In [55]: df.ffill()
Out[55]:
my_data
0 10.0
1 8.0
2 7.0
3 7.0
4 5.0
5 5.0
6 2.0
I am using shift with diff
s=df.my_data.diff().gt(0)
df.loc[s,'my_data']=df.loc[s.shift(-1).fillna(False),'my_data'].values
Out[71]:
my_data
0 10.0
1 8.0
2 7.0
3 7.0
4 5.0
5 5.0
6 2.0

Pandas DataFrame, calculate max column value relative to current row column value

I have a dataframe:
df = pd.DataFrame( {
'epoch' : [1, 4, 7, 8, 9, 11, 12, 15, 16, 17],
'price' : [1, 2, 3, 3, 1, 4, 2, 3, 4, 4]
} )
epoch price
0 1 1
1 4 2
2 7 3
3 8 3
4 9 1
5 11 4
6 12 2
7 15 3
8 16 4
9 17 4
I have to create a new column that should be calculated in the following way:
For each row
Find current row's epoch (let's say e_cur)
Calculate e_cur-3 = e_cur – 3 (three is a constant here but it will be variable)
Calculate price max value where epoch >= e-3_cur and epoch <= e_cur
In other words, find maximum price in rows that are three epoch away from current row's epoch.
For example:
Index=0, e_cur = epoch = 1, e_cur-3 = 1 -3 = -2, there is only one (first) row whose epoch is between -2 and 1 so the price from the first row is maximum price
Index =6, e_cur = epoch = 12, e_cur-3 = 12 – 3 = 9, there are three rows whose epoch is between 9 and 12, but row with index=5 has the maximum price = 4.
Here are the results for every row that I calculated manually:
epoch price max_price_where_epoch_is_between_e_cur-3_and_e_cur
0 1 1 1
1 4 2 2
2 7 3 3
3 8 3 3
4 9 1 3
5 11 4 4
6 12 2 4
7 15 3 3
8 16 4 4
9 17 4 4
As you can see, epoch something goes one by one, but sometimes there are "holes".
How to calculate that with pandas?
Using rolling window:
In [161]: df['between'] = df.epoch.map(df.set_index('epoch')
...: .reindex(np.arange(df.epoch.min(), df.epoch.max()+1))
...: .rolling(3, min_periods=1)
...: .max()['price'])
...:
In [162]: df
Out[162]:
epoch price between
0 1 1 1.0
1 4 2 2.0
2 7 3 3.0
3 8 3 3.0
4 9 1 3.0
5 11 4 4.0
6 12 2 4.0
7 15 3 3.0
8 16 4 4.0
9 17 4 4.0
Explanation:
Helper DF:
In [165]: df.set_index('epoch').reindex(np.arange(df.epoch.min(), df.epoch.max()+1))
Out[165]:
price
epoch
1 1.0
2 NaN
3 NaN
4 2.0
5 NaN
6 NaN
7 3.0
8 3.0
9 1.0
10 NaN
11 4.0
12 2.0
13 NaN
14 NaN
15 3.0
16 4.0
17 4.0
In [166]: df.set_index('epoch').reindex(np.arange(df.epoch.min(), df.epoch.max()+1)).rolling(3, min_periods=1).max()
Out[166]:
price
epoch
1 1.0
2 1.0
3 1.0
4 2.0
5 2.0
6 2.0
7 3.0
8 3.0
9 3.0
10 3.0
11 4.0
12 4.0
13 4.0
14 2.0
15 3.0
16 4.0
17 4.0
Consider applying function on epoch column where you can locate the required rows and calculate their price max value
>> df['between'] = df['epoch'].apply(lambda e: df.loc[
>> (df['epoch'] >= e - 3) & (df['epoch'] <= e), 'price'].max())
>> df
epoch price between
0 1 1 1
1 4 2 2
2 7 3 3
3 8 3 3
4 9 1 3
5 11 4 4
6 12 2 4
7 15 3 3
8 16 4 4
9 17 4 4
I have tried both solutions, from tarashypka and MaxU.
The first solution I have tried was Tarashypka's. I tested it on 100k rows. It took about one minute.
Than I tried MaxU's solution, that has finished in about 4 seconds.
I prefer MaxU's solution because of the speed, but with Tarashypka's solution I also learned how to use lambda function with DataFrame.
Thank you very very much to all of you.
Best regards and wishes.

Categories