Pandas DataFrame - Access Values That are created on the fly - python

I am trying figure out something which I can easily preform on excel but I am having a hard time to understand how to do it on a Pandas Data Frame without using loops.
Suppose that I have a data frame as follows:
+------------+-------+-------+-----+------+
| Date | Price | Proxy | Div | Days |
+------------+-------+-------+-----+------+
| 13/01/2021 | 10 | 20 | 0.5 | NaN |
| 08/01/2021 | NaN | 30 | 0.6 | 5 |
| 04/01/2021 | NaN | 40 | 0.7 | 4 |
| 03/01/2021 | NaN | 50 | 0.8 | 1 |
| 01/01/2021 | NaN | 60 | 0.9 | 2 |
+------------+-------+-------+-----+------+
The task is to fill all the Price where price is null. In excel I would suppose that Date is column A and first row of Date id row 2 then to fill NaN in row 2 of Price I would use the formula =(B2)/(((C3/C2)*D3)*E3)=2.22.
Now I want to use the value 2.22 on the fly to fill NaN in row 3 of Price reason being to fill nan of row 3 I need to make use of filled row 2 value. Hence the formula in excel would to fill row 3 price would be =(B3)/(((C4/C3)*D4)*E4).
1 way would be to loop over all the rows of Data Frame that I don't want to do. What would be the vectorised approach to solve this problem?
Expected Output
+------------+-------+-------+-----+------+
| Date | Price | Proxy | Div | Days |
+------------+-------+-------+-----+------+
| 13/01/2021 | 10 | 20 | 0.5 | NA |
| 08/01/2021 | 2.22 | 30 | 0.6 | 5 |
| 04/01/2021 | 0.60 | 40 | 0.7 | 4 |
| 03/01/2021 | 0.60 | 50 | 0.8 | 1 |
| 01/01/2021 | 0.28 | 60 | 0.9 | 2 |
+------------+-------+-------+-----+------+
Current_Price = Prev Price (non-nan) / (((Current_Proxy/Prev_Proxy) * Div) * Days)
Edit
Create initial data frame using code below
data = {'Date': ['2021-01-13', '2021-01-08', '2021-01-04', '2021-01-03', '2021-01-01'],
'Price':[10, np.nan, np.nan, np.nan,np.nan],
'Proxy':[20, 30, 40, 50, 60],
'Div':[0.5, 0.6, 0.7, 0.8, 0.9],
'Days':[np.nan, 5, 4, 1, 2]}
df = pd.DataFrame(data)

What you want to achieve is actually a cumulated product:
df['Price'] = (df['Price'].combine_first(df['Proxy'].shift()/df.eval('Proxy*Div*Days'))
.cumprod().round(2))
Output:
Date Price Proxy Div Days
0 2021-01-13 10.00 20 0.5 NaN
1 2021-01-08 2.22 30 0.6 5.0
2 2021-01-04 0.60 40 0.7 4.0
3 2021-01-03 0.60 50 0.8 1.0
4 2021-01-01 0.28 60 0.9 2.0

Related

Filter according dynamic conditions that are dependent

I'm having a dataframe that looks like:
+----+---------+---------+
| | Count | Value |
|----+---------+---------|
| 0 | 10 | 0.5 |
| 1 | 17 | 0.9 |
| 2 | 56 | 0.6 |
| 3 | 25 | 0.7 |
| 4 | 80 | 0.7 |
| 5 | 190 | 0.6 |
| 6 | 3 | 0.8 |
| 7 | 60 | 0.5 |
+----+---------+---------+
Now I want to filter. Smaller amounts of Count require a higher Value to get in focus.
The dependencies could look like: dict({100:0.5, 50:0.6, 40:0.7, 20:0.75, 10:0.8})
Examples:
if Count is above 100, Value requires only to be greater/equal 0.5
if Count is only 10 to 19, Value need to be greather/equal 0.8
I could filter it easily with:
df[((df["Count"]>=100) & (df["Value"]>=0.5)) |
((df["Count"]>=50) & (df["Value"]>=0.6)) |
((df["Count"]>=40) & (df["Value"]>=0.7)) |
((df["Count"]>=20) & (df["Value"]>=0.75)) |
((df["Count"]>=10) & (df["Value"]>=0.8))]
+----+---------+---------+
| | Count | Value |
|----+---------+---------|
| 1 | 17 | 0.9 |
| 2 | 56 | 0.6 |
| 4 | 80 | 0.7 |
| 5 | 190 | 0.6 |
+----+---------+---------+
But want to change periodically the thresholds (also adding or removing threshold steps) without constantly changing the filter. How could I do this in pandas?
MWE
import pandas as pd
df = pd.DataFrame({
"Count":[10,17,56,25,80,190,3,60],
"Value":[0.5,0.9,0.6,0.7,0.7,0.6,0.8,0.5]
})
limits = dict({100:0.5, 50:0.6, 40:0.7, 20:0.75, 10:0.8})
R equivalent
In R I could solve a similar question with following code (thanks to akrun). But I don't know how to adapt to pandas.
library(data.table)
set.seed(33)
df = data.table(CPE=sample(1:500, 100),
PERC=runif(min = 0.1, max = 1, n=100))
lst1 <- list(c(20, 0.95), c(50, 0.9), c(100,0.85), c(250,0.8))
df[Reduce(`|`, lapply(lst1, \(x) CPE > x[1] & PERC > x[2]))]
Lets simplify your code by using boolean reduction with np.logical_or. This is also very close to what your are trying to do in R
c = ['Count', 'Value']
df[np.logical_or.reduce([df[c].ge(t).all(1) for t in limits.items()])]
Count Value
1 17 0.9
2 56 0.6
4 80 0.7
5 190 0.6
I would use pandas.cut to perform the comparison in linear time. If you have many groups performing multiple comparisons will become inefficient (O(n*m) complexity):
# sorted bins and matching labels
bins = sorted(limits)
# [10, 20, 40, 50, 100]
labels = [limits[x] for x in bins]
# [0.8, 0.75, 0.7, 0.6, 0.5]
# mapping threshold from bins
s = pd.cut(df['Count'], bins=[0]+bins+[np.inf], labels=[np.inf]+labels, right=False).astype(float)
out = df[df['Value'].ge(s)]
Output:
Count Value
1 17 0.9
2 56 0.6
4 80 0.7
5 190 0.6
Intermediate s:
0 0.80
1 0.80
2 0.60
3 0.75
4 0.60
5 0.50
6 inf
7 0.60
Name: Count, dtype: float64

Populate the current row based on the prev

A question was posted on the link below where one wanted to use a previous row to populate the current row:
Is there a way in Pandas to use previous row value in dataframe.apply when previous value is also calculated in the apply?
In this case, there was only one index, the date.
Now I want to add a second index, employee ID; on the first occurrence of the first index, Index_EmpID, then I would like B to be populated with a value from A. On any subsequent occurrence, I would like the value from the previous row multiplied by the value from the current row.
I have the following data frame:
|Index_EmpID |Index_Date | A | B |
|============|===========|======|=====|
|A123 |2022-01-31 | 1 | NaN |
|A123 |2022-02-28 | 1 | NaN |
|A123 |2022-03-31 | 1.05 | NaN |
|A123 |2022-04-30 | 1 | NaN |
|A567 |2022-01-31 | 1 | NaN |
|A567 |2022-02-28 | 1.05 | NaN |
|A567 |2022-03-31 | 1 | NaN |
|A567 |2022-04-30 | 1.05 | NaN |
I require:
|Index_EmpID |Index_Date | A | B |
|============|===========|======|======|
|A123 |2022-01-31 | 1 | 1 |
|A123 |2022-02-28 | 1 | 1 |
|A123 |2022-03-31 | 1.05 | 1.05 |
|A123 |2022-04-30 | 1 | 1.05 |
|A567 |2022-01-31 | 1 | 1 |
|A567 |2022-02-28 | 1.05 | 1.05 |
|A567 |2022-03-31 | 1 | 1.05 |
|A567 |2022-04-30 | 1.05 |1.1025|
Something like
import numpy as np
df.groupby("Index_EmpID")["A"].agg(np.cumprod).reset_index()
should work.
A solution that uses iterrows is not as nice a solution as the one that uses groupby but it follows directly from the description and uses only the most elementary Pandas facilities.
empdf = pd.DataFrame({'Index_EmpID': (['A123']*4 + ['A567']*4),
'Index_Date': (['2022-01-31', '2022-02-28',
'2022-03-31', '2022-04-30'] * 2),
'A': [1, 1, 1.05, 1, 1, 1.05, 1, 1.05],
'B': ([np.nan]*8)})
past_id, past_b, bs = None, 1, []
for label, row in empdf.iterrows():
if row['Index_EmpID'] == past_id:
bs.append(past_b * row['A'])
else:
bs.append(row['A'])
past_b = bs[-1]
past_id = row['Index_EmpID']
empdf['B'] = bs
This would produce exactly the dataframe you requested
Index_EmpID Index_Date A B
0 A123 2022-01-31 1.00 1.0000
1 A123 2022-02-28 1.00 1.0000
2 A123 2022-03-31 1.05 1.0500
3 A123 2022-04-30 1.00 1.0500
4 A567 2022-01-31 1.00 1.0000
5 A567 2022-02-28 1.05 1.0500
6 A567 2022-03-31 1.00 1.0500
7 A567 2022-04-30 1.05 1.1025

How to create cumulative bins in dataframe?

I have a df which looks like this:
date | user_id | purchase_probability | sales
2020-01-01 | 1 | 0.19 | 10
2020-01-20 | 1 | 0.04 | 0
2020-01-01 | 3 | 0.31 | 5
2020-01-10 | 2 | 0.05 | 18
How can I best create a new dataframe that creates cumulative buckets in 10% increments such as:
probability_bin | total_users | total_sales
0-10% | 2 | 18+0=18
0-20% | 2 | 18+0+10=28
0-30% | 2 | 28
0-40% | 3 | 10+0+5+18=33
0-50% | 3 | 33
0-60% | same for all rows below
0-70%
0-80%
0-90%
0-100%
I tried using a custom function and also pandas pcut and qcut but not sure how to get to that cumulative output.
Any ideas are appreciated.
Use cut to create normal bins, then aggregate and cumsum:
bins = np.arange(0, 101, 10)
labels = [f'0-{int(i)}%' for i in bins[1:]]
group = pd.cut(df['purchase_probability'], bins=bins/100, labels=labels)
(df.groupby(group)
.agg(total_users=('user_id', 'count'), total_sales=('sales', 'sum'))
.cumsum()
)

How to find average after sorting month column in python

I have a challenge in front of me in python.
| Growth_rate | Month |
| ------------ |-------|
| 0 | 1 |
| -2 | 1 |
| 1.2 | 1 |
| 0.3 | 2 |
| -0.1 | 2 |
| 7 | 2 |
| 9 | 3 |
| 4.1 | 3 |
Now I want to average the growth rate according to the months in a new columns. Like 1st month the avg would be -0.26 and it should look like below table.
| Growth_rate | Month | Mean |
| ----------- | ----- | ----- |
| 0 | 1 | -0.26 |
| -2 | 1 | -0.26 |
| 1.2 | 1 | -0.26 |
| 0.3 | 2 | 2.2 |
| -0.1 | 2 | 2.2 |
| 7 | 2 | 2.2 |
| 9 | 3 | 6.5 |
| 4.1 | 3 | 6.5 |
This will calculate the mean growth rate and put it into mean column.
Any help would be great.
df.groupby(df.months).mean().reset_index().rename(columns={'Growth_Rate':'mean'}).merge(df,on='months')
Out[59]:
months mean Growth_Rate
0 1 -0.266667 0.0
1 1 -0.266667 -2.0
2 1 -0.266667 1.2
3 2 2.200000 -0.3
4 2 2.200000 -0.1
5 2 2.200000 7.0
6 3 6.550000 9.0
7 3 6.550000 4.1
Assuming that you are using the pandas package. If your table is in a DataFrame df
In [91]: means = df.groupby('Month').mean().reset_index()
In [92]: means.columns = ['Month', 'Mean']
Then join via merge
In [93]: pd.merge(df, means, how='outer', on='Month')
Out[93]:
Growth_rate Month Mean
0 0.0 1 -0.266667
1 -2.0 1 -0.266667
2 1.2 1 -0.266667
3 0.3 2 2.400000
4 -0.1 2 2.400000
5 7.0 2 2.400000
6 9.0 3 6.550000
7 4.1 3 6.550000

Select rows in one DataFrame based on rows in another

Let's assume I have a very large pandas DataFrame dfBig with columns Param1, Param2, ..., ParamN, score, step, and a smaller DataFrame dfSmall with columns Param1, Param2, ..., ParamN (i.e. missing the score and step columns).
I want to select all the rows of dfBig for which the values of columns Param1, Param2, ..., ParamN match those of some row in dfSmall. Is there a clean way of doing this in pandas?
Edit: To give an example, consider this DataFrame dfBig:
Arch | Layers | Score | Time
A | 1 | 0.3 | 10
A | 1 | 0.6 | 20
A | 1 | 0.7 | 30
A | 2 | 0.4 | 10
A | 2 | 0.5 | 20
A | 2 | 0.6 | 30
B | 1 | 0.1 | 10
B | 1 | 0.2 | 20
B | 1 | 0.7 | 30
B | 2 | 0.7 | 10
B | 2 | 0.8 | 20
B | 2 | 0.8 | 30
Let's imagine a model is specified by a pair (Arch, Layers). I want to query dfBig and get the time series for scores over time for the best performing models with Arch A and Arch B.
Following EdChum's answer below, I take it that the best solution is to do something like this procedurally:
modelColumns = [col for col in dfBigCol if col not in ["Time", "Score"]]
groupedBest = dfBig.groupby("Arch").Score.max()
dfSmall = pd.DataFrame(groupedBest).reset_index()[modelColumns]
dfBest = pd.merge(dfSmall, dfBig)
which yields:
Arch | Layers | Score | Time
A | 1 | 0.3 | 10
A | 1 | 0.6 | 20
A | 1 | 0.7 | 30
B | 2 | 0.7 | 10
B | 2 | 0.8 | 20
B | 2 | 0.8 | 30
If there's a better way to do this, I'm happy to hear it.
If I understand your question correctly you should be able to just call merge on dfBig and pass dfSmall which will look for matches in the aligned columns and only return those rows.
Example:
In [71]:
dfBig = pd.DataFrame({'a':np.arange(100), 'b':np.arange(100), 'c':np.arange(100)})
dfSmall = pd.DataFrame({'a':[3,4,5,6]})
dfBig.merge(dfSmall)
Out[71]:
a b c
0 3 3 3
1 4 4 4
2 5 5 5
3 6 6 6

Categories