calculate new column values based on conditions in pandas

calculate new column values based on conditions in pandas - python

I have columns in the pandas dataframe df_profit:
profit_date profit
0 01.04 70
1 02.04 80
2 03.04 80
3 04.04 100
4 05.04 120
5 06.04 120
6 07.04 120
7 08.04 130
8 09.04 140
9 10.04 140
And I have the second dataframe df_deals:
deals_date
0 03.04
1 05.04
2 06.04
I want to create a new column 'delta' in the df_profit and let it be equal to delta between current value and previous value in 'profit' column. But I want the delta to be calculated only after the first date in the 'profit_date' is equal to the date in the column 'deal_date' of df_deals dataframe and previous value in the delta calculation to be always the same and equal to the value when the first date in 'profit_date' was equal to the first date in 'deals_date'.
So, the result would look like:
profit_date profit delta
0 01.04 70
1 02.04 80
2 03.04 80
3 04.04 100 20
4 05.04 120 40
5 06.04 120 40
6 07.04 120 40
7 08.04 130 50
8 09.04 140 60
9 10.04 140 60

For the next time you should provide better data to make it easier to help (dataframe creation so that we can copy paste your code).
I think this codes does what you want:
import pandas as pd
df_profit = pd.DataFrame(columns=["profit_date", "profit"],
data=[
["01.04", 70],
["02.04", 80],
["03.04", 80],
["04.04", 100],
["05.04", 120],
["06.04", 120],
["07.04", 120],
["08.04", 130],
["09.04", 140],
["10.04", 140]])
df_deals = pd.DataFrame(columns=["deals_date"], data=["03.04", "05.04", "06.04"])
# combine both dataframes, based on date columns
df = df_profit.merge(right=df_deals, left_on="profit_date", right_on="deals_date", how="left")
# find the first value (first row with deals date) and set it to 'base'
df["base"] = df.loc[df["deals_date"].first_valid_index()]["profit"]
# calculate delta
df["delta"] = df["profit"] - df["base"]
# Remove unused values
df.loc[:df["deals_date"].first_valid_index(), "delta"] = None
# remove temporary cols
df.drop(columns=["base", "deals_date"], inplace=True)
print(df)
output is:
profit_date profit delta
0 01.04 70 NaN
1 02.04 80 NaN
2 03.04 80 NaN
3 04.04 100 20.0
4 05.04 120 40.0
5 06.04 120 40.0
6 07.04 120 40.0
7 08.04 130 50.0
8 09.04 140 60.0
9 10.04 140 60.0

You can try this one for don't get NaN values
start_profit = df_profit.loc[(df_profit["profit_date"] == df_deals.iloc[0][0])]
start_profit = start_profit.iloc[0][1]
for i in range(len(df_profit)):
if int(str(df_profit.iloc[i][0]).split(".")[0]) > 3 and int(str(df_profit.iloc[i][0]).split(".")[1]) >= 4:
df_profit.loc[i,"delta"] = df_profit.iloc[i][1]-start_profit
Hope it helps

Related

Find First occurrence of a user and assign values to it

Here's what my data look like:
user_id
prior_elapse_time
timestamp
115
NaN
0
115
10
1000
115
5
2000
222212
NaN
0
222212
8
500
222212
12
3000
222212
NaN
5000
222212
15
8000
I found similar posts that teach me how to get the first occurrence of a user:
train_df.groupby('user_id')['prior_elapsed_time'].first()
This would nicely get me all the first appearance of each user. However, now I'm at a loss at how to correctly assign 0 to the NaN only at the first occurrence of the user. Due to logging error, you can see that NaN appears elsewhere, but I only want to assign 0 to the boldfaced NaN.
I also tried
train_df['prior_elapse_time'][(train_df['prior_elapse_time'].isna()) & (train_df['timestamp'] == 0)] = 0
But then I get the "copy" vs. "view" assignment problem (which I don't fully understand).
Any help?

If your df is sorted by user_id:
>>> df.loc[df.user_id.diff().ne(0), 'prior_elapse_time'] = 0
>>> df
user_id prior_elapse_time timestamp
0 115 0.0 0
1 115 10.0 1000
2 115 5.0 2000
3 222212 0.0 0
4 222212 8.0 500
5 222212 12.0 3000
6 222212 NaN 5000
7 222212 15.0 8000
Alternatively, use pandas.Series.mask
>>> df['prior_elapse_time'] = df.prior_elapse_time.mask(df.user_id.diff().ne(0), 0)
If not sorted, then get the indices via groupby:
>>> idx = df.reset_index().groupby('user_id')['index'].first()
>>> df.loc[idx, 'prior_elapse_time'] = 0
If you want to set 0 to only those places where it was previously NaN, add pandas.Series.isnull mask to the columns.
>>> df.loc[
(df.user_id.diff().ne(0) & df.prior_elapse_time.isnull()),
'prior_elapse_time'
] = 0

Preserving id columns in dataframe after applying assign and groupby

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
Following this answer to a previous questions I had asked, I used this code to summarise the ultrasound measurements using the maximum measurement recorded in a single trimester (13 weeks):
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
This results in the following output:
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
However, MotherID and PregnancyID no longer appear as columns in the output of df.info(). Similarly, when I output the dataframe to a csv file, I only get columns 1,2 and 3. The id columns only appear when running df.head() as can be seen in the dataframe above.
I need to preserve the id columns as I want to use them to merge this dataframe with another one using the ids. Therefore, my question is, how do I preserve these id columns as part of my dataframe after running the code above?

Chain that with reset_index:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
# .drop(columns = 'gestationalAgeInWeeks') # don't need this
.groupby(['MotherID', 'PregnancyID','tm'])['abdomCirc'] # change here
.max().add_prefix('abdomCirc_') # here
.unstack()
.reset_index() # and here
)
Or a more friendly version with pivot_table:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13)
.pivot_table(index= ['MotherID', 'PregnancyID'], columns='tm',
values= 'abdomCirc', aggfunc='max')
.add_prefix('abdomCirc_') # remove this if you don't want the prefix
.reset_index()
)
Output:
tm MotherID PregnancyID abdomCirc_1 abdomCirc_2 abdomCirc_3
0 abdomCirc_0 abdomCirc_0 NaN 200.0 NaN
1 abdomCirc_1 abdomCirc_1 NaN 315.0 350.0
2 abdomCirc_2 abdomCirc_2 180.0 NaN NaN

Add a row for missing period and for the corresponding period calculate the average of last 3 Months

I am trying to write a code which adds missing periods to the dataframe and calculates their respective averages. Refer to the below example:
Invoice Date Amount
9 01/2020 227500
4 02/2020 56000
0 03/2020 22000
1 05/2020 25000
5 06/2020 75000
2 07/2020 27000
6 08/2020 48000
3 09/2020 35000
7 10/2020 115000
8 12/2020 85000
In the above dataframe, we see that there's a record missing for '11/2020'. I am trying to add the record for the period of 11/2020 and calculate it's mean for the last three months i.e., if 11/2020 is missing, take the amounts of 12/2020,10/2020 and 9/2020 and calculate its Mean and add/append it to the dataframe.
Expected output:
Invoice Date Amount
10 01/2020 227500.00
4 02/2020 56000.00
0 03/2020 22000.00
5 04/2020 75000.00
1 05/2020 25000.00
6 06/2020 48000.00
2 07/2020 27000.00
7 08/2020 115000.00
3 09/2020 35000.00
8 10/2020 77000.00
11 11/2020 65666.67
9 12/2020 85000.00
Please note that, I am able to arrive at the above result with the following code:
import pandas as pd
FundAdmin = {
'Invoice Date': ['03/2020', '05/2020', '07/2020', '09/2020', '02/2020', '04/2020', '06/2020', '08/2020', '10/2020', '12/2020',
'01/2020'
],
'Amount': [22000, 25000, 27000, 35000, 56000, 75000, 48000, 115000, 77000, 85000, 227500]
}
expected_dates = ['01/2020', '02/2020', '03/2020', '04/2020', '05/2020', '06/2020', '07/2020', '08/2020', '09/2020', '10/2020', '11/2020',
'12/2020'
]
df = pd.DataFrame(FundAdmin, columns = ['Invoice Date', 'Amount'])
current_dates = df['Invoice Date']
missing_dates = list(set(expected_dates) - set(current_dates))
sorted_df = df.sort_values(by = 'Invoice Date')
for i in missing_dates:
Top_3_Rows = sorted_df.tail(3)# print(Top_3_Rows)
Top_3_Rows_Amount = round(Top_3_Rows.mean(), 2)
CalcDF = {
'Invoice Date': i,
'Amount': float(Top_3_Rows_Amount)
}
FullDF = df.append(CalcDF, ignore_index = True)
print(FullDF)
However, my code is not able to handle the calculation for missing records in the middle of the dataframe. Meaning, it adds missing period to dataframe, but is not able to pick up the values of the previous 3months and it is adding the same mean amount to all the missing periods. Example: If there's a record for 4/2020 missing, code should be able to add a new record for 4/2020 and assign the value of the mean generated out of 1/2020,2/2020 and 3/2020 to 4/2020. Instead, it is assigning the Mean value of other missing period. Please refer to the below:
Expected Output (if both 11/2020 and 4/2020 are missing):
Invoice Date Amount
10 01/2020 227500.00
4 02/2020 56000.00
0 03/2020 22000.00
5 04/2020 101833.33 <---- New Record Inserted for 4/2020 through the calculation the mean for 3/2020,2/2020,1/2020
1 05/2020 25000.00
6 06/2020 48000.00
2 07/2020 27000.00
7 08/2020 115000.00
3 09/2020 35000.00
8 10/2020 77000.00
11 11/2020 65666.67 <---- New Record Inserted for 11/2020 through the calculation the mean for 12/2020,10/2020,9/2020
9 12/2020 85000.00
My Output (if both 11/2020 and 4/2020 are missing):
Invoice Date Amount
10 01/2020 227500.00
4 02/2020 56000.00
0 03/2020 22000.00
5 04/2020 65666.67 <--- Value same as 11/2020
1 05/2020 25000.00
6 06/2020 48000.00
2 07/2020 27000.00
7 08/2020 115000.00
3 09/2020 35000.00
8 10/2020 77000.00
11 11/2020 65666.67 <--- This works fine.
9 12/2020 85000.00
From my observation, I found that my code is not able to fetch the last 3 records if the missing period occurs to be in the middle of the dataframe, as I am using tail() method and it is fetching the records of 9/2020,10/2020 and 12/2020, caluclating its mean and assigning the same value to 4/2020. I am a complete beginner to python and if any assistance provided to resolve the above issue is greatly appreciated.

Would this work for you?
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from random import randint
df_len = 100
df = pd.DataFrame({
'Invoice': [randint(1, 10) for _ in range(df_len)],
'Dates' : [(datetime.today() - pd.DateOffset(months=mnths_ago)).date()
for mnths_ago in range(df_len)],
'Amount': [randint(1, 100000) for _ in range(df_len)],
})
# Drop 10 random rows
drop_indices = np.random.choice(df.index, 10, replace=False)
df = df.drop(drop_indices)
df
Invoice Dates Amount
0 1 2020-05-19 23797
1 6 2020-04-19 54101
2 10 2020-03-19 91522
3 5 2020-02-19 48762
4 1 2020-01-19 54497
.. ... ... ...
93 1 2012-08-19 56834
94 10 2012-07-19 21382
95 2 2012-06-19 33056
96 1 2012-05-19 93336
98 7 2012-03-19 12406
from dateutil import relativedelta
def get_prev_mean(date):
return df[:df.loc[df.Dates == date].index[0]].tail(3)['Amount'].mean()
r = relativedelta.relativedelta(df.Dates.min(), df.Dates.max())
n_months = -(r.years * 12) + r.months
all_months = [(df.Dates.max() - pd.DateOffset(months=mnths_ago)).date() for mnths_ago in range(n_months)]
missing_months = [mnth for mnth in all_months if mnth in list(df.Dates)]
dct = {mnth: get_prev_mean(mnth) for mnth in missing_months}
to_merge = pd.DataFrame(data=dct.values(), index=dct.keys()).reset_index()
to_merge.columns = ['Dates', 'Amount']
out = pd.concat([df, to_merge], sort=False).sort_values(by='Dates').reset_index(drop=True)
out
Invoice Dates Amount
0 7.0 2012-03-19 12406.0
1 1.0 2012-05-19 93336.0
2 2.0 2012-06-19 33056.0
3 10.0 2012-07-19 21382.0
4 1.0 2012-08-19 56834.0
.. ... ... ...
171 10.0 2020-03-19 91522.0
172 NaN 2020-04-19 23797.0
173 6.0 2020-04-19 54101.0
174 NaN 2020-05-19 NaN
175 1.0 2020-05-19 23797.0

How to plot Matplotlib chart which takesvalues from different columns

This is my dataframe
Order Time Profit
0 1 106 NaN
1 1 111 -296.0
2 2 14 NaN
3 2 16 -296.0
4 3 62 NaN
.. ... ... ...
335 106 32 -297.6
336 107 44 NaN
337 107 44 138.0
338 108 58 NaN
339 108 63 -303.4
So the way I want it to work is plot a chart where X is the time, Y is the absolute price(positive or negative) so we need to have 2 bars. Now, the time should not be from the same row, but from the first row with the same order number.
For ex. The -296.0 would be under time 106, not 111 because 106 was the first under Order nr.1. How would we do something like that?
This is my code so far:
data = pd.read_csv(filename)
df = pd.DataFrame(data, columns = ['Order','Time','Profit']).astype(str)
#turns time column into hours of week
df['Time'] = df['Time'].apply(lambda x: findHourOfWeek(x))
df['Profit'] = df['Profit'].astype(float)

Assuming the structure we see in the sample of your data holds over the entire data set, i.e. there is only one Profit value per Order, you can do it like this: Group the DataFrame by Order, and aggregate by taking the minimum:
df_grouped = df.groupby(by='Order').min()
resulting in this DataFrame:
Time Profit
Order
1 106 -296.0
2 14 -296.0
3 62 NaN
...
106 32 -297.6
107 44 138.0
108 58 -303.4
Then you can sort by Time and do the plot:
import matplotlib.pyplot as plt
df_grouped.sort_values(by='Time', inplace=True)
plt.plot(df_grouped['Time'], df_grouped['Profit'])

If you rather want to rely on position in the data table you can also do this:
plot_df = pd.DataFrame()
plot_df["Order"] = df.Order.unique()
plot_df["Profit"] = list(df.groupby("Order").nth(-1)["Profit"])
plot_df["Time"] = list(df.groupby("Order").nth(0)["Time"])
However, if you want min value for time you'd better use solution provided by Arne since it would be more safe and correct (provided that you only have one profit value for each order number).

Weighted Means for columns in Pandas DataFrame including Nan

I am trying to get the weighted mean for each column (A-F) of a Pandas.Dataframe with "Value" as the weight. I can only find solutions for problems with categories, which is not what I need.
The comparable solution for normal means would be
df.means()
Notice the df has Nan in the columns and "Value".
A B C D E F Value
0 17656 61496 83 80 117 99 2902804
1 75078 61179 14 3 6 14 3761964
2 21316 60648 86 Nan 107 93 127963
3 6422 48468 28855 26838 27319 27011 131354
4 12378 42973 47153 46062 46634 42689 3303909572
5 54292 35896 59 6 3 18 27666367
6 21272 Nan 126 12 3 5 9618047
7 26434 35787 113 17 4 8 309943
8 10508 34314 34197 7100 10 10 NaN
I can use this for a single column.
df1 = df[['A','Value']]
df1 = df1.dropna()
np.average(df1['A'], weights=df1['Value'])
There must be a simple method. It's driving me nuts I don't see it.
I would appreciate any help.

You could use masked arrays. We could dropoff rows where Value column has NaN values.
In [353]: dff = df.dropna(subset=['Value'])
In [354]: dff.apply(lambda x: np.ma.average(
np.ma.MaskedArray(x, mask=np.isnan(x)), weights=dff.Value))
Out[354]:
A 1.282629e+04
B 4.295120e+04
C 4.652817e+04
D 4.545254e+04
E 4.601520e+04
F 4.212276e+04
Value 3.260246e+09
dtype: float64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

calculate new column values based on conditions in pandas - python

Related

Find First occurrence of a user and assign values to it

Preserving id columns in dataframe after applying assign and groupby

Add a row for missing period and for the corresponding period calculate the average of last 3 Months

How to plot Matplotlib chart which takesvalues from different columns

Weighted Means for columns in Pandas DataFrame including Nan

Categories

Resources