I have created a data frame which has rolling quarter mapping using the code
abcd = pd.DataFrame()
abcd['Month'] = np.nan
abcd['Month'] = pd.date_range(start='2020-04-01', end='2022-04-01', freq = 'MS')
abcd['Time_1'] = np.arange(1, abcd.shape[0]+1)
abcd['Time_2'] = np.arange(0, abcd.shape[0])
abcd['Time_3'] = np.arange(-1, abcd.shape[0]-1)
db_nd_ad_unpivot = pd.melt(abcd, id_vars=['Month'],
value_vars=['Time_1', 'Time_2', 'Time_3',],
var_name='Time_name', value_name='Time')
abcd_map = db_nd_ad_unpivot[(db_nd_ad_unpivot['Time']>0)&(db_nd_ad_unpivot['Time']< abcd.shape[0]+1)]
abcd_map = abcd_map[['Month','Time']]
The output of the code looks like this:
Now, I have created an additional column name that gives me the name of the month and year in format Mon-YY using the code
abcd_map['Month'] = pd.to_datetime(abcd_map.Month)
# abcd_map['Month'] = abcd_map['Month'].astype(str)
abcd_map['Time_Period'] = abcd_map['Month'].apply(lambda x: x.strftime("%b'%y"))
Now I want to see for a specific time, what is the minimum and maximum in the month column. For eg. for time instance 17
,The simple groupby results as:
Time Period
17 Aug'21-Sept'21
The desired output is
Time Time_Period
17 Aug'21-Oct'21.
I think it is based on min and max of the column Month as by using the strftime function the column is getting converted in String/object type.
How about converting to string after finding the min and max
New_df = abcd_map.groupby('Time')['Month'].agg(['min', 'max']).apply(lambda x: x.dt.strftime("%b'%y")).agg(' '.join, axis=1).reset_index()
Do this:
abcd_map['Month_'] = pd.to_datetime(abcd_map['Month']).dt.strftime('%Y-%m')
abcd_map['Time_Period'] = abcd_map['Month_'] = pd.to_datetime(abcd_map['Month']).dt.strftime('%Y-%m')
abcd_map['Time_Period'] = abcd_map['Month'].apply(lambda x: x.strftime("%b'%y"))
df = abcd_map.groupby(['Time']).agg(
sum_col=('Time', np.sum),
first_date=('Time_Period', np.min),
last_date=('Time_Period', np.max)
).reset_index()
df['TimePeriod'] = df['first_date']+'-'+df['last_date']
df = df.drop(['first_date','last_date'], axis = 1)
df
which returns
Time sum_col TimePeriod
0 1 3 Apr'20-May'20
1 2 6 Jul'20-May'20
2 3 9 Aug'20-Jun'20
3 4 12 Aug'20-Sep'20
4 5 15 Aug'20-Sep'20
5 6 18 Nov'20-Sep'20
6 7 21 Dec'20-Oct'20
7 8 24 Dec'20-Nov'20
8 9 27 Dec'20-Jan'21
9 10 30 Feb'21-Mar'21
10 11 33 Apr'21-Mar'21
11 12 36 Apr'21-May'21
12 13 39 Apr'21-May'21
13 14 42 Jul'21-May'21
14 15 45 Aug'21-Jun'21
15 16 48 Aug'21-Sep'21
16 17 51 Aug'21-Sep'21
17 18 54 Nov'21-Sep'21
18 19 57 Dec'21-Oct'21
19 20 60 Dec'21-Nov'21
20 21 63 Dec'21-Jan'22
21 22 66 Feb'22-Mar'22
22 23 69 Apr'22-Mar'22
23 24 48 Apr'22-Mar'22
24 25 25 Apr'22-Apr'22
I need to generate a list of data. The data is randomised based on a seed. As the list has potentially no limit to size, I am thinking of using pagination to send the data back to requester. The list has to be replicable with a given seed by requester.
Unlike getting data from a database where I can specify offset and number of records to retrieve, the random list needs to be created each time ? How do I avoid having to start from the beginning to get to the nth page (for instance) ? eg
import numpy as np
np.random.seed(0)
for i in range(20):
print(f'{i+1}\t=\t{np.random.randint(100)}')
1 = 44
2 = 47
3 = 64
4 = 67
5 = 67
6 = 9
7 = 83
8 = 21
9 = 36
10 = 87
11 = 70
12 = 88
13 = 88
14 = 12
15 = 58
16 = 65
17 = 39
18 = 87
19 = 46
20 = 88
I my page size = 10, how to avoid generating 1-10 by the time I'm generating 11-20 for the 2nd page ?
Thanks.
I want create dataframe by reusing the above row to calculate the value of below row.
Currently I am using variables to stores values and creating list and pushing list to cf dataframe to calculate Discount Cash Flows.
Current Reproducible code-
import math
import pandas as pd
#User input
cashflow = 3.6667
fcf_growth_for_first_5_years = 14/100
fcf_growth_for_last_5_years = 7/100
no_of_years = 10
t_g_r = 3.50/100 ##Terminal Growth Rate
discount_rate = 10/100
##fcf calculaton for 10 Years
future_cash_1_year = cashflow*(1+fcf_growth_for_first_5_years)
future_cash_2_year = future_cash_1_year*(1+fcf_growth_for_first_5_years)
future_cash_3_year = future_cash_2_year*(1+fcf_growth_for_first_5_years)
future_cash_4_year = future_cash_3_year*(1+fcf_growth_for_first_5_years)
future_cash_5_year = future_cash_4_year*(1+fcf_growth_for_first_5_years)
future_cash_6_year = future_cash_5_year*(1+fcf_growth_for_last_5_years)
future_cash_7_year = future_cash_6_year*(1+fcf_growth_for_last_5_years)
future_cash_8_year = future_cash_7_year*(1+fcf_growth_for_last_5_years)
future_cash_9_year = future_cash_8_year*(1+fcf_growth_for_last_5_years)
future_cash_10_year = future_cash_9_year*(1+fcf_growth_for_last_5_years)
fcf = []
fcf.extend(value for name, value in locals().items() if name.startswith('future_cash_'))
cf = pd.DataFrame()
cf.insert(0, 'Sr_No', range(1,11))
cf.insert(1, 'Year', range(23,33))
cf['fcf'] = fcf
cf
Desired Output-
I am getting desired output by using lst method code as given above, but I am looking for more efficient way to calculate values using pandas df instead of using lst & variables.
Sr_No Year fcf
0 1 23 4.180038
1 2 24 4.765243
2 3 25 5.432377
3 4 26 6.192910
4 5 27 7.059918
5 6 28 7.554112
6 7 29 8.082900
7 8 30 8.648703
8 9 31 9.254112
9 10 32 9.901900
Using a for loop makes this much more easier to handle
import math
import pandas as pd
#User input
cashflow = 3.6667
fcf_growth_for_first_5_years = 14/100
fcf_growth_for_last_5_years = 7/100
no_of_years = 10
t_g_r = 3.50/100 ##Terminal Growth Rate
discount_rate = 10/100
cf = pd.DataFrame()
cf.insert(0, 'Sr_No', range(1,11))
cf.insert(1, 'Year', range(23,33))
##fcf calculaton for 10 Years
fcf=[]
for row in range(len(cf)):
if cf.Sr_No[row]==1:
fcf.append(cashflow*(1+fcf_growth_for_first_5_years))
elif cf.Sr_No[row]<6:
fcf.append(fcf[row-1]*(1+fcf_growth_for_first_5_years))
else:
fcf.append(fcf[row-1]*(1+fcf_growth_for_last_5_years))
cf['fcf'] = fcf
cf
... I am looking for more efficient way to calculate values using pandas df ...
You could use .cumprod():
cashflow = 3.6667
fcf_growth_for_first_5_years = 14 / 100
fcf_growth_for_last_5_years = 7 / 100
df = pd.DataFrame({
"Sr_No": range(1, 1 + no_of_years), "Year": range(23, 23 + no_of_years)
})
df.loc[df.index[:no_of_years // 2], "fcf"] = 1 + fcf_growth_for_first_5_years
df["fcf"] = df["fcf"].fillna(1 + fcf_growth_for_last_5_years).cumprod() * cashflow
Result:
Sr_No Year fcf
0 1 23 4.180038
1 2 24 4.765243
2 3 25 5.432377
3 4 26 6.192910
4 5 27 7.059918
5 6 28 7.554112
6 7 29 8.082900
7 8 30 8.648703
8 9 31 9.254112
9 10 32 9.901900
It is possibly done with regular expressions, which I am not very strong at.
My dataframe is like this:
import pandas as pd
import regex as re
data = {'postcode': ['DG14','EC3M','BN45','M2','WC2A','W1C','PE35'], 'total':[44, 54,56, 78,87,35,36]}
df = pd.DataFrame(data)
df
postcode total
0 DG14 44
1 EC3M 54
2 BN45 56
3 M2 78
4 WC2A 87
5 W1C 35
6 PE35 36
I want to get these strings in my column with the last letter stripped like so:
postcode total
0 DG14 44
1 EC3 54
2 BN45 56
3 M2 78
4 WC2 87
5 W1C 35
6 PE35 36
Probably something using re.sub('', '\D')?
Thank you.
You could use str.replace here:
df["postcode"] = df["postcode"].str.replace(r'[A-Za-z]$', '')
One of the approaches:
import pandas as pd
import re
data = {'postcode': ['DG14','EC3M','BN45','M2','WC2A','W1C','PE35'], 'total':[44, 54,56, 78,87,35,36]}
data['postcode'] = [re.sub(r'[a-zA-Z]$', '', item) for item in data['postcode']]
df = pd.DataFrame(data)
print(df)
Output:
postcode total
0 DG14 44
1 EC3 54
2 BN45 56
3 M2 78
4 WC2 87
5 W1 35
6 PE35 36
I have a dataframe called df_location:
location = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_location = pd.DataFrame(locations)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
Each island_id corresponds to one or more locations. As you can see, the locations are stored in a list.
What I'm trying to do is to search the list_of_locations for each unique location and merge it to df_location in a way where each island_id will correspond to a specific location.
Final dataframe should be the following:
merged = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69],
'island_id':[10,20,20,30,30,40,40,40,50,60]}
df_merged = pd.DataFrame(merged)
I don't know whether there is a method or function in python to do so. I would really appreciate it if someone can give me a solution to this problem.
The pandas method you're looking for to expand your df_islands dataframe is .explode(column_name). From there, rename your column to location_id and then join the dataframes using pd.merge(). It'll perform a SQL-like join method using the location_id as the key.
import pandas as pd
locations = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_locations = pd.DataFrame(locations)
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
df_islands = df_islands.explode(column='list_of_locations')
df_islands.columns = ['island_id', 'location_id']
pd.merge(df_locations, df_islands)
Out[]:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60
The df.apply() method works here. It's a bit long-winded but it works:
df_location['island_id'] = df_location['location_id'].apply(
lambda x: [
df_islands['island_id'][i] \
for i in df_islands.index \
if x in df_islands['list_of_locations'][i]
# comment above line and use this instead if list is stored in a string
# if x in eval(df_islands['list_of_locations'][i])
][0]
)
First we select the final value we want if the if statement is True: df_islands['island_id'][i]
Then we loop over each column in df_islands by using df_islands.index
Then create the if statement which loops over all values in df_islands['list_of_locations'] and returns True if the value for df_location['location_id'] is in the list.
Finally, since we must contain this long statement in square brackets, it is a list. However, we know that there is only one value in the list so we can index it by using [0] at the end.
I hope this helps and happy for other editors to make the answer more legible!
print(df_location)
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60