Set a column to one date format Pandas - python

I am filtering out records for last month data records, however when doing
emp_df = emp_df[emp_df['Date'].dt.month == (currentMonth-1)]
It neglects some records(treats some records months as days).Link to File
from datetime import datetime, date
import pandas as pd
import numpy as np
cholareport = pd.read_excel("D:/Automations/HealthCheck and Audit Trail/report.xlsx")
uniqueemp = set(cholareport['Email'])
cholareport['Date'] = pd.to_datetime(cholareport['Date'])
uniqueemp = set(cholareport['Email'])
daystoignore = ['Holiday_COE', 'Leave_COE']
# datedfforemp = pd.DataFrame(columns=uniqueemp)
cholareport['Date'] = cholareport['Date'].apply(lambda x:
pd.to_datetime(x).strftime('%d/%m/%Y'))
cholareport["Date"] = pd.to_datetime(cholareport["Date"], utc=True)
for emp in uniqueemp:
emp_df = cholareport[cholareport['Email'].isin([emp])]
emp_df = emp_df[~emp_df['Task: Task Name'].isin(daystoignore)]
# s1 = pd.to_datetime(emp_df['Date']).dt.strftime('%Y-%m')
# s2 = (pd.to_datetime('today').strftime('%Y-%m') -pd.DateOffset(months=1)).strftime('%Y-%m')
# emp_df = emp_df[s1 == s2]
currentMonth = datetime.now().month
# print(currentMonth)
# print(emp_df['Date'])
emp_df['Date'] = pd.to_datetime(emp_df['Date']).dt.strftime("%dd-%mm-%YYYY")
format_data = "%dd-%mm-%YYYY"
empdfdate = []
for i in emp_df['Date']:
empdfdate.append(datetime.strptime(i,format_data))
print(empdfdate)
emp_df['Date'] = empdfdate
for i in emp_df['Date']:
print(i.month, i.day)
# emp_df['Date'] = pd.to_datetime(emp_df['Date']).dt.strftime('%Y-%m')
emp_df = emp_df[emp_df['Date'].dt.month == (currentMonth-1)]
for i in emp_df['Date']:
print(i.month, i.day)
Results :
6 10
7 10
10 10
11 10
12 10
10 13
10 14
Expected:
6 10
7 10
10 10
11 10
12 10
13 10
14 10

I am not entirely sure what you want to accomplish. If I understand it correctly, you simply want to count the number of entries per day for the past month. In such case, you can simply do the following.
from datetime import datetime
import pandas as pd
report = pd.read_excel('report.xlsx')
print('day: counts', report.Date[report.Date.dt.month == datetime.now().month - 1].dt.day.value_counts(), sep='\n')
I do not get your expected results. It might be that you also want to filter by email somehow; however, I cannot understand from your code what it is that you want to do.
Output:
day: counts
3 101
5 101
6 101
7 101
4 101
24 84
28 84
27 84
26 84
25 84
10 82
11 82
12 82
13 82
14 82
17 67
21 67
20 67
19 67
18 67
31 2
Name: Date, dtype: int64

Related

How do I convert date (YYYY-MM-DD) to Month-YY and groupby on some other column to get minimum and maximum month?

I have created a data frame which has rolling quarter mapping using the code
abcd = pd.DataFrame()
abcd['Month'] = np.nan
abcd['Month'] = pd.date_range(start='2020-04-01', end='2022-04-01', freq = 'MS')
abcd['Time_1'] = np.arange(1, abcd.shape[0]+1)
abcd['Time_2'] = np.arange(0, abcd.shape[0])
abcd['Time_3'] = np.arange(-1, abcd.shape[0]-1)
db_nd_ad_unpivot = pd.melt(abcd, id_vars=['Month'],
value_vars=['Time_1', 'Time_2', 'Time_3',],
var_name='Time_name', value_name='Time')
abcd_map = db_nd_ad_unpivot[(db_nd_ad_unpivot['Time']>0)&(db_nd_ad_unpivot['Time']< abcd.shape[0]+1)]
abcd_map = abcd_map[['Month','Time']]
The output of the code looks like this:
Now, I have created an additional column name that gives me the name of the month and year in format Mon-YY using the code
abcd_map['Month'] = pd.to_datetime(abcd_map.Month)
# abcd_map['Month'] = abcd_map['Month'].astype(str)
abcd_map['Time_Period'] = abcd_map['Month'].apply(lambda x: x.strftime("%b'%y"))
Now I want to see for a specific time, what is the minimum and maximum in the month column. For eg. for time instance 17
,The simple groupby results as:
Time Period
17 Aug'21-Sept'21
The desired output is
Time Time_Period
17 Aug'21-Oct'21.
I think it is based on min and max of the column Month as by using the strftime function the column is getting converted in String/object type.
How about converting to string after finding the min and max
New_df = abcd_map.groupby('Time')['Month'].agg(['min', 'max']).apply(lambda x: x.dt.strftime("%b'%y")).agg(' '.join, axis=1).reset_index()
Do this:
abcd_map['Month_'] = pd.to_datetime(abcd_map['Month']).dt.strftime('%Y-%m')
abcd_map['Time_Period'] = abcd_map['Month_'] = pd.to_datetime(abcd_map['Month']).dt.strftime('%Y-%m')
abcd_map['Time_Period'] = abcd_map['Month'].apply(lambda x: x.strftime("%b'%y"))
df = abcd_map.groupby(['Time']).agg(
sum_col=('Time', np.sum),
first_date=('Time_Period', np.min),
last_date=('Time_Period', np.max)
).reset_index()
df['TimePeriod'] = df['first_date']+'-'+df['last_date']
df = df.drop(['first_date','last_date'], axis = 1)
df
which returns
Time sum_col TimePeriod
0 1 3 Apr'20-May'20
1 2 6 Jul'20-May'20
2 3 9 Aug'20-Jun'20
3 4 12 Aug'20-Sep'20
4 5 15 Aug'20-Sep'20
5 6 18 Nov'20-Sep'20
6 7 21 Dec'20-Oct'20
7 8 24 Dec'20-Nov'20
8 9 27 Dec'20-Jan'21
9 10 30 Feb'21-Mar'21
10 11 33 Apr'21-Mar'21
11 12 36 Apr'21-May'21
12 13 39 Apr'21-May'21
13 14 42 Jul'21-May'21
14 15 45 Aug'21-Jun'21
15 16 48 Aug'21-Sep'21
16 17 51 Aug'21-Sep'21
17 18 54 Nov'21-Sep'21
18 19 57 Dec'21-Oct'21
19 20 60 Dec'21-Nov'21
20 21 63 Dec'21-Jan'22
21 22 66 Feb'22-Mar'22
22 23 69 Apr'22-Mar'22
23 24 48 Apr'22-Mar'22
24 25 25 Apr'22-Apr'22

Pagination for Seeded Random List

I need to generate a list of data. The data is randomised based on a seed. As the list has potentially no limit to size, I am thinking of using pagination to send the data back to requester. The list has to be replicable with a given seed by requester.
Unlike getting data from a database where I can specify offset and number of records to retrieve, the random list needs to be created each time ? How do I avoid having to start from the beginning to get to the nth page (for instance) ? eg
import numpy as np
np.random.seed(0)
for i in range(20):
print(f'{i+1}\t=\t{np.random.randint(100)}')
1 = 44
2 = 47
3 = 64
4 = 67
5 = 67
6 = 9
7 = 83
8 = 21
9 = 36
10 = 87
11 = 70
12 = 88
13 = 88
14 = 12
15 = 58
16 = 65
17 = 39
18 = 87
19 = 46
20 = 88
I my page size = 10, how to avoid generating 1-10 by the time I'm generating 11-20 for the 2nd page ?
Thanks.

USE the above row to calculate the value for below row iteratively using pandas dataframe

I want create dataframe by reusing the above row to calculate the value of below row.
Currently I am using variables to stores values and creating list and pushing list to cf dataframe to calculate Discount Cash Flows.
Current Reproducible code-
import math
import pandas as pd
#User input
cashflow = 3.6667
fcf_growth_for_first_5_years = 14/100
fcf_growth_for_last_5_years = 7/100
no_of_years = 10
t_g_r = 3.50/100 ##Terminal Growth Rate
discount_rate = 10/100
##fcf calculaton for 10 Years
future_cash_1_year = cashflow*(1+fcf_growth_for_first_5_years)
future_cash_2_year = future_cash_1_year*(1+fcf_growth_for_first_5_years)
future_cash_3_year = future_cash_2_year*(1+fcf_growth_for_first_5_years)
future_cash_4_year = future_cash_3_year*(1+fcf_growth_for_first_5_years)
future_cash_5_year = future_cash_4_year*(1+fcf_growth_for_first_5_years)
future_cash_6_year = future_cash_5_year*(1+fcf_growth_for_last_5_years)
future_cash_7_year = future_cash_6_year*(1+fcf_growth_for_last_5_years)
future_cash_8_year = future_cash_7_year*(1+fcf_growth_for_last_5_years)
future_cash_9_year = future_cash_8_year*(1+fcf_growth_for_last_5_years)
future_cash_10_year = future_cash_9_year*(1+fcf_growth_for_last_5_years)
fcf = []
fcf.extend(value for name, value in locals().items() if name.startswith('future_cash_'))
cf = pd.DataFrame()
cf.insert(0, 'Sr_No', range(1,11))
cf.insert(1, 'Year', range(23,33))
cf['fcf'] = fcf
cf
Desired Output-
I am getting desired output by using lst method code as given above, but I am looking for more efficient way to calculate values using pandas df instead of using lst & variables.
Sr_No Year fcf
0 1 23 4.180038
1 2 24 4.765243
2 3 25 5.432377
3 4 26 6.192910
4 5 27 7.059918
5 6 28 7.554112
6 7 29 8.082900
7 8 30 8.648703
8 9 31 9.254112
9 10 32 9.901900
Using a for loop makes this much more easier to handle
import math
import pandas as pd
#User input
cashflow = 3.6667
fcf_growth_for_first_5_years = 14/100
fcf_growth_for_last_5_years = 7/100
no_of_years = 10
t_g_r = 3.50/100 ##Terminal Growth Rate
discount_rate = 10/100
cf = pd.DataFrame()
cf.insert(0, 'Sr_No', range(1,11))
cf.insert(1, 'Year', range(23,33))
##fcf calculaton for 10 Years
fcf=[]
for row in range(len(cf)):
if cf.Sr_No[row]==1:
fcf.append(cashflow*(1+fcf_growth_for_first_5_years))
elif cf.Sr_No[row]<6:
fcf.append(fcf[row-1]*(1+fcf_growth_for_first_5_years))
else:
fcf.append(fcf[row-1]*(1+fcf_growth_for_last_5_years))
cf['fcf'] = fcf
cf
... I am looking for more efficient way to calculate values using pandas df ...
You could use .cumprod():
cashflow = 3.6667
fcf_growth_for_first_5_years = 14 / 100
fcf_growth_for_last_5_years = 7 / 100
df = pd.DataFrame({
"Sr_No": range(1, 1 + no_of_years), "Year": range(23, 23 + no_of_years)
})
df.loc[df.index[:no_of_years // 2], "fcf"] = 1 + fcf_growth_for_first_5_years
df["fcf"] = df["fcf"].fillna(1 + fcf_growth_for_last_5_years).cumprod() * cashflow
Result:
Sr_No Year fcf
0 1 23 4.180038
1 2 24 4.765243
2 3 25 5.432377
3 4 26 6.192910
4 5 27 7.059918
5 6 28 7.554112
6 7 29 8.082900
7 8 30 8.648703
8 9 31 9.254112
9 10 32 9.901900

Strip the last character from a string if it is a letter in python dataframe

It is possibly done with regular expressions, which I am not very strong at.
My dataframe is like this:
import pandas as pd
import regex as re
data = {'postcode': ['DG14','EC3M','BN45','M2','WC2A','W1C','PE35'], 'total':[44, 54,56, 78,87,35,36]}
df = pd.DataFrame(data)
df
postcode total
0 DG14 44
1 EC3M 54
2 BN45 56
3 M2 78
4 WC2A 87
5 W1C 35
6 PE35 36
I want to get these strings in my column with the last letter stripped like so:
postcode total
0 DG14 44
1 EC3 54
2 BN45 56
3 M2 78
4 WC2 87
5 W1C 35
6 PE35 36
Probably something using re.sub('', '\D')?
Thank you.
You could use str.replace here:
df["postcode"] = df["postcode"].str.replace(r'[A-Za-z]$', '')
One of the approaches:
import pandas as pd
import re
data = {'postcode': ['DG14','EC3M','BN45','M2','WC2A','W1C','PE35'], 'total':[44, 54,56, 78,87,35,36]}
data['postcode'] = [re.sub(r'[a-zA-Z]$', '', item) for item in data['postcode']]
df = pd.DataFrame(data)
print(df)
Output:
postcode total
0 DG14 44
1 EC3 54
2 BN45 56
3 M2 78
4 WC2 87
5 W1 35
6 PE35 36

How to merge two columns of a dataframe based on values from a column in another dataframe?

I have a dataframe called df_location:
location = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_location = pd.DataFrame(locations)
I have another dataframe called df_islands:
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
Each island_id corresponds to one or more locations. As you can see, the locations are stored in a list.
What I'm trying to do is to search the list_of_locations for each unique location and merge it to df_location in a way where each island_id will correspond to a specific location.
Final dataframe should be the following:
merged = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69],
'island_id':[10,20,20,30,30,40,40,40,50,60]}
df_merged = pd.DataFrame(merged)
I don't know whether there is a method or function in python to do so. I would really appreciate it if someone can give me a solution to this problem.
The pandas method you're looking for to expand your df_islands dataframe is .explode(column_name). From there, rename your column to location_id and then join the dataframes using pd.merge(). It'll perform a SQL-like join method using the location_id as the key.
import pandas as pd
locations = {'location_id': [1,2,3,4,5,6,7,8,9,10],
'temperature_value': [20,21,22,23,24,25,26,27,28,29],
'humidity_value':[60,61,62,63,64,65,66,67,68,69]}
df_locations = pd.DataFrame(locations)
islands = {'island_id':[10,20,30,40,50,60],
'list_of_locations':[[1],[2,3],[4,5],[6,7,8],[9],[10]]}
df_islands = pd.DataFrame(islands)
df_islands = df_islands.explode(column='list_of_locations')
df_islands.columns = ['island_id', 'location_id']
pd.merge(df_locations, df_islands)
Out[]:
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60
The df.apply() method works here. It's a bit long-winded but it works:
df_location['island_id'] = df_location['location_id'].apply(
lambda x: [
df_islands['island_id'][i] \
for i in df_islands.index \
if x in df_islands['list_of_locations'][i]
# comment above line and use this instead if list is stored in a string
# if x in eval(df_islands['list_of_locations'][i])
][0]
)
First we select the final value we want if the if statement is True: df_islands['island_id'][i]
Then we loop over each column in df_islands by using df_islands.index
Then create the if statement which loops over all values in df_islands['list_of_locations'] and returns True if the value for df_location['location_id'] is in the list.
Finally, since we must contain this long statement in square brackets, it is a list. However, we know that there is only one value in the list so we can index it by using [0] at the end.
I hope this helps and happy for other editors to make the answer more legible!
print(df_location)
location_id temperature_value humidity_value island_id
0 1 20 60 10
1 2 21 61 20
2 3 22 62 20
3 4 23 63 30
4 5 24 64 30
5 6 25 65 40
6 7 26 66 40
7 8 27 67 40
8 9 28 68 50
9 10 29 69 60

Categories