Organizing dates and holidays in a dataframe - python

Scenario: I have one with different columns of data, and another single dataframe with lists of dates.
Example of dataframe1:
iterationcount datecolumn list
iteration5 1
iteration5 2
iteration3 2
iteration3 2
iteration4 33
iteration3 4
iteration1 5
iteration2 3
iteration5 2
iteration4 22
Example of dataframe2:
iteration1 01.01.2018 26.01.2018 30.03.2018
iteration2 01.01.2018 30.03.2018 02.04.2018 25.12.2018 26.12.2018
iteration3
iteration4 01.01.2018 15.01.2018 19.02.2018
iteration5 01.01.2018 19.02.2018 30.03.2018 21.05.2018 02.07.2018 06.08.2018 03.09.2018 08.10.2018 12.11.2018
The second dataframe is a list of holidays for each of the iterations. And it will be used to fill the second column of the first dataframe
Constraints: For each iteration of the first dataframe the user will select a month and year: the script will then find the first date of that month. If that date is on the list of dates of dataframe2 for that iteration, then pick the next working date based on the program calender.
Ex: User selects January 2018, code returns 01/01/2018. For the first iteration, that date is a holiday, so pick the next workday, in this case 02/01/2018, and then input this date to all of dataframe1 corresponding to that iteration:
iterationcount datecolumn list
iteration5 1
iteration5 2
iteration3 2
iteration3 2
iteration4 33
iteration3 4
iteration1 02/01/2018 5
iteration2 3
iteration5 2
iteration4 22
Then move to the next iteration (some iterations will have the same calendar dates).
Code: I have tried multiple approaches so far, but could not achieve the result. The closest I think I got was with:
import pandas as pd
import datetime
import os
from os import listdir
from os.path import isfile, join
import glob
## Get Adjustments
mypath3 = "//DGMS/Desktop/Uploader_v1.xlsm"
ApplyOnDates = pd.read_excel(open(mypath3, 'rb'), sheet_name='Holidays')
# Get content
mypath = "//DGMS/Desktop/Uploaded"
all_files = glob.glob(os.path.join(mypath, "*.xls*"))
contentdataframes = []
contentdataframes2 = []
for f in all_files:
df = pd.read_excel(f)
df['Name'] = os.path.basename(f).split('.')[0].split('_')[0]
df['ApplyOn']= ''
mask = df.columns.str.contains('Base|Last|Fixing|Cash')
c2 = df.columns[~mask].tolist()
df = df[c2]
contentdataframes.append(df)
finalfinal = pd.concat(contentdataframes2)
for row in finalfinal.Name.itertuple():
datedatedate = datetime.datetime(2018, 01, 1)
if (pd.np.where(ApplyOnDates.Index.str.contains(finalfinal(row)).isin(datedatedate) = True:
datetouse = datedatedate + datetime.timedelta(days=1)
else:
datetouse = datedatedate
finalfinal['ApplyOn'] = datetouse
Question: Basically, my main trouble here is being able to match the rows in both dataframes and search the date in the column of the holidays dataframe. Is there a proper way to do this?
Obs: I was able to achieve a similar result directly in vba, by using the functions of excel (vlookup, match...), the problem is that doing in excel for the amount of data basically crashes the file every time.

so you want to basically merge the column of dataframe2 to dataframe1 right? Try to use merge:
newdf = pd.DataFrame.merge(dataframe1, dataframe2, left_on='iterationcount',
right_on='iterationcount', how='inner', indicator=False)
That should give you a new frame.

Related

Poor performance filtering one dataframe with another

I have two dataframes one holds unique records of episodic data, the other lists of events. There are multiple events per episode. I need to loop through the episode data, find all the events that correspond to each episode and write the resultant events for a new dataframe. There are around 4,000 episodes and 20,000 events. The process is painfully slow as for each episode I am searching 20,000 events. I am guessing there is a way to reduce the number of events searched each loop by removing the matched ones - but I am not sure. This is my code (there is additional filtering to assist with matching)
for idx, row in episode_df.iterrows():
total_episodes += 1
icu_admission = datetime.strptime(row['ICU_ADM'], '%d/%m/%Y %H:%M:%S')
tmp_df = event_df.loc[event_df['ur'] == row['HRN']]
if ( len(tmp_df.index) < 1):
empty_episodes += 1
continue
# Loop through temp dataframe and write all records with an admission date
# close to icu_admission to new dataframe
for idx_a, row_a in tmp_df.iterrows():
admission = datetime.strptime(row_a['admission'], '%Y-%m-%d %H:%M:%S')
difference = admission - icu_admission
if (abs(difference.total_seconds()) > 14400):
continue
new_df = new_df.append(row_a)
selected_records += 1
A simplified version of the dataframes:
episode_df:
episode_no HRN name ICU_ADM
1 12345 joe date1
2 78124 ann date1
3 98374 bill date2
4 76523 lucy date3
event_df
episode_no ur admission
1 12345 date1
1 12345 date1
1 12345 date5
7 67899 date9
Not all episodes have events and only events with episodes need to be copied.
This could work:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['ICU_ADM'] = [pd.to_datetime(f'2020-01-{x}') for x in range(1,10)]
df1['test_day'] = df1['ICU_ADM'].dt.day
df2 = pd.DataFrame()
df2['admission'] = [pd.to_datetime(f'2020-01-{x}') for x in range(2,10,3)]
df2['admission_day'] = df2['admission'].dt.day
df2['random_val'] = np.random.rand(len(df2),1)
pd.merge_asof(df1, df2, left_on=['ICU_ADM'], right_on=['admission'], tolerance=pd.Timedelta('1 day'))

concatenating and saving multiple pair of CSV in pandas

I am a beginner in python. I have a hundred pair of CSV file. The file looks like this:
25_13oct_speed_0.csv
26_13oct_speed_0.csv
25_13oct_speed_0.1.csv
26_13oct_speed_0.1.csv
25_13oct_speed_0.2.csv
26_13oct_speed_0.2.csv
and others
I want to concatenate the pair files between 25 and 26 file. each pair of the file has a speed threshold (Speed_0, 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0) which is labeled on the file name. These files have the same structure data.
Mac Annotation X Y
A first 0 0
A last 0 0
B first 0 0
B last 0 0
Therefore, concatenate analyze is enough to join these two data. I use this method:
df1 = pd.read_csv('25_13oct_speed_0.csv')
df2 = pd.read_csv('26_13oct_speed_0.csv')
frames = [df1, df2]
result = pd.concat(frames)
for each pair files. but it takes time and not an elegant way. is there a good way to combine automatically the pair file and save simultaneously?
Idea is create DataFrame by list of files and add 2 new columns by Series.str.split by first _:
print (files)
['25_13oct_speed_0.csv', '26_13oct_speed_0.csv',
'25_13oct_speed_0.1.csv', '26_13oct_speed_0.1.csv',
'25_13oct_speed_0.2.csv', '26_13oct_speed_0.2.csv']
df1 = pd.DataFrame({'files': files})
df1[['g','names']] = df1['files'].str.split('_', n=1, expand=True)
print (df1)
files g names
0 25_13oct_speed_0.csv 25 13oct_speed_0.csv
1 26_13oct_speed_0.csv 26 13oct_speed_0.csv
2 25_13oct_speed_0.1.csv 25 13oct_speed_0.1.csv
3 26_13oct_speed_0.1.csv 26 13oct_speed_0.1.csv
4 25_13oct_speed_0.2.csv 25 13oct_speed_0.2.csv
5 26_13oct_speed_0.2.csv 26 13oct_speed_0.2.csv
Then loop per groups per column names, loop by groups with DataFrame.itertuples and create new DataFrame with read_csv, if necessary add new column filled by values from g, append to list, concat and last cave to new file by name from column names:
for i, g in df1.groupby('names'):
out = []
for n in g.itertuples():
df = pd.read_csv(n.files).assign(source=n.g)
out.append(df)
dfbig = pd.concat(out, ignore_index=True)
print (dfbig)
dfbig.to_csv(g['names'].iat[0])

New column based off certain input parameter to select what columns to use - Python

Have a pandas dataframe that includes multiple columns of monthly finance data. I have an input of period that is specified by the person running the program. It's currently just saved as period like shown below within the code.
#coded into python
period = ?? (user adds this in from input screen)
I need to create another column of data that uses the input period number to perform a calculation of other columns.
So, in the above table I'd like to create a new column 'calculation' that depends on the period input. For example, if a period of 1 was used the following calc1 would be completed (with math actually done). Period = 2 - then calc2. Period = 3 - then calc3. I only need one column calculated depending on the period number but added three examples in below picture for example of how it'd work.
I can do this in SQL using case when. So using the input period then sum what columns I need to.
select Account #,
'&Period' AS Period,
'&Year' AS YR,
case
When '&Period' = '1' then sum(d_cf+d_1)
when '&Period' = '2' then sum(d_cf+d_1+d_2)
when '&Period' = '3' then sum(d_cf+d_1+d_2+d_3)
I am unsure on how to do this easily in python (newer learner). Yes, I could create a column that does each calculation via new column for every possible period (1-12), and then only select that column but I'd like to learn and do it a more efficient way.
Can you help more or point me in a better direction?
You could certainly do something like
df[['d_cf'] + [f'd_{i}' for i in range(1, period+1)]].sum(axis=1)
You can do this using a simple function in python:
def get_calculation(df, period=NULL):
'''
df = pandas data frame
period = integer type
'''
if period == 1:
return df.apply(lambda x: x['d_0'] +x['d_1'], axis=1)
if period == 2:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'], axis=1)
if period == 3:
return df.apply(lambda x: x['d_0'] +x['d_1']+ x['d_2'] + x['d_3'], axis=1)
new_df = get_calculation(df, period = 1)
Setup:
df = pd.DataFrame({'d_0':list(range(1,7)),
'd_1': list(range(10,70,10)),
'd_2':list(range(100,700,100)),
'd_3': list(range(1000,7000,1000))})
Setup:
import pandas as pd
ddict = {
'Year':['2018','2018','2018','2018','2018',],
'Account_Num':['1111','1122','1133','1144','1155'],
'd_cf':['1','2','3','4','5'],
}
data = pd.DataFrame(ddict)
Create value calculator:
def get_calcs(period):
# Convert period to integer
s = str(period)
# Convert to string value
n = int(period) + 1
# This will repeat the period number by the value of the period number
return ''.join([i * n for i in s])
Main function copies data frame, iterates through period values, and sets calculated values to the correct spot index-wise for each relevant column:
def process_data(data_frame=data, period_column='d_cf'):
# Copy data_frame argument
df = data_frame.copy(deep=True)
# Run through each value in our period column
for i in df[period_column].values.tolist():
# Create a temporary column
new_column = 'd_{}'.format(i)
# Pass the period into our calculator; Capture the result
calculated_value = get_calcs(i)
# Create a new column based on our period number
df[new_column] = ''
# Use indexing to place the calculated value into our desired location
df.loc[df[period_column] == i, new_column] = calculated_value
# Return the result
return df
Start:
Year Account_Num d_cf
0 2018 1111 1
1 2018 1122 2
2 2018 1133 3
3 2018 1144 4
4 2018 1155 5
Result:
process_data(data)
Year Account_Num d_cf d_1 d_2 d_3 d_4 d_5
0 2018 1111 1 11
1 2018 1122 2 222
2 2018 1133 3 3333
3 2018 1144 4 44444
4 2018 1155 5 555555

Python Pandas: Count quarterly occurrence from start and end date range

I have a dataframe of jobs for different people with star and end time for each job. I'd like to count, every four months, how many jobs each person is responsible for. I figured out away to do it but I'm sure it's tremendously inefficient (I'm new to pandas). It takes quite a while to compute when I run the code on my complete dataset (hundreds of persons and jobs).
Here is what I have so far.
#create a data frame
import pandas as pd
import numpy as np
df = pd.DataFrame({'job': pd.Categorical(['job1','job2','job3','job4']),
'person': pd.Categorical(['p1', 'p1', 'p2','p2']),
'start': ['2015-01-01', '2015-06-01', '2015-01-01', '2016- 01- 01'],
'end': ['2015-07-01', '2015- 12-31', '2016-03-01', '2016-12-31']})
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
Which gives me
I then create a new dataset with
bdate = min(df['start'])
edate = max(df['end'])
dates = pd.date_range(bdate, edate, freq='4MS')
people = sorted(set(list(df['person'])))
df2 = pd.DataFrame(np.zeros((len(dates), len(people))), index=dates, columns=people)
for d in pd.date_range(bdate, edate, freq='MS'):
for p in people:
contagem = df[(df['person'] == p) &
(df['start'] <= d) &
(df['end'] >= d)]
pos = np.argmin(np.abs(dates - d))
df2.iloc[pos][p] = len(contagem.index)
df2
And I get
I'm sure there must be a better way of doing this without having to loop through all dates and persons. But how?
This answer assumes that each job-person combination is unique. It creates a series for every row with the value equal to the job an index that expands the dates. Then it resamples every 4th month (which is not quarterly but what your solution describes) and counts the unique non-na occurrences.
def make_date_range(x):
return pd.Series(index=pd.date_range(x.start.values[0], x.end.values[0], freq='M'), data=x.job.values[0])
# Iterate through each job person combo and make an entry for each month with the job as the value
df1 = df.groupby(['job', 'person']).apply(make_date_range).unstack('person')
# remove outer level from index
df1.index = df1.index.droplevel('job')
# resample each month counting only unique values
df1.resample('4MS').agg(lambda x: len(x[x.notnull()].unique()))
Output
person p1 p2
2015-01-01 1 1
2015-05-01 2 1
2015-09-01 1 1
2016-01-01 0 2
2016-05-01 0 1
2016-09-01 0 1
And here is a long one line solution that iterates over every rows and creates a new dataframe and stacks all of them together via pd.concat and then resamples.
pd.concat([pd.DataFrame(index = pd.date_range(tup.start, tup.end, freq='4MS'),
data=[[tup.job]],
columns=[tup.person]) for tup in df.itertuples()])\
.resample('4MS').count()
And another one that is faster
df1 = pd.melt(df, id_vars=['job', 'person'], value_name='date').set_index('date')
g = df1.groupby([pd.TimeGrouper('4MS'), 'person'])['job']
g.agg('nunique').unstack('person', fill_value=0)

Python Pandas: Find index based on value in DataFrame

Is there a way to specify a DataFrame index (row) based on matching text inside the dataframe?
I am importing a text file from the internet located here every day into a python pandas DataFrame. I am parsing out just some of the data and doing calculations to give me the peak value for each day. The specific group of data I am needing to gather starts with the section headed "RTO COMBINED HOUR ENDING INTEGRATED FORECAST LOAD MW".
I need to specifically only use part of the data to do the calculations I need and I am able to manually specify which index line to start with, but daily this number could change due to text added to the top of the file by the authors.
Updated as of: 05-05-2016 1700 Constrained operations ARE expected in
the AEP, APS, BC, COMED, DOM,and PS zones on 05-06-2016. Constrained
operations ARE expected in the AEP, APS, BC, COMED, DOM,and PS zones
on 05-07-2016. The PS/ConEd 600/400 MW contract will be limited to
700MW on 05-06-16.
Is there a way to match text in the pandas DataFrame and specify the index of that match? Currently I am manually specifying the index I want to start with using the variable 'day' below on the 6th line. I would like this variable to hold the index (row) of the dataframe that includes the text I want to match.
The code below works but may stop working if the line number (index) changes:
def forecastload():
wb = load_workbook(filename = 'pjmactualload.xlsx')
ws = wb['PJM Load']
printRow = 13
#put this in iteration to pull 2 rows of data at a time (one for each day) for 7 days max
day = 239
while day < 251:
#pulls in first day only
data = pd.read_csv("http://oasis.pjm.com/doc/projload.txt", skiprows=day, delim_whitespace=True, header=None, nrows=2)
#sets data at HE 24 = to data that is in HE 13- so I can delete column 0 data to allow checking 'max'
data.at[1,13]= data.at[1,1]
#get date for printing it with max load later on
newDate = str(data.at[0,0])
#now delete first column to get rid of date data. date already saved as newDate
data = data.drop(0,1)
data = data.drop(1,1)
#pull out max value of day
#add index to this for iteration ie dayMax[x] = data.values.max()
dayMax = data.max().max()
dayMin = data.min().min()
#print date and max load for that date
actualMax = "Forecast Max"
actualMin = "Forecast Min"
dayMax = int(dayMax)
maxResults = [str(newDate),int(dayMax),actualMax,dayMin,actualMin]
d = 1
for items in maxResults:
ws.cell(row=printRow, column=d).value = items
d += 1
printRow += 1
#print maxResults
#l.writerows(maxResults)
day = day + 2
wb.save('pjmactualload.xlsx')
In this case i recommend you to use the command line in order to obtain a dataset that you could read later with pandas and do whatever you want.
To retrieve the data you can use curl and grep:
$ curl -s http://oasis.pjm.com/doc/projload.txt | grep -A 17 "RTO COMBINED HOUR ENDING INTEGRATED FORECAST" | tail -n +5
05/06/16 am 68640 66576 65295 65170 66106 70770 77926 83048 84949 85756 86131 86089
pm 85418 85285 84579 83762 83562 83289 82451 82460 84009 82771 78420 73258
05/07/16 am 66809 63994 62420 61640 61848 63403 65736 68489 71850 74183 75403 75529
pm 75186 74613 74072 73950 74386 74978 75135 75585 77414 76451 72529 67957
05/08/16 am 63583 60903 59317 58492 58421 59378 60780 62971 66289 68997 70436 71212
pm 71774 71841 71635 71831 72605 73876 74619 75848 78338 77121 72665 67763
05/09/16 am 63865 61729 60669 60651 62175 66796 74620 79930 81978 83140 84307 84778
pm 85112 85562 85568 85484 85766 85924 85487 85737 87366 84987 78666 72166
05/10/16 am 67581 64686 62968 62364 63400 67603 75311 80515 82655 84252 86078 87120
pm 88021 88990 89311 89477 89752 89860 89256 89327 90469 87730 81220 74449
05/11/16 am 70367 67044 65125 64265 65054 69060 76424 81785 84646 87097 89541 91276
pm 92646 93906 94593 94970 95321 95073 93897 93162 93615 90974 84335 77172
05/12/16 am 71345 67840 65837 64892 65600 69547 76853 82077 84796 87053 89135 90527
pm 91495 92351 92583 92473 92541 92053 90818 90241 90750 88135 81816 75042
Let's use the previous output (in the rto.txt file) to obtain a more readable data using awk and sed:
$ awk '/^ [0-9]/{d=$1;print $0;next}{print d,$0}' rto.txt | sed 's/^ //;s/\s\+/,/g'
05/06/16,am,68640,66576,65295,65170,66106,70770,77926,83048,84949,85756,86131,86089
05/06/16,pm,85418,85285,84579,83762,83562,83289,82451,82460,84009,82771,78420,73258
05/07/16,am,66809,63994,62420,61640,61848,63403,65736,68489,71850,74183,75403,75529
05/07/16,pm,75186,74613,74072,73950,74386,74978,75135,75585,77414,76451,72529,67957
05/08/16,am,63583,60903,59317,58492,58421,59378,60780,62971,66289,68997,70436,71212
05/08/16,pm,71774,71841,71635,71831,72605,73876,74619,75848,78338,77121,72665,67763
05/09/16,am,63865,61729,60669,60651,62175,66796,74620,79930,81978,83140,84307,84778
05/09/16,pm,85112,85562,85568,85484,85766,85924,85487,85737,87366,84987,78666,72166
05/10/16,am,67581,64686,62968,62364,63400,67603,75311,80515,82655,84252,86078,87120
05/10/16,pm,88021,88990,89311,89477,89752,89860,89256,89327,90469,87730,81220,74449
05/11/16,am,70367,67044,65125,64265,65054,69060,76424,81785,84646,87097,89541,91276
05/11/16,pm,92646,93906,94593,94970,95321,95073,93897,93162,93615,90974,84335,77172
05/12/16,am,71345,67840,65837,64892,65600,69547,76853,82077,84796,87053,89135,90527
05/12/16,pm,91495,92351,92583,92473,92541,92053,90818,90241,90750,88135,81816,75042
now, read and reshape the above result with pandas:
df = pd.read_csv("rto2.txt",names=["date","period"]+list(range(1,13)),index_col=[0,1])
df = df.stack().reset_index().rename(columns={"level_2":"hour",0:"value"})
df.index = pd.to_datetime(df.apply(lambda x: "{date} {hour}:00 {period}".format(**x),axis=1))
df.drop(["date", "hour", "period"], axis=1, inplace=True)
At this point you have a beautiful time series :)
In [10]: df.head()
Out[10]:
value
2016-05-06 01:00:00 68640
2016-05-06 02:00:00 66576
2016-05-06 03:00:00 65295
2016-05-06 04:00:00 65170
2016-05-06 05:00:00 66106
to obtain the statistics:
In[11]: df.groupby(df.index.date).agg([min,max])
Out[11]:
value
min max
2016-05-06 65170 86131
2016-05-07 61640 77414
2016-05-08 58421 78338
2016-05-09 60651 87366
2016-05-10 62364 90469
2016-05-11 64265 95321
2016-05-12 64892 92583
I hope this can help you.
Regards.
Here is how you can do what you are looking for:
And the sample code:
import numpy as np
import pandas a pd
df = pd.DataFrame(np.random.rand(10,4), columns=list('abcd'))
df.loc[df['a'] < 0.5, 'a'] = 1
You can refer to this documentation
Added image showing how to access index:

Categories