How to compare differently transposed data in pandas or python - python

I am trying to compare or merge two different data sets and I am using pandas for that.
The challenge that I am facing is that data is spread across rows in the first data set (Data1) and the other data set (Data2) has the same data spread across columns, below are the screenshots.
Screenshot 1st - This is Data1
Screenshot 2nd - This is Data2
Also, I have attached the same Excel workbook here for your reference.
What I am trying to do is convert one of them to another format to match the dataset and perform the merge.
Note: Transpose is not helping me, since I need to do it for each department and transpose does put everything either in rows or columns including department, whereas I only want to transpose weekly data.
What is the best way to achieve this in Python?

One option to transform the second dataframe is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df = pd.read_excel('Test_Data_Set.xlsx', sheet_name=None)
df1 = df['Data1']
df2 = df['Data2']
df3 = df2.pivot_longer(index = ['code', 'name'], names_to = 'day_of_week', names_pattern=r'(.+)\s.+')
df1.merge(df3, on =['code', 'name', 'day_of_week'])
code name day_of_week start_time end_time value
0 test2 Test_Department2 Monday 900 1900 08:00 - 20:00
1 test2 Test_Department2 Tuesday 900 1900 08:00 - 20:00
2 test2 Test_Department2 Wednesday 900 1900 08:00 - 20:00
3 test2 Test_Department2 Thursday 900 1900 08:00 - 20:00
4 test2 Test_Department2 Friday 900 1900 10:00 - 19:00
5 test2 Test_Department2 Saturday 900 1900 10:00 - 19:00
6 test2 Test_Department2 Sunday 900 1900 12:00 - 17:00

Related

Combine weekday with hours in Pandas

I have a data frame with a weekday column that contains the name of the weekdays and a time column that contains hours on these days. How can I combine these 2 columns, so they can be also sortable?
I have tried the string version but it is not sortable based on weekdays and hours.
This is the sample table how it looks like.
weekday
time
Monday
12:00
Monday
13:00
Tuesday
20:00
Friday
10:00
This is what I want to get.
weekday_hours
Monday 12:00
Monday 13:00
Tuesday 20:00
Friday 10:00
Asumming that df is your initial dataframe
import json
datas = json.loads(df.to_json(orient="records"))
final_data = {"weekday_hours": []}
for data in datas:
final_data["weekday_hours"].append(data['weekday'] + ' ' + data['time'])
final_df = pd.DataFrame(final_data)
final_df
Ouptput:
you first need to create a datetime object of 7 days at an hourly level to sort by. In a normal Data warehousing world you normally have a calendar and a time dimension with all the different representation of your date data that you can merge and sort by, this is an adaptation of that methodology.
import pandas as pd
df1 = pd.DataFrame({'date' : pd.date_range('01 Jan 2021', '08 Jan 2021',freq='H')})
df1['str_date'] = df1['date'].dt.strftime('%A %H:%M')
print(df1.head(5))
date str_date
0 2021-01-01 00:00:00 Friday 00:00
1 2021-01-01 01:00:00 Friday 01:00
2 2021-01-01 02:00:00 Friday 02:00
3 2021-01-01 03:00:00 Friday 03:00
4 2021-01-01 04:00:00 Friday 04:00
Then create your column to merge on.
df['str_date'] = df['weekday'] + ' ' + df['time']
df2 = pd.merge(df[['str_date']],df1,on=['str_date'],how='left')\
.sort_values('date').drop('date',1)
print(df2)
str_date
3 Friday 10:00
0 Monday 12:00
1 Monday 13:00
2 Tuesday 20:00
Based on my understanding of the question, you want a single column, "weekday_hours," but you also want to be able to sort the data based on this column. This is a bit tricky because "Monday" doesn't provide enough information to define a valid datetime. Parsing using pd.to_datetime(df['weekday_hours'], format='%A %H:%M' for example, will return 1900-01-01 <hour::minute::second> if given just weekday and time. When sorted, this only sorts by time.
One workaround is to use dateutil to parse the dates. In lieu of a date, it will return the next date corresponding to the day of the week. For example, today (9 April 2021) dateutil.parser.parse('Friday 10:00') returns datetime.datetime(2021, 4, 9, 10, 0) and dateutil.parser.parse('Monday 10:00') returns datetime.datetime(2021, 4, 12, 10, 0). Therefore, we need to set the "default" date to something corresponding to our "first" day of the week. Here is an example starting with unsorted dates:
import datetime
import dateutil
import pandas as pd
weekdays = ['Friday', 'Monday', 'Monday', 'Tuesday']
times = ['10:00', '13:00', '12:00', '20:00', ]
df = pd.DataFrame({'weekday' : weekdays, 'time' : times})
df2 = pd.DataFrame()
df2['weekday_hours'] = df[['weekday', 'time']].agg(' '.join, axis=1)
amonday = datetime.datetime(2021, 2, 1, 0, 0) # assuming week starts monday
sorter = lambda t: [dateutil.parser.parse(ti, default=amonday) for ti in t]
print(df2.sort_values('weekday_hours', key=sorter))
Produces the output:
weekday_hours
2 Monday 12:00
1 Monday 13:00
3 Tuesday 20:00
0 Friday 10:00
Note there are probably more computationaly efficient ways if you are working with a lot of data, but this should illustrate the idea of a sortable weekday/time pair.

Calculate the time difference between two hh:mm columns in a pandas dataframe

I am reading some data from an csv file where the datatype of the two columns are in hh:mm format. Here is an example:
Start End
11:15 15:00
22:30 2:00
In the above example, the End in the 2nd row happens in the next day. I am trying to get the time difference between these two columns in the most efficient way as the dataset is huge. Is there any good pythonic way for doing this? Also, since there is no date, and some Ends happen in the next I get wrong result when I calculate the diff.
>>> import pandas as pd
>>> df = pd.read_csv(file_path)
>>> pd.to_datetime(df['End'])-pd.to_datetime(df['Start'])
0 0 days 03:45:00
1 0 days 03:00:00
2 -1 days +03:30:00
You can use the technique (a+x)%x with a timedelta of 24h (or 1d, same)
the + timedelta(hours=24) makes all values becomes positive
the % timedelta(hours=24) makes the ones above 24h back of 24h
df['duration'] = (pd.to_datetime(df['End']) - pd.to_datetime(df['Start']) + timedelta(hours=24)) \
% timedelta(hours=24)
Gives
Start End duration
0 11:15 15:00 0 days 03:45:00
1 22:30 2:00 0 days 03:30:00

Setting values in a time frame to zero

I have a df with DateTimeIndex (hourly readings) and light intensity.
Time Light
1/2/2017 18:00 31
1/2/2017 19:00 -5
1/2/2017 20:00 NA
......
......
2/2/2017 05:00 NA
2/2/2017 06:00 20
The issue is that after sunset (6 pm) until sunrise (6 am), the sensor doesn't work and has bad readings. I would like to set any readings in this period to 0.
You can create a mask with these conditions and set the value based on it.
hours = (df.index.to_series().dt.hour) # convert DateTimeIndex to hours
mask = (hours > 6) & (hours < 18)
df.loc[~mask, 'Light'] = 0
You should convert the DataTimeIndex to Series to access the datetime methods.

How to add a new column by searching for data in a Pandas time series dataframe

I have a Pandas time series dataframe.
It has minute data for a stock for 30 days.
I want to create a new column, stating the price of the stock at 6 AM for that day, e.g. for all lines for January 1, I want a new column with the price at noon on January 1, and for all lines for January 2, I want a new column with the price at noon on January 2, etc.
Existing timeframe:
Date Time Last_Price Date Time 12amT
1/1/19 08:00 100 1/1/19 08:00 ?
1/1/19 08:01 101 1/1/19 08:01 ?
1/1/19 08:02 100.50 1/1/19 08:02 ?
...
31/1/19 21:00 106 31/1/19 21:00 ?
I used this hack, but it is very slow, and I assume there is a quicker and easier way to do this.
for lab, row in df.iterrows() :
t=row["Date"]
df.loc[lab,"12amT"]=df[(df['Date']==t)&(df['Time']=="12:00")]["Last_Price"].values[0]
One way to do this is to use groupby with pd.Grouper:
For pandas 24.1+
df.groupby(pd.Grouper(freq='D'))[0]\
.transform(lambda x: x.loc[(x.index.hour == 12) &
(x.index.minute==0)].to_numpy()[0])
Older pandas use:
df.groupby(pd.Grouper(freq='D'))[0]\
.transform(lambda x: x.loc[(x.index.hour == 12) &
(x.index.minute==0)].values[0])
MVCE:
df = pd.DataFrame(np.arange(48*60), index=pd.date_range('02-01-2019',periods=(48*60), freq='T'))
df['12amT'] = df.groupby(pd.Grouper(freq='D'))[0].transform(lambda x: x.loc[(x.index.hour == 12)&(x.index.minute==0)].to_numpy()[0])
Output (head):
0 12amT
2019-02-01 00:00:00 0 720
2019-02-01 00:01:00 1 720
2019-02-01 00:02:00 2 720
2019-02-01 00:03:00 3 720
2019-02-01 00:04:00 4 720
I'm not sure why you have two DateTime columns, I made my own example to demonstrate:
ind = pd.date_range('1/1/2019', '30/1/2019', freq='H')
df = pd.DataFrame({'Last_Price':np.random.random(len(ind)) + 100}, index=ind)
def noon_price(df):
noon_price = df.loc[df.index.hour == 12, 'Last_Price'].values
noon_price = noon_price[0] if len(noon_price) > 0 else np.nan
df['noon_price'] = noon_price
return df
df.groupby(df.index.day).apply(noon_price).reindex(ind)
reindex by default will fill each day's rows with its noon_price.
To add a column with the next day's noon price, you can shift the column 24 rows down, like this:
df['T+1'] = df.noon_price.shift(-24)

Pandas day for day

I have a lot of data in a Pandas dataframe:
Timestamp Value
2015-07-15 07:16:39.034 49.960
2015-07-15 07:16:39.036 49.940
......
2015-08-12 23:16:39.235 42.958
I have about 50 000 entries per day, and I would like to perform different operations on this data, day by day.
For example, if I would like to find the rolling mean, I would enter this:
df['rm5000'] = pd.rolling_mean(df['Value'], window=5000)
But that would give me the rolling mean across dates. The first rolling mean datapoint August 12th would contain 4999 datapoints from August 11th. However, I would like to start all over each day, so as the first 4999 datapoints on each day do not contain a rolling mean of 5000, as there might be a large difference between the last data one date and the first data the next day.
Do I have to slice the data into separate dataframes for each date for Pandas to do certain operations on the data for each separate date?
If you set the timestamps as a index, you can groupby a TimeGrouper with a frequency code to partition the data by days, like below
In [2]: df = pd.DataFrame({'Timestamp': pd.date_range('2015-07-15', '2015-07-18', freq='10min'),
'Value': np.linspace(49, 51, 433)})
In [3]: df = df.set_index('Timestamp')
In [4]: df.groupby(pd.TimeGrouper('D'))['Value'].apply(lambda x: pd.rolling_mean(x, window=15))
Out[4]:
Timestamp
2015-07-15 00:00:00 NaN
2015-07-15 00:10:00 NaN
.....
2015-07-15 23:30:00 49.620370
2015-07-15 23:40:00 49.625000
2015-07-15 23:50:00 49.629630
2015-07-16 00:00:00 NaN
2015-07-16 00:10:00 NaN

Categories