How to group data with similar dates in pandas - python

I have two csv files. Both the files contain Date, Stock, Open, High,Low,Close column of a single day. I made one dataframe from these two files. So, in this single dataframe 1st the data of Stock 1 is printed from day open to day close and then data of Stock 2 from day open to day end .The data is of 15 min interval and a day starts with 2019-01-01 09:15:00 and ends with 2019-01-01 15:15:00.
What I want is to create a dataframe where data of stock1 at 2019-01-01 09:15:00 is printed and then the data of stock2 at the same time and so on for 2019-01-01 09:30:00, 2019-01-01 09:45:00....
Check the image:

New Answer:
After reading your response I figured the best course of action for your issue would be moving your data to a 2-index DataFrame format using Pandas MultiIndex
arrays = [
np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df
Out[16]:
0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
foo one 1.075770 -0.109050 1.643563 -1.469388
two 0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
Old Answer
You can use the pandas concat method. If their index format matches, the Pandas API will take care of the rest.
import pandas as pd
import datetime
idx = pd.date_range("2018-01-01", periods=5, freq="H")
ts = pd.DataFrame(range(len(idx)), index=idx)
| | 0 |
|:--------------------|----:|
| 2018-01-01 00:00:00 | 0 |
| 2018-01-01 01:00:00 | 1 |
| 2018-01-01 02:00:00 | 2 |
| 2018-01-01 03:00:00 | 3 |
| 2018-01-01 04:00:00 | 4 |
idy = pd.date_range("2018-01-02", periods=10, freq="H")
tsy = pd.DataFrame(range(len(idy)), index=idy)
| | 0 |
|:--------------------|----:|
| 2018-01-02 00:00:00 | 0 |
| 2018-01-02 01:00:00 | 1 |
| 2018-01-02 02:00:00 | 2 |
| 2018-01-02 03:00:00 | 3 |
| 2018-01-02 04:00:00 | 4 |
| 2018-01-02 05:00:00 | 5 |
| 2018-01-02 06:00:00 | 6 |
| 2018-01-02 07:00:00 | 7 |
| 2018-01-02 08:00:00 | 8 |
| 2018-01-02 09:00:00 | 9 |
Result:
pd.concat([ts, tsy])
| | 0 |
|:--------------------|----:|
| 2018-01-01 00:00:00 | 0 |
| 2018-01-01 01:00:00 | 1 |
| 2018-01-01 02:00:00 | 2 |
| 2018-01-01 03:00:00 | 3 |
| 2018-01-01 04:00:00 | 4 |
| 2018-01-02 00:00:00 | 0 |
| 2018-01-02 01:00:00 | 1 |
| 2018-01-02 02:00:00 | 2 |
| 2018-01-02 03:00:00 | 3 |
| 2018-01-02 04:00:00 | 4 |
| 2018-01-02 05:00:00 | 5 |
| 2018-01-02 06:00:00 | 6 |
| 2018-01-02 07:00:00 | 7 |
| 2018-01-02 08:00:00 | 8 |
| 2018-01-02 09:00:00 | 9 |

Related

How can I measure if there is overlap in begin time to end time within each group using Python?

How can I see if there is overlap between start time and/or end time for each group (by ID). That is to say if two "services" occurred together for any length of time from one employee (ID). I have a table like the following, but would like to calculate the overlap column.
| ID | Begin Time | End Time | Overlap |
| 1 | 1/1/2023 13:30 | 1/1/2023 13:55 | False |
| 1 | 1/7/2023 12:30 | 1/1/2023 13:45 | False |
| 2 | 1/3/2023 15:30 | 1/3/2023 16:30 | True |
| 1 | 1/5/2023 07:30 | 1/5/2023 08:30 | True |
| 2 | 1/3/2023 14:55 | 1/3/2023 15:55 | True |
| 1 | 1/5/2023 06:30 | 1/5/2023 09:30 | True |
| 1 | 1/7/2023 06:30 | 1/7/2023 09:30 | True |
| 1 | 1/7/2023 06:00 | 1/7/2023 06:45 | True |
Here is a chunk of code that creates this dataframe -->
id_list = [1,1,2,1,2,1,1,1]
begin_time = ['1/1/2023 13:30', '1/7/2023 12:30', '1/3/2023 15:30', '1/5/2023 07:30', '1/3/2023 14:55',
'1/5/2023 06:30', '1/7/2023 06:30', '1/7/2023 06:00']
end_time = ['1/1/2023 13:55', '1/1/2023 13:45', '1/3/2023 16:30', '1/5/2023 08:30', '1/3/2023 15:55',
'1/5/2023 09:30', '1/7/2023 09:30', '1/7/2023 06:45']
df = pd.DataFrame(list(zip(id_list, begin_time, end_time)), columns = ['ID', 'Begin_Time', 'End_Time'])
df['Begin_Time'] = pd.to_datetime(df['Begin_Time'])
df['End_Time'] = pd.to_datetime(df['End_Time'])
df
Use Interval.overlaps in custom function with enumerate for filter out itself Interval:
def f(x):
i = pd.IntervalIndex.from_arrays(x['Begin_Time'],
x['End_Time'],
closed="both")
a = np.arange(len(x))
x['overlap'] = [i[a != j].overlaps(y).any() for j, y in enumerate(i) ]
return x
df = df.groupby('ID').apply(f)
print (df)
ID Begin_Time End_Time overlap
0 1 2023-01-01 13:30:00 2023-01-01 13:55:00 False
1 1 2023-01-08 12:30:00 2023-01-08 13:45:00 False <- data was changed
2 2 2023-01-03 15:30:00 2023-01-03 16:30:00 True
3 1 2023-01-05 07:30:00 2023-01-05 08:30:00 True
4 2 2023-01-03 14:55:00 2023-01-03 15:55:00 True
5 1 2023-01-05 06:30:00 2023-01-05 09:30:00 True
6 1 2023-01-07 06:30:00 2023-01-07 09:30:00 True
7 1 2023-01-07 06:00:00 2023-01-07 06:45:00 True

How to reindex a datetime-based multiindex in pandas

I have a dataframe that counts the number of times an event has occured per user per day. Users may have 0 events per day and (since the table is an aggregate from a raw event log) rows with 0 events are missing from the dataframe. I would like to add these missing rows and group the data by week so that each user has one entry per week (including 0 if applicable).
Here is an example of my input:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({
"person_id": np.arange(3).repeat(5),
"date": pd.date_range("2022-01-01", "2022-01-15", freq="d"),
"event_count": np.random.randint(1, 7, 15),
})
# end of each week
# Note: week 2022-01-23 is not in df, but should be part of the result
desired_index = pd.to_datetime(["2022-01-02", "2022-01-09", "2022-01-16", "2022-01-23"])
df
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-01 00:00:00 | 4 |
| 1 | 0 | 2022-01-02 00:00:00 | 5 |
| 2 | 0 | 2022-01-03 00:00:00 | 3 |
| 3 | 0 | 2022-01-04 00:00:00 | 5 |
| 4 | 0 | 2022-01-05 00:00:00 | 5 |
| 5 | 1 | 2022-01-06 00:00:00 | 2 |
| 6 | 1 | 2022-01-07 00:00:00 | 3 |
| 7 | 1 | 2022-01-08 00:00:00 | 3 |
| 8 | 1 | 2022-01-09 00:00:00 | 3 |
| 9 | 1 | 2022-01-10 00:00:00 | 5 |
| 10 | 2 | 2022-01-11 00:00:00 | 4 |
| 11 | 2 | 2022-01-12 00:00:00 | 3 |
| 12 | 2 | 2022-01-13 00:00:00 | 6 |
| 13 | 2 | 2022-01-14 00:00:00 | 5 |
| 14 | 2 | 2022-01-15 00:00:00 | 2 |
This is how my desired result looks like:
| | person_id | level_1 | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 0 | 2022-01-16 00:00:00 | 0 |
| 3 | 0 | 2022-01-23 00:00:00 | 0 |
| 4 | 1 | 2022-01-02 00:00:00 | 0 |
| 5 | 1 | 2022-01-09 00:00:00 | 11 |
| 6 | 1 | 2022-01-16 00:00:00 | 5 |
| 7 | 1 | 2022-01-23 00:00:00 | 0 |
| 8 | 2 | 2022-01-02 00:00:00 | 0 |
| 9 | 2 | 2022-01-09 00:00:00 | 0 |
| 10 | 2 | 2022-01-16 00:00:00 | 20 |
| 11 | 2 | 2022-01-23 00:00:00 | 0 |
I can produce it using:
(
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.groupby("person_id").apply(
lambda df: (
df
.reset_index(drop=True, level=0)
.reindex(desired_index, fill_value=0))
)
.reset_index()
)
However, according to the docs of reindex, I should be able to use it with level=1 as a kwarg directly and without having to do another groupby. However, when I do this I get an "inner join" of the two indices instead of an "outer join":
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(desired_index, level=1)
.reset_index()
)
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 1 | 2022-01-09 00:00:00 | 11 |
| 3 | 1 | 2022-01-16 00:00:00 | 5 |
| 4 | 2 | 2022-01-16 00:00:00 | 20 |
Why is that, and how am I supposed to use df.reindex correctly?
I have found a similar SO question on reindexing a multi-index level, but the accepted answer there uses df.unstack, which doesn't work for me, because not every level of my desired index occurs in my current index (and vice versa).
You need reindex by both levels of MultiIndex:
mux = pd.MultiIndex.from_product([df['person_id'].unique(), desired_index],
names=['person_id','date'])
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(mux, fill_value=0)
.reset_index()
)
print (result)
person_id date event_count
0 0 2022-01-02 9
1 0 2022-01-09 13
2 0 2022-01-16 0
3 0 2022-01-23 0
4 1 2022-01-02 0
5 1 2022-01-09 11
6 1 2022-01-16 5
7 1 2022-01-23 0
8 2 2022-01-02 0
9 2 2022-01-09 0
10 2 2022-01-16 20
11 2 2022-01-23 0

How can I set last rows of a dataframe based on condition in Python?

I have 1 dataframes, df1, with 2 different columns. The first column 'col1' is a datetime column, and the second one is a int column with only 2 possible values (0 or 1). Here is an example of the dataframe:
+----------------------+----------+
| col1 | col2 |
+----------------------+----------+
| 2020-01-01 10:00:00 | 0 |
+----------------------+----------+
| 2020-01-01 11:00:00 | 1 |
+----------------------+----------+
| 2020-01-01 12:00:00 | 1 |
+----------------------+----------+
| 2020-01-02 11:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 12:00:00 | 1 |
+----------------------+----------+
| ... | ... |
+----------------------+----------+
As you can see, the datetimes are sorted in an ascending order. What I would like is: for each diferent date (in this example are 2 diferent dates, 2020-01-01 and 2020-01-02 with diferent times) I would like to mantain the first 1 value and put as 0 the previous and the next ones in that date. So, the resulting dataframe would be:
+----------------------+----------+
| col1 | col2 |
+----------------------+----------+
| 2020-01-01 10:00:00 | 0 |
+----------------------+----------+
| 2020-01-01 11:00:00 | 1 |
+----------------------+----------+
| 2020-01-01 12:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 11:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 12:00:00 | 1 |
+----------------------+----------+
| ... | ... |
+----------------------+----------+
How can I do it in Python?
Use:
df['col1'] = pd.to_datetime(df.col1)
mask = df.groupby(df.col1.dt.date)['col2'].cumsum().eq(1)
df.col2.where(mask, 0, inplace = True)
Output:
>>> df
col1 col2
0 2020-01-01 10:00:00 0
1 2020-01-01 12:00:00 1
2 2020-01-01 12:00:00 0
3 2020-01-02 11:00:00 0
4 2020-01-02 12:00:00 1

How to create rows that fill the time between events in Python

I am building a data frame for survival analysis starting from 2018-01-01 00:00:00 and ending TODAY. I have two columns with start and end times only for the events that ocurred associated with an ID.
However, I need to add rows with the times between which the event was not observed
Here I show what I have:
+--------+-----+-----+---------------------+---------------------+
| State | ID1 | ID2 | Start_Time | End_Time |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 04:00:00 | 2019-12-04 19:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-08 06:30:00 | 2019-12-20 10:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 11:00:00 | 2019-12-22 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-26 08:00:00 | 2019-12-29 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-19 08:00:00 | 2018-09-20 04:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-25 16:30:00 | 2018-09-26 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 01:30:00 | 2018-09-27 10:30:00 |
+--------+-----+-----+---------------------+---------------------+
And what I need is:
+--------+-----+-----+---------------------+---------------------+
| State | ID1 | ID2 | Start_Time | End_Time |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2018-01-01 00:00:00 | 2019-12-04 04:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 04:00:00 | 2019-12-04 19:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 19:30:00 | 2019-12-08 06:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-08 06:30:00 | 2019-12-20 10:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-20 10:00:00 | 2019-12-22 11:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 11:00:00 | 2019-12-22 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 23:00:00 | 2019-12-26 08:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-26 08:00:00 | 2019-12-29 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-29 16:30:00 | TODAY |
+--------+-----+-----+---------------------+---------------------+
| State1 | 112 | AA1 | 2018-01-01 00:00:00 | 2018-09-19 08:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-19 08:00:00 | 2018-09-20 04:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA1 | 2018-09-20 04:30:00 | 2018-09-25 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-25 16:30:00 | 2018-09-26 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA1 | 2018-09-26 23:00:00 | 2018-09-27 01:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 01:30:00 | 2018-09-27 10:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 10:30:00 | TODAY |
+--------+-----+-----+---------------------+---------------------+
I have tried this code (borrowed from: How to find the start time and end time of an event in python?), but it gives me only the sequence of events, not the desired rows and the answer provided by #Fredy MontaƱo (below):
fill_date = []
for item in range(1,df.shape[0],1):
if (df['End_Time'][item-1] - df['Start_Time'][item]) == 0:
""
else:
fill_date.append([df["State"][item-1], df["ID1"][item-1], df["ID2"][item-1], df['End_Time'][item-1],df['Start_Time'][item]])
df_add = pd.DataFrame(fill_date)
df_add.columns = ["State", "ID1", "ID2", 'Start_Time', 'End_Time']
df_output = pd.concat([df[["State", "ID1", "ID2", "Start_Time", "End_Time"]], df_add],axis = 0)
df_output = df_output.sort_values(["State", "ID2", "Start_Time"], ascending=True)
I think I have to put a condition over the STATE, ID1 and ID2 variables in order to not to take times from the previous groups.
Any suggestion?
Maybe this solution works for you.
I slice the dataframe only to take the dates, but it works for you you can repeat it taking into account the states and ID
df = df[['Start_Time', 'End_Time']]
fill_date = []
for item in range(1,df.shape[0],1):
if df['Start_Time'][item] - df['End_Time'][item-1] == 0:
""
else:
fill_date.append([df['End_Time'][item-1],df['Start_Time'][item]])
df_add = pd.DataFrame(fill_date)
df_add.columns = ['Start_Time', 'End_Time']
and finally, I do a concat to join you original dataframe with the new df of dates of not Observed events dates on squares are the new
df_final = pd.concat([df,df_add],axis = 0)
df_final.sort_index(0)

Iterate over each day, calculate average price of first and last 3 hours and take difference of those averages in Python

I have a data frame that looks like this,
+---------------------+------------+----------+-------+
| Timestamp | Date | Time | Price |
+---------------------+------------+----------+-------+
| 2017-01-01 00:00:00 | 2017-01-01 | 00:00:00 | 20 |
| 2017-01-01 00:01:00 | 2017-01-01 | 00:01:00 | 25 |
| 2017-01-01 00:02:00 | 2017-01-01 | 00:02:00 | 15 |
| 2017-01-01 00:03:00 | 2017-01-01 | 00:03:00 | 20 |
| ... | | | |
| 2017-01-01 00:20:00 | 2017-01-01 | 00:20:00 | 25 |
| 2017-01-01 00:21:00 | 2017-01-01 | 00:21:00 | 15 |
| 2017-01-01 00:22:00 | 2017-01-01 | 00:22:00 | 10 |
| 2017-01-01 00:23:00 | 2017-01-01 | 00:23:00 | 25 |
| 2017-01-01 00:00:00 | 2017-02-01 | 00:00:00 | 10 |
| 2017-02-01 00:01:00 | 2017-02-01 | 00:01:00 | 25 |
| 2017-02-01 00:02:00 | 2017-02-01 | 00:02:00 | 10 |
| 2017-02-01 00:03:00 | 2017-02-01 | 00:03:00 | 25 |
| ... | | | |
| 2017-02-01 00:20:00 | 2017-02-01 | 00:20:00 | 15 |
| 2017-02-01 00:21:00 | 2017-02-01 | 00:21:00 | 10 |
| 2017-02-01 00:22:00 | 2017-02-01 | 00:22:00 | 25 |
| 2017-02-01 00:24:00 | 2017-02-01 | 00:23:00 | 10 |
+---------------------+------------+----------+-------+
Timestamp datetime64[ns]
Date datetime64[ns]
Time object
Price float64
and I'm trying to calculate difference between the average price of the first 3 hours and the last 3 hours of a day.
Design in my mind is to do something like this;
For every unique date in Date
a = avg(price.first(3))
b = avg(price.last(3))
dif = a - b
append to another dataset
---------EDIT----------
and the expected result is;
+------------+---------+
| Date | Diff |
+------------+---------+
| 2017-01-01 | 3.33334 |
| 2017-01-02 | 0 |
+------------+---------+
My real query will be in seconds rather then hours.(I didnt wanted to put 120 rows in here show 2 minutes of the data).So hours are representations of seconds.
And there can be some missing rows in the dataset so if I just do price.first(3600) it can overshoot for some days right? If I can solve this using df.Timestamp.datetime.hour that will be more precise I think.
I really can't get my head around figuring how to get first and last 3 Price for everyday kind of approach. Any help will be much much appreciated!! Thank you so so much in advance!
As you showed, the hours are ordered, so you can groupby day, and the get the list of the prices of the 24 hours of the day, then, you can apply a function to do the difference. You could try something like this:
import pandas as pd
from statistics import mean
def getavg(ls):
mean3first=mean(ls[:3])
mean3last=mean(ls[len(ls)-3:])
return mean3first-mean3last
diff_means= df.groupby(['Date']).agg(list)['Price'].apply(getavg).reset_index()
diff_means.columns=['Date','Diff']
print(diff_means)
I'm not entirely sure what format you want the result in, but I found a solution that I find pretty elegant:
unique_dates = df.Date.unique()
new_df = pd.DataFrame()
for u_date in unique_dates:
first_3 = np.mean(df[df.Date == u_date].reset_index().head(3).Price)
last_3 = np.mean(df[df.Date == u_date].reset_index().tail(3).Price)
new_df = new_df.append(
pd.DataFrame([[u_date, last_3 - first_3]], columns = ['Date', 'PriceDiff']))

Categories