(newbie to python and pandas)
I have a data set of 15 to 20 million rows, each row is a time-indexed observation of a time a 'user' was seen, and I need to analyze the visit-per-day patterns of each user, normalized to their first visit. So, I'm hoping to plot with an X axis of "days after first visit" and a Y axis of "visits by this user on this day", i.e., I need to get a series indexed by a timedelta and with values of visits in the period ending with that delta [0:1, 3:5, 4:2, 6:8,] But I'm stuck very early ...
I start with something like this:
rng = pd.to_datetime(['2000-01-01 08:00', '2000-01-02 08:00',
'2000-01-01 08:15', '2000-01-02 18:00',
'2000-01-02 17:00', '2000-03-01 08:00',
'2000-03-01 08:20','2000-01-02 18:00'])
uid=Series(['u1','u2','u1','u2','u1','u2','u2','u3'])
misc=Series(['','x1','A123','1.23','','','','u3'])
df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
df=df.set_index(df.ts)
grouped = df.groupby('uid')
firstseen = grouped.first()
The ts values are unique to each uid, but can be duplicated (two uid can be seen at the same time, but any one uid is seen only once at any one timestamp)
The first step is (I think) to add a new column to the DataFrame, showing for each observation what the timedelta is back to the first observation for that user. But, I'm stuck getting that column in the DataFrame. The simplest thing I tried gives me an obscure-to-newbie error message:
df['sinceseen'] = df.ts - firstseen.ts[df.uid]
...
ValueError: cannot reindex from a duplicate axis
So I tried a brute-force method:
def f(row):
return row.ts - firstseen.ts[row.uid]
df['sinceseen'] = Series([{idx:f(row)} for idx, row in df.iterrows()], dtype=timedelta)
In this attempt, df gets a sinceseen but it's all NaN and shows a type of float for type(df.sinceseen[0]) - though, if I just print the Series (in iPython) it generates a nice list of timedeltas.
I'm working back and forth through "Python for Data Analysis" and it seems like apply() should work, but
def fg(ugroup):
ugroup['sinceseen'] = ugroup.index - ugroup.index.min()
return ugroup
df = df.groupby('uid').apply(fg)
gives me a TypeError on the "ugroup.index - ugroup.index.min(" even though each of the two operands is a Timestamp.
So, I'm flailing - can someone point me at the "pandas" way to get to the data structure Ineed?
Does this help you get started?
>>> df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
>>> df = df.sort(["uid", "ts"])
>>> df["since_seen"] = df.groupby("uid")["ts"].apply(lambda x: x - x.iloc[0])
>>> df
misc ts uid since_seen
0 2000-01-01 08:00:00 u1 0 days, 00:00:00
2 A123 2000-01-01 08:15:00 u1 0 days, 00:15:00
4 2000-01-02 17:00:00 u1 1 days, 09:00:00
1 x1 2000-01-02 08:00:00 u2 0 days, 00:00:00
3 1.23 2000-01-02 18:00:00 u2 0 days, 10:00:00
5 2000-03-01 08:00:00 u2 59 days, 00:00:00
6 2000-03-01 08:20:00 u2 59 days, 00:20:00
7 u3 2000-01-02 18:00:00 u3 0 days, 00:00:00
[8 rows x 4 columns]
Related
I have a pandas dataframe (python) indexed with timestamps roughly every 10 seconds. I want to find hourly averages, but all functions I find start their averaging at even hours (e.g. hour 9 includes data from 08.00:00 to 08:59:50). Let's say I have the dataframe below.
Timestamp value data
2022-01-01 00:00:00 0.0 5.31
2022-01-01 00:00:10 0.0 0.52
2022-01-01 00:00:20 1.0 9.03
2022-01-01 00:00:30 1.0 4.37
2022-01-01 00:00:40 1.0 8.03
...
2022-01-01 13:52:30 1.0 9.75
2022-01-01 13:52:40 1.0 0.62
2022-01-01 13:52:50 1.0 3.58
2022-01-01 13:53:00 1.0 8.23
2022-01-01 13:53:10 1.0 3.07
Freq: 10S, Length: 5000, dtype: float64
So what I want to do:
Only look at data where we have data that consistently through 1 hour has a value of 1
Find an hourly average of these hours (could e.g. be between 01:30:00-02:29:50 and 11:16:30 - 12:16:20)..
I hope I made my problem clear enough. How do I do this?
EDIT:
Maybe the question was a bit unclear phrased.
I added a third column data, which is what I want to find the mean of. I am only interested in time intervals where, value = 1 consistently through one hour, the rest of the data can be excluded.
EDIT #2:
A bit of background to my problem: I have a sensor giving me data every 10 seconds. For data to be "approved" certain requirements are to be fulfilled (value in this example), and I need the hourly averages (and preferably timestamps for when this occurs). So in order to maximize the number of possible hours to include in my analysis, I would like to find full hours even if they don't start at an even timestamp.
If I understand you correctly you want a conditional mean - calculate the mean per hour of the data column conditional on the value column being all 1 for every 10s row in that hour.
Assuming your dataframe is called df, the steps to do this are:
Create a grouping column
This is your 'hour' column that can be created by
df['hour'] = df.Timestamp.hour
Create condition
Now we've got a column to identify groups we can check which groups are eligible - only those with value consistently equal to 1. If we have 10s intervals and it's per hour then if we group by hour and sum this column then we should get 360 as there are 360 10s intervals per hour.
Group and compute
We can now group and use the aggregate function to:
sum the value column to evaluate against our condition
compute the mean of the data column to return for the valid hours
# group and aggregate
df_mean = df[['hour', 'value', 'data']].groupby('hour').aggregate({'value': 'sum', 'data': 'mean'})
# apply condition
df_mean = df_mean[df_mean['value'] == 360]
That's it - you are left with a dataframe that contains the mean value of data for only the hours where you have a complete hour of value=1.
If you want to augment this so you don't have to start with the grouping as per hour starting as 08:00:00-09:00:00 and maybe you want to start as 08:00:10-09:00:10 then the solution is simple - augment the grouping column but don't change anything else in the process.
To do this you can use datetime.timedelta to shift things forward or back so that df.Timestamp.hour can still be leveraged to keep things simple.
Infer grouping from data
One final idea - if you want to infer which hours on a rolling basis you have complete data for then you can do this with a rolling sum - this is even easier. You:
compute the rolling sum of value and mean of data
only select where value is equal to 360
df_roll = df.rolling(360).aggregate({'value': 'sum', 'data': 'mean'})
df_roll = df_roll[df_roll['value'] == 360]
Yes, there is. You need resample with an offset.
Make some test data
Please make sure to provide meaningful test data next time.
import pandas as pd
import numpy as np
# One day in 10 second intervals
index = pd.date_range(start='1/1/2018', end='1/2/2018', freq='10S')
df = pd.DataFrame({"data": np.random.random(len(index))}, index=index)
# This will set the first part of the data to 1, the rest to 0
df["value"] = (df.index < "2018-01-01 10:00:10").astype(int)
This is what we got:
>>> df
data value
2018-01-01 00:00:00 0.377082 1
2018-01-01 00:00:10 0.574471 1
2018-01-01 00:00:20 0.284629 1
2018-01-01 00:00:30 0.678923 1
2018-01-01 00:00:40 0.094724 1
... ... ...
2018-01-01 23:59:20 0.839973 0
2018-01-01 23:59:30 0.890321 0
2018-01-01 23:59:40 0.426595 0
2018-01-01 23:59:50 0.089174 0
2018-01-02 00:00:00 0.351624 0
Get the mean per hour with an offset
Here is a small function that checks if all value rows in the slice are equal to 1 and returns the mean if so, otherwise it (implicitly) returns None.
def get_conditioned_average(frame):
if frame.value.eq(1).all():
return frame.data.mean()
Now just apply this to hourly slices, starting, e.g., at 10 seconds after the full hour.
df2 = df.resample('H', offset='10S').apply(get_conditioned_average)
This is the final result:
>>> df2
2017-12-31 23:00:10 0.377082
2018-01-01 00:00:10 0.522144
2018-01-01 01:00:10 0.506536
2018-01-01 02:00:10 0.505334
2018-01-01 03:00:10 0.504431
... ... ...
2018-01-01 19:00:10 NaN
2018-01-01 20:00:10 NaN
2018-01-01 21:00:10 NaN
2018-01-01 22:00:10 NaN
2018-01-01 23:00:10 NaN
Freq: H, dtype: float64
I am working on some code that will rearrange a time series. Currently I have a standard time series. I have a three columns with with the header being [Date, Time, Value]. I want to reformat the dataframe to index with the date and use a header with the time (i.e. 0:00, 1:00, ... , 23:00). The dataframe will be filled in with the value.
Here is the Dataframe currently have
essentially I'd like to mve the index toa single day and show the hours through the columns.
Thanks,
Use pivot:
df = df.pivot(index='Date', columns='Time', values='Total')
Output (first 10 columns and with random values for Total):
>>> df.pivot(index='Date', columns='Time', values='Total').iloc[0:10]
time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00
date
2019-01-01 0.732494 0.087657 0.930405 0.958965 0.531928 0.891228 0.664634 0.432684 0.009653 0.604878
2019-01-02 0.471386 0.575126 0.509707 0.715290 0.337983 0.618632 0.413530 0.849033 0.725556 0.186876
You could try this.
Split the time part to get only the hour. Add hr to it.
df = pd.DataFrame([['2019-01-01', '00:00:00',-127.57],['2019-01-01', '01:00:00',-137.57],['2019-01-02', '00:00:00',-147.57],], columns=['Date', 'Time', 'Totals'])
df['hours'] = df['Time'].apply(lambda x: 'hr'+ str(int(x.split(':')[0])))
print(pd.pivot_table(df, values ='Totals', index=['Date'], columns = 'hours'))
Output
hours hr0 hr1
Date
2019-01-01 -127.57 -137.57
2019-01-02 -147.57 NaN
So I am reading in a csv file of a 30 minute timeseries going from "2015-01-01 00:00" upto and including "2020-12-31 23:30". There are five sets of these timeseries, each being at a certain location, and there are 105215 rows going down for each 30 minutes. My job is to go through and find the timedelta between each row, for each column. It should be 30 minutes for each one, except sometimes it isn't, and I have to find that.
So far I'm reading in the data fine via
ca_time = np.array(ca.iloc[0:, 1], dtype= "datetime64")
ny_time = np.array(ny.iloc[0:, 1], dtype = "datetime64")
tx_time = np.array(tx.iloc[0:, 1], dtype = "datetime64")
#I'm then passing these to a pandas dataframe for more convenient manipulation
frame_ca = pd.DataFrame(data = ca_time, dtype = "datetime64[s]")
frame_ny = pd.DataFrame(data = ny_time, dtype = "datetime64[s]")
frame_tx = pd.DataFrame(data = tx_time, dtype = "datetime64[s]")
#Then concatenating them into an array with 100k+ rows, and the five columns represent each location
full_array = pd.concat([frame_ca, frame_ny, frame_tx], axis = 1)
I now want to find the timedelta between each cell for each respective location.
Currently I'm trying this as a simply test
first_row = full_array2.loc[1:1, :1]
second_row = full_array2.loc[2:2, :1]
delta = first_row - second_row
I'm getting back
0 0 0
1 NaT NaT NaT
2 NaT NaT NaT
These seems simple enough but don't know how I'm getting Not a Time here.
For reference, below are both those rows I'm trying to subtract
ca ny tx fl az
1 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00, 0 0 0 0 0
2 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00
Any help appreciated!
I have three dataframes in Pandas, say df1, df2 and df3. The first column of all dataframes is the Timestamp (DateTime format like 2017-01-01 12:30:00 etc.) Here is an example of each's first column:-
df1 TimeStamp
2016-01-01 12:00:00
2016-01-01 12:10:00
.....
df2 TimeStamp
2016-01-01 12:00:00
2016-01-01 12:10:00
.....
df3 TimeStamp
2016-13-01 12:00:00
2016-13-01 12:30:00
.....
As you can see, for the first two are at 10 minutes intervals, while the third one is at 30 minutes intervals. What I would like to do is to merge all 3 dataframes together, such that for cases where there is not exact match due to non-available data(like 12:10:00 not available for 3rd dataframe ), it would be considered as 12:00:00 (the preceding measurement) for merging purposes. (But of course, the Date should be the same) Note that all the dataframes have different sizes, but I would like to merge them based on Timestamp together for analytical purposes. Thank you!
DESIRED RESULT:
df_final TimeStamp .. Columns of df1 Columns of df2 Columns of df3
2016-13-01 12:00:00
2016-13-01 12:10:00
2016-13-01 12:20:00
.....
MORE DETAILS BASED ON ANSWER SUGGESTED
Firstly, as my dataframes (all 3) did not have index as TimeStamps, but had columns as TimeStamps, I set index for each as the TimeStamps:
df1.index = df1.TimeStamp
df2.index = df2.TimeStamp
df3.index = df3.TimeStamp
On using this
u_index = df3.index.union(df2.index.union(df1.index))
I get a weird output strangely which is not at regularly 10 min intervals like needed.
Index(['2016-01-01 00:00:00.000', '2016-01-01 00:00:00.000',
'2016-01-01 00:00:00.000', '2016-01-01 00:00:00.000',
...
'2017-12-31 23:50:00.000', '2017-12-31 23:50:00.000',
'2017-12-31 23:50:00.000', '2017-12-31 23:50:00.000',
dtype='object', name='TimeStamp', length=3199372)
Accordingly, the final df1_n dataframe is at 30 min intervals and not 10 mins (as the Union of indices was not properly done). I think that there is something going wrong here and once Step 2 suggested (u_index) is working properly, everything will be easy to merge the dataframes.
So I'm not 100% sure if what you asked for is how to complete the missing values after merging the three dataframes with the next valid observation.
if so, that's the quickest way I found to do this (not the most elegant...):
create a new index which is the union of the three indexes (will result in timestamp with intervals of 10 minutes in you case).
reindex all three dfs according to the new index while filling in missing values separately.
merge columns of the three dfs (which will be easy since after step NO.2 they will have the same index).
taking a portion of the data:
df1
Out[48]:
val_1
TimeStamp
2016-01-01 12:00:00 11
2016-01-01 12:10:00 12
df2
Out[49]:
val_2
TimeStamp
2016-01-01 12:00:00 21
2016-01-01 12:10:00 22
df3
Out[50]:
val_3
TimeStamp
2016-01-01 12:00:00 31
2016-13-01 12:30:00 32
step NO.1
u_index = df3.index.union(df2.index.union(df1.index))
u_index
Out[38]: Index(['2016-01-01 12:00:00', '2016-01-01 12:10:00', '2016-13-01 12:30:00'], dtype='object', name='TimeStamp')
step NO.2
df3_n = df3.reindex(index=u_index,method='bfill')
df2_n = df2.reindex(index=u_index,method='bfill')
df1_n = df1.reindex(index=u_index,method='bfill')
step NO.3
df1_n.merge(df2_n,on='TimeStamp').merge(df3_n,on='TimeStamp')
Out[47]:
val_1 val_2 val_3
TimeStamp
2016-01-01 12:00:00 11.0 21.0 31
2016-01-01 12:10:00 12.0 22.0 32
2016-13-01 12:30:00 NaN NaN 32
You might need to adjust the last row, since it has no following row to fill values from. but that's it pretty much.
UsageDate CustID1 CustID2 .... CustIDn
0 2018-01-01 00:00:00 1.095
1 2018-01-01 01:00:00 1.129
2 2018-01-01 02:00:00 1.165
3 2018-01-01 04:00:00 1.697
.
.
m 2018-31-01 23:00:00 1.835 (m,n)
The dataframe (df) has m rows and n columns. m is a Hourly TimeSeries Index which starts from first hour of month to last hour of month.
The columns are the customers which are almost 100,000.
The values at each cell of Dataframe are energy consumption values.
For every customer, I need to calculate:
1) Mean of every hour usage - so basically average of 1st hour of every day in a month, 2nd hour of every day in a month etc.
2) Summation of usage of every customer
3) Top 3 usage hours - for a customer x, it can be "2018-01-01 01:00:00",
"2018-11-01 05:00:00" "2018-21-01 17:00:00"
4) Bottom 3 usage hours - Similar explanation as above
5) Mean of usage for every customer in the month
My main point of trouble is how to aggregate data both for every customer and the hour of day, or day together.
For summation of usage for every customer, I tried:
df_temp = pd.DataFrame(columns=["TotalUsage"])
for col in df.columns:
`df_temp[col,"TotalUsage"] = df[col].apply.sum()`
However, this and many version of this which I tried are not helping me solve the problem.
Please help me with an approach and how to think about such problems.
Also, since the dataframe is large, it would be helpful if we can talk about Computational Complexity and how can we decrease computation time.
This looks like a job for pandas.groupby.
(I didn't test the code because I didn't have a good sample dataset from which to work. If there are errors, let me know.)
For some of your requirements, you'll need to add a column with the hour:
df['hour']=df['UsageDate'].dt.hour
1) Mean by hour.
mean_by_hour=df.groupby('hour').mean()
2) Summation by user.
sum_by_uers=df.sum()
3) Top usage by customer. Bottom 3 usage hours - Similar explanation as above.--I don't quite understand your desired output, you might be asking too many different questions in this question. If you want the hour and not the value, I think you may have to iterate through the columns. Adding an example may help.
4) Same comment.
5) Mean by customer.
mean_by_cust = df.mean()
I am not sure if this is all the information you are looking for but it will point you in the right direction:
import pandas as pd
import numpy as np
# sample data for 3 days
np.random.seed(1)
data = pd.DataFrame(pd.date_range('2018-01-01', periods= 72, freq='H'), columns=['UsageDate'])
data2 = pd.DataFrame(np.random.rand(72,5), columns=[f'ID_{i}' for i in range(5)])
df = data.join([data2])
# print('Sample Data:')
# print(df.head())
# print()
# mean of every month and hour per year
# groupby year month hour then find the mean of every hour in a given year and month
mean_data = df.groupby([df['UsageDate'].dt.year, df['UsageDate'].dt.month, df['UsageDate'].dt.hour]).mean()
mean_data.index.names = ['UsageDate_year', 'UsageDate_month', 'UsageDate_hour']
# print('Mean Data:')
# print(mean_data.head())
# print()
# use set_index with max and head
top_3_Usage_hours = df.set_index('UsageDate').max(1).sort_values(ascending=False).head(3)
# print('Top 3:')
# print(top_3_Usage_hours)
# print()
# use set_index with min and tail
bottom_3_Usage_hours = df.set_index('UsageDate').min(1).sort_values(ascending=False).tail(3)
# print('Bottom 3:')
# print(bottom_3_Usage_hours)
out:
Sample Data:
UsageDate ID_0 ID_1 ID_2 ID_3 ID_4
0 2018-01-01 00:00:00 0.417022 0.720324 0.000114 0.302333 0.146756
1 2018-01-01 01:00:00 0.092339 0.186260 0.345561 0.396767 0.538817
2 2018-01-01 02:00:00 0.419195 0.685220 0.204452 0.878117 0.027388
3 2018-01-01 03:00:00 0.670468 0.417305 0.558690 0.140387 0.198101
4 2018-01-01 04:00:00 0.800745 0.968262 0.313424 0.692323 0.876389
Mean Data:
ID_0 ID_1 ID_2 \
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.250716 0.546475 0.202093
1 0.414400 0.264330 0.535928
2 0.335119 0.877191 0.380688
3 0.577429 0.599707 0.524876
4 0.702336 0.654344 0.376141
ID_3 ID_4
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.244185 0.598238
1 0.400003 0.578867
2 0.623516 0.477579
3 0.429835 0.510685
4 0.503908 0.595140
Top 3:
UsageDate
2018-01-01 21:00:00 0.997323
2018-01-03 23:00:00 0.990472
2018-01-01 08:00:00 0.988861
dtype: float64
Bottom 3:
UsageDate
2018-01-01 19:00:00 0.002870
2018-01-03 02:00:00 0.000402
2018-01-01 00:00:00 0.000114
dtype: float64
For top and bottom 3 if you want to find the min sum across rows then:
df.set_index('UsageDate').sum(1).sort_values(ascending=False).tail(3)