Generating DataFrame with combination of columns and sum of grouped values - python

So, I have a DataFrame, in which each row represent an event, formed by essentially 4 columns:
happen start_date end_date number
0 2015-01-01 2015-01-01 2015-01-03 100.0
1 2015-01-01 2015-01-01 2015-01-01 20.0
2 2015-01-01 2015-01-02 2015-01-02 50.0
3 2015-01-02 2015-01-02 2015-01-02 40.0
4 2015-01-02 2015-01-02 2015-01-03 50.0
where happen is the date the event took place, start_date and end_date are the validity of that event, and number is just a summable variable.
What I'd like to get is a DataFrame that has for each row the combination of the happen date and validity date and, contextually, the sum of the number column.
What I tried so far is a double for loop on all dates, knowing that start_date >= happen:
startdate = pd.to_datetime('01/06/2014', format='%d/%m/%Y') # the minimum possible happen
enddate = pd.to_datetime('31/12/2021', format='%d/%m/%Y') # the maximum possible happen (and validity)
df_day = pd.DataFrame()
for dt1 in pd.date_range(start=startdate, end=enddate):
for dt2 in pd.date_range(start=dt1, end=enddate):
num_sum = df[(df['happen'] == dt1)&(df['start_date'] <= dt2)&
(df['end_date'] >= dt2)]['number'].sum()
row = {'happen':dt1,'valid':dt2,'number':num_sum}
df_day = df_day.append(row,ignore_index = True)
and that never came to an end. So I tried other way, I generated the df with all date combination first (like 3.8e6 rows), and then tried to fill it with a lambda func (it's crazy, I know, but don't know how to work around it):
dt1 = pd.date_range(start=startdate, end=enddate).tolist()
df_day = pd.DataFrame()
for i in dt1:
dt_acc1 = [i]
dt2 = pd.date_range(start=i, end=enddate).tolist()
df_comb = pd.DataFrame(list(product(dt_acc1, dt2)), columns=['happen', 'valid'])
df_day = df_day.append(df_comb, ignore_index=True)
df_day['number'] = 0
def append_num(happen,valid):
return df[(df['happen'] == happen)&(df['start_date'] <= valid)&
(df['end_date'] >= valid)]['number'].sum()
df_day['number'] = df_day.apply(lambda x: append_num(x['happen'],x['valid']), axis=1)
and also this loop take forever.
My expected output is something like this:
happen valid number
0 2015-01-01 2015-01-01 120.0
1 2015-01-01 2015-01-02 150.0
2 2015-01-01 2015-01-03 100.0
3 2015-01-02 2015-01-02 90.0
4 2015-01-02 2015-01-03 50.0
5 2015-01-03 2015-01-03 0.0
As you can see the first row represents the sum of all rows with happen on 2015-01-01 and with a start_date and end_date that contain the 2015-01-01 in valid. The number column contains the sum (with 120. = 100. + 20.). On the second row, with valid going one day forward, I "lose" element with index 1 and I "gain" element with index 2 (150. = 100. + 50.).
Every help or suggestion is appreciated!

Related

Finding the Timedelta through pandas dataframe, I keep return NaT

So I am reading in a csv file of a 30 minute timeseries going from "2015-01-01 00:00" upto and including "2020-12-31 23:30". There are five sets of these timeseries, each being at a certain location, and there are 105215 rows going down for each 30 minutes. My job is to go through and find the timedelta between each row, for each column. It should be 30 minutes for each one, except sometimes it isn't, and I have to find that.
So far I'm reading in the data fine via
ca_time = np.array(ca.iloc[0:, 1], dtype= "datetime64")
ny_time = np.array(ny.iloc[0:, 1], dtype = "datetime64")
tx_time = np.array(tx.iloc[0:, 1], dtype = "datetime64")
#I'm then passing these to a pandas dataframe for more convenient manipulation
frame_ca = pd.DataFrame(data = ca_time, dtype = "datetime64[s]")
frame_ny = pd.DataFrame(data = ny_time, dtype = "datetime64[s]")
frame_tx = pd.DataFrame(data = tx_time, dtype = "datetime64[s]")
#Then concatenating them into an array with 100k+ rows, and the five columns represent each location
full_array = pd.concat([frame_ca, frame_ny, frame_tx], axis = 1)
I now want to find the timedelta between each cell for each respective location.
Currently I'm trying this as a simply test
first_row = full_array2.loc[1:1, :1]
second_row = full_array2.loc[2:2, :1]
delta = first_row - second_row
I'm getting back
0 0 0
1 NaT NaT NaT
2 NaT NaT NaT
These seems simple enough but don't know how I'm getting Not a Time here.
For reference, below are both those rows I'm trying to subtract
ca ny tx fl az
1 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00, 0 0 0 0 0
2 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00
Any help appreciated!

Pandas Dataframe: Find the conditional mean of all observations that meet certain conditions that are DIFFERENT in each row

Let's say that I have a dataframe like this:
date M1_start M1_end SimPrices_t0_exp
0 2017-12-31 2018-01-01 2018-01-31 16.151667
1 2018-01-01 2018-02-01 2018-02-28 45.138445
2 2018-01-02 2018-02-01 2018-02-28 56.442648
3 2018-01-03 2018-02-01 2018-02-28 59.769931
4 2018-01-04 2018-02-01 2018-02-28 50.171695
And I want to get the mean of SimPrices_t0_exp observations whose value of 'date' are between the M1_start and M1_end for every observation
I have tried this
mask = ((df['date'] >= df['M1_start']) & (df['date'] <= df['M1_end']))
df['mymean'] = df['SimPrices_t0_exp'][mask].mean()
How ever this returns NaN for every observation, I believe because the mask is applied for each row individually checking the mask conditions for its own date which will never return true.
Can somebody help me? I have been struggling with this problem for two days
Example: for the first observation, the resulting column would have on its first observation the average of 45.13,56.44,59.76,50.17 in this particular case
if it helps somebody, the pseudocode would be something like this:
for obs in observations:
start = obs.start
end = obs.end
sum = 0
obs_count = 0
for obs2 in observations:
if obs2.date >= start and obs2.date <= end:
sum += obs.SimPrices_t0_exp
obs_count += 1
obs.mean = sum/obs_count
Thanks!!
Here, one way to do this using cartesian merging (not a good choice for large dataset), filtering and groupby:
df = df.assign(key=1)
df_m = df.merge(df, on='key')
df_m.query('M1_start_x <= date_y <= M1_end_x').groupby(['M1_start_x','M1_end_x'])['SimPrices_t0_exp_y'].mean()
Output:
M1_start_x M1_end_x
2018-01-01 2018-01-31 52.88068
Name: SimPrices_t0_exp_y, dtype: float64

Pandas resample doesn't return anything

I am learning to use pandas resample() function, however, the following code does not return anything as expected. I re-sampled the time series by day.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
range = pd.date_range('2015-01-01','2015-12-31',freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['distance'] = df['speed'] * 0.25
df['cumulative_distance'] = df.distance.cumsum()
print df.head()
weekly_summary = pd.DataFrame()
weekly_summary['speed'] = df.speed.resample('D').mean()
weekly_summary['distance'] = df.distance.resample('D').sum()
print weekly_summary.head()
Output
speed distance cumulative_distance
2015-01-01 00:00:00 40 10.00 10.00
2015-01-01 00:15:00 6 1.50 11.50
2015-01-01 00:30:00 31 7.75 19.25
2015-01-01 00:45:00 41 10.25 29.50
2015-01-01 01:00:00 59 14.75 44.25
[5 rows x 3 columns]
Empty DataFrame
Columns: [speed, distance]
Index: []
[0 rows x 2 columns]
Depending on your pandas version, how you will do this will vary.
In pandas 0.19.0, your code works as expected:
In [7]: pd.__version__
Out[7]: '0.19.0'
In [8]: df.speed.resample('D').mean().head()
Out[8]:
2015-01-01 28.562500
2015-01-02 30.302083
2015-01-03 30.864583
2015-01-04 29.197917
2015-01-05 30.708333
Freq: D, Name: speed, dtype: float64
In older versions, your solution might not work but at least in 0.14.1, you can tweak it to do so:
>>> pd.__version__
'0.14.1'
>>> df.speed.resample('D').mean()
29.41087328767123
>>> df.speed.resample('D', how='mean').head()
2015-01-01 29.354167
2015-01-02 26.791667
2015-01-03 31.854167
2015-01-04 26.593750
2015-01-05 30.312500
Freq: D, Name: speed, dtype: float64
This looks like an issue with old version of pandas, in newer versions it will enlarge the df when assigning a new column where the index is not the same shape. What should work is to not make an empty df and instead pass the initial call to resample as the data arg for the df ctor:
In [8]:
range = pd.date_range('2015-01-01','2015-12-31',freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['distance'] = df['speed'] * 0.25
df['cumulative_distance'] = df.distance.cumsum()
print (df.head())
weekly_summary = pd.DataFrame(df.speed.resample('D').mean())
weekly_summary['distance'] = df.distance.resample('D').sum()
print( weekly_summary.head())
speed distance cumulative_distance
2015-01-01 00:00:00 28 7.0 7.0
2015-01-01 00:15:00 8 2.0 9.0
2015-01-01 00:30:00 10 2.5 11.5
2015-01-01 00:45:00 56 14.0 25.5
2015-01-01 01:00:00 6 1.5 27.0
speed distance
2015-01-01 27.895833 669.50
2015-01-02 29.041667 697.00
2015-01-03 27.104167 650.50
2015-01-04 28.427083 682.25
2015-01-05 27.854167 668.50
Here I pass the call to resample as the data arg for the df ctor, this will take the index and column name and create a single column df:
weekly_summary = pd.DataFrame(df.speed.resample('D').mean())
then subsequent assignments should work as expected

pandas rolling computation to find set of small numbers

I have a dataframe that has one variable and an equally spaced date time index (the index is at 1 second granularity).Say there are 1000 samples overall:
dates = pd.date_range('2015-1-1', periods=(1000) ,freq='S')
df = pd.DataFrame(np.random.rand(1000),index=dates, columns=['X'])
X
2015-01-01 00:00:00 2.2
2015-01-01 00:00:01 2.5
2015-01-01 00:00:02 1.2
2015-01-01 00:00:03 1.5
2015-01-01 00:00:04 3.7
2015-01-01 00:00:05 3.1
etc
I want to determine the start of the rolling window (of a given length) that contains the largest set that contains the smallest valued numbers within the given window size.
So in the example above, if the window was of size two, the answer would be:
start_index = 2015-01-01 00:00:02
end_index = 2015-01-01 00:00:03
I've tried to read the pandas document to see if there is a rolling computation that can help, but no luck! Thanks.
You just need to do rolling_sum over df['X'] == df['X'].min(). Then the end of window is simply:
>>> ts = df['X'] == df['X'].min()
>>> pd.rolling_sum(ts, win_size).argmax()
and in order to obtain start of the window you may either shift the end of window or alternatively shift the series:
>>> pd.rolling_sum(ts.shift(-win_size), win_size).argmax()

adding column with per-row computed time difference from group start?

(newbie to python and pandas)
I have a data set of 15 to 20 million rows, each row is a time-indexed observation of a time a 'user' was seen, and I need to analyze the visit-per-day patterns of each user, normalized to their first visit. So, I'm hoping to plot with an X axis of "days after first visit" and a Y axis of "visits by this user on this day", i.e., I need to get a series indexed by a timedelta and with values of visits in the period ending with that delta [0:1, 3:5, 4:2, 6:8,] But I'm stuck very early ...
I start with something like this:
rng = pd.to_datetime(['2000-01-01 08:00', '2000-01-02 08:00',
'2000-01-01 08:15', '2000-01-02 18:00',
'2000-01-02 17:00', '2000-03-01 08:00',
'2000-03-01 08:20','2000-01-02 18:00'])
uid=Series(['u1','u2','u1','u2','u1','u2','u2','u3'])
misc=Series(['','x1','A123','1.23','','','','u3'])
df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
df=df.set_index(df.ts)
grouped = df.groupby('uid')
firstseen = grouped.first()
The ts values are unique to each uid, but can be duplicated (two uid can be seen at the same time, but any one uid is seen only once at any one timestamp)
The first step is (I think) to add a new column to the DataFrame, showing for each observation what the timedelta is back to the first observation for that user. But, I'm stuck getting that column in the DataFrame. The simplest thing I tried gives me an obscure-to-newbie error message:
df['sinceseen'] = df.ts - firstseen.ts[df.uid]
...
ValueError: cannot reindex from a duplicate axis
So I tried a brute-force method:
def f(row):
return row.ts - firstseen.ts[row.uid]
df['sinceseen'] = Series([{idx:f(row)} for idx, row in df.iterrows()], dtype=timedelta)
In this attempt, df gets a sinceseen but it's all NaN and shows a type of float for type(df.sinceseen[0]) - though, if I just print the Series (in iPython) it generates a nice list of timedeltas.
I'm working back and forth through "Python for Data Analysis" and it seems like apply() should work, but
def fg(ugroup):
ugroup['sinceseen'] = ugroup.index - ugroup.index.min()
return ugroup
df = df.groupby('uid').apply(fg)
gives me a TypeError on the "ugroup.index - ugroup.index.min(" even though each of the two operands is a Timestamp.
So, I'm flailing - can someone point me at the "pandas" way to get to the data structure Ineed?
Does this help you get started?
>>> df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
>>> df = df.sort(["uid", "ts"])
>>> df["since_seen"] = df.groupby("uid")["ts"].apply(lambda x: x - x.iloc[0])
>>> df
misc ts uid since_seen
0 2000-01-01 08:00:00 u1 0 days, 00:00:00
2 A123 2000-01-01 08:15:00 u1 0 days, 00:15:00
4 2000-01-02 17:00:00 u1 1 days, 09:00:00
1 x1 2000-01-02 08:00:00 u2 0 days, 00:00:00
3 1.23 2000-01-02 18:00:00 u2 0 days, 10:00:00
5 2000-03-01 08:00:00 u2 59 days, 00:00:00
6 2000-03-01 08:20:00 u2 59 days, 00:20:00
7 u3 2000-01-02 18:00:00 u3 0 days, 00:00:00
[8 rows x 4 columns]

Categories