Given the following time series (illustration purposes):
From | Till | Precipitation
2022-01-01 06:00:00 | 2022-01-02 06:00:00 | 0.5
2022-01-02 06:00:00 | 2022-01-03 06:00:00 | 1.2
2022-01-03 06:00:00 | 2022-01-04 06:00:00 | 0.0
2022-01-04 06:00:00 | 2022-01-05 06:00:00 | 1.3
2022-01-05 06:00:00 | 2022-01-06 06:00:00 | 9.8
2022-01-06 06:00:00 | 2022-01-07 06:00:00 | 0.1
I'd like to estimate the daily precipitation between 2022-01-02 00:00:00 and 2022-01-06 00:00:00. We can assume that the rate of precipitation is constant for each given interval in the table.
Doing manually I'd assume something like
2022-01-02 00:00:00 | 2022-01-03 00:00:00 | 0.25 * 0.5 + 0.75 * 1.2
Note: the real-world data will most likely look much less regular, somewhat like the following (missing intervals can be assumed to be 0.0):
From | Till | Precipitation
2022-01-01 05:45:12 | 2022-01-02 02:11:20 | 0.8
2022-01-03 02:01:59 | 2022-01-04 12:01:00 | 5.4
2022-01-04 06:00:00 | 2022-01-05 06:00:00 | 1.3
2022-01-05 07:10:00 | 2022-01-06 07:10:00 | 9.2
2022-01-06 02:54:00 | 2022-01-07 02:53:59 | 0.1
Maybe there's a library with a general and efficient solution?
If there's no such library, how do I compute the resampled time series in the most efficient way?
just calculate the period overlaps ... I think this will be plenty fast
import pandas as pd
import numpy as np
def create_test_data():
# just a helper to construct a test dataframe
from_dates = pd.date_range(start='2022-01-01 06:00:00', freq='D', periods=6)
till_dates = pd.date_range(start='2022-01-02 06:00:00', freq='D', periods=6)
precip_amounts = [0.5, 1.2, 1, 2, 3, 0.5]
return pd.DataFrame({'From': from_dates, 'Till': till_dates, 'Precip': precip_amounts})
def get_between(df, start_datetime, end_datetime):
# all the entries that end (Till) after start_time
# and start(From) before the end
mask1 = df['Till'] > start_datetime
mask2 = df['From'] < end_datetime
return df[mask1 & mask2]
def get_ratio_values(df, start_datetime, end_datetime, debug=True):
# get the ratios of the period windows
df2 = get_between(df, start_datetime, end_datetime) # get only the rows of interest
precip_values = df['Precip']
# get overlap from the end time of row to start of our period of interest
overlap_period1 = df2['Till'] - start
# get overlap from end of our period of interest and the start time of row
overlap_period2 = end - df2['From']
# get the "best" overlap for each row
best_overlap = np.minimum(overlap_period1, overlap_period2)
# get the period of each duration
window_durations = df2['Till'] - df2['From']
# calculate the ratios of overlap (cannot be greater than 1)
ratios = np.minimum(1.0, best_overlap / window_durations)
# calculate the value * the ratio
ratio_values = ratios * precip_values
if debug:
# just some prints for verification
print("Ratio * value = result")
print("----------------------")
print("\n".join(f"{x:0.3f} * {y:0.2f} = {z}" for x, y, z in zip(ratios, df['Precip'], ratio_values)))
print("----------------------")
return ratio_values
start = pd.to_datetime('2022-01-02 00:00:00')
end = pd.to_datetime('2022-01-04 00:00:00')
ratio_vals = get_ratio_values(create_test_data(), start, end)
total_precip = ratio_vals.sum()
print("SUM RESULT =", total_precip)
you could also just calculate for the first and last entry since anything in the middle will be 1 always (which is probably both simpler and faster)
def get_ratio_values(df, start_datetime, end_datetime, debug=True):
# get the ratios of the period windows
df2 = get_between(df, start_datetime, end_datetime) # get only the rows of interest
precip_values = df['Precip']
# overlap with first row and duration of first row
overlap_start = df2[0]['Till'] - start
duration_start = df2[0]['Till'] - df2[0]['From']
# overlap with last row and duration of last row
overlap_end = end - df2[-1]['From']
duration_start = df2[-1]['Till'] - df2[-1]['From']
ratios = [1]* len(df2)
ratios[0] = overlap_start/duration_start
ratios[-1] = overlap_end/duration_end
return ratios * precip_values
Related
I have weekly data for several years where I have the start date and end date in datetime format. I now want to make a new column for each year I have data where the mean value of each month is calculated and stored for each hour for the years. All years should have the same format, so ignoring the leap year. So to summarize I have the following data:
input_data:
datetime | A | B | C | D | ... | Z |
---------------------|---|---|---|---| --- |---|
2015-01-01 00:00:00 |123| 23| 67|189| ... | 78|
................... |...|...|...|...| ... |...|
2021-06-01 00:00:00 |345| 87|456| 89| ... | 23|
where I have 2015-01-01 00:00:00 as start date and 2021-06-01 08:00:00 as end date. I would like to get something like:
output:
datetime | 2015 | 2016 | 2017| 2018 | ... | 2021 |
----------------|---------|---------|---------|-----------|-----|----------|
01-01 00:00:00 |mean(A:Z)| mean(A:Z)| mean(A:Z)|mean(A:Z)| ... | mean(A:Z)|
................|.........|..........|..........|.........| ... |..........|
12-31 23:00:00 |mean(A:Z)| mean(A:Z)|mean(A:Z)| mean(A:Z)| ... | mean(A:Z)|
where mean(A:Z) is the mean value for each month of the columns A to Z. I would like to avoid to iterate over each hour for each year. How can best achieve this? Sorry if the question is to simple but I am currently stuck....
IIUC, you can use:
# Update
out = (df.assign(datetime=df['datetime'].dt.strftime('%m-%d %H:%M:%S'),
year=df['datetime'].dt.year.values)
.set_index(['datetime', 'year']).mean(axis=1)
.unstack('year'))
print(out)
# Alternative
# out = (df.set_index('datetime').mean(axis=1).to_frame('mean')
# .assign(datetime=df['datetime'].dt.strftime('%m-%d %H:%M:%S').values,
# year=df['datetime'].dt.year.values)
# .pivot('datetime', 'year', 'mean'))
# Output
year 2015 2016 2017
datetime
01-01 00:00:00 259.000000 420.000000 263.333333
01-01 01:00:00 263.000000 205.333333 169.000000
01-01 02:00:00 342.000000 268.000000 302.000000
01-01 03:00:00 63.000000 243.000000 220.000000
01-01 04:00:00 299.333333 282.666667 421.666667
... ... ... ...
12-31 19:00:00 82.666667 215.000000 84.333333
12-31 20:00:00 316.000000 367.000000 237.666667
12-31 21:00:00 319.666667 170.666667 275.666667
12-31 22:00:00 119.666667 263.666667 325.333333
12-31 23:00:00 252.666667 300.000000 94.666667
[8784 rows x 3 columns]
Setup:
import pandas
import numpy as np
np.random.seed(2022)
dti = pd.date_range('2015-01-01', '2017-12-31 23:00:00', freq='H', name='datetime')
df = pd.DataFrame(np.random.randint(1, 500, (len(dti), 3)),
index=dti, columns=list('ABC')).reset_index()
I would start by creating a new column for the year in the original data frame
input_data['year'] = input_data['datetime'].dt.year
The I would use the groupby method wih a foreach loop to calculate the means as following
output = pd.DataFrame()
output['datetime'] = input_data['datetime']
for name, group in input_data.groupby(['year']):
group.drop(['year', 'datetime'], axis = 1, inplace = True)
output[name] = group.mean(axis = 1).reset_index(0,drop=True)
Output image
That being said I am making an assumption here based on your question that the leap year is to be ignored and that all years has the same format and number of samples. If you have any further questions ot that the years don't have the same numbers of samples please tell me.
I have a large dataset with ~ 9 million rows and 4 columns - one of which is a utc timestamp. Data in this set has been recorded from 507 sites across Australia, and there is a site ID column. I have another dataset that has the timezones for each site ID in the format 'Australia/Brisbane'. I've written a function to create a new column in the main dataset that is the utc timestamp converted to the local time. However the wrong new time is being matched up with the utc timestamp, for example 2019-01-05 12:10:00+00:00 and 2019-01-13 18:55:00+11:00 (wrong timezone). I believe that sites are not mixed up in the data, but I've tried to sort the data incase that was the problem. Below is my code and images of the first row of each dataset, any help is much appreciated!
import pytz
from dateutil import tz
def update_timezone(df):
newtimes = []
df = df.sort_values('site_id')
sites = df['site_id'].unique().tolist()
for site in sites:
timezone = solarbom.loc[solarbom['site_id'] == site].iloc[0, 1]
dfsub = df[df['site_id'] == site].copy()
dfsub['utc_timestamp'] = dfsub['utc_timestamp'].dt.tz_convert(timezone)
newtimes.extend(dfsub['utc_timestamp'].tolist())
df['newtimes'] = newtimes
Main large dataset
Site info dataset
IIUC, you're looking to group your data by ID, then convert the timestamp specific to each ID. You could achieve this by using groupby, then applying a converter function to each group. Ex:
import pandas as pd
# dummy data:
df = pd.DataFrame({'utc_timestamp': [pd.Timestamp("2022-01-01 00:00 Z"),
pd.Timestamp("2022-01-01 01:00 Z"),
pd.Timestamp("2022-01-05 00:00 Z"),
pd.Timestamp("2022-01-03 00:00 Z"),
pd.Timestamp("2022-01-03 01:00 Z"),
pd.Timestamp("2022-01-03 02:00 Z")],
'site_id': [1, 1, 5, 3, 3, 3],
'values': [11, 11, 55, 33, 33, 33]})
# time zone info for each ID:
timezdf = pd.DataFrame({'site_id': [1, 3, 5],
'timezone_id_x': ["Australia/Adelaide", "Australia/Perth", "Australia/Darwin"]})
### what we want:
# for row, data in timezdf.iterrows():
# print(f"ID: {data['site_id']}, tz: {data['timezone_id_x']}")
# print(pd.Timestamp("2022-01-01 00:00 Z"), "to", pd.Timestamp("2022-01-01 00:00 Z").tz_convert(data['timezone_id_x']))
# ID: 1, tz: Australia/Adelaide
# 2022-01-01 00:00:00+00:00 to 2022-01-01 10:30:00+10:30
# ID: 3, tz: Australia/Perth
# 2022-01-01 00:00:00+00:00 to 2022-01-01 08:00:00+08:00
# ID: 5, tz: Australia/Darwin
# 2022-01-01 00:00:00+00:00 to 2022-01-01 09:30:00+09:30
###
def converter(group, timezdf):
# get the time zone by looking for the current group ID in timezdf
z = timezdf.loc[timezdf["site_id"] == group["site_id"].iloc[0], 'timezone_id_x'].iloc[0]
group["localtime"] = group["localtime"].dt.tz_convert(z)
return group
df["localtime"] = df["utc_timestamp"]
df = df.groupby("site_id").apply(lambda g: converter(g, timezdf))
now df looks like
df
Out[71]:
utc_timestamp site_id values localtime
0 2022-01-01 00:00:00+00:00 1 11 2022-01-01 10:30:00+10:30
1 2022-01-01 01:00:00+00:00 1 11 2022-01-01 11:30:00+10:30
2 2022-01-05 00:00:00+00:00 5 55 2022-01-05 09:30:00+09:30
3 2022-01-03 00:00:00+00:00 3 33 2022-01-03 08:00:00+08:00
4 2022-01-03 01:00:00+00:00 3 33 2022-01-03 09:00:00+08:00
5 2022-01-03 02:00:00+00:00 3 33 2022-01-03 10:00:00+08:00
I have a time series for visibility data that contains half-hourly measurements of visibility. A fog event is defined when the visibility falls below 1 km and the fog event ends when the visibility exceeds 1 Km. Please find the code attached below. I intend to find out the number of such fog events and the duration of each such fog event.
from IPython.display import display
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import files
uploaded = files.upload()
import io
df = pd.read_csv(io.BytesIO(uploaded['visibility.csv']))
df.set_index('Unnamed: 0',inplace=True)
df.index = pd.to_datetime(df.index)
df=df.interpolate(method='linear', limit_direction='forward')
display(df)
Unnamed: 0 visibility_km
2016-01-01 00:00:00 0.595456
2016-01-01 00:30:00 0.595456
2016-01-01 01:00:00 0.595456
2016-01-01 01:30:00 0.595456
2016-01-01 02:00:00 0.595456
... ...
2020-12-31 21:30:00 0.925370
2020-12-31 22:00:00 0.901230
2020-12-31 22:30:00 0.804670
2020-12-31 23:00:00 0.804670
2020-12-31 23:30:00 0.692016
# FOG Events
fog_events=df[df<1.0].count()
print('no. of fog events',fog_events)
no. of fog events 10318
But it simply gives the number of times the visibility drops below 1 km and not the number of fog events.
You can create sample time series data like this:
import pandas as pd
tdf = pd.DataFrame({'Time':pd.date_range(start='1/1/2016', periods=11, freq='30s'),
'Visibility_km': [0.56, 0.75, 0.99, 1.01, 1.1, 1.3, 0.5, 0.6, 0.7, 1.2, 1.3]})
Data in this format makes it easier to copy and paste your problem. To get the total number of fog events and their durations, start by creating a column for the events and one to mark when the event starts and ends
# Create column to mark duration of events
tdf['fog_event'] = (tdf['Visibility_km'] < 1.).astype(int)
# Create column to mark event start and end
tdf['event_diff'] = tdf['fog_event'] != tdf['fog_event'].shift(1)
print(tdf)
Time Visibility_km fog_event event_diff
0 2016-01-01 00:00:00 0.56 1 True
1 2016-01-01 00:00:30 0.75 1 False
2 2016-01-01 00:01:00 0.99 1 False
3 2016-01-01 00:01:30 1.01 0 True
4 2016-01-01 00:02:00 1.10 0 False
5 2016-01-01 00:02:30 1.30 0 False
6 2016-01-01 00:03:00 0.50 1 True
7 2016-01-01 00:03:30 0.60 1 False
8 2016-01-01 00:04:00 0.70 1 False
9 2016-01-01 00:04:30 1.20 0 True
10 2016-01-01 00:05:00 1.30 0 False
Now you can get the events in two ways:
The first way doesn't use Pandas and was the original way I grouped the events by.
from itertools import groupby
groups = [list(g) for _, g in groupby(tdf.fog_event.values)]
fog_durations = np.array([sum(g) for g in groups])
duration_each_event = fog_durations[fog_durations != 0]
total_fog_events = sum(fog_durations != 0)
print(duration_each_event)
array([3, 3])
print(total_fog_events)
2
To do it using Pandas, you can group by the cumulative sum of the event difference
fdf = tdf.groupby([tdf['event_diff'].cumsum(), 'fog_event']).size()
fdf = fdf.reset_index(name = 'duration').rename(columns = {'event_diff': 'index'})
duration_each_event = fdf.loc[fdf['fog_event'] == 1, 'duration'].values
total_fog_events = fdf.loc[fdf['fog_event'] == 1, 'fog_event'].sum()
print(duration_each_event)
[3, 3]
print(total_fog_events)
2
Assuming the time interval between measurements doesn't change (i.e. always measured 30 seconds apart), you can multiply duration_each_event by 30 (for seconds) or 0.5 (for minutes) to get the duration in time units.
I've got a timeseries of intermitent daily data like this.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['2020-01-01', '2020-01-02', '2020-01-02','2020-01-02','2020-01-03','2020-01-04','2020-01-07','2020-01-08','2020-01-08','2020-01-10','2020-01-13','2020-01-15'],
'Price': [200, 324, 320, 421, 240, np.NaN, 500, 520, 531, np.NaN, 571, np.NaN]})
df['Date']= pd.to_datetime(df['Date'])
df.set_index('Date')
df
Result:
+------------+-------+
| Date | Price |
+------------+-------+
| 2020-01-01 | 200 |
+------------+-------+
| 2020-01-02 | 324 |
+------------+-------+
| 2020-01-02 | 320 | -- 1st duplicate for 2020-01-02
+------------+-------+
| 2020-01-02 | 421 | -- 2nd duplicate for 2020-01-02
+------------+-------+
| 2020-01-03 | 240 |
+------------+-------+
| 2020-01-04 | NaN |
+------------+-------+
| 2020-01-07 | 500 |
+------------+-------+
| 2020-01-08 | 520 |
+------------+-------+
| 2020-01-08 | 531 | -- 1st duplicate for 2020-01-08
+------------+-------+
| 2020-01-10 | NaN |
+------------+-------+
| 2020-01-13 | 571 |
+------------+-------+
| 2020-01-15 | NaN |
+------------+-------+
I need to fill the NaN values with prices from nearest available date where there is more than 1 price recorded (duplicate) i.e.
320 should be moved from 2020-01-02 to 2020-01-04
421 from 2020-01-02 to 2020-01-10
531 from 2020-01-08 to 2020-01-15
Here is a Pandas solution, step by step
First, we groupby Price by Date and put them in a list for each date, that we then unwrap into separate columns, which we can then rename
df2 = (
df.groupby('Date')['Price']
.apply(list)
.apply(pd.Series)
.rename(columns = {0:'Price',1:'Other'})
)
df2
so we get
Price Other
Date
2020-01-01 200.0 NaN
2020-01-02 324.0 320.0
2020-01-03 240.0 NaN
2020-01-04 NaN NaN
2020-01-07 500.0 NaN
2020-01-08 520.0 NaN
Here Price has the first price for that date, and Other the second price for that date, if available
Now we ffill() Other, so that propagates second values forward until the new second value is found etc.
df2['Other'] = df2['Other'].ffill()
so we get
Price Other
Date
2020-01-01 200.0 NaN
2020-01-02 324.0 320.0
2020-01-03 240.0 320.0
2020-01-04 NaN 320.0
2020-01-07 500.0 320.0
2020-01-08 520.0 320.0
Now we can replace NaNs in the Price column with the values from Other column, and drop Other:
df2['Price'] = df2['Price'].fillna(df2['Other'])
df2.drop(columns = ['Other'], inplace = True)
df2
to get
Price
Date
2020-01-01 200.0
2020-01-02 324.0
2020-01-03 240.0
2020-01-04 320.0
2020-01-07 500.0
2020-01-08 520.0
I have never been good with pandas so I treat the frame more-or-less as 2D array (thus there might be more efficient ways to do this with pandas... My attempt on this is the 2nd solution)
So, the idea is:
Loop your frame row-by-row
Always keep a pointer to the previous row so you can compare Dates (and values)
When a duplicate Date is found, set the last_dup_row_index which is always points to the latest "movable" row (see inline comments for edge cases)
While iterating, if you hit a missing Price, add a "move" into to_move_indexes. This is a list of tuples of moves that can be performed
At the end of the above loop you have all you need to modify your frame:
The possible price moves
The indexes you move from so you can delete those rows if you want to
The code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['2020-01-01', '2020-01-02', '2020-01-02','2020-01-03','2020-01-04','2020-01-07','2020-01-08'],
'Price': [200, 324, 320, 240, np.NaN, 500, 520]})
df['Date']= pd.to_datetime(df['Date'])
df.set_index('Date')
prev_row = None
last_dup_row_index = None
to_move_indexes = []
for index, row in df.iterrows():
# Check if we have a None in the current row and
# a duplicate row waiting for us
if pd.isna(row['Price']) and last_dup_row_index is not None:
print(f"To move price from {last_dup_row_index} to {index}")
to_move_indexes.append((last_dup_row_index, index))
# Check if this row and the prev one have the
# same date
if prev_row is not None and prev_row['Date'] == row['Date']:
# There is a case where for the same Date you have
# two entries out of which one is NaN. Here use the
# other one
if not pd.isna(row['Price']):
print(f"Setting duplicate to: idx={index},\n{row}")
last_dup_row_index = index
elif not pd.isna(prev_row['Price']):
print(f"Setting duplicate to: idx={index - 1},\n{prev_row}")
last_dup_row_index = index - 1
else:
# There is an edge case where two NaNs follow each
# other - not changing last duplicate
print(f"Warning: two NaN #{row['Date']}")
prev_row = row
print(to_move_indexes)
# Perform moves
for from_idx, to_idx in to_move_indexes:
df.at[to_idx, 'Price'] = df.at[from_idx, 'Price']
print("\nFrame after moves:")
print(df)
# Preform deletes if you need to
df.drop([x for x, _ in to_move_indexes], inplace=True)
print("\nFrame after deletes:")
print(df)
And the output you get is:
# Here we detected that row index 2 is a duplicate (with index 1)
Setting duplicate to: idx=2,
Date 2020-01-02 00:00:00
Price 320
Name: 2, dtype: object
# Here we see that row index 4 is missing Price. However
# we have the previous duplicate (2) waiting for us so we
# add a "move" as (2, 4) to our list
To move price from 2 to 4
# The final list is
[(2, 4)]
Frame after moves:
Date Price
0 2020-01-01 200.0
1 2020-01-02 324.0
2 2020-01-02 320.0
3 2020-01-03 240.0
4 2020-01-04 320.0
5 2020-01-07 500.0
6 2020-01-08 520.0
Frame after deletes:
Date Price
0 2020-01-01 200.0
1 2020-01-02 324.0
3 2020-01-03 240.0
4 2020-01-04 320.0
5 2020-01-07 500.0
6 2020-01-08 520.0
UPDATE: Second way
# Calculate diff on dates column and keep the
# ones that are same (returns series)
dups = df.Date.diff() == "0 days"
# A cryptic (for me) way to get all the indexes
# where the value is True
dup_indexes = dups.index[dups].to_list()
# Now get the indexes where the Price is Nan
nans = pd.isnull(df).any(1)
nan_indexes = nans.index[nans].to_list()
# Create moves: the nan_index should be greater than
# the dup_index but as close as possible
moves = []
for nan_index in nan_indexes:
# dup_indexes are sorted so get the last one,
# smaller than the nan_index
dup_index = [x for x in dup_indexes if x < nan_index][-1]
if dup_index:
moves.append((dup_index, nan_index))
# Do moves and deletes
for from_idx, to_idx in moves:
df.at[to_idx, 'Price'] = df.at[from_idx, 'Price']
df.drop([x for x, _ in moves], inplace=True)
print(df)
How to tackle and intuition behind ffill
What you are looking for is a method called forward fill. Forward fill locates a null value, then it checks if there's any valid values behind it. If so, it uses it.
To understand more on how to apply the method on your data, please check the documentation of pandas fillna here. It is detailed and provides examples, take a careful look at them and understand what each argument does.
Note, in case your previous value is also NaN, therefore, ffill won't change it (obviously).
Pseudo code
Since you have changed the data, I can think of a pseudo-code.
First, collect all the missing data in your table using df[df.Price.isnull()]
Then check for each missing value if there are duplicates prior to it.
If so, choose the closest duplicate and replace it, else, keep it nan.
I am looking at this drone rental dataset on my python discovery journey and was trying to GroupBy the Result column to show how much each drone made in each month.
I could usually do this if the result was associated for a particular date, but as this is a longer-term rental business I need to work out how much of the result is attributable to each month between the start and end dates.
+------+------------------+------------------+--------+
| Drone| Start | End | Result |
+------+------------------+------------------+--------+
| DR1 16/06/2013 10:30 22/08/2013 07:00 2786 |
| DR1 20/04/2013 23:30 16/06/2013 10:30 7126 |
| DR1 24/01/2013 23:00 20/04/2013 23:30 2964 |
| DR2 01/03/2014 19:00 07/05/2014 18:00 8884 |
| DR2 04/09/2015 09:00 04/11/2015 07:00 7828 |
| DR2 04/10/2013 05:00 24/12/2013 07:00 5700 |
+-----------------------------------------------------+
I was able to find the difference in the dates using this:
import datetime
from dateutil.relativedelta import relativedelta
df.Start = pd.to_datetime(df.Start)
df.End = pd.to_datetime(df.End)
a = df.loc[0, 'Start']
b = df.loc[0, 'End']
relativedelta(a,b)
However the output prints out as such:
relativedelta(months=-2, days=-5, hours=-20, minutes=-30)
and I can't use this to calculate the cash attributable like I would if the dataset had one date, using a GroupBy
df.groupby(['Device', 'Date']).agg(sum)['Result']
I would appreciate some help on the correct thought process for approaching a problem like this and what the code would look like.
taking the first example from each drone type,
my expected output would be:
+------+-------+-------+---------+
|Drone | Month | Days | Result |
+------+-------+-------+---------+
|DR1 June X $YY |
|DR1 July X $YY |
|DR1 August X $YY |
|DR2 March Y $ZZ |
|DR2 April Y $ZZ |
|DR2 May Y $ZZ |
+--------------------------------+
Thanks
This is a loopy solution, but I think it does what you want.
# Just load the sample data
from io import StringIO
data = 'Drone,Start,End,Result\n' + \
'DR1,16/06/2013 10:30,22/08/2013 07:00,2786\n' + \
'DR1,20/04/2013 23:30,16/06/2013 10:30,7126\n' + \
'DR1,24/01/2013 23:00,20/04/2013 23:30,2964\n' + \
'DR2,01/03/2014 19:00,07/05/2014 18:00,8884\n' + \
'DR2,04/09/2015 09:00,04/11/2015 07:00,7828\n' + \
'DR2,04/10/2013 05:00,24/12/2013 07:00,5700\n'
stream = StringIO(data)
# Actual solution
import pandas as pd
from datetime import datetime
df = pd.read_csv(stream, sep=',', parse_dates=[1, 2])
def get_month_spans(row):
month_spans = []
start = row['Start']
total_delta = (row['End'] - row['Start']).total_seconds()
while row['End'] > start:
if start.month != 12:
end = datetime(year=start.year, month=start.month+1, day=1)
else:
end = datetime(year=start.year+1, month=1, day=1)
if end > row['End']:
end = row['End']
delta = (end - start).total_seconds()
proportional = row['Result'] * (delta / total_delta)
month_spans.append({'Drone': row['Drone'],
'Month': datetime(year=start.year,
month=start.month,
day=1),
'Result': proportional,
'Days': delta / (24 * 3600)})
start = end
print(delta)
return month_spans
month_spans = []
for index, row in df.iterrows():
month_spans += get_month_spans(row)
monthly = pd.DataFrame(month_spans).groupby(['Drone', 'Month']).agg(sum)[['Result', 'Days']]
print(monthly)
Which outputs how much each drone made each month along with the number of days:
Result Days
Drone Month
DR1 2013-01-01 242.633083 7.041667
2013-02-01 964.789537 28.000000
2013-03-01 1068.159845 31.000000
2013-04-01 1953.216797 30.000000
2013-05-01 3912.726199 31.000000
2013-06-01 2555.334620 30.000000
2013-07-01 1291.856653 31.000000
2013-08-01 887.283266 21.291667
DR2 2013-04-01 459.202454 20.791667
2013-05-01 684.662577 31.000000
2013-06-01 662.576687 30.000000
2013-07-01 684.662577 31.000000
2013-08-01 684.662577 31.000000
2013-09-01 662.576687 30.000000
2013-10-01 684.662577 31.000000
2013-11-01 662.576687 30.000000
2013-12-01 514.417178 23.291667
2014-01-01 1369.726258 28.208333
2014-02-01 1359.610112 28.000000
2014-03-01 1505.282624 31.000000
2014-04-01 1456.725120 30.000000
2014-05-01 1505.282624 31.000000
2014-06-01 1456.725120 30.000000
2014-07-01 230.648144 4.750000
2015-04-01 7828.000000 1.916667