Python function that mimics the distribution of my dataset - python

I have a food order dataset that looks like this, with a few thousand orders over the span of a few months:
Date
Item Name
Price
2021-10-09 07:10:00
Water Bottle
1.5
2021-10-09 12:30:60
Pizza
12
2021-10-09 17:07:56
Chocolate bar
3
Those orders are time-dependent. Nobody will eat a pizza at midnight, usually. There will be more 3PM Sunday orders than there will be 3PM Monday orders (because people are at work). I want to extract the daily order distribution for each weekday (Monday till Sunday) from those few thousand orders so I can generate new orders later that fits this distribution. I do not want to fill in the gaps in my dataset.
How can I do so?
I want to create a generate_order_date() function that would generate a random hours:minutes:seconds depending on the day. I can already identify which weekday a date corresponds to. I just need to extract the 7 daily distributions so I can call my function like this:
generate_order_date(day=Monday, nb_orders=1)
[12:30:00]
generate_order_date(day=Friday, nb_orders=5)
[12:30:00, 07:23:32, 13:12:09, 19:15:23, 11:44:59]
generated timestamps do not have to be in chronological order. Just like if I was calling
np.random.normal(mu, sigma, 1000)

Try np.histogram(data)
https://numpy.org/doc/stable/reference/generated/numpy.histogram.html
The first argument will give you the density which would be your distribution. You can visualise it
plt.plot(np.histogram(data)[0])
data here would be the time of the day a particular item was ordered. For this approach, I would suggest round up your time to 5 min intervals or more depending on the frequency. For example - round 12:34pm to 12:30 and 12:36pm to 12:35pm. Choose a suitable frequency.
Another method would be scipy.stats.gaussian_kde. This would use a guassian kernel. Below is an implementation I have used previously
def get_kde(df:pd.DataFrame)->list:
xs = np.round(np.linspace(-1,1,3000),3)
kde = gaussian_kde(df.values)
kde_vals = np.round(kde(xs),3)
data = [[xs[i],kde_vals[i]] for i in range(len(xs)) if kde_vals[i]>0.1]
return data
where df.values is your data. There are plenty more kernels which you can use to get a density estimate. The most suitable one depends on the nature of your data.
Also see https://en.wikipedia.org/wiki/Kernel_density_estimation#Statistical_implementation

Here's a rough sketch of what you could do:
Assumption: Column Date of the DataFrame df contains datetimes. If not do df.Date = pd.to_datetime(df.Date) first.
First 2 steps:
import pickle
from datetime import datetime, timedelta
from sklearn.neighbors import KernelDensity
# STEP 1: Data preparation
df["Seconds"] = (
df.Date.dt.hour * 3600 + df.Date.dt.minute * 60 + df.Date.dt.second
) / 86400
data = {
day: sdf.Seconds.to_numpy()[:, np.newaxis]
for day, sdf in df.groupby(df.Date.dt.weekday)
}
# STEP 2: Kernel density estimation
kdes = {
day: KernelDensity(bandwidth=0.1).fit(X) for day, X in data.items()
}
with open("kdes.pkl", "wb") as file:
pickle.dump(kdes, file)
STEP 1: Build a normalised column Seconds (normalised between 0 and 1). Then group over the weekdays (numbered 0,..., 6) and prepare for for every day of the week data for estimation of a kernel density.
STEP 2: Estimate the kernel densities for every day of the week with KernelDensity from Scikit-learn and pickle the results.
Based on these estimates build the desired sample function:
# STEP 3: Sampling
with open("kdes.pkl", "rb") as file:
kdes = pickle.load(file)
def generate_order_date(day, orders):
fmt = "%H:%M:%S"
base = datetime(year=2022, month=1, day=1)
kde = kdes[day]
return [
(base + timedelta(seconds=int(s * 86399))).time().strftime(fmt)
for s in kde.sample(orders)
]
I won't pretend that this is anywhere near perfect. But maybe you could use it as a starting point.

Related

Poisson in sales

I see that Poisson is often used to estimate the number of sales in a certain time period (month, for example).
from scipy import stats
monthly_average_sales = 30
current_month_sales = 35
mu = monthly_average_sales
x = current_month_sales
up_to_35 = scipy.stats.poisson.pmf(x, mu)
above_35 = 1 - up_to_35
Suppose I want to estimate the probability that a specific order will close this month. Is this possible? For example, today is the 15th. If a customer initially called me on the 1st of the month, what is the probability that they will place the order before the month is over? They might place the order tomorrow (the 16th) or on the last day of the month. I don't care when, as long as it's by the end of this month.
from scipy import stats
monthly_average_sales = 30
current_sale_days_open = 15
number_of_days_this_month = 31
equivalent_number_of_sales = number_of_days_this_month / current_sale_days_open
mu = monthly_average_sales
x = equivalent_number_of_sales
up_to_days_open = scipy.stats.poisson.pmf(x, mu)
above_days_open = 1 - up_to_days_open
I don't want to abuse statistics to the point that they become meaningless (I'm not a politician!). Am I going about this the right way?

How to groupby a custom time range in xarray?

I have a DataArray with date, x, and y coordinate dims.
date: 265, y: 1458, x: 1159
For each year, over the course of 35 years, there are about 10 arrays with a date and x and y dims.
I'd like to groupby a custom annual season to integrate values for each raster over that custom annual season (maybe October to April for each year). The xarray docs show that I can do something like this:
arr.groupby("date.season")
which results in
DataArrayGroupBy, grouped over 'season'
4 groups with labels 'DJF', 'JJA', 'MAM', 'SON'.
But this groups by season over the entire 35 year record and I can't control the start and ending months that are used to group.
Similarly this does not quite get at what I want:
all_eeflux_arr.groupby("date.year")
DataArrayGroupBy, grouped over 'year'
36 groups with labels 1985, 1986, 1987, ..., 2019, 2020.
The start and the ending date for each year is automatically January/December.
I'd really like to be able to groupby an arbitrary period with a start time and an end time in units of day of year.
If this grouping can also discard dates that fall outside the grouping window, even better, since I may want to group by a day of year range that doesn't span all months.
The DataArray.resample method seems to be able to select a custom offset (see this SO post) for the start of the year, but I can't figure out how to access this with groupby. I can't use resample because it does not return a DataArray and I need to call the .integrate xarray method on each group DataArray (across the date dim, to get custom annual totals).
Data to reproduce (2Gb, but can be tested on a subset):
https://ucsb.box.com/s/zhjxkqje18m61rivv1reig2ixiigohk0
Code to reproduce
import rioxarray as rio
import xarray as xr
import numpy as np
from pathlib import Path
from datetime import datetime
all_scenes_f_et = Path('/home/serdp/rhone/rhone-ecostress/rasters/eeflux/PDR')
all_pdr_et_paths = list(all_scenes_f_et.glob("*.tif"))
def eeflux_path_date(path):
year, month, day, _ = path.name.split("_")
return datetime(int(year), int(month), int(day))
def open_eeflux(path, da_for_match):
data_array = rio.open_rasterio(path) #chunks makes i lazyily executed
data_array.rio.reproject_match(da_for_match)
data_array = data_array.sel(band=1).drop("band") # gets rid of old coordinate dimension since we need bands to have unique coord ids
data_array["date"] = eeflux_path_date(path) # makes a new coordinate
return data_array.expand_dims({"date":1}) # makes this coordinate a dimension
da_for_match = rio.open_rasterio(all_pdr_et_paths[0])
daily_eeflux_arrs = [open_eeflux(path, da_for_match) for path in all_pdr_et_paths]
all_eeflux_arr = xr.concat(daily_eeflux_arrs, dim="date")
all_eeflux_arr = all_eeflux_arr.sortby("date")
### not sure what should go here
all_eeflux_arr.groupby(????????).integrate(dim="date", datetime_unit="D")
Advice is much appreciated!
I ended up writing a function that works well enough. Since my dataset isn't that large the integration doesn't take very long to run in a for loop that iterates over each group.
def group_by_custom_doy(all_eeflux_arr, doy_start, doy_end):
ey = max(all_eeflux_arr['date.year'].values)
sy = min(all_eeflux_arr['date.year'].values)
start_years = range(sy,ey)
end_years = range(sy+1, ey+1)
start_end_years = list(zip(start_year,end_year))
water_year_arrs = []
for water_year in start_end_years:
start_mask = ((all_eeflux_arr['date.dayofyear'].values > doy_start) & (all_eeflux_arr['date.year'].values == water_year[0]))
end_mask = ((all_eeflux_arr['date.dayofyear'].values < doy_end) & (all_eeflux_arr['date.year'].values == water_year[1]))
water_year_arrs.append(all_eeflux_arr[start_mask | end_mask])
return water_year_arrs
water_year_arrs = group_by_custom_doy(all_eeflux_arr, 125, 300)

Resample to Pandas DataFrame to Hourly using Hour as mid-point

I have a data frame with temperature measurements at a frequency of 5 minutes. I would like to resample this dataset to find the mean temperature per hour.
This is typically done using df['temps'].resample('H', how='mean') but this averages all values that fall within the hour - using all times where '12' is the hour, for example. I want something that gets all values from 30 minutes either side of the hour (or times nearest to the actual hour) and finds the mean that way. In other words, for the resampled time step of 1200, use all temperature values from 1130 to 1230 to calculate the mean.
Example code below to create a test data frame:
index = pd.date_range('1/1/2000', periods=200, freq='5min')
temps = pd.Series(range(200), index=index)
df = pd.DataFrame(index=index)
df['temps'] = temps
Can this be done using the built-in resample method? I'm sure I've done it before using pandas but cannot find any reference to it.
It seems you need:
print (df['temps'].shift(freq='30Min').resample('H').mean())

How I can extract a subset of months from all NetCDF files in one directory

I need to calculate the 90th percentile based on temperature data for 1961-1990. I have 30 NETCDF files and every file includes daily data for one year. I need to calculate the percentile (90th) for special Lat, Long while considering just summer days out of all 30 years of daily data. I need also consider the years when February has 29 days. When I run my code it just considered the first summer (summer 1961) and cannot consider all summer days with each other.
data = xr.open_mfdataset('/Tmax-2m/Control/*.nc')
time = data.variables['time']
lon = data.variables['lon'][:]
lat = data.variables['lat'][:]
tmax = data.variables['tmax'][:]
df = data.sel(lat=39.18,lon=-95.57, method='nearest')
time2=df.variables['time'][151:243]
dg=df.sel (time=time2, method = 'nearest')
print np.percentile (dg.tmax, 90)
I tried this way but it calculate the percentile for every summer of every year:
splits=[151,516,881,1247,1612,1977,2342,2708,3073,3438,3803,4169,4534,4899,5264,5630,5995,6360,6725,7091,7456,7821,8186,8552,8917,9282,9647,10013,10378,10743]
t0=92
result=[]
for i in splits:
time3=df.variables['time'][i:(i+t0)]
dg=df.sel(time=time3, method ='nearest')
result.append(np.percentile (dg.tmax, 90))
np.savetxt("percentile1.csv", result, fmt="%s")
Did you consider to use CDO for this task? (If you are running under linux this is easy, if you are on windows, you probably need to install it under cygwin)
You can merge the 30 files into one timeseries like this:
cdo mergetime file_y*.nc timeseries.nc
here the * is a wildcard for the year (1961, 1962 etc) in the filename that I assuming is file_y1961.nc file_y1962.nc etc... adopt as appropriate. timeseries.nc is the output file.
and then calculate the seasonal percentiles like this :
cdo yseaspctl,90 timeseries.nc -yseasmin timeseries.nc -yseasmax timeseries.nc percen.nc
percen.nc will have the seasonal percentiles in and you can extract the one for summer.
further details here: https://code.mpimet.mpg.de/projects/cdo/

Dataset statistics with custom begin of the year

I would like to do some annual statistics (cumulative sum) on an daily time series of data in an xarray dataset. The tricky part is that the day on which my considered year begins must be flexible and the time series contains leap years.
I tried e.g. the following:
rollday = -181
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
foo_groups = foo.roll(time=rollday).groupby(foo.time.dt.year)
foo_cumsum = foo_groups.apply(lambda x: x.cumsum(dim='time', skipna=True))
which is "unfavorable" mainly because of two things:
(1) the rolling doesn't account for the leap years, so the get an offset of one day per leap year and
(2) the beginning of the first year (until end of June) is appended to the end of the rolled time series, which creates some "fake year" where the cumulative sums doesn't make sense anymore.
I tried also to first cut off the ends of the time series, but then the rolling doesn't work anymore. Resampling to me also did not seem to be an option, as I could not find a fitting pandas freq string.
I'm sure there is a better/correct way to do this. Can somebody help?
You can use a xarray.DataArray that specifies the groups. One way to do this is to create an array of values (years) that define the group ids:
# setup sample data
dr = pd.date_range('2015-01-01', '2017-08-23')
foo = xr.Dataset({'data': (['time'], np.ones(len(dr)))}, coords={'time': dr})
# create an array of years (modify day/month for your use case)
my_years = xr.DataArray([t.year if ((t.month < 9) or ((t.month==9) and (t.day < 15))) else (t.year + 1) for t in foo.indexes['time']],
dims='time', name='my_years', coords={'time': dr})
# use that array of years (integers) to do the groupby
foo_cumsum = foo.groupby(my_years).apply(lambda x: x.cumsum(dim='time', skipna=True))
# Voila!
foo_cumsum['data'].plot()

Categories