I see that Poisson is often used to estimate the number of sales in a certain time period (month, for example).
from scipy import stats
monthly_average_sales = 30
current_month_sales = 35
mu = monthly_average_sales
x = current_month_sales
up_to_35 = scipy.stats.poisson.pmf(x, mu)
above_35 = 1 - up_to_35
Suppose I want to estimate the probability that a specific order will close this month. Is this possible? For example, today is the 15th. If a customer initially called me on the 1st of the month, what is the probability that they will place the order before the month is over? They might place the order tomorrow (the 16th) or on the last day of the month. I don't care when, as long as it's by the end of this month.
from scipy import stats
monthly_average_sales = 30
current_sale_days_open = 15
number_of_days_this_month = 31
equivalent_number_of_sales = number_of_days_this_month / current_sale_days_open
mu = monthly_average_sales
x = equivalent_number_of_sales
up_to_days_open = scipy.stats.poisson.pmf(x, mu)
above_days_open = 1 - up_to_days_open
I don't want to abuse statistics to the point that they become meaningless (I'm not a politician!). Am I going about this the right way?
Related
I have a database with a number of events for a given date. I want to display a graph showing the number of events each day, displaying the week number on the x-axis and creating a curve for each year.
I have no problem doing this simply.
My concern is that I should not display the "calendar" years (from January 1st to December 31st, in other words the isoweek from 1 to 53), but the so-called "winter" years regrouping the 12 month periods from August to August.
I wrote the code below to do this, I get a correctly indexed table with the number of weeks indexed in order (from 35 to 53 then from 53 to 34), the number of cases ("count") and a "winter_year" column which allows me to group my curves.
In spite of all my attempts, on the diagram the X axis continues to be displayed from 1 to 53 ...
I have recreated an example with random numbers below
Can you help me to get the graph I want?
I am also open to any suggestions for improving my code which, I know, is probably not very optimal... I'm still learning Python.
#%%
import pandas
import numpy as np
from datetime import date
def winter_year(date):
if date.month > 8:
x = date.year
else:
x = date.year-1
return("Winter "+str(x)+"-"+str((x+1)))
#%%
np.random.seed(10)
data = pandas.DataFrame()
data["dates"] = pandas.date_range("2017-07-12","2022-08-10")
data["count"] = pandas.Series(np.random.randint(150, size = len(data["dates"])))
data = data.set_index("dates")
print(data)
#%%
data["week"] = data.index.isocalendar().week
data["year"] = data.index.year
data["date"] = data.index.date
data["winter_year"] = data["date"].apply(winter_year)
datapiv = pandas.pivot_table(data,values = ["count"],index = ["week"], columns = ["winter_year"],aggfunc=np.sum)
order_weeks = [i for i in range(35,54)]
for i in range(1,35):
order_weeks.append(i)
datapiv = datapiv.reindex(order_weeks)
datapiv.plot(use_index=True)
Add this line before the plot:
datapiv.index = [str(x) for x in order_weeks]
it would be something like:
...
order_weeks = [i for i in range(35,54)]
for i in range(1,35):
order_weeks.append(i)
datapiv = datapiv.reindex(order_weeks)
# this is the new line
datapiv.index = [str(x) for x in order_weeks]
datapiv.plot(use_index=True)
Output:
I have a food order dataset that looks like this, with a few thousand orders over the span of a few months:
Date
Item Name
Price
2021-10-09 07:10:00
Water Bottle
1.5
2021-10-09 12:30:60
Pizza
12
2021-10-09 17:07:56
Chocolate bar
3
Those orders are time-dependent. Nobody will eat a pizza at midnight, usually. There will be more 3PM Sunday orders than there will be 3PM Monday orders (because people are at work). I want to extract the daily order distribution for each weekday (Monday till Sunday) from those few thousand orders so I can generate new orders later that fits this distribution. I do not want to fill in the gaps in my dataset.
How can I do so?
I want to create a generate_order_date() function that would generate a random hours:minutes:seconds depending on the day. I can already identify which weekday a date corresponds to. I just need to extract the 7 daily distributions so I can call my function like this:
generate_order_date(day=Monday, nb_orders=1)
[12:30:00]
generate_order_date(day=Friday, nb_orders=5)
[12:30:00, 07:23:32, 13:12:09, 19:15:23, 11:44:59]
generated timestamps do not have to be in chronological order. Just like if I was calling
np.random.normal(mu, sigma, 1000)
Try np.histogram(data)
https://numpy.org/doc/stable/reference/generated/numpy.histogram.html
The first argument will give you the density which would be your distribution. You can visualise it
plt.plot(np.histogram(data)[0])
data here would be the time of the day a particular item was ordered. For this approach, I would suggest round up your time to 5 min intervals or more depending on the frequency. For example - round 12:34pm to 12:30 and 12:36pm to 12:35pm. Choose a suitable frequency.
Another method would be scipy.stats.gaussian_kde. This would use a guassian kernel. Below is an implementation I have used previously
def get_kde(df:pd.DataFrame)->list:
xs = np.round(np.linspace(-1,1,3000),3)
kde = gaussian_kde(df.values)
kde_vals = np.round(kde(xs),3)
data = [[xs[i],kde_vals[i]] for i in range(len(xs)) if kde_vals[i]>0.1]
return data
where df.values is your data. There are plenty more kernels which you can use to get a density estimate. The most suitable one depends on the nature of your data.
Also see https://en.wikipedia.org/wiki/Kernel_density_estimation#Statistical_implementation
Here's a rough sketch of what you could do:
Assumption: Column Date of the DataFrame df contains datetimes. If not do df.Date = pd.to_datetime(df.Date) first.
First 2 steps:
import pickle
from datetime import datetime, timedelta
from sklearn.neighbors import KernelDensity
# STEP 1: Data preparation
df["Seconds"] = (
df.Date.dt.hour * 3600 + df.Date.dt.minute * 60 + df.Date.dt.second
) / 86400
data = {
day: sdf.Seconds.to_numpy()[:, np.newaxis]
for day, sdf in df.groupby(df.Date.dt.weekday)
}
# STEP 2: Kernel density estimation
kdes = {
day: KernelDensity(bandwidth=0.1).fit(X) for day, X in data.items()
}
with open("kdes.pkl", "wb") as file:
pickle.dump(kdes, file)
STEP 1: Build a normalised column Seconds (normalised between 0 and 1). Then group over the weekdays (numbered 0,..., 6) and prepare for for every day of the week data for estimation of a kernel density.
STEP 2: Estimate the kernel densities for every day of the week with KernelDensity from Scikit-learn and pickle the results.
Based on these estimates build the desired sample function:
# STEP 3: Sampling
with open("kdes.pkl", "rb") as file:
kdes = pickle.load(file)
def generate_order_date(day, orders):
fmt = "%H:%M:%S"
base = datetime(year=2022, month=1, day=1)
kde = kdes[day]
return [
(base + timedelta(seconds=int(s * 86399))).time().strftime(fmt)
for s in kde.sample(orders)
]
I won't pretend that this is anywhere near perfect. But maybe you could use it as a starting point.
I am a beginner in Python. I'm using the mip package to optimize a standalone battery given hourly power price in a year. I need the program to pick 5 lowest price hours to charge the battery and 4 highest price hours to discharge it every day for a year. But first I'm trying out the solver for 24 hours.
Data:
time, month, day, hour, power price (24 entries)
Q:
Solve for a standalone battery's optimal charging and discharging pattern
Battery rating: 1MW with 4MWh storage capability (4-hour storage)
Output: two columns of binary variables
Battery needs 4.7 hours to fully charge, and discharges for 4 hours
Round-trip efficiency 85%, charging 1 hour enables discharging of 0.85 hour
Constraints:
Battery state: available power >0 (cumulative charge - cumulative discharge) > 0
0 < cumulative discharge < 4
0 < cumulative charge < 4.7
Below is my code:
import numpy as np
import pandas as pd
import mip
from mip import Model, xsum, maximize, BINARY, CONTINUOUS, OptimizationStatus
# Define model and var
m = mip.Model(sense=maximize)
maxdischargepower = 4
maxchargepower = 4.7
H = 24
charge = [m.add_var(var_type = BINARY) for i in range(H)]
discharge = [m.add_var(var_type = BINARY) for i in range(H)]
batterystate = np.cumsum(charge) - np.cumsum(discharge)
# Define objective function
m.objective = xsum(discharge[i]*price[i] for i in range(H)) - xsum(charge[i]*price[i] for i in range (H))
# Constraints
m += np.cumsum(discharge) <= maxdischargepower
m += np.cumsum(charge) <= maxchargepower
m += np.cumsum(discharge) >= 0
m += batterystate >= 0
I have several questions:
I get a result of -1277, which is the opposite number of the sum of power price in 24 hours. There must be something wrong with the optimizing code but I cannot find it.
How do I save the charge and discharge binaries in the input data file?
Should I iterate the optimization model for 365 days for year-round data?
Thank you.
---------------------Edit 2/19-------------------------
Here's some sample data I've been running the code on:
Or in fact, 24 random numbers would also work but these are the actual prices I've been using.
I have an entire year of the data and once I figure out how to optimize within a day should I iterate the optimization for 366 days?
sample data
Seems like the main part of the algorithm is to ask these questions at the top of each hour:
At what time will the battery be discharged?
Which hour between now and then will be the cheapest for recharging?
IF that is the current hour, then turn on recharging.
It seems like that would be optimal if you can recharge in less than an hour. I don't understand the numbers you have -- 4 and 4.7 -- sounds like you need to be recharging almost all the time.
If that is the case, the algorithm can be turned around to "avoid recharging during costly hours".
You can't wait too long to decide whether to charge or not. I assume the cost of making the decision is essentially zero. So, recompute every hour (at least) and don't reach forward too far. It sounds like you have less than 5 hours before the battery will die; trying to optimize further into the future will be useless if it is dead.
I am working with a csv file of hourly temperature data for a specific location for a year. From there I made a dataframe in pandas with the following lists: DOY (24 for every day), Time (in minutes; 0, 60, 120, etc.), and temperature.
These data are serving as input data for a model that I have in Python (Jupyter notebooks, running on iOS) that predicts animal body temperatures when solving for a bunch of biophysical heat flux equations. For that model, I have a function that has the arguments of min and max temperature for every day. In the csv file, every day has 24 rows of data since they're giving hourly temperatures. I need to be able to iterate through this csv file and select the minimum and maximum temperature value for the day before current day [i-1], the current day [i], and the following day [i+1] in another function that I already have. Does anyone have suggestions for how to set up those functions? I'm still rather new to Python (< 1 year experience) so any help would be really appreciated! :)
Edit to clarify:
import math
import itertools
%pylab inline
import matplotlib as plt
import pandas as pd
import numpy as np
%cd "/Users/lauren/Desktop"
Ta_input = pd.DataFrame(pd.read_csv("28Jan_Mesa_Ta.csv"))
Ta_input.columns = ['', 'doy', 'time', 'Ta']
Ta_input.to_numpy()
doy =list(Ta_input['doy'])
time=list(Ta_input['time'])
Ta=list(Ta_input['Ta'])
micro_df = pd.DataFrame(list(zip(doy,time, Ta)),
columns=['doy','time', 'Ta'])
print(micro_df)
##### Below is the readout showing what the df looks like ###
Populating the interactive namespace from numpy and matplotlib
/Users/lauren/Desktop
doy time Ta
0 1 0 4.434094
1 1 60 4.383863
2 1 120 4.115001
3 1 180 3.831146
4 1 240 3.537708
... ... ... ...
8755 365 1140 6.478684
8756 365 1200 5.744720
8757 365 1260 5.212801
8758 365 1320 4.568695
8759 365 1380 4.398663
[8760 rows x 3 columns]
/usr/local/Caskroom/miniconda/base/lib/python3.7/site-packages/IPython/core/magics/pylab.py:160: UserWarning: pylab import has clobbered these variables: ['time', 'polyint', 'plt', 'insert']
`%matplotlib` prevents importing * from pylab and numpy
"\n`%matplotlib` prevents importing * from pylab and numpy"
I have these functions
def anytime_temp(t,max_t_yesterday,min_t_today,max_t_today,min_t_tomorrow):
#t = time
#i = today
#i-1 = yesterday
#i+1 = tomorrow
#Tn = daily min
#Tx = daily max
if 0.<=t<=5.:
return max_t_yesterday*Gamma_t(t) + min_t_today*(1-Gamma_t(t))
elif 5.<t<=14.:
return max_t_today*Gamma_t(t)+ min_t_today*(1-Gamma_t(t))
else:
return max_t_today*Gamma_t(t)+ min_t_tomorrow*(1-Gamma_t(t))
# Rabs = amount of radiation absorbed by an organism
def Rabs(s,alpha_s,h,lat,J,time,long,d,tau,alt,rg,alpha_l,max_t_yesterday,min_t_today,max_t_today,min_t_tomorrow,eg,Tave,amp,z,D):
if math.cos(zenith(latitude,julian,time,longitude)) > 0.:
return s*alpha_s*(Fh(h,lat,J,time,long,d)*(hS(J,lat,time,long,tau,alt))+0.5*Sd(J,lat,time,long,tau,alt)+0.5*Sr(rg,J,lat,time,long,tau,alt)) + 0.5*alpha_l*(Sla(anytime_temp(time,max_t_yesterday,min_t_today,max_t_today,min_t_tomorrow)) + Slg(eg,Tzt(Tave,amp,z,D,time)))
else:
return 0.5*alpha_l*(Sla(anytime_temp(time,max_t_yesterday,min_t_today,max_t_today,min_t_tomorrow)) + Slg(eg,Tzt(Tave,amp,z,D,time)))
...which both take maximum temperature yesterday, minimum and maximum temperature today, and minimum temperature tomorrow as inputs. My code is set up to run with me just setting those values to be single numbers (e.g., min_t_today = 25.). But now that I have a list of hourly temperatures for the entire year, I am trying to figure out the best way to either modify these functions, or define new functions that I could call in these functions to allow me to pull the minimum and maximum temperature values for each specific DOY (day of year, which is another list in my df).
In other words, my csv file has hourly temperatures for every DOY, so 24 temps per day. I need to iterate through to calculate and call on the min and max temperatures for a given day in these functions. Any tips would be helpful! Thanks!
I have times in SQLite in the form of '2012-02-21 00:00:00.000000' and would like to average times of day together. Dates don't matter--just times. So, e.g., if the data is:
'2012-02-18 20:00:00.000000'
'2012-02-19 21:00:00.000000'
'2012-02-20 22:00:00.000000'
'2012-02-21 23:00:00.000000'
The average of 20, 21, 22, an 23, should be 21.5, or 21:30 (or 9:30pm in the U.S.).
Q1) Is there a best way to do this in a SELECT query in SQLite?
But more difficult: what if one or more of the datetimes crosses midnight? They definitely will in my data set. Example:
'2012-02-18 22:00:00.000000'
'2012-02-19 23:00:00.000000'
'2012-02-21 01:00:00.000000'
Now the average seems like it should be (22 + 23 + 1)/3 = 15.33 or 15:20 (3:20pm). But that would misrepresent the data, as these events are all happening at night, from 22:00 to 01:00 (10pm to 1am). Really, the better approach would be to average them like (22 + 23 + 25)/3 = 23.33 or 23:20 (11:20pm).
Q2) Is there anything I should do to my SELECT query to take this into account, or is this something I have to code in Python?
what do you really want to compute?
datetimes (or times within 1 day) are usually represented as real numbers
time coordinates on a 24-hour clock are complex numbers, however
average of real-number representations of the times will give you dubious results...
i don't know what you want to do with edge cases like [1:00, 13:00], but let's consider following example: [01:30, 06:30, 13:20, 15:30, 16:15, 16:45, 17:10]
I suggest implementing this algorithm - in Python:
convert times to complex numbers - e.g. compute their coordinates on a circle of radius = 1
compute the average using vector addition
convert the result vector angle to minutes + compute the relevance of this result (e.g. relevance of average of [1:00, 13:00] should be 0 whatever the angle is computed because of rounding errors)
import math
def complex_average(minutes):
# first convert the times from minutes (0:00 - 23:59) to radians
# so we get list for quasi polar coordinates (1, radians)
# (no point in rotating/flipping to get real polar coordinates)
# 180° = 1/2 day = 24*60/2 minutes
radians = [t*math.pi/(24*60/2) for t in minutes]
xs = []
ys = []
for r in radians:
# convert polar coordinates (1, r) to cartesian (x, y)
# the vectors start at (0, 0) and end in (x, y)
x, y = (math.cos(r), math.sin(r))
xs.append(x)
ys.append(y)
# result vector = vector addition
sum_x, sum_y = (sum(ys), sum(xs))
# convert result vector coordinates to radians, then to minutes
# note the cumulative ROUNDING ERRORS, however
result_radians = math.atan2(sum_x, sum_y)
result_minutes = int(result_radians / math.pi * (24*60/2))
if result_minutes < 0:
result_minutes += 24*60
# relevance = magnitude of the result vector / number of data points
# (<0.0001 means that all vectors cancel each other, e.g. [1:00, 13:00]
# => result_minutes would be random due to rounding error)
# FYI: standart_deviation = 6*60 - 6*60*relevance
relevance = round(math.sqrt(sum_x**2 + sum_y**2) / len(minutes), 4)
return result_minutes, relevance
And test it like this:
# let's say the select returned a bunch of integers in minutes representing times
selected_times = [90, 390, 800, 930, 975, 1005, 1030]
# or create other test data:
#selected_times = [hour*60 for hour in [23,22,1]]
complex_avg_minutes, relevance = complex_average(selected_times)
print("complex_avg_minutes = {:02}:{:02}".format(complex_avg_minutes//60,
complex_avg_minutes%60),
"(relevance = {}%)".format(int(round(relevance*100))))
simple_avg = int(sum(selected_times) / len(selected_times))
print("simple_avg = {:02}:{:02}".format(simple_avg//60,
simple_avg%60))
hh_mm = ["{:02}:{:02}".format(t//60, t%60) for t in selected_times]
print("\ntimes = {}".format(hh_mm))
Output for my example:
complex_avg_minutes = 15:45 (relevance = 44%)
simple_avg = 12:25
I'm not sure you can average dates.
What I would do is get the average of the difference in hours between the row values and a fixed date then add that average to the fixed date. Using minutes may cause an overflow of int and require some type conversion
sort of...
select dateadd(hh,avg(datediff(hh,getdate(),myrow)),getdate())
from mytable;
If I understand correctly, you want to get the average distance of the times from midnight?
How about this?
SELECT SUM(mins) / COUNT(*) from
( SELECT
CASE
WHEN strftime('%H', t) * 1 BETWEEN 0 AND 11
THEN (strftime('%H', t)) * 60 + strftime('%M', t)
ELSE strftime('%H', t) * 60 + strftime('%M', t) - 24 * 60
END mins
FROM timestamps
);
So we calculate the minutes offset from midnight: after noon we get a negative value, before noon is positive. The first line averages them and gives us a result in minutes. Converting that back to a hh:mm time is left as an "exercise for the student" ;-)
Site Rosetta Code has a task and code on this subject, and in researching that I came across this wikipedia link. Check out the talk/discussion pages too for discussions on applicability etc.