The columns in the below dataset will have:
A: Date contract opened;
B: Date contract stops;
C: Unique account ID against which contract associated (can have multiple contracts live against one ID)
D: Monthly revenue for contract period - for simplicity, assume revenue generated from first month contract assumed up to month before the date the contract closes
Opp Start Date OPP contract end date Unique Account Field MRR(expected)
1/2/2013 1/2/2015 50e55 195.00
1/2/2013 1/2/2014 4ee75 50.00
1/2/2013 1/2/2014 4f031 75.00
1/2/2013 1/2/2016 4c3b2 133.00
1/2/2013 1/2/2016 49ec8 132.00
1/3/2013 1/3/2014 49fc8 59.00
1/4/2013 1/4/2015 49wc8 87.00
12/27/2013 12/27/2014 50bf7 190.00
12/27/2013 12/27/2014 49cc8 179.00
12/27/2013 12/27/2014 49ec8 147.00
etc....
I would like to calculate the following:
How much revenue was generated by month between Jan-2013 and Dec-2014?
How many active contracts (generated revenue in that month) were there by month between Jan-2013 and Dec-2014?
How many active accounts (generated revenue from at least one contract) were there by month between Jan-2013 and Dec-2014?
I tried the below code:
I was able to use sum() to get the revenues, but I'm not sure what to do beyond this.
from datetime import date
df['date'] = pd.to_datetime(df['Opp Start Date'])
df.groupby(df['Opp Start Date'].dt.strftime('%B'))['MRR(expected)'].sum().sort_values()
Result I got from the above code:
Opp Start Date
February 221744
January 241268
July 245811
August 247413
April 249702
March 251219
June 251494
May 259149
September 263395
October 293990
November 296590
December 311659
I need to calculate the above following.
How can I achieve this in python?
You could do it in either place, or a combination of the two. There are also lots of different ways you could approach this. Personally, I would create a query that pulled the information for one month, and then iterate through, either storing the results in an array or a temp table.
For a single month in the period, the query would look something like this:
Select count(unique_account_field), sum(MRR)
from contracts
where Opp_start_date <= #month_end
and Opp_contract_end_date > #month_start
That would take care of 1 and 2. #3 is a little different. That's a simple count(distinct unique_account_field) from the set over the whole period.
Related
I have a sqllite table which contains a datetime col of dates and col of real numbers. Any ideas on how i can query to get all all the years within that date range
Note the dates are stored as yyyy-mm-dd
For example if I had a table with all the dates from Jan 1 1990 - Dec 31 2021 and I wanted between Feb 1 and Feb 2 I would get
2022-02-01
2022-02-02
2021-02-01
2021-02-02
...
1991-02-01
1991-02-02
1990-02-01
1990-02-02
or if i query Jan 31 to Feb 2 I would get
2022-01-31
2022-02-01
2022-02-02
2021-01-31
2021-02-01
2021-02-02
...
1991-01-31
1991-02-01
1991-02-02
1991-01-31
1991-02-01
1991-02-02
I'm trying to do this using peewee for python but I'm not even sure how to write this sql statement. From what I've seen sql has a between statement but this wont work as it would give me only one year when I want every record in the database between a given range.
Since it appears that the author's sample data posted originally was not representative of the actual data, and the ACTUAL data is stored in YYYY-MM-DD, then it is quite simple using sqlite's strftime():
start = '01-31'
end = '02-02'
query = (Reg.select()
.where(fn.strftime('%m-%d', Reg.date).between(start, end)))
for row in query:
print(row.date)
Should print something like:
2018-01-31
2018-02-01
2018-02-02
2019-01-31
...
Consider filtering query by building dates with date() keeping year of current date. Date ranges across years may need to be split with a self-join or union:
SELECT *
FROM my_table
WHERE my_date BETWEEN date(strftime('%Y', my_date) ||'-01-31')
AND date(strftime('%Y', my_date) ||'-02-02')
To get the years within a date range in a SQLite table, you can use the strftime function to extract the year from the date column and then use the distinct keyword to only return unique years.
SELECT DISTINCT strftime('%Y', date_column) as year
FROM table_name
WHERE date_column BETWEEN '2022-01-01' AND '2022-12-31'
I accidentally ran into this solution (it was a problem for me) working in SQL Server, where I was trying to get records created in the last week;
SELECT *
FROM table1
WHERE DATEPART(WK, table1.date_entered) = DATEPART (WK, GETDATE()) -1
This returns everything that was created in Week now-1 of all previous years, similar to the SQLite strftime(%W, my-date-here).
If the dates you are querying don't span over a change in years (i.e; not Dec-29 --> Jan-5) you could do something like the below;
SELECT date_column
FROM myTable
WHERE strftime($j, date_column) >= strftime($j, **MY_START_DATE**)
AND strftime($j, date_column) <= strftime($j, **MY_END_DATE**)
Here we get the Day of the year (thats what $j gives us) and select anything from date_column where the day of the year is between our start and end dates. If i understood your question properly, it will give you as listed
2022-02-01
2022-02-02
2021-02-01
2021-02-02
...
IF you want further information on working with dates I can't do better than refer you to "Robyn Page’s SQL Server DATE/TIME Workbench".
Obviously this article is on SQL Server but most of it translates to other DB's.
I have a pandas dataframe of stock records, my goal is to pass in a particular 'day' e.g 8 and get the filtered data frame for the 8th of each month and year in the dataset.
I have gone through some SO questions and managed to get one part of my requirement that was getting the records for a particular day, however if the data for say '8th' does not exist for the particular month and year, I need to get the records for the closest day where record exists for this particular month and year.
As an example, if I pass in 8th and there is no record for 8th Jan' 2022, I need to see if records exists for 7th and 9th Jan'22, and so on..and get the record for the nearest date.
If record is present in both 7th and 9th, I will get the record for 9th (higher date).
However, it is possible if the record for 7th exists and 9th does not exist, then I will get the record for 7th (closest).
Code I have written so far
filtered_df = data.loc[(data['Date'].dt.day == 8)]
If the dataset is required, please let me know. I tried to make it clear but if there is any doubt, please let me know. Any help in the correct direction is appreciated.
Alternative 1
Resample to a daily resolution, selecting the nearest day to fill in missing values:
df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]
Alternative 2
A more general method (and a tiny bit faster) is to generate dates/times of your choice, then use reindex() and method 'nearest'. It is more general because you can use any series of timestamps you could come up with (not necessarily aligned with any frequency).
dates = pd.date_range(
start=df.first_valid_index().normalize(), end=df.last_valid_index(),
freq='D')
dates = dates[dates.day == 8]
df2 = df.reindex(dates, method='nearest')
Example
Let's start with a reproducible example:
import yfinance as yf
df = yf.download(['AAPL', 'AMZN'], start='2022-01-01', end='2022-12-31', freq='D')
>>> df.iloc[:10, :5]
Adj Close Close High
AAPL AMZN AAPL AMZN AAPL
Date
2022-01-03 180.959747 170.404495 182.009995 170.404495 182.880005
2022-01-04 178.663086 167.522003 179.699997 167.522003 182.940002
2022-01-05 173.910645 164.356995 174.919998 164.356995 180.169998
2022-01-06 171.007523 163.253998 172.000000 163.253998 175.300003
2022-01-07 171.176529 162.554001 172.169998 162.554001 174.139999
2022-01-10 171.196426 161.485992 172.190002 161.485992 172.500000
2022-01-11 174.069748 165.362000 175.080002 165.362000 175.179993
2022-01-12 174.517136 165.207001 175.529999 165.207001 177.179993
2022-01-13 171.196426 161.214005 172.190002 161.214005 176.619995
2022-01-14 172.071335 162.138000 173.070007 162.138000 173.779999
Now:
df2 = df.resample('D').nearest()
df2 = df2.loc[df2.index.day == 8]
>>> df2.iloc[:5, :5]
Adj Close Close High
AAPL AMZN AAPL AMZN AAPL
2022-01-08 171.176529 162.554001 172.169998 162.554001 174.139999
2022-02-08 174.042633 161.413498 174.830002 161.413498 175.350006
2022-03-08 156.730942 136.014496 157.440002 136.014496 162.880005
2022-04-08 169.323975 154.460495 170.089996 154.460495 171.779999
2022-05-08 151.597595 108.789001 152.059998 108.789001 155.830002
Warning
Replacing a missing day with data from the future (which is what happens when the nearest day is after the missing one) is called peak-ahead and can cause peak-ahead bias in quant research that would use that data. It is usually considered dangerous. You'd be safer using method='ffill'.
I am trying to replace some hardcoded SQL queries related to timezone changes with a more dynamic/data-driven Python script. I have a dataset that looks like this spreadsheet below. WEEK_START/DAY/MONTH is the week, day, and month when daylight savings time begins (for example Canberra starts the first Sunday of April while Vienna is the last Sunday of March). The end variables are in the same format and display when it ends.
Dataset
Here is the issue. I have seen solutions for specific use cases such as this, finding the last Sunday of the month:
current_year=today.year
current_month=today.month
current_day=today.day
month = calendar.monthcalendar(current_year, current_month)
day_of_month = max(month[-1][calendar.SUNDAY], month[-2][calendar.SUNDAY])
print(day_of_month)
31
This tells me that the last day of this month is the 31st. I can adjust the attributes for one given month/scenario, but how would I make a column for each and every row (city) to retrieve each? That is, several cities that change times on different dates? I thought if I could set attributes in day_of_month in an apply function it would work but when I do something like weekday='SUNDAY' it returns an error because of course the string 'SUNDAY' is not the same as SUNDAY the attribute of calendar. My SQL queries are grouped by cities that change on the same day but ideally anyone would be able to edit the CSV that loads the above dataset as needed and then each day the script would run once to see if today is between the start and end of daylight savings. We might have new cities to add in the future. I'm confident in doing that bit but quite lost on how to retrieve the dates for a given year.
My alternate, less resilient, option is to look at the distinct list of potential dates (last Sunday of March, first Sunday of April, etc.), write code to retrieve each one upfront (as in the above snippet above), and assign the dates in that way. I say that this is less resilient because if a city is added that does not fit in an existing group for time changes, the code would need to be altered as well.
So stackoverflow, is there a way to do this in a data driven way in pandas through an apply or something similar? Thanks in advance.
Basically I think you have most of what you need. Just map the WEEK_START / WEEK_END column {-1, 1} to last or first day of month, put it all in a function and apply it to each row. EX:
import calendar
import operator
import pandas as pd
def get_date(year: int, month: int, dayname: str, first=-1) -> pd.Timestamp:
"""
get the first or last day "dayname" in given month and year.
returns last by default.
"""
daysinmonth = calendar.monthcalendar(year, month)
getday = operator.attrgetter(dayname.upper())
if first == 1:
day = daysinmonth[0][getday(calendar)]
else:
day = max(daysinmonth[-1][getday(calendar)], daysinmonth[-2][getday(calendar)])
return pd.Timestamp(year, month, day)
year = 2021 # we need a year...
df['date_start'] = df.apply(lambda row: get_date(year,
row['MONTH_START'],
row['DAY_START'],
row['WEEK_START']), # selects first or last
axis=1) # to each row
df['date_end'] = df.apply(lambda row: get_date(year,
row['MONTH_END'],
row['DAY_END'],
row['WEEK_END']),
axis=1)
giving you for the sample data
df[['CITY', 'date_start', 'date_end']]
CITY date_start date_end
0 Canberra 2021-04-04 2021-10-03
1 Melbourne 2021-04-04 2021-10-03
2 Sydney 2021-04-04 2021-10-03
3 Kitzbuhel 2021-03-28 2021-10-31
4 Vienna 2021-03-28 2021-10-31
5 Antwerp 2021-03-28 2021-10-31
6 Brussels 2021-03-28 2021-10-31
7 Louvain-la-Neuve 2021-03-28 2021-10-31
Once you start working with time zones and DST transitions, Q: Is there a way to infer in Python if a date is the actual day in which the DST (Daylight Saving Time) change is made? might also be interesting to you.
I am trying to forecast daily profit using time series analysis, but daily profit is not only recorded unevenly, but some of the data is missing.
Raw Data:
Date
Revenue
2020/1/19
10$
2020/1/20
7$
2020/1/25
14$
2020/1/29
18$
2020/2/1
12$
2020/2/2
17$
2020/2/9
28$
The above table is an example of what kind of data I have. Profit is not recorded daily, so date between 2020/1/20 and 2020/1/24 does not exist. Not only that, say the profit recorded during the period between 2020/2/3 and 2020/3/8 went missing in the database. I would like to recover this missing data and use time series analysis to predict the profit after 2020/2/9 ~.
My approach was to first aggregate the profit every 6 days since I have to recover the profit between 2020/2/3 and 2020/3/8. So my cleaned data will look something like this
Date
Revenue
2020/1/16 ~ 2020/1/21
17$
2020/1/22 ~ 2020/1/27
14$
2020/1/28 ~ 2020/2/2
47$
2020/2/3 ~ 2020/2/8
? (to predict)
After applying this to a time series model, I would like to further predict the profit after 2020/2/9 ~.
This is my general idea, but as a beginner at Python, using pandas library, I have trouble executing my ideas. Could you please help me how to aggregate the profit every 6 days and have the data look like the above table?
Easiest way is using pandas resample function.
Provided you have an index of type Datetime resampling to aggregate profits at every 6 days would be as simple as your_dataframe.resample('6D').sum()
You can do all sorts of resampling (end of month, end of quarter, begining of week, every hour, minute, second, ...). Check the full documentation if you're interested: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html?highlight=resample#pandas.DataFrame.resample
I suggest using a combination of .rolling, pd.date_range, and .reindex
Say your DataFrame is df, with proper datetime indexing:
df = pd.DataFrame([['2020/1/19',10],
['2020/1/20',7],
['2020/1/25',14],
['2020/1/29',18],
['2020/2/1',12],
['2020/2/2',17],
['2020/2/9',28]],columns=['Date','Revenue'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)
The first step is to 'fill in' the missing days with dummy, zero revenue. We can use pd.date_range to get an index with evenly spaced dates from 2020/1/16 to 2020/2/8, and then .reindex to bring this into the main df DataFrame:
evenly_spaced_idx = pd.date_range(start='2020/1/16',end='2020/2/8',freq='1d')
df = df.reindex(evenly_spaced_idx, fill_value=0)
Now we can take a rolling sum for each 6 day period. We're not interested in every day's six day revenue total, only every 6th days, though:
summary_df = df.rolling('6d').sum().iloc[5::6, :]
The last thing with summary_df is just to format it the way you'd like so that it clearly states the date range which each row refers to.
summary_df['Start Date'] = summary_df.index-pd.Timedelta('6d')
summary_df['End Date'] = summary_df.index
summary_df.reset_index(drop=True,inplace=True)
You can use resample for this.
Make sure to have the "Date" column as datetime type.
>>> df = pd.DataFrame([["2020/1/19" ,10],
... ["2020/1/20" ,7],
... ["2020/1/25" ,14],
... ["2020/1/29" ,18],
... ["2020/2/1" ,12],
... ["2020/2/2" ,17],
... ["2020/2/9" ,28]], columns=['Date', 'Revenue'])
>>> df['Date'] = pd.to_datetime(df.Date)
For pandas < 1.1.0
>>> df.set_index('Date').resample('6D', base=3).sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
For pandas >= 1.1.0
>>> df.set_index('Date').resample('6D', origin='2020-01-16').sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
I have a dataset of meteorological data every day for the period 1983 - 2019. I am doing some data wrangling and I would like to create a function which will take values for one variable and for a certain period and make a sum.
This will be used several times in the code, so I don't want to use .resample('MS').sum(). Moreover, I would like to be stronger in programming, so I am trying to solve it by 'written' function.
So dataset looks like this. I've just created a new variable 'MONTH'. But it could be a quarter or a half-year.
filled_values['MONTH'] = filled_values['DAY'].dt.strftime('%b %Y')
filled_values.tail(n=2)
out:
DAY RAIN TEMP TMAX TMIN WIND MONTH
13030 2019-03-04 0.1240 22.38 26.500 18.840 1.16 Mar 2019
13031 2019-03-05 0.1900 22.77 29.220 17.510 1.08 Mar 2019
And now I am trying to create the function.
prec_sums_per_month = []
def sums_prec(dataset, date, variable_for_sum, new_variable, new_dataset):
for date in dataset.items:
new_dataset[new_variable] = variable_for_sum.sum()
return new_dataset
prec_sums_per_month = sums_prec(filled_values, 'MONTH', 'RAIN', 'RAIN_MONTH', prec_sums_per_month)
print(prec_sums_per_month)
I've expected the new DataFrame (or dictionary?) with 'MONTH' variable and sum of rain for that month. But here is my result:
TypeError: 'method' object is not iterable