How to sum up monthly total in sqlite3 - python

I have this few lines of code here which am able to sum up total value of a month but what i want to do is to sum up each month and print them.When i change the month value it works alright but i want to sum up total value of each month
example:
january = 200
february = 240
march = 310
....
december = 8764
taking year condition into consideration.
CODE
import sqlite3
conn = sqlite3.connect("test.db")
cur = conn.cursor()
cur.execute("SELECT SUM(AMOUNT) FROM `cash` WHERE strftime('%m', `DATE`) = '10'")
rows = cur.fetchall()
for row in rows:
print(row)
The table has column name, date and amount and date is in this format 2019-05-23, 2016-05-30
NAME DATE AMOUNT
JOE 2018-01-23 50.00
BEN 2018-01-21 61.00
FRED 2018-02-23 31.00
FRED 2018-02-03 432.00
DAN 2018-03-23 69.00
FRED 2018-03-23 61.00
BRYAN 2018-04-21 432.00
FELIX 2018-04-25 907.00
.......................
......................

Yeah, you need GROUP BY. Something like:
SELECT strftime('%Y-%m', date) AS sales_month
, sum(amount) AS total_sales
FROM sales
GROUP BY sales_month
ORDER BY sales_month;
SQLFiddle example
(You get bonus points for using a date format that sqlite date and time functions understand; too many people use "MM/DD/YYYY' or the like)
Edit: If you have a lot of data and run this frequently, you might want to use a computed index to speed it up:
CREATE INDEX sales_idx_monthly ON sales(strftime('%Y-%m', date));

a "group by" instead the where part should work
GROUP BY strftime('%Y',`DATE`), strftime('%m',`DATE`)

Related

Python dataframe returning closest value above specified input in one row (pivot_table)

I have the following DataFrame, output_excel, containing inventory data and sales data for different products. See the DataFrame below:
Product 2022-04-01 2022-05-01 2022-06-01 2022-07-01 2022-08-01 2022-09-01 AvgMonthlySales Current Inventory
1 BE37908 1500 1400 1200 1134 1110 1004 150.208333 1500
2 BE37907 2000 1800 1800 1540 1300 1038 189.562500 2000
3 DE37907 5467 5355 5138 4926 4735 4734 114.729167 5467
Please note that that in my example, today's date is 2022-04-01, so all inventory numbers for the months May through September are predicted values, while the AvgMonthlySales are the mean of actual, past sales for that specific product. The current inventory just displays today's value.
I also have another dataframe, df2, containing the lead time, the same sales data, and the calculated security stock for the same products. The formula for the security stock is ((leadtime in weeks / 4) + 1) * AvgMonthlySales:
Product AvgMonthlySales Lead time in weeks Security Stock
1 BE37908 250.208333 16 1251.04166
2 BE37907 189.562500 24 1326.9375
3 DE37907 114.729167 10 401.552084
What I am trying to achieve:
I want to create a new dataframe, which tells me how many months are left until our inventory drops below the security stock. For example, for the first product, BE37908, the security stock is ~1251 units, and by 2022-06-01 our inventory will drop below that number. So I want to return 2022-05-01, as this is the last month where our inventories are projected to be above the security stock. The whole output should look something like this:
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN
Please also note that the timeframe for the projections (the columns) can be set by the user, so we couldn't just select columns 2 through 7. However, the Product column will always be the first one, and the AvgMonthlySales and the Current Inventory columns will always be the last two.
To recap, I want to return the column with the smallest value above the security stock for each product. I have an idea on how to do that by column using argsort, but not by row. What is the best way to achieve this? Any tips?
You could try as follows:
# create list with columns with dates
cols = [col for col in df.columns if col.startswith('20')]
# select cols, apply df.gt row-wise, sum and subtract 1
idx = df.loc[:,cols].gt(df2['Security Stock'], axis=0).sum(axis=1).sub(1)
# get the correct dates from the cols
# if the value == len(cols)-1, *all* values will have been greater so: np.nan
idx = [cols[i] if i != len(cols)-1 else np.nan for i in idx]
out = df['Product'].to_frame()
out['Last Date Above Security Stock'] = idx
print(out)
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN

Comparison between Group By technique of Python and SQL partition by

I want to ask a conceptual question.
I have a table that looks like
UPC_CODE A_PRICE A_QTY DATE COMPANY_CODE A_CAT
1001 100.25 2 2021-05-06 1 PB
1001 2122.75 10 2021-05-01 1 PB
1002 212.75 5 2021-05-07 2 PT
1002 3100.75 10 2021-05-01 2 PB
I want that for each UPC_CODE and COMPANY_CODE the latest data should be picked up.
To achieve this, I have SQL and Python
Using SQL:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY UPC_CODE, COMPANY_CODE ORDER BY DATE DESC) rn
FROM yourTable)
SELECT UPC_CODE, A_PRICE, A_QTY, DATE, COMPANY_CODE, A_CAT
FROM cte
WHERE rn = 1;
Using Python:
df = df.groupby(['UPC_CODE','COMPANY_CODE']).\
agg(Date = ('DATE','max'),A_PRICE = ('A_PRICE','first'),\
A_QTY = ('A_QTY','first'),A_CAT = ('A_CAT','first').reset_index()
Ideally I should be getting the following resultant table:
UPC_CODE A_PRICE A_QTY DATE COMPANY_CODE A_CAT
1001 100.25 2 2021-05-06 1 PB
1002 212.75 5 2021-05-07 2 PT
However, using SQL I am getting the above, but this is not the case for Python.
What I am missing out here?
upc_code and date columns might be used along with rank(method='first',ascending = False), eg. descending order while determining first rows , applied dataframe.groupby() function after converting date column to datetime type in Python in order to filter out the corresponding rows with value = 1 for df['rn']
df['date']=pd.to_datetime(df['date'])
df['rn']=df.groupby('upc_code')['date'].rank(method='first',ascending = False)
print(df[df['rn']==1])

Pandas advanced problem : For each row, get complex info from another dataframe

Problem
I have a dataframe df :
Index Client_ID Date
1 johndoe 2019-01-15
2 johndoe 2015-11-25
3 pauldoe 2015-05-26
And I have another dataframe df_prod, with products like this :
Index Product-Type Product-Date Buyer Price
1 A 2020-01-01 pauldoe 300
2 A 2018-01-01 pauldoe 200
3 A 2019-01-01 johndoe 600
4 A 2017-01-01 johndoe 800
5 A 2020-11-05 johndoe 100
6 B 2014-12-12 johndoe 200
7 B 2016-11-15 johndoe 300
What I want is to add to df a column, that will sum the Prices of the last products of each type known at the current date (with Product-Date <= df.Date). An example is the best way to explain :
For the first row of df
1 johndoe 2019-01-01
The last A-Product known at this date bought by johndoe is this one :
3 A 2019-01-01 johndoe 600
(since the 4th one is older, and the 5th one has a Product-Date > Date)
The last B-Product known at this date bought by johndoe is this one :
7 B 2016-11-15 johndoe 300
So the row in df, after transformation, will look like that (900 being 600 + 300, prices of the 2 products of interest) :
1 johndoe 2019-01-15 900
The full df after transformation will then be :
Index Client_ID Date LastProdSum
1 johndoe 2019-15-01 900
2 johndoe 2015-11-25 200
3 pauldoe 2015-05-26 0
As you can see, there are multiple possibilities :
Buyers didn't necessary buy all products (see pauldoe, who only bought A-products)
Sometimes, no product is known at df.Date (see row 3 of the new df, in 2015, we don't know any product bought by pauldoe)
Sometimes, only one product is known at df.Date, and the value is the one of the product (see row 3 of the new df, in 2015, we only have one product for johndoe, which is a B-product bought in 2014, whose price is 200)
What I did :
I found a solution to this problem, but it's taking too much time to be used, since my dataframe is huge.
For that, I iterate using iterrows on rows of df, I then select the products linked to the Buyer, having Product-Date < Date on df_prod, then get the older grouping by Product-Type and getting the max date, then I finally sum all my products prices.
The fact I solve the problem iterating on each row (with a for iterrows), extracting for each row of df a part of df_prod that I work on to finally get my sum, makes it really long.
I'm almost sure there's a better way to solve the problem, with pandas functions (pivot for example), but I couldn't find the way. I've been searching a lot.
Thanks in advance for your help
Edit after Dani's answer
Thanks a lot for your answer. It looks really good, I accepted it since you spent a lot of time on it.
Execution is still pretty long, since I didn't specify something.
In fact, Product-Types are not shared through buyers : each buyers has its own multiple products types. The true way to see this is like this :
Index Product-Type Product-Date Buyer Price
1 pauldoe-ID1 2020-01-01 pauldoe 300
2 pauldoe-ID1 2018-01-01 pauldoe 200
3 johndoe-ID2 2019-01-01 johndoe 600
4 johndoe-ID2 2017-01-01 johndoe 800
5 johndoe-ID2 2020-11-05 johndoe 100
6 johndoe-ID3 2014-12-12 johndoe 200
7 johndoe-ID3 2016-11-15 johndoe 300
As you can understand, product types are not shared through different buyers (in fact, it can happen, but in really rare situations that we won't consider here)
The problem remains the same, since you want to sum prices, you'll add the prices of last occurences of johndoe-ID2 and johndoe-ID3 to have the same final result row
1 johndoe 2019-15-01 900
But as you now understand, there are actually more Product-Types than Buyers, so the step "get unique product types" from your answer, that looked pretty fast on the initial problem, actually takes a lot of time.
Sorry for being unclear on this point, I didn't think of a possibility of creating a new df based on product types.
The main idea is to use merge_asof to fetch the last date for each Product-Type and Client_ID, so do the following:
# get unique product types
product_types = list(df_prod['Product-Type'].unique())
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = [product_types for _ in range(len(df))]
df_with_prod = df.explode('Product-Type')
# merge only the closest date by each client and product type
merge = pd.merge_asof(df_with_prod.sort_values(['Date', 'Client_ID']),
df_prod.sort_values(['Product-Date', 'Buyer']),
left_on='Date',
right_on='Product-Date',
left_by=['Client_ID', 'Product-Type'], right_by=['Buyer', 'Product-Type'])
# fill na in prices
merge['Price'] = merge['Price'].fillna(0)
# sum Price by client and date
res = merge.groupby(['Client_ID', 'Date'], as_index=False)['Price'].sum().rename(columns={'Price' : 'LastProdSum'})
print(res)
Output
Client_ID Date LastProdSum
0 johndoe 2015-11-25 200.0
1 johndoe 2019-01-15 900.0
2 pauldoe 2015-05-26 0.0
The problem is that merge_asof won't work with duplicate values, so we need to create unique values. These new values are the cartesian product of Client_ID and Product-Type, this part is done in:
# get unique product types
product_types = list(df_prod['Product-Type'].unique())
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = [product_types for _ in range(len(df))]
df_with_prod = df.explode('Product-Type')
Finally do a groupby and sum the Price, not before doing a fillna to fill the missing values.
UPDATE
You could try:
# get unique product types
product_types = df_prod.groupby('Buyer')['Product-Type'].apply(lambda x: list(set(x)))
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = df['Client_ID'].map(product_types)
df_with_prod = df.explode('Product-Type')
# merge only the closest date by each client and product type
merge = pd.merge_asof(df_with_prod.sort_values(['Date', 'Client_ID']),
df_prod.sort_values(['Product-Date', 'Buyer']),
left_on='Date',
right_on='Product-Date',
left_by=['Client_ID', 'Product-Type'], right_by=['Buyer', 'Product-Type'])
# fill na in prices
merge['Price'] = merge['Price'].fillna(0)
# sum Price by client and date
res = merge.groupby(['Client_ID', 'Date'], as_index=False)['Price'].sum().rename(columns={'Price' : 'LastProdSum'})
print(res)
The idea here is to change how you generate the unique values.

Increase Count Sequentially

I have a dataset which tracks when a user reads a website. A user can read a website and at anytime therefore the user will appear multiple time. I want to create a column which tracks the number of times a user reads a specific website. But since it is a time series, the count should be incremental. I have about 28gbs so pandas will not be able to handle the work load, so I have to write it in sql.
Sample data below:
Date ID WebID
201901 Bob X-001
201902 Bob X-002
201903 Bob X-001
201901 Sue X-001
Expected Results:
Date ID WebID Count
201901 Bob X-001 1
201902 Bob X-002 1
201903 Bob X-001 2
201901 Sue X-001 1
use row_number()
select *,row_number() over(partition by id,webid order by date) cnt
from table
order by date,id
You can use below sql query:
Select count(*) "Count" , Date, ID, WebID, from table group by webid, id, date

Dividing rows for specific columns by date+n in Pandas

I want to divide rows in my dataframe via specific columns.
That is, I have a column named 'ticker' which has a attributes 'date' and 'price'.
I want to divide date[i+2] by date[i] where i and i+2 just mean the DAY and the DAY +2 for the price of that ticker. The date is also in proper datetime format for operations using Pandas.
The data looks like:
date | ticker | price |
2002-01-30 A 20
2002-01-31 A 21
2002-02-01 A 21.4
2002-02-02 A 21.3
.
.
That means I want to select the price based off the ticker and the DAY and the DAY + 2 specifically for each ticker to calculate the ratio date[i+2]/date[i].
I've considered using iloc but I'm not sure how to select for specific tickers only to do the math on.
use groupby:
df.groupby('ticker')['price'].transform(lambda x: x / x.shift(2))
0 NaN
1 NaN
2 1.070000
3 1.014286
Name: price, dtype: float64

Categories