Increase Count Sequentially - python

I have a dataset which tracks when a user reads a website. A user can read a website and at anytime therefore the user will appear multiple time. I want to create a column which tracks the number of times a user reads a specific website. But since it is a time series, the count should be incremental. I have about 28gbs so pandas will not be able to handle the work load, so I have to write it in sql.
Sample data below:
Date ID WebID
201901 Bob X-001
201902 Bob X-002
201903 Bob X-001
201901 Sue X-001
Expected Results:
Date ID WebID Count
201901 Bob X-001 1
201902 Bob X-002 1
201903 Bob X-001 2
201901 Sue X-001 1

use row_number()
select *,row_number() over(partition by id,webid order by date) cnt
from table
order by date,id

You can use below sql query:
Select count(*) "Count" , Date, ID, WebID, from table group by webid, id, date

Related

Comparison between Group By technique of Python and SQL partition by

I want to ask a conceptual question.
I have a table that looks like
UPC_CODE A_PRICE A_QTY DATE COMPANY_CODE A_CAT
1001 100.25 2 2021-05-06 1 PB
1001 2122.75 10 2021-05-01 1 PB
1002 212.75 5 2021-05-07 2 PT
1002 3100.75 10 2021-05-01 2 PB
I want that for each UPC_CODE and COMPANY_CODE the latest data should be picked up.
To achieve this, I have SQL and Python
Using SQL:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY UPC_CODE, COMPANY_CODE ORDER BY DATE DESC) rn
FROM yourTable)
SELECT UPC_CODE, A_PRICE, A_QTY, DATE, COMPANY_CODE, A_CAT
FROM cte
WHERE rn = 1;
Using Python:
df = df.groupby(['UPC_CODE','COMPANY_CODE']).\
agg(Date = ('DATE','max'),A_PRICE = ('A_PRICE','first'),\
A_QTY = ('A_QTY','first'),A_CAT = ('A_CAT','first').reset_index()
Ideally I should be getting the following resultant table:
UPC_CODE A_PRICE A_QTY DATE COMPANY_CODE A_CAT
1001 100.25 2 2021-05-06 1 PB
1002 212.75 5 2021-05-07 2 PT
However, using SQL I am getting the above, but this is not the case for Python.
What I am missing out here?
upc_code and date columns might be used along with rank(method='first',ascending = False), eg. descending order while determining first rows , applied dataframe.groupby() function after converting date column to datetime type in Python in order to filter out the corresponding rows with value = 1 for df['rn']
df['date']=pd.to_datetime(df['date'])
df['rn']=df.groupby('upc_code')['date'].rank(method='first',ascending = False)
print(df[df['rn']==1])

SQLite removing a row, and reordering the rest of the rows

I'm struggling with something relatively simple,
Let's say I have a table the first column is the primary key autoincrement
id
name
age
1
tom
22
2
harry
33
3
greg
44
4
sally
55
I want to remove row 2 and the rest of the items to automatically reorder so it looks like this
id
name
age
1
tom
22
2
greg
44
3
sally
55
I have tried every available bit of advice I can find online, they all involve deleting the table name from the sqlite_sequence table which doesn't work
is there a simple way to do archive what I am after?
I don't see the point/need to resequence the id column, as all values there will continue to be unique, even after deleting one or more records. If you really wanted to view your data this way, you could delete the record mentioned, and then use ROW_NUMBER:
DELETE
FROM yourTable
WHERE id = 2;
SELECT ROW_NUMBER() OVER (ORDER BY id) id, name, age
FROM yourTable
ORDER BY id;

How do I use pandas to count the number of times a name and type occur together within a 60 period from the first instance?

My dataframe is this:
Date Name Type Description Number
2020-07-24 John Doe Type1 NaN NaN
2020-08-10 Jo Doehn Type1 NaN NaN
2020-08-15 John Doe Type1 NaN NaN
2020-09-10 John Doe Type2 NaN NaN
2020-11-24 John Doe Type1 NaN NaN
I want the Number column to have the instance number with the 60 day period. So for entry 1, the Number should just be 1 since it's the first instance - same with entry 2 since it's a different name. Entry 3 however, should have 2 in the Number column since it's the second instance of John Doe and Type 1 in the 60 day period starting 7/24 (the first instance date). Entry 4 would be 1 as well since the Type is different. Entry 5 would also be 1 since it's outside the 60 day period from 7/24. However, any entries after this with John Doe, Type 1 would have a new 60 day period starting 11/24.
Sorry, I know this is a pretty loaded question with a lot of aspects to it, but I'm trying to get up to speed on dataframes again and I'm not sure where to begin.
As a starting point, you could create a pivot table. (The assign statement just creates a temporary column of ones, to support counting.) In the example below, each row is a date, and each column is a (name, type) pair.
Then, use the resample function (to get one row for every calendar day), and the rolling function (to sum the numbers in the 60-day window).
x = (df.assign(temp = 1)
.pivot_table(index='date',
columns=['name', 'type'],
values='temp',
aggfunc='count',
fill_value=0)
)
x.resample('1d').count().rolling(60).sum()
Can you post sample data in text format (for copy/paste)?

Pull value from dataframe based on date

Suppose I have the following data frame below:
userid recorddate
0 tom 2018-06-12
1 nick 2019-06-01
2 tom 2018-02-12
3 nick 2019-06-02
How would I go about determining and pulling the value for the earliest recorddate for each user. i.e. 2018-02-12 for tom and 2019-06-01 for nick?
In addition, what if I added a parameter such as the earliest recorddate that is greater than 2019-01-01?
Here a solution with loc
df['recorddate'] = pd.to_datetime(df['recorddate'])
date = pd.to_datetime("2019-01-01")
df.loc[df['recorddate']>date]
Output will be:
userid recorddate
1 nick 2019-06-01
3 nick 2019-06-02
you can change the greater sign with equal or smaller sign to get a different result.
Cheers
Everything will be easier if you convert your date strings into datetime objects. Once that's done you can sort them then take the first record per userid. Additionally you can filter the dataframe by passing a date string in your conditional, and proceed the same way.
df['recorddate'] = pd.to_datetime(df['recorddate'])
df.sort_values(by='recorddate', inplace=True)
df.groupby('userid').first()
output
recorddate
userid
nick 2019-06-01
tom 2018-02-12
or
df[df['recorddate']>'2019-01-01'].groupby('userid').first()

How to sum up monthly total in sqlite3

I have this few lines of code here which am able to sum up total value of a month but what i want to do is to sum up each month and print them.When i change the month value it works alright but i want to sum up total value of each month
example:
january = 200
february = 240
march = 310
....
december = 8764
taking year condition into consideration.
CODE
import sqlite3
conn = sqlite3.connect("test.db")
cur = conn.cursor()
cur.execute("SELECT SUM(AMOUNT) FROM `cash` WHERE strftime('%m', `DATE`) = '10'")
rows = cur.fetchall()
for row in rows:
print(row)
The table has column name, date and amount and date is in this format 2019-05-23, 2016-05-30
NAME DATE AMOUNT
JOE 2018-01-23 50.00
BEN 2018-01-21 61.00
FRED 2018-02-23 31.00
FRED 2018-02-03 432.00
DAN 2018-03-23 69.00
FRED 2018-03-23 61.00
BRYAN 2018-04-21 432.00
FELIX 2018-04-25 907.00
.......................
......................
Yeah, you need GROUP BY. Something like:
SELECT strftime('%Y-%m', date) AS sales_month
, sum(amount) AS total_sales
FROM sales
GROUP BY sales_month
ORDER BY sales_month;
SQLFiddle example
(You get bonus points for using a date format that sqlite date and time functions understand; too many people use "MM/DD/YYYY' or the like)
Edit: If you have a lot of data and run this frequently, you might want to use a computed index to speed it up:
CREATE INDEX sales_idx_monthly ON sales(strftime('%Y-%m', date));
a "group by" instead the where part should work
GROUP BY strftime('%Y',`DATE`), strftime('%m',`DATE`)

Categories