I want to ask a conceptual question.
I have a table that looks like
UPC_CODE A_PRICE A_QTY DATE COMPANY_CODE A_CAT
1001 100.25 2 2021-05-06 1 PB
1001 2122.75 10 2021-05-01 1 PB
1002 212.75 5 2021-05-07 2 PT
1002 3100.75 10 2021-05-01 2 PB
I want that for each UPC_CODE and COMPANY_CODE the latest data should be picked up.
To achieve this, I have SQL and Python
Using SQL:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY UPC_CODE, COMPANY_CODE ORDER BY DATE DESC) rn
FROM yourTable)
SELECT UPC_CODE, A_PRICE, A_QTY, DATE, COMPANY_CODE, A_CAT
FROM cte
WHERE rn = 1;
Using Python:
df = df.groupby(['UPC_CODE','COMPANY_CODE']).\
agg(Date = ('DATE','max'),A_PRICE = ('A_PRICE','first'),\
A_QTY = ('A_QTY','first'),A_CAT = ('A_CAT','first').reset_index()
Ideally I should be getting the following resultant table:
UPC_CODE A_PRICE A_QTY DATE COMPANY_CODE A_CAT
1001 100.25 2 2021-05-06 1 PB
1002 212.75 5 2021-05-07 2 PT
However, using SQL I am getting the above, but this is not the case for Python.
What I am missing out here?
upc_code and date columns might be used along with rank(method='first',ascending = False), eg. descending order while determining first rows , applied dataframe.groupby() function after converting date column to datetime type in Python in order to filter out the corresponding rows with value = 1 for df['rn']
df['date']=pd.to_datetime(df['date'])
df['rn']=df.groupby('upc_code')['date'].rank(method='first',ascending = False)
print(df[df['rn']==1])
Related
Problem
I have a dataframe df :
Index Client_ID Date
1 johndoe 2019-01-15
2 johndoe 2015-11-25
3 pauldoe 2015-05-26
And I have another dataframe df_prod, with products like this :
Index Product-Type Product-Date Buyer Price
1 A 2020-01-01 pauldoe 300
2 A 2018-01-01 pauldoe 200
3 A 2019-01-01 johndoe 600
4 A 2017-01-01 johndoe 800
5 A 2020-11-05 johndoe 100
6 B 2014-12-12 johndoe 200
7 B 2016-11-15 johndoe 300
What I want is to add to df a column, that will sum the Prices of the last products of each type known at the current date (with Product-Date <= df.Date). An example is the best way to explain :
For the first row of df
1 johndoe 2019-01-01
The last A-Product known at this date bought by johndoe is this one :
3 A 2019-01-01 johndoe 600
(since the 4th one is older, and the 5th one has a Product-Date > Date)
The last B-Product known at this date bought by johndoe is this one :
7 B 2016-11-15 johndoe 300
So the row in df, after transformation, will look like that (900 being 600 + 300, prices of the 2 products of interest) :
1 johndoe 2019-01-15 900
The full df after transformation will then be :
Index Client_ID Date LastProdSum
1 johndoe 2019-15-01 900
2 johndoe 2015-11-25 200
3 pauldoe 2015-05-26 0
As you can see, there are multiple possibilities :
Buyers didn't necessary buy all products (see pauldoe, who only bought A-products)
Sometimes, no product is known at df.Date (see row 3 of the new df, in 2015, we don't know any product bought by pauldoe)
Sometimes, only one product is known at df.Date, and the value is the one of the product (see row 3 of the new df, in 2015, we only have one product for johndoe, which is a B-product bought in 2014, whose price is 200)
What I did :
I found a solution to this problem, but it's taking too much time to be used, since my dataframe is huge.
For that, I iterate using iterrows on rows of df, I then select the products linked to the Buyer, having Product-Date < Date on df_prod, then get the older grouping by Product-Type and getting the max date, then I finally sum all my products prices.
The fact I solve the problem iterating on each row (with a for iterrows), extracting for each row of df a part of df_prod that I work on to finally get my sum, makes it really long.
I'm almost sure there's a better way to solve the problem, with pandas functions (pivot for example), but I couldn't find the way. I've been searching a lot.
Thanks in advance for your help
Edit after Dani's answer
Thanks a lot for your answer. It looks really good, I accepted it since you spent a lot of time on it.
Execution is still pretty long, since I didn't specify something.
In fact, Product-Types are not shared through buyers : each buyers has its own multiple products types. The true way to see this is like this :
Index Product-Type Product-Date Buyer Price
1 pauldoe-ID1 2020-01-01 pauldoe 300
2 pauldoe-ID1 2018-01-01 pauldoe 200
3 johndoe-ID2 2019-01-01 johndoe 600
4 johndoe-ID2 2017-01-01 johndoe 800
5 johndoe-ID2 2020-11-05 johndoe 100
6 johndoe-ID3 2014-12-12 johndoe 200
7 johndoe-ID3 2016-11-15 johndoe 300
As you can understand, product types are not shared through different buyers (in fact, it can happen, but in really rare situations that we won't consider here)
The problem remains the same, since you want to sum prices, you'll add the prices of last occurences of johndoe-ID2 and johndoe-ID3 to have the same final result row
1 johndoe 2019-15-01 900
But as you now understand, there are actually more Product-Types than Buyers, so the step "get unique product types" from your answer, that looked pretty fast on the initial problem, actually takes a lot of time.
Sorry for being unclear on this point, I didn't think of a possibility of creating a new df based on product types.
The main idea is to use merge_asof to fetch the last date for each Product-Type and Client_ID, so do the following:
# get unique product types
product_types = list(df_prod['Product-Type'].unique())
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = [product_types for _ in range(len(df))]
df_with_prod = df.explode('Product-Type')
# merge only the closest date by each client and product type
merge = pd.merge_asof(df_with_prod.sort_values(['Date', 'Client_ID']),
df_prod.sort_values(['Product-Date', 'Buyer']),
left_on='Date',
right_on='Product-Date',
left_by=['Client_ID', 'Product-Type'], right_by=['Buyer', 'Product-Type'])
# fill na in prices
merge['Price'] = merge['Price'].fillna(0)
# sum Price by client and date
res = merge.groupby(['Client_ID', 'Date'], as_index=False)['Price'].sum().rename(columns={'Price' : 'LastProdSum'})
print(res)
Output
Client_ID Date LastProdSum
0 johndoe 2015-11-25 200.0
1 johndoe 2019-01-15 900.0
2 pauldoe 2015-05-26 0.0
The problem is that merge_asof won't work with duplicate values, so we need to create unique values. These new values are the cartesian product of Client_ID and Product-Type, this part is done in:
# get unique product types
product_types = list(df_prod['Product-Type'].unique())
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = [product_types for _ in range(len(df))]
df_with_prod = df.explode('Product-Type')
Finally do a groupby and sum the Price, not before doing a fillna to fill the missing values.
UPDATE
You could try:
# get unique product types
product_types = df_prod.groupby('Buyer')['Product-Type'].apply(lambda x: list(set(x)))
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = df['Client_ID'].map(product_types)
df_with_prod = df.explode('Product-Type')
# merge only the closest date by each client and product type
merge = pd.merge_asof(df_with_prod.sort_values(['Date', 'Client_ID']),
df_prod.sort_values(['Product-Date', 'Buyer']),
left_on='Date',
right_on='Product-Date',
left_by=['Client_ID', 'Product-Type'], right_by=['Buyer', 'Product-Type'])
# fill na in prices
merge['Price'] = merge['Price'].fillna(0)
# sum Price by client and date
res = merge.groupby(['Client_ID', 'Date'], as_index=False)['Price'].sum().rename(columns={'Price' : 'LastProdSum'})
print(res)
The idea here is to change how you generate the unique values.
I am trying to identify only first orders of unique "items" purchased by "test" customers in a simplified sample dataframe from the dataframe created below:
df=pd.DataFrame({"cust": ['A55', 'A55', 'A55', 'B080', 'B080', 'D900', 'D900', 'D900', 'D900', 'C019', 'C019', 'Z09c', 'A987', 'A987', 'A987'],
"date":['01/11/2016', '01/11/2016', '01/11/2016', '08/17/2016', '6/17/2016','03/01/2016',
'04/30/2016', '05/16/2016','09/27/2016', '04/20/2016','04/29/2016', '07/07/2016', '1/29/2016', '10/17/2016', '11/11/2016' ],
"item": ['A10BABA', 'A10BABA', 'A10DBDB', 'A9GABA', 'A11AD', 'G198A', 'G198A', 'F673', 'A11BB', 'CBA1', 'CBA1', 'DA21',
'BG10A', 'CG10BA', 'BG10A']
})
df.date = pd.to_datetime(df.date)
df = df.sort_values(["cust", "date"], ascending = True)
The desired output would look as shown in picture - with all unique items ordered by date of purchase in a new column called "cust_item_rank" and remove any repeated (duplicated) orders of the same item by same user.
To clarify further, those items purchased on the same date by same user should have the same order/rank as shown in picture for customer A55 (A10BABA and A10DBDB are ranked as 1).
I have spent a fair bit of time using a combination of group by and/or rank operations but unsuccessful thus far. As an example:
df["cust_item_rank"] = df.groupby("cust")["date"]["item"].rank(ascending = 1, method = "min")
Yields an error (Exception: Column(s) date already selected).
Can somebody please guide me to the desired solution here?
# Remove duplicates
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
df2['cust_item_rank'] = df2.groupby('cust').cumcount().add(1)
df2
cust date item cust_item_rank
0 A55 2016-01-11 A10BABA 1
1 A55 2016-11-01 A10DBDB 2
2 A987 2016-01-29 BG10A 1
3 A987 2016-10-17 CG10BA 2
4 B080 2016-06-17 A11AD 1
5 B080 2016-08-17 A9GABA 2
6 C019 2016-04-20 CBA1 1
7 D900 2016-03-01 G198A 1
8 D900 2016-05-16 F673 2
9 D900 2016-09-27 A11BB 3
10 Z09c 2016-07-07 DA21 1
To solve this question, I built upon the excellent initial answer by cs95 and calling on the rank function in pandas as follows:
#remove duplicates as recommended by cs95
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
#rank by date afer grouping by customer
df2["cust_item_rank"]= df2.groupby(["cust"])["date"].rank(ascending=1,method='dense').astype(int)
This resulted in the following (desired output):
It appears that this problem is solved using either "min" or "dense" method of ranking but I chose the latter "dense" method to potentially avoid skipping any rank.
I have this few lines of code here which am able to sum up total value of a month but what i want to do is to sum up each month and print them.When i change the month value it works alright but i want to sum up total value of each month
example:
january = 200
february = 240
march = 310
....
december = 8764
taking year condition into consideration.
CODE
import sqlite3
conn = sqlite3.connect("test.db")
cur = conn.cursor()
cur.execute("SELECT SUM(AMOUNT) FROM `cash` WHERE strftime('%m', `DATE`) = '10'")
rows = cur.fetchall()
for row in rows:
print(row)
The table has column name, date and amount and date is in this format 2019-05-23, 2016-05-30
NAME DATE AMOUNT
JOE 2018-01-23 50.00
BEN 2018-01-21 61.00
FRED 2018-02-23 31.00
FRED 2018-02-03 432.00
DAN 2018-03-23 69.00
FRED 2018-03-23 61.00
BRYAN 2018-04-21 432.00
FELIX 2018-04-25 907.00
.......................
......................
Yeah, you need GROUP BY. Something like:
SELECT strftime('%Y-%m', date) AS sales_month
, sum(amount) AS total_sales
FROM sales
GROUP BY sales_month
ORDER BY sales_month;
SQLFiddle example
(You get bonus points for using a date format that sqlite date and time functions understand; too many people use "MM/DD/YYYY' or the like)
Edit: If you have a lot of data and run this frequently, you might want to use a computed index to speed it up:
CREATE INDEX sales_idx_monthly ON sales(strftime('%Y-%m', date));
a "group by" instead the where part should work
GROUP BY strftime('%Y',`DATE`), strftime('%m',`DATE`)
NOTE: Looking for some help on an efficient way to do this besides a mega join and then calculating the difference between dates
I have table1 with country ID and a date (no duplicates of these values) and I want to summarize table2 information (which has country, date, cluster_x and a count variable, where cluster_x is cluster_1, cluster_2, cluster_3) so that table1 has appended to it each value of the cluster ID and the summarized count from table2 where date from table2 occurred within 30 days prior to date in table1.
I believe this is simple in SQL: How to do this in Pandas?
select a.date,a.country,
sum(case when a.date - b.date between 1 and 30 then b.cluster_1 else 0 end) as cluster1,
sum(case when a.date - b.date between 1 and 30 then b.cluster_2 else 0 end) as cluster2,
sum(case when a.date - b.date between 1 and 30 then b.cluster_3 else 0 end) as cluster3
from table1 a
left outer join table2 b
on a.country=b.country
group by a.date,a.country
EDIT:
Here is a somewhat altered example. Say this is table1, an aggregated data set with date, city, cluster and count. Below it is the "query" dataset (table2). in this case we want to sum the count field from table1 for cluster1,cluster2,cluster3 (there is actually 100 of them) corresponding to the country id as long as the date field in table1 is within 30 days prior.
So for example, the first row of the query dataset has date 2/2/2015 and country 1. In table 1, there is only one row within 30 days prior and it is for cluster 2 with count 2.
Here is a dump of the two tables in CSV:
date,country,cluster,count
2014-01-30,1,1,1
2015-02-03,1,1,3
2015-01-30,1,2,2
2015-04-15,1,2,5
2015-03-01,2,1,6
2015-07-01,2,2,4
2015-01-31,2,3,8
2015-01-21,2,1,2
2015-01-21,2,1,3
and table2:
date,country
2015-02-01,1
2015-04-21,1
2015-02-21,2
Edit: Oop - wish I would have seen that edit about joining before submitting. Np, I'll leave this as it was fun practice. Critiques welcome.
Where table1 and table2 are located in the same directory as this script at "table1.csv" and "table2.csv", this should work.
I didn't get the same result as your examples with 30 days - had to bump it to 31 days, but I think the spirit is here:
import pandas as pd
import numpy as np
table1_path = './table1.csv'
table2_path = './table2.csv'
with open(table1_path) as f:
table1 = pd.read_csv(f)
table1.date = pd.to_datetime(table1.date)
with open(table2_path) as f:
table2 = pd.read_csv(f)
table2.date = pd.to_datetime(table2.date)
joined = pd.merge(table2, table1, how='outer', on=['country'])
joined['datediff'] = joined.date_x - joined.date_y
filtered = joined[(joined.datediff >= np.timedelta64(1, 'D')) & (joined.datediff <= np.timedelta64(31, 'D'))]
gb_date_x = filtered.groupby(['date_x', 'country', 'cluster'])
summed = pd.DataFrame(gb_date_x['count'].sum())
result = summed.unstack()
result.reset_index(inplace=True)
result.fillna(0, inplace=True)
My test output:
ipdb> table1
date country cluster count
0 2014-01-30 00:00:00 1 1 1
1 2015-02-03 00:00:00 1 1 3
2 2015-01-30 00:00:00 1 2 2
3 2015-04-15 00:00:00 1 2 5
4 2015-03-01 00:00:00 2 1 6
5 2015-07-01 00:00:00 2 2 4
6 2015-01-31 00:00:00 2 3 8
7 2015-01-21 00:00:00 2 1 2
8 2015-01-21 00:00:00 2 1 3
ipdb> table2
date country
0 2015-02-01 00:00:00 1
1 2015-04-21 00:00:00 1
2 2015-02-21 00:00:00 2
...
ipdb> result
date_x country count
cluster 1 2 3
0 2015-02-01 00:00:00 1 0 2 0
1 2015-02-21 00:00:00 2 5 0 8
2 2015-04-21 00:00:00 1 0 5 0
UPDATE:
I think it doesn't make much sense to use pandas for processing data that can't fit into your memory. Of course there are some tricks how to deal with that, but it's painful.
If you want to process your data efficiently you should use a proper tool for that.
I would recommend to have a closer look at Apache Spark SQL where you can process your distributed data on multiple cluster nodes, using much more memory/processing power/IO/etc. compared to one computer/IO subsystem/CPU pandas approach.
Alternatively you can try use RDBMS like Oracle DB (very expensive, especially software licences! and their free version is full of limitations) or free alternatives like PostgreSQL (can't say much about it, because of lack of experience) or MySQL (not that powerful compared to Oracle; for example there is no native/clear solution for dynamic pivoting which you most probably will want to use, etc.)
OLD answer:
you can do it this way (please find explanations as comments in the code):
#
# <setup>
#
dates1 = pd.date_range('2016-03-15','2016-04-15')
dates2 = ['2016-02-01', '2016-05-01', '2016-04-01', '2015-01-01', '2016-03-20']
dates2 = [pd.to_datetime(d) for d in dates2]
countries = ['c1', 'c2', 'c3']
t1 = pd.DataFrame({
'date': dates1,
'country': np.random.choice(countries, len(dates1)),
'cluster': np.random.randint(1, 4, len(dates1)),
'count': np.random.randint(1, 10, len(dates1))
})
t2 = pd.DataFrame({'date': np.random.choice(dates2, 10), 'country': np.random.choice(countries, 10)})
#
# </setup>
#
# merge two DFs by `country`
merged = pd.merge(t1.rename(columns={'date':'date1'}), t2, on='country')
# filter dates and drop 'date1' column
merged = merged[(merged.date <= merged.date1 + pd.Timedelta('30days'))\
& \
(merged.date >= merged.date1)
].drop(['date1'], axis=1)
# group `merged` DF by ['country', 'date', 'cluster'],
# sum up `counts` for overlapping dates,
# reset the index,
# pivot: convert `cluster` values to columns,
# taking sum's of `count` as values,
# NaN's will be replaced with zeroes
# and finally reset the index
r = merged.groupby(['country', 'date', 'cluster'])\
.sum()\
.reset_index()\
.pivot_table(index=['country','date'],
columns='cluster',
values='count',
aggfunc='sum',
fill_value=0)\
.reset_index()
# rename numeric columns to: 'cluster_N'
rename_cluster_cols = {x: 'cluster_{0}'.format(x) for x in t1.cluster.unique()}
r = r.rename(columns=rename_cluster_cols)
Output (for my datasets):
In [124]: r
Out[124]:
cluster country date cluster_1 cluster_2 cluster_3
0 c1 2016-04-01 8 0 11
1 c2 2016-04-01 0 34 22
2 c3 2016-05-01 4 18 36
Sorry, I apologise now, just started learning Python and trying to get something working.
Ok dataset is
Buy, typeid, volume, issued, duration, Volume Entered,Minimum Volume, range, price, locationid, locationname
SELL 20 2076541 2015-09-12T06:31:13 90 2076541 1 region 331.21 60008494 Amarr
SELL 20 194642 2015-09-07T19:36:49 90 194642 1 region 300 60008494 Amarr
SELL 20 2320 2015-09-13T07:48:54 3 2320 1 region 211 60008491 Irnin
I would like to filter for a specific location either by name or ID, doesn't bother me, then to pick the minimum price for that location. Preferably to hardcode it in, since I only have a few locations I'm interested. e.g locationid = 60008494.
I see you can do two conditions on one line, but I don't see how to apply it.
So I'm trying to nest it.
Doesn't have to be pandas, just seems the first thing I found that did one part of what I required.
The code I've gotten so far is, and only does the minimum part of what I'm looking to achieve.
data = pd.read_csv('orders.csv')
length = len(data['typeid'].unique())
res = pd.DataFrame(columns=('Buy', 'typeid', 'volume','duration','volumeE','Minimum','range','price','locationid','locationname'))
for i in range(0,length):
name_filter = data[data['typeid'] == data['typeid'].unique()[i]]
price_min_filter = name_filter[name_filter['price'] == name_filter['price'].min() ]
res = res.append(price_min_filter, ignore_index=True)
i=i+1
res.to_csv('format.csv') # writes output to csv
print "Complete"
UPDATED.
Ok so, the latest part, seems like the following code is the direction I should be going in. If I could have s=typeid, locationid and price, thats perfect. So I've written what I want to do, whats the correct syntax to get that in python? Sorry I'm used to Excel and SQL.
import pandas as pd
df = pd.read_csv('orders.csv')
df[df['locationid'] ==60008494]
s= df.groupby(['typeid'])['price'].min()
s.to_csv('format.csv')
If what you really want is -
I would like to filter for a specific location either by name or ID, doesn't bother me, then to pick the minimum price for that location. Preferably to hardcode it in, since I only have a few locations I'm interested. e.g locationid = 60008494.
You can simply filter the df on the locationid first and then use ['price'].min(). Example -
In [1]: import pandas as pd
In [2]: s = """Buy,typeid,volume,issued,duration,Volume Entered,Minimum Volume,range,price,locationid,locationname
...: SELL,20,2076541,2015-09-12T06:31:13,90,2076541,1,region,331.21,60008494,Amarr
...: SELL,20,194642,2015-09-07T19:36:49,90,194642,1,region,300,60008494,Amarr
...: SELL,20,2320,2015-09-13T07:48:54,3,2320,1,region,211,60008491,Irnin"""
In [3]: import io
In [4]: df = pd.read_csv(io.StringIO(s))
In [5]: df
Out[5]:
Buy typeid volume issued duration Volume Entered \
0 SELL 20 2076541 2015-09-12T06:31:13 90 2076541
1 SELL 20 194642 2015-09-07T19:36:49 90 194642
2 SELL 20 2320 2015-09-13T07:48:54 3 2320
Minimum Volume range price locationid locationname
0 1 region 331.21 60008494 Amarr
1 1 region 300.00 60008494 Amarr
2 1 region 211.00 60008491 Irnin
In [8]: df[df['locationid']==60008494]['price'].min()
Out[8]: 300.0
If you want to do it for all the locationids', then as said in the other answer you can use DataFrame.groupby for that and then take the ['price'] column for the group you want and use .min(). Example -
data = pd.read_csv('orders.csv')
data.groupby(['locationid'])['price'].min()
Demo -
In [9]: df.groupby(['locationid'])['price'].min()
Out[9]:
locationid
60008491 211
60008494 300
Name: price, dtype: float64
For getting the complete row which has minimum values in the corresponding groups, you can use idxmin() to get the index for the minimum value and then pass it to df.loc to get those rows. Example -
In [9]: df.loc[df.groupby(['locationid'])['price'].idxmin()]
Out[9]:
Buy typeid volume issued duration Volume Entered \
2 SELL 20 2320 2015-09-13T07:48:54 3 2320
1 SELL 20 194642 2015-09-07T19:36:49 90 194642
Minimum Volume range price locationid locationname
2 1 region 211 60008491 Irnin
1 1 region 300 60008494 Amarr
If I understand your question correctly, you really won't need to do much more than a DataFrame.Groupby(). As an example, you can group the dataframe by the locationname, then select the price column from the resulting groupby object, then use the min() method to output the minimum value for each group:
data.groupby('locationname')['price'].min()
which will give you the minimum value of price for each group. So it will look something like:
locationname
Amarr 300
Irnin 211
Name: price, dtype: float64