Python pandas multiple conditions - python

Sorry, I apologise now, just started learning Python and trying to get something working.
Ok dataset is
Buy, typeid, volume, issued, duration, Volume Entered,Minimum Volume, range, price, locationid, locationname
SELL 20 2076541 2015-09-12T06:31:13 90 2076541 1 region 331.21 60008494 Amarr
SELL 20 194642 2015-09-07T19:36:49 90 194642 1 region 300 60008494 Amarr
SELL 20 2320 2015-09-13T07:48:54 3 2320 1 region 211 60008491 Irnin
I would like to filter for a specific location either by name or ID, doesn't bother me, then to pick the minimum price for that location. Preferably to hardcode it in, since I only have a few locations I'm interested. e.g locationid = 60008494.
I see you can do two conditions on one line, but I don't see how to apply it.
So I'm trying to nest it.
Doesn't have to be pandas, just seems the first thing I found that did one part of what I required.
The code I've gotten so far is, and only does the minimum part of what I'm looking to achieve.
data = pd.read_csv('orders.csv')
length = len(data['typeid'].unique())
res = pd.DataFrame(columns=('Buy', 'typeid', 'volume','duration','volumeE','Minimum','range','price','locationid','locationname'))
for i in range(0,length):
name_filter = data[data['typeid'] == data['typeid'].unique()[i]]
price_min_filter = name_filter[name_filter['price'] == name_filter['price'].min() ]
res = res.append(price_min_filter, ignore_index=True)
i=i+1
res.to_csv('format.csv') # writes output to csv
print "Complete"
UPDATED.
Ok so, the latest part, seems like the following code is the direction I should be going in. If I could have s=typeid, locationid and price, thats perfect. So I've written what I want to do, whats the correct syntax to get that in python? Sorry I'm used to Excel and SQL.
import pandas as pd
df = pd.read_csv('orders.csv')
df[df['locationid'] ==60008494]
s= df.groupby(['typeid'])['price'].min()
s.to_csv('format.csv')

If what you really want is -
I would like to filter for a specific location either by name or ID, doesn't bother me, then to pick the minimum price for that location. Preferably to hardcode it in, since I only have a few locations I'm interested. e.g locationid = 60008494.
You can simply filter the df on the locationid first and then use ['price'].min(). Example -
In [1]: import pandas as pd
In [2]: s = """Buy,typeid,volume,issued,duration,Volume Entered,Minimum Volume,range,price,locationid,locationname
...: SELL,20,2076541,2015-09-12T06:31:13,90,2076541,1,region,331.21,60008494,Amarr
...: SELL,20,194642,2015-09-07T19:36:49,90,194642,1,region,300,60008494,Amarr
...: SELL,20,2320,2015-09-13T07:48:54,3,2320,1,region,211,60008491,Irnin"""
In [3]: import io
In [4]: df = pd.read_csv(io.StringIO(s))
In [5]: df
Out[5]:
Buy typeid volume issued duration Volume Entered \
0 SELL 20 2076541 2015-09-12T06:31:13 90 2076541
1 SELL 20 194642 2015-09-07T19:36:49 90 194642
2 SELL 20 2320 2015-09-13T07:48:54 3 2320
Minimum Volume range price locationid locationname
0 1 region 331.21 60008494 Amarr
1 1 region 300.00 60008494 Amarr
2 1 region 211.00 60008491 Irnin
In [8]: df[df['locationid']==60008494]['price'].min()
Out[8]: 300.0
If you want to do it for all the locationids', then as said in the other answer you can use DataFrame.groupby for that and then take the ['price'] column for the group you want and use .min(). Example -
data = pd.read_csv('orders.csv')
data.groupby(['locationid'])['price'].min()
Demo -
In [9]: df.groupby(['locationid'])['price'].min()
Out[9]:
locationid
60008491 211
60008494 300
Name: price, dtype: float64
For getting the complete row which has minimum values in the corresponding groups, you can use idxmin() to get the index for the minimum value and then pass it to df.loc to get those rows. Example -
In [9]: df.loc[df.groupby(['locationid'])['price'].idxmin()]
Out[9]:
Buy typeid volume issued duration Volume Entered \
2 SELL 20 2320 2015-09-13T07:48:54 3 2320
1 SELL 20 194642 2015-09-07T19:36:49 90 194642
Minimum Volume range price locationid locationname
2 1 region 211 60008491 Irnin
1 1 region 300 60008494 Amarr

If I understand your question correctly, you really won't need to do much more than a DataFrame.Groupby(). As an example, you can group the dataframe by the locationname, then select the price column from the resulting groupby object, then use the min() method to output the minimum value for each group:
data.groupby('locationname')['price'].min()
which will give you the minimum value of price for each group. So it will look something like:
locationname
Amarr 300
Irnin 211
Name: price, dtype: float64

Related

Python dataframe returning closest value above specified input in one row (pivot_table)

I have the following DataFrame, output_excel, containing inventory data and sales data for different products. See the DataFrame below:
Product 2022-04-01 2022-05-01 2022-06-01 2022-07-01 2022-08-01 2022-09-01 AvgMonthlySales Current Inventory
1 BE37908 1500 1400 1200 1134 1110 1004 150.208333 1500
2 BE37907 2000 1800 1800 1540 1300 1038 189.562500 2000
3 DE37907 5467 5355 5138 4926 4735 4734 114.729167 5467
Please note that that in my example, today's date is 2022-04-01, so all inventory numbers for the months May through September are predicted values, while the AvgMonthlySales are the mean of actual, past sales for that specific product. The current inventory just displays today's value.
I also have another dataframe, df2, containing the lead time, the same sales data, and the calculated security stock for the same products. The formula for the security stock is ((leadtime in weeks / 4) + 1) * AvgMonthlySales:
Product AvgMonthlySales Lead time in weeks Security Stock
1 BE37908 250.208333 16 1251.04166
2 BE37907 189.562500 24 1326.9375
3 DE37907 114.729167 10 401.552084
What I am trying to achieve:
I want to create a new dataframe, which tells me how many months are left until our inventory drops below the security stock. For example, for the first product, BE37908, the security stock is ~1251 units, and by 2022-06-01 our inventory will drop below that number. So I want to return 2022-05-01, as this is the last month where our inventories are projected to be above the security stock. The whole output should look something like this:
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN
Please also note that the timeframe for the projections (the columns) can be set by the user, so we couldn't just select columns 2 through 7. However, the Product column will always be the first one, and the AvgMonthlySales and the Current Inventory columns will always be the last two.
To recap, I want to return the column with the smallest value above the security stock for each product. I have an idea on how to do that by column using argsort, but not by row. What is the best way to achieve this? Any tips?
You could try as follows:
# create list with columns with dates
cols = [col for col in df.columns if col.startswith('20')]
# select cols, apply df.gt row-wise, sum and subtract 1
idx = df.loc[:,cols].gt(df2['Security Stock'], axis=0).sum(axis=1).sub(1)
# get the correct dates from the cols
# if the value == len(cols)-1, *all* values will have been greater so: np.nan
idx = [cols[i] if i != len(cols)-1 else np.nan for i in idx]
out = df['Product'].to_frame()
out['Last Date Above Security Stock'] = idx
print(out)
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN

Filter for most recent event by group with pandas

I'm trying to filter a pandas dataframe so that I'm able to get the most recent data point for each account number in the dataframe.
Here is an example of what the data looks like.
I'm looking for an output of one instance of an account with the product and most recent date.
account_number product sale_date
0 123 rental 2021-12-01
1 423 rental 2021-10-01
2 513 sale 2021-11-02
3 123 sale 2022-01-01
4 513 sale 2021-11-30
I was trying to use groupby and idxmax() but it doesn't work with dates.
And I did want to change the dtype away from date time.
data_grouped = data.groupby('account_number')['sale_date'].max().idxmax()
Any ideas would be awesome.
To retain a subsetted data frame, consider sorting by account number and descending sale date, then call DataFrame.groupby().head (which will return NaNs if in first row per group unlike DataFrame.groupby().first):
data_grouped = (
data.sort_values(
["account_number", "sale_date"], ascending=[True, False]
).reset_index(drop=True)
.groupby("account_number")
.head(1)
)
It seems the sale_date column has strings. If you convert it to datetime dtype, then you can use groupby + idxmax:
df['sale_date'] = pd.to_datetime(df['sale_date'])
out = df.loc[df.groupby('account_number')['sale_date'].idxmax()]
Output:
account_number product sale_date
3 123 sale 2022-01-01
1 423 rental 2021-10-01
4 513 sale 2021-11-30
Would the keyword 'first' work ? So that would be:
data.groupby('account_number')['sale_date'].first()
You want the last keyword in order to get the most recent date after grouping, like this:
df.groupby(by=["account_number"])["sale_date"].last()
which will provide this output:
account_number
123 2022-01-01
423 2021-10-01
513 2021-11-30
Name: sale_date, dtype: datetime64[ns]
It is unclear why you want to transition away from using the datetime dtype, but you need it in order to correctly sort for the value you are looking for. Consider doing this as an intermediate step, then reformatting the column after processing.
I'll change my answer to use #Daniel Weigelbut's answer... and also here, where you can apply .nth(n) to find the nth value for a general case ((-1) for the most recent date).
new_data = data.groupby('account_number')['sale_date'].nth(-1)
My previous suggestion of creating a sorted multi index with
data.set_index(['account_number', 'sale_date'], inplace = True)
data_sorted = data.sort_index(level = [0, 1])
still works and might be more useful for any more complex sorting. As others have said, make sure your date strings are date time objects if you sort like this.

Pandas substract above row

Basically this is the challenge I have
Data set with time range and unique ID, what I need to do is to find if ID is duplicated in date range.
123 transaction 1/1/2021
345 transaction 1/1/2021
123 transaction 1/2/2021
123 transaction 1/20/2021
Where I want to return 1 for ID 123 because the duplicate transaction is in range of 7 days.
I can do this with Excel and I added some more date ranges depending on day for exple Wednesday range up to 6 days, Thursday 5 days, Friday 4 days range. But I have no idea how to accomplish this with pandas...
The reason why I want to do this with pandas is because each data set has up to 1M rows and it takes forever with Excel to accomplish and on top of that I need to split by category and it's just a pain to do all that manual work.
Is there any recommendations or ideas in how to accomplish that task?
The df:
df = pd.read_csv(StringIO(
"""id,trans_date
123,1/1/2021
345,1/1/2021
123,1/2/2021
123,1/20/2021
345,1/3/2021
"""
)) # added extra record for demo
df
id trans_date
0 123 1/1/2021
1 345 1/1/2021
2 123 1/2/2021
3 123 1/20/2021
4 345 1/3/2021
df['trans_date'] = pd.to_datetime(df['trans_date'])
As you have to look into each of the ids separately, you can group by id and then get the maximum and minimum dates and if the difference is greater than 7, then those would be 1. Otherwise, 0.
result = df.groupby('id')['trans_date'].apply(
lambda x: True if (x.max()-x.min()).days > 7 else False)
result
id
123 True
345 False
Name: trans_date, dtype: bool
If you just need the required ids, then
result.index[result].values
array([123])
The context and data you've provided about your situation are scanty, but you can probably do something like this:
>>> df
id type date
0 123 transaction 2021-01-01
1 345 transaction 2021-01-01
2 123 transaction 2021-01-02
3 123 transaction 2021-01-20
>>> dupes = df.groupby(pd.Grouper(key='date', freq='W'))['id'].apply(pd.Series.duplicated)
>>> dupes
0 False
1 False
2 True
3 False
Name: id, dtype: bool
There, item 2 (the third item) is True because 123 already occured in the past week.
As far as I can understand the question, I think this is what you need.
from datetime import datetime
import pandas as pd
df = pd.DataFrame({
"id": [123, 345, 123, 123],
"name": ["transaction", "transaction", "transaction", "transaction"],
"date": ["01/01/2021", "01/01/2021", "01/02/2021", "01/10/2021"]
})
def dates_in_range(dates):
num_days_frame = 6
processed_dates = sorted([datetime.strptime(date, "%m/%d/%Y") for date in dates])
difference_in_range = any(abs(processed_dates[i] - processed_dates[i-1]).days < num_days_frame for i in range(1, len(processed_dates)))
return difference_in_range and 1 or 0
group = df.groupby("id")
df_new = group.apply(lambda x: dates_in_range(x["date"]))
print(df_new)
"""
print(df_new)
id
123 1
345 0
"""
Here you first group by the id such that you get all dates for that particular id in the same row.
After which a row-wise function operation is applied to the aggregated dates such that, first they are sorted and afterward checked if the difference between consecutive items is greater than the defined range. The sorting makes sure that consecutive differences will actually result in a true or false outcome if dates are close by.
Finally if any such row exists for which the difference of consecutive sorted dates are less than num_days_frame (6), we return a 1 else we return a 0.
All that being said this might not be as performant as each row is being sorted. One way to avoid that is sort the entire df first and apply the group operation to ensure sorted dates.

Pandas advanced problem : For each row, get complex info from another dataframe

Problem
I have a dataframe df :
Index Client_ID Date
1 johndoe 2019-01-15
2 johndoe 2015-11-25
3 pauldoe 2015-05-26
And I have another dataframe df_prod, with products like this :
Index Product-Type Product-Date Buyer Price
1 A 2020-01-01 pauldoe 300
2 A 2018-01-01 pauldoe 200
3 A 2019-01-01 johndoe 600
4 A 2017-01-01 johndoe 800
5 A 2020-11-05 johndoe 100
6 B 2014-12-12 johndoe 200
7 B 2016-11-15 johndoe 300
What I want is to add to df a column, that will sum the Prices of the last products of each type known at the current date (with Product-Date <= df.Date). An example is the best way to explain :
For the first row of df
1 johndoe 2019-01-01
The last A-Product known at this date bought by johndoe is this one :
3 A 2019-01-01 johndoe 600
(since the 4th one is older, and the 5th one has a Product-Date > Date)
The last B-Product known at this date bought by johndoe is this one :
7 B 2016-11-15 johndoe 300
So the row in df, after transformation, will look like that (900 being 600 + 300, prices of the 2 products of interest) :
1 johndoe 2019-01-15 900
The full df after transformation will then be :
Index Client_ID Date LastProdSum
1 johndoe 2019-15-01 900
2 johndoe 2015-11-25 200
3 pauldoe 2015-05-26 0
As you can see, there are multiple possibilities :
Buyers didn't necessary buy all products (see pauldoe, who only bought A-products)
Sometimes, no product is known at df.Date (see row 3 of the new df, in 2015, we don't know any product bought by pauldoe)
Sometimes, only one product is known at df.Date, and the value is the one of the product (see row 3 of the new df, in 2015, we only have one product for johndoe, which is a B-product bought in 2014, whose price is 200)
What I did :
I found a solution to this problem, but it's taking too much time to be used, since my dataframe is huge.
For that, I iterate using iterrows on rows of df, I then select the products linked to the Buyer, having Product-Date < Date on df_prod, then get the older grouping by Product-Type and getting the max date, then I finally sum all my products prices.
The fact I solve the problem iterating on each row (with a for iterrows), extracting for each row of df a part of df_prod that I work on to finally get my sum, makes it really long.
I'm almost sure there's a better way to solve the problem, with pandas functions (pivot for example), but I couldn't find the way. I've been searching a lot.
Thanks in advance for your help
Edit after Dani's answer
Thanks a lot for your answer. It looks really good, I accepted it since you spent a lot of time on it.
Execution is still pretty long, since I didn't specify something.
In fact, Product-Types are not shared through buyers : each buyers has its own multiple products types. The true way to see this is like this :
Index Product-Type Product-Date Buyer Price
1 pauldoe-ID1 2020-01-01 pauldoe 300
2 pauldoe-ID1 2018-01-01 pauldoe 200
3 johndoe-ID2 2019-01-01 johndoe 600
4 johndoe-ID2 2017-01-01 johndoe 800
5 johndoe-ID2 2020-11-05 johndoe 100
6 johndoe-ID3 2014-12-12 johndoe 200
7 johndoe-ID3 2016-11-15 johndoe 300
As you can understand, product types are not shared through different buyers (in fact, it can happen, but in really rare situations that we won't consider here)
The problem remains the same, since you want to sum prices, you'll add the prices of last occurences of johndoe-ID2 and johndoe-ID3 to have the same final result row
1 johndoe 2019-15-01 900
But as you now understand, there are actually more Product-Types than Buyers, so the step "get unique product types" from your answer, that looked pretty fast on the initial problem, actually takes a lot of time.
Sorry for being unclear on this point, I didn't think of a possibility of creating a new df based on product types.
The main idea is to use merge_asof to fetch the last date for each Product-Type and Client_ID, so do the following:
# get unique product types
product_types = list(df_prod['Product-Type'].unique())
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = [product_types for _ in range(len(df))]
df_with_prod = df.explode('Product-Type')
# merge only the closest date by each client and product type
merge = pd.merge_asof(df_with_prod.sort_values(['Date', 'Client_ID']),
df_prod.sort_values(['Product-Date', 'Buyer']),
left_on='Date',
right_on='Product-Date',
left_by=['Client_ID', 'Product-Type'], right_by=['Buyer', 'Product-Type'])
# fill na in prices
merge['Price'] = merge['Price'].fillna(0)
# sum Price by client and date
res = merge.groupby(['Client_ID', 'Date'], as_index=False)['Price'].sum().rename(columns={'Price' : 'LastProdSum'})
print(res)
Output
Client_ID Date LastProdSum
0 johndoe 2015-11-25 200.0
1 johndoe 2019-01-15 900.0
2 pauldoe 2015-05-26 0.0
The problem is that merge_asof won't work with duplicate values, so we need to create unique values. These new values are the cartesian product of Client_ID and Product-Type, this part is done in:
# get unique product types
product_types = list(df_prod['Product-Type'].unique())
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = [product_types for _ in range(len(df))]
df_with_prod = df.explode('Product-Type')
Finally do a groupby and sum the Price, not before doing a fillna to fill the missing values.
UPDATE
You could try:
# get unique product types
product_types = df_prod.groupby('Buyer')['Product-Type'].apply(lambda x: list(set(x)))
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = df['Client_ID'].map(product_types)
df_with_prod = df.explode('Product-Type')
# merge only the closest date by each client and product type
merge = pd.merge_asof(df_with_prod.sort_values(['Date', 'Client_ID']),
df_prod.sort_values(['Product-Date', 'Buyer']),
left_on='Date',
right_on='Product-Date',
left_by=['Client_ID', 'Product-Type'], right_by=['Buyer', 'Product-Type'])
# fill na in prices
merge['Price'] = merge['Price'].fillna(0)
# sum Price by client and date
res = merge.groupby(['Client_ID', 'Date'], as_index=False)['Price'].sum().rename(columns={'Price' : 'LastProdSum'})
print(res)
The idea here is to change how you generate the unique values.

find first unique items selected by user and ranking them in order of user selection by date

I am trying to identify only first orders of unique "items" purchased by "test" customers in a simplified sample dataframe from the dataframe created below:
df=pd.DataFrame({"cust": ['A55', 'A55', 'A55', 'B080', 'B080', 'D900', 'D900', 'D900', 'D900', 'C019', 'C019', 'Z09c', 'A987', 'A987', 'A987'],
"date":['01/11/2016', '01/11/2016', '01/11/2016', '08/17/2016', '6/17/2016','03/01/2016',
'04/30/2016', '05/16/2016','09/27/2016', '04/20/2016','04/29/2016', '07/07/2016', '1/29/2016', '10/17/2016', '11/11/2016' ],
"item": ['A10BABA', 'A10BABA', 'A10DBDB', 'A9GABA', 'A11AD', 'G198A', 'G198A', 'F673', 'A11BB', 'CBA1', 'CBA1', 'DA21',
'BG10A', 'CG10BA', 'BG10A']
})
df.date = pd.to_datetime(df.date)
df = df.sort_values(["cust", "date"], ascending = True)
The desired output would look as shown in picture - with all unique items ordered by date of purchase in a new column called "cust_item_rank" and remove any repeated (duplicated) orders of the same item by same user.
To clarify further, those items purchased on the same date by same user should have the same order/rank as shown in picture for customer A55 (A10BABA and A10DBDB are ranked as 1).
I have spent a fair bit of time using a combination of group by and/or rank operations but unsuccessful thus far. As an example:
df["cust_item_rank"] = df.groupby("cust")["date"]["item"].rank(ascending = 1, method = "min")
Yields an error (Exception: Column(s) date already selected).
Can somebody please guide me to the desired solution here?
# Remove duplicates
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
df2['cust_item_rank'] = df2.groupby('cust').cumcount().add(1)
df2
cust date item cust_item_rank
0 A55 2016-01-11 A10BABA 1
1 A55 2016-11-01 A10DBDB 2
2 A987 2016-01-29 BG10A 1
3 A987 2016-10-17 CG10BA 2
4 B080 2016-06-17 A11AD 1
5 B080 2016-08-17 A9GABA 2
6 C019 2016-04-20 CBA1 1
7 D900 2016-03-01 G198A 1
8 D900 2016-05-16 F673 2
9 D900 2016-09-27 A11BB 3
10 Z09c 2016-07-07 DA21 1
To solve this question, I built upon the excellent initial answer by cs95 and calling on the rank function in pandas as follows:
#remove duplicates as recommended by cs95
df2 = (df.loc[~df.groupby(['cust'])['item'].apply(pd.Series.duplicated)]
.reset_index(drop=True))
#rank by date afer grouping by customer
df2["cust_item_rank"]= df2.groupby(["cust"])["date"].rank(ascending=1,method='dense').astype(int)
This resulted in the following (desired output):
It appears that this problem is solved using either "min" or "dense" method of ranking but I chose the latter "dense" method to potentially avoid skipping any rank.

Categories