Counting total monthly values from a CSV in Python

Counting total monthly values from a CSV in Python - python

I am trying to record monthy sales totals over the course of 2.5 years in a csv data set.
I started with a csv file of transaction history for a SKU, which was sorted by date (MM/DD/YYYY), with varying statuses indicating whether the item was sold, archived (quoted, not sold), or open. I managed to figure out how to only display the "sold" rows, but cannot figure out how to display a total amount sold per month.
Here's what I have thus far.
#Import Libraries
from pandas import DataFrame, read_csv
import pandas as pd
#Set Variables
fields = ['Date', 'Qty', 'Status']
file = r'kp4.csv'
df = pd.read_csv(file, usecols=fields)
# Filters Dataset to only display "Sold" items in Status column
data = (df[df['Status'] == "Sold"])
print (data)
Output:
Date Qty Status
4 2/21/2018 5 Sold
4 2/21/2018 5 Sold
11 2/16/2018 34 Sold
14 3/16/2018 1 Sold
My ideal output would look something like this:
Date Qty Status
4 02/2018 39 Sold
5 03/2018 1 Sold
I've tried groupy, manipulating the year format, assigning indexes per other tutorials and have gotten nothing but errors. If anyone can point me in the right direction, it would be greatly appreciated.
Thanks!

IIUC
df.Date=pd.to_datetime(df.Date)
df=df.drop_duplicates()
df.groupby(df.Date.dt.strftime('%m/%Y')).agg({'Qty':'sum','Status':'first'})
Out[157]:
Qty Status
Date
02/2018 39 Sold
03/2018 1 Sold

Related

Python dataframe returning closest value above specified input in one row (pivot_table)

I have the following DataFrame, output_excel, containing inventory data and sales data for different products. See the DataFrame below:
Product 2022-04-01 2022-05-01 2022-06-01 2022-07-01 2022-08-01 2022-09-01 AvgMonthlySales Current Inventory
1 BE37908 1500 1400 1200 1134 1110 1004 150.208333 1500
2 BE37907 2000 1800 1800 1540 1300 1038 189.562500 2000
3 DE37907 5467 5355 5138 4926 4735 4734 114.729167 5467
Please note that that in my example, today's date is 2022-04-01, so all inventory numbers for the months May through September are predicted values, while the AvgMonthlySales are the mean of actual, past sales for that specific product. The current inventory just displays today's value.
I also have another dataframe, df2, containing the lead time, the same sales data, and the calculated security stock for the same products. The formula for the security stock is ((leadtime in weeks / 4) + 1) * AvgMonthlySales:
Product AvgMonthlySales Lead time in weeks Security Stock
1 BE37908 250.208333 16 1251.04166
2 BE37907 189.562500 24 1326.9375
3 DE37907 114.729167 10 401.552084
What I am trying to achieve:
I want to create a new dataframe, which tells me how many months are left until our inventory drops below the security stock. For example, for the first product, BE37908, the security stock is ~1251 units, and by 2022-06-01 our inventory will drop below that number. So I want to return 2022-05-01, as this is the last month where our inventories are projected to be above the security stock. The whole output should look something like this:
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN
Please also note that the timeframe for the projections (the columns) can be set by the user, so we couldn't just select columns 2 through 7. However, the Product column will always be the first one, and the AvgMonthlySales and the Current Inventory columns will always be the last two.
To recap, I want to return the column with the smallest value above the security stock for each product. I have an idea on how to do that by column using argsort, but not by row. What is the best way to achieve this? Any tips?

You could try as follows:
# create list with columns with dates
cols = [col for col in df.columns if col.startswith('20')]
# select cols, apply df.gt row-wise, sum and subtract 1
idx = df.loc[:,cols].gt(df2['Security Stock'], axis=0).sum(axis=1).sub(1)
# get the correct dates from the cols
# if the value == len(cols)-1, *all* values will have been greater so: np.nan
idx = [cols[i] if i != len(cols)-1 else np.nan for i in idx]
out = df['Product'].to_frame()
out['Last Date Above Security Stock'] = idx
print(out)
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN

Pandas advanced problem : For each row, get complex info from another dataframe

Problem
I have a dataframe df :
Index Client_ID Date
1 johndoe 2019-01-15
2 johndoe 2015-11-25
3 pauldoe 2015-05-26
And I have another dataframe df_prod, with products like this :
Index Product-Type Product-Date Buyer Price
1 A 2020-01-01 pauldoe 300
2 A 2018-01-01 pauldoe 200
3 A 2019-01-01 johndoe 600
4 A 2017-01-01 johndoe 800
5 A 2020-11-05 johndoe 100
6 B 2014-12-12 johndoe 200
7 B 2016-11-15 johndoe 300
What I want is to add to df a column, that will sum the Prices of the last products of each type known at the current date (with Product-Date <= df.Date). An example is the best way to explain :
For the first row of df
1 johndoe 2019-01-01
The last A-Product known at this date bought by johndoe is this one :
3 A 2019-01-01 johndoe 600
(since the 4th one is older, and the 5th one has a Product-Date > Date)
The last B-Product known at this date bought by johndoe is this one :
7 B 2016-11-15 johndoe 300
So the row in df, after transformation, will look like that (900 being 600 + 300, prices of the 2 products of interest) :
1 johndoe 2019-01-15 900
The full df after transformation will then be :
Index Client_ID Date LastProdSum
1 johndoe 2019-15-01 900
2 johndoe 2015-11-25 200
3 pauldoe 2015-05-26 0
As you can see, there are multiple possibilities :
Buyers didn't necessary buy all products (see pauldoe, who only bought A-products)
Sometimes, no product is known at df.Date (see row 3 of the new df, in 2015, we don't know any product bought by pauldoe)
Sometimes, only one product is known at df.Date, and the value is the one of the product (see row 3 of the new df, in 2015, we only have one product for johndoe, which is a B-product bought in 2014, whose price is 200)
What I did :
I found a solution to this problem, but it's taking too much time to be used, since my dataframe is huge.
For that, I iterate using iterrows on rows of df, I then select the products linked to the Buyer, having Product-Date < Date on df_prod, then get the older grouping by Product-Type and getting the max date, then I finally sum all my products prices.
The fact I solve the problem iterating on each row (with a for iterrows), extracting for each row of df a part of df_prod that I work on to finally get my sum, makes it really long.
I'm almost sure there's a better way to solve the problem, with pandas functions (pivot for example), but I couldn't find the way. I've been searching a lot.
Thanks in advance for your help
Edit after Dani's answer
Thanks a lot for your answer. It looks really good, I accepted it since you spent a lot of time on it.
Execution is still pretty long, since I didn't specify something.
In fact, Product-Types are not shared through buyers : each buyers has its own multiple products types. The true way to see this is like this :
Index Product-Type Product-Date Buyer Price
1 pauldoe-ID1 2020-01-01 pauldoe 300
2 pauldoe-ID1 2018-01-01 pauldoe 200
3 johndoe-ID2 2019-01-01 johndoe 600
4 johndoe-ID2 2017-01-01 johndoe 800
5 johndoe-ID2 2020-11-05 johndoe 100
6 johndoe-ID3 2014-12-12 johndoe 200
7 johndoe-ID3 2016-11-15 johndoe 300
As you can understand, product types are not shared through different buyers (in fact, it can happen, but in really rare situations that we won't consider here)
The problem remains the same, since you want to sum prices, you'll add the prices of last occurences of johndoe-ID2 and johndoe-ID3 to have the same final result row
1 johndoe 2019-15-01 900
But as you now understand, there are actually more Product-Types than Buyers, so the step "get unique product types" from your answer, that looked pretty fast on the initial problem, actually takes a lot of time.
Sorry for being unclear on this point, I didn't think of a possibility of creating a new df based on product types.

The main idea is to use merge_asof to fetch the last date for each Product-Type and Client_ID, so do the following:
# get unique product types
product_types = list(df_prod['Product-Type'].unique())
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = [product_types for _ in range(len(df))]
df_with_prod = df.explode('Product-Type')
# merge only the closest date by each client and product type
merge = pd.merge_asof(df_with_prod.sort_values(['Date', 'Client_ID']),
df_prod.sort_values(['Product-Date', 'Buyer']),
left_on='Date',
right_on='Product-Date',
left_by=['Client_ID', 'Product-Type'], right_by=['Buyer', 'Product-Type'])
# fill na in prices
merge['Price'] = merge['Price'].fillna(0)
# sum Price by client and date
res = merge.groupby(['Client_ID', 'Date'], as_index=False)['Price'].sum().rename(columns={'Price' : 'LastProdSum'})
print(res)
Output
Client_ID Date LastProdSum
0 johndoe 2015-11-25 200.0
1 johndoe 2019-01-15 900.0
2 pauldoe 2015-05-26 0.0
The problem is that merge_asof won't work with duplicate values, so we need to create unique values. These new values are the cartesian product of Client_ID and Product-Type, this part is done in:
# get unique product types
product_types = list(df_prod['Product-Type'].unique())
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = [product_types for _ in range(len(df))]
df_with_prod = df.explode('Product-Type')
Finally do a groupby and sum the Price, not before doing a fillna to fill the missing values.
UPDATE
You could try:
# get unique product types
product_types = df_prod.groupby('Buyer')['Product-Type'].apply(lambda x: list(set(x)))
# create a new DataFrame with a row for each Product-Type for each Client_ID
df['Product-Type'] = df['Client_ID'].map(product_types)
df_with_prod = df.explode('Product-Type')
# merge only the closest date by each client and product type
merge = pd.merge_asof(df_with_prod.sort_values(['Date', 'Client_ID']),
df_prod.sort_values(['Product-Date', 'Buyer']),
left_on='Date',
right_on='Product-Date',
left_by=['Client_ID', 'Product-Type'], right_by=['Buyer', 'Product-Type'])
# fill na in prices
merge['Price'] = merge['Price'].fillna(0)
# sum Price by client and date
res = merge.groupby(['Client_ID', 'Date'], as_index=False)['Price'].sum().rename(columns={'Price' : 'LastProdSum'})
print(res)
The idea here is to change how you generate the unique values.

Is there a way to count and calculate mean for text columns using groupby?

I have been using pandas.groupby to pivot data and create descriptive charts and tables for my data. While doing groupby for three variables, I keep running into a DataError: No numeric types to aggregate error while working with the cancelled column.
To describe my data, Year and Month contain yearly and monthly data for multiple columns (multiple years, all months), Type contains the type of order item (Clothes, Appliances, etc.), and cancelled contains yes or no string values to determine whether a order was cancelled or not.
I am hoping to plot a graph and show a table to show what the cancellation rate (and success rate) is by order item. The following is what I'm using so far
df.groupby(['Year', 'Month', 'Type'])['cancelled'].mean()
But this doesn't seem to be working.
Sample
Year Month Type cancelled
2012 1 electronics yes
2012 10 fiber yes
2012 9 clothes no
2013 4 vegetables yes
2013 5 appliances no
2016 3 fiber no
2017 1 clothes yes

Use:
df = pd.DataFrame({
'Year':[2020] * 6,
'Month':[7,8,7,8,7,8],
'cancelled':['yes','no'] * 3,
'Type':list('aaaaba')
})
print (df)
Get counts per Year, Month, Type columns:
df1 = df.groupby(['Year', 'Month', 'Type','cancelled']).size().unstack(fill_value=0)
print (df1)
cancelled no yes
Year Month Type
2020 7 a 0 2
b 0 1
8 a 3 0
And then divide by sum of values for ratio:
df2 = df1.div(df1.sum()).mul(100)
print (df2)
cancelled no yes
Year Month Type
2020 7 a 0.0 66.666667
b 0.0 33.333333
8 a 100.0 0.000000

It's possible I have misunderstood what you want your output to look like, but to find the cancellation rate for each item type, you could do something like this:
# change 'cancelled' to numeric values
df.loc[df['cancelled'] == 'yes', 'cancelled'] = 1
df.loc[df['cancelled'] == 'no', 'cancelled'] = 0
# get the mean of 'cancelled' for each item type
res = {}
for t in df['Type'].unique():
res[t] = df.loc[df['Type'] == t, 'cancelled'].mean()
# if desired, put it into a dataframe
results = pd.DataFrame([res], index=['Rate']).T
Output:
Rate
electronics 1.0
fiber 0.5
clothes 0.5
vegetables 1.0
appliances 0.0
Note: If you want to specify specific years or months, you can do that with loc as well, but given that your example data did not have any repeats within a given year or month, this would return your original dataframe for your given example.

Data frame formation

I need to create a data frame for 100 customer_id along with their expenses for each day starting from 1st June 2019 to 31st August 2019. I have customer id already in a list and dates as well in a list. How to make a data frame in the format shown.
CustomerID TrxnDate
1 1-Jun-19
1 2-Jun-19
1 3-Jun-19
1 Upto....
1 31-Aug-19
2 1-Jun-19
2 2-Jun-19
2 3-Jun-19
2 Upto....
2 31-Aug-19
and so on for other 100 customer id
I already have customer_id data frame using pandas function now i need to map each customer_id with the date ie assume we have customer id as 1 then 1 should have all dates from 1st June 2019 to 31 aug 2019 and then customerId 2 should have the same dates. Please see the data frame required.

# import module
import pandas as pd
# list of dates
lst = ['1-Jun-19', '2-Jun-19', ' 3-Jun-19']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
Repeat the operations for Customer ID and store in df2 or something and then
frames = [df, df2]
result = pd.concat(frames)
There are simpler methods , but this will give you a idea how it is carried out.
I see you want specific dataframes, so first creat the dataframes according to customer ID 1. then repeat same for Customer ID 2, and then concat those dataframes.

Python pandas multiple conditions

Sorry, I apologise now, just started learning Python and trying to get something working.
Ok dataset is
Buy, typeid, volume, issued, duration, Volume Entered,Minimum Volume, range, price, locationid, locationname
SELL 20 2076541 2015-09-12T06:31:13 90 2076541 1 region 331.21 60008494 Amarr
SELL 20 194642 2015-09-07T19:36:49 90 194642 1 region 300 60008494 Amarr
SELL 20 2320 2015-09-13T07:48:54 3 2320 1 region 211 60008491 Irnin
I would like to filter for a specific location either by name or ID, doesn't bother me, then to pick the minimum price for that location. Preferably to hardcode it in, since I only have a few locations I'm interested. e.g locationid = 60008494.
I see you can do two conditions on one line, but I don't see how to apply it.
So I'm trying to nest it.
Doesn't have to be pandas, just seems the first thing I found that did one part of what I required.
The code I've gotten so far is, and only does the minimum part of what I'm looking to achieve.
data = pd.read_csv('orders.csv')
length = len(data['typeid'].unique())
res = pd.DataFrame(columns=('Buy', 'typeid', 'volume','duration','volumeE','Minimum','range','price','locationid','locationname'))
for i in range(0,length):
name_filter = data[data['typeid'] == data['typeid'].unique()[i]]
price_min_filter = name_filter[name_filter['price'] == name_filter['price'].min() ]
res = res.append(price_min_filter, ignore_index=True)
i=i+1
res.to_csv('format.csv') # writes output to csv
print "Complete"
UPDATED.
Ok so, the latest part, seems like the following code is the direction I should be going in. If I could have s=typeid, locationid and price, thats perfect. So I've written what I want to do, whats the correct syntax to get that in python? Sorry I'm used to Excel and SQL.
import pandas as pd
df = pd.read_csv('orders.csv')
df[df['locationid'] ==60008494]
s= df.groupby(['typeid'])['price'].min()
s.to_csv('format.csv')

If what you really want is -
I would like to filter for a specific location either by name or ID, doesn't bother me, then to pick the minimum price for that location. Preferably to hardcode it in, since I only have a few locations I'm interested. e.g locationid = 60008494.
You can simply filter the df on the locationid first and then use ['price'].min(). Example -
In [1]: import pandas as pd
In [2]: s = """Buy,typeid,volume,issued,duration,Volume Entered,Minimum Volume,range,price,locationid,locationname
...: SELL,20,2076541,2015-09-12T06:31:13,90,2076541,1,region,331.21,60008494,Amarr
...: SELL,20,194642,2015-09-07T19:36:49,90,194642,1,region,300,60008494,Amarr
...: SELL,20,2320,2015-09-13T07:48:54,3,2320,1,region,211,60008491,Irnin"""
In [3]: import io
In [4]: df = pd.read_csv(io.StringIO(s))
In [5]: df
Out[5]:
Buy typeid volume issued duration Volume Entered \
0 SELL 20 2076541 2015-09-12T06:31:13 90 2076541
1 SELL 20 194642 2015-09-07T19:36:49 90 194642
2 SELL 20 2320 2015-09-13T07:48:54 3 2320
Minimum Volume range price locationid locationname
0 1 region 331.21 60008494 Amarr
1 1 region 300.00 60008494 Amarr
2 1 region 211.00 60008491 Irnin
In [8]: df[df['locationid']==60008494]['price'].min()
Out[8]: 300.0
If you want to do it for all the locationids', then as said in the other answer you can use DataFrame.groupby for that and then take the ['price'] column for the group you want and use .min(). Example -
data = pd.read_csv('orders.csv')
data.groupby(['locationid'])['price'].min()
Demo -
In [9]: df.groupby(['locationid'])['price'].min()
Out[9]:
locationid
60008491 211
60008494 300
Name: price, dtype: float64
For getting the complete row which has minimum values in the corresponding groups, you can use idxmin() to get the index for the minimum value and then pass it to df.loc to get those rows. Example -
In [9]: df.loc[df.groupby(['locationid'])['price'].idxmin()]
Out[9]:
Buy typeid volume issued duration Volume Entered \
2 SELL 20 2320 2015-09-13T07:48:54 3 2320
1 SELL 20 194642 2015-09-07T19:36:49 90 194642
Minimum Volume range price locationid locationname
2 1 region 211 60008491 Irnin
1 1 region 300 60008494 Amarr

If I understand your question correctly, you really won't need to do much more than a DataFrame.Groupby(). As an example, you can group the dataframe by the locationname, then select the price column from the resulting groupby object, then use the min() method to output the minimum value for each group:
data.groupby('locationname')['price'].min()
which will give you the minimum value of price for each group. So it will look something like:
locationname
Amarr 300
Irnin 211
Name: price, dtype: float64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.