Plotting time series data group by month per product - python

Let's say the data used is something like this
df = pd.DataFrame({'Order_id': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
'Order_date': ['10/1/2020', '10/1/2020', '11/1/2020', '11/1/2020', '12/1/2020', '12/1/2020', '12/1/2020', '12/1/2020', '13/1/2020', '13/1/2020'],
'Product_nr': [0, 2, 1, 0, 2, 0, 2, 1, 2, 0],
'Quantity': [3, 1, 6, 5, 10, 1, 2, 5, 4, 3]})
#transforming the date column into datetime
df['Order_date'] = pd.to_datetime(df['Order_date'])
and I'm trying to plot the number of ordered products per day per product over the given time span.
My initial idea would be something like
product_groups = df.groupby(['Product_nr'])
products_daily = pd.DataFrame()
for product, total_orders in product_groups:
products_daily[product.day] = total_orders.values
products_daily.plot(subplots=True, legend=False)
pyplot.show()
I know there must be a groupby('Product_nr') and the date should be splitted into days using Grouper(freq='D'). They should also be a for loop to combine them and then plotting them all but I really have no clue how to put those pieces together. How can I archieve this? My ultimate goal is actually to plot them per month per product for over 4 years of sales records, but given the example data here I changed it into daily.
Any suggestion or link for guides, tutorials are welcome too. Thank you very much!

You can pivot the table and use pandas' plot function:
(df.groupby(['Order_date', 'Product_nr'])
['Quantity'].sum()
.unstack('Product_nr')
.plot(subplots=True, layout=(1,3)) # change layout to fit your data
)
Output:

Related

How do I quickly drop rows based on the max value in a groupby? [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 18 days ago.
I have a large dataframe containing information on people and their job change history. Sometimes, someone had multiple changes to their record on one day, each of which is assigned a transaction sequence number. I just want to keep the rows with the highest transaction sequence number of that day. Currently, I'm using the for loop below to do this, but it takes forever.
list_indexes_to_drop = []
for (associate_id, date), df in df_job_his.groupby(["Employee ID", "Event Date"]):
if len(df) > 1:
list_indexes_to_drop += list(df.index[df["Transaction Sequence Number"] != df["Transaction Sequence Number"].max()])
I also have this code below, but I'm not sure how to use it to filter the dataframe.
df_job_his.groupby(["Employee ID", "Event Date"])["Transaction Sequence Number"].max()
Is there a more efficient way to go about this?
Here's an example of some random data in the same format:
df_job_his = pd.DataFrame({"Employee ID": [1, 1, 1, 2, 3, 3, 4, 4, 5, 6, 6, 6, 7, 8, 9, 9, 10], "Event Date": ["2020-04-05", "2020-06-08", "2020-06-08", "2022-09-01", "2022-02-15", "2022-02-15", "2021-07-29", "2021-07-29", "2021-08-14", "2021-09-14", "2022-01-04", "2022-01-04", "2022-01-04", "2022-04-04", "2020-08-13", "2020-08-13", "2020-03-17"], "Transaction Sequence Number": [1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1]}).groupby(["Employee ID", "Event Date"])
It you groupby was almost a correct answer!
A trick to get the value with the highest "Transaction Sequence Number", would be to use .groupby.last() after sorting the dataframe by Transaction Sequence Number
Here's a solution:
import pandas as pd
import numpy as np
df_job_his = pd.DataFrame({
'Employee ID': [0, 0, 0, 0, 1, 1, 1],
'Event Date': [1, 2, 3, 3, 1, 2, 3],
'Transaction Sequence Number': [1, 2, 4, 3, 5, 6, 7],
'Important info about transaction': np.random.random(7)
})
df_job_his.sort_values('Transaction Sequence Number').groupby(
["Employee ID", "Event Date"]).last()
It outputs something like this, where the employee o for date 3 gets the last row only.
(Employee ID,Event Date)
Transaction Sequence Number
Important info about transaction
(0, 1)
1
0.00571239
(0, 2)
2
0.0484783
(0, 3)
4
0.958739
(1, 1)
5
0.0690461
(1, 2)
6
0.721041
(1, 3)
7
0.763681

How to apply rolling mean function while keeping all the observations with duplicated indices in time

I have a dataframe that has duplicated time indices and I would like to get the mean across all for the previous 2 days (I do not want to drop any observations; they are all information that I need). I've checked pandas documentation and read previous posts on Stackoverflow (such as Apply rolling mean function on data frames with duplicated indices in pandas), but could not find a solution. Here's an example of how my data frame look like and the output I'm looking for. Thank you in advance.
data:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],'t': [1, 2, 3, 2, 1, 2, 2, 3, 4],'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
output:
t
v2
1
-
2
-
3
4.167
4
5
5
6.667
A rough proposal to concatenate 2 copies of the input frame in which values in 't' are replaced respectively by values of 't+1' and 't+2'. This way, the meaning of the column 't' becomes "the target day".
Setup:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],
't': [1, 2, 3, 2, 1, 2, 2, 3, 4],
'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
Implementation:
len = df.shape[0]
incr = pd.DataFrame({'id': [0]*len, 't': [1]*len, 'v1':[0]*len}) # +1 in 't'
df2 = pd.concat([df + incr, df + incr + incr]).groupby('t').mean()
df2 = df2[1:-1] # Drop the days that have no full values for the 2 previous days
df2 = df2.rename(columns={'v1': 'v2'}).drop('id', axis=1)
Output:
v2
t
3 4.166667
4 5.000000
5 6.666667
Thank you for all the help. I ended up using groupby + rolling (2 Day), and then drop duplicates (keep the last observation).

Splitting data into bins in histplot

I have a problem with sns.histplot(). As far as I understand, the bins parameter should indicate how many of the bins should be in the plot. Here is some code to visualize the strange (at least for me) behavior.
d = {'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9 , 10 , 11 , 12, 13, 14, 15], 'col2': [1, 1, 1, 1, 1, 1, 2, 2, 2 , 2 , 2, 2, 2, 2, 2]}
df = pd.DataFrame(data=d)
sns.histplot(data=df, x='col1', multiple='dodge', hue='col2', binwidth=2, bins=8)
I have almost the same problem in my original code where I have:
hist = sns.histplot(data=Data, x=y_data, multiple='dodge', ax=axes[0], hue=i[2][1], binwidth=2, bins=10)
And as you can see, there is only one bin where data has its minimum and std, but it is not split into the number of bins I declared. Why is this not splitting data into the provided number of bins? How can I change code to ensure constant number of bins?
I think the problem is the binwidth parameter. Maybe just try to delete that parameter, or set it to a smaller value (0.2 or 0.1).
From the docs, regarding the binwidth parameter:
Width of each bin, overrides bins but can be used with binrange.
So you can't specify both bins and binwidth at the same time.

function on previous N rows in data frame?

I need to go through each row in a dataframe looking at a column called 'source_IP_address', and look at the previous 100 rows, so that I can find if there are any rows with the same 'source_IP_address' and where another column states 'authentication failure'.
I have written some code that does this, as I couldn't use Pandas rolling over two columns. Problem is, it is not very fast and I want to know if there is a better way to do it?
function to find in the previous window of n rows, the number of matching axis values, together with number of attribute values in the attribute column
def check_window_for_match(df_w, window_size, axis_col, attr_col, attr_var):
l = []
n_rows = df_w.shape[0]
for i in range(n_rows):
# create a temp dataframe with the previous n rows including current row
temp_df = df_w.iloc[i-(window_size-1):i+1]
#print(temp_df.shape)
# assign the current 'B' value as the axis variable
current_value = df_w[axis_col].iloc[i]
#print(current_value)
#print(temp_df)
# given the temp dataframe of previous window of n_rows, check axis matches against fails
l_temp = temp_df.loc[(temp_df[axis_col] == current_value) & (temp_df[attr_col] == attr_var)].shape[0]
l.append(l_temp)
return l
e.g.
df_test = pd.DataFrame({'B': [0, 1, 2, np.nan, 4, 6, 7, 8, 10, 8, 7], 'C': [2, 10, 'fail', np.nan, 6, 7, 8, 'fail', 8, 'fail', 9]})
df_test
matches_list = check_window_for_match(df_test, window_size=3, axis_col='B', attr_col='C', attr_var='fail')
output: [0, 0, 1, 0, 0, 0, 0, 1, 0, 2, 0]
I want to know if my code is correct and if it is the best way to do it, or there is a better alternative.

Joining pandas dataframes based on minimization of distance

I have a dataset of stores with 2D locations at daily timestamps. I am trying to match up each row with weather measurements made at stations at some other locations, also with daily timestamps, such that the Cartesian distance between each store and matched station is minimized. The weather measurements have not been performed daily, and the station positions may vary, so this is a matter of finding the closest station for each specific store at each specific day.
I realize that I can construct nested loops to perform the matching, but I am wondering if anyone here can think of some neat way of using pandas dataframe operations to accomplish this. A toy example dataset is shown below. For simplicity, it has static weather station positions.
store_df = pd.DataFrame({
'store_id': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'x': [1, 1, 1, 4, 4, 4, 4, 4, 4],
'y': [1, 1, 1, 1, 1, 1, 4, 4, 4],
'date': [1, 2, 3, 1, 2, 3, 1, 2, 3]})
weather_station_df = pd.DataFrame({
'station_id': [1, 1, 1, 2, 2, 3, 3, 3],
'weather': [20, 21, 19, 17, 16, 18, 19, 17],
'x': [0, 0, 0, 5, 5, 3, 3, 3],
'y': [2, 2, 2, 1, 1, 3, 3, 3],
'date': [1, 2, 3, 1, 3, 1, 2, 3]})
The data below is the desired outcome. I have included station_id only for clarification.
store_id date station_id weather
0 1 1 1 20
1 1 2 1 21
2 1 3 1 19
3 2 1 2 17
4 2 2 3 19
5 2 3 2 16
6 3 1 3 18
7 3 2 3 19
8 3 3 3 17
The idea of the solution is to build the table of all combinations,
df = store_df.merge(weather_station_df, on='date', suffixes=('_store', '_station'))
calculate the distance
df['dist'] = (df.x_store - df.x_station)**2 + (df.y_store - df.y_station)**2
and choose the minimum per group:
df.groupby(['store_id', 'date']).apply(lambda x: x.loc[x.dist.idxmin(), ['station_id', 'weather']]).reset_index()
If you have a lot of date the you can do the join per group.
import math
import numpy as np
def distance(x1, x2, y1, y2):
return np.sqrt((x2-x1)**2 + (y2-y1)**2)
#Join On Date to get all combinations of store and stations per day
df_all = store_df.merge(weather_station_df, on=['date'])
#Apply distance formula to each combination
df_all['distances'] = distance(df_all['x_y'], df_all['x_x'], df_all['y_y'], df_all['y_x'])
#Get Minimum distance for each day Per store_id
df_mins = df_all.groupby(['date', 'store_id'])['distances'].min().reset_index()
#Use resulting minimums to get the station_id matching the min distances
closest_stations_df = df_mins.merge(df_all, on=['date', 'store_id', 'distances'], how='left')
#filter out the unnecessary columns
result_df = closest_stations_df[['store_id', 'date', 'station_id', 'weather', 'distances']].sort_values(['store_id', 'date'])
edited: To use vectorized distance formula

Categories