More efficient way to group by and count values Pandas dataframe - python

A more efficient way to do this?
I have a sales records imported from a spreadsheet. I start by importing that list to a dataframe. I then need to get the average orders per customer by month and year.
The spreadsheet does not contain counts, just order and customer ID.
So I have to count each ID then get drop duplicates and then reset index.
Final dataframe is exported back into a spreadsheet and SQL database.
The code below works, and I get the desired output, but it seems it should be more efficient? I am new to pandas and Python so I'm sure I could do this better.
df_customers = df.filter(
['Month', 'Year', 'Order_Date', 'Customer_ID', 'Customer_Name', 'Patient_ID', 'Order_ID'], axis=1)
df_order_count = df.filter(
['Month', 'Year'], axis=1)
df_order_count['Order_cnt'] = df_customers.groupby(['Month', 'Year'])['Order_ID'].transform('nunique')
df_order_count['Customer_cnt'] = df_customers.groupby(['Month', 'Year'])['Customer_ID'].transform('nunique')
df_order_count['Avg'] = (df_order_count['Order_cnt'] / df_order_count['Customer_cnt']).astype(float).round(decimals=2)
df_order_count = df_order_count.drop_duplicates().reset_index(drop=True)

Try this
g = df.groupby(['Month', 'Year'])
df_order_count['Avg'] = g['Order_ID'].transform('nunique')/g['Customer_ID'].transform('nunique')

Related

Pandas DataFrame sorting issues, grouping for no reason?

I have one data frame containing stats about NBA season. I'm simply trying to sort by date, but for some reason it's grouping all games that have the same data and changing the values of that said date to the same values.
df = pd.read_csv("gamedata.csv")
df["Total"] = df["Tm"] + df["Opp.1"]
teams = df['Team']
df = df.drop(columns=['Team'])
df.insert(loc=4, column='Team', value=teams)
df["W/L"] = df["W/L"]=="W"
df["W/L"] = df["W/L"].astype(int)
df = df.sort_values("Date")
df.to_csv("gamedata_clean.csv")
Before
After
I expected the df to be unchanged except for the order to be in ascending date, but it's changing values in other columns for reasons I do not know.
Please add this line to your code to sort your dataframe by date
df.sort_values(by='Date')
I hope you will get the desired output

Is it possible to create a for loop for a pandas df with 2 variables?

I have a pandas dataframe that contains weight (weight column) information based on different users (user_Id column) and dates (date column/pandas data object).
I would like to calculate the weight difference between the earliest and latest measurement for all users.
To calculate the earliest and latest measurement, I use the following functions:
earliest_date = []
latest_date = []
for x in Id_list:
a = weight_info[weight_info['Id']==x]
earliest_date.append(a['date'].min())
latest_date.append(a['date'].max())
Then I want to create a for loop in order to pass in date and earliest date to get the weight information, something like:
df = weight_info[(weight_info['date']==x) & (weight_info['Id']==y)]
df['weight']
But I am not sure how to do this with a for loop based on two variables. Or is there any easier way to run the whole calculation?
Get min/max dates per user with groupby
min_dates = weight_info.groupby('Id').agg({'min':'date'})
max_dates = weight_info.groupby('Id').agg({'max':'date'})
Then join with the weights to get the weight for the min/max date per user
min_weights = weight_info.merge( min_dates[['Id', 'date']],
on = ['Id', 'date'], how='inner' )
max_weights = weight_info.merge( max_dates[['Id', 'date']],
on = ['Id', 'date'], how='inner' )
Finally, subtract both for the same customer
You could try using 'pandasql'. This library allows you to manipulate data in a Pandas data frame, using SQL code. I have found it useful for manipulating data frames from random csv files.
import pandasql as psql
df = 'Your_pandas_df'
# Shows the record counts in your dataset
record_count = psql.sqldf('''
SELECT
COUNT(*) as record_count
FROM df''')

Unique Count in Pandas with a Condition

above is the data frame from which i need to get the count of account id transacted newly in the month of may
Condition for New Account: those account which has not transacted in last 3 months
Highlighted cell is the new Account from which i need only distinct count of account id
Desired Output
using pandas python
Similar to #tlouarn's answer, drop_duplicates() first and then use agg('count')
month = 'May' # pick the desired Month
mdf = df[df['Month'] == month] # rows of May
odf = df.iloc[:max(mdf.index)] # rows before June
odf = odf.drop_duplicates( # remove duplicates
subset=[
'Acc ID',
], keep='first')
odf = odf[odf['Month'] == month] # keep rows of May
ndf = odf.groupby( # thanks to #tlouarn
by=['Month']
).agg([
'count'
])
print(ndf)
assume that the dataframe is sorted on Month.
I would use drop_duplicates(subset=['Acc ID', 'Month']) to filter out the duplicates, then use a groupby(by=['Month']).agg(['count']) to get to the desired output.

Creating a table for time series analysis including counts

I got a dataframe on which I would like to perform some analysis. An easy example of what I would like to achieve is, having the dataframe:
data = ['2017-02-13', '2017-02-13', '2017-02-13', '2017-02-15', '2017-02-16']
df = pd.DataFrame(data = data, columns = ['date'])
I would like to create a new dataframe from this. The new dataframe should contain 2 columns, the entire date span. So it should also include 2017-02-14 and the number of times each date appears in the original data.
I managed to construct a dataframe that includes all the dates as so:
dates = pd.to_datetime(df['date'], format = "%Y-%m-%d")
dateRange = pd.date_range(start = dates.min(), end = dates.max()).tolist()
df2 = pd.DataFrame(data = datumRange, columns = ['datum'])
My question is, how would I add the counts of each date from df to df 2? I've been messing around trying to write my own functions but have not managed to achieve it. I am assuming this needs to be done more often and that I am thinking to difficult...
Try this:
df2['counts'] = df2['datum'].map(pd.to_datetime(df['date']).value_counts()).fillna(0).astype(int)

Adding correction column to dataframe

I have a pandas dataframe I read from a csv file with df = pd.read_csv("data.csv"):
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,5,8
2020-01-03,place2,2,9
I also have a dataframe with corrections df_corr = pd.read_csv("corrections .csv")
date,location,value
2020-01-02,place2,-1
2020-01-03,place2,2
How do I apply these corrections where date and location match to get the following?
date,location,value1,value2
2020-01-01,place1,1,2
2020-01-02,place2,4,8
2020-01-03,place2,4,9
EDIT:
I got two good answers and decided to go with set_index(). Here is how I did it 'non-destructively'.
df = pd.read_csv("data.csv")
df_corr = pd.read_csv("corr.csv")
idx = ['date', 'location']
df_corrected = df.set_index(idx).add(
df_corr.set_index(idx).rename(
columns={"value": "value1"}), fill_value=0
).astype(int).reset_index()
It looks like you want to join the two DataFrames on the date and location columns. After that its a simple matter of applying the correction by adding the value1 and value columns (and finally dropping the column containing the corrections).
# Join on the date and location columns.
df_corrected = pd.merge(df, df_corr, on=['date', 'location'], how='left')
# Apply the correction by adding the columns.
df_corrected.value1 = df_corrected.value1 + df_corrected.value
# Drop the correction column.
df_corrected.drop(columns='value', inplace=True)
Set date and location as index in both dataframes, add the two and fillna
df.set_index(['date','location'], inplace=True)
df1.set_index(['date','location'], inplace=True)
df['value1']=(df['value1']+df1['value']).fillna(df['value1'])

Categories