above is the data frame from which i need to get the count of account id transacted newly in the month of may
Condition for New Account: those account which has not transacted in last 3 months
Highlighted cell is the new Account from which i need only distinct count of account id
Desired Output
using pandas python
Similar to #tlouarn's answer, drop_duplicates() first and then use agg('count')
month = 'May' # pick the desired Month
mdf = df[df['Month'] == month] # rows of May
odf = df.iloc[:max(mdf.index)] # rows before June
odf = odf.drop_duplicates( # remove duplicates
subset=[
'Acc ID',
], keep='first')
odf = odf[odf['Month'] == month] # keep rows of May
ndf = odf.groupby( # thanks to #tlouarn
by=['Month']
).agg([
'count'
])
print(ndf)
assume that the dataframe is sorted on Month.
I would use drop_duplicates(subset=['Acc ID', 'Month']) to filter out the duplicates, then use a groupby(by=['Month']).agg(['count']) to get to the desired output.
Related
This is the data format date column shows the first entry date and i am trying to get the max of due missed in the first 3 months and i have data of 30 months and i need max of dues missed in the first 3 months for each id
Considering your dataframe, as shown in your picture:
import pandas as pd
df = pd.DataFrame({
'ID':[100,100,100,100,100,
101,101,101,101, 101],
'Entry Date':['2020-04-10','2020-05-10','2020-06-10','2020-07-10','2020-08-10',
'2020-07-25','2020-08-25','2020-09-25','2020-10-25','2020-11-25'],
'Due missed':[0,0,7,0,5,
9,1,5,7,10]
})
df['Entry Date'] = df['Entry Date'].apply(pd.to_datetime)
What you will want to do is sort the dataframe so the first 3 months are at the top per ID:
df.sort_values(['ID', 'Entry Date'], inplace=True)
Then you can group by ID and select the top 3 rows (using head(3)), and select the maximum value of those three rows (using ['Due missed'].max()):
df.groupby('ID').apply(lambda x: x.head(3)['Due missed'].max())
A more efficient way to do this?
I have a sales records imported from a spreadsheet. I start by importing that list to a dataframe. I then need to get the average orders per customer by month and year.
The spreadsheet does not contain counts, just order and customer ID.
So I have to count each ID then get drop duplicates and then reset index.
Final dataframe is exported back into a spreadsheet and SQL database.
The code below works, and I get the desired output, but it seems it should be more efficient? I am new to pandas and Python so I'm sure I could do this better.
df_customers = df.filter(
['Month', 'Year', 'Order_Date', 'Customer_ID', 'Customer_Name', 'Patient_ID', 'Order_ID'], axis=1)
df_order_count = df.filter(
['Month', 'Year'], axis=1)
df_order_count['Order_cnt'] = df_customers.groupby(['Month', 'Year'])['Order_ID'].transform('nunique')
df_order_count['Customer_cnt'] = df_customers.groupby(['Month', 'Year'])['Customer_ID'].transform('nunique')
df_order_count['Avg'] = (df_order_count['Order_cnt'] / df_order_count['Customer_cnt']).astype(float).round(decimals=2)
df_order_count = df_order_count.drop_duplicates().reset_index(drop=True)
Try this
g = df.groupby(['Month', 'Year'])
df_order_count['Avg'] = g['Order_ID'].transform('nunique')/g['Customer_ID'].transform('nunique')
I have an original df with 4 columns: user (user id visiting website), month (month user visited website), year (year user visited website), num_hits (number of times user visited that month for that year.
I want to plot by user and year, the month (x-axis) and the num_hits (y-axis). I created a list of tuples in pandas as a column using:
df['tup'] = list(zip(df['month'], df['num_hits']))
df1 = df.groupby(['user', 'year'], as_index = False)['tup'].agg(list)
But here is where I got stuck, as I wanted to sort the list of tuples in the column 'tup' by their first element so then I could plot each of these list of tuples. My solution to this was to create a list of lists from the df and then sort the first element like this:
df2 = df1['tup'].values.tolist()
for i in df2:
i.sort(key=lambda x: x[0])
So then I could plot them using:
for i in range(len(df2)):
plt.plot(*zip(*df2[i]))
But by doing this, I lost the user and year information that I wanted to keep in order to display it on the legend of the plot for the corresponding line. Is there anyway of sorting the list of tuples in the pandas df and then plotting it directly using matplotlib so that I could display the user and the year in the legend for that corresponding line? Thank you in advance.
The simplest solution is to not use tuples at all. You can create a pivot table, with the user and year columns as an index, the month column as the columns, and the num_hits column as the values. By first sorting the rows by month the columns will be in the correct order. By transposing the dataframe, so that month is now the index, and user and year are the column, you can simply call .plot() which will return what you need:
df.sort_values("month").pivot(index=["user", "year"], columns="month", values="num_hits").T.plot()
This could be broken up into stages, if you would prefer:
# create the pivot table
df1 = df.sort_values("month").pivot(index=["user", "year"], columns="month", values="num_hits")
# transpose
df2 = df1.T
# plot
df2.plot()
And the data I used, ensuring that the months were not sorted to start with, so that it would definitely need to change to be correct:
import pandas as pd
import numpy as np
df = pd.DataFrame({"user": [1]*12*3 + [2]*12*3 + [3]*12*3 + [4]*12*3 + [5]*12*3,
"month": list(np.arange(12, 0, -1))*3*5,
"year": ([2019]*12 + [2020]*12 + [2021]*12)*5,
"num_hits": np.random.randint(0, 1000, 12*3*5)})
Although it is not stated in the documentation from what I can see, the .pivot() appears to sort the columns anyway, so you shouldn't even need to use .sort_values().
I have a list of stock data over the years and I want to remove the bottom x% in terms of market cap in each month. My idea is to make a loop that creates a new Pandas dataframe for each month, and then within that month, I remove the bottom x% in terms of market cap. This is what the data looks like
df['date'] = pd.to_datetime(df['date'])
df['year-month'] = df['date'].map(lambda x: x.strftime('%Y-%m'))
df = df.sort_values(['year-month', 'MarketCap'], ascending=[True, False])
df = df.groupby('year-month').apply(lambda x: x[x['MarketCap'] > x['MarketCap'].quantile(.1)]).reset_index(1, drop = True)
df = df.drop(columns=['year-month']).reset_index().drop(columns=['year-month'])
First create a column contains only year and month.
Then sort year-month and MarketCap column ascending and descending.
Group by MarketCap column and filter out the row below 10% in each group.
I have the following excel sheet, which I've imported into pandas using read_csv
df
<table><tbody><tr><th>Order ID</th><th>Platform</th><th>Media Source</th><th>Campaign</th><th>1st order</th><th>Order fulfilled</th><th>Date</th></tr><tr><td>1</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>2</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td></tr><tr><td>3</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>4</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>5</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td></tr><tr><td>6</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td></tr><tr><td>7</td><td>Mobile</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>8</td><td>Web</td><td>Google</td><td>Cmp2</td><td>FALSE</td><td>FALSE</td><td>2/1/2019</td></tr><tr><td>9</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr><tr><td>10</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td></tr></tbody></table>
I want to add a new column NewOrderForDate which gives me a count of all the orders for that campaign for that date AND 1st Order = TRUE
Here's how the dataframe should look after adding this column
<table><tbody><tr><th>Order ID</th><th>Platform</th><th>Media Source</th><th>Campaign</th><th>1st order</th><th>Order fulfilled</th><th>Date</th><th>NewOrderForDate </th></tr><tr><td>1</td><td>Web</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>2</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>FALSE</td><td>TRUE</td><td>2/1/2019</td><td>2</td></tr><tr><td>3</td><td>Web</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>4</td><td>Web</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>5</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>2/1/2019</td><td>2</td></tr><tr><td>6</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>1/1/2019</td><td>5</td></tr><tr><td>7</td><td>Mobile</td><td>Facebook</td><td>FBCmp</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>8</td><td>Web</td><td>Google</td><td>Cmp2</td><td>TRUE</td><td>FALSE</td><td>2/1/2019</td><td>2</td></tr><tr><td>9</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>TRUE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr><tr><td>10</td><td>Mobile</td><td>Google</td><td>Cmp1</td><td>FALSE</td><td>TRUE</td><td>1/1/2019</td><td>5</td></tr></tbody></table>
If I had to do this in Excel, I'd probably use
=COUNTIFS(G$2:G$11,G2,E$2:E$11,"TRUE")
Basically, I want to group by column and date and get a count of all the orders where 1st order = TRUE and write these values to a new column
GroupBy 'Campaign', count the '1st order' and add 'NewOrderForDate' column for each group.
def udf(grp_df):
grp_df['NewOrderForDate'] = len(grp_df[grp_df['1st order']==True])
return grp_df
result = df.groupby('Campaign', as_index=False, group_keys=False).apply(udf)
Use transform to keep the index shape, and sum the bool value of 1st Order:
df['NewOrderForDate'] = df.groupby(['Date', 'Campaign'])['1st order'].transform(lambda x: x.sum())