I have a data frame with months (by year), and ID number. I am trying to calculate the attrition rate, but I am getting stuck on obtaining unique ID counts when a month equals a certain month in pandas.
ID.
Month
1
Sept. 2022
2
Oct. 2022
etc... with possible duplicates in ID and 1.75 years worth of data.
import pandas as pd
path = some path on my computer
data = pd.read_excel(path)
if data["Month"] == "Sept. 2022":
ID_SEPT = data["ID."].unique()
return ID_SEPT
I am trying to discover what I am doing incorrect here in this if-then statement. Ideally I am trying to collect all the unique ID values per each month per year to then calculate the attrition rate. Is there something obvious I am doing wrong here?
Thank you.
I tried an id-then statement and I was expecting unique value counts of ID per month.
You need to use one of the iterator functions, like items().
for (columnName, columnData) in data.iteritems():
if columnName = 'Month'
[code]
The way you do this with a dataframe, conceptually, is to filter the entire dataframe to be just the rows where your comparison is true, and then do whatever (get uniques) from there.
That would look like this:
filtered_df = df[df['Month'] == 'Sept. 2022']
ids_sept = list(filtered_df['ID.'].unique()
The first line there can look at a little strange, but what it is doing is:
df['Month'] == 'Sept. 2022' will return an array/column/series (it actually returns a series) of True/False whether or not the comparison is, well, true or false.
You then run that series of bools through df[series_of_bools] that filters the dataframe to return only the rows where it is True.
Thus, you have a filter.
If you are looking for the number of unique items, rather than the list of unique items, you can also use filtered_df['ID.'].nunique() and save yourself the step later of getting the length of the list.
You are looking for pandas.groupby.
Use it like this to get the unique values of each Group (Month)
data.groupby("Month")["ID."].unique() # You have a . after ID in your example, check if thats correct
try this
data[data.Month=='Sept. 2022']['ID.'].unique()
Related
I have a certain business requirement for which I am having trouble in implementing a faster solution (current solution takes 3 hrs per iteration)
Eg: Say I have a df
and there's a list :
l = [[a,b,c],[d,e,f]]
To do:
Compare all the list values across customer and check if they exist or not
If they exist then find the corresponding min and max date1
Currently the pseudo working code I have is :
for each customer:
group by customer and add column having code column into a list
for each list value:
check if particular list value exists (in case check if [a,b,c] exists in first loop)
if exists:
check for min date by group etc
This multiple for loop is taking too long to execute since I have 100k+ customers.
Any way to further improve this? I already eliminated one for loop reducing time from 10hrs to 3
l = [['a','b','c'],['d','e','f']]
Firstly flatten your list:
from pandas.core.common import flatten
l=list(flatten(l))
Then do boolean masking to check if the customer exists or not in your dataframe:
newdf=df[df['code'].isin(l)]
Finally do groupby():
#The below code groupby 'code':
newdf=newdf.groupby('code').agg(max_date1=('date1','max'),min_date1=('date1','min'))
#If You want to groupby customerid and code then use:
newdf=newdf.groupby(['customerid','code']).agg(max_date1=('date1','max'),min_date1=('date1','min'))
Now If you print newdf you will get your desired output
I slightly modified my approach.
Instead of looping through each customer (I have 100k+ customers)
I looped through each list :
checked if customers were present or not and then looped through filtered customers
This reduced the time by a couple of hours.
Thanks again for your help
I have the following dataframe called df1 that contains data for a number of regions in the column NUTS_ID:
The index, called Date has all the days of 2010. That is, for each code in NUTS_ID I have a day of 2010 (all days of the year for AT1, AT2and so on). I created a list containing the dates corresponding to non-workdays and I want to add a column that with 0 for non-workdays and 1 for workdays.
For this, I simply used a for loop that checks day by day if it's in the workday list I created:
for day in df1.index:
if day not in workdays_list:
df1.loc[day,'Workday'] = 0 # Assigning 0 to to non-workdays
else:
df1.loc[day,'Workday'] = 1 # Assigning 1 to workdays
This works well enough if the dataset is not big. But with some of the datasets I'm processing this takes a very long time. I would like to ask for ideas in order to do the process faster and more efficient. Thank you in advance for your input.
EDIT: One of the things I have thought is that maybe a groupby could be helpful, but I don't know if that is correct.
You can use np.where with isin to check if your Date (i.e. your index) is in the list you created:
import numpy as np
df1['Workday'] = np.where(df1.index.isin(workdays_list),1,0)
I can't reproduce your dataset, but something along those lines should work.
I am trying to assign a proportion value to a column in a specific row inside my df. Each row represents a unique product's sales in a specific month, in a dataframe (called testingAgain) like this:
Month ProductID(SKU) Family Sales ProporcionVenta
1 1234 FISH 10000.0 0.0
This row represents product 1234's sales during January. (It is an aggregate, so it represents every January in the DB)
Now I am trying to find the proportion of sales of that unique productid-month in relation to the sum of sales of family-month. For example, the family fish has sold 100,000 in month 1, so in this specific case it would be calculated 10,000/100,000 (productid-month-sales/family-month-sales)
I am trying to do so like this:
for family in uniqueFamilies:
for month in months:
salesFamilyMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)]['Qty'].sum()
for sku in uniqueSKU:
salesSKUMonth = testingAgain[(testingAgain['Family']==family)&(testingAgain['Month']==month)&(testingAgain['SKU']==sku)]['Qty'].sum()
proporcion = salesSKUMonth/salesFamilyMonth
testingAgain[(testingAgain['SKU']==sku)&(testingAgain['Family']==familia)&(testingAgain['Month']==month)]['ProporcionVenta'] = proporcion
The code works, it runs, and I have even individually printed the proportions and calculated them in Excel and they are correct, but the problem is with the last line. As soon as the code finishes running, I print testingAgain and see all proportions listed as 0.0, even though they should have been assigned the new one.
I'm not completely convinced about my approach, but I think it is decent.
Any ideas on how to solve this problem?
Thanks, appreciate it.
Generally, in Pandas (even Numpy), unlike general purpose Python, analysts should avoid using for loops as there are many vectorized options to run conditional or grouped calculations. In your case, consider groupby().transform() which returns inline aggregates (i.e., aggregate values without collapsing rows) or
as docs indicate: broadcast to match the shape of the input array.
Currently, your code is attempting to assign a value to a subsetted slice of data frame column that should raise SettingWithCopyWarning. Such an operation would not affect original data frame. Your loop can use .loc for conditional assignment
testingAgain.loc[(testingAgain['SKU']==sku) &
(testingAgain['Family']==familia) &
(testingAgain['Month']==month), 'ProporcionVenta'] = proporcion
However, avoid looping since transform works nicely to assign new data frame columns. Also, below div is the Series division method (functionally equivalent to / operator).
testingAgain['ProporcionVenta'] = (testingAgain.groupby(['SKU', 'Family', 'Monthh'])['Qty'].transform('sum')
.div(testingAgain.groupby(['Family', 'Month'])['Qty'].transform('sum'))
)
I am trying to write a simple code in which I have units produced in a dataframe 'Yield' and 'Date' on which they were produced. Multiple records are present for the same date. I am going to use numpy cumsum function to get running total for each row and then subtract the value for the current row. I do not wish to do aggregation for the date since I need the original raw records to remain.
I can do this for one set of date by having .loc variable made for each date and then apply the function. But can't figure out how to do this iteratively.
data_43102 = data['Yield_Done','PDate'].loc[data['PDate'] ==43102]
#gives me Yield Done for only 43102
data_43102['Running_total']= cumsum(data_43102['Yield_Done']) #gives me cumulative total
data_43102['Running_total'] = data_43102['Running_total'] - data_43102['Yield_Done']
Whet I expect after running the code is there to be output like in the case of one I had
You can store all the dates in a list and then use isin to get data filtered for all the dates:
dates = ['43102', '23102', '43102'...]
data_filtered_by_date = data['Yield_Done','PDate'].loc[data['PDate'].isin(dates)]
I hope this helps.
I need to write a function, which takes df with data and returns string with country, which GDP is maximum among countries with area(sq km) is less than 200 OR which population is less than 1000.
How to write this code correctly?
def find_country(df):
df.loc[((df.Area < 200).Max(df.GDP))|(df.Population < 1000)]
First of all you should make your first column to be your Index. This could be done using the following command:
df.set_index('Country', inlace = True)
Assuming you want to replace your dataframe with the reworked version.
To find your desired country you simply look for the date which has the maximum GDP, for instance, and return its index. The subscript of the index is needed to get the actual value of the index.
def find_Country(df):
return df[df['GDP'] == max(df['GDP'])].index[0]
I hope this will help,
Fabian