I have a certain business requirement for which I am having trouble in implementing a faster solution (current solution takes 3 hrs per iteration)
Eg: Say I have a df
and there's a list :
l = [[a,b,c],[d,e,f]]
To do:
Compare all the list values across customer and check if they exist or not
If they exist then find the corresponding min and max date1
Currently the pseudo working code I have is :
for each customer:
group by customer and add column having code column into a list
for each list value:
check if particular list value exists (in case check if [a,b,c] exists in first loop)
if exists:
check for min date by group etc
This multiple for loop is taking too long to execute since I have 100k+ customers.
Any way to further improve this? I already eliminated one for loop reducing time from 10hrs to 3
l = [['a','b','c'],['d','e','f']]
Firstly flatten your list:
from pandas.core.common import flatten
l=list(flatten(l))
Then do boolean masking to check if the customer exists or not in your dataframe:
newdf=df[df['code'].isin(l)]
Finally do groupby():
#The below code groupby 'code':
newdf=newdf.groupby('code').agg(max_date1=('date1','max'),min_date1=('date1','min'))
#If You want to groupby customerid and code then use:
newdf=newdf.groupby(['customerid','code']).agg(max_date1=('date1','max'),min_date1=('date1','min'))
Now If you print newdf you will get your desired output
I slightly modified my approach.
Instead of looping through each customer (I have 100k+ customers)
I looped through each list :
checked if customers were present or not and then looped through filtered customers
This reduced the time by a couple of hours.
Thanks again for your help
Related
I have a data frame with months (by year), and ID number. I am trying to calculate the attrition rate, but I am getting stuck on obtaining unique ID counts when a month equals a certain month in pandas.
ID.
Month
1
Sept. 2022
2
Oct. 2022
etc... with possible duplicates in ID and 1.75 years worth of data.
import pandas as pd
path = some path on my computer
data = pd.read_excel(path)
if data["Month"] == "Sept. 2022":
ID_SEPT = data["ID."].unique()
return ID_SEPT
I am trying to discover what I am doing incorrect here in this if-then statement. Ideally I am trying to collect all the unique ID values per each month per year to then calculate the attrition rate. Is there something obvious I am doing wrong here?
Thank you.
I tried an id-then statement and I was expecting unique value counts of ID per month.
You need to use one of the iterator functions, like items().
for (columnName, columnData) in data.iteritems():
if columnName = 'Month'
[code]
The way you do this with a dataframe, conceptually, is to filter the entire dataframe to be just the rows where your comparison is true, and then do whatever (get uniques) from there.
That would look like this:
filtered_df = df[df['Month'] == 'Sept. 2022']
ids_sept = list(filtered_df['ID.'].unique()
The first line there can look at a little strange, but what it is doing is:
df['Month'] == 'Sept. 2022' will return an array/column/series (it actually returns a series) of True/False whether or not the comparison is, well, true or false.
You then run that series of bools through df[series_of_bools] that filters the dataframe to return only the rows where it is True.
Thus, you have a filter.
If you are looking for the number of unique items, rather than the list of unique items, you can also use filtered_df['ID.'].nunique() and save yourself the step later of getting the length of the list.
You are looking for pandas.groupby.
Use it like this to get the unique values of each Group (Month)
data.groupby("Month")["ID."].unique() # You have a . after ID in your example, check if thats correct
try this
data[data.Month=='Sept. 2022']['ID.'].unique()
I have the following dataframe called df1 that contains data for a number of regions in the column NUTS_ID:
The index, called Date has all the days of 2010. That is, for each code in NUTS_ID I have a day of 2010 (all days of the year for AT1, AT2and so on). I created a list containing the dates corresponding to non-workdays and I want to add a column that with 0 for non-workdays and 1 for workdays.
For this, I simply used a for loop that checks day by day if it's in the workday list I created:
for day in df1.index:
if day not in workdays_list:
df1.loc[day,'Workday'] = 0 # Assigning 0 to to non-workdays
else:
df1.loc[day,'Workday'] = 1 # Assigning 1 to workdays
This works well enough if the dataset is not big. But with some of the datasets I'm processing this takes a very long time. I would like to ask for ideas in order to do the process faster and more efficient. Thank you in advance for your input.
EDIT: One of the things I have thought is that maybe a groupby could be helpful, but I don't know if that is correct.
You can use np.where with isin to check if your Date (i.e. your index) is in the list you created:
import numpy as np
df1['Workday'] = np.where(df1.index.isin(workdays_list),1,0)
I can't reproduce your dataset, but something along those lines should work.
I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))
I am trying to write a simple code in which I have units produced in a dataframe 'Yield' and 'Date' on which they were produced. Multiple records are present for the same date. I am going to use numpy cumsum function to get running total for each row and then subtract the value for the current row. I do not wish to do aggregation for the date since I need the original raw records to remain.
I can do this for one set of date by having .loc variable made for each date and then apply the function. But can't figure out how to do this iteratively.
data_43102 = data['Yield_Done','PDate'].loc[data['PDate'] ==43102]
#gives me Yield Done for only 43102
data_43102['Running_total']= cumsum(data_43102['Yield_Done']) #gives me cumulative total
data_43102['Running_total'] = data_43102['Running_total'] - data_43102['Yield_Done']
Whet I expect after running the code is there to be output like in the case of one I had
You can store all the dates in a list and then use isin to get data filtered for all the dates:
dates = ['43102', '23102', '43102'...]
data_filtered_by_date = data['Yield_Done','PDate'].loc[data['PDate'].isin(dates)]
I hope this helps.
I have a dataframe (df_data) with 14 columns for info over 1 month. I pulled out one week's data (df1) and made a list of all the account numbers there (accounts1)
What I am trying to do is take that accounts1 list and have it go through each value in the list, checking if it is counted more than once in df_data and if so, to save that account number to a new list for repeats only.
Then I want to take that repeats list and pull the 14 columns out of the original df_data so I can have all the rows of all 14 columns for each occurrence of that account number.
I'm getting stuck with the list of repeated account numbers, I used the following code, which seems to have worked to create a list with results
cnt = collections.Counter(accounts1)
repeats.append([k for k, v in cnt.items() if v > 1])
print((repeats).count)
but the amount of elements in that list is right under 3,000. When I used the .unique and checked the difference it should be a little over 5,000. What am I doing wrong? And how can I then use those elements to pull the columns from the original dataframe?
Basically say I had
accounts1 = df1['accntnum'] = [0,1,2,5,8,2,5,0,0,7]
I would want it to cycle through and pull out each repeat from df_data and return a list of them like
repeats = [0, 2, 5, 7]
(There are numbers in the monthly df_data that are in weekly df1 but may not be repeated there yet)
Then I'd like to use that list to pull from df_data['accntnum'], thinking something like
df_repeats = df_data[df_data['accntnum'] isin repeats]]
Oh also, I'm really only interested in the first occurrence of a repeat. There is a date and time column that can help sort those out in the end though. Thank you in advance!