Check for Duplicate values and Pull Info to New Dataframe - python

I have a dataframe (df_data) with 14 columns for info over 1 month. I pulled out one week's data (df1) and made a list of all the account numbers there (accounts1)
What I am trying to do is take that accounts1 list and have it go through each value in the list, checking if it is counted more than once in df_data and if so, to save that account number to a new list for repeats only.
Then I want to take that repeats list and pull the 14 columns out of the original df_data so I can have all the rows of all 14 columns for each occurrence of that account number.
I'm getting stuck with the list of repeated account numbers, I used the following code, which seems to have worked to create a list with results
cnt = collections.Counter(accounts1)
repeats.append([k for k, v in cnt.items() if v > 1])
print((repeats).count)
but the amount of elements in that list is right under 3,000. When I used the .unique and checked the difference it should be a little over 5,000. What am I doing wrong? And how can I then use those elements to pull the columns from the original dataframe?
Basically say I had
accounts1 = df1['accntnum'] = [0,1,2,5,8,2,5,0,0,7]
I would want it to cycle through and pull out each repeat from df_data and return a list of them like
repeats = [0, 2, 5, 7]
(There are numbers in the monthly df_data that are in weekly df1 but may not be repeated there yet)
Then I'd like to use that list to pull from df_data['accntnum'], thinking something like
df_repeats = df_data[df_data['accntnum'] isin repeats]]
Oh also, I'm really only interested in the first occurrence of a repeat. There is a date and time column that can help sort those out in the end though. Thank you in advance!

Related

How to iterate through each row of a groupby object created by groupby()?

I'm working with a large dataset that includes all police stops in my city since 2014. The dataset has millions of rows, but sometimes there are multiple rows for a single stop (so, if the police stopped a group of 4 people, it is included in the database as 4 separate rows even though it's all the same stop). I'm looking to create a new column in the dataset orderInStop, which is a count of how many people were stopped in sequential order. The first person caught up in the stop would have a value of 1, the second person a value of 2, and so on.
To do so, I have used the groupby() function to group all rows that match on time & location, which is the indication that the rows are all part of the same stop. I can manage to create a new column that includes the TOTAL count of the number of people in the stop (so, if there were 4 rows with the same time & location, all four rows have a value of 4 for the new orderInStop variable. But I need the first row in the group to have a value of 1, the second a value of 2, the third 3, and the fourth 4.
Below is my code attempt at iterating through each group I've created to sequentially count each row within each group, but the code doesn't quite work (it populates the entire column rather than each row within the groups). Any help to tweak this code would be much appreciated!
Note: I also tried using logical operators in a for loop, to essentially ask IF the time & location column values match for the current and previous rows, but ran into too many problems with 'the truth values of a Series is ambiguous' errors, so instead I'm trying to use groupby().
Attempt that creates a total count rather than sequential count:
df['order2'] = df.groupby(by=["Date_Time_Occur", "Location"])['orderInStop'].transform('count')
Attempt that fails, to iterate through each row in each group:
df['order3'] = 1
grp = df.groupby(by=["Date_Time_Occur", "Location"])
for name, groups in grp:
count = 1
for row in groups:
df['order3'] = count
count = count + 1
In your example for row in groups iterates over the column names, since groups is a DataFrame.
To iterate over each row you could do
df['order3'] = 1
grp = df.groupby(by=["Date_Time_Occur", "Location"])
for name, groups in grp:
count = 1
for i, row in groups.iterrows(): # i will be index, row a pandas Series
df['order3'] = count
count = count + 1
Note that your solution relies on pandas groupby to preserve row order. This should be the case, see this question, but there is very likely a shorter & safer solution (see fsimonjetz comment for a starting point).

Pandas comparing list across column values

I have a certain business requirement for which I am having trouble in implementing a faster solution (current solution takes 3 hrs per iteration)
Eg: Say I have a df
and there's a list :
l = [[a,b,c],[d,e,f]]
To do:
Compare all the list values across customer and check if they exist or not
If they exist then find the corresponding min and max date1
Currently the pseudo working code I have is :
for each customer:
group by customer and add column having code column into a list
for each list value:
check if particular list value exists (in case check if [a,b,c] exists in first loop)
if exists:
check for min date by group etc
This multiple for loop is taking too long to execute since I have 100k+ customers.
Any way to further improve this? I already eliminated one for loop reducing time from 10hrs to 3
l = [['a','b','c'],['d','e','f']]
Firstly flatten your list:
from pandas.core.common import flatten
l=list(flatten(l))
Then do boolean masking to check if the customer exists or not in your dataframe:
newdf=df[df['code'].isin(l)]
Finally do groupby():
#The below code groupby 'code':
newdf=newdf.groupby('code').agg(max_date1=('date1','max'),min_date1=('date1','min'))
#If You want to groupby customerid and code then use:
newdf=newdf.groupby(['customerid','code']).agg(max_date1=('date1','max'),min_date1=('date1','min'))
Now If you print newdf you will get your desired output
I slightly modified my approach.
Instead of looping through each customer (I have 100k+ customers)
I looped through each list :
checked if customers were present or not and then looped through filtered customers
This reduced the time by a couple of hours.
Thanks again for your help

Remove duplicated cell content using python?

I filter the duplicates, got duplicate on the same row and join items by comma and with this below code, don't really understand why the Join_Dup column is replicated?
dd = sales_all[sales_all['Order ID'].duplicated(keep=False)]
dd['Join_Dup'] = dd.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
print(dd.head())
dd = dd[['Order ID','Join_Dup']].drop_duplicates()
dd
Order ID Join_Dup
0 176558 USB-C Charging Cable,USB-C Charging Cable,USB-...
2 176559 Bose SoundSport Headphones,Bose SoundSport Hea...
3 176560 Google Phone,Wired Headphones,Google Phone,Wir...
5 176561 Wired Headphones,Wired Headphones,Wired Headph...
... ... ...
186846 259354 iPhone,iPhone,iPhone,iPhone,iPhone,iPhone
186847 259355 iPhone,iPhone,iPhone,iPhone,iPhone,iPhone
186848 259356 34in Ultrawide Monitor,34in Ultrawide Monitor,...
186849 259357 USB-C Charging Cable,USB-C Charging Cable,USB-...
[178437 rows x 2 columns]
I need to remove the duplicates from the cell in each row, can some please help.
IIUC, let's try to prevent the duplicates in the groupby transform statement:
dd['Join_No_Dup'] = dd.groupby('Order ID')['Product'].transform(lambda x: ','.join(set(x)))
Edit disregard the second part of the answer. I will delete that portion if it ends up not being useful.
So you comment you want unique product strings for each Order ID. You can get that in a single step:
dd = (
sales_all.groupby(['Order ID', 'Product'])['some_other_column']
.size().rename('quantity').reset_index()
)
Now you have unique rows of OrderID/Product with the count of repeated products (or quantity, as in a regular invoice). You can work with that or you can groupby to form a list of products:
orders = dd.groupby('Order ID').Product.apply(list)
---apply vs transform---
Please note that if you use .transform as in your question you will invariably get a result with the same shape as the dataframe/series being grouped (i.e. grouping will be reversed and you will end up with the same number of rows, thus creating duplicates). The function .apply will pass the groups of your groupby to the same function, any function, but will not broadcast back to the original shape (it will return only one row per group).
Old Answer
So you are removing ALL Oder IDs that appear in multiple rows (if ID 14 appears in two rows you discard both rows). This makes the groupby in the next line redundant, as every grouped ID will have just one line.
Ok, now that's out of the way. Then presumably each row in Product contains a list which you are joining with a lambda. This step would be a little faster with a pandas native function.
dd['Join_Dup'] = dd.Product.str.join(', ')
# perhaps choose a better name for the column, once you remove duplicates it will not mean much (does 'Join_Products' work?)
Now to handle duplicates. You didn't actually need to join in the last step if all you wanted was to remove dups. Pandas can handle lists as well. But the part you were missing is the subset attribute.
dd = dd[['Order ID', 'Join_Dup']].drop_duplicates(subset='Join_Dup')

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

Faster way to append ordered frequencies of pandas series

I am trying to make a list of the number of elements in each group in a pandas series. In my dataframe i have column called ID, and all values occur multiple times. I want to make a list containing the frequency of each element in the order by which they occur.
So an example of the column ID is [1,2,3,3,3,2,1,5,2,3,1,2,4,3]
this should produce [3,4,5,1,1] since the group-ID 1 occurs 3 times, the group-ID 2 occurs 4 times etc. I have made a code that does this perfectly:
group_list = df.ID.unique().tolist()
group_size = []
for i in group_list:
group_size.append(df.ID.value_counts()[i])
The problem is that it takes way to long to finish. I have 5 million rows, and i let it run for 50 minutes, and it still didn't finish! I tried running it for the first 30-50 rows and it works as intended.
To me it would be logical to simply use value_counts(sort=False) but it doesn't give me the group-ID frequencies in the order they occur in my series. I also tried implementing extend because i read it should be faster, but I get a "numpy.int64 object is not iterable".
Given a Series
ser = pd.Series([1,2,3,3,3,2,1,5,2,3,1,2,4,3])
You can do the following:
ser.value_counts().reindex(ser.unique()).tolist()
Out: [3, 4, 5, 1, 1]
Reindex will reorder the value_counts items based on the order they appear.

Categories