Remove duplicated cell content using python? - python

I filter the duplicates, got duplicate on the same row and join items by comma and with this below code, don't really understand why the Join_Dup column is replicated?
dd = sales_all[sales_all['Order ID'].duplicated(keep=False)]
dd['Join_Dup'] = dd.groupby('Order ID')['Product'].transform(lambda x: ','.join(x))
print(dd.head())
dd = dd[['Order ID','Join_Dup']].drop_duplicates()
dd
Order ID Join_Dup
0 176558 USB-C Charging Cable,USB-C Charging Cable,USB-...
2 176559 Bose SoundSport Headphones,Bose SoundSport Hea...
3 176560 Google Phone,Wired Headphones,Google Phone,Wir...
5 176561 Wired Headphones,Wired Headphones,Wired Headph...
... ... ...
186846 259354 iPhone,iPhone,iPhone,iPhone,iPhone,iPhone
186847 259355 iPhone,iPhone,iPhone,iPhone,iPhone,iPhone
186848 259356 34in Ultrawide Monitor,34in Ultrawide Monitor,...
186849 259357 USB-C Charging Cable,USB-C Charging Cable,USB-...
[178437 rows x 2 columns]
I need to remove the duplicates from the cell in each row, can some please help.

IIUC, let's try to prevent the duplicates in the groupby transform statement:
dd['Join_No_Dup'] = dd.groupby('Order ID')['Product'].transform(lambda x: ','.join(set(x)))

Edit disregard the second part of the answer. I will delete that portion if it ends up not being useful.
So you comment you want unique product strings for each Order ID. You can get that in a single step:
dd = (
sales_all.groupby(['Order ID', 'Product'])['some_other_column']
.size().rename('quantity').reset_index()
)
Now you have unique rows of OrderID/Product with the count of repeated products (or quantity, as in a regular invoice). You can work with that or you can groupby to form a list of products:
orders = dd.groupby('Order ID').Product.apply(list)
---apply vs transform---
Please note that if you use .transform as in your question you will invariably get a result with the same shape as the dataframe/series being grouped (i.e. grouping will be reversed and you will end up with the same number of rows, thus creating duplicates). The function .apply will pass the groups of your groupby to the same function, any function, but will not broadcast back to the original shape (it will return only one row per group).
Old Answer
So you are removing ALL Oder IDs that appear in multiple rows (if ID 14 appears in two rows you discard both rows). This makes the groupby in the next line redundant, as every grouped ID will have just one line.
Ok, now that's out of the way. Then presumably each row in Product contains a list which you are joining with a lambda. This step would be a little faster with a pandas native function.
dd['Join_Dup'] = dd.Product.str.join(', ')
# perhaps choose a better name for the column, once you remove duplicates it will not mean much (does 'Join_Products' work?)
Now to handle duplicates. You didn't actually need to join in the last step if all you wanted was to remove dups. Pandas can handle lists as well. But the part you were missing is the subset attribute.
dd = dd[['Order ID', 'Join_Dup']].drop_duplicates(subset='Join_Dup')

Related

How to iterate through each row of a groupby object created by groupby()?

I'm working with a large dataset that includes all police stops in my city since 2014. The dataset has millions of rows, but sometimes there are multiple rows for a single stop (so, if the police stopped a group of 4 people, it is included in the database as 4 separate rows even though it's all the same stop). I'm looking to create a new column in the dataset orderInStop, which is a count of how many people were stopped in sequential order. The first person caught up in the stop would have a value of 1, the second person a value of 2, and so on.
To do so, I have used the groupby() function to group all rows that match on time & location, which is the indication that the rows are all part of the same stop. I can manage to create a new column that includes the TOTAL count of the number of people in the stop (so, if there were 4 rows with the same time & location, all four rows have a value of 4 for the new orderInStop variable. But I need the first row in the group to have a value of 1, the second a value of 2, the third 3, and the fourth 4.
Below is my code attempt at iterating through each group I've created to sequentially count each row within each group, but the code doesn't quite work (it populates the entire column rather than each row within the groups). Any help to tweak this code would be much appreciated!
Note: I also tried using logical operators in a for loop, to essentially ask IF the time & location column values match for the current and previous rows, but ran into too many problems with 'the truth values of a Series is ambiguous' errors, so instead I'm trying to use groupby().
Attempt that creates a total count rather than sequential count:
df['order2'] = df.groupby(by=["Date_Time_Occur", "Location"])['orderInStop'].transform('count')
Attempt that fails, to iterate through each row in each group:
df['order3'] = 1
grp = df.groupby(by=["Date_Time_Occur", "Location"])
for name, groups in grp:
count = 1
for row in groups:
df['order3'] = count
count = count + 1
In your example for row in groups iterates over the column names, since groups is a DataFrame.
To iterate over each row you could do
df['order3'] = 1
grp = df.groupby(by=["Date_Time_Occur", "Location"])
for name, groups in grp:
count = 1
for i, row in groups.iterrows(): # i will be index, row a pandas Series
df['order3'] = count
count = count + 1
Note that your solution relies on pandas groupby to preserve row order. This should be the case, see this question, but there is very likely a shorter & safer solution (see fsimonjetz comment for a starting point).

Is there a better way to group by a category, and then select values based on different column values in Pandas?

I have an issue where I want to group by a date column, sort by a time column, and grab the resulting values in the values column.
The data that looks something like this
time value date
0 12.850000 19.195359 08-22-2019
1 9.733333 13.519543 09-19-2019
2 14.083333 9.191413 08-26-2019
3 16.616667 18.346598 08-19-2019
...
Where every date can occur multiple times, recording values at different points
during the day.
I wanted to group by date, and extract the minimum and maximum values of those groupings so I did this:
dayMin = df.groupby('date').value.min()
which gives me a Series object that is fairly easy to manipulate. The issue
comes up when I want to group by 'date', sort by 'time', then grab the 'value'.
What I did was:
dayOpen = df.groupby('date').apply(lambda df: df[ df.time == df.time.min() ])['value']
which almost worked, resulting in a DataFrame of:
date
08-19-2019 13344 17.573522
08-20-2019 12798 19.496609
08-21-2019 2009 20.033917
08-22-2019 5231 19.393700
08-23-2019 12848 17.784213
08-26-2019 417 9.717627
08-27-2019 6318 7.630234
I figured out how to clean up those nasty indexes to the left, name the column, and even concat with my dayMin Series to achieve my goal.
Ultimately my question is if there is a nicer way to perform these data manipulations that follow the general pattern of: "Group by column A, perform filtering or sorting operation on column B, grab resulting values from column C" for future applications.
Thank you in advance :)
You can sort the data frame before calling groupby:
first_of_day = df.sort_values('time').groupby('date').head(1)
This should work for you:
df.sort_values('time').groupby(['date'])['value'].agg([('Min' , 'min'), ('Max', 'max')])
For this small example:
Result df:

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

How to exclude more than one group in a groupby using python?

I have grouped the number of customers by region and year joined using groupby in Python. However I want to remove several regions from the region group.
I know in order to exclude one group from a groupby you can use the following code:
grouped = df.groupby(['Region'])
df1 = df.drop(grouped.get_group(('Southwest')).index).
Therefore I initially tried the following:
grouped = df.groupby(['Region'])
df1 = df.drop(grouped.get_group(('Southwest','Northwest')).index)
However that gave me the apparent error ('Southwest','Northwest').
Now I am wondering if there is a way to drop several groups at once instead of me having to type out the above code repeatedly for each region I want to remove.
I expect the output of the final query to be similar to the image shown below however information regarding the Northwest and Southwest regions should be removed.
It's not df1 = df.drop(grouped.get_group(('Southwest','Northwest')).index). grouped.get_group takes a single name as argument. If you want to drop more than one group, you can use df1 = df.drop((grouped.get_group('Southwest').index, grouped.get_group('Northwest').index)) since drop can take a list as input.
As a side note, ('Southwest') evaluates to 'Southwest' (i.e. it's not a tuple). If you want to make a tuple of size 1, it's ('Southwest', )

Return subset/slice of Pandas dataframe based on matching column of other dataframe, for each element in column?

So I think this is a relatively simple question:
I have a Pandas data frame (A) that has a key column (which is not unique/will have repeats of the key)
I have another Pandas data frame (B) that has a key column, which may have many matching entries/repeats.
So what I'd like is a bunch of data frames (a list, or a bunch of slice parameters, etc.), one for each key in A (regardless of whether it's unique or not)
In [bad] pseudocode:
for each key in A:
resultDF[] = Rows in B where B.key = key
I can easily do this iteratively with loops, but I've read that you're supposed to slice/merge/join data frames holistically, so I'm trying to see if I can find a better way of doing this.
A join will give me all the stuff that matches, but that's not exactly what I'm looking for, since I need a resulting dataframe for each key (i.e. for every row) in A.
Thanks!
EDIT:
I was trying to be brief, but here are some more details:
Eventually, what I need to do is generate some simple statistical metrics for elements in the columns of each row.
In other words, I have a DF, call it A, and it has a r rows, with c columns, one of which is a key. There may be repeats on the key.
I want to "match" that key with another [set of?] dataframe, returning however many rows match the key. Then, for that set of rows, I want to, say, determine the min and max of certain element (and std. dev, variance, etc.) and then determine if the corresponding element in A falls within that range.
You're absolutely right that it's possible that if row 1 and row 3 of DF A have the same key -- but potentially DIFFERENT elements -- they'd be checked against the same result set (the ranges of which obviously won't change). That's fine. These won't likely ever be big enough to make that an issue (but if there's the better way of doing it, that's great).
The point is that I need to be able to do the "in range" and stat summary computation for EACH key in A.
Again, I can easily do all of this iteratively. But this seems like the sort of thing pandas could do well, and I'm just getting into using it.
Thanks again!
FURTHER EDIT
The DF looks like this:
df = pd.DataFrame([[1,2,3,4,1,2,3,4], [28,15,13,11,12,23,21,15],['keyA','keyB','keyC','keyD', 'keyA','keyB','keyC','keyD']]).T
df.columns = ['SEQ','VAL','KEY']
SEQ VAL KEY
0 1 28 keyA
1 2 15 keyB
2 3 13 keyC
3 4 11 keyD
4 1 12 keyA
5 2 23 keyB
6 3 21 keyC
7 4 15 keyD
Both DF's A and B are of this format.
I can iterative get the resultant sets by:
loop_iter = len(A) / max(A['SEQ_NUM'])
for start in range(0, loop_iter):
matchA = A.iloc[start::loop_iter, :]['KEY']
That's simple. But I guess I'm wondering if I can do this "inline". Also, if for some reason the numeric ordering breaks (i.e. the SEQ get out of order) this this won't work. There seems to be no reason NOT to do it explicitly splitting on the keys, right? So perhaps I have TWO questions: 1). How to split on keys, iteratively (i.e. accessing a DF one row at a time), and 2). How to match a DF and do summary statistics, etc., on a DF that matches on the key.
So, once again:
1). Iterate through DF A, going one at a time, and grabbing a key.
2). Match the key to the SET (matchB) of keys in B that match
3). Do some stats on "values" of matchB, check to see if val.A is in range, etc.
4). Profit!
Ok, from what I understand, the problem at its most simple is that you have a pd.Series of values (i.e. a["key"], which let's just call keys), which correspond to the rows of a pd.DataFrame (the df called b), such that set(b["key"]).issuperset(set(keys)). You then want to apply some function to each group of rows in b where the b["key"] is one of the values in keys.
I'm purposefully disregarding the other df -- a -- that you mention in your prompt, because it doesn't seem to bear any significance to the problem, other than being the source of keys.
Anyway, this is a fairly standard sort of operation -- it's a groupby-apply.
def descriptive_func(df):
"""
Takes a df where key is always equal and returns some summary.
:type df: pd.DataFrame
:rtype: pd.Series|pd.DataFrame
"""
pass
# filter down to those rows we're interested in
valid_rows = b[b["key"].isin(set(keys))]
# this groups by the value and applies the descriptive func to each sub df in turn
summary = valid_rows.groupby("key").apply(descriptive_func)
There are a few built in methods on the groupby object that are useful. For example, check out valid_rows.groupby("key").sum() or valid_rows.groupby("key").describe(). Under the covers, these are really similar uses of apply. The shape of the returned summary is determined by the applied function. The unique grouped-by values -- those of b["key"] -- always constitute the index, but if the applied function returns a scalar, summary is a Series; if the applied function returns a Series, then summary constituted of the return Series as rows; if the applied function returns a DataFrame, then the result is a multiindex DataFrame. This is a core pattern in Pandas, and there's a whole, whole lot to explore here.

Categories