python pandas: group by condition on rows

python pandas: group by condition on rows - python

I have a large pandas dataframe from which I'm trying to form pairs for some rows.
My df looks as follows:
object_id increment location event
0 1 d A
0 2 d B
0 3 z C
0 4 g A
0 5 g B
0 6 i C
1 1 k A
1 2 k B
... ... ... ...
Object ids describe a specific object.
Increment is a value that increments every time something happens (to keep track of the order), location is the location at which this thing happens. And the last column is the type of event.
Now, I want to group these as sometimes (but not always) when A happens at a location, B happens after that, and then C is a completely different event and can be ignored. But I only want to group these together when the location is the same, the object id is the same, and when the events are listed right after each other (so the increment should only differ by 1).
Now the problem is that these events and increment numbers start from zero again for the same object at some times. So I only want to group them when they are exactly located after each other in the dataframe (so groups should contain two entries at max). I'm having a really hard time pulling this off as there are no options of comparing rows in the groupby function.
Any tips what direction I should try?
edit:
The output I'm looking for is forming groups of the form:
group_id object_id increment location event
0 0 1 d A
0 0 2 d B
1 0 3 z C
2 0 4 g A
2 0 5 g B
3 0 6 i C
4 1 1 k A
4 1 2 k B
... ... ... ... ...
So only forming groups when the "first" entry of the pair has event A and some increment value x, and the "second" entry has event B and increment value x+1, and is therefore part of the same sequence. Hope this clarifies my question a bit!

Your question is not really clear, so in this question you might need to work on the conditions in the if statement, however this might help you.
The dataframe set up:
import pandas as pd
d = {'object_id': [0,0,0,0], 'increment': [1,2,3,4],
'location': ['d', 'd', 'z', 'g'], 'event': ['A', 'B', 'C', 'A']}
df = pd.DataFrame(data=d)
Let's make a list to save the index where the location is the same. Furthermore you should add the conditions in the way that work for you but that was not so clear from your question. From there you can run the following function:
lst = []
def functionGrouping(dataset):
for i in range(len(df)-1):
if df['event'].iloc[i+1] == 'C':
i = i + 1
else:
if df['location'].iloc[i+1] == df['location'].iloc[i] and df['object_id'].iloc[i+1] == df['object_id'].iloc[i]:
df['increment'].iloc[i+1] = df['increment'].iloc[i+1] + df['increment'].iloc[i]
lst.append([i])
functionGrouping(df)
And from there drop the rows which you summed up in the function.
for i in range(len(lst)):
df=df.drop(df.index[i])
I hope that this helped a bit, however your question was not really clear. For future questions, please simplify your question and include an example of the desired output.

Related

How to sort a dataframe's column of values so that no identical elements are contiguous/adjacent [duplicate]

This question already has an answer here:
How to efficiently reorder rows based on condition?
(1 answer)
Closed 3 days ago.
If I have A, A, B, B, B, C, E and want to sort the values so that no identical elements are contiguous, then the sort may result in: A, B, A, B, C, B, E. There may be other combinations of sort for this very set of data.
I would like to sort the item column of the dataframe above mentioned way and want to know the technical/formal name for this sorting method. Googling did not help much.
The closest approach I can think of is conditional count of item values with a cumulative window and then sort by that conditional count column but coding that appears very challenging. Any insights will be appreciated.
df = pd.DataFrame({
'item':['A', 'A', 'A', 'B', 'B', 'C', 'E'],
'person':[1, 1, 2, 2, 2, 2, 1]
})
df.sort_values('item', kind='heapsort') # <<< Of course, heapsort does not solve the purpose

The closest approach I can think of is conditional count of item values with a cumulative window and then sort by that conditional count column but coding that appears very challenging.
You can use something like:
>>> (df.assign(count=df.groupby('item').cumcount())
.sort_values(['count', 'item'])
.drop(columns='count')
item person
0 A 1
3 B 2
5 C 2
6 E 1
1 A 1
4 B 2
2 A 2
Update
I found another answer which is more performant and elegant from #mozway
>>> df.sort_values(by='item', kind='stable', key=lambda s: df.groupby(s).cumcount())
item person
0 A 1
3 B 2
5 C 2
6 E 1
1 A 1
4 B 2
2 A 2

How to iterate over pairs: a group and its next group?

I have a grouped dataframe:
df = pd.DataFrame({'a': [0, 0, 1, 1, 2], 'b': range(5)})
g = df.groupby('a')
for key, gr in g:
print(gr, '\n')
a b
0 0 0
1 0 1
a b
2 1 2
3 1 3
a b
4 2 4
I want to do a computation that needs each group and its next one (except the last group, of course).
So with this example I want to get two pairs:
# First pair:
a b
0 0 0
1 0 1
a b
2 1 2
3 1 3
# Second pair:
a b
2 1 2
3 1 3
a b
4 2 4
My attempt
If the groups were in a list instead, this would be easy:
for x, x_next in zip(lst[], lst[1:]):
...
But unfortunately, selecting a slice doesn't work with a pd.DataFrameGroupBy object:
g[1:] # TypeError: unhashable type: 'slice'. (It thinks I want to access the column by its name.)
g.iloc[1:] # AttributeError: 'DataFrameGroupBy' object has no attribute 'iloc'
This question
is related but it doesn't answer my question.
I am posting an answer myself, but maybe there are better or more efficient solutions (maybe pandas-native?).

You can convert a pd.DataFrameGroupBy to a list that contains all groups (in tuples: grouping value and a group),
and then iterate over this list:
lst = list(g)
for current, next_one in zip(lst[], lst[1:]):
...
Alternatively, create an iterator, and skip its first value:
it = iter(g)
next(it)
for current, next_one in zip(g, it):
...
A more complicated way:
g.groups returns a dictionary where keys are the unique values of your grouping column, and values are
the groups. Then you can try to iterate over a dictionary, but I think it would be unnecessarily complicated.

What is the optimal method for multiple if/then's to evaluate a column's value in a dataframe and conditionally modify another column value?

I have a dataframe (1580 rows x 48 columns) where each column contains answers to questions, but not every row contains an answer to every question (leaving it NaN). Groups of questions are related, and I'd like to tabulate the answers to the group of questions into new columns (c_answers and i_answers). I have generated lists of the correct answers for each group of questions. Here is an example of the data:
ex_df = pd.DataFrame([["a", "b", "d"],[np.nan, "a", "b"], ["c", "e", np.nan]], columns=["q1", "q2", "q3"])
correct_answers = ["a", "b", "c"]
ex_df
which generates the following dataframe:
q1 q2 q3
0 a b d
1 NaN a b
2 e c NaN
What I would like to do, ideally, is to create a function that would score each column, and for each correct answer on a row (appears in the correct_answers list) it would increment a c_answers column by 1, for each answer that is not in correct_answers, it would increment a i_answers column by 1 instead, but if the provided answer is NaN, it would do neither (not counted as correct or incorrect). This function could then be applied to each group of questions, calculating the number of correct and incorrect answers for each row, for that group.
What I have been able to make a bit of progress with instead is something like this:
ex_df['q1score'] = np.where(ex_df['q1'].isna(), np.nan,
np.where(ex_df['q1'].isin(correct_answers), 1, 100))
which updates the dataframe like so:
q1 q2 q3 q1score
0 a b d 1.0
1 NaN a b NaN
2 e c NaN 100.0
I could then re-use this code to score out q2 and q3 into their own new columns, which I could then sum up into a new column, and then from that column, I could generate two more columns which could calculate the number of correct and incorrect scores from that sum. Finally, I could go back and drop the other 4 columns that I created and keep only the two that I wanted in the first place.
Looking around and trying different methods for the last two hours, I'm finding a lot of answers that deal with one or another of the different issues I'm trying to deal with, but nothing that I could finagle to actually work for my situation. Maybe the solution I've kludged together is the best one, but I'm still relatively new to programming (<18 months) and it didn't seem like the most efficient or most Pythonic method to solve this problem. Hoping someone else has a better answer out there. Thank you!
Edit for more information regarding output: Regarding what I'd like the final output to look like, I'd like something that looks like this:
q1 q2 q3 c_answers i_answers
0 a b d 2 1
1 NaN a b 2 0
2 e c NaN 1 1
Like I said, I can kind of finagle that using the nested np.where() to create numeric columns that I can then sum up and reverse engineer to get a raw count from. While this is a solution, its cumbersome and seems like its probably not the optimal one, especially with the amount of repetition involved (I'll have to do this process for 9 different groups of columns, each being a cluster of questions).

Use sum for count Trues values for correct and incorrect values per rows:
m1 = ex_df.isin(correct_answers)
m2 = ex_df.notna() & ~m1
df = ex_df.assign(c_answers=m1.sum(axis=1), i_answers=m2.sum(axis=1))
print (df)
q1 q2 q3 c_answers i_answers
0 a b d 2 1
1 NaN a b 2 0
2 c e NaN 1 1
Possible solution for multiple groups:
groups = {'g1':['q1','q2'], 'g2':['q2','q3'], 'g3':['q1','q2','q3']}
for k, v in groups.items():
m1 = ex_df[v].isin(correct_answers)
m2 = ex_df[v].notna() & ~m1
ex_df = ex_df.assign(**{f'c_answers_{k}':m1.sum(axis=1),
f'i_answers_{k}':m2.sum(axis=1)})
print (ex_df)
q1 q2 q3 c_answers_g1 i_answers_g1 c_answers_g2 i_answers_g2 \
0 a b d 2 0 1 1
1 NaN a b 1 0 2 0
2 c e NaN 1 1 0 1
c_answers_g3 i_answers_g3
0 2 1
1 2 0
2 1 1

Pandas: count number of duplicate rows using groupby

I have a dataframe with duplicate rows
>>> d = pd.DataFrame({'n': ['a', 'a', 'a'], 'v': [1,2,1]})
>>> d
n v
0 a 1
1 a 2
2 a 1
I would like to understand how to use .groupby() method specifically so that I can add a new column to the dataframe which shows count of rows which are identical to the current one.
>>> dd = d.groupby(by=['n','v'], as_index=False) # Use all columns to find groups of identical rows
>>> for k,v in dd:
... print(k, "\n", v, "\n") # Check what we found
...
('a', 1)
n v
0 a 1
2 a 1
('a', 2)
n v
1 a 2
When I'm trying to do dd.count() on resulting DataFrameGroupBy object I get IndexError: list index out of range. This seems to happen because all columns are used in grouping operation and there's no other column to use for counting. Similarly dd.agg({'n', 'count'}) fails with ValueError: no results.
I could use .apply() to achieve something that looks like result.
>>> dd.apply(lambda x: x.assign(freq=len(x)))
n v freq
0 0 a 1 2
2 a 1 2
1 1 a 2 1
However this has two issues: 1) something happens to the index so that it is hard to map this back to the original index, 2) this does not seem idiomatic Pandas and manuals discourage using .apply() as it could be slow.
Is there more idiomatic way to count duplicate rows when using .groupby()?

One solution is use GroupBy.size for aggregate output with counter:
d = d.groupby(by=['n','v']).size().reset_index(name='c')
print (d)
n v c
0 a 1 2
1 a 2 1
Your solution working if specify some column name after groupby, because no another columns n, v in input DataFrame:
d = d.groupby(by=['n','v'])['n'].count().reset_index(name='c')
print (d)
n v c
0 a 1 2
1 a 2 1
What is also necessary if need new column with GroupBy.transform - new column is filled by aggregate values:
d['c'] = d.groupby(by=['n','v'])['n'].transform('size')
print (d)
n v c
0 a 1 2
1 a 2 1
2 a 1 2

Fastest way to compare row and previous row in pandas dataframe with millions of rows

I'm looking for solutions to speed up a function I have written to loop through a pandas dataframe and compare column values between the current row and the previous row.
As an example, this is a simplified version of my problem:
User Time Col1 newcol1 newcol2 newcol3 newcol4
0 1 6 [cat, dog, goat] 0 0 0 0
1 1 6 [cat, sheep] 0 0 0 0
2 1 12 [sheep, goat] 0 0 0 0
3 2 3 [cat, lion] 0 0 0 0
4 2 5 [fish, goat, lemur] 0 0 0 0
5 3 9 [cat, dog] 0 0 0 0
6 4 4 [dog, goat] 0 0 0 0
7 4 11 [cat] 0 0 0 0
At the moment I have a function which loops through and calculates values for 'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1. It also looks at the first value in the arrays stored in 'Col1' and 'Col2' and updates 'newcol3' and 'newcol4' if these values have changed since the previous row.
Here's the pseudo-code for what I'm doing currently (since I've simplified the problem I haven't tested this but it's pretty similar to what I'm actually doing in ipython notebook):
def myJFunc(df):
... #initialize jnum counter
... jnum = 0;
... #loop through each row of dataframe (not including the first/zeroeth)
... for i in range(1,len(df)):
... #has user changed?
... if df.User.loc[i] == df.User.loc[i-1]:
... #has time increased by more than 1 (hour)?
... if abs(df.Time.loc[i]-df.Time.loc[i-1])>1:
... #update new columns
... df['newcol2'].loc[i-1] = 1;
... df['newcol1'].loc[i] = 1;
... #increase jnum
... jnum += 1;
... #has content changed?
... if df.Col1.loc[i][0] != df.Col1.loc[i-1][0]:
... #record this change
... df['newcol4'].loc[i-1] = [df.Col1.loc[i-1][0], df.Col2.loc[i][0]];
... #different user?
... elif df.User.loc[i] != df.User.loc[i-1]:
... #update new columns
... df['newcol1'].loc[i] = 1;
... df['newcol2'].loc[i-1] = 1;
... #store jnum elsewhere (code not included here) and reset jnum
... jnum = 1;
I now need to apply this function to several million rows and it's impossibly slow so I'm trying to figure out the best way to speed it up. I've heard that Cython can increase the speed of functions but I have no experience with it (and I'm new to both pandas and python). Is it possible to pass two rows of a dataframe as arguments to the function and then use Cython to speed it up or would it be necessary to create new columns with "diff" values in them so that the function only reads from and writes to one row of the dataframe at a time, in order to benefit from using Cython? Any other speed tricks would be greatly appreciated!
(As regards using .loc, I compared .loc, .iloc and .ix and this one was marginally faster so that's the only reason I'm using that currently)
(Also, my User column in reality is unicode not int, which could be problematic for speedy comparisons)

I was thinking along the same lines as Andy, just with groupby added, and I think this is complementary to Andy's answer. Adding groupby is just going to have the effect of putting a NaN in the first row whenever you do a diff or shift. (Note that this is not an attempt at an exact answer, just to sketch out some basic techniques.)
df['time_diff'] = df.groupby('User')['Time'].diff()
df['Col1_0'] = df['Col1'].apply( lambda x: x[0] )
df['Col1_0_prev'] = df.groupby('User')['Col1_0'].shift()
User Time Col1 time_diff Col1_0 Col1_0_prev
0 1 6 [cat, dog, goat] NaN cat NaN
1 1 6 [cat, sheep] 0 cat cat
2 1 12 [sheep, goat] 6 sheep cat
3 2 3 [cat, lion] NaN cat NaN
4 2 5 [fish, goat, lemur] 2 fish cat
5 3 9 [cat, dog] NaN cat NaN
6 4 4 [dog, goat] NaN dog NaN
7 4 11 [cat] 7 cat dog
As a followup to Andy's point about storing objects, note that what I did here was to extract the first element of the list column (and add a shifted version also). Doing it like this you only have to do an expensive extraction once and after that can stick to standard pandas methods.

Use pandas (constructs) and vectorize your code i.e. don't use for loops, instead use pandas/numpy functions.
'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1.
Calculate these separately:
df['newcol1'] = df['User'].shift() == df['User']
df.ix[0, 'newcol1'] = True # possibly tweak the first row??
df['newcol1'] = (df['Time'].shift() - df['Time']).abs() > 1
It's unclear to me the purpose of Col1, but general python objects in columns doesn't scale well (you can't use fast path and the contents are scattered in memory). Most of the time you can get away with using something else...
Cython is the very last option, and not needed in 99% of use-cases, but see enhancing performance section of the docs for tips.

In your problem, it seems like you want to iterate through row pairwise. The first thing you could do is something like this:
from itertools import tee, izip
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return izip(a, b)
for (idx1, row1), (idx2, row2) in pairwise(df.iterrows()):
# you stuff
However you cannot modify row1 and row2 directly you will still need to use .loc or .iloc with the indexes.
If iterrows is still too slow I suggest to do something like this:
Create a user_id column from you unicode names using pd.unique(User) and mapping the name with a dictionary to integer ids.
Create a delta dataframe: to a shifted dataframe with the user_id and time column you substract the original dataframe.
df[[col1, ..]].shift() - df[[col1, ..]])
If user_id > 0, it means that the user changed in two consecutive row. The time column can be filtered directly with delta[delta['time' > 1]]
With this delta dataframe you record the changes row-wise. You can use it a a mask to update the columns you need from you original dataframe.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.