I have a grouped dataframe:
df = pd.DataFrame({'a': [0, 0, 1, 1, 2], 'b': range(5)})
g = df.groupby('a')
for key, gr in g:
print(gr, '\n')
a b
0 0 0
1 0 1
a b
2 1 2
3 1 3
a b
4 2 4
I want to do a computation that needs each group and its next one (except the last group, of course).
So with this example I want to get two pairs:
# First pair:
a b
0 0 0
1 0 1
a b
2 1 2
3 1 3
# Second pair:
a b
2 1 2
3 1 3
a b
4 2 4
My attempt
If the groups were in a list instead, this would be easy:
for x, x_next in zip(lst[], lst[1:]):
...
But unfortunately, selecting a slice doesn't work with a pd.DataFrameGroupBy object:
g[1:] # TypeError: unhashable type: 'slice'. (It thinks I want to access the column by its name.)
g.iloc[1:] # AttributeError: 'DataFrameGroupBy' object has no attribute 'iloc'
This question
is related but it doesn't answer my question.
I am posting an answer myself, but maybe there are better or more efficient solutions (maybe pandas-native?).
You can convert a pd.DataFrameGroupBy to a list that contains all groups (in tuples: grouping value and a group),
and then iterate over this list:
lst = list(g)
for current, next_one in zip(lst[], lst[1:]):
...
Alternatively, create an iterator, and skip its first value:
it = iter(g)
next(it)
for current, next_one in zip(g, it):
...
A more complicated way:
g.groups returns a dictionary where keys are the unique values of your grouping column, and values are
the groups. Then you can try to iterate over a dictionary, but I think it would be unnecessarily complicated.
Related
Say I have the following DataFrame df:
time person attributes
----------------------------
1 1 a
2 2 b
3 1 c
4 3 d
5 2 e
6 1 f
7 3 g
... ... ...
I want to write a function get_latest() that, when given a request_time and a list of persons ids, it will return a DataFrame containing the latest entry (row) for each person, up to the request_time.
So for instance, if I called get_latest(request_time = 4.5, ids = [1, 2]), then I want it to return
time person attributes
----------------------------
2 2 b
3 1 c
since those are the latest entries for persons 1 and 2 up to the time 4.5.
I've thought about doing a truncation of the DataFrame and then doing search from there by going up the DataFrame, but that's an okay efficiency of O(n), and I was wondering if there are functions or logic that make this a faster computation.
EDIT: I made this example DataFrame on the fly but it is perhaps important that I point out that the times are Python datetimes.
How about pd.DataFrame.query
def latest_entries(request_time: int or float, ids: list) -> pd.DataFrame:
return (
df
.query("time <= #request_time & person in #ids")
.sort_values(["time"], ascending=False)
.drop_duplicates(subset=["person"], keep="first")
.reset_index(drop=True)
)
print(latest_entries(4.5, [1, 2]))
time person attributes
0 3 1 c
1 2 2 b
def get_latest(tme, ids):
df2= ( df[(df['time']<=tme) &
(df['person'].isin(ids))])
return df2[~df2.duplicated(subset=['person'], keep='last')]
get_latest(4.5, [1,2])
time person attributes
1 2 2 b
2 3 1 c
I have the following data structure:
The columns s and d are indicating the transition of object in column x. What I want to do is get a transition string per object present in the column x. For e.g. with a new column as follows:
Is there a lean way to do it using pandas, without using too many loops?
This was the code I tried:
obj = df['x'].tolist()
rows = []
for o in obj:
locs = df[df['x'] == o]['s'].tolist()
str_locs = '->'.join(str(l) for l in locs)
print(str_locs)
d = dict()
d['x'] = o
d['new'] = str_locs
rows.append(d)
tmp = pd.DataFrame(rows)
This give the output temp as:
x new
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
a 1->2->4->8
b 1->2
b 1->2
Example df:
df = pd.DataFrame({"x":["a","a","a","a","b","b"], "s":[1,2,4,8,5,11],"d":[2,4,8,9,11,12]})
print(df)
x s d
0 a 1 2
1 a 2 4
2 a 4 8
3 a 8 9
4 b 5 11
5 b 11 12
Following code will generate a transition string of all objects present in the column x.
groupby with respect to column x and get list of lists of s and d for every object available in x
Merge the list of lists sequentially
Remove consecutive duplicates from the merged list using itertools.groupby
Join the items of merged list with -> to make it a single string.
Finally map the series to column x of input df
from itertools import groupby
grp = df.groupby('x')[['s', 'd']].apply(lambda x: x.values.tolist())
grp = grp.apply(lambda x: [str(item) for tup in x for item in tup])
sr = grp.apply(lambda x: "->".join([i[0] for i in groupby(x)]))
df["new"] = df["x"].map(sr)
print(df)
x s d new
0 a 1 2 1->2->4->8->9
1 a 2 4 1->2->4->8->9
2 a 4 8 1->2->4->8->9
3 a 8 9 1->2->4->8->9
4 b 5 11 5->11->12
5 b 11 12 5->11->12
I have a dataframe with duplicate rows
>>> d = pd.DataFrame({'n': ['a', 'a', 'a'], 'v': [1,2,1]})
>>> d
n v
0 a 1
1 a 2
2 a 1
I would like to understand how to use .groupby() method specifically so that I can add a new column to the dataframe which shows count of rows which are identical to the current one.
>>> dd = d.groupby(by=['n','v'], as_index=False) # Use all columns to find groups of identical rows
>>> for k,v in dd:
... print(k, "\n", v, "\n") # Check what we found
...
('a', 1)
n v
0 a 1
2 a 1
('a', 2)
n v
1 a 2
When I'm trying to do dd.count() on resulting DataFrameGroupBy object I get IndexError: list index out of range. This seems to happen because all columns are used in grouping operation and there's no other column to use for counting. Similarly dd.agg({'n', 'count'}) fails with ValueError: no results.
I could use .apply() to achieve something that looks like result.
>>> dd.apply(lambda x: x.assign(freq=len(x)))
n v freq
0 0 a 1 2
2 a 1 2
1 1 a 2 1
However this has two issues: 1) something happens to the index so that it is hard to map this back to the original index, 2) this does not seem idiomatic Pandas and manuals discourage using .apply() as it could be slow.
Is there more idiomatic way to count duplicate rows when using .groupby()?
One solution is use GroupBy.size for aggregate output with counter:
d = d.groupby(by=['n','v']).size().reset_index(name='c')
print (d)
n v c
0 a 1 2
1 a 2 1
Your solution working if specify some column name after groupby, because no another columns n, v in input DataFrame:
d = d.groupby(by=['n','v'])['n'].count().reset_index(name='c')
print (d)
n v c
0 a 1 2
1 a 2 1
What is also necessary if need new column with GroupBy.transform - new column is filled by aggregate values:
d['c'] = d.groupby(by=['n','v'])['n'].transform('size')
print (d)
n v c
0 a 1 2
1 a 2 1
2 a 1 2
I have a large pandas dataframe from which I'm trying to form pairs for some rows.
My df looks as follows:
object_id increment location event
0 1 d A
0 2 d B
0 3 z C
0 4 g A
0 5 g B
0 6 i C
1 1 k A
1 2 k B
... ... ... ...
Object ids describe a specific object.
Increment is a value that increments every time something happens (to keep track of the order), location is the location at which this thing happens. And the last column is the type of event.
Now, I want to group these as sometimes (but not always) when A happens at a location, B happens after that, and then C is a completely different event and can be ignored. But I only want to group these together when the location is the same, the object id is the same, and when the events are listed right after each other (so the increment should only differ by 1).
Now the problem is that these events and increment numbers start from zero again for the same object at some times. So I only want to group them when they are exactly located after each other in the dataframe (so groups should contain two entries at max). I'm having a really hard time pulling this off as there are no options of comparing rows in the groupby function.
Any tips what direction I should try?
edit:
The output I'm looking for is forming groups of the form:
group_id object_id increment location event
0 0 1 d A
0 0 2 d B
1 0 3 z C
2 0 4 g A
2 0 5 g B
3 0 6 i C
4 1 1 k A
4 1 2 k B
... ... ... ... ...
So only forming groups when the "first" entry of the pair has event A and some increment value x, and the "second" entry has event B and increment value x+1, and is therefore part of the same sequence. Hope this clarifies my question a bit!
Your question is not really clear, so in this question you might need to work on the conditions in the if statement, however this might help you.
The dataframe set up:
import pandas as pd
d = {'object_id': [0,0,0,0], 'increment': [1,2,3,4],
'location': ['d', 'd', 'z', 'g'], 'event': ['A', 'B', 'C', 'A']}
df = pd.DataFrame(data=d)
Let's make a list to save the index where the location is the same. Furthermore you should add the conditions in the way that work for you but that was not so clear from your question. From there you can run the following function:
lst = []
def functionGrouping(dataset):
for i in range(len(df)-1):
if df['event'].iloc[i+1] == 'C':
i = i + 1
else:
if df['location'].iloc[i+1] == df['location'].iloc[i] and df['object_id'].iloc[i+1] == df['object_id'].iloc[i]:
df['increment'].iloc[i+1] = df['increment'].iloc[i+1] + df['increment'].iloc[i]
lst.append([i])
functionGrouping(df)
And from there drop the rows which you summed up in the function.
for i in range(len(lst)):
df=df.drop(df.index[i])
I hope that this helped a bit, however your question was not really clear. For future questions, please simplify your question and include an example of the desired output.
Problem:
I have a somewhat complicated cross-referencing task I need to perform between a long list (~600,000 entries) and a short list (~300,000 entries). I'm trying to find the similar entries between the two lists, and each unique entry is identified by three different integers (call them int1,int2,and int3). Based on the three integer identifiers in one list, I want to see if those same three integers are in the other list, and return which ones they are.
Attempt:
First I zipped each three-integer tuple in the long list into an array called a. Similarly, I zipped each three-int tuple in the short list into an array called b:
a = [(int1,int2,int3),...] # 600,000 entries
b = [(int1,int2,int3),...] # 300,000 entries
Then I iterated through each entry in a to see if it was in b. If it was, I appended the corresponding tuples to an array outside the loop called c:
c= []
for i in range(0,len(a),1):
if a[i] in b:
c.append(a[i])
The iteration is (not surprisingly) very slow. I'm guessing Python has to check b for a[i] at each iteration (~300,000 times!), and its iterating 600,000 times. It has taken over an hour now and still hasn't finished, so I know I should be optimizing something.
My question is: what is the most Pythonic or fastest way to perform this cross-referencing?
You can use sets:
c = set(b).intersection(a)
I chose to convert b to a set as it is the shorter of the two lists. Using intersection() does not require that list a first be converted to a set.
You can also do this:
c = set(a) & set(b)
however, both lists require conversion to type set first.
Either way you have a O(n) operation, see time complexity.
Pandas solution:
a = [(1,2,3),(4,5,6),(4,5,8),(1,2,8) ]
b = [(1,2,3),(0,3,7),(4,5,8)]
df1 = pd.DataFrame(a)
print (df1)
0 1 2
0 1 2 3
1 4 5 6
2 4 5 8
3 1 2 8
df2 = pd.DataFrame(b)
print (df2)
0 1 2
0 1 2 3
1 0 3 7
2 4 5 8
df = pd.merge(df1, df2)
print (df)
0 1 2
0 1 2 3
1 4 5 8
Pure python solution with sets:
c = list(set(b).intersection(set(a)))
print (c)
[(4, 5, 8), (1, 2, 3)]
Another interesting way to do it:
from itertools import compress
list(compress(b, map(lambda x: x in a, b)))
And other one:
filter(lambda x: x in a, b)