Group by categorical column and by range of ids - python

I have a dataframe similar to this one:
col inc_idx
0 A 1
1 B 1
2 C 1
3 A 2
4 A 3
5 B 2
6 D 1
7 E 1
8 F 1
9 F 2
10 Z 1
And I'm trying to iterate the df by batches:
First loop: All col rows with inc_idx >= 1 and inc_idx <=2
A 1
A 2
B 1
B 2
...
Second loop: All col rows with inc_idx >= 3 and inc_idx <=4
A 3
The way I'm doing it now leaves a lot of room for improvement:
i = 0
while True:
for col, grouped_rows in df.groupby(by=['col']):
from_idx = i * 2
to_idx = from_idx + 2
items = grouped_rows .iloc[from_idx:to_idx].to_list()
i += 2
I think that there's got to be a more efficient approach and also a way to remove the "while True" loop and instead just waiting for the internal loop to run out of items.

I don't know exactly what you want to do. Here's something that groups the rows.
df.groupby((df.inc_idx + 1) // 2).agg(list)
col inc_idx
inc_idx
1 [A, B, C, A, B, D, E, F, F, Z] [1, 1, 1, 2, 2, 1, 1, 1, 2, 1]
2 [A] [3]

I've found (I think) a simpler way to solve it. I'll add a new "batch" column:
df['batch'] = df.apply(lambda x: x['inc_idx'] // 2, axis=1)
With this new column, now I can simply do something like:
df.groupby(by=['col', 'batch'])

Related

Get the middle value from a column according to a criteria

I have a dataframe with 3 columns. I need to get the value from col A and B in the middle of C when C = 1. If the amount of C = 1 is even, I want the first one from the middle
For example, this one is for an odd amount of C = 1
A B C
w y 0
c v 0
t o 1
e p 1
t b 1
u e 0
q p 0
The row in the middle when C = 1 is
A B C
e p 1
Therefore, it should return
df_return
A B C
e p 1
When we have an even amount of C = 1:
df_return
A B C
w y 0
c v 0
t o 1
e p 1
t b 1
u e 1
r e 1
u f 1
q p 0
The ones in the middle when C = 1 are
A B C
t b 1
u e 1
However, I want only 1 of them, and it should be the first one. So
df_return
A B C
t b 1
How can I do it?
One thing you should know is that A and B are ordered
Focus on the relevant part, discarding rows holding zeros:
df = df[df.C == 1]
Now it's simple. Just find the midpoint, based on length or .shape.
if len(df) > 0:
mid = (len(df) - 1) // 2
return df.iloc[mid, :]

Nested loop through list avoiding same element

I can't explain the concept well at all, but I am trying to loop through a list using a nested loop, and I can't figure out how to avoid them using the same element.
list = [1, 2, 2, 4]
for i in list:
for j in list:
print(i, j) # But only if they are not the same element
So the output should be:
1 2
1 2
1 4
2 1
2 2
2 4
2 1
2 2
2 4
4 1
4 2
4 2
Edit as the solutions don't work in all scenarios:
The if i != j solution only works if all elements in the list are different, I clearly chose a poor example, but I meant same element rather than the same number; I have changed the example
You can compare the indices of the two iterations instead:
lst = [1, 2, 2, 4]
for i, a in enumerate(lst):
for j, b in enumerate(lst):
if i != j:
print(a, b)
You can also consider using itertools.permutations for your purpose:
lst = [1, 2, 2, 4]
from itertools import permutations
for i, j in permutations(lst, 2):
print(i, j)
Both would output:
1 2
1 2
1 4
2 1
2 2
2 4
2 1
2 2
2 4
4 1
4 2
4 2
Simply:
if i != j:
print(i, j)

Compare current column value to different column value by row slices

Assuming a dataframe like this
In [5]: data = pd.DataFrame([[9,4],[5,4],[1,3],[26,7]])
In [6]: data
Out[6]:
0 1
0 9 4
1 5 4
2 1 3
3 26 7
I want to count how many times the values in a rolling window/slice of 2 on column 0 are greater or equal to the value in col 1 (4).
On the first number 4 at col 1, a slice of 2 on column 0 yields 5 and 1, so the output would be 2 since both numbers are greater than 4, then on the second 4 the next slice values on col 0 would be 1 and 26, so the output would be 1 because only 26 is greater than 4 but not 1. I can't use rolling window since iterating through rolling window values is not implemented.
I need something like a slice of the previous n rows and then I can iterate, compare and count how many times any of the values in that slice are above the current row.
I have done this using list instead of doing it in data frame. Check the code below:
list1, list2 = df['0'].values.tolist(), df['1'].values.tolist()
outList = []
for ix in range(len(list1)):
if ix < len(list1) - 2:
if list2[ix] < list1[ix + 1] and list2[ix] < list1[ix + 2]:
outList.append(2)
elif list2[ix] < list1[ix + 1] or list2[ix] < list1[ix + 2]:
outList.append(1)
else:
outList.append(0)
else:
outList.append(0)
df['2_rows_forward_moving_tag'] = pd.Series(outList)
Output:
0 1 2_rows_forward_moving_tag
0 9 4 1
1 5 4 1
2 1 3 0
3 26 7 0

Get first row of dataframe in Python Pandas based on criteria

Let's say that I have a dataframe like this one
import pandas as pd
df = pd.DataFrame([[1, 2, 1], [1, 3, 2], [4, 6, 3], [4, 3, 4], [5, 4, 5]], columns=['A', 'B', 'C'])
>> df
A B C
0 1 2 1
1 1 3 2
2 4 6 3
3 4 3 4
4 5 4 5
The original table is more complicated with more columns and rows.
I want to get the first row that fulfil some criteria. Examples:
Get first row where A > 3 (returns row 2)
Get first row where A > 4 AND B > 3 (returns row 4)
Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)
But, if there isn't any row that fulfil the specific criteria, then I want to get the first one after I just sort it descending by A (or other cases by B, C etc)
Get first row where A > 6 (returns row 4 by ordering it by A desc and get the first one)
I was able to do it by iterating on the dataframe (I know that craps :P). So, I prefer a more pythonic way to solve it.
This tutorial is a very good one for pandas slicing. Make sure you check it out. Onto some snippets... To slice a dataframe with a condition, you use this format:
>>> df[condition]
This will return a slice of your dataframe which you can index using iloc. Here are your examples:
Get first row where A > 3 (returns row 2)
>>> df[df.A > 3].iloc[0]
A 4
B 6
C 3
Name: 2, dtype: int64
If what you actually want is the row number, rather than using iloc, it would be df[df.A > 3].index[0].
Get first row where A > 4 AND B > 3:
>>> df[(df.A > 4) & (df.B > 3)].iloc[0]
A 5
B 4
C 5
Name: 4, dtype: int64
Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)
>>> df[(df.A > 3) & ((df.B > 3) | (df.C > 2))].iloc[0]
A 4
B 6
C 3
Name: 2, dtype: int64
Now, with your last case we can write a function that handles the default case of returning the descending-sorted frame:
>>> def series_or_default(X, condition, default_col, ascending=False):
... sliced = X[condition]
... if sliced.shape[0] == 0:
... return X.sort_values(default_col, ascending=ascending).iloc[0]
... return sliced.iloc[0]
>>>
>>> series_or_default(df, df.A > 6, 'A')
A 5
B 4
C 5
Name: 4, dtype: int64
As expected, it returns row 4.
For existing matches, use query:
df.query(' A > 3' ).head(1)
Out[33]:
A B C
2 4 6 3
df.query(' A > 4 and B > 3' ).head(1)
Out[34]:
A B C
4 5 4 5
df.query(' A > 3 and (B > 3 or C > 2)' ).head(1)
Out[35]:
A B C
2 4 6 3
you can take care of the first 3 items with slicing and head:
df[df.A>=4].head(1)
df[(df.A>=4)&(df.B>=3)].head(1)
df[(df.A>=4)&((df.B>=3) * (df.C>=2))].head(1)
The condition in case nothing comes back you can handle with a try or an if...
try:
output = df[df.A>=6].head(1)
assert len(output) == 1
except:
output = df.sort_values('A',ascending=False).head(1)
For the point that 'returns the value as soon as you find the first row/record that meets the requirements and NOT iterating other rows', the following code would work:
def pd_iter_func(df):
for row in df.itertuples():
# Define your criteria here
if row.A > 4 and row.B > 3:
return row
It is more efficient than Boolean Indexing when it comes to a large dataframe.
To make the function above more applicable, one can implements lambda functions:
def pd_iter_func(df: DataFrame, criteria: Callable[[NamedTuple], bool]) -> Optional[NamedTuple]:
for row in df.itertuples():
if criteria(row):
return row
pd_iter_func(df, lambda row: row.A > 4 and row.B > 3)
As mentioned in the answer to the 'mirror' question, pandas.Series.idxmax would also be a nice choice.
def pd_idxmax_func(df, mask):
return df.loc[mask.idxmax()]
pd_idxmax_func(df, (df.A > 4) & (df.B > 3))

How to return all opposite pairs in a Pandas DataFrame?

For the dataframe below, how to return all opposite pairs?
import pandas as pd
df1 = pd.DataFrame([1,2,-2,2,-1,-1,1,1], columns=['a'])
a
0 1
1 2
2 -2
3 2
4 -1
5 -1
6 1
7 1
The output should be as below:
(1) sum of all rows is 0
(2) as there are 3 "1" and 2 "-1" in
original data, output includes 2 "1" and 2"-1".
a
0 1
1 2
2 -2
4 -1
5 -1
6 1
Thank you very much.
Well, I thought this would take fewer lines (and probably can) but this does work. First just create a couple of new columns to simplify the later syntax:
>>> df1['abs_a'] = np.abs( df1['a'] )
>>> df1['ones'] = 1
Then the main thing you need is to do some counting. For example, are there fewer 1s or fewer -1s?
>>> df2 = df1.groupby(['abs_a','a']).count()
ones
abs_a a
1 -1 2
1 3
2 -2 1
2 2
>>> df3 = df2.groupby(level=0).min()
ones
abs_a
1 2
2 1
That's basically the answer right there, but I'll put it closer to the form you asked for:
>>> lst = [ [i]*j for i, j in zip( df3.index.tolist(), df3['ones'].tolist() ) ]
>>> arr = np.array( [item for sublist in lst for item in sublist] )
>>> np.hstack( [arr,-1*arr] )
array([ 1, 1, 2, -1, -1, -2], dtype=int64)
Or if you want to put it back into a dataframe:
>>> pd.DataFrame( np.hstack( [arr,-1*arr] ) )
0
0 1
1 1
2 2
3 -1
4 -1
5 -2

Categories