Group by categorical column and by range of ids

Group by categorical column and by range of ids - python

I have a dataframe similar to this one:
col inc_idx
0 A 1
1 B 1
2 C 1
3 A 2
4 A 3
5 B 2
6 D 1
7 E 1
8 F 1
9 F 2
10 Z 1
And I'm trying to iterate the df by batches:
First loop: All col rows with inc_idx >= 1 and inc_idx <=2
A 1
A 2
B 1
B 2
...
Second loop: All col rows with inc_idx >= 3 and inc_idx <=4
A 3
The way I'm doing it now leaves a lot of room for improvement:
i = 0
while True:
for col, grouped_rows in df.groupby(by=['col']):
from_idx = i * 2
to_idx = from_idx + 2
items = grouped_rows .iloc[from_idx:to_idx].to_list()
i += 2
I think that there's got to be a more efficient approach and also a way to remove the "while True" loop and instead just waiting for the internal loop to run out of items.

I don't know exactly what you want to do. Here's something that groups the rows.
df.groupby((df.inc_idx + 1) // 2).agg(list)
col inc_idx
inc_idx
1 [A, B, C, A, B, D, E, F, F, Z] [1, 1, 1, 2, 2, 1, 1, 1, 2, 1]
2 [A] [3]

I've found (I think) a simpler way to solve it. I'll add a new "batch" column:
df['batch'] = df.apply(lambda x: x['inc_idx'] // 2, axis=1)
With this new column, now I can simply do something like:
df.groupby(by=['col', 'batch'])

Related

Get the middle value from a column according to a criteria

I have a dataframe with 3 columns. I need to get the value from col A and B in the middle of C when C = 1. If the amount of C = 1 is even, I want the first one from the middle
For example, this one is for an odd amount of C = 1
A B C
w y 0
c v 0
t o 1
e p 1
t b 1
u e 0
q p 0
The row in the middle when C = 1 is
A B C
e p 1
Therefore, it should return
df_return
A B C
e p 1
When we have an even amount of C = 1:
df_return
A B C
w y 0
c v 0
t o 1
e p 1
t b 1
u e 1
r e 1
u f 1
q p 0
The ones in the middle when C = 1 are
A B C
t b 1
u e 1
However, I want only 1 of them, and it should be the first one. So
df_return
A B C
t b 1
How can I do it?
One thing you should know is that A and B are ordered

Focus on the relevant part, discarding rows holding zeros:
df = df[df.C == 1]
Now it's simple. Just find the midpoint, based on length or .shape.
if len(df) > 0:
mid = (len(df) - 1) // 2
return df.iloc[mid, :]

Nested loop through list avoiding same element

I can't explain the concept well at all, but I am trying to loop through a list using a nested loop, and I can't figure out how to avoid them using the same element.
list = [1, 2, 2, 4]
for i in list:
for j in list:
print(i, j) # But only if they are not the same element
So the output should be:
1 2
1 2
1 4
2 1
2 2
2 4
2 1
2 2
2 4
4 1
4 2
4 2
Edit as the solutions don't work in all scenarios:
The if i != j solution only works if all elements in the list are different, I clearly chose a poor example, but I meant same element rather than the same number; I have changed the example

You can compare the indices of the two iterations instead:
lst = [1, 2, 2, 4]
for i, a in enumerate(lst):
for j, b in enumerate(lst):
if i != j:
print(a, b)
You can also consider using itertools.permutations for your purpose:
lst = [1, 2, 2, 4]
from itertools import permutations
for i, j in permutations(lst, 2):
print(i, j)
Both would output:
1 2
1 2
1 4
2 1
2 2
2 4
2 1
2 2
2 4
4 1
4 2
4 2

Simply:
if i != j:
print(i, j)

Compare current column value to different column value by row slices

Assuming a dataframe like this
In [5]: data = pd.DataFrame([[9,4],[5,4],[1,3],[26,7]])
In [6]: data
Out[6]:
0 1
0 9 4
1 5 4
2 1 3
3 26 7
I want to count how many times the values in a rolling window/slice of 2 on column 0 are greater or equal to the value in col 1 (4).
On the first number 4 at col 1, a slice of 2 on column 0 yields 5 and 1, so the output would be 2 since both numbers are greater than 4, then on the second 4 the next slice values on col 0 would be 1 and 26, so the output would be 1 because only 26 is greater than 4 but not 1. I can't use rolling window since iterating through rolling window values is not implemented.
I need something like a slice of the previous n rows and then I can iterate, compare and count how many times any of the values in that slice are above the current row.

I have done this using list instead of doing it in data frame. Check the code below:
list1, list2 = df['0'].values.tolist(), df['1'].values.tolist()
outList = []
for ix in range(len(list1)):
if ix < len(list1) - 2:
if list2[ix] < list1[ix + 1] and list2[ix] < list1[ix + 2]:
outList.append(2)
elif list2[ix] < list1[ix + 1] or list2[ix] < list1[ix + 2]:
outList.append(1)
else:
outList.append(0)
else:
outList.append(0)
df['2_rows_forward_moving_tag'] = pd.Series(outList)
Output:
0 1 2_rows_forward_moving_tag
0 9 4 1
1 5 4 1
2 1 3 0
3 26 7 0

Get first row of dataframe in Python Pandas based on criteria

Let's say that I have a dataframe like this one
import pandas as pd
df = pd.DataFrame([[1, 2, 1], [1, 3, 2], [4, 6, 3], [4, 3, 4], [5, 4, 5]], columns=['A', 'B', 'C'])
>> df
A B C
0 1 2 1
1 1 3 2
2 4 6 3
3 4 3 4
4 5 4 5
The original table is more complicated with more columns and rows.
I want to get the first row that fulfil some criteria. Examples:
Get first row where A > 3 (returns row 2)
Get first row where A > 4 AND B > 3 (returns row 4)
Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)
But, if there isn't any row that fulfil the specific criteria, then I want to get the first one after I just sort it descending by A (or other cases by B, C etc)
Get first row where A > 6 (returns row 4 by ordering it by A desc and get the first one)
I was able to do it by iterating on the dataframe (I know that craps :P). So, I prefer a more pythonic way to solve it.

This tutorial is a very good one for pandas slicing. Make sure you check it out. Onto some snippets... To slice a dataframe with a condition, you use this format:
>>> df[condition]
This will return a slice of your dataframe which you can index using iloc. Here are your examples:
Get first row where A > 3 (returns row 2)
>>> df[df.A > 3].iloc[0]
A 4
B 6
C 3
Name: 2, dtype: int64
If what you actually want is the row number, rather than using iloc, it would be df[df.A > 3].index[0].
Get first row where A > 4 AND B > 3:
>>> df[(df.A > 4) & (df.B > 3)].iloc[0]
A 5
B 4
C 5
Name: 4, dtype: int64
Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)
>>> df[(df.A > 3) & ((df.B > 3) | (df.C > 2))].iloc[0]
A 4
B 6
C 3
Name: 2, dtype: int64
Now, with your last case we can write a function that handles the default case of returning the descending-sorted frame:
>>> def series_or_default(X, condition, default_col, ascending=False):
... sliced = X[condition]
... if sliced.shape[0] == 0:
... return X.sort_values(default_col, ascending=ascending).iloc[0]
... return sliced.iloc[0]
>>>
>>> series_or_default(df, df.A > 6, 'A')
A 5
B 4
C 5
Name: 4, dtype: int64
As expected, it returns row 4.

For existing matches, use query:
df.query(' A > 3' ).head(1)
Out[33]:
A B C
2 4 6 3
df.query(' A > 4 and B > 3' ).head(1)
Out[34]:
A B C
4 5 4 5
df.query(' A > 3 and (B > 3 or C > 2)' ).head(1)
Out[35]:
A B C
2 4 6 3

you can take care of the first 3 items with slicing and head:
df[df.A>=4].head(1)
df[(df.A>=4)&(df.B>=3)].head(1)
df[(df.A>=4)&((df.B>=3) * (df.C>=2))].head(1)
The condition in case nothing comes back you can handle with a try or an if...
try:
output = df[df.A>=6].head(1)
assert len(output) == 1
except:
output = df.sort_values('A',ascending=False).head(1)

For the point that 'returns the value as soon as you find the first row/record that meets the requirements and NOT iterating other rows', the following code would work:
def pd_iter_func(df):
for row in df.itertuples():
# Define your criteria here
if row.A > 4 and row.B > 3:
return row
It is more efficient than Boolean Indexing when it comes to a large dataframe.
To make the function above more applicable, one can implements lambda functions:
def pd_iter_func(df: DataFrame, criteria: Callable[[NamedTuple], bool]) -> Optional[NamedTuple]:
for row in df.itertuples():
if criteria(row):
return row
pd_iter_func(df, lambda row: row.A > 4 and row.B > 3)
As mentioned in the answer to the 'mirror' question, pandas.Series.idxmax would also be a nice choice.
def pd_idxmax_func(df, mask):
return df.loc[mask.idxmax()]
pd_idxmax_func(df, (df.A > 4) & (df.B > 3))

How to return all opposite pairs in a Pandas DataFrame?

For the dataframe below, how to return all opposite pairs?
import pandas as pd
df1 = pd.DataFrame([1,2,-2,2,-1,-1,1,1], columns=['a'])
a
0 1
1 2
2 -2
3 2
4 -1
5 -1
6 1
7 1
The output should be as below:
(1) sum of all rows is 0
(2) as there are 3 "1" and 2 "-1" in
original data, output includes 2 "1" and 2"-1".
a
0 1
1 2
2 -2
4 -1
5 -1
6 1
Thank you very much.

Well, I thought this would take fewer lines (and probably can) but this does work. First just create a couple of new columns to simplify the later syntax:
>>> df1['abs_a'] = np.abs( df1['a'] )
>>> df1['ones'] = 1
Then the main thing you need is to do some counting. For example, are there fewer 1s or fewer -1s?
>>> df2 = df1.groupby(['abs_a','a']).count()
ones
abs_a a
1 -1 2
1 3
2 -2 1
2 2
>>> df3 = df2.groupby(level=0).min()
ones
abs_a
1 2
2 1
That's basically the answer right there, but I'll put it closer to the form you asked for:
>>> lst = [ [i]*j for i, j in zip( df3.index.tolist(), df3['ones'].tolist() ) ]
>>> arr = np.array( [item for sublist in lst for item in sublist] )
>>> np.hstack( [arr,-1*arr] )
array([ 1, 1, 2, -1, -1, -2], dtype=int64)
Or if you want to put it back into a dataframe:
>>> pd.DataFrame( np.hstack( [arr,-1*arr] ) )
0
0 1
1 1
2 2
3 -1
4 -1
5 -2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group by categorical column and by range of ids - python

I don't know exactly what you want to do. Here's something that groups the rows. df.groupby((df.inc_idx + 1) // 2).agg(list) col inc_idx inc_idx 1 [A, B, C, A, B, D, E, F, F, Z] [1, 1, 1, 2, 2, 1, 1, 1, 2, 1] 2 [A] [3]

I've found (I think) a simpler way to solve it. I'll add a new "batch" column: df['batch'] = df.apply(lambda x: x['inc_idx'] // 2, axis=1) With this new column, now I can simply do something like: df.groupby(by=['col', 'batch'])

Related

Get the middle value from a column according to a criteria

Nested loop through list avoiding same element

Compare current column value to different column value by row slices

Get first row of dataframe in Python Pandas based on criteria

How to return all opposite pairs in a Pandas DataFrame?

Categories

Resources