Find the rows in which the (aboslute) value is bigger than - python

I have a pandas dataframe like:
one
two
three
1
3
4
2
4
6
1
3
4
10
3
4
2
4
5
0
3
4
-10
3
4
Now observing the first column (labeled 'one') I would like to find the rows where the value is bigger than say 9. (in this case it would be the fourth )
Ideally, I also would like to find the rows where the absolute value of the value is bigger than say 9 (so that would be fourth and seventh)
How can I do this? (So far I only covert the columns into series and even into series of truths and false but my problem is that my dataframe is huge and I cannot visually inspect it. I need to get the row numbers automatically

you can apply abs and compare and filter by loc:
out = df.loc[df['one'].abs() > 9]
output :
>>>
one two three
3 10 3 4
6 -10 3 4

You could use abs() pandas-abs
df = pd.DataFrame({
'a': [1, 4, -6, 3, 7],
'b': [2, 3, 5, 3, 1],
'c': [4, 2, 7, 1, 3]
})
df[df.a.abs() > 5]
returns two rows 2, 4.

row = {}
for column in df:
row_temp = {}
index = df.loc[df[column].abs()>=9].index
row.update({column:list(index)})

Related

First appearance of a condition in a dataframe

I have a pandas dataframe like this:
col
0 3
1 5
2 9
3 5
4 6
5 6
6 11
7 6
8 2
9 10
that could be created in Python with the code:
import pandas as pd
df = pd.DataFrame(
{
'col': [3, 5, 9, 5, 6, 6, 11, 6, 2, 10]
}
)
I want to find the rows that have a value greater than 8, and also there is at least one row before them that has a value less than 4.
So the output should be:
col
2 9
9 10
You can see that index 0 has a value equal to 3 (less than 4) and then index 2 has a value greater than 8. So we add index 2 to the output and continue to check for the next rows. But we don't anymore consider indexes 0, 1, 2, and reset the work.
Index 6 has a value equal to 11, but none of the indexes 3, 4, 5 has a value less than 4, so we don't add index 6 to the output.
Index 8 has a value equal to 2 (less than 4) and index 9 has a value equal to 10 (greater than 8), so index 9 is added to the output.
It's my priority not to use any for-loops for the code.
Have you any idea about this?
Boolean indexing to the rescue:
# value > 8
m1 = df['col'].gt(8)
# get previous value <4
# check if any occurred previously
m2 = df['col'].shift().lt(4).groupby(m1[::-1].cumsum()).cummax()
df[m1&m2]
Output:
col
2 9
9 10
Check Below code using SHIFT:
df['val'] = np.where(df['col']>8, True, False).cumsum()
df['val'] = np.where(df['col']>8, df['val']-1, df['val'])
df.assign(min_value = df.groupby('val')['col'].transform('min')).\
query('col>8 and min_value<4')[['col']]
OUTPUT:

Change names of columns which contain only positive values

Let's consider data frame following:
import pandas as pd
df = pd.DataFrame([[1, -2, 3, -5, 4 ,2 ,7 ,-8 ,2], [2, -4, 6, 7, -8, 9, 5, 3, 2], [2, 4, 6, 7, 8, 9, 5, 3, 2], [1, 2, 3, 4, 5, 6, 7, 8, 9]]).transpose()
df.columns = ["A", "B", "C", "D"]
A B C D
0 1 2 2 1
1 -2 -4 4 2
2 3 6 6 3
3 -5 7 7 4
4 4 -8 8 5
5 2 9 9 6
6 7 5 5 7
7 -8 3 3 8
8 2 2 2 9
I want to add at the end of the column name "pos" if column contain only positive values. What I would do with it is:
pos_idx = df.loc[:, (df>0).all()].columns
df[pos_idx].columns = df[pos_idx].columns + "pos"
However it seems not to work - it returns no error, however it does not change column names. Moreover, what is very interesting, is that code:
df.columns = df.columns + "anything"
actually add to column names word "anything". Could you please explain to me why it happens (works in general case, but it does not work on index case), and how to do this correctly?
You are saving the new column names onto a copy of the dataframe. The below statement is not overwriting column names of df, but only of the slice df[pos_idx]
df[pos_idx].columns = df[pos_idx].columns + "pos"
Your second code example directly acccesses df, that's why that one works
How to make it work? --> Define the "full columns list" (separately). Afterwards write it into df directly.
How to define the "full list"? Add "pos" as a suffix to all cols which don't have any occurrence of values that are <=0.
my_col_list = [col+(count==0)*"_pos" for col, count in (df <= 0).sum().to_dict().items()]
df.columns = my_col_list
First of all, use .rename() function to change the name of a column.
To add 'pos' to columns with non negative values you can use this:
renamed_columns = {i:i+' pos' for i in df.columns if df[i].min()>=0}
df.rename(columns=renamed_columns,inplace=True)

Efficient way in Pandas to count occurrences of Series of values by row

I have a large dataframe for which I want to count the number of occurrences of a series specific values (given by an external function) by row. For reproducibility let's assume the following simplified dataframe:
data = {'A': [3, 2, 1, 0], 'B': [4, 3, 2, 1], 'C': [1, 2, 3, 4], 'D': [1, 1, 2, 2], 'E': [4, 4, 4, 4]}
df = pd.DataFrame.from_dict(data)
df
A B C D E
0 3 4 1 1 4
1 2 3 2 1 3
2 1 2 3 2 2
3 0 1 4 2 4
How can I count the number of occurrences of specific values (given by a series with the same size) by row?
Again for simplicity, let's assume this value_series is given by the max of each row.
values_series = df.max(axis=1)
0 4
1 3
2 3
3 4
dtype: int64
The solution I got to seems not very pythonic (e.g. I'm using iterrows(), which is slow):
max_count = []
for index, row in df.iterrows():
max_count.append(row.value_counts()[values_series.loc[index]])
df_counts = pd.Series(max_count)
Is there any more efficient way to do this?
We can compare the transposed df.T directly to the df.max series, thanks to broadcasting:
(df.T == df.max(axis=1)).sum()
# result
0 2
1 1
2 1
3 2
dtype: int64
(Transposing also has the added benefit that we can use sum without specifying the axis, i.e. with the default axis=0.)
You can try
df.eq(df.max(1),axis=0).sum(1)
Out[361]:
0 2
1 1
2 1
3 2
dtype: int64
The perfect job for numpy broadcasting:
a = df.to_numpy()
b = values_series.to_numpy()[:, None]
(a == b).sum(axis=1)

How to get the (relative) place of values in a dataframe when sorted using Python?

How can I create a Pandas DataFrame that shows the relative position of each value, when those values are sorted from low to high for each column?
So in this case, how can you transform 'df' into 'dfOut'?
import pandas as pd
import numpy as np
#create DataFrame
df = pd.DataFrame({'A': [12, 18, 9, 21, 24, 15],
'B': [18, 22, 19, 14, 14, 11],
'C': [5, 7, 7, 9, 12, 9]})
# How to assign a value to the order in the column, when sorted from low to high?
dfOut = pd.DataFrame({'A': [2, 4, 1, 5, 6, 3],
'B': [3, 5, 4, 2, 2, 1],
'C': [1, 2, 2, 3, 4, 3]})
If you need to map the same values to the same output, try using the rank method of a DataFrame. Like this:
>> dfOut = df.rank(method="dense").astype(int) # Type transformation added to match your output
>> dfOut
A B C
0 2 3 1
1 4 5 2
2 1 4 2
3 5 2 3
4 6 2 4
5 3 1 3
The rank method computes the rank for each column following a specific criteria. According to the Pandas documentation, the "dense" method ensures that "rank always increases by 1 between groups", and that might match your use case.
Original answer: In case that repeated numbers are not required to map to the same out value, np.argsort could be applied on each column to retrieve the position of each value that would sort the column. Combine this with the apply method of a DataFrame to apply the function on each column and you have this:
>> dfOut = df.apply(lambda column: np.argsort(column.values)))
>> dfOut
A B C
0 2 5 0
1 0 3 1
2 5 4 2
3 1 0 3
4 3 2 5
5 4 1 4
Here is my attempt using some functions:
def sorted_idx(l, num):
x = sorted(list(set(l)))
for i in range(len(x)):
if x[i]==num:
return i+1
def output_list(l):
ret = [sorted_idx(l, elem) for elem in l]
return ret
dfOut = df.apply(lambda column: output_list(column))
print(dfOut)
I make reduce the original list to unique values and then sort. Finally, I return the index+1 where the element in the original list matches this unique, sorted list to get the values you have in your expected output.
Output:
A B C
0 2 3 1
1 4 5 2
2 1 4 2
3 5 2 3
4 6 2 4
5 3 1 3

How to find all the zero cells in a python panda dataframe and replace them?

My data is like this:
df = pd.DataFrame({'a': [5,0,0, 6, 0, 0, 0 , 12]})
I want to count the zeros above the 6 and replace them with (6/count+1)=(6/3)=2 (I will also replace the original 6)
I also want to do a similar thing with the zeros above the 12.
So, (12/count)=(12/3)=4
So the final result will be:
[5,2,2, 2, 3, 3, 3 , 3]
I am not sure how to start. Are there any functions that do this?
Thanks.
Use GroupBy.transform with mean and custom groups created with test not equal 0, swap order, cumulative sum and swap order to original:
g = df['a'].ne(0).iloc[::-1].cumsum().iloc[::-1]
df['b'] = df.groupby(g)['a'].transform('mean')
print (df)
a b
0 5 5
1 0 2
2 0 2
3 6 2
4 0 3
5 0 3
6 0 3
7 12 3

Categories