For example, I have the following:
.
a
b
benchmark
0
1
2
1
1
1
5
3
and I would like to apply a condition in Pandas for each column as:
def f(x):
if x > benchmark:
# X being the values of a or b
return x
else:
return 0
But I don't know how to do that. If I did df.apply(f) I can't access other cells in the row as x is just the value of the one cell.
I don't want to create a new column either. I want to directly change the value of the cell as I compare it to benchmark, clearing or 0'ing the cells that that do not meet the benchmark.
Any insight?
You don't need a function, instead use vectorial operations:
out = df.where(df.gt(df['benchmark'], axis=0), 0)
To change the values in place:
df[df.le(df['benchmark'], axis=0)] = 0
Output:
a b benchmark
0 0 2 0
1 0 5 0
If you don't want to affect benchmark:
m = df.le(df['benchmark'], axis=0)
m['benchmark'] = False
df[m] = 0
Output:
a b benchmark
0 0 2 1
1 0 5 3
Related
I need to find if the last value of dataframe['position'] is different from 0, then count the previous (so in reverse) values until them changes and store the counted index before the change in a variable, this without for loops. By loc or iloc for example...
dataframe:
| position |
0 1
1 0
2 1 <4
3 1 <3
4 1 <2
5 1 <1
count = 4
I achieved this by a for loop, but I need to avoid it:
count = 1
if data['position'].iloc[-1] != 0:
for i in data['position']:
if data['position'].iloc[-count] == data['position'].iloc[-1]:
count = count + 1
else:
break
if data['position'].iloc[-count] != data['position'].iloc[-1]:
count = count - 1
You can reverse your Series, convert to boolean using the target condition (here "not equal 0" with ne), and apply a cummin to propagate the False upwards and sum to count the trailing True:
count = df.loc[::-1, 'position'].ne(0).cummin().sum()
Output: 4
If you have multiple columns:
counts = df.loc[::-1].ne(0).cummin().sum()
alternative
A slightly faster alternative (~25% faster), but relying on the assumptions that you have at least one zero and non duplicated indices could be to find the last zero and use indexing
m = df['position'].eq(0)
count = len(df.loc[m[m].index[-1]:])-1
Without the requirement to have at least one zero:
m = df['position'].eq(0)
m = m[m]
count = len(df) if m.empty else len(df.loc[m.index[-1]:])-1
This should do the trick:
((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])).cumprod().sum()
This builds a condition ((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])) indicating whether the value in each row (counting backwards from the end) is nonzero and equals the last value. Then, the values are coerced into 0 or 1 and the cumulative product is taken, so that the first non-matching zero will break the sequence and all subsequent values will be zero. Then the flags are summed to get the count of these consecutive matched values.
Depending on your data, though, stepping iteratively backwards from the end may be faster. This solution is vectorized, but it requires working with the entire column of data and doing several computations which are the same size as the original series.
Example:
In [12]: data = pd.DataFrame(np.random.randint(0, 3, size=(10, 5)), columns=list('ABCDE'))
...: data
Out[12]:
A B C D E
0 2 0 1 2 0
1 1 0 1 2 1
2 2 1 2 1 0
3 1 0 1 2 2
4 1 1 0 0 2
5 2 2 1 0 2
6 2 1 1 2 2
7 0 0 0 1 0
8 2 2 0 0 1
9 2 0 0 2 1
In [13]: ((data.iloc[-1] != 0) & (data[::-1] == data.iloc[-1])).cumprod().sum()
Out[13]:
A 2
B 0
C 0
D 1
E 2
dtype: int64
I am currently trying to get my hands on pandas DataFrames. I have constructed a certain matrix which looks like this:
x y z
A 1 0 1
B 1 1 0
C 1 0 0
D 0 1 0
What I want to have is this (for each cell = 1, append the column name to the result per row):
A x,z
B x,y
C x
D y
My current best solution iterates over the columns in a for loop, gets all columns with a value > 0, extracts the column names and then passes it on to my next function. However, since I have a lot of columns (>1000) the for loop is very slow and I am sure there is a better way which I cannot figure out. Can you give me a helping hand?
Use if there are only 1 and 0 values use matrix multiplication DataFrame.dot with columns names and last remove separator by Series.str.rstrip:
df['new'] = df.dot(df.columns + ',').str.rstrip(',')
print (df)
x y z new
A 1 0 1 x,z
B 1 1 0 x,y
C 1 0 0 x
D 0 1 0 y
If possible some another integers and is necessary test for greater like 0 by DataFrame.gt use:
df['new'] = df.gt(0).dot(df.columns + ',').str.rstrip(',')
I can't find a similar question for this query. However, I have a pandas dataframe where I want to use two of the columns to make conditional and if its true, replace the values in one of these columns.
For example. One of my columns is the 'itemname' and the other is the 'value'.
the 'itemname' may be repeated many times. I want to check for each 'itemname', if all other items with the same name have value 0, then replace these 'value' with 100.
I know this should be simple, however I cannot get my head around it.
Just to make it clearer, here
itemname value
0 a 0
1 b 100
2 c 0
3 a 0
3 b 75
3 c 90
I would like my statement to change this data frame to
itemname value
0 a 100
1 b 100
2 c 0
3 a 100
3 b 75
3 c 90
Hope that makes sense. I check if someone else has asked something similar and couldnt find something in this case.
Using transform with any:
df.loc[~df.groupby('itemname').value.transform('any'), 'value'] = 100
Using numpy.where:
s = ~df.groupby('itemname').value.transform('any')
df.assign(value=np.where(s, 100, df.value))
Using addition and multiplication:
s = ~df.groupby('itemname').value.transform('any')
df.assign(value=df.value + (100 * s))
Both produce the correct output, however, np.where and the final solution don't modify the DataFrame in place:
itemname value
0 a 100
1 b 100
2 c 0
3 a 100
3 b 75
3 c 90
Explanation
~df.groupby('itemname').value.transform('any')
0 True
1 False
2 False
3 True
3 False
3 False
Name: value, dtype: bool
Since 0 is a falsey value, we can use any, and negate the result, to find groups where all values are equal to 0.
You can use GroupBy + transform to create a mask. Then assign via pd.DataFrame.loc and Boolean indexing:
mask = df.groupby('itemname')['value'].transform(lambda x: x.eq(0).all())
df.loc[mask.astype(bool), 'value'] = 100
print(df)
itemname value
0 a 100
1 b 100
2 c 0
3 a 100
3 b 75
3 c 90
If all your values are positive or 0
Could use transform with sum and check if 0:
m = (df.groupby('itemname').transform('sum') == 0)['value']
df.loc[m, 'value'] = 100
I have a boolean matrix of M x N, where M = 6000 and N = 1000
1 | 0 1 0 0 0 1 ----> 1000
2 | 1 0 1 0 1 0 ----> 1000
3 | 0 0 1 1 0 0 ----> 1000
V
6000
Now for each column, I want to find the first occurrence where the value is 1. For the above example, in the first 5 columns, I want 2 1 2 3 2 1.
Now the code I have is
sig_matrix = list()
num_columns = df.columns
for col_name in num_columns:
print('Processing column {}'.format(col_name))
sig_index = df.filter(df[col_name] == 1).\
select('perm').limit(1).collect()[0]['perm']
sig_matrix.append(sig_index)
Now the above code is really slow and it takes 5~7 minutes for me to parse 1000 columns is there any faster ways to do this instead of what I am doing? I am also willing to use pandas data frame instead of pyspark dataframe if that is faster.
Here is a numpy version that runs <1s for me, so should be preferable for this size of data:
arr=np.random.choice([0,1], size=(6000,1000))
[np.argwhere(arr[:,i]==1.)[0][0] for i in range(1000)]
There could well be more efficient numpy solutions.
I ended up solving my problem using numpy. Here is how I did it.
import numpy as np
sig_matrix = list()
columns = list(df)
for col_name in columns:
sig_index = np.argmax(df[col_name]) + 1
sig_matrix.append(sig_index)
As the values in my columns are 0 and 1, argmax will return the first occurrence of value 1.
I have a Series that looks the following:
col
0 B
1 B
2 A
3 A
4 A
5 B
It's a time series, therefore the index is ordered by time.
For each row, I'd like to count how many times the value has appeared consecutively, i.e.:
Output:
col count
0 B 1
1 B 2
2 A 1 # Value does not match previous row => reset counter to 1
3 A 2
4 A 3
5 B 1 # Value does not match previous row => reset counter to 1
I found 2 related questions, but I can't figure out how to "write" that information as a new column in the DataFrame, for each row (as above). Using rolling_apply does not work well.
Related:
Counting consecutive events on pandas dataframe by their index
Finding consecutive segments in a pandas data frame
I think there is a nice way to combine the solution of #chrisb and #CodeShaman (As it was pointed out CodeShamans solution counts total and not consecutive values).
df['count'] = df.groupby((df['col'] != df['col'].shift(1)).cumsum()).cumcount()+1
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 B 1
One-liner:
df['count'] = df.groupby('col').cumcount()
or
df['count'] = df.groupby('col').cumcount() + 1
if you want the counts to begin at 1.
Based on the second answer you linked, assuming s is your series.
df = pd.DataFrame(s)
df['block'] = (df['col'] != df['col'].shift(1)).astype(int).cumsum()
df['count'] = df.groupby('block').transform(lambda x: range(1, len(x) + 1))
In [88]: df
Out[88]:
col block count
0 B 1 1
1 B 1 2
2 A 2 1
3 A 2 2
4 A 2 3
5 B 3 1
I like the answer by #chrisb but wanted to share my own solution, since some people might find it more readable and easier to use with similar problems....
1) Create a function that uses static variables
def rolling_count(val):
if val == rolling_count.previous:
rolling_count.count +=1
else:
rolling_count.previous = val
rolling_count.count = 1
return rolling_count.count
rolling_count.count = 0 #static variable
rolling_count.previous = None #static variable
2) apply it to your Series after converting to dataframe
df = pd.DataFrame(s)
df['count'] = df['col'].apply(rolling_count) #new column in dataframe
output of df
col count
0 B 1
1 B 2
2 A 1
3 A 2
4 A 3
5 B 1
If you wish to do the same thing but filter on two columns, you can use this.
def count_consecutive_items_n_cols(df, col_name_list, output_col):
cum_sum_list = [
(df[col_name] != df[col_name].shift(1)).cumsum().tolist() for col_name in col_name_list
]
df[output_col] = df.groupby(
["_".join(map(str, x)) for x in zip(*cum_sum_list)]
).cumcount() + 1
return df
col_a col_b count
0 1 B 1
1 1 B 2
2 1 A 1
3 2 A 1
4 2 A 2
5 2 B 1