Currently I have a series of columns which I am creating that contain a boolean based on a date in the Dataframe I am using
df['bool1'] = [1 if x > pd.to_datetime('20190731') else 0 for x in df['date']]
df['bool2'] = [1 if x > pd.to_datetime('20190803') else 0 for x in df['date']]
df['bool3'] = [1 if x > pd.to_datetime('20190813') else 0 for x in df['date']]
I figured that a list comprehension like this is a pythonic way of solving the problem. I feel like my code is very clear in what it is doing, and somebody could easily follow it.
There is a potential improvement to be made in say creating a dictionary for {bool1:'20190731'} then loop through Key:Value pairs so that I don't repeat the line of code. But this will only decrease line number whilst increase readability and scalability. It won't actually made my code run faster.
However my problem is that this code is actually very slow to run. Should I be using a lambda function to speed this up? What is the fastest way to write this code?
I think dictionary for new columns with values for compare is nice idea.
d = {'bool1':'20190731', 'bool2':'20190803', 'bool3':'20190813'}
Then is possible create new columns in loop:
for k, v in d.items():
df[k] = (df['date'] > pd.to_datetime(v)).astype(int)
#alternative
#df[k] = np.where(df['date'] > pd.to_datetime(v), 1, 0)
For improve performance use broadcasting in numpy:
rng = pd.date_range('20190731', periods=20)
df = pd.DataFrame({'date': rng})
d = {'bool1':'20190731', 'bool2':'20190803', 'bool3':'20190813'}
#pandas 0.24+
mask = df['date'].to_numpy()[:, None] > pd.to_datetime(list(d.values())).to_numpy()
#pandas below
#mask = df['date'].values[:, None] > pd.to_datetime(list(d.values())).values
arr = np.where(mask, 1, 0)
df = df.join(pd.DataFrame(arr, columns=d.keys()))
print (df)
date bool1 bool2 bool3
0 2019-07-31 0 0 0
1 2019-08-01 1 0 0
2 2019-08-02 1 0 0
3 2019-08-03 1 0 0
4 2019-08-04 1 1 0
5 2019-08-05 1 1 0
6 2019-08-06 1 1 0
7 2019-08-07 1 1 0
8 2019-08-08 1 1 0
9 2019-08-09 1 1 0
10 2019-08-10 1 1 0
11 2019-08-11 1 1 0
12 2019-08-12 1 1 0
13 2019-08-13 1 1 0
14 2019-08-14 1 1 1
15 2019-08-15 1 1 1
16 2019-08-16 1 1 1
17 2019-08-17 1 1 1
18 2019-08-18 1 1 1
19 2019-08-19 1 1 1
with numpy.where it should be faster
df['bool1'] = np.where(df['date'] > pd.to_datetime('20190731'), 1, 0)
df['bool2'] = np.where(df['date'] > pd.to_datetime('20190803'), 1, 0)
df['bool3'] = np.where(df['date'] > pd.to_datetime('20190813'), 1, 0)
Related
Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec
My code pulls a dataframe object and I'd like to mask the dataframe.
If a value <= 15 then change value to 1 else change value to 0.
import pandas as pd
XTrain = pd.read_excel('C:\\blahblahblah.xlsx')
for each in XTrain:
if each <= 15:
each = 1
else:
each = 0
Im coming from VBA and .NET so I know it's not very pythonic, but it seems super easy to me...
The code hits an error since it iterates through the df header.
So I tried to check for type
for each in XTrain:
if isinstance(each, str) is False:
if each <= 15:
each = 1
else:
each = 0
This time it got to the final header but did not progress into the dataframe.
This makes me think I am not looping through thr dataframe correctly?
Been stumped for hours, could anyone send me a little help?
Thank you!
for each in XTrain always loops through the column names only. That's how Pandas designs it to be.
Pandas allows comparison/ arithmetic operations with numbers directly. So you want:
# le is less than or equal to
XTrains.le(15).astype(int)
# same as
# (XTrain <= 15).astype(int)
If you really want to iterate (don't), remember that a dataframe is two dimensional. So something like this:
for index, row in df.iterrows():
for cell in row:
if cell <= 15:
# do something
# cell = 1 might not modify the cell in original dataframe
# this is a python thing and you will get used to it
else:
# do something else
SetUp
df = pd.DataFrame({'A' : range(0, 20, 2), 'B' : list(range(10, 19)) + ['a']})
print(df)
A B
0 0 10
1 2 11
2 4 12
3 6 13
4 8 14
5 10 15
6 12 16
7 14 17
8 16 18
9 18 a
Solution : pd.to_numeric
to avoid problems with str values and DataFrame.le
df.apply(lambda x: pd.to_numeric(x, errors='coerce')).le(15).astype(int)
Output
A B
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 0 0
If you want keep string values:
df2 = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
new_df = df2.where(lambda x: x.isna(), df2.le(15).astype(int)).fillna(df)
print(new_df)
A B
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 0 a
Use applymap to apply the function to each element of the dataframe and lambda to write the function.
df.applymap(lambda x: x if isinstance(each, str) else 1 if x <= 15 else 0)
I have a dataframe column with 1s and 0s like this:
df['working'] =
1
1
0
0
0
1
1
0
0
1
which represents when a machine is working (1) or stopped (0). I need to classify this stops depending on their length ie if there are less or equal than n consecutive 0s change all them to short-stop (2) if there are more than n, to long-stop (3). The expected result should look like this when applied over the example with n=2:
df[['working', 'result']]=
1 1
1 1
0 3
0 3
0 3
1 1
1 1
0 2
0 2
1 1
of course this is an example, my df has more than 1M rows.
I tried looping through it but it's really slow and also using this. But I couldn't achieve to transform it to my problem.
Can anyone help?. Thanks so much in advance.
I hope Series.map with Series.value_counts should be used for improve performance:
n = 2
#compare 0 values
m = df['working'].eq(0)
#created groups only by mask
s = df['working'].cumsum()[m]
#counts only 0 groups
out = s.map(s.value_counts())
#set new values by mask
df['result'] = 1
df.loc[m, 'result'] = np.where(out > n, 3, 2)
print (df)
working result
0 1 1
1 1 1
2 0 3
3 0 3
4 0 3
5 1 1
6 1 1
7 0 2
8 0 2
9 1 1
Here's one approach:
# Counter for each gruop where there is a change
m = df.working.ne(df.working.shift()).cumsum()
# mask where working is 0
eq0 = df.working.eq(0)
# Get a count of consecutive 0s
count = df[eq0].groupby(m[eq0]).transform('count')
# replace 0s accordingly
df.loc[eq0, 'result'] = np.where(count > 2, 3, 2).ravel()
# fill the remaining values with 1
df['result'] = df.result.fillna(1)
print(df)
working result
0 1 1.0
1 1 1.0
2 0 3.0
3 0 3.0
4 0 3.0
5 1 1.0
6 1 1.0
7 0 2.0
8 0 2.0
9 1 1.0
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
i want to count number of consecutive zeros in my Dataframe shown below, help please
DEC JAN FEB MARCH APRIL MAY consecutive zeros
0 X X X 1 0 1 0
1 X X X 1 0 1 0
2 0 0 1 0 0 1 2
3 1 0 0 0 1 1 3
4 0 0 0 0 0 1 5
5 X 1 1 0 0 0 3
6 1 0 0 1 0 0 2
7 0 0 0 0 1 0 4
For each row, you want cumsum(1-row) with reset at every point when row == 1. Then you take the row max.
For example
ts = pd.Series([0,0,0,0,1,1,0,0,1,1,1,0])
ts2 = 1-ts
tsgroup = ts.cumsum()
consec_0 = ts2.groupby(tsgroup).transform(pd.Series.cumsum)
consec_0.max()
will give you 4 as needed.
Write that in a function and apply to your dataframe
Here's my two cents...
Think of all the other non-zero elements as 1, then you will have a binary code. All you need to do now is find the 'largest interval' where there's no bit flip starting with 0.
We can write a function and 'apply' with lambda
def len_consec_zeros(a):
a = np.array(list(a)) # convert elements to `str`
rr = np.argwhere(a == '0').ravel() # find out positions of `0`
if not rr.size: # if there are no zeros, return 0
return 0
full = np.arange(rr[0], rr[-1]+1) # get the range of spread of 0s
# get the indices where `0` was flipped to something else
diff = np.setdiff1d(full, rr)
if not diff.size: # if there are no bit flips, return the
return len(full) # size of the full range
# break the array into pieces wherever there's a bit flip
# and the result is the size of the largest chunk
pos, difs = full[0], []
for el in diff:
difs.append(el - pos)
pos = el + 1
difs.append(full[-1]+1 - pos)
# return size of the largest chunk
res = max(difs) if max(difs) != 1 else 0
return res
Now that you have this function, call it on every row...
# join all columns to get a string column
# assuming you have your data in `df`
df['concated'] = df.astype(str).apply(lambda x: ''.join(x), axis=1)
df['consecutive_zeros'] = df.concated.apply(lambda x: len_consec_zeros(x))
Here's one approach -
# Inspired by https://stackoverflow.com/a/44385183/
def pos_neg_counts(mask):
idx = np.flatnonzero(mask[1:] != mask[:-1])
if len(idx)==0: # To handle all 0s or all 1s cases
if mask[0]:
return np.array([mask.size]), np.array([0])
else:
return np.array([0]), np.array([mask.size])
else:
count = np.r_[ [idx[0]+1], idx[1:] - idx[:-1], [mask.size-1-idx[-1]] ]
if mask[0]:
return count[::2], count[1::2] # True, False counts
else:
return count[1::2], count[::2] # True, False counts
def get_consecutive_zeros(df):
arr = df.values
mask = (arr==0) | (arr=='0')
zero_count = np.array([pos_neg_counts(i)[0].max() for i in mask])
zero_count[zero_count<2] = 0
return zero_count
Sample run -
In [272]: df
Out[272]:
DEC JAN FEB MARCH APRIL MAY
0 X X X 1 0 1
1 X X X 1 0 1
2 0 0 1 0 0 1
3 1 0 0 0 1 1
4 0 0 0 0 0 1
5 X 1 1 0 0 0
6 1 0 0 1 0 0
7 0 0 0 0 1 0
In [273]: df['consecutive_zeros'] = get_consecutive_zeros(df)
In [274]: df
Out[274]:
DEC JAN FEB MARCH APRIL MAY consecutive_zeros
0 X X X 1 0 1 0
1 X X X 1 0 1 0
2 0 0 1 0 0 1 2
3 1 0 0 0 1 1 3
4 0 0 0 0 0 1 5
5 X 1 1 0 0 0 3
6 1 0 0 1 0 0 2
7 0 0 0 0 1 0 4
Whats the best way to do the following in python/pandas please?
I want to count the occurences where trend data 2 steps out of line with trend data 1 and reset the counter each time trend data 1 changes.
I'm struggling with the right way to do it on the dataframe creating a new column df['D'] in this example.
df['A'] = trend data 1
df['B'] = boolean indicator if trend data 1 changes
df['C'] = trend data 2
df['D'] = desired result
df['A'] df['B'] df['C'] df['D']
1 0 1 0
1 0 1 0
-1 1 -1 0
-1 0 -1 0
-1 0 1 1
-1 0 -1 1
-1 0 -1 1
-1 0 1 2
-1 0 1 2
-1 0 -1 2
1 1 1 0
1 0 1 0
1 0 -1 1
1 0 1 1
1 0 -1 2
1 0 1 2
1 0 1 2
in excel i would simply use:
=IF(B2=1,0,IF(AND((C2<>C1),(C2<>A2)),D1+1,D1))
however, i've always struggled in not being able to reference prior cells in pandas.
I can't use np.where(). I'm sure its just apply a function in the correct way but I can't seem to make it work referencing other columns and resetting the variable. I've looked at other answers but can't seem to find anything to work in this situation.
something like
note: create df['E'] = df['C'].shift(1)
def corrections(x):
if df['B'] == 1:
x = 0
elif ((df['C'] != df['E']) AND ( df['C'] != df['A'])):
x = x + 1
else:
x
apologies, as I feel i'm missing something rather simple with this question but just keep going round in circles!
def make_D (df):
counter = 0
array = []
for index in df.index:
if df.loc[index, 'A']!=df.loc[index, 'C']:
counter = counter + 1
if index>0:
if df.loc[index, 'B'] != df.loc[index-1, 'B']:
counter = 0
array.append(counter)
df['D'] = array
return (df)
new_df = make_D(df)
hope it helps!
#Set a list to store values for column D
d = []
#calculate D using the given conditions
df.apply(lambda x: d.append(0) if ((x.name==0)|(x.B==1)) else d.append(d[-1]+1) if (x.C!=df.iloc[x.name-1].C) & (x.C!=x.A) else d.append(d[-1]), axis=1)
#set columns D using values from the list d.
df['D'] = d
Out[594]:
A B C D
0 1 0 1 0
1 1 0 1 0
2 -1 1 -1 0
3 -1 0 -1 0
4 -1 0 1 1
5 -1 0 -1 1
6 -1 0 -1 1
7 -1 0 1 2
8 -1 0 1 2
9 -1 0 -1 2
10 1 1 1 0
11 1 0 1 0
12 1 0 -1 1
13 1 0 1 1
14 1 0 -1 2
15 1 0 1 2
16 1 0 1 2