Iterate through data frame - python

My code pulls a dataframe object and I'd like to mask the dataframe.
If a value <= 15 then change value to 1 else change value to 0.
import pandas as pd
XTrain = pd.read_excel('C:\\blahblahblah.xlsx')
for each in XTrain:
if each <= 15:
each = 1
else:
each = 0
Im coming from VBA and .NET so I know it's not very pythonic, but it seems super easy to me...
The code hits an error since it iterates through the df header.
So I tried to check for type
for each in XTrain:
if isinstance(each, str) is False:
if each <= 15:
each = 1
else:
each = 0
This time it got to the final header but did not progress into the dataframe.
This makes me think I am not looping through thr dataframe correctly?
Been stumped for hours, could anyone send me a little help?
Thank you!

for each in XTrain always loops through the column names only. That's how Pandas designs it to be.
Pandas allows comparison/ arithmetic operations with numbers directly. So you want:
# le is less than or equal to
XTrains.le(15).astype(int)
# same as
# (XTrain <= 15).astype(int)
If you really want to iterate (don't), remember that a dataframe is two dimensional. So something like this:
for index, row in df.iterrows():
for cell in row:
if cell <= 15:
# do something
# cell = 1 might not modify the cell in original dataframe
# this is a python thing and you will get used to it
else:
# do something else

SetUp
df = pd.DataFrame({'A' : range(0, 20, 2), 'B' : list(range(10, 19)) + ['a']})
print(df)
A B
0 0 10
1 2 11
2 4 12
3 6 13
4 8 14
5 10 15
6 12 16
7 14 17
8 16 18
9 18 a
Solution : pd.to_numeric
to avoid problems with str values and DataFrame.le
df.apply(lambda x: pd.to_numeric(x, errors='coerce')).le(15).astype(int)
Output
A B
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 0 0
If you want keep string values:
df2 = df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
new_df = df2.where(lambda x: x.isna(), df2.le(15).astype(int)).fillna(df)
print(new_df)
A B
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 0 a

Use applymap to apply the function to each element of the dataframe and lambda to write the function.
df.applymap(lambda x: x if isinstance(each, str) else 1 if x <= 15 else 0)

Related

Count number of consecutive rows that are greater than current row value but less than the value from other column

Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
First row is calculated as : since 3 and 2 are greater than 0 but less than corresponding B value 8 and 3
Second row is calculated as : since next value 2 is not greater than 3
Third row is calculated as : since 9 is greater than 2 but not greater than its corresponding B value
Similarly, previous count is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply approach. I was looking for a more pandaic approach.
My Solution
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
Generating output :
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
Output :
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
I made some optimizations:
You don't need to reset_index() you can access the index with .name
If you only pass df[['A']] instead of the whole frame, that may help.
prev_nrow.empty is the same as (prev_nrow.size == 0)
Applied different logic to get the desired value via first_false, this speeds things up significantly.
def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Timing
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
Gave up at 15 minutes.
New Method:
1m 20sec
Throw pandarallel at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')
26sec

How to create a new column based on a condition in another column

In pandas, How can I create a new column B based on a column A in df, such that:
B=1 if A_(i+1)-A_(i) > 5 or A_(i) <= 10
B=0 if A_(i+1)-A_(i) <= 5
However, the first B_i value is always one
Example:
A
B
5
1 (the first B_i)
12
1
14
0
22
1
20
0
33
1
Use diff with a comparison to your value and convertion from boolean to int using le:
N = 5
df['B'] = (~df['A'].diff().le(N)).astype(int)
NB. using a le(5) comparison with inversion enables to have 1 for the first value
output:
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
updated answer, simply combine a second condition with OR (|):
df['B'] = (~df['A'].diff().le(5)|df['A'].lt(10)).astype(int)
output: same as above with the provided data
I was little confused with your rows numeration bacause we should have missing value on last row instead of first if we calcule for B_i basing on condition A_(i+1)-A_(i) (first row should have both, A_(i) and A_(i+1) and last row should be missing A_(i+1) value.
Anyway,basing on your example i assumed that we calculate for B_(i+1).
import pandas as pd
df = pd.DataFrame(columns=["A"],data=[5,12,14,22,20,33])
df['shifted_A'] = df['A'].shift(1) #This row can be removed - it was added only show to how shift works on final dataframe
df['B']=''
df.loc[((df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10), 'B']=1 #Update rows that fulfill one of conditions with 1
df.loc[(df['A']-df['A'].shift(1))<=5, 'B']=0 #Update rows that fulfill condition with 0
df.loc[df.index==0, 'B']=1 #Update first row in B column
print(df)
That prints:
A shifted_A B
0 5 NaN 1
1 12 5.0 1
2 14 12.0 0
3 22 14.0 1
4 20 22.0 0
5 33 20.0 1
I am not sure if it is fastest way, but i guess it should be one of easier to understand.
Little explanation:
df.loc[mask, columnname]=newvalue allows us to update value in given column if condition (mask) is fulfilled
(df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10)
Each condition here returns True or False. If we added them the result is True if any of that is True (which is simply OR). In case we need AND we can multiply the conditions
Use Series.diff, replace first missing value for 1 after compare for greater or equal by Series.ge:
N = 5
df['B'] = (df.A.diff().fillna(N).ge(N) | df.A.lt(10)).astype(int)
print (df)
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1

pandas - take last N rows from one subgroup

Let's suppose we have a dataframe that be generated using this code:
import pandas as pd
d = {'p1': np.random.rand(32),
'a1': np.random.rand(32),
'phase': [0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3, 0,0,0,0, 1,1,1,1, 2,2,2,2, 3,3,3,3],
'file_number': [1,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1, 2,2,2,2, 2,2,2,2, 2,2,2,2, 2,2,2,2]
}
df = pd.DataFrame(d)
For each file number i want to take only last N rows of phase number 3. So that the result for N==2 looks like this:
Currently I'm doing it in this way:
def phase_3_last_n_observations(df, n):
result = []
for fn in df['file_number'].unique():
file_df = df[df['file_number']==fn]
for phase in [0,1,2,3]:
phase_df = file_df[file_df['phase']==phase]
if phase == 3:
phase_df = phase_df[-n:]
result.append(phase_df)
df = pd.concat(result, axis=0)
return df
phase_3_last_n_observations(df, 2)
However, it is very slow and I have terabytes of data, so I need to worry about performance. Does anyone have any idea how to speed my solution up? Thanks!
Filter the rows where phase is 3 then groupby and use tail to select the last two rows per file_number, finally append to get the result
m = df['phase'].eq(3)
df[~m].append(df[m].groupby('file_number').tail(2)).sort_index()
p1 a1 phase file_number
0 0.223906 0.164288 0 1
1 0.214081 0.748598 0 1
2 0.567702 0.226143 0 1
3 0.695458 0.567288 0 1
4 0.760710 0.127880 1 1
5 0.592913 0.397473 1 1
6 0.721191 0.572320 1 1
7 0.047981 0.153484 1 1
8 0.598202 0.203754 2 1
9 0.296797 0.614071 2 1
10 0.961616 0.105837 2 1
11 0.237614 0.640263 2 1
14 0.500415 0.220355 3 1
15 0.968630 0.351404 3 1
16 0.065283 0.595144 0 2
17 0.308802 0.164214 0 2
18 0.668811 0.826478 0 2
19 0.888497 0.186267 0 2
20 0.199129 0.241900 1 2
21 0.345185 0.220940 1 2
22 0.389895 0.761068 1 2
23 0.343100 0.582458 1 2
24 0.182792 0.245551 2 2
25 0.503181 0.894517 2 2
26 0.144294 0.351350 2 2
27 0.157116 0.847499 2 2
30 0.194274 0.143037 3 2
31 0.542183 0.060485 3 2
I use idea from deleted answer - get indices by previous rows for rows matching 3 by GroupBy.cumcount and remove them by DataFrame.drop:
def phase_3_last_n_observations(df, N):
df1 = df[df['phase'].eq(3)]
idx = df1[df1.groupby('file_number').cumcount(ascending=False).ge(N)].index
return df.drop(idx)
#index is reseted for default, because used for remove rows
df = phase_3_last_n_observations(df.reset_index(drop=True), 2)
As an alternative solution to what already exists: You can calculate the last elements for all phase groups and afterwards just use .loc to get the needed group result. I have written the code for N==2, if you want for N==3, then use [-1, -2, -3]
result = df.groupby(['phase']).nth([-1, -2])
PHASE = 3
result.loc[PHASE]

Pandas: sum column until condition met in other column

I need to sum the value column until I hit a break.
df = pd.DataFrame({'value': [1,2,3,4,5,6,7,8], 'break': [0,0,1,0,0,1,0,0]})
value break
0 1 0
1 2 0
2 3 1
3 4 0
4 5 0
5 6 1
6 7 0
7 8 0
Expected output
value break
0 6 1
1 15 1
I was thinking a group by but I can't seem to get anywhere with it. I don't even need the break columns at the end.
You're on the right track, try groupby on reverse cumsum:
(df.groupby(df['break'][::-1].cumsum()[::-1],
as_index=False, sort=False)
.sum()
.query('`break` != 0') # remove this for full data
)
Output:
value break
0 6 1
1 15 1

Python crossover or switch formula

I would like a formula or anything that acts like a "switch". If the column 'position' goes to 3 or above, the switch is turned on (=1). If 'position' goes above 5, the switch is turned off (=0). And if position goes below 3, the switch is also turned off (=0). I have included the column 'desired' to display what I would like this new column to automate.
df = pd.DataFrame()
df['position'] = [1,2,3,4,5,6,7,8,7,6,5,4,3,2,1,2,3,4,5,4,3,2,1]
df['desired'] = [0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0]
I would use .shift() to create row with shifted position to have current and provious value in one row. And then I can check if it goes above 3 or 5 or below 3 and change value which will be assigned to in column 'desired'.
After creating column `'desired' I have to drop shifted data.
import pandas as pd
df = pd.DataFrame()
df['position'] = [1,2,3,4,5,6,7,8,7,6,5,4,3,2,1,2,3,4,5,4,3,2,1]
#df['desired'] = [0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0]
df['previous'] = df['position'].shift()
# ---
value = 0
def change(row):
global value
#print(row)
if (row['previous'] < 3) and (row['position'] >= 3):
value = 1
if (row['previous'] >= 3) and (row['position'] < 3):
value = 0
if (row['previous'] <= 5) and (row['position'] > 5):
value = 0
return value
# ---
#for ind, row in df.iterrows():
# print(int(row['position']), change(row))
df['desired'] = df.apply(change, axis=1)
df.drop('previous', axis=1)
print(df)
Result
position desired
0 1 0
1 2 0
2 3 1
3 4 1
4 5 1
5 6 0
6 7 0
7 8 0
8 7 0
9 6 0
10 5 0
11 4 0
12 3 0
13 2 0
14 1 0
15 2 0
16 3 1
17 4 1
18 5 1
19 4 1
20 3 1
21 2 0
22 1 0

Categories