I am trying to write a function that enables me to do some arithmetic iteratively on a subset of rows when a condition is met in another column. My DataFrame looks like this:
Value store flag
0 16051.249 0 0
36 16140.792 0.019822 0
0 16150.500 AAA 1
37 16155.223 1.24698 0
1 16199.700 BBB 1
38 16235.732 1.90162 0
41 16252.594 2.15627 0
2 16256.300 CCC 1
42 16260.678 2.15627 0
1048 17071.513 14.7752 0
3 17071.600 DDD 1
1049 17072.347 14.7752 0
1391 17134.538 16.7026 0
4 17134.600 EEE 1
1392 17134.635 16.7026 0
1675 17227.600 19.4348 0
5 17227.800 EFG 1
1676 17228.796 19.4348 0
1722 17262.189 20.5822 0
6 17264.300 XYZ 1
1723 17266.625 20.6702 0
2630 17442.770 32.7927 0
7 17442.800 ZZZ 1
2631 17442.951 32.7927 0
3068 17517.492 37.6485 0
8 17517.500 TTT 1
3069 17518.296 37.6485 0
3295 17565.776 38.2871 0
9 17565.800 SDF 1
3296 17565.888 38.2871 0
... ... ... ...
I'd like to apply the following function to all rows where the flag value equals 1:
def f(x):
return df.iloc[0,1]+(df.iloc[2,1]-df.iloc[0,1])*((df.iloc[1,0]-df.iloc[0,0])/(df.iloc[2,0]-df.iloc[0,0]))
and finally put the return value into a dictionary with it's corresponding key value; for example {AAA: 123, BBB:456,...}.
This function requires the rows above and below the row where flag=="1"
I have tried to re-structure my df in a way that I can use rooling window with my function, i.e:
idx = (df['flag'] == "1").fillna(False)
idx |= idx.shift(1) | idx.shift(2)
idx |= idx.shift(-1) | idx.shift(-2)
df=df[idx]
df.rolling(window=3, min_periods=1).apply(f)[::3].reset_index(drop=True)
but this doesn't work!
Since the function is location dependent I am not sure how to apply it to all triplet of rows where flag value is 1. Any suggestion is much appreciated!
IIUC, your calculation could be handled directly on the df columns level, no need to apply function on specific rows.
# convert to numeric so that the column can be used for arithmetic calculations
df['store2'] = pd.to_numeric(df.store, errors='coerce')
# calculate the f(x) based on 'Value' and 'store2' column
df['result'] = df.store2.shift(1) + (df.store2.shift(-1) - df.store2.shift(1))*(df.Value - df.Value.shift(1))/(df.Value.shift(-1) - df.Value.shift(1))
# export the resultset:
df.loc[df.flag==1,['store','result']].set_index('store')['result'].to_json()
just keep the state and use apply:
zero_vals = []
def func(row):
if row.flag == 0:
zero_vals.append(row)
elif row.flag == 1:
# do math here using previous rows of data and current row
zero_vals.clear()
else:
raise ValueError('unexpected flag value')
then it's just:
df.apply(func, axis=1)
Related
-)I'm working on an automation task in python wherein in each row the 1st negative value should be added up with the 1st non-negative value from the left. Further, the result should replace the positive value and 0 should replace the negative value
-)This process should continue until the entire row contains all negative or all positive values.
**CUSTOMER <30Days 31-60 Days 61-90Days 91-120Days 120-180Days 180-360Days >360Days**
ABC -2 23 2 3 2 2 -1
(>360Days)+(180-360Days)
-1 + 2
CUSTOMER <30Days 31-60 Days 61-90Days 91-120Days 120-180Days 180-360Days >360Days
ABC -2 23 2 3 2 1 0
(<30Days)+(180-360Days)
-2 + 1
CUSTOMER <30Days 31-60 Days 61-90Days 91-120Days 120-180Days 180-360Days >360Days
ABC 0 23 2 3 2 -1 0
(180-360Days)+(120-180Days)
-1 + 2
CUSTOMER <30Days 31-60 Days 61-90Days 91-120Days 120-180Days 180-360Days >360Days
ABC 0 23 2 3 2 0 0
Check this code:
import pandas as pd
#Empty DataFrame
df=pd.DataFrame()
#Enter the data
new_row={'CUSTOMER':'ABC','<30Days':-2,'31-60 Days':23,'61-90Days':2,'91-120Days':3,'120-180Days':2,'180-360Days':2,'>360Days':-1}
df=df.append(new_row,ignore_index=True)
#Keep columns order as per the requirement
df=df[['CUSTOMER','<30Days','31-60 Days','61-90Days','91-120Days','120-180Days','180-360Days','>360Days']]
#Take column names and reverse the order
ls=list(df.columns)
ls.reverse()
#Remove non integer column
ls.remove('CUSTOMER')
#Initialize variables
flag1=1
flag=0
new_ls=[]
new_ls_index=[]
for j in range(len(df)):
while flag1!=0:
#Perform logic
for i in ls:
if int(df[i][j]) < 0 and flag == 0:
new_ls.append(int(df[i][j]))
new_ls_index.append(i)
flag=1
elif flag==1 and int(df[i][j]) >= 0 :
new_ls.append(int(df[i][j]))
new_ls_index.append(i)
flag=2
elif flag==2:
df[new_ls_index[1]]=new_ls[0]+new_ls[1]
df[new_ls_index[0]]=0
flag=0
new_ls=[]
new_ls_index=[]
#Check all values in row either positive or negative
if new_ls==[]:
new_ls_neg=[]
new_ls_pos=[]
for i in ls:
if int(df[i][j]) < 0:
new_ls_neg.append(int(df[i][j]))
if int(df[i][j]) >= 0 :
new_ls_pos.append(int(df[i][j]))
if len(new_ls_neg)==len(ls) or len(new_ls_pos)==len(ls):
flag1=0 #Set flag to stop the loop
I have a pandas data frame that looks like:
Index Activity
0 0
1 0
2 1
3 1
4 1
5 0
...
1167 1
1168 0
1169 0
I want to count how many times it changes from 0 to 1 and when it changes from 1 to 0, but I do not want to count how many 1's or 0's there are.
For example, if I only wanted to count index 0 to 5, the count for 0 to 1 would be one.
How would I go about this? I have tried using some_value
This is a simple approach that can also tell you the index value when the change happens. Just add the index to a list.
c_1to0 = 0
c_0to1 = 0
for i in range(0, df.shape[0]-1):
if df.iloc[i]['Activity'] == 0 and df.iloc[i+1]['Activity'] == 1:
c_0to1 +=1
elif df.iloc[i]['Activity'] == 1 and df.iloc[i+1]['Activity'] == 0:
c_1to0 +=1
I want to find the starting index and ending index of every piece of data chunk in the dataset.
The data is like:
index A wanted_column1 wanted_column2
2000/1/1 0 0
2000/1/2 1 2000/1/2 1
2000/1/3 1 1
2000/1/4 1 1
2000/1/5 0 0
2000/1/6 1 2000/1/6 2
2000/1/7 1 2
2000/1/8 1 2
2000/1/9 0 0
As shown in the data, index and A are the given columns and wanted_column1 and wanted_column2 are what I want to get.
The idea is that there are different pieces of continuous chunks of data. I want to retrieve starting indices of every chunk of data and I want to increment a count of how many chunks are in the data.
I tried to use shift(-1), but it is not possible to differentiate the difference between starting index and the ending index.
Is that what you need ?
df['change'] = df['A'].diff().eq(1)
df['wanted_column1'] = df[['index','change']].apply(lambda x: x[0] if x[1] else None, axis=1)
df['wanted_column2'] = df['change'].cumsum()
df['wanted_column2'] = df[['wanted_column2','A']].apply(lambda x: 0 if x[1]==0 else x[0], axis=1)
df.drop('change', axis=1, inplace=True)
That yields :
index A wanted_column1 wanted_column2
0 2000/1/1 0 None 0
1 2000/1/2 1 2000/1/2 1
2 2000/1/3 1 None 1
3 2000/1/4 1 None 1
4 2000/1/5 0 None 0
5 2000/1/6 1 2000/1/6 2
6 2000/1/7 1 None 2
7 2000/1/8 1 None 2
8 2000/1/9 0 None 2
Edit : performance comparison
gehbiszumeis's solution : 19.9 ms
my solution : 4.07 ms
Assuming your dataframe to be df, you can find the indices where df['A'] != 0. The indices before are the last indices of a chunck, the ones after the first ones of a chunk. Later you count the number of found indices to calculate the number of data chunks
import pandas as pd
# Read your data
df = pd.read_csv('my_txt.txt', sep=',')
df['wanted_column1'] = None # creating already dummy columns
df['wanted_column2'] = None
# Find indices after each index, where 'A' is not 1, except of it is the last value
# of the dataframe
first = [x + 1 for x in df[df['A'] != 1].index.values if x != len(df)-1]
# Find indices before each index, where 'A' is not 1, except of it is the first value
# of the dataframe
last = [x - 1 for x in df[df['A'] != 1].index.values if x != 0]
# Set the first indices of each chunk at its corresponding position in your dataframe
df.loc[first, 'wanted_column1'] = df.loc[first, 'index']
# You can set also the last indices of each chunk (you only mentioned this in the text,
# not in your expected-result-listed). Uncomment for last indices.
# df.loc[last, 'wanted_column1'] = df.loc[last, 'index']
# Count the number of chunks and fill it to wanted_column2
for i in df.index: df.loc[i, 'wanted_column2'] = sum(df.loc[:i, 'wanted_column1'].notna())
# Some polishing of the df after to match your expected result
df.loc[df['A'] != 1, 'wanted_column2'] = 0
This gives
index A wanted_column1 wanted_column2
0 2000/1/1 0 None 0
1 2000/1/2 1 2000/1/2 1
2 2000/1/3 1 None 1
3 2000/1/4 1 None 1
4 2000/1/5 0 None 0
5 2000/1/6 1 2000/1/6 2
6 2000/1/7 1 None 2
7 2000/1/8 1 None 2
8 2000/1/9 0 None 0
and works for all lengths of df and number of chunks in your data
I used the code below to map the 2 values inside S column to 0 but it didn't work. Any suggestion on how to solve this?
N.B : I want to implement an external function inside the map.
df = pd.DataFrame({
'Age': [30,40,50,60,70,80],
'Sex': ['F','M','M','F','M','F'],
'S' : [1,1,2,2,1,2]
})
def app(value):
for n in df['S']:
if n == 1:
return 1
if n == 2:
return 0
df["S"] = df.S.map(app)
Use eq to create a boolean series and conver that boolean series to int with astype:
df['S'] = df['S'].eq(1).astype(int)
OR
df['S'] = (df['S'] == 1).astype(int)
Output:
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
Don't use apply, simply use loc to assign the values:
df.loc[df.S.eq(2), 'S'] = 0
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
If you need a more performant option, use np.select. This is also more scalable, as you can always add more conditions:
df['S'] = np.select([df.S.eq(2)], [0], 1)
You're close but you need a few corrections. Since you want to use a function, remove the for loop and replace n with value. Additionally, use apply instead of map. Apply operates on the entire column at once. See this answer for how to properly use apply vs applymap vs map
def app(value):
if value == 1:
return 1
elif value == 2:
return 0
df['S'] = df.S.apply(app)
Age Sex S
0 30 F 1
1 40 M 1
2 50 M 0
3 60 F 0
4 70 M 1
5 80 F 0
If you only wish to change values equal to 2, you can use pd.DataFrame.loc:
df.loc[df['S'] == 0, 'S'] = 0
pd.Series.apply is not recommend and this is just a thinly veiled, inefficient loop.
You could use .replace as follows:
df["S"] = df["S"].replace([2], 0)
This will replace all of 2 values to 0 in one line
Go with vectorize numpy operation:
df['S'] = np.abs(df['S'] - 2)
and stand yourself out from competitions in interviews and SO answers :)
>>>df = pd.DataFrame({'Age':[30,40,50,60,70,80],'Sex':
['F','M','M','F','M','F'],'S':
[1,1,2,2,1,2]})
>>> def app(value):
return 1 if value == 1 else 0
# or app = lambda value : 1 if value == 1 else 0
>>> df["S"] = df["S"].map(app)
>>> df
Age S Sex
Age S Sex
0 30 1 F
1 40 1 M
2 50 0 M
3 60 0 F
4 70 1 M
5 80 0 F
You can do:
import numpy as np
df['S'] = np.where(df['S'] == 2, 0, df['S'])
I have a dataframe of 143999 rows which contains position and time data.
I already made a column "dt" which calulates the time difference between rows.
Now I want to create a new column which gives the dt values a group number.
So it starts with group = 0 and when dt > 60 the group number should increase by 1.
I tried the following:
def group(x):
c = 0 #
if densdata["dt"] < 60:
densdata["group"] = c
elif densdata["dt"] >= 60:
c += 1
densdata["group"] = c
densdata["group"] = densdata.apply(group, axis=1)'
The error that I get is: The truth value of a Series is ambiguous.
Any ideas how to fix this problem?
This is what I want:
dt group
0.01 0
2 0
0.05 0
300 1
2 1
60 2
You can take advantage of the fact that True evaluates to 1 and use .cumsum().
densdata = pd.DataFrame({'dt': np.random.randint(low=50,high=70,size=20),
'group' : np.zeros(20, dtype=np.int32)})
print(densdata.head())
dt group
0 52 0
1 59 0
2 69 0
3 55 0
4 63 0
densdata['group'] = (densdata.dt >= 60).cumsum()
print(densdata.head())
dt group
0 52 0
1 59 0
2 69 1
3 55 1
4 63 2
If you want to guarantee that the first value of group will be 0, even if the first value of dt is >= 60, then use
densdata['group'] = (densdata.dt.replace(densdata.dt[0],np.nan) >= 60).cumsum()