Data:
day cost new_column
1 1 0
2 1 0
3 5 enter position
4 3 stay in position
5 10 stay in position
6 1 exit position
7 1 0
Hi, I'm wondering if there is a way to reference a previous row in a calculated column without looping and using loc/iloc. In the above example, I want to calculate this 'new_column'. Once the cost hits 5, I want to enter the position. Once I am in a position, I want to be able to check on the next line if I am already in a position and check that the price is not 1. If so then I stay in the position. The first row I hit where the cost is 1 and the previous "new_column" is 'stay in position' I want to make 'exit position'. Then the next row with 1, 'new_column' should go back to zero.
How I am solving this now is by looping over each row, looking at the cost on row i and the status of new_column on row i-1, then inserting the result in new_column on row i.
This takes a while on large data sets and I would like to find a more efficient way to do things. I looked into list(map()), but i don't see a way to reference a previous row, because I don't think that data will have been created yet to reference. Any ideas?
Thank you
Hey as smj suggested one option is using shift.
day = list(range(1,8))
cost = [1,1,5,3,10,1,1]
import pandas as pd
import numpy as np
df = pd.DataFrame({'day':day,'cost':cost}, columns = ['day', 'cost'])
print(df)
df['new'] = np.where(df['cost']> 1, np.where(
df['cost'].shift(-1) >=1,
'stay','a'
),
np.where(
df['cost'].shift()>1, 'exit', 0
)
)
print(df)
Related
I want to split all rows into two groups that have similar means.
I have a dataframe of about 50 rows but this could go into several thousands with a column of interest called 'value'.
value total bucket
300048137 3.0741 3.0741 0
352969997 2.1024 5.1765 0
abc13.com 4.5237 9.7002 0
abc7.com 5.8202 15.5204 0
abcnews.go.com 6.7270 22.2474 0
........
www.legacy.com 12.6609 263.0797 1
www.math-aids.com 10.9832 274.0629 1
So far I tried using cumulative sum for which total column was created then I essentially made the split based on where the mid-point of the total column is. Based on this solution.
test['total'] = test['value'].cumsum()
df_sum = test['value'].sum()//2
test['bucket'] = np.where(test['total'] <= df_sum, 0,1)
If I try to group them and take the average for each group then the difference is quite significant
display(test.groupby('bucket')['value'].mean())
bucket
0 7.456262
1 10.773905
Is there a way I could achieve this partition based on means instead of sums? I was thinking about using expanding means from pandas but couldn't find a proper way to do it.
I am not sure I understand what you are trying to do, but possibly you want to groupy by quantiles of a column. If so:
test['bucket'] = pd.qcut(test['value'], q=2, labels=False)
which will have bucket=0 for the half of rows with the lesser value values. And 1 for the rest. By tweakign the q parameter you can have as many groups as you want (as long as <= number of rows).
Edit:
New attemp, now that I think I understand better your aim:
df = pd.DataFrame( {'value':pd.np.arange(100)})
df['group'] = df['value'].argsort().mod(2)
df.groupby('group')['value'].mean()
# group
# 0 49
# 1 50
# Name: value, dtype: int64
df['group'] = df['value'].argsort().mod(3)
df.groupby('group')['value'].mean()
#group
# 0 49.5
# 1 49.0
# 2 50.0
# Name: value, dtype: float64
I am trying to apply a condition to a pandas column by location and am not quite sure how. Here is some sample data:
data = {'Pop': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967],
'Pop2': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967]}
PopDF = pd.DataFrame(data)
remainder = 6
#I would like to subtract 1 from PopDF['Pop2'] column cells 0-remainder.
#The remaining cells in the column I would like to stay as is (retain original pop values).
PopDF['Pop2']= PopDF['Pop2'].iloc[:(remainder)]-1
PopDF['Pop2'].iloc[(remainder):] = PopDF['Pop'].iloc[(remainder):]
The first line works to subtract 1 in the correct locations, however, the remaining cells become NaN. The second line of code does not work – the error is:
ValueError: Length of values (1) does not match length of index (8)
Instead of selected the first N rows and subtracting them, subtract the entire column and only assign the first 6 values of it:
df.loc[:remainder, 'Pop2'] = df['Pop2'] - 1
Output:
>>> df
Pop Pop2
0 728375 728374
1 733355 733354
2 695395 695394
3 734658 734657
4 732811 732810
5 789396 789395
6 727761 727760
7 751967 751967
I have a "large" DataFrame table with index being country codes (alpha-3) and columns being years (1900 to 2000) imported via a pd.read_csv(...) [as I understand, these are actually string so I need to pass it as '1945' for example].
The values are 0,1,2,3.
I need to "spread" these values until the next non-0 for each row.
example : 0 0 1 0 0 3 0 0 2 1
becomes: 0 0 1 1 1 3 3 3 2 1
I understand that I should not use iterations (current implementation is something like this, as you can see, using 2 loops is not optimal, I guess I could get rid of one by using apply(row) )
def spread_values(df):
for idx in df.index:
previous_v = 0
for t_year in range(min_year, max_year):
current_v = df.loc[idx, str(t_year)]
if current_v == 0 and previous_v != 0:
df.loc[idx, str(t_year)] = previous_v
else:
previous_v = current_v
However I am told I should use the apply() function, or vectorisation or list comprehension because it is not optimal?
The apply function however, regardless of the axis, does not allow to dynamically get the index/column (which I need to conditionally update the cell), and I think the core issue I can't make the vec or list options work is because I do not have a finite set of column names but rather a wide range (all examples I see use a handful of named columns...)
What would be the more optimal / more elegant solution here?
OR are DataFrames not suited for my data at all? what should I use instead?
You can use df.replace(to_replace=0, method='ffil). This will fill all zeros in your dataframe (except for zeros occuring at the start of your dataframe) with the previous non-zero value per column.
If you want to do it rowwise unfortunately the .replace() function does not accept an axis argument. But you can transpose your dataframe, replace the zeros and transpose it again: df.T.replace(0, method='ffill').T
I've got a dataframe like this
Day,Minute,Second,Value
1,1,0,1
1,2,1,2
1,3,1,2
1,2,6,0
1,2,1,1
1,2,5,1
2,0,1,1
2,0,5,2
Sometimes the sensor records incorrect values and gets added again but with the correct value. For example, here we should delete the second and third rows since they are being overridden by row four coming from a timestamp before them. How do I filter out the 'bad' rows like those that are unnecessary? For the example, the expected output should be:
Day,Minute,Second,Value
1,1,0,1
1,2,1,1
1,2,5,1
2,0,1,1
2,0,5,2
Here's the pseudocode for an iterative solution(Sorry for no indents in the formatting this is my first post)
for row in dataframe:
for previous_row in rows in dataframe before row:
if previous_row > row:
delete previous row
I think there should be a vectorized solution, especially for the second loop. I also don't want to modify what I'm iterating over but I'm not sure there is another option other than duplicating the dataframe.
Here is some starter code to work with the example dataframe
import pandas as pd
data = [{'Day':1, 'Minute':1, 'Second':0, 'Value':1},
{'Day':1, 'Minute':2, 'Second':1, 'Value':2},
{'Day':1, 'Minute':2, 'Second':6, 'Value':2},
{'Day':1, 'Minute':3, 'Second':1, 'Value':0},
{'Day':1, 'Minute':2, 'Second':1, 'Value':1},
{'Day':1, 'Minute':2, 'Second':5, 'Value':1},
{'Day':2, 'Minute':0, 'Second':1, 'Value':1},
{'Day':2, 'Minute':0, 'Second':5, 'Value':2}]
df = pd.DataFrame(data)
If you have multiple rows for the same combination of Day, Minute, Second but a different Value, I am assuming you want to retain the last recorded value and discard all the previous ones considering they are "bad".
You can do this simply by using drop_duplicates:
df.drop_duplicates(subset=['Day', 'Minute', 'Second'], keep='last')
UPDATE v2:
If you need to retain the last group of ['Minute', 'Second'] combinations for each day, identify monotonically increasing Minute groups (since it's the bigger time unit of the two) and select the group with the max value of Group_Id for each ['Day']:
res = pd.DataFrame()
for _, g in df.groupby(['Day']):
g['Group_Id'] = (g.Minute.diff() < 0).cumsum()
res = pd.concat([res, g[g['Group_Id'] == max(g['Group_Id'].values)]])
OUTPUT:
Day Minute Second Value Group_Id
1 2 1 1 1
1 2 5 1 1
2 0 1 1 0
2 0 5 2 0
let's say I have the following dataframe:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
Now I want to multiply the variable Shots for a random value (multiplier in the code) and recaclucate the StG variable that is nothing but Shots/Goals, the code I used is:
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Shots'] *= multiplier
row['StG']=float(row['Shots'])/float(row['Goals'])
Then I saved the .csv and it was identically at the original one, so after the for I simply used print(df) to obtain:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
If I print the values row per row during the for iteration I see they change, but its like they don't save in the df.
I think it is because I'm simply accessing to the values,not the actual dataframe.
I should add something like df.row[], but it returns DataFrame has no row property.
Thanks for the help.
____EDIT____
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Impresions']*=multiplier
row['Clicks']*=(np.random.randint(1,multiplier+1))
row['Ctr']= float(row['Clicks'])/float(row['Impresions'])
row['Mult']=multiplier
#print (row['Clicks'],row['Impresions'],row['Ctr'],row['Mult'])
The main condition is that the number of Clicks cant be ever higher than the number of impressions.
Then I recalculate the ratio between Clicks/Impressions on CTR.
I am not sure if multiplying the entire column is the best choice to maintain the condition that for each row Impr >= Clicks, hence I went row by row
Fom the pandas docs about iterrows(): pandas.DataFrame.iterrows
"You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect."
The good news is you don't need to iterate over rows - you can perform the operations on columns:
# Generate an array of random integers of same length as your DataFrame
multipliers = np.random.randint(1, 5+1, size=len(df))
# Multiply corresponding elements from df['Shots'] and multipliers
df['Shots'] *= multipliers
# Recalculate df['StG']
df['StG'] = df['Shots']/df['Goals']
Define a function that returns a series:
def f(x):
m = np.random.randint(1,5+1)
return pd.Series([x.Shots * m, x.Shots/x.Goals * m])
Apply the function to the data frame row-wise, it will return another data frame which can be used to replace some columns in the existing data frame, or create new columns in data frame
df[['Shots', 'StG']] = df.apply(f, axis=1)
This approach is very flexible as long as the new column values depend only on other values in the same row.