access previous rows in python dataframe apply method - python

I want to create a new column comp in a pandas dataframe containing a single column price. The value of this new column should be generated by a function that works on the current and last 3 values of the price.
df.apply() works off a single row, shift() doesnt seem to work. Do experts have any suggestion to make it work in a vectorized operation?

Use a series sum group.apply() function. Below assumes you have an index or column named ID of increasing row values 1, 2, 3, ... that can be used to count back 3 values.
# SERIES SUM FUNCTION
def intsum(x):
if x < 3:
ser = df.price[(df.ID < x)]
else:
ser = df.price[(df.ID >= x - 3) & (df.ID < x)]
return ser.sum()
# APPLY FUNCTION
df['comp'] = df['ID'].apply(intsum)

Related

'Oversampling' cartesian data in a dataframe without for loop?

I have a 3D data in a pandas dataframe that I would like to 'oversample'/smooth by replacing the value at each x,y point with the average value of all the points that are within 5 units of that point. I can do it using a for loop like this (starting with a dataframe with three columns X,Y,Z):
import pandas as pd
Z_OS = []
X_OS = []
Y_OS = []
for inddex, row in df.iterrows():
Z_OS += [df[(df['X'] > row['X']-5) & (df['X']<row['X']+5) & (df['Y'] > row['Y']-5) & (df1['Y']<row['Y']+5)]['Z'].mean()]
X_OS += [row['X']]
Y_OS += [row['Y']]
dict = {
'X': X_OS,
'Y': Y_OS,
'Z': Z_OS
}
OSdf = pd.DataFrame.from_dict(dict)
but this method is very slow for large datasets and feels very 'unpythonic'. How could I do this without for loops? Is it possible via complex use of the groupby function?
xy = df[['x','y']]
df['smoothed z'] = df[['z']].apply(
lambda row: df['z'][(xy - xy.loc[row.name]).abs().lt(5).all(1)].mean(),
axis=1
)
Here I used df[['z']] to get a column 'z' as a data frame. We need an index of a row, i.e. row.name, when we apply a function to this column.
.abs().lt(5).all(1) read as absolut values which are all less then 5 along the row.
Update
The code below is actually the same but seems more consistent as it addresses directly the index:
df.index.to_series().apply(lambda i: df.loc[(xy - xy.loc[i]).abs().lt(5).all(1), 'z'].mean())
df['column_name'].rolling(rolling_window).mean()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html

Apply function over portion of dataframe multiple times

I have a dataframe with 4 columns: "Date" (in string format), "Hour" (in string format), "Energia_Attiva_Ingresso_Delta" and "Energia_Attiva_Uscita_Delta".
Obviously for every date there are multiple hours. I'd like to calculate a column for the overall dataframe, but on a daily base. Basically: the operation of the function must be calculated for every single date.
So, I thought to iter over the single values of the date column and to filter the dataframe with .loc, then pass the filtered df to the function. In the function I have to re-filter the df with loc (for the purpose of the calculation).
Here's the code I wrote and as you can see in the function i need to operate iterativelly on the row with the maximum value of 'Energia_Ingresso_Delta'; to do so I use again the .loc function:
#function
def optimize(df):
min_index = np.argmin(df.Margine)
max_index = np.argmax(df.Margine)
Energia_Prelevata_Da_Rete = df[df.Margine < 0]['Margine'].sum().round(1)
Energia_In_Eccesso = df[df.Margine > 0]['Margine'].sum().round(1)
carico_medio = (Energia_In_Eccesso / df[df['Margine']<0]['Margine'].count()).round(1)
while (Energia_In_Eccesso != 0):
max_index = np.argmax(df.Energia_Ingresso_Delta)
df.loc[max_index, 'Energia_Attiva_Ingresso_Delta'] = df.loc[max_index,'Energia_Attiva_Ingresso_Delta'] + carico_medio
Energia_In_Eccesso = (Energia_In_Eccesso - carico_medio).round(1)
#Call function with "partial dataframe". The dataframe is called "prova"
for items in list(prova.Data.unique()):
function(prova.loc[[items]])
But I keep getting this error:
"None of [Index(['2021-05-01'], dtype='object')] are in the [index]"
Can someone help me? :)
Thanks in advance

using python read a column 'H' from csv and implement this function SUM(H16:H$280)/H$14*100

Using python read a column 'H' from a dataframe and implement this function:
CDF = {SUM(H1:H$266)/G$14}*100
Where:
H$266 is the last element of the column, and
G$14 is the total sum of the column H.
In sum(), the first variable iterates (H1, H2, H3 ... H266) but the last value remains the same (H$266). So the first value of CDF is obviously 100 and then it goes on decreasing downwards.
I want to implement this using dataframe.
As an example, you could do this:
from pandas import Series
s = Series([1, 2, 3]) # H1:H266 data
sum_of_s = s.sum() # G14
def calculus(subset, total_sum):
return subset.sum() / total_sum * 100
result = Series([calculus(s.iloc[i:], sum_of_s) for i in range(len(s))])
print(result)
You should adapt it to your dataset, but basically it's the idea. Let me know if it works.

How can I count occurrences of a string in a dataframe in Python?

I'm trying to count the number of ships in a column of a dataframe. In this case I'm trying to count the number of 77Hs. I can do it for individual elements but actions on the whole column don't seem to work
E.g. This works with an individual element in my dataframe
df = pd.DataFrame({'Route':['Callais','Dover','Portsmouth'],'shipCode':[['77H','77G'],['77G'],['77H','77H']]})
df['shipCode'][2].count('77H')
But when I try and perform the action on every row using either
df['shipCode'].count('77H')
df['shipCode'].str.count('77H')
It fails with both attempts, any help on how to code this would be much appreciated
Thanks
what if you did something like this??
assuming your initial dictionary...
import pandas as pd
from collections import Counter
df = pd.DataFrame(df) #where df is the dictionary defined in OP
you can generate a Counter for all of the elements in the lists in each row like this:
df['counts'] = df['shipCode'].apply(lambda x: Counter(x))
output:
Route shipCode counts
0 Callais [77H, 77G] {'77H': 1, '77G': 1}
1 Dover [77G] {'77G': 1}
2 Portsmouth [77H, 77H] {'77H': 2}
or if you want one in particular, i.e. '77H', you can do something like this:
df['counts'] = df['shipCode'].apply(lambda x: Counter(x)['77H'])
output:
Route shipCode counts
0 Callais [77H, 77G] 1
1 Dover [77G] 0
2 Portsmouth [77H, 77H] 2
or even this using the first method (full Counter in each row):
[count['77H'] for count in df['counts']]
output:
[1, 0, 2]
The data frame has a shipcode column with a list of values.
First show a True or False value to identify rows that contain the string '77H' in the shipcode column.
> df['shipcode'].map(lambda val: val.count('77H')>0)
Now filter the data frame based on those True/False values obtained in the previous step.
> df[df['shipcode'].map(lambda val: val.count('77H')>0)]
Finally, get a count for all values in the data frame where the shipcode list contains a value matching '77H' using the python len method.
> len(df[df['shipcode'].map(lambda val: val.count('77H')>0)])
Another way that makes it easy to remember what's been analyzed is to create a column in the same data frame to store the True/False value. Then filter by the True/False values. It's really the same as above but a little prettier in my opinion.
> df['filter_column'] = df['shipcode'].map(lambda val: val.count('77H')>0)
> len(df[df['filter_column']])
Good luck and enjoy working with Python and Pandas to process your data!

Filling each row of one column of a DataFrame with different values (a random distribution)

I have a DataFrame with aprox. 4 columns and 200 rows. I created a 5th column with null values:
df['minutes'] = np.nan
Then, I want to fill each row of this new column with random inverse log normal values. The code to generate 1 inverse log normal:
note: if the code bellow is ran multiple times it will generate a new result because of the value inside ppf() : random.random()
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
What's happening when I do that is that it's filling all 200 rows of df['minutes'] with the same number, instead of triggering the random.random() for each row as I expected it to.
What do I have to do? I tried using for loopbut apparently I'm not getting it right (giving the same results):
for i in range(1,len(df)):
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
what am I doing wrong?
Also, I'll add that later I'll need to change some parameters of the inverse log normal above if the value of another column is 0 or 1. as in:
if df['type'] == 0:
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
elif df['type'] == 1:
df['minutes'] = df['minutes'].fillna(stats.lognorm(1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int))
thanks in advance.
The problem with your use of fillna here is that this function takes a value as argument and applies it to every element along the specified axis. So your stat value is calculated once and then distributed into every row.
What you need is your function called for every element on the axis, so your argument must be the function itself and not a value. That's a job for apply, which takes a function and applies it on elements along an axis.
I'm straight jumping to your final requirements:
You could use apply just on the minutes-column (as a pandas.Series method) with a lambda-function and then assign the respective results to the type-column filtered rows of column minutes:
import numpy as np
import pandas as pd
import scipy.stats as stats
import random
# setup
df = pd.DataFrame(np.random.randint(0, 2, size=(8, 4)),
columns=list('ABC') + ['type'])
df['minutes'] = np.nan
df.loc[df.type == 0, 'minutes'] = \
df['minutes'].apply(lambda _: stats.lognorm(
0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int),
convert_dtype=False))
df.loc[df.type == 1, 'minutes'] = \
df['minutes'].apply(lambda _: stats.lognorm(
1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int),
convert_dtype=False))
... or you use apply as a DataFrame method with a function wrapping your logic to distinguish between values of type-column and assign the result back to the minutes-column:
def calc_minutes(row):
if row['type'] == 0:
return stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int)
elif row['type'] == 1:
return stats.lognorm(1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int)
df['minutes'] = df.apply(calc_minutes, axis=1)
Managed to do it with some steps with a different mindset:
Created 2 lists, each with i's own parameters
Used NumPy's append
so that for each row a different random number
lognormal_tone = []
lognormal_ttwo = []
for i in range(len(s)):
lognormal_tone.append(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
lognormal_ttwo.append(stats.lognorm(0.4, scale=np.exp(2.7)).ppf(random.random()).astype(int))
Then, included them in the DataFrame with another previously created list:
df = pd.DataFrame({'arrival':arrival,'minTypeOne':lognormal_tone, 'minTypeTwo':lognormal_two})

Categories