Python Pandas using lambda on index row by row - python

So I'm trying to apply a function to the index row by row but having some problems
startDate = '2015-05-01 00:00'
endDate = '2015-05-08 00:00'
idx = pd.date_range(startDate, endDate, freq="1min")
df = pd.DataFrame(columns=['F(t)'])
df = df.reindex(idx, fill_value=0)
def circadian_function(T):
return math.cos(math.pi*(T-delta)/12)
Everything is okay up to here but trying to apply the function I'm not sure what to do
df['F(t)'] = df.index.apply(lambda x: circadian_function x[index].hour, axis=1)
Should I be using a lambda? Or just an apply?

I don't have 50 rep so I can't comment on #Ted Petrou's answer ;-;
I just wanted to say a couple things that you should know.
If you are going to feed df.index.hour into your carcadian_function, make sure you use numpy instead of math. Otherwise the interpreter will throw a TypeError (I just found out about this).
Make sure to define delta.
Example:
import numpy as np
def circadian_function(T, delta):
return np.cos(np.pi * (T-delta) / 12)
What #Ted Petrou recommends you do in full:
df['F(x)'] = circadian_function(df.index.hour, 0.5) #I picked an arbitrary delta
Numpy will automatically vectorize the function for you. Props to Ted I learned something new as well :>

Use apply only as a last result. This can be easily vectorized. Make sure you define delta.
import numpy as np
df['F(t)'] = np.cos(np.pi*(idx.hour-delta)/12)

Related

Replace unknown values (with different median values)

I have a particular problem, I would like to clean and prepare my data and I have a lot of unknown values for the "highpoint_metres" column of my dataframe (members). As there is no missing information for the "peak_id", I calculated the median value of the height according to the peak_id to be more accurate.
I would like to do two steps: 1) add a new column to my "members" dataframe where there would be the value of the median but different depending on the "peak_id" (value calculated thanks to the code in the question). 2) That the code checks that the value in highpoint_metres is null, if it is, that the value of the new column is put instead. I don't know if this is clearer
code :
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
print(members)
mediane_peak_id = members[["peak_id","highpoint_metres"]].groupby("peak_id",as_index=False).median()
And I don't know how to continue from there (my level of python is very bad ;-))
I believe that's what you're looking for:
import numpy as np
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
median_highpoint_by_peak = members.groupby("peak_id")["highpoint_metres"].transform("median")
is_highpoint_missing = np.isnan(members.highpoint_metres)
members["highpoint_meters_imputed"] = np.where(is_highpoint_missing, median_highpoint_by_peak, members.highpoint_metres)
so one way to go about replacing 0 with median could be:
import numpy as np
df[col_name] = df[col_name].replace({0: np.median(df[col_name])})
You can also use apply function:
df[col_name] = df[col_name].apply(lambda x: np.median(df[col_name]) if x==0 else x)
Let me know if this helps.
So adding a little bit more info based on Marie's question.
One way to get median is through groupby and then left join it with the original dataframe.
df_gp = df.groupby(['peak_id']).agg(Median = (highpoint_metres, 'median')).reset_index()
df = pd.merge(df, df_gp, on='peak_id')
df = df.apply(lambda x['highpoint_metres']: x['Median'] if x['highpoint_metres']==np.nan else x['highpoint_metres'])
Let me know if this solves your issue

Function returns only one iteration, instead of multiple. What is wrong?

First of all I'm a beginner and having an issue about functions and returning values. After that, I need to do some matrix operations to take minimum value of the right column. However, since I cannot return these values (I could not figure out why) I'm not able to do any operations on it. The problem here is, every time I try to use return, It gives me only the first or the last row of the matrix. If you can help, I really appreciate it. Thanks.
import numpy as np
import pandas as pd
df = pd.read_csv(r"C:\Users\Yunus Özer\Downloads/MA.csv")
df.head()
x = df["x"]
def minreg():
for k in range(2,16):
x_pred = np.full(x.shape, np.nan)
for t in range(k,x.size):
x_pred[t] = np.mean(x[(t-k):t])
mape_value=((np.mean(np.abs(x-x_pred)/np.abs(x))*100))
m=np.array([k,mape_value])
return m
print(minreg())
return m command basicly terminates the function and returns m. As a result, the function terminates after executing the first loop. So firstly you need to call return after your loop ends. Secondly you need to put each m value generated for the loop to an array to store them and return that array.
import numpy as np
import pandas as pd
df = pd.read_csv(r"C:\Users\Yunus Özer\Downloads/MA.csv")
df.head()
x = df["x"]
def minreg():
m_arr = []
for k in range(2,16):
x_pred = np.full(x.shape, np.nan)
for t in range(k,x.size):
x_pred[t] = np.mean(x[(t-k):t])
mape_value=((np.mean(np.abs(x-x_pred)/np.abs(x))*100))
m_arr.append(np.array([k,mape_value]))
return m_arr
print(minreg())

.How to subtract a percentage from a csv file and then output it into another file? I'd preferably like a formula like x*.10=y

Sorry if I haven't explained things very well. I'm a complete novice please feel free to critic
I've searched every where but I havent found anything close to subtracting a percent. when its done on its own(x-.10=y) it works wonderfully. the only problem is Im trying to make 'x' stand for sample_.csv[0] or the numerical value from first column from my understanding.
import csv
import numpy as np
import pandas as pd
readdata = csv.reader(open("sample_.csv"))
x = input(sample_.csv[0])
y = input(x * .10)
print(x + y)
the column looks something like this
"20,a,"
"25,b,"
"35,c,"
"45,d,"
I think you should only need pandas for this task. I'm guessing you want to apply this operation on one column:
import pandas as pd
df = pd.read_csv('sample_.csv') # assuming columns within csv header.
df['new_col'] = df['20,a'] * 1.1 # Faster than adding to a percentage x + 0.1x = 1.1*x
df.to_csv('new_sample.csv', index=False) # Default behavior is to write index, which I personally don't like.
BTW: input is a reserved command in python and asks for input from the user. I'm guessing you don't want this behavior but I could be wrong.
import pandas as pd
df = pd.read_csv("sample_.csv")
df['newcolumn'] = df['column'].apply(lambda x : x * .10)
Please try this.

Python equivalent to Spark rangeBetween for window?

I am trying to find if there is a way in python to do the equivalent of a rangeBetween in a rolling aggregation. In Spark, you can use rangeBetween such that the window does not have to be symmetrical on the targeted row, ie for each row, I can look -5h to +3h: All rows that happen between 5 hours before and 3 hours after absed on a datetime column. I know that python has the pd.rolling option, but after reading all the documentation i can find on it it looks like it only takes 1 input as the window. You can change whether that window is centered on each row or not but I can't find a way to explicitly set it so it can look at a range of my choosing.
Does anyone know of another function or functionality that I am not aware of that would work to do this?
I'm not sure if it's the best answer but it's mine and it works so I guess it'll have to do until there is a better option. I made a python function out of it so you can sub in whatever aggregation function you want.
def rolling_stat(pdf, lower_bound, upper_bound, group , statistic = 'mean' )
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta
group = pdf[group].drop_duplicates()
for grp in group:
dataframe_grp = dataframe[dataframe['group']==grp]
dataframe_grp.sort_index()
for index, row in dataframe_grp.iterrows():
lower= (index - timedelta(minutes = lower_bound))
upper= (index + timedelta(minutes = upper_bound))
agg = dataframe_grp.loc[lower:upper]['nbr'].agg([statistic])
dataframe_grp.at[index, 'agg'] = agg[0]
data_agg = data_agg.append(dataframe_grp)

Separating pandas dataframe by offset string

Lets say I have a pandas.DataFrame that has hourly data for 3 days:
import pandas as pd
import numpy as np
import datetime as dt
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(np.random.randn(3*24,2),index=dates,columns=list('AB'))
I would like to get every, let's say, 6 hours of data and independently fit a curve to that data. Since pandas' resample function has a how keyword that is supposed to be any numpy array function, I thought that I could maybe try to use resample to do that with polyfit, but apparently there is no way (right?).
So the only alternative way I thought of doing that is separating df into a sequence of DataFrames, so I am trying to create a function that would work such as
l=splitDF(df, '6H')
and it would return to me a list of dataframes, each one with 6 hours of data (except maybe the first and last ones). So far I got nothing that could work except something like the following manual method:
def splitDF(data, rule):
res_index=data.resample(rule).index
out=[]
cont=0
for date in data.index:
... check for date in res_index ...
... and start cutting at those points ...
But this method would be extremely slow and there is probably a faster way to do it. Is there a fast (maybe even pythonic) way of doing this?
Thank you!
EDIT
A better method (that needs some improvement but it's faster) would be the following:
def splitDF(data, rule):
res_index=data.resample(rule).index
out=[]
pdate=res_index[0]
for date in res_index:
out.append(data[pdate:date][:-1])
pdate=date
out.append(data[pdate:])
return out
But still seems to me that there should be a better method.
Ok, so this sounds like a textbook case for using groupby. Here's my thinking:
import pandas as pd
#let's define a function that'll group a datetime-indexed dataframe by hour-interval/date
def create_date_hour_groups(df, hr):
new_df = df.copy()
hr_int = int(hr)
new_df['hr_group'] = new_df.index.hour/hr_int
new_df['dt_group'] = new_df.index.date
return new_df
#now we define a wrapper for polyfit to pass to groupby.apply
def polyfit_x_y(df, x_col='A', y_col='B', poly_deg=3):
df_new = df.copy()
coef_array = pd.np.polyfit(df_new[x_col], df_new[y_col], poly_deg)
poly_func = pd.np.poly1d(coef_array)
df_new['poly_fit'] = poly_func(df[x_col])
return df_new
#to the actual stuff
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(pd.np.random.randn(3*24,2),index=dates,columns=list('AB'))
df = create_date_hour_groups(df, 6)
df_fit = df.groupby(['dt_group', 'hr_group'],
as_index=False).apply(polyfit_x_y)
How about?
np.array_split(df,len(df)/6)

Categories