Python equivalent to Spark rangeBetween for window?

Python equivalent to Spark rangeBetween for window? - python

I am trying to find if there is a way in python to do the equivalent of a rangeBetween in a rolling aggregation. In Spark, you can use rangeBetween such that the window does not have to be symmetrical on the targeted row, ie for each row, I can look -5h to +3h: All rows that happen between 5 hours before and 3 hours after absed on a datetime column. I know that python has the pd.rolling option, but after reading all the documentation i can find on it it looks like it only takes 1 input as the window. You can change whether that window is centered on each row or not but I can't find a way to explicitly set it so it can look at a range of my choosing.
Does anyone know of another function or functionality that I am not aware of that would work to do this?

I'm not sure if it's the best answer but it's mine and it works so I guess it'll have to do until there is a better option. I made a python function out of it so you can sub in whatever aggregation function you want.
def rolling_stat(pdf, lower_bound, upper_bound, group , statistic = 'mean' )
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta
group = pdf[group].drop_duplicates()
for grp in group:
dataframe_grp = dataframe[dataframe['group']==grp]
dataframe_grp.sort_index()
for index, row in dataframe_grp.iterrows():
lower= (index - timedelta(minutes = lower_bound))
upper= (index + timedelta(minutes = upper_bound))
agg = dataframe_grp.loc[lower:upper]['nbr'].agg([statistic])
dataframe_grp.at[index, 'agg'] = agg[0]
data_agg = data_agg.append(dataframe_grp)

Related

efficient way to find unique values within time windows in python?

I have a large pandas dataframe that countains data similar to the image attached.
I want to get a count of how many unique TN exist within each 2 second window of the data. I've done this with a simple loop, but it is incredibly slow. Is there a better technique I can use to get this?
My original code is:
uniqueTN = []
tmstart = 5400; tmstop = 86400
for tm in range(int(tmstart), int(tmstop), 2):
df = rundf[(rundf['time']>=(tm-2))&rundf['time']<tm)]
uniqueTN.append(df['TN'].unique())
This solution would be fine it the set of data was not so large.

Here is how you can implement groupby() method and nunique().
rundf['time'] = (rundf['time'] // 2) * 2
grouped = rundf.groupby('time')['TN'].nunique()
Another alternative is to use the resample() method of pandas and then the nunique() method.
grouped = rundf.resample('2S', on='time')['TN'].nunique()

Replace unknown values (with different median values)

I have a particular problem, I would like to clean and prepare my data and I have a lot of unknown values for the "highpoint_metres" column of my dataframe (members). As there is no missing information for the "peak_id", I calculated the median value of the height according to the peak_id to be more accurate.
I would like to do two steps: 1) add a new column to my "members" dataframe where there would be the value of the median but different depending on the "peak_id" (value calculated thanks to the code in the question). 2) That the code checks that the value in highpoint_metres is null, if it is, that the value of the new column is put instead. I don't know if this is clearer
code :
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
print(members)
mediane_peak_id = members[["peak_id","highpoint_metres"]].groupby("peak_id",as_index=False).median()
And I don't know how to continue from there (my level of python is very bad ;-))

I believe that's what you're looking for:
import numpy as np
import pandas as pd
members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv")
median_highpoint_by_peak = members.groupby("peak_id")["highpoint_metres"].transform("median")
is_highpoint_missing = np.isnan(members.highpoint_metres)
members["highpoint_meters_imputed"] = np.where(is_highpoint_missing, median_highpoint_by_peak, members.highpoint_metres)

so one way to go about replacing 0 with median could be:
import numpy as np
df[col_name] = df[col_name].replace({0: np.median(df[col_name])})
You can also use apply function:
df[col_name] = df[col_name].apply(lambda x: np.median(df[col_name]) if x==0 else x)
Let me know if this helps.
So adding a little bit more info based on Marie's question.
One way to get median is through groupby and then left join it with the original dataframe.
df_gp = df.groupby(['peak_id']).agg(Median = (highpoint_metres, 'median')).reset_index()
df = pd.merge(df, df_gp, on='peak_id')
df = df.apply(lambda x['highpoint_metres']: x['Median'] if x['highpoint_metres']==np.nan else x['highpoint_metres'])
Let me know if this solves your issue

Pandas select rows based on a function of a column

I am trying to learn Pandas. I have found several examples on how to construct a pandas dataframe and how to add columns, they work nicely. I would like to learn to select all rows based on a value of a column. I have found multiple examples on how to perform selection if a value of a column should be smaller or greater than a certain number, that also works. My question is how to do a more general selection, where I would like to first compute a function of a column, then select all rows for which the value of a function would be greater or smaller than a certain number
import names
import numpy as np
import pandas as pd
from datetime import date
import random
def randomBirthday(startyear, endyear):
T1 = date.today().replace(day=1, month=1, year=startyear).toordinal()
T2 = date.today().replace(day=1, month=1, year=endyear).toordinal()
return date.fromordinal(random.randint(T1, T2))
def age(birthday):
today = date.today()
return today.year - birthday.year - ((today.month, today.day) < (birthday.month, birthday.day))
N_PEOPLE = 20
dict_people = { }
dict_people['gender'] = np.array(['male','female'])[np.random.randint(0, 2, N_PEOPLE)]
dict_people['names'] = [names.get_full_name(gender=g) for g in dict_people['gender']]
peopleFrame = pd.DataFrame(dict_people)
# Example 1: Add new columns to the data frame
peopleFrame['birthday'] = [randomBirthday(1920, 2020) for i in range(N_PEOPLE)]
# Example 2: Select all people with a certain age
peopleFrame.loc[age(peopleFrame['birthday']) >= 20]
This code works except for the last line. Please suggest what is the correct way to write this line. I have considered adding an extra column with the value of the function age, and then selecting based on its value. That would work. But I am wondering if I have to do it. What if I don't want to store the age of a person, only use it for selection

Use Series.apply:
peopleFrame.loc[peopleFrame['birthday'].apply(age) >= 20]

Python Pandas using lambda on index row by row

So I'm trying to apply a function to the index row by row but having some problems
startDate = '2015-05-01 00:00'
endDate = '2015-05-08 00:00'
idx = pd.date_range(startDate, endDate, freq="1min")
df = pd.DataFrame(columns=['F(t)'])
df = df.reindex(idx, fill_value=0)
def circadian_function(T):
return math.cos(math.pi*(T-delta)/12)
Everything is okay up to here but trying to apply the function I'm not sure what to do
df['F(t)'] = df.index.apply(lambda x: circadian_function x[index].hour, axis=1)
Should I be using a lambda? Or just an apply?

I don't have 50 rep so I can't comment on #Ted Petrou's answer ;-;
I just wanted to say a couple things that you should know.
If you are going to feed df.index.hour into your carcadian_function, make sure you use numpy instead of math. Otherwise the interpreter will throw a TypeError (I just found out about this).
Make sure to define delta.
Example:
import numpy as np
def circadian_function(T, delta):
return np.cos(np.pi * (T-delta) / 12)
What #Ted Petrou recommends you do in full:
df['F(x)'] = circadian_function(df.index.hour, 0.5) #I picked an arbitrary delta
Numpy will automatically vectorize the function for you. Props to Ted I learned something new as well :>

Use apply only as a last result. This can be easily vectorized. Make sure you define delta.
import numpy as np
df['F(t)'] = np.cos(np.pi*(idx.hour-delta)/12)

Shifting all rows in dask dataframe

In Pandas, there is a method DataFrame.shift(n) which shifts the contents of an array by n rows, relative to the index, similarly to np.roll(a, n). I can't seem to find a way to get a similar behaviour working with Dask. I realise things like row-shifts may be difficult to manage with Dask's chunked system, but I don't know of a better way to compare each row with the subsequent one.
What I'd like to be able to do is this:
import numpy as np
import pandas as pd
import dask.DataFrame as dd
with pd.HDFStore(path) as store:
data = dd.from_hdf(store, 'sim')[col1]
shifted = data.shift(1)
idx = data.apply(np.sign) != shifted.apply(np.sign)
in order to create a boolean series indicating the locations of sign changes in the data. (I am aware that method would also catch changes from a signed value to zero)
I would then use the boolean series to index a different Dask dataframe for plotting.

Rolling functions
Currently dask.dataframe does not implement the shift operation. It could though if you raise an issue. In principle this is not so dissimilar from rolling operations that dask.dataframe does support, like rolling_mean, rolling_sum, etc..
Actually, if you were to create a Pandas function that adheres to the same API as these pandas.rolling_foo functions then you can use the dask.dataframe.rolling.wrap_rolling function to turn your pandas style rolling function into a dask.dataframe rolling function.
dask.dataframe.rolling_sum = wrap_rolling(pandas.rolling_sum)

The following code might help to shift down the series.
s = dd_df['column'].rolling(window=2).sum() - dd_df['column']
Edit (03/09/2019):
When you are rolling and finding the sum, for a particular row,
result[i] = row[i-1] + row[i]
Then by subtracting the old value of the column from the result, you are doing the following operation:
final_row[i] = result[i] - row[i]
Which equals:
final_row[i] = row[i-1] + row[i] - row[i]
Which ultimately results in the whole column getting shifted down once.
Tip:
If you want to shift it down multiple rows, you should actually execute the whole operation again that many times with the same window.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python equivalent to Spark rangeBetween for window? - python

Related

efficient way to find unique values within time windows in python?

Replace unknown values (with different median values)

Pandas select rows based on a function of a column

Python Pandas using lambda on index row by row

Shifting all rows in dask dataframe

Categories

Resources