Interpolation across different dataframes with Pandas - python

Context of the Problem
I am running a discrete event simulation where at the end of each event I store the state of the system in a row of a dataframe. Each simulation run has a clock which determines which events are run and when, all the simulation have the same initial clock (0) and the same end clock (simulation end). but the number of rows for the dataframes may be different because the simulation has several stochastic components. The clock column is then converted to timestamp and use as index of the dataframe.
Whenever one does a simulation, it is good practice to run it many times and then average over all the replicates, doing so in this setup is a bit complicated because each simulation produced a dataframe which different indexes.
Proposed Solution
So far this is the solution I found:
# Auxiliary functions
def get_union_index(dfs):
index_union = dfs[0].index
for df in dfs[1:]:
index_union = index_union.union(df.index)
return pd.Series(index_union).drop_duplicates().reset_index(drop=True)
def interpolate(df, index, method='time'):
aux = df.astype(float)
aux = aux.reindex(index).fillna(method='ffill').fillna(method='bfill')
return aux
# Simulation
dfs = []
replicates = 30
seeds = list(range(40, 40 + replicates))
dfs = [simulate(seed=seed, **parameters) for seed in seeds]
# Main Code
union_index = get_union_index(dfs)
dfs_interpolates = [interpolate(df, union_index) for df in dfs]
df_concat = pd.concat(dfs_interpolates)
by_row_index = df_concat.groupby(df_concat.index)
# Averaging
df_means = by_row_index.mean()
df_std = by_row_index.std()
Explanation
First, it is necessary to combine all the indexes, then this combined index is used to re-index all the dataframes and the nan values are filled using interpolation.
Questions
Is there a native pandas function that could simplify this?
(If not 1) Is there an alternative way to combine the datasets directly, since the majority of the index are disjoint, union_index has a length of approximately len(df) * len(dfs) which is actually huge.
In this case I am using ffill and then bfill but the solution should allow the more general interpolate method to support different and more complex approaches. The main issue is that since each row is an event (before interpolation), only a few columns (sometimes none) are changed in two consecutive rows, making it possible to have several consecutive rows with identical values in several columns and thus having a lot of redundant values, and even more after interpolation (especially for fill and bfill).
EDIT
Add data to reproduce and change question 3
dfs is a list of 3 DataFrames as follows:
Then dfs_interpolates is also a list of 3 DataFrames but this time the indexes are exactly the same, namely:
Finally the expected result should be a row-wise application of a function (mean, median, etc.) across the different dataframes in dfs_interpolates. For the case of the mean the result should be:

Related

Applying Column-based Data Transformations on PySpark DataFrames in Parallel

I've searched across SO a bit and haven't been able to find a question that resembles mine; I hope this isn't a duplicate, but feel free to point me in the right direction if a similar question has already been asked!
I'm in the process of K-Fold mean-encoding a set of categorical vectors in a very large dataset (think, 30+ million rows).
I currently have my code set up such that:
the dataframe is split into random subsets using randomSplit()
for each split, I iterate through each column of type categorical and calculate the mean-encoding for that column and split
I keep track of the split's mean-encoding results in a dictionary
following completion of all splits, I average the results
My problem is that this is taking a good amount of time (to perform the mean-encoding calculation on a single column across 5 splits takes a little over 6 minutes; I have multiple hundreds of categorical columns) and I'm fairly certain that I can speed it up by simply running the task in parallel (i.e. apply the same function to all splits simultaneously). However, I can't seem to figure out how to perform said function in parallel using PySpark's built-in functionality. I'm not interested in bringing in threading or pools simply because I'm unsure how it actually interacts with PySpark (if I'm totally wrong and this is the optimal way to go, please let me know).
If it helps, here is the function I've put together for the purposes of calculating the mean-encoding for a specified column for a specified DF, followed by the loop that I'm talking about. Any way to increase the efficiency and speed of this would be hugely appreciated.
def determine_means(df, col, target):
"""
:param df: pyspark.sql.dataframe
dataframe to apply target mean encoding
:param col: str list
column to apply target encoding
:param target: str
target column
:return:
dict of {string:float}
"""
means = df.groupby(F.col(col)).agg(F.mean(target).alias(f"{col}_mean_encoding"))
means = means.withColumn(f"{col}_mean_encoding", means[f"{col}_mean_encoding"].cast(FloatType()))
means = means.toPandas()
return dict(zip(list(means[col].values), list(means[f"{col}_mean_encoding"].values)))
meta_means_dict = dict()
splits = PYSPARK_DF.randomSplit([.2, .2, .2, .2, .2])
for sp in splits:
for col in CATEGORICAL_COLUMNS:
if col not in meta_means_dict.keys():
meta_means_dict[col] = dict()
for k, v in determine_means(sp, col, TARGET_COL).items():
if k in meta_means_dict[col].keys():
meta_means_dict[col][k].append(v)
else:
meta_means_dict[col][k] = [v]
Does anyone have any advice or tips?

Pyspark random split changes distribution of data

I found a very strange behavior with pyspark when I use randomSplit. I have a column is_clicked that takes values 0 or 1 and there are way more zeros than ones. After random split I would expect the data would be uniformly distributed. But instead, I see that the first rows in the splits are all is_cliked=1, followed by rows that are all is_clicked=0. You can see that number of clicks in the original dataframe df is 9 out of 1000 (which is what I expect). But after random split the number of clicks is 1000 out of 1000. If I take more rows I will see that it's all going to be is_clicked=1 until there are no more columns like this, and then it will be followed by rows is_clicked=0.
Anyone knows why there is distribution change after random split? How can I make is_clicked be uniformly distributed after split?
So indeed pyspark does sort the data, when does randomSplit. Here is a quote from the code:
It is possible that the underlying dataframe doesn't guarantee the
ordering of rows in its constituent partitions each time a split is
materialized which could result in overlapping splits. To prevent
this, we explicitly sort each input partition to make the ordering
deterministic. Note that MapTypes cannot be sorted and are explicitly
pruned out from the sort order.
The solution to this either reshuffle the data after the split or just use filter instead of randomSplit:
Solution 1:
df = df.withColumn('rand', sf.rand(seed=42)).orderBy('rand')
df_train, df_test = df.randomSplit([0.5, 0.5])
df_train.orderBy('rand')
Solution 2:
df_train = df.filter(df.rand < 0.5)
df_test = df.filter(df.rand >= 0.5)
Here is a blog post with more details.

Is there a way to loop through pandas dataframe and drop windows of rows dependent on condition?

Problem Summary - I have a dataframe of ~10,000 rows. Some rows contain data aberrations that I would like to get rid of, and those aberrations are tied to observations made at certain temperatures (one of the data columns).
What I've tried - My thought is that the easiest way to remove the rows of bad data is to loop through the temperature intervals, find the maximum index that is less than each of the temperature interval observations, and use the df.drop function to get rid of a window of rows around that index. Between every temperature interval at which bad data is observed, I reset the index of the dataframe. However, it seems to be completely unstable!! Sometimes it nearly works, other times it throws key errors. I think my problem may be in working with the data frame "in place," but I don't see another way to do it.
Example Code:
Here is an example with a synthetic dataframe and a function that uses the same principles that I've tried. Note that I've tried different renditions with .loc and .iloc (commented out below).
#Create synthetic dataframe
import pandas as pd
import numpy as np
temp_series = pd.Series(range(25, 126, 1))
temp_noise = np.random.rand(len(temp_series))*3
df = pd.DataFrame({'temp':(temp_series+temp_noise), 'data':(np.random.rand(len(temp_series)))*400})
#calculate length of original and copy original because function works in place.
before_length = len(df)
df_dup = df
temp_intervals = [50, 70, 92.7]
window = 5
From here, run a function based on the dataframe (df), the temperature observations (temp_intervals) and the window size (window):
def remove_window(df, intervals, window):
'''Loop through the temperature intervals to define a window of indices around given temperatures in the dataframe to drop. Drop the window of indices in place and reset the index prior to moving to the next interval.
'''
def remove_window(df, intervals, window):
for temp in intervals[0:len(intervals)]:
#Find index where temperature first crosses the interval input
cent_index = max(df.index[df['temp']<=temp].tolist())
#Define window of indices to remove from the df
drop_indices = list(range(cent_index-window, cent_index+window))
#Use df.drop
df.drop(drop_indices, inplace=True)
df.reset_index(drop=True)
return df
So, is this a problem with he funtcion I've defined or is there a problem with df.drop?
Thank you,
Brad
It can be tricky to repeatedly delete parts of the dataframe and keep track of what you're doing. A cleaner approach is to keep track of which rows you want to delete within the loop, but only delete them outside of the loop, all at once. This should also be faster.
def remove_window(df, intervals, window):
# Create a Boolean array indicating which rows to keep
keep_row = np.repeat(True, len(df))
for temp in intervals[0:len(intervals)]:
# Find index where temperature first crosses the interval input
cent_index = max(df.index[df['temp']<=temp].tolist())
# Define window of indices to remove from the df
keep_row[range(cent_index - window, cent_index + window)] = False
# Delete all unwanted rows at once, outside the loop
df = df[keep_row]
df.reset_index(drop=True, inplace=True)
return df

How to apply multiple functions to several chunks of a dask dataframe?

I have a dataframe of 500,000 lines and 3 columns. I would like to compute the result of three functions for every chunk of 5,000 lines in the dataframe (that is, 100 chunks). Two of the three functions are used-defined, while the third is the mean of the values in column 3.
At the moment, I am first extracting a chunk, and then computing the results of the functions for that chunk. For the mean of column 3 I am using df.iloc[:,2].compute().mean() but the other functions are performed outside of dask.
Is there a way to leverage dask's multithreading ability, taking the entire dataframe and a chunk size as input, and have it computing the same functions but automatically? This feels like the more appropriate way of using Dask.
Also, this feels like a basic dask question to me, so please if this is a duplicate, just point me to the right place (I'm new to dask and I might have not looked for the right thing so far).
I would repartition your dataframe, and then use the map_partitions function to apply each of your functions in parallel
df = df.repartition(npartitions=100)
a = df.map_partitions(func1)
b = df.map_partitions(func2)
c = df.map_partitions(func3)
a, b, c = dask.compute(a, b, c)
You can create an artificial column for grouping indices into those 100 chunks.
ranges = np.arange(0, df.shape[0], 5000)
df['idx_group'] = ranges.searchsorted(df.index, side='right')
Then use this idx_group to perform your operations using pandas groupby.
NOTE: You can play with searchsorted to exactly fit your chunk requirements.

Creating dataframe by merging a number of unknown length dataframes

I am trying to do some analysis on baseball pitch F/x data. All the pitch data is stored in a pandas dataframe with columns like 'Pitch speed' and 'X location.' I have a wrapper function (using pandas.query) that, for a given pitch, will find other pitches with similar speed and location. This function returns a pandas dataframe of unknown size. I would like to use this function over large numbers of pitches; for example, to find all pitches similar to those thrown in a single game. I have a function that does this correctly, but it is quite slow (probably because it is constantly resizing resampled_pitches):
def get_pitches_from_templates(template_pitches, all_pitches):
resampled_pitches = pd.DataFrame(columns = all_pitches.columns.values.tolist())
for i, row in template_pitches.iterrows():
resampled_pitches = resampled_pitches.append( get_pitches_from_template( row, all_pitches))
return resampled_pitches
I have tried to rewrite the function using pandas.apply on each row, or by creating a list of dataframes and then merging, but can't quite get the syntax right.
What would be the fastest way to this type of sampling and merging?
it sounds like you should use pd.concat for this.
res = []
for i, row in template_pitches.iterrows():
res.append(resampled_pitches.append(get_pitches_from_template(row, all_pitches)))
return pd.concat(res)
I think that a merge might be even faster. Usage of df.iterrows() isn't recommended as it generates a series for every row.

Categories