Spark Streaming: From DStream to Pandas Dataframe - python

In the snippet below I try to transform a DStream of temperatures (received from Kafka) into a pandas Dataframe.
def main_process(time, dStream):
print("========= %s =========" % str(time))
try:
# Get the singleton instance of SparkSession
spark = getSparkSessionInstance(dStream.context.getConf())
# Convert RDD[String] to RDD[Row] to DataFrame
rowRdd = dStream.map(lambda t: Row(Temperatures=t))
df = spark.createDataFrame(rowRdd)
df.show()
print("The mean is: %m" % df.mean())
As is, the mean is never calculated, which I suppose is because "df" is not a pandas dataframe (?).
I tried using df = spark.createDataFrame(df.toPandas()) according to the relevant documentation but the compiler doesn't recognize "toPandas()" and the transformation never occurs.
Am I in the right path, and if so how should I apply the transformation?
Or maybe my approach is wrong and I must handle the DStream in a different way?
Thank you in advance!

Related

Using pd.DataFrame.sample on dask dataframe with groupby

I have a very large dataframe that I am resampling a large number of times, so I'd like to use dask to speed up the process. However, I'm running into challenges with the groupby apply. An example data frame would be
import numpy as np
import pandas as pd
import random
test_df = pd.DataFrame({'sample_id':np.array(['a', 'b', 'c', 'd']).repeat(100),
'param1':random.sample(range(1, 1000), 400)})
test_df.set_index('sample_id', inplace=True)
which I can normally groupby and resample using
N = 5;i=1
test = test_df\
.groupby(['sample_id'])\
.apply(pd.DataFrame.sample, n=N, replace=False)\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
Which I wrap into a method that iterates over an N gradient i times. The actual dataframe is very large with a number of columns, and before anyone suggests, this method is a little bit faster than an np.random.choice approach on the index-- it's all in the groupby. I've run the overall procedure through a multiprocessing method, but I wanted to see if I could get a bit more speed out of a dask version of the same. The problem is the documentation suggests that if you index and partition then you get complete groups per partition-- which is not proving true.
import dask.dataframe as dd
df1 = dd.from_pandas(test_df, npartitions=8)
df1=df1.persist()
df1.divisions
creates
('a', 'b', 'c', 'd', 'd')
which unsurprisingly results in a failure
N = 5;i=1
test = df1\
.groupby(['sample_id'])\
.apply(pd.DataFrame.sample, n=N, replace=False)\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
ValueError: Metadata inference failed in groupby.apply(sample).
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.
Original error is below:
ValueError("Cannot take a larger sample than population when 'replace=False'")
I have dug all around the documentation on keywords, dask dataframes & partitions, and groupby aggregations and simply am simply missing the solution if it's there in the documents. Any advice on how to create a smarter set of partitions and/or get the groupby with sample playing nice with dask would be deeply appreciated.
It's not quite clear to me what you are trying to achieve and why you need to add replace=False (which is default) but the following code work for me. I just need to add meta.
import dask.dataframe as dd
df1 = dd.from_pandas(test_df.reset_index(), npartitions=8)
N = 5
i = 1
test = df1\
.groupby(['sample_id'])\
.apply(lambda x: x.sample(n=N),
meta={"sample_id": "object",
"param1": "f8"})\
.reset_index(drop=True)
test['bootstrap'] = i
test['resample'] = N
If you then want to drop sample_id you just need to add
df = df.drop("sample_id", axis=1)

PySpark: Partition a dataframe based on column values and store the resulting dataframes in a list

I have a pyspark dataframe with 4 columns: city, season, weather_variable, variable_value. I have to partition the frame into partition for different combinations of city, season, weather_variable. Following which, I'll apply k-means on these partitions.
I'm using the following code in order to create the partitions:
a = df_in.select('city', 'season', 'variable').distinct().toPandas().as_matrix()
dfArray = [df_in.filter("city = '{}' and season = '{}' and variable = '{}'".format(x[0], x[1], x[2])) for x in a]
But the issue is the process is very slow as it is using filter iteratively. Can someone suggest a way to make the process more efficient. Something like groupby might help.
Please find below the python version of the code that I want to convert to pyspark:
from sklearn.cluster import KMeans
def k_means_on_partition(group):
v = group['variable_value']
kmeans = KMeans(n_clusters = 7)
kmeans.fit(v.values.reshape(-1, 1))
group['cluster'] = kmeans.labels_
return group
df_out = df_in.groupby(['city', 'season', 'variable']).apply(k_means_on_partition)
Also, for pyspark, I'll be using pyspark.ml instead of sklearn.

pytest assert for pyspark dataframe comparison

I have 2 pyspark dataframe as shown in file attached. expected_df and actual_df
In my unit test I am trying to check if both are equal or not.
for which my code is
expected = map(lambda row: row.asDict(), expected_df.collect())
actual = map(lambda row: row.asDict(), actaual_df.collect())
assert expected = actual
Since both dfs are same but row order is different so assert fails here.
What is best way to compare such dfs.
You can try pyspark-test
https://pypi.org/project/pyspark-test/
This is inspired by the panadas testing module build for pyspark.
Usage is simple
from pyspark_test import assert_pyspark_df_equal
assert_pyspark_df_equal(df_1, df_2)
Also apart from just comparing dataframe, just like the pandas testing module it also accepts many optional params that you can check in the documentation.
Note:
The datatypes in pandas and pysaprk are bit different, thats why directly converting to .toPandas and using panadas testing module might not be the right approach.
This package is for unit/integration testing, so meant to be used with small size dfs
This is done in some of the pyspark documentation:
assert sorted(expected_df.collect()) == sorted(actaual_df.collect())
We solved this by hashing each row with Spark's hash function and then summing the resultant column.
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
def hash_df(df):
"""Hashes a DataFrame for comparison.
Arguments:
df (DataFrame): A dataframe to generate a hash from
Returns:
int: Summed value of hashed rows of an input DataFrame
"""
# Hash every row into a new hash column
df = df.withColumn('hash_value', F.hash(*sorted(df.columns))).select('hash_value')
# Sum the hashes, see https://shortest.link/28YE
value = df.agg(F.sum('hash_value')).collect()[0][0]
return value
expected_hash = hash_df(expected_df)
actual_hash = hash_df(actual_df)
assert expected_hash == actual_hash
Unfortunately this cannot be done without applying sort on any of the columns(specially on the key column), reason being there isn't any guarantee for ordering of records in a DataFrame . You cannot predict the order in which the records are going to appear in the dataframe. The below approach works fine for me:
expected = expected_df.orderBy('period_start_time').collect()
actual = actaual_df.orderBy('period_start_time').collect()
assert expected == actual
If the overhead of an additional library such as pyspark_test is a problem, you could try sorting both dataframes by the same columns, converting them to pandas, and using pd.testing.assert_frame_equal.
I know that the .toPandas method for pyspark dataframes is generally discouraged because the data is loaded into the driver's memory (see the pyspark documentation here), but this solution works for relatively small unit tests.
For example:
sort_cols = actual_df.columns
pd.testing.assert_frame_equal(
actual_df.sort(sort_cols).toPandas(),
expected_df.sort(sort_cols).toPandas()
)
try to have "==" instead of "=".
assert expected == actual
I have two Dataframes with the same order.
Comparing this two I use:
def test_df(df1, df2):
assert df1.values.tolist() == df2.values.tolist()
Another way go about that ensuring sort order would be:
from pandas.testing import assert_frame_equal
def assert_frame_with_sort(results, expected, key_columns):
results_sorted = results.sort_values(by=key_columns).reset_index(drop=True)
expected_sorted = expected.sort_values(by=key_columns).reset_index(drop=True)
assert_frame_equal(results_sorted, expected_sorted)

Data imputation with fancyimpute and pandas

I have a large pandas data fame df. It has quite a few missings. Dropping row/or col-wise is not an option. Imputing medians, means or the most frequent values is not an option either (hence imputation with pandas and/or scikit unfortunately doens't do the trick).
I came across what seems to be a neat package called fancyimpute (you can find it here). But I have some problems with it.
Here is what I do:
#the neccesary imports
import pandas as pd
import numpy as np
from fancyimpute import KNN
# df is my data frame with the missings. I keep only floats
df_numeric = = df.select_dtypes(include=[np.float])
# I now run fancyimpute KNN,
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))
However, df_filled is a single vector somehow, instead of the filled data frame. How do I get a hold of the data frame with imputations?
Update
I realized, fancyimpute needs a numpay array. I hence converted the df_numeric to a an array using as_matrix().
# df is my data frame with the missings. I keep only floats
df_numeric = df.select_dtypes(include=[np.float]).as_matrix()
# I now run fancyimpute KNN,
# it returns a np.array which I store as a pandas dataframe
df_filled = pd.DataFrame(KNN(3).complete(df_numeric))
The output is a dataframe with the column labels gone missing. Any way to retrieve the labels?
Add the following lines after your code:
df_filled.columns = df_numeric.columns
df_filled.index = df_numeric.index
I see the frustration with fancy impute and pandas. Here is a fairly basic wrapper using the recursive override method. Takes in and outputs a dataframe - column names intact. These sort of wrappers work well with pipelines.
from fancyimpute import SoftImpute
class SoftImputeDf(SoftImpute):
"""DataFrame Wrapper around SoftImpute"""
def __init__(self, shrinkage_value=None, convergence_threshold=0.001,
max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero",
min_value=None,max_value=None,normalizer=None,verbose=True):
super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value,
convergence_threshold=convergence_threshold,
max_iters=max_iters,max_rank=max_rank,
n_power_iterations=n_power_iterations,
init_fill_method=init_fill_method,
min_value=min_value,max_value=max_value,
normalizer=normalizer,verbose=False)
def fit_transform(self, X, y=None):
assert isinstance(X, pd.DataFrame), "Must be pandas dframe"
for col in X.columns:
if X[col].isnull().sum() < 10:
X[col].fillna(0.0, inplace=True)
z = super(SoftImputeDf, self).fit_transform(X.values)
return pd.DataFrame(z, index=X.index, columns=X.columns)
I really appreciate #jander081's approach, and expanded on it a tiny bit to deal with setting categorical columns. I had a problem where the categorical columns would get unset and create errors during training, so modified the code as follows:
from fancyimpute import SoftImpute
import pandas as pd
class SoftImputeDf(SoftImpute):
"""DataFrame Wrapper around SoftImpute"""
def __init__(self, shrinkage_value=None, convergence_threshold=0.001,
max_iters=100,max_rank=None,n_power_iterations=1,init_fill_method="zero",
min_value=None,max_value=None,normalizer=None,verbose=True):
super(SoftImputeDf, self).__init__(shrinkage_value=shrinkage_value,
convergence_threshold=convergence_threshold,
max_iters=max_iters,max_rank=max_rank,
n_power_iterations=n_power_iterations,
init_fill_method=init_fill_method,
min_value=min_value,max_value=max_value,
normalizer=normalizer,verbose=False)
def fit_transform(self, X, y=None):
assert isinstance(X, pd.DataFrame), "Must be pandas dframe"
for col in X.columns:
if X[col].isnull().sum() < 10:
X[col].fillna(0.0, inplace=True)
z = super(SoftImputeDf, self).fit_transform(X.values)
df = pd.DataFrame(z, index=X.index, columns=X.columns)
cats = list(X.select_dtypes(include='category'))
df[cats] = df[cats].astype('category')
# return pd.DataFrame(z, index=X.index, columns=X.columns)
return df
df=pd.DataFrame(data=mice.complete(d), columns=d.columns, index=d.index)
The np.array that is returned by the .complete() method of the fancyimpute object (be it mice or KNN) is fed as the content (argument data=) of a pandas dataframe whose cols and indexes are the same as the original data frame.

dask.DataFrame.apply and variable length data

I would like to apply a function to a dask.DataFrame, that returns a Series of variable length. An example to illustrate this:
def generate_varibale_length_series(x):
'''returns pd.Series with variable length'''
n_columns = np.random.randint(100)
return pd.Series(np.random.randn(n_columns))
#apply this function to a dask.DataFrame
pdf = pd.DataFrame(dict(A=[1,2,3,4,5,6]))
ddf = dd.from_pandas(pdf, npartitions = 3)
result = ddf.apply(generate_varibale_length_series, axis = 1).compute()
Apparently, this works fine.
Concerning this, I have two questions:
Is this supposed to work always or am I just lucky here? Is dask expecting, that all partitions have the same amount of columns?
In case the metadata inference fails, how can I provide metadata, if the number of columns is not known beforehand?
Background / usecase: In my dataframe each row represents a simulation trail. The function I want to apply extracts time points of certain events from it. Since I do not know the number of events per trail in advance, I do not know how many columns the resulting dataframe will have.
Edit:
As MRocklin suggested, here an approach that uses dask delayed to compute result:
#convert ddf to delayed objects
ddf_delayed = ddf.to_delayed()
#delayed version of pd.DataFrame.apply
delayed_apply = dask.delayed(lambda x: x.apply(generate_varibale_length_series, axis = 1))
#use this function on every delayed object
apply_on_every_partition_delayed = [delayed_apply(d) for d in ddf.to_delayed()]
#calculate the result. This gives a list of pd.DataFrame objects
result = dask.compute(*apply_on_every_partition_delayed)
#concatenate them
result = pd.concat(result)
Short answer
No, dask.dataframe does not support this
Long answer
Dask.dataframe expects to know the columns of every partition ahead of time and it expects those columns to match.
However, you can still use Dask and Pandas together through dask.delayed, which is far more capable of handling problems like these.
http://dask.pydata.org/en/latest/delayed.html

Categories