Efficiently compute 2D histogram with Pyspark (Numpy and UDF) - python

I'm trying to do something really simple which somehow translates into something really difficult when Pyspark is involved.
I have a really large dataframe (~2B rows) on our platform which I'm not allowed to download but only analyse using Pyspark code. The dataframe contains position of some objects over Europe in the last year and I want to compute the density of those objects over time. I've succesfully used the function numpy.histogram2d in the past with good results (it's the faster that I've found in numpy at least). Since there is no equivalent of this function in pyspark I've defined a UDF to compute the density and return a new dataframe. This works when I only process a few rows (I've tried with 100K rows):
import pandas as pd
import numpy as np
def compute_density(df):
lon_bins = np.linspace(-15, 45, 100)
lat_bins = np.linspace(35, 70, 100)
density, xedges, yedges = np.histogram2d(df["corrected_latitude_degree"].values,
df["corrected_longitude_degree"].values,
[lat_bins, lon_bins])
x2d, y2d = np.meshgrid(xedges[:-1], yedges[:-1])
x_out = x2d.ravel()
y_out = y2d.ravel()
density_out = density.ravel()
data = {
'latitude': x_out,
'longitude': y_out,
'density': density_out
}
return pd.DataFrame(data)
which I then call as this
schema = StructType([
StructField("latitude", DoubleType()),
StructField("longitude", DoubleType()),
StructField("density", DoubleType())
])
preproc = (
inp
.limit(100000)
.withColumn("groups", F.lit(0))
)
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def compute_density_udf(df):
return compute_density(df)
result = preproc.groupby(["groups"]).apply(compute_density_udf)
Why am I using the GROUPED_MAP version to apply the UDF? I didn't manage to get it work with the SCALAR type UDF when returning with a schema, although I don't really need to group.
When I try to use this UDF on the full dataset I get an OOM, cause I believe there is only one group and is too much for the UDF to process. I'm sure there is a smarter way to compute this directly with pyspark without an UDF or alternatively split into groups and then assemble the results at the end? Does anyone have any idea/suggestion?

Related

How to make a data frame combining different regression results in python?

I am running some regression models to predict performance.
After running the models I created a variable to see the predictions (y_pred_* are lists with 2567 values):
y_pred_LR = regressor.predict(X_test)
y_pred_SVR = regressor2.predict(X_test)
y_pred_RF = regressor3.predict(X_test)
the types of these prediction lists are Array of float64, while the y_test is a DataFrame.
I wanted to create a table with the results, I tried some different ways, calling as list, trying to convert, trying to select as values, and I did not succeed so far, any one could help?
My last trial was like below:
comparison = pd.DataFrame({'Real': y_test, LR':y_pred_LR,'RF':y_pred_RF,'SVM':y_pred_SVM})
In this case the DataFrame is created but the values donĀ“t appear.
Additionally, I would like to create two new rows with the mean and standard deviation of results and this row should be located at beginning (or first row) of the Data Frame.
Thanks
import pandas as pd
import numpy as np
real = np.array([2] * 10).reshape(-1,1)
y_pred_LR = np.array([0] * 10)
y_pred_SVR = np.array([1] * 10)
y_pred_RF = np.array([5] * 10)
real = real.flatten()
comparison = pd.DataFrame({'real':real,'y_pred_LR':y_pred_LR,'y_pred_SVR':y_pred_SVR,"y_pred_RF":y_pred_RF})
Mean = comparison.mean(axis=0)
StD = comparison.std(axis=0)
Mean_StD = pd.concat([Mean,StD],axis=1).T
result = pd.concat([Mean_StD,comparison],ignore_index=True)
print(result)

NameError: name 'rabinerJuangStepPattern' is not defined when using dtw

I'm trying to run this code from Kaggle. For clustering time series using DTW.
More specifically the part:
In[24/25]:
"""
From a list of series, compute a distance matrix by computing the
DTW distance of all pairwise combinations of series.
"""
diff_matrix = {}
cross = itertools.product(cols, cols)
for (col1, col2) in cross:
series1 = daily_sales_item_lookup_scaled_weekly[col1]
series2 = daily_sales_item_lookup_scaled_weekly[col2]
diff = dtw(
series1,
series2,
keep_internals=True,
step_pattern=rabinerJuangStepPattern(2, "c")
)\
.normalizedDistance
diff_matrix[(col1, col2)] = [diff]
return diff_matrix
As one of the parameters, the authors claim "step_pattern=rabinerJuangStepPattern(2, "c"))" however, when I run it, I get the error mentioned.
Does anyone know what might be wrong?
Thank you!
You need to import this function from the dtw package like this first:
from dtw import *
If you scroll to the top of the Kaggle Page you can see that it is imported there too.

Rolling PCA on pandas dataframe

I'm wondering if anyone knows of how to implement a rolling/moving window PCA on a pandas dataframe. I've looked around and found implementations in R and MATLAB but not Python. Any help would be appreciated!
This is not a duplicate - moving window PCA is not the same as PCA on the entire dataframe. Please see pandas.DataFrame.rolling() if you do not understand the difference
Unfortunately, pandas.DataFrame.rolling() seems to flatten the df before rolling, so it cannot be used as one might expect to roll over the rows of the df and pass windows of rows to the PCA.
The following is a work-around for this based on rolling over indices instead of rows. It may not be very elegant but it works:
# Generate some data (1000 time points, 10 features)
data = np.random.random(size=(1000,10))
df = pd.DataFrame(data)
# Set the window size
window = 100
# Initialize an empty df of appropriate size for the output
df_pca = pd.DataFrame( np.zeros((data.shape[0] - window + 1, data.shape[1])) )
# Define PCA fit-transform function
# Note: Instead of attempting to return the result,
# it is written into the previously created output array.
def rolling_pca(window_data):
pca = PCA()
transf = pca.fit_transform(df.iloc[window_data])
df_pca.iloc[int(window_data[0])] = transf[0,:]
return True
# Create a df containing row indices for the workaround
df_idx = pd.DataFrame(np.arange(df.shape[0]))
# Use `rolling` to apply the PCA function
_ = df_idx.rolling(window).apply(rolling_pca)
# The results are now contained here:
print df_pca
A quick check reveals that the values produced by this are identical to control values computed by slicing appropriate windows manually and running PCA on them.

Converting between projections using pyproj in Pandas dataframe

This is undoubtedly a bit of a "can't see the wood for the trees" moment. I've been staring at this code for an hour and can't see what I've done wrong. I know it's staring me in the face but I just can't see it!
I'm trying to convert between two geographical co-ordinate systems using Python.
I have longitude (x-axis) and latitude (y-axis) values and want to convert to OSGB 1936. For a single point, I can do the following:
import numpy as np
import pandas as pd
import shapefile
import pyproj
inProj = pyproj.Proj(init='epsg:4326')
outProj = pyproj.Proj(init='epsg:27700')
x1,y1 = (-2.772048, 53.364265)
x2,y2 = pyproj.transform(inProj,outProj,x1,y1)
print(x1,y1)
print(x2,y2)
This produces the following:
-2.772048 53.364265
348721.01039783185 385543.95241055806
Which seems reasonable and suggests that longitude of -2.772048 is converted to a co-ordinate of 348721.0103978.
In fact, I want to do this in a Pandas dataframe. The dataframe contains columns containing longitude and latitude and I want to add two additional columns that contain the converted co-ordinates (called newLong and newLat).
An exemplar dataframe might be:
latitude longitude
0 53.364265 -2.772048
1 53.632481 -2.816242
2 53.644596 -2.970592
And the code I've written is:
import numpy as np
import pandas as pd
import shapefile
import pyproj
inProj = pyproj.Proj(init='epsg:4326')
outProj = pyproj.Proj(init='epsg:27700')
df = pd.DataFrame({'longitude':[-2.772048,-2.816242,-2.970592],'latitude':[53.364265,53.632481,53.644596]})
def convertCoords(row):
x2,y2 = pyproj.transform(inProj,outProj,row['longitude'],row['latitude'])
return pd.Series({'newLong':x2,'newLat':y2})
df[['newLong','newLat']] = df.apply(convertCoords,axis=1)
print(df)
Which produces:
latitude longitude newLong newLat
0 53.364265 -2.772048 385543.952411 348721.010398
1 53.632481 -2.816242 415416.003113 346121.990302
2 53.644596 -2.970592 416892.024217 335933.971216
But now it seems that the newLong and newLat values have been mixed up (compared with the results of the single point conversion shown above).
Where have I got my wires crossed to produce this result? (I apologise if it's completely obvious!)
When you do df[['newLong','newLat']] = df.apply(convertCoords,axis=1), you are indexing the columns of the df.apply output. However, the column order is arbitrary because your series was defined using a dictionary (which is inherently unordered).
You can opt to return a Series with a fixed column ordering:
return pd.Series([x2, y2])
Alternatively, if you want to keep the convertCoords output labelled, then you can use .join to combine results instead:
return pd.Series({'newLong':x2,'newLat':y2})
...
df = df.join(df.apply(convertCoords, axis=1))
Please note that the transform function of pyproj accepts also arrays, which is quite useful when it comes to large dataframes, and much faster than using lambda/apply function
import pandas as pd
from pyproj import Proj, transform
inProj, outProj = Proj(init='epsg:4326'), Proj(init='epsg:27700')
df['newLon'], df['newLat'] = transform(inProj, outProj, df['longitude'].tolist(), df['longitude'].tolist())

Making histogram with Spark DataFrame column

I am trying to make a histogram with a column from a dataframe which looks like
DataFrame[C0: int, C1: int, ...]
If I were to make a histogram with the column C1, what should I do?
Some things I have tried are
df.groupBy("C1").count().histogram()
df.C1.countByValue()
Which do not work because of mismatch in data types.
The pyspark_dist_explore package that #Chris van den Berg mentioned is quite nice. If you prefer not to add an additional dependency you can use this bit of code to plot a simple histogram.
import matplotlib.pyplot as plt
# Show histogram of the 'C1' column
bins, counts = df.select('C1').rdd.flatMap(lambda x: x).histogram(20)
# This is a bit awkward but I believe this is the correct way to do it
plt.hist(bins[:-1], bins=bins, weights=counts)
What worked for me is
df.groupBy("C1").count().rdd.values().histogram()
I have to convert to RDD because I found histogram method in pyspark.RDD class, but not in spark.SQL module
You can use histogram_numeric Hive UDAF:
import random
random.seed(323)
sqlContext = HiveContext(sc)
n = 3 # Number of buckets
df = sqlContext.createDataFrame(
sc.parallelize(enumerate(random.random() for _ in range(1000))),
["id", "v"]
)
hists = df.selectExpr("histogram_numeric({0}, {1})".format("v", n))
hists.show(1, False)
## +------------------------------------------------------------------------------------+
## |histogram_numeric(v,3) |
## +------------------------------------------------------------------------------------+
## |[[0.2124888140177466,415.0], [0.5918851340384337,330.0], [0.8890271451209697,255.0]]|
## +------------------------------------------------------------------------------------+
You can also extract the column of interest and use histogram method on RDD:
df.select("v").rdd.flatMap(lambda x: x).histogram(n)
## ([0.002028109534323752,
## 0.33410233677189705,
## 0.6661765640094703,
## 0.9982507912470436],
## [327, 326, 347])
Let's say your values in C1 are between 1-1000 and you want to get a histogram of 10 bins. You can do something like:
df.withColumn("bins", df.C1/100).groupBy("bins").count()
If your binning is more complex you can make a UDF for it (and at worse, you might need to analyze the column first, e.g. by using describe or through some other method).
If you want a to plot the Histogram, you could use the pyspark_dist_explore package:
fig, ax = plt.subplots()
hist(ax, df.groupBy("C1").count().select("count"))
If you would like the data in a pandas DataFrame you could use:
pandas_df = pandas_histogram(df.groupBy("C1").count().select("count"))
One easy way could be
import pandas as pd
x = df.select('symboling').toPandas() # symboling is the column for histogram
x.plot(kind='hist')

Categories