Pyspark - cache and union - python

I am running a code like this:
def cache_and_union(df1, df2):
df1.cache()
df = df1.union(df2)
return df.collect()
For some reason, when performing the collect method the cache is not performed and used (does not appear in stages neither storage). Why could this happening?

Judging by this code example, you're not actually using cache in any way. cache suggests Spark that df1 would be used multiple times, so it should be cached, and not be recomputed each time. In your code sample, you're just joining df1 and doing a collect.

Related

Running the same Databricks Python Notebook concurrently

I am running the same notebook three times in parallel using the code below:
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
def notebook1_function(country, days):
dbutils.notebook.run(path = "/pathtonotebook1/notebook1", \
timeout_seconds = 300, \
arguments = {"Country":country, "Days":days})
countries = ['US','Canada','UK']
days = [2] * len(countries)
with ThreadPoolExecutor() as executor:
results = executor.map(notebook1_function, countries, days)
Each time, I am passing different value for 'country' and 2 for 'days'. Inside notebook1 I have df1.
I want to know the following:
How to append all the df1's from the three concurrent runs into a single dataframe.
How to get the status [Success/Failure] of each run after completion.
Thank you in advance.
When you're using dbutils.notebook.run (so-called Notebook workflows), the notebook is executed as a separate job, and caller of the notebook that doesn't share anything with it - all communication happens via parameters that you're passing to the notebook, and notebook may return only string value specified via call to dbutils.notebook.exit. So your code doesn't have access to the df1 inside the notebook that you're calling.
Usually, if you're using such notebook workflow, then you need to somehow persist the content of the df1 from the called notebook into some table, and then read that content from caller notebook.
Another possibility, is to extract the code from the called notebook into the function that will receive arguments, and will return the dataframe, include that notebook via %run, call the function with different arguments, and combine results using the union. Something like this:
Notebook 1 (called):
def my_function(country, days):
# do something
return dataframe
Caller notebook:
%run "./Notebook 1"
df_us = my_function('US', 10)
df_canada = my_function('Canada', 10)
df_uk = my_function('UK', 10)
df_all = df_us.union(df_canada).union(df_uk)

how can I clear specific streamlit cache?

I run streamlit script with a few caches.
When I use the following code it clears all the caches:
from streamlit import caching
caching.clear_cache()
I would like to clear only a specific cache. How can I do this?
This isn’t currently (easily) doable.
This is an optional solution that can be applied to some cases:
You can use the allow_output_mutation option:
import streamlit as st
#st.cache(allow_output_mutation=True)
def mutable_cache():
return some_list
mutable_object = mutable_cache()
if st.button("Clear history cache"):
mutable_object.clear()
I wrote the returned cached object as list but you can use other object types as well (then you have to replace the clear method which is specific to lists).
For more info please look on the answers I got in the streamlit community forum
FYI: Streamlit is current proposing a way to clean cache on given elements (still in experimental mode though):
https://docs.streamlit.io/library/advanced-features/experimental-cache-primitives#clear-memo-and-singleton-caches-procedurally
#st.experimental_memo
def square(x):
return x**2
if st.button("Clear Square"):
# Clear square's memoized values:
square.clear()
if st.button("Clear All"):
# Clear values from *all* memoized functions:
st.experimental_memo.clear()

PySpark 2: KMeans The input data is not directly cached

I don't know why I receive the message
WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.
When I try to use Spark KMeans
df_Part = assembler.transform(df_Part)
df_Part.cache()
while (k<=max_cluster) and (wssse > seuilStop):
kmeans = KMeans().setK(k)
model = kmeans.fit(df_Part)
wssse = model.computeCost(df_Part)
k=k+1
It says that my input (Dataframe) is not cached !!
I tried to print df_Part.is_cached and I received True which means that my dataframe is cached, So why Spark still warns me about this?
This message is generated by the o.a.s.mllib.clustering.KMeans and there is nothing you can really about it without patching Spark code.
Internally o.a.s.ml.clustering.KMeans:
Converts DataFrame to RDD[o.a.s.mllib.linalg.Vector].
Executes o.a.s.mllib.clustering.KMeans.
While you cache DataFrame, RDD which is used internally is not cached. This is why you see the warning. While it is annoying I wouldn't worry to much about it.
This was fixed in Spark 2.2.0. Here is the Spark-18356.
The discussion there also suggests this is not a big deal, but the fix may reduce runtime slightly, as well as avoiding warnings.

Pyspark multiple jobs in parallel

I have the following situation with my Pyspark:
In my driver program (driver.py), I call a function from another file (prod.py)
latest_prods = prod.featurize_prods().
Driver code:
from Featurize import Featurize
from LatestProd import LatestProd
from Oldprod import Oldprod
sc = SparkContext()
if __name__ == '__main__':
print 'Into main'
featurize_latest = Featurize('param1', 'param2', sc)
latest_prod = LatestProd(featurize_latest)
latest_prods = latest_prod.featurize_prods()
featurize_old = Featurize('param3', 'param3', sc)
old_prods = Oldprod(featurize_old)
old_prods = oldprod.featurize_oldprods()
total_prods = sc.union([latest_prods, old_prods])
Then I do some some reduceByKey code here... that generates total_prods_processed.
Finally I call:
total_prods_processed.saveAsTextFile(...)
I would like to generate latest_prods and old_prods in parallel. Both are created in the same SparkContext. Is it possible to do that? If not, how can I achieve that functionality?
Is this something that does Spark automatically? I am not seeing this behavior when I run the code so please let me know if it is a configuration option.
After searching on the internet, I think your problem can be addressed by threads. It is as simple as create two threads for your old_prod and latest_prod work.
Check this post for a simplified example. Since Spark is thread-safe, you gain the parallel efficiency without sacrifice anything.
The short answer is no, you can't schedule operations on two distinct RDDs at the same time in the same spark context. However there are some workarounds, you could process them in two distinct SparkContext on the same cluster and call SaveAsTextFile. Then read both in another job to perform the union. (this is not recommended by the documentation).
If you want to try this method, it is discussed here using spark-jobserver since spark doesn't support multiple context by default : https://github.com/spark-jobserver/spark-jobserver/issues/147
However according to the operations you perform there would be no reason to process both at the same time since you need the full results to perform the union, spark will split those operations in 2 different stages that will be executed one after the other.

Aerospike Python Client. UDF module to count records. Cannot register module

I am currently implementing the Aerospike Python Client in order to benchmark it along with our Redis implementation, to see which is faster and/or more stable.
I'm still on baby steps, currently Unit-Testing basic functionality, for example if I correctly add records in my Set. For that reason, I want to create a function to count them.
I saw in Aerospike's Documentation, that :
"to perform an aggregation on query, you first need to register a UDF
with the database".
It seems that this is the suggested way that aggregations, counting and other custom functionality should be run in Aerospike.
Therefore, to count the records in a set I have, I created the following module:
# "counter.lua"
function count(s)
return s : map(function() return 1 end) : reduce (function(a,b) return a+b end)
end
I'm trying to use aerospike python client's function to register a UDF(User Defined Function) module:
udf_put(filename, udf_type, policy)
My code is as follows:
# aerospike_client.py:
# "udf_put" parameters
policy = {'timeout': 1000}
lua_module = os.path.join(os.path.dirname(os.path.realpath(__file__)), "counter.lua") #same folder
udf_type = aerospike.UDF_TYPE_LUA # equals to "0", which is for "Lua"
self.client.udf_put(lua_module, udf_type, policy) # Exception is thrown here
query = self.client.query(self.aero_namespace, self.aero_set)
query.select()
result = query.apply('counter', 'count')
an exception is thrown:
exceptions.Exception: (-2L, 'Filename should be a string', 'src/main/client/udf.c', 82)
Is there anything I'm missing or doing wrong?
Is there a way to "debug" it without compiling C code?
Is there any other suggested way to count the records in my set? Or I'm fine with the Lua module?
First, I'm not seeing that exception, but I am seeing a bug with udf_put where the module is registered but the python process hangs. I can see the module appear on the server using AQL's show modules.
I opened a bug with the Python client's repo on Github, aerospike/aerospike-client-python.
There's a best practices document regarding UDF development here: https://www.aerospike.com/docs/udf/best_practices.html
In general using the stream-UDF to aggregate the records through the count function is the correct way to go about it. There are examples here and here.

Categories