PySpark 2: KMeans The input data is not directly cached

PySpark 2: KMeans The input data is not directly cached - python

I don't know why I receive the message
WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.
When I try to use Spark KMeans
df_Part = assembler.transform(df_Part)
df_Part.cache()
while (k<=max_cluster) and (wssse > seuilStop):
kmeans = KMeans().setK(k)
model = kmeans.fit(df_Part)
wssse = model.computeCost(df_Part)
k=k+1
It says that my input (Dataframe) is not cached !!
I tried to print df_Part.is_cached and I received True which means that my dataframe is cached, So why Spark still warns me about this?

This message is generated by the o.a.s.mllib.clustering.KMeans and there is nothing you can really about it without patching Spark code.
Internally o.a.s.ml.clustering.KMeans:
Converts DataFrame to RDD[o.a.s.mllib.linalg.Vector].
Executes o.a.s.mllib.clustering.KMeans.
While you cache DataFrame, RDD which is used internally is not cached. This is why you see the warning. While it is annoying I wouldn't worry to much about it.

This was fixed in Spark 2.2.0. Here is the Spark-18356.
The discussion there also suggests this is not a big deal, but the fix may reduce runtime slightly, as well as avoiding warnings.

Related

Why does a call to torch.tensor inside of apply_async fail to complete (seems to block execution)?

I'm trying to understand why the following simple example doesn't successfully complete execution and seems to get stuck on the first line of really_simple_func (on Ubuntu machines, but not Windows). The code is:
import torch as t
import numpy as np
import multiprocessing as mp # I've tried both multiprocessing
# import torch.multiprocessing as mp # and torch.multiprocessing
def really_simple_func():
temp_val_2 = t.tensor(np.zeros(425447)[0:400000]) # this is the line that blocks.
return 4.3
if __name__ == "__main__":
print("Run brief starting")
some_zeros = np.zeros(425447)
temp_val = t.tensor(some_zeros[0:400000]) # DELETE THIS LINE TO MAKE IT WORK
pool = mp.Pool(processes=1)
job = pool.apply_async(really_simple_func)
print("just before job.get()")
result = job.get()
print("Run brief completed. Reward = {}".format(result))
I have torch 1.11.0 installed, numpy 1.22.3 and have tried both CPU and GPU versions of Torch. When I run this code on two different Ubuntu machines, I get the following output:
Run brief starting
just before job.get()
However, the code never successfully completes (doesn't print the "Run brief completed" line). (It does complete on a third Windows box).
On the Ubuntu machines, if I delete the line with the comment "#DELETE THIS LINE TO MAKE IT WORK" the execution DOES complete, printing the final line as expected. Similarly, if I leave the line defining temp_val in but delete the line with the comment "This is the line that blocks" it will also complete. Moreover, if I reduce the size of the temp_val tensor (say from 400000 to 4000) it will also complete successfully. Finally, it is worth noting that while I can reproduce this behaviour on two different Ubuntu machines, this code does actually complete on my Windows machine - though, as far as I can tell, the versions of key packages, such as torch, are the same.
I don't understand this behaviour. I suspect it is something to do with the way torch allocates memory or stores information. I've tried calling del temp_val to free up memory, but that doesn't seem to fix things. It seems to me that the async call to t.tensor within really_simple_func is stopped from completing if there has already been a call to t.tensor in the main code block, creating a sufficiently large tensor.
I don't understand why this is happening, or even if that is the correct explanation. In any case, what would be best practice if I do need to do some tensor processing within apply_async as well as in the main thread? More generally, what is Torch waiting on when I make a call to t.tensor?
(Obviously, this is just the simplest version of the real code I'm trying to get to work that reproduced this issue. I realise that calling mp.Pool with only one process doesn't really make sense...nor, indeed, does using apply_async to call a function that returns a constant!)

Unfortunately, I cannot provide any answers to your questions.
I can, however, share experiences with seemingly the same issue. I use a Linux machine with torch 1.8.1 and numpy 1.19.2.
When I run the following code on my machine:
with Pool(max_pool) as p:
pool_outputs = list(
tqdm(
p.imap(lambda f: get_model_results_per_query_file(get_preds, tokenizer, f), query_files),
total=len(query_files)
)
)
For which the function get_model_results_per_query_file contains operations similar to the following:
feats = features.unsqueeze(0).repeat(batch_size, 1, 1).to(device)
(features is a torch tensor)
The first round of jobs automatically fail, and new ones are immediately started (that do not fail for some reason). The whole process never completes though, since the main process still seems to be waiting for the first failed jobs.
If I remove the lines in my code involving the repeat function, no jobs fail.
I managed to solve my issue and preserve the same results by adapting a similar solution to yours:
feats = torch.as_tensor(np.tile(features, (batch_size, 1, 1))).to(device)
I believe as_tensor works in a similar fashion to from_numpy in this case.
I only managed to find this solution thanks to your post and your proposed workaround, so thank you!

After some further exploration, here is a brief answer to my own question.
While I still don't fully understand the blocking behaviour (and would welcome any further explanation), I have just seen that the way I'm generating torch tensors from a numpy array is not correct.
In particular, instead of using torch.tensor(temp_val) where temp_val is a numpy array, I should be using torch.from_numpy(temp_val). Doing this fixes the problem.
Alternatively, I can convert temp_val into a list and then create the tensor via torch.tensor(temp_val_as_list) - which also avoids the issue.

Problem in tqdm function in a Doc2Vec model

I am using this article https://actsusanli.medium.com/ to implement the Doc2Vec model and I have a problem in the training step.
model_dbow.train(utils.shuffle([x for x in tqdm(train_tagged.values)]), total_examples=len(train_tagged.values), epochs = 40)
As you can see, I am using the tqdm function. When I ran the code the tqdm is 100%, after some minutes, but the algorithm still runs in the same shell for a long time.
Do you have any idea if this is a problem of tqdm function or something else?

By using the "list comprehension" ([..])...
[x for x in tqdm(train_tagged.values)]
...you are having tqdm iterate once over your train_tagged.values sequence, into an actual in-memory Python list. This will show the tqdm progress rather quickly – then completely finish any involvement with tqdm.
Then, you're passing that plain result list (without any tqdm features) into Doc2Vec.train(), where Doc2Vec does its epochs=40 training passes. tqdm is no longer involved, so there'll be no incremental progress-bar output.
You might be tempted to try (or have already tried) something that skips the extra list creation, passing the tqdm-wrapped sequence directly in like:
corpus = utils.shuffle(train_tagged.values)
model_dbow.train(tqdm(corpus), total_examples=len(corpus), epochs = 40)
But this has a different problem: the tqdm-wrapper is only designed to allow (& report the progress of) one iteration over the wrapped sequence. So this will show that one iteration's incremental progress.
But when .train() tries its next necessary 39 re-iterations, to complete its epochs=40 training-runs, the single-pass tqdm object will be exhausted, preventing full & proper training.
Note that there is an option for progress-logging within Gensim, by setting the Python logging level (globally, or just for the class Doc2Vec) to INFO. Doc2Vec will then emit a log-line showing progress, within each epoch and between epochs, about every 1 second. But: you can also make such logging less-frequent by supplying a different seconds value to the optional report_delay argument of .train(), for example report_delay=60 (for a log line every minute instead of every second).
If you really want a progress-bar, it should possible to use tqdm - but you will have to work around its assumption that the iterable you're wrapping with tqdm() will only be iterated over once.
I believe there'd be two possible approaches, each with different tradeoffs:
(1) Instead of letting .train() repeat the corpus N times, do it yourself - adjusting the other .train() parameters accordingly. Roughly, that'd mean changing a line like...
model.train(corpus, total_examples=len(corpus), epochs=40)
...into something that turns your desired 40 epochs into something that looks like just one iteration to both tqdm & Gensim's .train(), like...
repeated_corpus = itertools.chain(*[corpus]*40)
repeated_len = 40 * len(corpus)
model.train(tqdm(repeated_corpus, total=repeated_len), total_examples=repeated_len, epochs=1)
(Note that you now have to give tqdm a hint as to the sequence's length, because the one-time chained-iterator from itertools.chain() doesn't report its own length.)
Then you'll get one progress-bar across the whole, training corpus - which the model is now seeing as one pass over a larger corpus, but ultimately involves the same 40 passes.
You'll want to reinterpret any remaining log lines with this change in mind, and you'll lose a chance to install your own per-epoch callbacks via the model's end-of-epoch callback mechanism. (But, that's a seldom-used feature, anyway.)
(2) Instead of wrapping the corpus with a single tqdm() (which can only show a progress-bar for one-iteration), wrap the corpus as a new fully-re-iterable object that itself will start a new tqdm() each time. For example, something like:
class TqdmEveryIteration(object):
def __init__(self, inner_iterable):
self.inner_iterable = inner_iterable
def iter(self):
return tqdm(inner_iterable)
Then, using this new extra tqdm-adding wrapper, you should be able to do:
corpus = utils.shuffle(train_tagged.values)
model_dbow.train(TqdmEveryIteration(corpus), total_examples=len(corpus), epochs = 40)
In this case, you should get one progress bar per epoch, because a new tqdm() wrapper will be started each training pass.
(If you try either of these approaches & they work well, please let me know! They should be roughly correct, but I haven't tested them yet.)
Separately: if the article from the author at actsusanli.medium.com that you're modeling your work on is...
https://towardsdatascience.com/multi-class-text-classification-with-doc2vec-logistic-regression-9da9947b43f4
...note that it's using an overly-complex & fragile anti-pattern, calling .train() multiple times in a loop with manual alpha management. That has problems as described in this other answer. But that approach would also have the side-effect of re-wrapping the corpus each time in a new tqdm (like the TqdmEveryIteration class above), so despite its other issues, would achieve one actual progress-bar each call to .train().
(I sent the author a private note via Medium about a month ago about this problem.)

Pyspark - cache and union

I am running a code like this:
def cache_and_union(df1, df2):
df1.cache()
df = df1.union(df2)
return df.collect()
For some reason, when performing the collect method the cache is not performed and used (does not appear in stages neither storage). Why could this happening?

Judging by this code example, you're not actually using cache in any way. cache suggests Spark that df1 would be used multiple times, so it should be cached, and not be recomputed each time. In your code sample, you're just joining df1 and doing a collect.

Pyspark multiple jobs in parallel

I have the following situation with my Pyspark:
In my driver program (driver.py), I call a function from another file (prod.py)
latest_prods = prod.featurize_prods().
Driver code:
from Featurize import Featurize
from LatestProd import LatestProd
from Oldprod import Oldprod
sc = SparkContext()
if __name__ == '__main__':
print 'Into main'
featurize_latest = Featurize('param1', 'param2', sc)
latest_prod = LatestProd(featurize_latest)
latest_prods = latest_prod.featurize_prods()
featurize_old = Featurize('param3', 'param3', sc)
old_prods = Oldprod(featurize_old)
old_prods = oldprod.featurize_oldprods()
total_prods = sc.union([latest_prods, old_prods])
Then I do some some reduceByKey code here... that generates total_prods_processed.
Finally I call:
total_prods_processed.saveAsTextFile(...)
I would like to generate latest_prods and old_prods in parallel. Both are created in the same SparkContext. Is it possible to do that? If not, how can I achieve that functionality?
Is this something that does Spark automatically? I am not seeing this behavior when I run the code so please let me know if it is a configuration option.

After searching on the internet, I think your problem can be addressed by threads. It is as simple as create two threads for your old_prod and latest_prod work.
Check this post for a simplified example. Since Spark is thread-safe, you gain the parallel efficiency without sacrifice anything.

The short answer is no, you can't schedule operations on two distinct RDDs at the same time in the same spark context. However there are some workarounds, you could process them in two distinct SparkContext on the same cluster and call SaveAsTextFile. Then read both in another job to perform the union. (this is not recommended by the documentation).
If you want to try this method, it is discussed here using spark-jobserver since spark doesn't support multiple context by default : https://github.com/spark-jobserver/spark-jobserver/issues/147
However according to the operations you perform there would be no reason to process both at the same time since you need the full results to perform the union, spark will split those operations in 2 different stages that will be executed one after the other.

Aerospike Python Client. UDF module to count records. Cannot register module

I am currently implementing the Aerospike Python Client in order to benchmark it along with our Redis implementation, to see which is faster and/or more stable.
I'm still on baby steps, currently Unit-Testing basic functionality, for example if I correctly add records in my Set. For that reason, I want to create a function to count them.
I saw in Aerospike's Documentation, that :
"to perform an aggregation on query, you first need to register a UDF
with the database".
It seems that this is the suggested way that aggregations, counting and other custom functionality should be run in Aerospike.
Therefore, to count the records in a set I have, I created the following module:
# "counter.lua"
function count(s)
return s : map(function() return 1 end) : reduce (function(a,b) return a+b end)
end
I'm trying to use aerospike python client's function to register a UDF(User Defined Function) module:
udf_put(filename, udf_type, policy)
My code is as follows:
# aerospike_client.py:
# "udf_put" parameters
policy = {'timeout': 1000}
lua_module = os.path.join(os.path.dirname(os.path.realpath(__file__)), "counter.lua") #same folder
udf_type = aerospike.UDF_TYPE_LUA # equals to "0", which is for "Lua"
self.client.udf_put(lua_module, udf_type, policy) # Exception is thrown here
query = self.client.query(self.aero_namespace, self.aero_set)
query.select()
result = query.apply('counter', 'count')
an exception is thrown:
exceptions.Exception: (-2L, 'Filename should be a string', 'src/main/client/udf.c', 82)
Is there anything I'm missing or doing wrong?
Is there a way to "debug" it without compiling C code?
Is there any other suggested way to count the records in my set? Or I'm fine with the Lua module?

First, I'm not seeing that exception, but I am seeing a bug with udf_put where the module is registered but the python process hangs. I can see the module appear on the server using AQL's show modules.
I opened a bug with the Python client's repo on Github, aerospike/aerospike-client-python.
There's a best practices document regarding UDF development here: https://www.aerospike.com/docs/udf/best_practices.html
In general using the stream-UDF to aggregate the records through the count function is the correct way to go about it. There are examples here and here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark 2: KMeans The input data is not directly cached - python

This was fixed in Spark 2.2.0. Here is the Spark-18356. The discussion there also suggests this is not a big deal, but the fix may reduce runtime slightly, as well as avoiding warnings.

Related

Why does a call to torch.tensor inside of apply_async fail to complete (seems to block execution)?

Problem in tqdm function in a Doc2Vec model

Pyspark - cache and union

Pyspark multiple jobs in parallel

Aerospike Python Client. UDF module to count records. Cannot register module

Categories

Resources