I have a quite complex Apache PySpark pipeline which performs several transformations on a (very large) set of text files. The intended output of my pipeline are different stages of the pipeline. Which is the best way (i.e. more efficient but also more sparkling, in the sense of: more fitting the Spark programming model and style) to do this?
Right now, my code looks like the following:
# initialize the pipeline and perform the first set of transformations.
ctx = pyspark.SparkContext('local', 'MyPipeline')
rdd = ctx.textFile(...).map(...).map(...)
# first checkpoint: the `first_serialization` function serializes
# the data into properly formatted string.
rdd..map(first_serialization).saveAsTextFile("ckpt1")
# here, I have to read again from the previously saved checkpoint
# using a `first_deserialization` function that deserializes what has
# been serialized from the `firs_serialization` function. Then performs
# other transformations.
rdd = ctx.textFile("ckpt1").map(...).map(...)
and so on. I would like to get rid of the serialization methods and of the multiple save/read -- by the way, does it impact the efficiency? I assume yes.
Any hint?
Thanks in advance.
This seems obviously simple, because it is, but I would recommend writing the intermediate stages out while continuing to reuse the existing RDD (side bar: use datasets/dataframes instead of RDDs to get more performance) and continue processing, writing out intermediate results as you go.
There's no need to pay the penalty of reading from disk/network when you already have the data processed (ideally even cached!) for further usage.
Example using your own code:
# initialize the pipeline and perform the first set of transformations.
ctx = pyspark.SparkContext('local', 'MyPipeline')
rdd = ctx.textFile(...).map(...).map(...)
# first checkpoint: the `first_serialization` function serializes
# the data into properly formatted string.
string_rdd = rdd..map(first_serialization)
string_rdd.saveAsTextFile("ckpt1")
# reuse the existing RDD after writing out the intermediate results
rdd = rdd.map(...).map(...) # rdd here is the same variable we used to create the string_rdd results above. alternatively, you may want to use the string_rdd variable here instead of the original rdd variable.
Related
I want to create a gensim dictionary from lines of a dataframe. The df.preprocessed_text is a list of words.
from gensim.models.phrases import Phrases, Phraser
from gensim.corpora.dictionary import Dictionary
def create_dict(df, bigram=True, min_occ_token=3):
token_ = df.preprocessed_text.values
if not bigram:
return Dictionary(token_)
bigram = Phrases(token_,
min_count=3,
threshold=1,
delimiter=b' ')
bigram_phraser = Phraser(bigram)
bigram_token = []
for sent in token_:
bigram_token.append(bigram_phraser[sent])
dictionary = Dictionary(bigram_token)
dictionary.filter_extremes(no_above=0.8, no_below=min_occ_token)
dictionary.compactify()
return dictionary
I couldn't find a progress bar option for it and the callbacks doesn't seem to work for it too. Since my corpus is huge, I really appreciate a way to show the progress. Is there any?
I'd recommend against changing prune_at for monitoring purposes, as it changes the behavior around which bigrams/words are remembered, possibly discarding many more than is strictly required for capping memory usage.
Wrapping tqdm around the iterables used (including the token_ use in the Phrases constructor and the bigram_token use in the Dictionary constructor) should work.
Alternatively, enabling INFO or greater logging should display logging that, while not as pretty/accurate as a progress-bar, will give some indication of progress.
Further, if as shown in the code, the use of bigram_token is only to support the next Dictionary, it need not be created as a full in-memory list. You should be able to just use layered iterators to transform the text, & tally the Dictionary, item-by-item. EG:
# ...
dictionary = Dictionary(tqdm(bigram_phraser[token_]))
# ...
(Also, if you're only using the Phraser once, you may not be getting any benefit from creating it at all - it's an optional memory optimization for when you want to keep applying the same phrase-creation operation without the full overhead of the original Phrases survey object. But if the Phrases is still in-scope, and all of it will be discarded immediately after this step, it might be just as fast to use the Phrases object directly without ever taking a detour to create the Phraser - so give that a try.)
On Python, I'm trying to merge multiple JSON files obtained from TinyDB.
I was not able to find a way to directly merge two tinydb JSON files that have keys autogenerated in the sequence that not restart with the opening of the next file.
In code words, i want to merge large amount of data like this:
hello1={"1":"bye",2:"good"....,"20000":"goodbye"}
hello2={"1":"dog",2:"cat"....,"15000":"monkey"}
As:
Hello3= {"1":"bye",2:"good"....,"20000":"goodbye","20001":"dog",20002:"cat"....,"35000":"monkey"}
Because of the problem to find the correct way to do it with TinyDB, I opened and transformed them simply in classic syntax json file, loading each file and then doing:
Data = Data['_default']
The problem that I have, is that at the moment the code works, but it has serious memory problems. After a few seconds, the created merged Db contains like 28Mb of data, but (probably) the cache saturate, and it starts to add all the other data in a really slow way.
So, I need to empty the cache after a certain amount of data, or probably i need to change the way to do this!
That's the code that i use:
Try1.purge()
Try1 = TinyDB('FullDB.json')
with open('FirstDataBase.json') as Part1 :
Datapart1 = json.load(Part1)
Datapart1 = Datapart1['_default']
for dets in range(1, len(Datapart1)):
Try1.insert(Datapart1[str(dets)])
with open('SecondDatabase.json') as Part2:
Datapart2 = json.load(Part2)
Datapart2 = Datapart2['_default']
for dets in range(1, len(Datapart2)):
Try1.insert(Datapart2[str(dets)])
Question: Merge Two TinyDB Databases ... probably i need to change the way to do this!
From TinyDB Documentation
Why Not Use TinyDB?
...
You are really concerned about performance and need a high speed database.
Single row insertion into a DB are always slow, try db.insert_multiple(....
The second one. with generator. gives you the option to hold down the memory footprint.
# From list
Try1.insert_multiple([{"1":"bye",2:"good"....,"20000":"goodbye"}])
or
# From generator function
Try1.insert_multiple(generator())
I want to create sequences of my dataset. However, Tensorflow only provides the function:
tf.parse_single_example()
I tried to avoid this problem by using the tf.py_func and smth like this:
dataset.map(lambda x: tf.py_func(_parse_tf_record, [x, sequence_length])
for sequence_id in range(0, sequence_length):
filename = x
# files only contain one record
for record in tf.python_io.tf_record_iterator(filename, options):
...
tf.parse_single_example()
...
break # only one sample per file
So for every map call I read #sequence_length files. However, this cannot be done parallel since tf.py_func does not allow for it.
A tensorflow example is a single conceptual unit and it should be independent from the other examples (so that batching and shuffling work properly).
If you want more data to be grouped together you should write it as a single example.
To make things easier there tf.train.SequenceExample that works with tf.parse_single_sequence_example. It has a context part that's common for all entries in the sequence and a sequence part that is repeated for every step. This is commonly used when working with recurrent networks (LSTM and alike) but you can use it whenever it makes sense in your context.
For python dataframe, info() function provides memory usage.
Is there any equivalent in pyspark ?
Thanks
Try to use the _to_java_object_rdd() function:
import py4j.protocol
from py4j.protocol import Py4JJavaError
from py4j.java_gateway import JavaObject
from py4j.java_collections import JavaArray, JavaList
from pyspark import RDD, SparkContext
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
# your dataframe what you'd estimate
df
# Helper function to convert python object to Java objects
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
# First you have to convert it to an RDD
JavaObj = _to_java_object_rdd(df.rdd)
# Now we can run the estimator
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)
I have something in mind, its just a rough estimation. as far as i know spark doesn't have a straight forward way to get dataframe memory usage, But Pandas dataframe does. so what you can do is.
select 1% of data sample = df.sample(fraction = 0.01)
pdf = sample.toPandas()
get pandas dataframe memory usage by pdf.info()
Multiply that values by 100, this should give a rough estimate of your whole spark dataframe memory usage.
Correct me if i am wrong :|
As per the documentation:
The best way to size the amount of memory consumption a dataset will require is to create an RDD, put it into cache, and look at the “Storage” page in the web UI. The page will tell you how much memory the RDD is occupying.
To estimate the memory consumption of a particular object, use SizeEstimator’s estimate method. This is useful for experimenting with different data layouts to trim memory usage, as well as determining the amount of space a broadcast variable will occupy on each executor heap.
For the dataframe df you can do this:
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(df._jdf)
You can persist dataframe in memory and take action as df.count(). You would be able to check the size under storage tab on spark web ui.. let me know if it works for you.
How about below? It's in KB, X100 to get the estimated real size.
df.sample(fraction = 0.01).cache().count()
I have a partitioned a RDD and would like each partition to be saved to a separate file with a specific name. This is the repartitioned rdd I am working with:
# Repartition to # key partitions and map each row to a partition given their key rank
my_rdd = df.rdd.partitionBy(len(keys), lambda row: int(row[0]))
Now, I would like to saveAsTextFile() on each partition. Naturally, I should do something like
my_rdd.foreachPartition(lambda iterator_obj: save_all_items_to_text_fxn)
However, as a test, I defined save_all_items_to_text_fxn() as followed:
def save_all_items_to_text_fxn(iterator_obj):
print 'Test'
... and I noticed that it's really only called twice instead of |partitions| number of times.
I would like to find out if I am on the wrong track. Thanks
I would like to find out if I am on the wrong track.
Well, it looks like you are. You won't be able to call saveAsTextFile on a partition iterator (not mention from inside any action or transformation) so a whole idea doesn't make sense. It is not impossible to write to HDFS from Python code using external libraries but I doubt it is worth all the fuss.
Instead you can handle this using standard Spark tools:
An expensive way
def filter_partition(x):
def filter_partition_(i, iter):
return iter if i == x else []
return filter_partition_
for i in rdd.getNumPartitions():
tmp = dd.mapPartitionsWithIndex(filter_partition(i)).coalesce(1)
tmp.saveAsTextFile('some_name_{0}'.format(i))
A cheap way.
Each partition is saved to a single with a name corresponding to a partition number. It means you can simply save a whole RDD using saveAsTextFile and rename individual files afterwards.