How to save partitions to files of a specific name? - python

I have a partitioned a RDD and would like each partition to be saved to a separate file with a specific name. This is the repartitioned rdd I am working with:
# Repartition to # key partitions and map each row to a partition given their key rank
my_rdd = df.rdd.partitionBy(len(keys), lambda row: int(row[0]))
Now, I would like to saveAsTextFile() on each partition. Naturally, I should do something like
my_rdd.foreachPartition(lambda iterator_obj: save_all_items_to_text_fxn)
However, as a test, I defined save_all_items_to_text_fxn() as followed:
def save_all_items_to_text_fxn(iterator_obj):
print 'Test'
... and I noticed that it's really only called twice instead of |partitions| number of times.
I would like to find out if I am on the wrong track. Thanks

I would like to find out if I am on the wrong track.
Well, it looks like you are. You won't be able to call saveAsTextFile on a partition iterator (not mention from inside any action or transformation) so a whole idea doesn't make sense. It is not impossible to write to HDFS from Python code using external libraries but I doubt it is worth all the fuss.
Instead you can handle this using standard Spark tools:
An expensive way
def filter_partition(x):
def filter_partition_(i, iter):
return iter if i == x else []
return filter_partition_
for i in rdd.getNumPartitions():
tmp = dd.mapPartitionsWithIndex(filter_partition(i)).coalesce(1)
tmp.saveAsTextFile('some_name_{0}'.format(i))
A cheap way.
Each partition is saved to a single with a name corresponding to a partition number. It means you can simply save a whole RDD using saveAsTextFile and rename individual files afterwards.

Related

when converting XML to SEVERAL dataframes, how to name these dfs in a dynamic way?

my code is on the bottom
"parse_xml" function can transfer a xml file to a df, for example, "df=parse_XML("example.xml", lst_level2_tags)" works
but as I want to save to several dfs so I want to have names like df_ first_level_tag, etc
when I run the bottom code, I get an error "f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
^
SyntaxError: can't assign to literal"
I also tried .format method instead of f-string but it also hasn't worked
there are at least 30 dfs to save and I don't want to do it one by one. always succeeded with f-string in Python outside pandas though
Is the problem here about f-string/format method or my code has other logic problem?
if necessary for you, the parse_xml function is directly from this link
the function definition
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
f'df_{first_level_tag}'=parse_XML("example.xml", lst_level2_tags)
This seems like a situation where you'd be best served by putting them into a dictionary:
dfs = {}
for first_level_tag in first_level_tags:
lst_level2_tags = []
for subchild in root[0]:
lst_level2_tags.append(subchild.tag)
dfs[first_level_tag] = parse_XML("example.xml", lst_level2_tags)
There's nothing structurally wrong with your f-string, but you generally can't get dynamic variable names in Python without doing ugly things. In general, storing the values in a dictionary ends up being a much cleaner solution when you want something like that.
One advantage of working with them this way is that you can then just iterate over the dictionary later on if you want to do something to each of them. For example, if you wanted to write each of them to disk as a CSV with a name matching the tag, you could do something like:
for key, df in dfs.items():
df.to_csv(f'{key}.csv')
You can also just refer to them individually (so if there was a tag named a, you could refer to dfs['a'] to access it in your code later).

How can I efficiently replace all instances of multiple regex patterns from a dataframe of strings, in pySpark?

I have a table in Hadoop which contains 7 billion strings which can themselves contain anything. I need to remove every name from the column containing the strings. An example string would be 'John went to the park' and I'd need to remove 'John' from that, ideally just replacing with '[name]'.
In the case of 'John and Mary went to market', the output would be '[NAME] and [NAME] went to market'.
To support this I have an ordered list of the most frequently occurring 20k names.
I have access to Hue (Hive, Impala) and Zeppelin (Spark, Python & libraries) to execute this.
I've tried this in the DB, but being unable to update columns or iterate over a variable made it a non-starter, so using Python and PySpark seems to be the best option especially considering the number of calculations (20k names * 7bil input strings)
#nameList contains ['John','Emma',etc]
def removeNames(line, nameList):
str_line= line[0]
for name in nameList:
rx = f"(^| |[[:^alpha:]])({name})( |$|[[:^alpha:]])"
str_line = re.sub(rx,'[NAME]', str_line)
str_line= [str_line]
return tuple(str_line)
df = session.sql("select free_text from table")
rdd = df.rdd.map(lambda line: removeNames(line, nameList))
rdd.toDF().show()
The code is executing, but it's taking an hour and a half even if I limit the input text to 1000 lines (which is nothing for Spark), and the lines aren't actually being replaced in the final output.
What I'm wondering is: Why isn't map actually updating the lines of the RDD, and how could I make this more efficient so it executes in a reasonable amount of time?
This is my first time posting so if there's essential info missing, I'll fill in as much as I can.
Thank you!
In case you're still curious about this, by using the udf (your removeNames function) Spark is serializing all of your data to the master node, essentially defeating your usage of Spark to do this operation in a distributed fashion. As the method suggested in the comments, if you go with the regexp_replace() method, Spark will be able to keep all of the data on the distributed nodes, keeping everything distributed and improving performance.

Merge Two TinyDB Databases

On Python, I'm trying to merge multiple JSON files obtained from TinyDB.
I was not able to find a way to directly merge two tinydb JSON files that have keys autogenerated in the sequence that not restart with the opening of the next file.
In code words, i want to merge large amount of data like this:
hello1={"1":"bye",2:"good"....,"20000":"goodbye"}
hello2={"1":"dog",2:"cat"....,"15000":"monkey"}
As:
Hello3= {"1":"bye",2:"good"....,"20000":"goodbye","20001":"dog",20002:"cat"....,"35000":"monkey"}
Because of the problem to find the correct way to do it with TinyDB, I opened and transformed them simply in classic syntax json file, loading each file and then doing:
Data = Data['_default']
The problem that I have, is that at the moment the code works, but it has serious memory problems. After a few seconds, the created merged Db contains like 28Mb of data, but (probably) the cache saturate, and it starts to add all the other data in a really slow way.
So, I need to empty the cache after a certain amount of data, or probably i need to change the way to do this!
That's the code that i use:
Try1.purge()
Try1 = TinyDB('FullDB.json')
with open('FirstDataBase.json') as Part1 :
Datapart1 = json.load(Part1)
Datapart1 = Datapart1['_default']
for dets in range(1, len(Datapart1)):
Try1.insert(Datapart1[str(dets)])
with open('SecondDatabase.json') as Part2:
Datapart2 = json.load(Part2)
Datapart2 = Datapart2['_default']
for dets in range(1, len(Datapart2)):
Try1.insert(Datapart2[str(dets)])
Question: Merge Two TinyDB Databases ... probably i need to change the way to do this!
From TinyDB Documentation
Why Not Use TinyDB?
...
You are really concerned about performance and need a high speed database.
Single row insertion into a DB are always slow, try db.insert_multiple(....
The second one. with generator. gives you the option to hold down the memory footprint.
# From list
Try1.insert_multiple([{"1":"bye",2:"good"....,"20000":"goodbye"}])
or
# From generator function
Try1.insert_multiple(generator())

Pyspark - reducer task iterates over values

I am working with pyspark for the first time.
I want my reducer task to iterates over the values that return with the key from the mapper just like in java.
I saw there is only option of accumulator and not iteration - like in add function add(data1,data2) => data1 is the accumulator.
I want to get in my input a list with the values that belongs to the key.
That's what i want to do. Anyone know if there is option of doing that?
Please use reduceByKey function. In python, it should look like
from operator import add
rdd = sc.textFile(....)
res = rdd.map(...).reduceByKey(add)
Note: Spark and MR has fundamental diffrences, so it is suggested not to force-fit one to another. Spark also supports pair functions pretty nicely, look for aggregateByKey if you want something fancier.
Btw, word count problem is discussed in depth (esp usage of flatmap) in spark docs, you may want to have a look

write table cell real-time python

I would like to loop trough a database, find the appropriate values and insert them in the appropriate cell in a separate file. It maybe a csv, or any other human-readable format.
In pseudo-code:
for item in huge_db:
for list_of_objects_to_match:
if itemmatch():
if there_arent_three_matches_yet_in_list():
matches++
result=performoperationonitem()
write_in_file(result, row=object_to_match_id, col=matches)
if matches is 3:
remove_this_object_from_object_to_match_list()
can you think of any way other than going every time through all the outputfile line by line?
I don't even know what to search for...
even better, there are better ways to find three matching objects in a db and have the results in real-time? (the operation will take a while, but I'd like to see the results popping out RT)
Assuming itemmatch() is a reasonably simple function, this will do what I think you want better than your pseudocode:
for match_obj in list_of_objects_to_match:
db_objects = query_db_for_matches(match_obj)
if len(db_objects) >= 3:
result=performoperationonitem()
write_in_file(result, row=match_obj.id, col=matches)
else:
write_blank_line(row=match_obj.id) # if you want
Then the trick becomes writing the query_db_for_matches() function. Without detail, I'll assume you're looking for objects that match in one particular field, call it type. In pymongo such a query would look like:
def query_db_for_matches(match_obj):
return pymongo_collection.find({"type":match_obj.type})
To get this to run efficiently, make sure your database has an index on the field(s) you're querying on by first calling:
pymongo_collection.ensure_index({"type":1})
The first time you call ensure_index it could take a long time for a huge collection. But each time after that it will be fast -- fast enough that you could even put it into query_db_for_matches before your find and it would be fine.

Categories