MapPartition in PySpark

MapPartition in PySpark - python

I am extremely new to Python and not very familiar with the syntax. I am looking at some sample implementation of the pyspark mappartitions method. To articulate the ask better, I have written the Java Equivalent of what I need.
JavaRDD<Row> modified = auditSet.javaRDD().mapPartitions(new FlatMapFunction<Iterator<Row>, Row>() {
public Iterator<Row> call(Iterator<Row> t) throws Exception {
Iterable<Row> iterable = () -> t;
return StreamSupport.stream(iterable.spliterator(), false).map(m -> enrich(m)).iterator();
}
private Row enrich(Row r) {
//<code to enrich row r
return RowFactory.create(/*new row from enriched row r*/);
}
});
I have an rdd. I need to call the mappartitions on it. I am not sure how to pass/handle the iterator inside of python. Once the call reaches the method, I am looking to iterate each record and enrich it and return the result.
Any help is appreciated.

Not sure if this is the right way, but this is what I have done. Open to comments and corrections.
auditSetDF.rdd.mapPartitions(lambda itr:mpImpl(itr,locationListBrdcast))
def mpImpl(itr,broadcastList):
        lst=broadcastList.value
        for x in itr:
                yield enrich(x,lst)

Related

Using c-like arrays in python

Is the following ever done in python to minimize the "allocation time" of creating new objects in a for loop in python? Or, is this considered bad practice / there is a better alternative?
for row in rows:
data_saved_for_row = [] // re-initializes every time (takes a while)
for item in row:
do_something()
do_something
vs. the "c-version" --
data_saved_for_row = []
for row in rows:
for index, item in enumerate(row):
do_something()
data_saved_for_row[index + 1] = '\0' # now we have a crude way of knowing
do_something_with_row() # when it ends without having
# to always reinitialize
Normally the second approach seems like a terrible idea, but I've run into situations when iterating million+ items where the initialization time of the row:
data_saved_for_row = []
has taken a second or more to do.
Here's an example:
>>> print timeit.timeit(stmt="l = list();", number=int(1e8))
7.77035903931

If you want functionality for this sort of performance, you may as well just write it in C yourself and import it with ctypes or something. But then, if you're writing this kind of performance-driven application, why are you using Python to do it in the first place?
You can use list.clear() as a middle-ground here, not having to reallocate anything immediately:
data_saved_for_row = []
for row in rows:
data_saved_for_row.clear()
for item in row:
do_something()
do_something
but this isn't a perfect solution, as shown by the cPython source for this (comments omitted):
static int
_list_clear(PyListObject *a)
{
Py_ssize_t i;
PyObject **item = a->ob_item;
if (item != NULL) {
i = Py_SIZE(a);
Py_SIZE(a) = 0;
a->ob_item = NULL;
a->allocated = 0;
while (--i >= 0) {
Py_XDECREF(item[i]);
}
PyMem_FREE(item);
}
return 0;
}
I'm not perfectly fluent in C, but this code looks like it's freeing the memory stored by the list, so that memory will have to be reallocated every time you add something to that list anyway. This strongly implies that the python language just doesn't natively support your approach.
Or you could write your own python data structure (as a subclass of list, maybe) that implements this paradigm (never actually clearing its own list, but maintaining a continuous notion of its own length), which might be a cleaner solution to your use case than implementing it in C.

How to call get() on dictionary with indexes?

I have an array of dictionaries but i am running into a scenario where I have to get the value from 1st index of the array of dictionaries, following is the chunk that I am trying to query.
address_data = record.get('Rdata')[0].get('Adata')
This throws the following error:
TypeError: 'NoneType' object is not subscriptable
I tried following:
if record.get('Rdata') and record.get('Rdata')[0].get('Adata'):
address_data = record.get('Rdata')[0].get('Adata')
but I don't know if the above approach is good or not.
So how to handle this in python?
Edit:
"partyrecord": {
"Rdata": [
{
"Adata": [
{
"partyaddressid": 172,
"addressid": 142165
}
]
}
]
}

Your expression assumes that record['Rdata'] will return a list with at least one element, so provide one if that isn't the case.
address_data = record.get('Rdata', [{}])[0].get('Adata')
Now if record['Rdata'] doesn't exist, you'll still have an empty dict on which to invoke get('Adata'). The end result will be address_data being set to None.
(Checking for the key first is preferable if a suitable default is expensive to create, since it will be created whether get needs to return it or not. But [{}] is fairly lightweight, and the compiler can generate it immediately.)

You might want to go for the simple, not exciting route:
role_data = record.get('Rdata')
if role_data:
address_data = role_data[0].get('Adata')
else:
address_data = None

Splitting list comprehension expression into multiple lines to understand what is going on better?

I am trying to understand some code better, and there is one line that is doing a bunch of things that I'm not fully understanding. I was wondering if there's a way to split it up into multiple lines so I can get a better idea of what is happening in each step.
This code is part of a function and a long bit of code (many lines) so it's a bit hard to paste the whole code in here when this is just one line. However, I can do that if necessary or attach the code somehow. The line in particular I am talking about is a function where when given a partial assignment, it returns the full assignment. It's as follows:
return min([(sum([int(self.get_weight(assign, var, action)>0) for action in self.domains[var]]), var) for var in self.my_function if var not in assign])[1]
These are calling on a bunch of previously coded weights, etc. I'm just asking how you can split up this expression into multiple lines. Some thoughts I had were:
for var in self.my_function:
if var not in assign:
for action in self.domains...
Any thoughts on how to do this? I saw that an error was thrown up when I tried something like this, "unhashable type: list", so I know it's not quite the same expression. Thanks!

First split the original code across multiple lines:
min([
(
sum([
int(self.get_weight(assign, var, action)>0) for action in self.domains[var]
]), var
) for var in self.my_function if var not in assign
])[1]
Which can be rewritten as:
list_to_take_min_of = []
for var in self.my_function:
if var not in assign:
weights = []
for action in self.domains[var]:
weights.append(int(self.get_weight(assign, var, action)>0))
list_to_take_min_of.append((sum(weights), var))
min(list_to_take_min_of)[1]

You could also just break up the original code into multiple lines to facilitate reading:
return min([(sum([int(self.get_weight(assign, var, action)>0) \
for action in self.domains[var]]), var) \
for var in self.my_function \
if var not in assign])[1]

When breaking up one liners, I find it helpful to take it one step at a time and follow the following template for each nested list comprehension:
out = [expression(value) for value in iterable if condition]
#becomes
out = []
for value in iterable:
if condition
out.append(
expression(value)
)
In your case you have a nested list comprehension, so expression(...) will be another copy of this boilerplate...

Creating a large dictionary in pyspark

I am trying to solve the following problem using pyspark.
I have a file on hdfs in the format which is a dump of lookup table.
key1, value1
key2, value2
...
I want to load this into python dictionary in pyspark and use it for some other purpose. So I tried to do:
table = {}
def populateDict(line):
(k,v) = line.split(",", 1)
table[k] = v
kvfile = sc.textFile("pathtofile")
kvfile.foreach(populateDict)
I found that table variable is not modified. So, is there a way to create a large inmemory hashtable in spark?

foreach is a distributed computation so you can't expect it to modify a datasctructure only visible in the driver. What you want is.
kv.map(line => { line.split(" ") match {
case Array(k,v) => (k,v)
case _ => ("","")
}.collectAsMap()
This is in scala but you get the idea, the important function is collectAsMap() which returns a map to the driver.
If you're data is very large you can use a PairRDD as a map. First map to pairs
kv.map(line => { line.split(" ") match {
case Array(k,v) => (k,v)
case _ => ("","")
}
then you can access with rdd.lookup("key") which returns a sequence of values associated with the key, though this definitely will not be as efficient as other distributed KV stores, as spark isn't really built for that.

For efficiency, see: sortByKey() and lookup()
lookup(key):
Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.
The RDD will be re-partitioned by sortByKey() (see: OrderedRDD) and efficiently searched during lookup() calls. In code, something like,
kvfile = sc.textFile("pathtofile")
sorted_kv = kvfile.flatMap(lambda x: x.split("," , 1)).sortByKey()
sorted_kv.lookup('key1').take(10)
will do the trick both as an RDD and efficiently.

Find string of method in a file

Right now I have this to find a method
def getMethod(text, a, filetype):
start = a
fin = a
if filetype == "cs":
for x in range(a, 0, -1):
if text[x] == "{":
start = x
break
for x in range(a, len(text)):
if text[x] == "}":
fin = x
break
return text[start:fin + 1]
How can I get the method the index a is in?
I can't just find { and } because you can have things like new { } which won't work
if I had a file with a few methods and I wanted to find what method the index of x is in then I want the body of that method for example if I had the file
private string x(){
return "x";
}
private string b(){
return "b";
}
private string z(){
return "z";
}
private string a(){
var n = new {l = "l"};
return "a";
}
And I got the index of "a" which lets say is 100
then I want to find the body of that method. So everything within { and }
So this...
{
var n = new {l = "l"};
return "a";
}
But using what I have now it would return:
{l = "l"};
return "a";
}

If my interpretation is correct, it seems you are attempting to parse C# source code to find the C# method that includes a given position a in a .cs file, the content of which is contained in text.
Unfortunately, if you want to do a complete and accurate job, I think you would need a full C# parser.
If that sounds like too much work, I'd think about using a version of ctags that is compatible with C# to generate a tag file and then search in the tag file for the method that applies to a given source file line instead of the original source file.

As Simon stated, if your problem is to parse source code, the best bet is to get a proper parser for that language.
If you're just looking to match up the braces however, there is a well-known algorithm for that: Python parsing bracketed blocks
Just be aware that since source code is a complex beast, don't expect this to work 100%.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

MapPartition in PySpark - python

Not sure if this is the right way, but this is what I have done. Open to comments and corrections. auditSetDF.rdd.mapPartitions(lambda itr:mpImpl(itr,locationListBrdcast)) def mpImpl(itr,broadcastList): lst=broadcastList.value for x in itr: yield enrich(x,lst)

Related

Using c-like arrays in python

How to call get() on dictionary with indexes?

Splitting list comprehension expression into multiple lines to understand what is going on better?

Creating a large dictionary in pyspark

Find string of method in a file

Categories

Resources