Pyspark - reducer task iterates over values - python

I am working with pyspark for the first time.
I want my reducer task to iterates over the values that return with the key from the mapper just like in java.
I saw there is only option of accumulator and not iteration - like in add function add(data1,data2) => data1 is the accumulator.
I want to get in my input a list with the values that belongs to the key.
That's what i want to do. Anyone know if there is option of doing that?

Please use reduceByKey function. In python, it should look like
from operator import add
rdd = sc.textFile(....)
res = rdd.map(...).reduceByKey(add)
Note: Spark and MR has fundamental diffrences, so it is suggested not to force-fit one to another. Spark also supports pair functions pretty nicely, look for aggregateByKey if you want something fancier.
Btw, word count problem is discussed in depth (esp usage of flatmap) in spark docs, you may want to have a look

Related

In python 3.5.2, how to elegantly chain an unknown quantity of functions on an object than changes type?

Introduction
I am not very certain the title is clear. I am not a native english speaker so if someones has a better summary for what this post is about, please edit !
Environment
python 3.5.2
pyspark 2.3.0
The context
I have a spark dataframe. This data, before being written to elastic search, gets transformed.
In my case, I have two transformations. They are map functions on the rdd of the dataframe.
However, instead of hard writing them, I want to make it so that I can give my function (that handles the data transformation) X functions that will be, one by one, applied to the dataframe (for the first function) and/or the result of the previous transformation function.
Initial work
This is the previous state, not desired, hard written:
df.rdd.map(transfo1) \
.map(transfo2) \
.saveAsNewAPIHadoopFile
What I have so far
def write_to_index(self, transformation_functions: list, dataframe):
// stuff
for transfo in transformation_functions:
dataframe = dataframe.rdd.map(transfo)
dataframe.saveAsNewAPIHadoopFile
However, this has a problem: if the return of the first transformation is not a dataframe, it will fail on the second iteration of the loop because the resulting object does not have a rdd property.
Working solution
object_to_process = dataframe.rdd
for transfo in transformation_functions:
object_to_process = object_to_process.map(transfo)
object_to_process.saveAsNewAPIHadoopFile
The solution above seems to work (does throw any error at least). But I would like to know if there is a more elegant solution or any built-in python solution for this.
You can use this one-liner:
from functools import reduce
def write_to_index(self, transformation_functions: list, dataframe):
reduce(lambda x, y: x.map(y), transformation_functions, dataframe.rdd).saveAsNewAPIHadoopFile
which, if written verbosely, should be identical to
dataframe.rdd.map(transformation_functions[0]) \
.map(transformation_functions[1]) \
.map(...) \
.saveAsNewAPIHadoopFile

get/access each chunk of dask.dataframe(df, chunksize=100)

I used below code to split a dataframe using dask:
result=dd.from_pandas(df, chunksize=75)
I use below code to create a custom json file:
for z in result:
createjson (z)
It just didnt work! how can I access to each chunk?
There may be a more native way (feels like there should be) but you can do:
for i in range(result.npartitions):
partition = result.get_partition(i)
# your code here
We do not know what your createjson function does, but perhaps it is covered by to_json().
Alternatively, if you really want to do something unique to each of your partition, and this is not unique to JSON, then you will want the method map_partitions().

string comparison for multiple values python

I have sets of data. The first (A) is a list of equipment with sophisticated names. The second is a list of more broad equipment categories (B) - to which I have to group the first list into using string comparisons. I'm aware this won't be perfect.
For each entity in List A - I'd like to establish the levenshtein distance for each entity in List B. The record in List B with the highest score will be the group to which I'll assign that data point.
I'm very rusty in python - and am playing around with FuzzyWuzzy to get the distance between two string values. However - I can't quite figure out how to iterate through each list to produce what I need.
I presumed I'd just create a list for each data set and write a pretty basic loop for each - but like I said I'm a little rusty and not having any luck.
Any help would be greatly appreciated! If there is another package that will allow me to do this (not Fuzzy) - I'm glad to take suggestions.
It looks like the process.extractOne function is what you're looking for. A simple use case is something like
from fuzzywuzzy import process
from collections import defaultdict
complicated_names = ['leather couch', 'left-handed screwdriver', 'tomato peeler']
generic_names = ['couch', 'screwdriver', 'peeler']
group = defaultdict(list)
for name in complicated_names:
group[process.extractOne(name, generic_names)[0]].append(name)
defaultdict is a dictionary that has default values for all keys.
We loop over all the complicated names, use fuzzywuzzy to find the closest match, and then add the name to the list associated with that match.

How to save partitions to files of a specific name?

I have a partitioned a RDD and would like each partition to be saved to a separate file with a specific name. This is the repartitioned rdd I am working with:
# Repartition to # key partitions and map each row to a partition given their key rank
my_rdd = df.rdd.partitionBy(len(keys), lambda row: int(row[0]))
Now, I would like to saveAsTextFile() on each partition. Naturally, I should do something like
my_rdd.foreachPartition(lambda iterator_obj: save_all_items_to_text_fxn)
However, as a test, I defined save_all_items_to_text_fxn() as followed:
def save_all_items_to_text_fxn(iterator_obj):
print 'Test'
... and I noticed that it's really only called twice instead of |partitions| number of times.
I would like to find out if I am on the wrong track. Thanks
I would like to find out if I am on the wrong track.
Well, it looks like you are. You won't be able to call saveAsTextFile on a partition iterator (not mention from inside any action or transformation) so a whole idea doesn't make sense. It is not impossible to write to HDFS from Python code using external libraries but I doubt it is worth all the fuss.
Instead you can handle this using standard Spark tools:
An expensive way
def filter_partition(x):
def filter_partition_(i, iter):
return iter if i == x else []
return filter_partition_
for i in rdd.getNumPartitions():
tmp = dd.mapPartitionsWithIndex(filter_partition(i)).coalesce(1)
tmp.saveAsTextFile('some_name_{0}'.format(i))
A cheap way.
Each partition is saved to a single with a name corresponding to a partition number. It means you can simply save a whole RDD using saveAsTextFile and rename individual files afterwards.

Python equiv. of PHP foreach []?

I am fetching rows from the database and wish to populate a multi-dimensional dictionary.
The php version would be roughly this:
foreach($query as $rows):
$values[$rows->id][] = $rows->name;
endforeach;
return $values;
I can't seem to find out the following issues:
What is the python way to add keys to a dictionary using an automatically numbering e.g. $values[]
How do I populate a Python dictionary using variables; using, for example, values[id] = name, will not add keys, but override existing.
I totally have no idea how to achieve this, as I am a Python beginner (programming in general, actually).
values = collections.defaultdict(list)
for rows in query:
values[rows.id].append(rows.name)
return values
Just a general note:
Python's dictionaries are mappings without order, while adding numerical keys would allow "sequential" access, in case of iteration there's no guarantee that order will coincide with the natural order of keys.
It's better not to translate from PHP to Python (or any other language), but rather right code idiomatic to that particular language. Have a look at the many open-source code that does the same/similar things, you might even find a useful module (library).
all_rows=[]
for row in query:
all_rows.append(row['col'])
print(all_rows)
You can do:
from collections import defaultdict
values = defaultdict(list)
for row in query:
values[row.id].append(row.name)
return values
Edit: forgot to return the values.

Categories