Issue with creating a global list from map using PySpark

Issue with creating a global list from map using PySpark - python

I have this code where I am reading a file in ipython using pyspark. What I am trying to do is to add a piece to it which forms a list based on a particular column read from the file but when I try to execute it then the list comes out to be empty and nothing gets appended to it. My code is:
list1 = []
def file_read(line):
list1.append(line[10])
# bunch of other code which process other column indexes on `line`
inputData = sc.textFile(fileName).zipWithIndex().filter(lambda (line,rownum): rownum>0).map(lambda (line, rownum): line)
column_val = (inputData
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >1 )
.map(file_read))
WHen I execute this part of code the list1 still comes to be empty even though there's data in line[10] as I am using it in other parts of the code in the same function above. It seems as if it is just not appending it to the list. How can I form the list above?

Well, it actually does append to the list1, problem is not to the one you're thinking about. Every variable referenced in the closures are serialized and send to the workers. It applies to list1 as well.
Every partition receives it's own copy of the list1, when file_read is called data is appended to this copy, and when a given map phase is finished it goes out of scope and is discarded.
Not particularly elegant piece of code but you should see that it is really what is happening here:
rdd = sc.parallelize(range(100), 5)
line1 = []
def file_read(line):
list1.append(line)
print len(list1)
return line
xs = rdd.map(file_read).collect()
Edit
Spark provides two types of shared variable. Broadcast variables, which are read only from the worker perspective, and accumulators which are write only from the driver perspective.
By default accumulators support only numeric variables and are intended to be used mostly as counters. It is possible to define custom accumulators though. To do that you have to extend AccumulatorParam class and provide custom zero and addInPlace implementations:
class ListParam(AccumulatorParam):
def zero(self, v):
return []
def addInPlace(self, acc1, acc2):
acc1.extend(acc2)
return acc1
Next you can redefine file_read as follows:
def file_read1(line):
global list1 # Required otherwise the next line will fail
list1 += [line]
return line
Example usage:
list1 = sc.accumulator([], ListParam())
rdd = sc.parallelize(range(10)).map(file_read1).collect()
list1.value
Even if it is possible to use accumulator like this it is probably to expensive to be used in practice and in the worst case scenario it can crash the driver. Instead you can simply use another transformation:
tmp = (inputData
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >1 ))
def line_read2(line): return ... # Just a core logic
line1 = tmp.map(lambda line: line[10])
column_val = tmp.map(line_read2)
Side note:
Code you've provided doesn't do anything. Transformations in Spark are just the descriptions of what has to be done, but until you call an action data nothing is really executed.

Related

Why is my list coming up blank when trying to import data from a CSV file?

Python is completely new to me and i'm still trying to figure out the basics... we were given a project to analyse and determine particular things in a csv file that was given to us.
There are many columns, but the first is most important as one of the variables in the function we need to create is for the first column. It's labelled 'adultids' where a combination of letters and numbers are given, and one 'adultid' takes up 15 rows of different information - there's many different 'adultids' within the file.
So start it off I am trying to make a list from that csv file that contains only the information from the 'adultsID' given (which, as a variable in the function, is a list of two 'adultids' from the csv file), basically trying to single out that information from the rest of the data in the csv file. When i run it, it comes up with '[]', and i cant figure out why... can someone tell me whats wrong?
I'm not sure if any of that makes sense, its very hard to describe, so i apologise in advance, but here is the code i tried:)
def file_to_read(csvfile, adultIDs):
with open(csvfile, 'r') as asymfile:
lst = asymfile.read().split("\n")
new_lst = []
if adultIDs == True:
for row in lst:
adultid, point, x, y, z = row.split(',')
if adultid == adultIDs:
new_lst.append([adultid, point, x, y, z])
return new_lst

Try this.
This is because if you give adultIDs to False Then you get the output [], because you assign the new_lst to []
def file_to_read(csvfile, adultIDs):
with open(csvfile, 'r') as asymfile:
lst = asymfile.read().split("\n")
new_lst = []
if adultIDs == True:
for row in lst:
adultid, point, x, y, z = row.split(',')
if adultid == adultIDs:
new_lst.append([adultid, point, x, y, z])
return new_lst
return lst

As far as I understand, you pass a list of ids like ['R10', 'R20', 'R30'] to the second argument of your function. Those ids are also contained in the csv-file, you are trying to parse. In this case you should, probably, rewrite your function, in a way, that checks if the adultid from a row of your csv-file is contained in the list adultIDs that you pass into your function. I'd rather do it like this:
def file_to_read(csvfile, adult_ids): # [1]
lst = []
with open(csvfile, 'r') as asymfile:
for row in asymfile: # [2]
r = row[:-1].split(',') # [3]
if r[0] in adult_ids: # [4]
lst.append(r)
return lst
Description for commented digits in brackets:
Python programmers usually prefer snake_case names for variables and arguments. You can learn more about it in PEP 8. Although it's not connected to your question, it may just be helpful for your future projects, when other programmers will review your code. Here I also would recommend
You don't need to read the whole file in a variable. You can iterate it row by row, thus saving memory. This may be helpful if you use huge files, or lack of memory.
You need to take the whole string except last character, which is \n.
in checks if the adult_id from the row of a csv-file is contained in the argument, that you pass. Thus I would recommend using set datatype for adult_ids rather than list. You can read about sets in documentation.
I hope I got your task right, and that helps you. Have a nice day!

How to change a python dictionary within a function?

So im running into an issue trying to get my dictionary to change within a function without returning anything here is my code:
def load_twitter_dicts_from_file(filename, emoticons_to_ids, ids_to_emoticons):
in_file = open(filename, 'r')
emoticons_to_ids = {}
ids_to_emoticons = {}
for line in in_file:
data = line.split()
if len(data) > 0:
emoticon = data[0].strip('"')
id = data[2].strip('"')
if emoticon not in emoticons_to_ids:
emoticons_to_ids[emoticon] = []
if id not in ids_to_emoticons:
ids_to_emoticons[id] = []
emoticons_to_ids[emoticon].append(id)
ids_to_emoticons[id].append(emoticon)
basically what im trying to do is to pass in two dictionaries and fill them with information from the file which works out fine but after i call it in the main and try to print the two dictionaries it says they are empty. Any ideas?

def load_twitter_dicts_from_file(filename, emoticons_to_ids, ids_to_emoticons):
…
emoticons_to_ids = {}
ids_to_emoticons ={}
These two lines replace whatever you pass to the function. So if you passed two dictionaries to the function, those dictionaries are never touched. Instead, you create two new dictionaries which are never passed to the outside.
If you want to mutate the dictionaries you pass to the function, then remove those two lines and create the dictionaries first.
Alternatively, you could also return those two dictionaries from the function at the end:
return emoticons_to_ids, ids_to_emoticons

Passing values and calling functions from other functions

I have this class that consists of 3 functions. Each function is in charge of one part of the whole process.
.load() loads up two files, re-formats their content and writes them to two new files.
.compare() takes two files and prints out their differences in a specific format.
.final() takes the result of .compare() and creates a file for every set of values.
Please ignore the Frankenstein nature of the logic as it is not my main concern at the moment. I know it can be written a thousand times better and that's fine by me for now as i am still new to Python and programing in general. I do have some theoretical experience but very limited technical practice and that is something i am working on.
Here is the code:
from collections import defaultdict
from operator import itemgetter
from itertools import groupby
from collections import deque
import os
class avs_auto:
def load(self, fileIn1, fileIn2, fileOut1, fileOut2):
with open(fileIn1+'.txt') as fin1, open(fileIn2+'.txt') as fin2:
frame_rects = defaultdict(list)
for row in (map(str, line.split()) for line in fin1):
id, frame, rect = row[0], row[2], [row[3],row[4],row[5],row[6]]
frame_rects[frame].append(id)
frame_rects[frame].append(rect)
for row in (map(str, line.split()) for line in fin2):
id, frame, rect = row[0], row[2], [row[3],row[4],row[5],row[6]]
frame_rects[frame].append(id)
frame_rects[frame].append(rect)
with open(fileOut1+'.txt', 'w') as fout1, open(fileOut2+'.txt', 'w') as fout2:
for frame, rects in sorted(frame_rects.iteritems()):
fout1.write('{{{}:{}}}\n'.format(frame, rects))
fout2.write('{{{}:{}}}\n'.format(frame, rects))
def compare(self, f1, f2):
with open(f1+'.txt', 'r') as fin1:
with open(f2+'.txt', 'r') as fin2:
lines1 = fin1.readlines()
lines2 = fin2.readlines()
diff_lines = [l.strip() for l in lines1 if l not in lines2]
diffs = defaultdict(list)
with open(f1+'x'+f2+'Result.txt', 'w') as fout:
for line in diff_lines:
d = eval(line)
for k in d:
list_ids = d[k]
for i in range(0, len(d[k]), 2):
diffs[d[k][i]].append(k)
for id_ in diffs:
diffs[id_].sort()
for k, g in groupby(enumerate(diffs[id_]), lambda (i, x): i - x):
group = map(itemgetter(1), g)
fout.write('{0} {1} {2}\n'.format(id_, group[0], group[-1]))
def final(self):
with open('hw1load3xhw1load2Result.txt', 'r') as fin:
lines = (line.split() for line in fin)
for k, g in groupby(lines, itemgetter(0)):
fst = next(g)
lst = next(iter(deque(g, 1)), fst)
with open('final/{}.avs'.format(k), 'w') as fout:
fout.write('video0=ImageSource("MovieName\original\%06d.jpeg", {}, {}, 15)\n'.format(fst[1], lst[2]))
Now to my question, how do i make it so each of the functions passes it's output files as values to the next function and calls it?
So for an example:
running .load() should output two files, call the .compare() function passing it those two files.
Then when .compare() is done, it should pass .final() the output file and calls it.
So .final() will open whatever file is passed to it from .compare() and not "test123.txt" as it is defined above.
I hope this all makes sense. Let me know if you need clarification. Any criticism is welcome concerning the code itself. Thanks in advance.

There are a couple of ways to do this, but I would write a master function that calls the other three in sequence. Something like:
def load_and_compare(self, input_file1, input_file2, output_file1, output_file2, result_file):
self.load(input_file1, input_file2, output_file1, output_file2)
self.compare(output_file1, output_file2)
self.final(result_file)
Looking over your code, I think you have a problem in load. You only declare a single dictionary, then load the contents of both files into it and write those same contents out to two files. Because each file has the same content, compare won't do anything meaningful.
Also, do you really want to write out the file contents and then re-read it into memory? I would keep the frame definitions in memory for use in compare after loading rather than reading them back in.
I don't really see a reason for this to be a class at all rather than just a trio of functions, but maybe if you have to read multiple files with mildly varying formats you could get some benefit of using class attributes to define the format while inheriting the general logic.

Do you mean call with the name of the two files? Well you defined a class, so you can just do:
def load(self, fileIn1, fileIn2, fileOut1, fileOut2):
... // do stuff here
// when done
self.compare( fileOut1, fileOut2 )
And so on.

I might be totally off here, but why don't you do it exactly as you're saying?
Just call self.compare() out of your load() method.
You can also add return statements to load() and return a tuple with the files.
Then add a 4th method to your class, which then collects the returned files and pipes them to the compare() method.
Best Regards!

One of the more powerful aspects of Python is that you can return something called a tuple. To answer this in a more generic Python sense consider this code:
>>> def load(file1, file2):
return file1+'.txt',file2+'.txt'
>>> def convert(file1, file2):
return 'converted_'+file1,'converted_'+file2
>>> convert(*load("Java", "C#"))
('converted_Java.txt', 'converted_C#.txt')
Each function takes two named arguments, but the returned tuple of the first can be "unpacked" into the input arguments of the second by adding a * in front of it.

Join two CSV files in python using dictreader

I realise the info to answer this question is probably already on here, but as a python newby I've been trying to piece together the info for a few weeks now and I'm hitting some trouble.
this question Python "join" function like unix "join" answers how to do a join on two lists easily, but the problem is that dictreader objects are iterables and not straightforward lists, meaning that there's an added layer of complications.
I basically am looking for an inner join on two CSV files, using the dictreader object. Here's the code I have so far:
def test(dictreader1, dictreader2):
matchedlist = []
for dictline1 in dictreader1:
for dictline2 in dictreader2:
if dictline1['member']=dictline2['member']:
matchedlist.append(dictline1, dictline2)
else: continue
return matchedlist
This is giving me an error at the if statement, but more importantly, I don't seem to be able to access the ['member'] element of the dictionary within the iterable, as it says it has no attribute "getitem".
Does anyone have any thoughts on how to do this? For reference, I need to keep the lists as iterables because each file is too big to fit in memory. The plan is to control this entire function within another for loop that only feeds it a few lines at a time to iterate over. So it will read one line of the left hand file, iterate over the whole second file to find a member field that matches and then join the two lines, similar to an SQL join statement.
Thanks for any help in advance, please forgive any obvious errors on my part.

A few thoughts:
Replace the = with ==. The latter is used for equality tests; the former for assignments.
Add a line a the beginning, dictreader2 = list(dictreader2). That will make it possible to loop over the dictionary entries more than once.
Add a second pair of parenthese to matchedlist.append((dictline1, dictline2)). The list.append method takes just one argument, so you want to create a tuple out of dictline1 and dictline2.
The final else: continue is unnecessary. A for-loop will automatically loop for you.
Use a print statement or somesuch to verify that dictline1 and dictline2 are both dictionary objects that have member as a key. It could be that your function is correct, but is being called with something other than a dictreader object.
Here is a worked out example using a list of dicts as input (similar to what a DictReader would return):
>>> def test(dictreader1, dictreader2):
dictreader2 = list(dictreader2)
matchedlist = []
for dictline1 in dictreader1:
for dictline2 in dictreader2:
if dictline1['member'] == dictline2['member']:
matchedlist.append((dictline1, dictline2))
return matchedlist
>>> dr1 = [{'member': 2, 'value':'abc'}, {'member':3, 'value':'def'}]
>>> dr2 = [{'member': 4, 'tag':'t4'}, {'member':3, 'tag':'t3'}]
>>> test(dr1, dr2)
[({'member': 3, 'value': 'def'}, {'member': 3, 'tag': 't3'})]
A further suggestion is to combine the two dictionaries into a single entry (this is closer to what an SQL inner join would do):
>>> def test(dictreader1, dictreader2):
dictreader2 = list(dictreader2)
matchedlist = []
for dictline1 in dictreader1:
for dictline2 in dictreader2:
if dictline1['member'] == dictline2['member']:
entry = dictline1.copy()
entry.update(dictline2)
matchedlist.append(entry)
return matchedlist
>>> test(dr1, dr2)
[{'member': 3, 'tag': 't3', 'value': 'def'}]
Good luck with your project :-)

Is there a better way to create dynamic functions on the fly, without using string formatting and exec?

I have written a little program that parses log files of anywhere between a few thousand lines to a few hundred thousand lines. For this, I have a function in my code which parses every line, looks for keywords, and returns the keywords with the associated values.
These log files contain of little sections. Each section has some values I'm interested in and want to store as a dictionary.
I have simplified the sample below, but the idea is the same.
My original function looked like this, it gets called between 100 and 10000 times per run, so you can understand why I want to optimize it:
def parse_txt(f):
d = {}
for line in f:
if not line:
pass
elif 'apples' in line:
d['apples'] = True
elif 'bananas' in line:
d['bananas'] = True
elif line.startswith('End of section'):
return d
f = open('fruit.txt','r')
d = parse_txt(f)
print d
The problem I run into, is that I have a lot of conditionals in my program, because it checks for a lot of different things and stores the values for it. And when checking every line for anywhere between 0 and 30 keywords, this gets slow fast. I don't want to do that, because, not every time I run the program I'm interested in everything. I'm only ever interested in 5-6 keywords, but I'm parsing every line for 30 or so keywords.
In order to optimize it, I wrote the following by using exec on a string:
def make_func(args):
func_str = """
def parse_txt(f):
d = {}
for line in f:
if not line:
pass
"""
if 'apples' in args:
func_str += """
elif 'apples' in line:
d['apples'] = True
"""
if 'bananas' in args:
func_str += """
elif 'bananas' in line:
d['bananas'] = True
"""
func_str += """
elif line.startswith('End of section'):
return d"""
print func_str
exec(func_str)
return parse_txt
args = ['apples','bananas']
fun = make_func(args)
f = open('fruit.txt','r')
d = fun(f)
print d
This solution works great, because it speeds up the program by an order of magnitude and it is relatively simple. Depending on the arguments I put in, it will give me the first function, but without checking for all the stuff I don't need.
For example, if I give it args=['bananas'], it will not check for 'apples', which is exactly what I want to do.
This makes it much more efficient.
However, I do not like it this solution very much, because it is not very readable, difficult to change something and very error prone whenever I modify something. Besides that, it feels a little bit dirty.
I am looking for alternative or better ways to do this. I have tried using a set of functions to call on every line, and while this worked, it did not offer me the speed increase that my current solution gives me, because it adds a few function calls for every line. My current solution doesn't have this problem, because it only has to be called once at the start of the program. I have read about the security issues with exec and eval, but I do not really care about that, because I'm the only one using it.
EDIT:
I should add that, for the sake of clarity, I have greatly simplified my function. From the answers I understand that I didn't make this clear enough.
I do not check for keywords in a consistent way. Sometimes I need to check for 2 or 3 keywords in a single line, sometimes just for 1. I also do not treat the result in the same way. For example, sometimes I extract a single value from the line I'm on, sometimes I need to parse the next 5 lines.

I would try defining a list of keywords you want to look for ("keywords") and doing this:
for word in keywords:
if word in line:
d[word] = True
Or, using a list comprehension:
dict([(word,True) for word in keywords if word in line])
Unless I'm mistaken this shouldn't be much slower than your version.
No need to use eval here, in my opinion. You're right in that an eval based solution should raise a red flag most of the time.
Edit: as you have to perform a different action depending on the keyword, I would just define function handlers and then use a dictionary like this:
def keyword_handler_word1(line):
(...)
(...)
def keyword_handler_wordN(line):
(...)
keyword_handlers = { 'word1': keyword_handler_word1, (...), 'wordN': keyword_handler_wordN }
Then, in the actual processing code:
for word in keywords:
# keyword_handlers[word] is a function
keyword_handlers[word](line)

Use regular expressions. Something like the next:
>>> lookup = {'a': 'apple', 'b': 'banane'} # keyword: characters to look for
>>> pattern = '|'.join('(?P<%s>%s)' % (key, val) for key, val in lookup.items())
>>> re.search(pattern, 'apple aaa').groupdict()
{'a': 'apple', 'b': None}

def create_parser(fruits):
def parse_txt(f):
d = {}
for line in f:
if not line:
pass
elif line.startswith('End of section'):
return d
else:
for testfruit in fruits:
if testfruit in line:
d[testfruit] = True
This is what you want - create a test function dynamically.
Depending on what you really want to do, it is, of course, possibe to remove one level of complexity and define
def parse_txt(f, fruits):
[...]
or
def parse_txt(fruits, f):
[...]
and work with functools.partial.

You can use set structure, like this:
fruit = set(['cocos', 'apple', 'lime'])
need = set (['cocos', 'pineapple'])
need. intersection(fruit)
return to you 'cocos'.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issue with creating a global list from map using PySpark - python

Related

Why is my list coming up blank when trying to import data from a CSV file?

How to change a python dictionary within a function?

Passing values and calling functions from other functions

Join two CSV files in python using dictreader

Is there a better way to create dynamic functions on the fly, without using string formatting and exec?

Categories

Resources