Unknown error on PySpark map + broadcast - python

I have a big group of tuples with tuple[0] = integer and tuple[1] = list of integers (resulting from a groupBy). I call the value tuple[0] key for simplicity.
The values inside the lists tuple[1] can be eventually other keys.
If key = n, all elements of key are greater than n and sorted / distinct.
In the problem I am working on, I need to find the number of common elements in the following way:
0, [1,2]
1, [3,4,5]
2, [3,7,8]
.....
list of values of key 0:
1: [3,4,5]
2: [3,7,8]
common_elements between list of 1 and list of 2: 3 -> len(common_elements) = 1
Then I apply the same for keys 1, 2 etc, so:
list of values of 1:
3: ....
4: ....
5: ....
The sequential script I wrote is based on pandas DataFrame df, with the first column v as list of 'keys' (as index = True) and the second column n as list of list of values:
for i in df.v: #iterate each value
for j in df.n[i]: #iterate within the list
common_values = set(df.n[i]).intersection(df.n[j])
if len(common_values) > 0:
return len(common_values)
Since is a big dataset, I'm trying to write a parallelized version with PySpark.
df.A #column of integers
df.B #column of integers
val_colA = sc.parallelize(df.A)
val_colB = sc.parallelize(df.B)
n_values = val_colA.zip(val_colB).groupByKey().MapValues(sorted) # RDD -> n_values[0] will be the key, n_values[1] is the list of values
n_values_broadcast = sc.broadcast(n_values.collectAsMap()) #read only dictionary
def f(element):
for i in element[1]: #iterating the values of "key" element[0]
common_values = set(element[1]).intersection(n_values_broadcast.value[i])
if len(common_values) > 0:
return len(common_values)
collection = n_values.map(f).collect()
The programs fails after few seconds giving error like KeyError: 665 but does not provide any specific failure reason.
I'm a Spark beginner thus not sure whether this the correct approach (should I consider foreach instead? or mapPartition) and especially where is the error.
Thanks for the help.

The error is actually pretty clear and Spark specific. You are accessing Python dict with __getitem__ ([]):
n_values_broadcast.value[i]
and if key is missing in the dictionary you'll get KeyError. Use get method instead:
n_values_broadcast.value.get(i, [])

Related

Python - list index out of range in if statement

I have a list that usually contains items but is sometimes empty.
Three items from the list are added to the database but I run into errors if it's empty, even though I'm using an if statement.
if item_list[0]:
one = item_list[0]
else:
one = "Unknown"
if item_list[1]:
two = item_list[1]
else:
two = "Unknown"
if item_list[2]:
three = item_list[2]
else:
three = "Unknown"
This still raises the list index out of range error if the list is empty. I can't find any other ways in which it could be done, but there must be a better way (I've also read that you should avoid using else statements?)
If a list is empty, the list has no index; and trying to access an index of the list causes an error.
The error actually occurs in the if statement.
you could obtain the result you expect by doing this:
one, two, three = item_list + ["unknown"] * (3 - len(item_list))
This line of code creates a temporary list consisting in the concatenation of item_list and a list of (3 minus the size of item_list) "unknown" strings; which is always a 3-items list. It then unpacks this list in the one, two and three variables
details:
You can multiply a list to obtain a bigger list with duplicate items: ['a', 1, None] * 2 gives ['a', 1, None, 'a', 1, None]. This is used to create a list of "unknow" strings. Note that multiplying a list by 0 results in an empty list (as expected).
You can use the addition operator to concatenate 2 (or more) lists: ['a', 'b'] + [1, 2] gives ['a', 'b', 1, 2]. This is used to create a 3-items list from item_list and the 'unknown' list created by multiplication.
You can unpack a list in several variable with the assignation operator: a, b = [1, 2] gives a = 1 and b = 2. It it even possible to use extended unpacking a, *b = [1, 2, 3] gives a = 1 and b = [2, 3].
example:
>>> item_list = [42, 77]
>>> one, two, three = item_list + ["unknown"] * (3 - len(item_list))
>>> one, two, three
(42, 77, 'unknown')
Python will throw this error if you try to access an element of an array that doesn't exist. So an empty array won't have index 0.
if item_list: # an empty list will be evaluated as False
one = item_list[0]
else:
one = "Unknown"
if 1 < len(item_list):
two = item_list[1]
else:
two = "Unknown"
if 2 < len(item_list):
three = item_list[2]
else:
three = "Unknown"
item_list[1] will immediately raise an error if there aren't 2 elements in the list; the behavior isn't like that of languages like Clojure, where a null value is instead returned.
Use len(item_list) > 1 instead.
You need to check if your list is long enough to have a value in the index position you are trying to retrieve from. If you are also trying to avoid using else in your condition statement, you can pre-assign your variables with default values.
count = len(item_list)
one, two, three = "Unknown", "Unknown", "Unknown"
if count > 0:
one = item_list[0]
if count > 1:
two = item_list[1]
if count > 2:
three = item_list[2]

Compare 1 column of 2D array and remove duplicates Python

Say I have a 2D array like:
array = [['abc',2,3,],
['abc',2,3],
['bb',5,5],
['bb',4,6],
['sa',3,5],
['tt',2,1]]
I want to remove any rows where the first column duplicates
ie compare array[0] and return only:
removeDups = [['sa',3,5],
['tt',2,1]]
I think it should be something like:
(set first col as tmp variable, compare tmp with remaining and #set array as returned from compare)
for x in range(len(array)):
tmpCol = array[x][0]
del array[x]
removed = compare(array, tmpCol)
array = copy.deepcopy(removed)
print repr(len(removed)) #testing
where compare is:
(compare first col of each remaining array items with tmp, if match remove else return original array)
def compare(valid, tmpCol):
for x in range(len(valid)):
if valid[x][0] != tmpCol:
del valid[x]
return valid
else:
return valid
I keep getting 'index out of range' error. I've tried other ways of doing this, but I would really appreciate some help!
Similar to other answers, but using a dictionary instead of importing counter:
counts = {}
for elem in array:
# add 1 to counts for this string, creating new element at this key
# with initial value of 0 if needed
counts[elem[0]] = counts.get(elem[0], 0) + 1
new_array = []
for elem in array:
# check that there's only 1 instance of this element.
if counts[elem[0]] == 1:
new_array.append(elem)
One option you can try is create a counter for the first column of your array before hand and then filter the list based on the count value, i.e, keep the element only if the first element appears only once:
from collections import Counter
count = Counter(a[0] for a in array)
[a for a in array if count[a[0]] == 1]
# [['sa', 3, 5], ['tt', 2, 1]]
You can use a dictionary and count the occurrences of each key.
You can also use Counter from the library collections that actually does this.
Do as follows :
from collection import Counter
removed = []
for k, val1, val2 in array:
if Counter([k for k, _, _ in array])[k]==1:
removed.append([k, val1, val2])

How to get each element of numpy array?

I have a numpy array as follows :
Keys which will store some values. for example
Keys [2,3,4,7,8]
How to get index of 4 and store the index in a int variable ?
For example the index value of 4 is 2, so 2 will be stored in a int variable.
I have tried with following code segment
//enter code here
for i in np.nditer(Keys):
print(keys[i]);
//enter code here
I am using python 3.5
Spyder 3.5.2
Anaconda 4.2.0
Is keys a list or numpy array
keys = [[2,3,4,7,8] # or
keys = np.array([2,3,4,7,8])
You don't need to iterate to see the elements of either. But you can do
for i in keys:
print(i)
for i in range(len(keys)):
print(keys[i])
[i for i in keys]
these work for either.
If you want the index of the value 4, the list has a method:
keys.index(4)
for the array
np.where(keys==4)
is a useful bit of code. Also
np.in1d(keys, 4)
np.where(np.in1d(keys, 4))
Forget about np.nditer. That's for advanced programming, not routine iteration.
There are several ways. If the list is not too large, then:
where_is_4 = [e for i,e in enumerate(Keys) if i==4][0]
What this does is it loops over the list with an enumerator and creates a list that contains the value of the enumerator every time the value '4' occurs.
Why not just do:
for i in range( len( Key ) ):
if ( Key[ i ] == 4 ):
print( i )
You can find all indices where the value is 4 using:
>>> keys = np.array([2,3,4,7,8])
>>> np.flatnonzero(keys == 4)
array([2])
There is a native numpy method for this called where.
It will return an array of the indices where some given condition is true. So you can just pick the first entry, if the list isn't empty:
N = 4
indicies = np.where(x==N)[0]
index = None
if indicies:
index = indicies[0]
Use of numpy.where(condition) will be a good choice here. From the below code you can get location of 4.
import numpy as np
keys = np.array([2,3,4,7,8])
result = np.where(keys==4)
result[0][0]

More efficient way to retrieve first occurrence of every unique value from a csv column in Python

A large csv I was given has a large table of flight data. A function I wrote to help parse it iterates over the column of Flight ID's, and then returns a dictionary containing the index and value of every unique Flight ID in order of first appearance.
Dictionary = { Index: FID, ... }
This comes as a quick adjustment to an older function that didn't require having to worry about FID repeats in the column (a few hundred thousand rows later...).
Right now, I have it iterating over and comparing each value in order. If a value is equal to the value after it, it skips it. If the next value is different, it stores the value in the dictionary. I changed it to now also check if that value has already occured before, and if so, to skip it.
Here's my code:
def DiscoverEarliestIndex(self, number):
finaldata = {}
columnvalues = self.column(number)
columnenum = {}
for a, b in enumerate(columnvalues):
columnenum[a] = b
i = 0
while i < (len(columnvalues) - 1):
next = columnenum[i+1]
if columnvalues[i] == next:
i += 1
else:
if next in finaldata.values():
i += 1
continue
else:
finaldata[i+1]= next
i += 1
else:
return finaldata
It's very inefficient, and slows down as the dictionary grows. The column has 5.2 million rows, so it's obviously not a good idea to handle this much with Python, but I'm stuck with it for now.
Is there a more efficient way to write this function?
if next in thegoodshit.values():
is probably your problem what you are doing here is
creating a list
searching the list
maybe you can use a set to hold the values and search that - something like this:
while i < (len(columnvalues) - 1):
next = columnenum[i+1]
if columnvalues[i] == next:
i += 1
else:
if next in searchable_data:
i += 1
continue
else:
finaldata[i+1]= next
searchable_data.add(next)
i += 1
else:
return finaldata
You are essentially looking for a database. Databases are made exactly for such operations on large datasets. It will be much faster to parse the entire CSV at once using the CSV module and sending them in a database than storing them in a dict and running checks against the entire dict.
*large* python dictionary with persistence storage for quick look-ups
To answer your question directly, you should be able to do this with dict comprehensions and the itertools module.
>>> import itertools as it
>>> data = {1: 'a', 2: 'a', 3: 'c', 4: 'c', 5:'d' }
>>> grouped_shit = {k: list(v) for (k,v) in it.groupby(data.iteritems(), lambda (_,v): v)}
>>> good_shit = {v[0][0]: k for (k, v) in grouped_shit.iteritems()}
>>> good_shit
{1: 'a', 3: 'c', 5: 'd'}
I think this can be tweeked a bit--I'm not super happy about going over the dict twice. But anyway, I think that the dict comprehensions are pretty efficient. Also, groupby assumes that your keys are in order---that is, it assumes that all the 'a's indices are grouped together, which is seems to be true in your case.

How to understand this python code?

This code is from the book Learning python and it is used to sum columns in a text file separated by commas. I really can't understand line 7, 8 &9.
Thanks for the help. Here is the code:
filename='data.txt'
sums={}
for line in open(filename):
cols=line.split(',')
nums=[int(col) for col in cols]
for(ix, num) in enumerate(nums):
sums[ix]=sums.get(ix, 0)+num
for key in sorted(sums):
print(key, '=', sums[key])
It looks like the input file contains rows of comma-separated integers. This program prints out the sum of each column.
You've mixed up the indentation, which changes the meaning of the program, and it wasn't terribly nicely written to begin with. Here it is with lots of commenting:
filename='data.txt' # name of the text file
sums = {} # dictionary of { column: sum }
# not initialized, because you don't know how many columns there are
# for each line in the input file,
for line in open(filename):
# split the line by commas, resulting in a list of strings
cols = line.split(',')
# convert each string to an integer, resulting in a list of integers
nums = [int(col) for col in cols]
# Enumerating a list numbers the items - ie,
# enumerate([7,8,9]) -> [(0,7), (1,8), (2,9)]
# It's used here to figure out which column each value gets added to
for ix, num in enumerate(nums):
# sums.get(index, defaultvalue) is the same as sums[index] IF sums already has a value for index
# if not, sums[index] throws an error but .get returns defaultvalue
# So this gets a running sum for the column if it exists, else 0;
# then we add the new value and store it back to sums.
sums[ix] = sums.get(ix, 0) + num
# Go through the sums in ascending order by column -
# this is necessary because dictionaries have no inherent ordering
for key in sorted(sums):
# and for each, print the column# and sum
print(key, '=', sums[key])
I would write it a bit differently; something like
from collections import Counter
sums = Counter()
with open('data.txt') as inf:
for line in inf:
values = [int(v) for v in line.split(',')]
sums.update(enumerate(values))
for col,sum in sorted(sums.iteritems()):
print("{}: {}".format(col, sum))
Assuming you understand lines 1-6…
Line 7:
sums[ix]=sums.get(ix, 0)+num
sums.get(ix, 0) is the same as sums[ix], except that if ix not in sums it returns 0 instead. So, this is just like sums[ix] += num, except that it first sets the value to 0 if this is the first time you've seen ix.
So, it should be clear that by the end of this loop, sums[ix] is going to have the sum of all values in column ix.
This is a silly way to do this. As mgilson points out, you could just use defaultdict so you don't need that extra logic. Or, even more simply, you could just use a list instead of a dict, because this (indexing by sequential small numbers) is exactly what lists are for…
Line 8:
for key in sorted(sums):
First, you can iterate over any dict as if it were a list or other iterable, and it has the same effect as iterating over sums.keys(). So, if sums looks like { 0: 4, 1: 6, 2: 3 }, you're going to iterate over 0, 1, 2.
Except that dicts don't have any inherent order. You may get 0, 1, 2, or you may get 1, 0, 2, or any other order.
So, sorted(sums) just returns a copy of that list of keys in sorted order, guaranteeing that you'll get 0, 1, 2 in that order.
Again, this is silly, because if you just used a list in the first place, you'd get things in order.
Line 9:
print(key, '=', sums[key])
This one should be obvious. If key iterates over 0, 1, 2, then this is going to print 0 = 4, 1 = 6, 2 = 3.
So, in other words, it's printing out each column number, together with the sum of all values in that column.

Categories