Spark Streaming updateStateByKey with tuple as a value - python

is it possible to use updateStateByKey() function with a tuple as a value? I am using PySpark and my input is (word, (count, tweet_id)), which means word is a key and a tuple (count, tweet_id) is a value. The task of updateStateByKey is for each word to sum their counts and create a list of all tweet_ids which contains the word.
I implemented following update function, however I got error list index out of range for new_values with index 1:
def updateFunc(new_values, last_sum):
count = 0
tweets_id = []
if last_sum:
count = last_sum[0]
tweets_id = last_sum[1]
return sum(new_values[0]) + count, tweets_id.extend(new_values[1])
And calling the method:
running_counts.updateStateByKey(updateFunc)

I've found the solution. The problem was with checkpointing which means the current state is persisted to the disk in case of a failure. It caused problems because when I changed my definition of a state, in checkpoint it was in the old state without a tuple. Therefore, I deleted checkpoint from the disk and implement the final solution as:
def updateFunc(new_values, last_sum):
count = 0
counts = [field[0] for field in new_values]
ids = [field[1] for field in new_values]
if last_sum:
count = last_sum[0]
new_ids = last_sum[1] + ids
else:
new_ids = ids
return sum(counts) + count, new_ids
Finally, the answer to my question is: yes, the state can be a tuple or any other data type for storing more values.

Related

Count occurances of a specific string within multi-valued elements in a set

I have generated a list of genes
genes = ['geneName1', 'geneName2', ...]
and a set of their interactions:
geneInt = {('geneName1', 'geneName2'), ('geneName1', 'geneName3'),...}
I want to find out how many interactions each gene has and put that in a vector (or dictionary) but I struggle to count them. I tried the usual approach:
interactionList = []
for gene in genes:
interactions = geneInt.count(gene)
interactionList.append(ineractions)
but of course the code fails because my set contains elements that are made out of two values while I need to iterate over the single values within.
I would argue that you are using the wrong data structure to hold interactions. You can represent interactions as a dictionary keyed by gene name, whose values are a set of all the genes it interacts with.
Let's say you currently have a process that does something like this at some point:
geneInt = set()
...
geneInt.add((gene1, gene2))
Change it to
geneInt = collections.defaultdict(set)
...
geneInt[gene1].add(gene2)
If the interactions are symmetrical, add a line
geneInt[gene2].add(gene1)
Now, to count the number of interactions, you can do something like
intCounts = {gene: len(ints) for gene, ints in geneInt.items()}
Counting your original list is simple if the interactions are one-way as well:
intCounts = dict.fromkeys(genes, 0)
for gene, _ in geneInt:
intCounts[gene] += 1
If each interaction is two-way, there are three possibilities:
Both interactions are represented in the set: the above loop will work.
Only one interaction of a pair is represented: change the loop to
for gene1, gene2 in geneInt:
intCounts[gene1] += 1
if gene1 != gene2:
intCounts[gene2] += 1
Some reverse interactions are represented, some are not. In this case, transform geneInt into a dictionary of sets as shown in the beginning.
Try something like this,
interactions = {}
for gene in genes:
interactions_count = 0
for tup in geneInt:
interactions_count += tup.count(gene)
interactions[gene] = interactions_count
Use a dictionary, and keep incrementing the value for every gene you see in each tuple in the set geneInt.
interactions_counter = dict()
for interaction in geneInt:
for gene in interaction:
interactions_counter[gene] = interactions_counter.get(gene, 0) + 1
The dict.get(key, default) method returns the value at the given key, or the specified default if the key doesn't exist. (More info)
For the set geneInt={('geneName1', 'geneName2'), ('geneName1', 'geneName3')}, we get:
interactions_counter = {'geneName1': 2, 'geneName2': 1, 'geneName3': 1}

Iterating through a list and assign an ID to the group of elements in a list

I have a list of lists called new_oder_list. I am iterating through this. I would like to create a sub-batch of 20 unique ids from these lists. The same id may appear in the next list so I am keeping a track of the ids in the order_chk_lst list. If there is a repetitive id in the list, I would like to skip that element and check the next element. I am assigning a unique ID to each sub-batch(of 20 elements). I have tried the following code but I am not getting more than 20 ids. I have tried the following code. I would really appreciate your feedback. Thank you.
new_order_list
[5029339601, 5029339775, 5029338374, 5029338219, 5029339927, 5029338917, 5029338917, 5029338219, 5029339601, 5029338905, 5029339320, 5029338282, 5029338374, 5029339109, 5029339320, 5029369758, 5029338282, 5029369758, 5029368075, 5029368652, 5029339941, 5029368652, 5029369810, 5029339584, 5029339584, 5029339775, 5029369810, 5029338531, 5029368003, 5029339536, 5029340252, 5029338531, 5029339137, 5029340252, 5029368003, 5029339137, 5029339536, 5029338531, 5029367966, 5029339109, 5029338390, 5029368075, 5029339576, 5029368083, 5029338209, 5029338417, 5029338905, 5029339576, 5029339941, 5029368075, 5029339895, 5029340051, 5029368075, 5029338390, 5029370218, 5029370218, 5029338209, 5029340051, 5029339895, 5029367966, 5029338417]
[5029370469, 5029368482, 5029370383, 5029340357, 5029340357, 5029370563, 5029370469, 5029340412, 5029339528, 5029370121, 5029370121, 5029370121, 5029368482, 5029368535, 5029370563, 5029339528, 5029370328, 5029368866, 5029369260, 5029369260, 5029369326, 5029370469, 5029338175, 5029338175, 5029368535, 5029368866, 5029368248, 5029340270, 5029339842, 5029339528, 5029340287, 5029338230, 5029368248, 5029368535, 5029368866, 5029340270, 5029339513, 5029369326, 5029368528, 5029340412, 5029339842, 5029338230, 5029370469, 5029370328, 5029369961, 5029340287, 5029370563, 5029370383, 5029340476, 5029340476]
implementation
MAX_ORDER = 20
batch_id = 10000000
sub_batch_id = 10000000
for i, order in enumerate(new_order_list):
# Increment batch_id if the order reaches every MAX_ORDER
if order in order_chk_lst:
# if the id is repeated then go to the next ( I think I am making a mistake here as the value of `i` will change.
continue
order_chk_lst.append(order)
if i % MAX_ORDER == 0:
batch_id = 1
# assign sub_batch_id for each zone (i == 0 will be the first assign within the batch)
# This is my function which will assgn the batch id (I have added this for a reference)
sub_batch_assign, sub_batch_id = assign_sub_batch(zones, sub_batch_id)
# e.g. sub_batch_assign = {"1A": 10000000, "1B": 10000001, "1D": 10000002}
def assign_sub_batch(zones: list, sub_batch_id: int) -> (dict, int):
sub_batch_assign = {}
for zone in zones:
sub_batch_assign[zone] = sub_batch_id
sub_batch_id += 1
return (sub_batch_assign, sub_batch_id)
If you want unique items just change new_oder_list to a set, it will remove all the duplicates. Iterate over the list of lists and use update to add the items to order_chk_lst
order_chk_lst = set()
for lst in new_order_list:
order_chk_lst.update(lst)
You can also change it back to list if you really need it to be list
order_chk_lst = list(order_chk_lst)
If the order is important you can use the fact that dict preserve the order since Python 3.6
order_chk_dict = {}
for lst in new_order_list:
order_chk_dict.update(dict.fromkeys(lst))
order_chk_lst = list(order_chk_dict.keys())

Python: Summing a pig tuple containing float values

I'm fairly new to Pig/Python and in need of help. Trying to write a Pig Script that reconciles financial data. The parameters used follow a syntax like (grand_tot, x1, x2,... xn), meaning that the first value should equal the sum of remaining values.
I don't know of a way to accomplish this using Pig alone, so I've been trying to write a Python UDF. Pig passes a tuple to Python; if the sum of x1:xn equals grand_tot, then Python should return a "1" to Pig to show that the numbers match, otherwise it returns a "0".
Here is what I have so far:
register 'myudf.py' using jython as myfuncs;
A = LOAD '$file_nm' USING PigStorage(',') AS (grand_tot,west_region,east_region,prod_line_a,prod_line_b, prod_line_c, prod_line_d);
A1 = GROUP A ALL;
B = FOREACH A1 GENERATE TOTUPLE($recon1) as flds;
C = FOREACH B GENERATE myfuncs.isReconciled(flds) AS res;
DUMP C;
$recon1 is passed as a parameter, and defined as:
grand_tot, west_region, east_region
I will later pass $recon2 as:
grand_tot, prod_line_a, prod_line_b, prod_line_c, prod_line_d
Sample row of data (in $file_nm) looks like:
grand_tot,west_region,east_region,prod_line_a,prod_line_b, prod_line_c, prod_line_d
10000,4500,5500,900,2200,450,3700,2750
12500,7500,5000,3180,2770,300,3950,2300
9900,7425,2475,1320,460,3070,4630,1740
Lastly... here is what I'm trying to do with Python UDF code:
#outputSchema("result")
def isReconciled(arrTuple):
arrTemp = []
arrNew = []
string1 = ""
result = 0
## the first element of the Tuple should be the sum of remaining values
varGrandTot = arrTuple[0]
## create a new array with the remaining Tuple values
arrTemp = arrTuple[1:]
for item in arrTuple:
arrNew.append(item)
## sum the second to the nth values
varSum = sum(arrNew)
## if the first value in the tuple equals the sum of all remaining values
if varGrandTot = varSum then:
#reconciled to the penny
result = 1
else:
result = 0
return result
The error message I receive is:
unsupported operand type(s) for +: 'int' and 'array.array'
I've tried numerous things attempting to convert the array values into numeric and convert to float so that I can sum, but with no success.
Any ideas??? Thanks for looking!
You can do this in PIG itself.
First, specify the datatype in the schema. PigStorage will use bytearray as default data type.Hence your python script is throwing the error.Looks like your sample data has int but in your question you have mentioned float.
Second, add the fields starting from the second field or the fields of your choice.
Third, use the bincond operator to check the first field value with the sum.
A = LOAD '$file_nm' USING PigStorage(',') AS (grand_tot:float,west_region:float,east_region:float,prod_line_a:float,prod_line_b:float, prod_line_c:float, prod_line_d:float);
A1 = FOREACH A GENERATE grand_tot,SUM(TOBAG(prod_line_a,prod_line_b,prod_line_c,prod_line_d)) as SUM_ALL;
B = FOREACH A1 GENERATE (grand_tot == SUM_ALL ? 1 : 0);
DUMP B;
It is very likely, that your arrTuple is not an array of numbers, but some item is an array.
To check it, modify your code by adding some checks:
#outputSchema("result")
def isReconciled(arrTuple):
# some checks
tmpl = "Item # {i} shall be a number (has value {itm} of type {tp})"
for i, num in enumerate(arrTuple):
msg = templ.format(i=i, itm=itm, tp=type(itm))
assert isinstance(arrTuple[0], (int, long, float)), msg
# end of checks
arrTemp = []
arrNew = []
string1 = ""
result = 0
## the first element of the Tuple should be the sum of remaining values
varGrandTot = arrTuple[0]
## create a new array with the remaining Tuple values
arrTemp = arrTuple[1:]
for item in arrTuple:
arrNew.append(item)
## sum the second to the nth values
varSum = sum(arrNew)
## if the first value in the tuple equals the sum of all remaining values
if varGrandTot = varSum then:
#reconciled to the penny
result = 1
else:
result = 0
return result
It is very likely, that it will throw an AssertionFailed exception on one of the items. Read the
assertion message to learn, which item is making the troubles.
Anyway, if you want to return 0 or 1 if first number equals sum of the rest of the array, following
would work too:
#outputSchema("result")
def isReconciled(arrTuple):
if arrTuple[0] == sum(arrTuple[1:]):
return 1
else:
return 0
and in case, you would live happy with getting True instead of 1 and False instead of 0:
#outputSchema("result")
def isReconciled(arrTuple):
return arrTuple[0] == sum(arrTuple[1:])

how to append data to existing LMDB?

I have around 1 million images to put in this dataset 10000 at a time appended to the set.
I"m sure the map_size is wrong with ref from this article
used this line to create the set
env = lmdb.open(Path+'mylmdb', map_size=int(1e12)
use this line every 10000 sample to write data to file where X and Y are placeholders for the data to be put in the LMDB.
env = create(env, X[:counter,:,:,:],Y,counter)
def create(env, X,Y,N):
with env.begin(write=True) as txn:
# txn is a Transaction object
for i in range(N):
datum = caffe.proto.caffe_pb2.Datum()
datum.channels = X.shape[1]
datum.height = X.shape[2]
datum.width = X.shape[3]
datum.data = X[i].tostring() # or .tostring() if numpy < 1.9
datum.label = int(Y[i])
str_id = '{:08}'.format(i)
# The encode is only essential in Python 3
txn.put(str_id.encode('ascii'), datum.SerializeToString())
#pdb.set_trace()
return env
How can I edit this code such that new data is added to this LMDB and not replaced as this present method replaces it in the same position.
I have check the length after generation with the env.stat().
Le me expand on my comment above.
All entries in LMDB are stored according to unique keys and your database already contains keys for i = 0, 1, 2, .... You need a way to find unique keys for each i. The simplest way to do that is to find the largest key in existing DB and keep adding to it.
Assuming that existing keys are consecutive,
max_key = env.stat()["entries"]
Otherwise, a more thorough approach is to iterate over all keys. (Check this.)
max_key = 0
for key, value in env.cursor():
max_key = max(max_key, key)
Finally, simply replace line 7 of your for loop,
str_id = '{:08}'.format(i)
by
str_id = '{:08}'.format(max_key + 1 + i)
to append to the existing database.

Python Linear Search Better Efficiency

I've got a question regarding Linear Searching in Python. Say I've got the base code of
for l in lines:
for f in search_data:
if my_search_function(l[1],[f[0],f[2]]):
print "Found it!"
break
in which we want to determine where in search_data exists the value stored in l[1]. Say my_search_function() looks like this:
def my_search_function(search_key, search_values):
for s in search_values:
if search_key in s:
return True
return False
Is there any way to increase the speed of processing? Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes. I've tried an outside-in approach, i.e.
for line in lines:
negative_index = -1
positive_index = 0
middle_element = len(search_data) /2 if len(search_data) %2 == 0 else (len(search_data)-1) /2
found = False
while positive_index < middle_element:
# print str(positive_index)+","+str(negative_index)
if my_search_function(line[1], [search_data[positive_index][0],search_data[negative_index][0]]):
print "Found it!"
break
positive_index = positive_index +1
negative_index = negative_index -1
However, I'm not seeing any speed increases from this. Does anyone have a better approach? I'm looking to cut the processing speed in half as I'm working with large amounts of CSV and the processing time for one file is > 00:15 which is unacceptable as I'm processing batches of 30+ files. Basically the data I'm searching on is essentially SKUs. A value from lines[0] could be something like AS123JK and a valid match for that value could be AS123. So a HashMap would not work here, unless there exists a way to do partial matches in a HashMap lookup that wouldn't require me breaking down the values like ['AS123', 'AS123J', 'AS123JK'], which is not ideal in this scenario. Thanks!
Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes.
Regardless, it may be worth your while to extract the strings (along with some reference to the original data structure) into a flat list, sort it, and perform fast binary searches on it with help of the bisect module.
Or, instead of a large number of searches, sort also a combined list of all the search keys and traverse both lists in parallel, looking for matches. (Proceeding in a similar manner to the merge step in merge sort, without actually outputting a merged list)
Code to illustrate the second approach:
lines = ['AS12', 'AS123', 'AS123J', 'AS123JK','AS124']
search_keys = ['AS123', 'AS125']
try:
iter_keys = iter(sorted(search_keys))
key = next(iter_keys)
for line in sorted(lines):
if line.startswith(key):
print('Line {} matches {}'.format(line, key))
else:
while key < line[:len(key)]:
key = next(iter_keys)
except StopIteration: # all keys processed
pass
Depends on problem detail.
For instance if you search for complete words, you could create a hashtable on searchable elements, and the final search would be a simple lookup.
Filling the hashtable is pseudo-linear.
Ultimately, I was broke down and implemented Binary Search on my multidimensional lists by sorting using the sorted() function with a lambda as a key argument.Here is the first pass code that I whipped up. It's not 100% efficient, but it's a vast improvement from where we were
def binary_search(master_row, source_data,master_search_index, source_search_index):
lower_bound = 0
upper_bound = len(source_data) - 1
found = False
while lower_bound <= upper_bound and not found:
middle_pos = (lower_bound + upper_bound) // 2
if source_data[middle_pos][source_search_index] < master_row[master_search_index]:
if search([source_data[middle_pos][source_search_index]],[master_row[master_search_index]]):
return {"result": True, "index": middle_pos}
break
lower_bound = middle_pos + 1
elif source_data[middle_pos][source_search_index] > master_row[master_search_index] :
if search([master_row[master_search_index]],[source_data[middle_pos][source_search_index]]):
return {"result": True, "index": middle_pos}
break
upper_bound = middle_pos - 1
else:
if len(source_data[middle_pos][source_search_index]) > 5:
return {"result": True, "index": middle_pos}
else:
break
and then where we actually make the Binary Search call
#where master_copy is the first multidimensional list, data_copy is the second
#the search columns are the columns we want to search against
for line in master_copy:
for m in master_search_columns:
found = False
for d in data_search_columns:
data_copy = sorted(data_copy, key=lambda x: x[d], reverse=False)
results = binary_search(line, data_copy,m, d)
found = results["result"]
if found:
line = update_row(line, data_copy[results["index"]], column_mapping)
found_count = found_count +1
break
if found:
break
Here's the info for sorting a multidimensional list Python Sort Multidimensional Array Based on 2nd Element of Subarray

Categories