I am trying to do some analytics against a large dictionary created by reading a file from disk. The read operation results in a stable memory footprint. I then have a method which performs some calculations based on data I copy out of that dictionary into a temporary dictionary. I do this so that all the copying and data use is scoped in the method, and would, I had hoped, disappear at the end of the method call.
Sadly, I am doing something wrong. The customerdict definition is as follows (defined at top of .py variable):
customerdict = collections.defaultdict(dict)
The format of the object is {customerid: dictionary{id: 0||1}}
There is also a similarly defined dictionary called allids.
I have a method for calculating the sim_pearson distance (modified code from Programming Collective Intelligence book), which is below.
def sim_pearson(custID1, custID2):
si = []
smallcustdict = {}
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]
#a loop to round out the remaining allids object to fill in 0 values
for customerID, catalog in smallcustdict.iteritems():
for id in allids:
if id not in catalog:
smallcustdict[customerID][asin] = 0.0
#get the list of mutually rated items
for id in smallcustdict[custID1]:
if id in smallcustdict[custID2]:
si.append(id) # = 1
#return 0 if there are no matches
if len(si) == 0: return 0
#add up all the preferences
sum1 = sum([smallcustdict[custID1][id] for id in si])
sum2 = sum([smallcustdict[custID2][id] for id in si])
#sum up the squares
sum1sq = sum([pow(smallcustdict[custID1][id],2) for id in si])
sum2sq = sum([pow(smallcustdict[custID2][id],2) for id in si])
#sum up the products
psum = sum([smallcustdict[custID1][id] * smallcustdict[custID2][id] for id in si])
#calc Pearson score
num = psum - (sum1*sum2/len(si))
den = sqrt((sum1sq - pow(sum1,2)/len(si)) * (sum2sq - pow(sum2,2)/len(si)))
del smallcustdict
del si
del sum1
del sum2
del sum1sq
del sum2sq
del psum
if den == 0:
return 0
return num/den
Every loop through the sim_pearson method grows the memory footprint of python.exe unbounded. I tried using the "del" method to explicitly delete local scoped variables.
Looking at taskmanager, the memory is jumping up at 6-10Mb increments. Once the initial customerdict is setup, the footprint is 137Mb.
Any ideas why I am running out of memory doing it this way?
I presume the issue is here:
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]
#a loop to round out the remaining allids object to fill in 0 values
for customerID, catalog in smallcustdict.iteritems():
for id in allids:
if id not in catalog:
smallcustdict[customerID][asin] = 0.0
The dictionaries from customerdict are being referenced in smallcustdict - so when you add to them, you they persist. This is the only point that I can see where you do anything that will persist out of scope, so I would imagine this is the problem.
Note you are making a lot of work for yourself in many places by not using list comps, doing the same thing repeatedly, and not making generic ways to do things, a better version might be as follows:
import collections
import functools
import operator
customerdict = collections.defaultdict(dict)
def sim_pearson(custID1, custID2):
#Declaring as a dict literal is nicer.
smallcustdict = {
custID1: customerdict[custID1],
custID2: customerdict[custID2],
}
# Unchanged, as I'm not sure what the intent is here.
for customerID, catalog in smallcustdict.iteritems():
for id in allids:
if id not in catalog:
smallcustdict[customerID][asin] = 0.0
#dict views are set-like, so the easier way to do what you want is the intersection of the two.
si = smallcustdict[custID1].viewkeys() & smallcustdict[custID2].viewkeys()
#if not is a cleaner way of checking for no values.
if not si:
return 0
#Made more generic to avoid repetition and wastefully looping repeatedly.
parts = [list(part) for part in zip(*((value[id] for value in smallcustdict.values()) for id in si))]
sums = [sum(part) for part in parts]
sumsqs = [sum(pow(i, 2) for i in part) for part in parts]
psum = sum(functools.reduce(operator.mul, part) for part in zip(*parts))
sum1, sum2 = sums
sum1sq, sum2sq = sumsqs
#Unchanged.
num = psum - (sum1*sum2/len(si))
den = sqrt((sum1sq - pow(sum1,2)/len(si)) * (sum2sq - pow(sum2,2)/len(si)))
#Again using if not.
if not den:
return 0
else:
return num/den
Note that this is entirely untested as the code you gave isn't a complete example. However, It should be easy enough to use as a basis for improvement.
Try changing the following two lines:
smallcustdict[custID1] = customerdict[custID1]
smallcustdict[custID2] = customerdict[custID2]
to
smallcustdict[custID1] = customerdict[custID1].copy()
smallcustdict[custID2] = customerdict[custID2].copy()
That way the changes you make to the two dictionaries do not persist in customerdict when the sim_pearson() function returns.
Related
I'm currently trying to get into Python and Ren'Py a bit. Since I like to design a lot dynamically
and don't want to use copy&paste so often, I'm trying to create a page that will have ImageButtons with the corresponding number I specify.
In the example below I use "4" - but this can be higher.
I have built a class for this purpose:
Example:
init python:
class PictureSettings(object):
def __init__ (self, ImgIdle, ImgHover, LabelCall):
self.ImgIdle = ImgIdle
self.ImgHover = ImgHover
self.LabelCall = LabelCall
return
For the Idle/Hover and for the Jump.
If I insert in the code now in an object each entry "manually" with append I get all 4 pictures as desired indicated.
Example: (Works - but is not dynamic)
python:
var_pictures = []
var_pictures.append(PictureSettings("img_picture_1_idle", "img_picture_1_hover", "picture_1"))
var_pictures.append(PictureSettings("img_picture_2_idle", "img_picture_2_hover", "picture_2"))
var_pictures.append(PictureSettings("img_picture_3_idle", "img_picture_3_hover", "picture_3"))
var_pictures.append(PictureSettings("img_picture_4_idle", "img_picture_4_hover", "picture_4"))
I would like it to be like this:
Example (Here I get only ""img_picture_4_idle", "img_picture_4_hover", "picture_4""):
$ countlimit = 4
$ count = 1
python:
while count < countlimit:
var_pictures = []
var_pictures.append(PictureSettings(
ImgIdle = "img_picture_[count]_idle",
ImgHover = "img_picture_[count]_hover",
LabelCall = "picture_[count]"))
count += 1
Have already tried various things, unfortunately without success.
For example: with Add - instead of append (because this overwrites the result and leaves only the last entry
I get the following error:
var_pictures.add(PictureSettings( AttributeError: 'RevertableList' object has no attribute 'add')
Maybe someone can help me with the solution so I can keep my code dynamic without copying something X times.
Thanks for your help
You are creating your list inside your loop, so it is recreated every time.
At the end, you only get the last created list.
var_pictures = []
while count < countlimit:
var_pictures.append(PictureSettings(
ImgIdle = "img_picture_[count]_idle",
ImgHover = "img_picture_[count]_hover",
LabelCall = "picture_[count]"))
count += 1
On another subject, if you want to do this in a more pythonic way:
pictures = [] # no need for var_, we know its a variable
for i in range(1, 5):
pictures.append(PictureSettings(
# in python, we prefere snake_case attributes
img_idle=f'img_picture_{i}_idle',
img_hover=f'img_picture_{i}_hover',
...
))
# or even shorter with list comprehension
pictures = [
PictureSettings(
img_idle=f'img_picture_{i}_idle',
)
for i in range(1, 5)
]
By the way, no need to return in your class constructor
I have generated a list of genes
genes = ['geneName1', 'geneName2', ...]
and a set of their interactions:
geneInt = {('geneName1', 'geneName2'), ('geneName1', 'geneName3'),...}
I want to find out how many interactions each gene has and put that in a vector (or dictionary) but I struggle to count them. I tried the usual approach:
interactionList = []
for gene in genes:
interactions = geneInt.count(gene)
interactionList.append(ineractions)
but of course the code fails because my set contains elements that are made out of two values while I need to iterate over the single values within.
I would argue that you are using the wrong data structure to hold interactions. You can represent interactions as a dictionary keyed by gene name, whose values are a set of all the genes it interacts with.
Let's say you currently have a process that does something like this at some point:
geneInt = set()
...
geneInt.add((gene1, gene2))
Change it to
geneInt = collections.defaultdict(set)
...
geneInt[gene1].add(gene2)
If the interactions are symmetrical, add a line
geneInt[gene2].add(gene1)
Now, to count the number of interactions, you can do something like
intCounts = {gene: len(ints) for gene, ints in geneInt.items()}
Counting your original list is simple if the interactions are one-way as well:
intCounts = dict.fromkeys(genes, 0)
for gene, _ in geneInt:
intCounts[gene] += 1
If each interaction is two-way, there are three possibilities:
Both interactions are represented in the set: the above loop will work.
Only one interaction of a pair is represented: change the loop to
for gene1, gene2 in geneInt:
intCounts[gene1] += 1
if gene1 != gene2:
intCounts[gene2] += 1
Some reverse interactions are represented, some are not. In this case, transform geneInt into a dictionary of sets as shown in the beginning.
Try something like this,
interactions = {}
for gene in genes:
interactions_count = 0
for tup in geneInt:
interactions_count += tup.count(gene)
interactions[gene] = interactions_count
Use a dictionary, and keep incrementing the value for every gene you see in each tuple in the set geneInt.
interactions_counter = dict()
for interaction in geneInt:
for gene in interaction:
interactions_counter[gene] = interactions_counter.get(gene, 0) + 1
The dict.get(key, default) method returns the value at the given key, or the specified default if the key doesn't exist. (More info)
For the set geneInt={('geneName1', 'geneName2'), ('geneName1', 'geneName3')}, we get:
interactions_counter = {'geneName1': 2, 'geneName2': 1, 'geneName3': 1}
I am pretty new to optimization in general and pyomo in particular, so I apologize in advance for any rookie mistakes.
I have defined a simple unit commitment exercise (example 3.1 from [1]) using [2] as starting point. I got the correct result and my code runs, but I have a few questions regarding how to access stuff.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shutil
import sys
import os.path
import pyomo.environ as pyo
import pyomo.gdp as gdp #necessary if you use booleans to select active and innactive units
def bounds_rule(m, n, param='Cap_MW'):
# m because it pases the module
# n because it needs a variable from each set, in this case there was only m.N
return (0, Gen[n][param]) #returns lower and upper bounds.
def unit_commitment():
m = pyo.ConcreteModel()
m.dual = pyo.Suffix(direction=pyo.Suffix.IMPORT_EXPORT)
N=Gen.keys()
m.N = pyo.Set(initialize=N)
m.Pgen = pyo.Var(m.N, bounds = bounds_rule) #amount of generation
m.Rgen = pyo.Var(m.N, bounds = bounds_rule) #amount of generation
# m.OnOff = pyo.Var(m.N, domain=pyo.Binary) #boolean on/off marker
# objective
m.cost = pyo.Objective(expr = sum( m.Pgen[n]*Gen[n]['energy_$MWh'] + m.Rgen[n]*Gen[n]['res_$MW'] for n in m.N), sense=pyo.minimize)
# demand
m.demandP = pyo.Constraint(rule=lambda m: sum(m.Pgen[n] for n in N) == Demand['ener_MWh'])
m.demandR = pyo.Constraint(rule=lambda m: sum(m.Rgen[n] for n in N) == Demand['res_MW'])
# machine production limits
# m.lb = pyo.Constraint(m.N, rule=lambda m, n: Gen[n]['Cap_min']*m.OnOff[n] <= m.Pgen[n]+m.Rgen[n] )
# m.ub = pyo.Constraint(m.N, rule=lambda m, n: Gen[n]['Cap_MW']*m.OnOff[n] >= m.Pgen[n]+m.Rgen[n])
m.lb = pyo.Constraint(m.N, rule=lambda m, n: Gen[n]['Cap_min'] <= m.Pgen[n]+m.Rgen[n] )
m.ub = pyo.Constraint(m.N, rule=lambda m, n: Gen[n]['Cap_MW'] >= m.Pgen[n]+m.Rgen[n])
m.rc = pyo.Suffix(direction=pyo.Suffix.IMPORT)
return m
Gen = {
'GenA' : {'Cap_MW': 100, 'energy_$MWh': 10, 'res_$MW': 0, 'Cap_min': 0},
'GenB' : {'Cap_MW': 100, 'energy_$MWh': 30, 'res_$MW': 25, 'Cap_min': 0},
} #starting data
Demand = {
'ener_MWh': 130, 'res_MW': 20,
} #starting data
m = unit_commitment()
pyo.SolverFactory('glpk').solve(m).write()
m.display()
df = pd.DataFrame.from_dict([m.Pgen.extract_values(), m.Rgen.extract_values()]).T.rename(columns={0: "P", 1: "R"})
print(df)
print("Cost Function result: " + str(m.cost.expr()) + "$.")
print(m.rc.display())
print(m.dual.display())
print(m.dual[m.demandR])
da= {'duals': m.dual[m.demandP],
'uslack': m.demandP.uslack(),
'lslack': m.demandP.lslack(),
}
db= {'duals': m.dual[m.demandR],
'uslack': m.demandR.uslack(),
'lslack': m.demandR.lslack(),
}
duals = pd.DataFrame.from_dict([da, db]).T.rename(columns={0: "demandP", 1: "demandR"})
print(duals)
Here come my questions.
1-Duals/shadow-price: By definition the shadow price are the dual variables of the constraints (m.demandP and m.demandR). Is there a way to access this values and put them into a dataframe without doing that "shitty" thing I did? I mean defining manually da and db and then creating the dataframe as both dictionaries joined? I would like to do something cleaner like the df that holds the results of P and R for each generator in the system.
2-Usually, in the unit commitment problem, binary variables are used in order to "mark" or "select" active and inactive units. Hence the "m.OnOff" variable (commented line). For what I found in [3], duals don't exist in models containing binary variables. After that I rewrote the problem without including binarys. This is not a problem in this simplistic exercise in which all units run, but for larger ones. I need to be able to let the optimization decide which units will and won't run and I still need the shadow-price. Is there a way to obtain the shadow-price/duals in a problem containing binary variables?
I let the constraint definition based on binary variables also there in case someone finds it useful.
Note: The code also runs with the binary variables and gets the correct result, however I couldn't figure out how to get the shadow-price. Hence my question.
[1] Morales, J. M., Conejo, A. J., Madsen, H., Pinson, P., & Zugno, M. (2013). Integrating renewables in electricity markets: operational problems (Vol. 205). Springer Science & Business Media.
[2] https://jckantor.github.io/ND-Pyomo-Cookbook/04.06-Unit-Commitment.html
[3] Dual Variable Returns Nothing in Pyomo
To answer 1, you can dynamically get the constraint objects from your model using model.component_objects(pyo.Constraint) which will return an iterator of your constraints, which keeps your from having to hard-code the constraint names. It gets tricky for indexed variables because you have to do an extra step to get the slacks for each index, not just the constraint object. For the duals, you can iterate over the keys attribute to retrieve those values.
duals_dict = {str(key):m.dual[key] for key in m.dual.keys()}
u_slack_dict = {
# uslacks for non-indexed constraints
**{str(con):con.uslack() for con in m.component_objects(pyo.Constraint)
if not con.is_indexed()},
# indexed constraint uslack
# loop through the indexed constraints
# get all the indices then retrieve the slacks for each index of constraint
**{k:v for con in m.component_objects(pyo.Constraint) if con.is_indexed()
for k,v in {'{}[{}]'.format(str(con),key):con[key].uslack()
for key in con.keys()}.items()}
}
l_slack_dict = {
# lslacks for non-indexed constraints
**{str(con):con.lslack() for con in m.component_objects(pyo.Constraint)
if not con.is_indexed()},
# indexed constraint lslack
# loop through the indexed constraints
# get all the indices then retrieve the slacks for each index of constraint
**{k:v for con in m.component_objects(pyo.Constraint) if con.is_indexed()
for k,v in {'{}[{}]'.format(str(con),key):con[key].lslack()
for key in con.keys()}.items()}
}
# combine into a single df
df = pd.concat([pd.Series(d,name=name)
for name,d in {'duals':duals_dict,
'uslack':u_slack_dict,
'lslack':l_slack_dict}.items()],
axis='columns')
Regarding 2, I agree with #Erwin s comment about solving with the binary variables to get the optimal solution, then removing the binary restriction but fixing the variables to the optimal values to get some dual variable values.
I'm writing a code to analyze a (8477960, 1) column vector. I am not sure if the while loops in my code are running infinitely, or if the way I've written things is just really slow.
This is a section of my code up to the first while loop, which I cannot get to run to completion.
import numpy as np
import pandas as pd
data = pd.read_csv(r'C:\Users\willo\Desktop\TF_60nm_2_2.txt')
def recursive_low_pass(rawsignal, startcoeff, endcoeff, filtercoeff):
# The current signal length
ni = len(rawsignal) # signal size
rougheventlocations = np.zeros(shape=(100000, 3))
# The algorithm parameters
# filter coefficient
a = filtercoeff
raw = np.array(rawsignal).astype(np.float)
# thresholds
s = startcoeff
e = endcoeff # for event start and end thresholds
# The recursive algorithm
# loop init
ml = np.zeros(ni)
vl = np.zeros(ni)
s = np.zeros(ni)
ml[0] = np.mean(raw) # local mean init
vl[0] = np.var(raw) # local variance init
i = 0 # sample counter
numberofevents = 0 # number of detected events
# main loop
while i < (ni - 1):
i = i + 1
# local mean low pass filtering
ml[i] = a * ml[i - 1] + (1 - a) * raw[i]
# local variance low pass filtering
vl[i] = a * vl[i - 1] + (1 - a) * np.power([raw[i] - ml[i]],2)
# local threshold to detect event start
sl = ml[i] - s * np.sqrt(vl[i])
I'm not getting any error messages, but I've let the program run for more than 10 minutes without any results, so I assume I'm doing something incorrectly.
You should try to vectorize this process rather than accessing/processing indexes (otherwise why use numpy).
The other thing is that you seem to be doing unnecessary work (unless we're not seeing the whole function).
the line:
sl = ml[i] - s * np.sqrt(vl[i])
assigns the variable sl which you're not using inside the loop (or anywhere else). This assignment performs a whole vector multiplication by s which is all zeroes. If you do need the sl variable, you should calculate it outside of the loop using the last encountered values of ml[i] and vl[i] which you can store in temporary variables instead of computing on every loop.
If ni is in the millions, this unnecessary vector multiplication (of millions of zeros) is going to be very costly.
You probably didn't mean to override the value of s = startcoeff with s = np.zeros(ni) in the first place.
In order to vectorize these calculations you will need to use np.acumulate with some customized functions.
The non-numpy equivalent would be as follows (using itertools instead):
from itertools import accumulate
ml = [np.mean(raw)]+[0]*(ni-1)
mlSums = accumulate(zip(ml,raw),lambda r,d:(a*r[0] + (1-a)*d[1],0))
ml = [v for v,_ in mlSums]
vl = [np.var(raw)]+[0]*(ni-1)
vlSums = accumulate(zip(vl,raw,ml),lambda r,d:(a*r[0] + (1-a)*(d[1]-d[2])**2,0,0))
vl = [v for v,_,_ in vlSums]
In each case, the ml / vl vectors are initialized with the base value at index zero and the rest filled with zeroes.
The accumulate(zip(... function calls go through the array and call the lambda function with the current sum in r and the paired elements in d. For the ml calculation, this corresponds to r = (ml[i-1],_) and d = (0,raw[i]).
Because accumulate ouputs the same date type as it is given as input (which are zipped tuples), the actual result is only the first value of the tuples in the mlSums/vlSums lists.
This took 9.7 seconds to process for 8,477,960 items in the lists.
I need some help getting my brain around designing an (efficient) markov chain in spark (via python). I've written it as best as I could, but the code I came up with doesn't scale.. Basically for the various map stages, I wrote custom functions and they work fine for sequences of a couple thousand, but when we get in the 20,000+ (and I've got some up to 800k) things slow to a crawl.
For those of you not familiar with markov moodels, this is the gist of it..
This is my data.. I've got the actual data (no header) in an RDD at this point.
ID, SEQ
500, HNL, LNH, MLH, HML
We look at sequences in tuples, so
(HNL, LNH), (LNH,MLH), etc..
And I need to get to this point.. where I return a dictionary (for each row of data) that I then serialize and store in an in memory database.
{500:
{HNLLNH : 0.333},
{LNHMLH : 0.333},
{MLHHML : 0.333},
{LNHHNL : 0.000},
etc..
}
So in essence, each sequence is combined with the next (HNL,LNH become 'HNLLNH'), then for all possible transitions (combinations of sequences) we count their occurrence and then divide by the total number of transitions (3 in this case) and get their frequency of occurrence.
There were 3 transitions above, and one of those was HNLLNH.. So for HNLLNH, 1/3 = 0.333
As a side not, and I'm not sure if it's relevant, but the values for each position in a sequence are limited.. 1st position (H/M/L), 2nd position (M/L), 3rd position (H,M,L).
What my code had previously done was to collect() the rdd, and map it a couple times using functions I wrote. Those functions first turned the string into a list, then merged list[1] with list[2], then list[2] with list[3], then list[3] with list[4], etc.. so I ended up with something like this..
[HNLLNH],[LNHMLH],[MHLHML], etc..
Then the next function created a dictionary out of that list, using the list item as a key and then counted the total ocurrence of that key in the full list, divided by len(list) to get the frequency. I then wrapped that dictionary in another dictionary, along with it's ID number (resulting in the 2nd code block, up a above).
Like I said, this worked well for small-ish sequences, but not so well for lists with a length of 100k+.
Also, keep in mind, this is just one row of data. I have to perform this operation on anywhere from 10-20k rows of data, with rows of data varying between lengths of 500-800,000 sequences per row.
Any suggestions on how I can write pyspark code (using the API map/reduce/agg/etc.. functions) to do this efficiently?
EDIT
Code as follows.. Probably makes sense to start at the bottom. Please keep in mind I'm learning this(Python and Spark) as I go, and I don't do this for a living, so my coding standards are not great..
def f(x):
# Custom RDD map function
# Combines two separate transactions
# into a single transition state
cust_id = x[0]
trans = ','.join(x[1])
y = trans.split(",")
s = ''
for i in range(len(y)-1):
s= s + str(y[i] + str(y[i+1]))+","
return str(cust_id+','+s[:-1])
def g(x):
# Custom RDD map function
# Calculates the transition state probabilities
# by adding up state-transition occurrences
# and dividing by total transitions
cust_id=str(x.split(",")[0])
trans = x.split(",")[1:]
temp_list=[]
middle = int((len(trans[0])+1)/2)
for i in trans:
temp_list.append( (''.join(i)[:middle], ''.join(i)[middle:]) )
state_trans = {}
for i in temp_list:
state_trans[i] = temp_list.count(i)/(len(temp_list))
my_dict = {}
my_dict[cust_id]=state_trans
return my_dict
def gen_tsm_dict_spark(lines):
# Takes RDD/string input with format CUST_ID(or)PROFILE_ID,SEQ,SEQ,SEQ....
# Returns RDD of dict with CUST_ID and tsm per customer
# i.e. {cust_id : { ('NLN', 'LNN') : 0.33, ('HPN', 'NPN') : 0.66}
# creates a tuple ([cust/profile_id], [SEQ,SEQ,SEQ])
cust_trans = lines.map(lambda s: (s.split(",")[0],s.split(",")[1:]))
with_seq = cust_trans.map(f)
full_tsm_dict = with_seq.map(g)
return full_tsm_dict
def main():
result = gen_tsm_spark(my_rdd)
# Insert into DB
for x in result.collect():
for k,v in x.iteritems():
db_insert(k,v)
You can try something like below. It depends heavily on tooolz but if you prefer to avoid external dependencies you can easily replace it with some standard Python libraries.
from __future__ import division
from collections import Counter
from itertools import product
from toolz.curried import sliding_window, map, pipe, concat
from toolz.dicttoolz import merge
# Generate all possible transitions
defaults = sc.broadcast(dict(map(
lambda x: ("".join(concat(x)), 0.0),
product(product("HNL", "NL", "HNL"), repeat=2))))
rdd = sc.parallelize(["500, HNL, LNH, NLH, HNL", "600, HNN, NNN, NNN, HNN, LNH"])
def process(line):
"""
>>> process("000, HHH, LLL, NNN")
('000', {'LLLNNN': 0.5, 'HHHLLL': 0.5})
"""
bits = line.split(", ")
transactions = bits[1:]
n = len(transactions) - 1
frequencies = pipe(
sliding_window(2, transactions), # Get all transitions
map(lambda p: "".join(p)), # Joins strings
Counter, # Count
lambda cnt: {k: v / n for (k, v) in cnt.items()} # Get frequencies
)
return bits[0], frequencies
def store_partition(iter):
for (k, v) in iter:
db_insert(k, merge([defaults.value, v]))
rdd.map(process).foreachPartition(store_partition)
Since you know all possible transitions I would recommend using a sparse representation and ignore zeros. Moreover you can replace dictionaries with sparse vectors to reduce memory footprint.
you can achieve this result by using pure Pyspark, i did using it using pyspark.
To create frequencies, let say you have already achieved and these are input RDDs
ID, SEQ
500, [HNL, LNH, MLH, HML ...]
and to get frequencies like, (HNL, LNH),(LNH, MLH)....
inputRDD..map(lambda (k, list): get_frequencies(list)).flatMap(lambda x: x) \
.reduceByKey(lambda v1,v2: v1 +v2)
get_frequencies(states_list):
"""
:param states_list: Its a list of Customer States.
:return: State Frequencies List.
"""
rest = []
tuples_list = []
for idx in range(0,len(states_list)):
if idx + 1 < len(states_list):
tuples_list.append((states_list[idx],states_list[idx+1]))
unique = set(tuples_list)
for value in unique:
rest.append((value, tuples_list.count(value)))
return rest
and you will get results
((HNL, LNH), 98),((LNH, MLH), 458),() ......
after this you may convert result RDDs into Dataframes or yu can directly insert into DB using RDDs mapPartitions