Creating a large dictionary in pyspark - python

I am trying to solve the following problem using pyspark.
I have a file on hdfs in the format which is a dump of lookup table.
key1, value1
key2, value2
...
I want to load this into python dictionary in pyspark and use it for some other purpose. So I tried to do:
table = {}
def populateDict(line):
(k,v) = line.split(",", 1)
table[k] = v
kvfile = sc.textFile("pathtofile")
kvfile.foreach(populateDict)
I found that table variable is not modified. So, is there a way to create a large inmemory hashtable in spark?

foreach is a distributed computation so you can't expect it to modify a datasctructure only visible in the driver. What you want is.
kv.map(line => { line.split(" ") match {
case Array(k,v) => (k,v)
case _ => ("","")
}.collectAsMap()
This is in scala but you get the idea, the important function is collectAsMap() which returns a map to the driver.
If you're data is very large you can use a PairRDD as a map. First map to pairs
kv.map(line => { line.split(" ") match {
case Array(k,v) => (k,v)
case _ => ("","")
}
then you can access with rdd.lookup("key") which returns a sequence of values associated with the key, though this definitely will not be as efficient as other distributed KV stores, as spark isn't really built for that.

For efficiency, see: sortByKey() and lookup()
lookup(key):
Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.
The RDD will be re-partitioned by sortByKey() (see: OrderedRDD) and efficiently searched during lookup() calls. In code, something like,
kvfile = sc.textFile("pathtofile")
sorted_kv = kvfile.flatMap(lambda x: x.split("," , 1)).sortByKey()
sorted_kv.lookup('key1').take(10)
will do the trick both as an RDD and efficiently.

Related

Which serialization format is this?

I’m dealing with some serialized data fetched from an SQL Server database and that looks like this :
('|AFoo|BBaar|C61|DFoo Baar|E200060|F200523|G200240|', )
Any idea which format is this ? And is there any Python package that can deserilize this ?
What you show is a tuple that contains one value - a string. You can use string.split to construct a list of the string's component parts - i.e., AFoo, BBaar etc
t = ('|AFoo|BBaar|C61|DFoo Baar|E200060|F200523|G200240|', )
for e in t:
values = [v for v in e.split('|') if v]
print(values)
Output:
['AFoo', 'BBaar', 'C61', 'DFoo Baar', 'E200060', 'F200523', 'G200240']
Note:
The for loop is used as a generic approach that allows for multiple strings in the tuple. For the data fragment shown in the question, this isn't actually necessary

python list to dictionary for dataflow

I am trying to convert a JSON file into a dictionary and apply key/value pairs, so I can then use groupbykey() to basically deduplicate the key/value pairs.
This is the original content of the file:
{"tax_pd":"200003","ein":"720378282"}
{"tax_pd":"200012","ein":"274027765"}
{"tax_pd":"200012","ein":"042746989"}
{"tax_pd":"200012","ein":"205993971"}
I have formatted it like so:
(u'201208', u'010620100')
(u'201208', u'860785769')
(u'201208', u'371650138')
(u'201208', u'237253410')
I want to turn these into key/value pairs, so I can apply GroupByKey, in my Dataflow Pipeline. I believe i need to turn it into a dictionary first?
I'm new to python and the google cloud applications and some help would be great!
EDIT : Code snippets
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'ReadInputText' >> beam.io.ReadFromText(known_args.input)
| 'YieldWords' >> beam.ParDo(ExtractWordsFn())
# | 'GroupByKey' >> beam.GroupByKey()
| 'WriteInputText' >> beam.io.WriteToText(known_args.output))
class ExtractWordsFn(beam.DoFn):
def process(self, element):
words = re.findall(r'[0-9]+', element)
yield tuple(words)
A quick pure-Python solution would be:
import json
with open('path/to/my/file.json','rb') as fh:
lines = [json.loads(l) for l in fh.readlines()]
# [{'tax_pd': '200003', 'ein': '720378282'}, {'tax_pd': '200012', 'ein': '274027765'}, {'tax_pd': '200012', 'ein': '042746989'}, {'tax_pd': '200012', 'ein': '205993971'}]
Looking at your data, you don't have unique keys to do key:value by tax_pd and ein. Assuming there will be collisions, you could do the following:
myresults = {}
for line in lines:
# I'm assuming we want to use tax_pd as the key, and ein as the value, but this can be extended to other keys
# This will return None if the tax_pd is not already found
if not myresults.get(line.get('tax_pd')):
myresults[line.get('tax_pd')] = [line.get('ein')]
else:
myresults[line.get('tax_pd')] = list(set([line.get('ein'), *myresults[line.get('tax_pd')]))
#results
#{'200003': ['720378282'], '200012': ['205993971', '042746989', '274027765']}
This way you have unique keys, with lists of corresponding unique ein values. Not completely sure if that's what you're going for or not. set will automatically dedup a list, and the wrapping list reconverts the data type
You can then lookup by the tax_id explicitly:
myresults.get('200012')
# ['205993971', '042746989', '274027765']
EDIT: To read from the cloud storage, the code snippet here translated to be a bit easier to use:
with gcs.open(filename) as fh:
lines = fh.read().split('\n')
You can set up your gcs object using their api docs

Return key if variable in associated values array?

I have a technical dictionary that I am using to correct various spellings of technical terms.
How can I use this structure (or restructure the below code to work) in order to return the key for any alternate spelling?
For example, if someone has written "craniem" I wish to return "cranium". I've tried a number of different constructions, including the one below, and cannot quite get it to work.
def techDict():
myDict = {
'cranium' : ['cranum','crenium','creniam','craniem'],
'coccyx' : ['coscyx','cossyx','koccyx','kosicks'],
'1814A' : ['Aero1814','A1814','1814'],
'SodaAsh' : ['sodaash','na2co3', 'soda', 'washingsoda','sodacrystals']
}
return myDict
techDict = techDict()
correctedSpelling = next(val for key, val in techDict.iteritems() if val=='1814')
print(correctedSpelling)
Using in instead of = will do the trick
next(k for k, v in techDict.items() if 'craniem' in v)
Just reverse and flatten your dictionary:
tech_dict = {
'cranium': ['cranum', 'crenium', 'creniam', 'craniem'],
'coccyx': ['coscyx', 'cossyx', 'koccyx', 'kosicks'],
'1814A': ['Aero1814', 'A1814', '1814'],
'SodaAsh': ['sodaash', 'na2co3', 'soda', 'washingsoda', 'sodacrystals'],
}
lookup = {val: key for key, vals in tech_dict.items() for val in vals}
# ^ note dict.iteritems doesn't exist in 3.x
Then you can trivially get:
corrected_spelling = lookup['1814']
This is far more efficient than potentially scanning through every list for every key in the dictionary to find your search term.
Also note: 1. compliance with the official style guide; and 2. that I've removed the techDict function entirely - it was pointless to write a function just to create a dictionary, especially as you immediately shadowed the function with the dictionary it returned so you couldn't even call it again.

Populating Neo4j using python Dictionary

So I just started with Neo4j, and I'm trying to figure out how I might populate my DataFrame. I have a dictionary of words as keys and synonyms as values in a list and I want to populate Neo4j that seems like it would be an interesting way to learn how to use the database.
An example would be:
'CRUNK' : [u'drunk', u'wasted', u'high', u'crunked', u'crazy', u'hammered', u'alcohol', u'hyphy', u'party']
The lists are not going to be of equal length so converting it to a more typical csv format is not an option, and I haven't found an explanation of how I could populate the database like I would for the SQL database in a Django app. I want to do something like this:
for each k,v in dictionary:
add k and add relationship to each value in v
Does anyone have any tutorials, documentation or answers that could help point me in the right direction?
I think what you want to do you can do in Cypher directly:
MERGE (w:Word {text:{root}})
UNWIND {words} as word
MERGE (w2:Word {text:word})
MERGE (w2)-[:SYNONYM]->(w)
You would then run this statement with http://py2neo.org's cypher-session API and the two parameters, a single root word and a list of words.
you can also use foreach instead of unwind
MERGE (w:Word {text:{root}})
FOREACH (word IN {words} |
MERGE (w2:Word {text:word})
MERGE (w2)-[:SYNONYM]->(w)
)
FINAL EDIT INCORPORATING MERGE:
This uses a dictionary to checks to make sure their output isn't NoneType or 'NOT FOUND', and populates the graph with a 'SYNONYM' relationship using the merge function to ensure their aren't duplicates.
import pickle
from py2neo import Graph
from py2neo import Node, Relationship
import random
graph = Graph(f'http://neo4j:{pw}#localhost:7474/db/data/'))
udSyn = pickle.load(open('lookup_ud', 'rb'))
myWords = udSyn.keys()
for key in myWords:
print(key)
values = udSyn[key]
if values in [None, 'NOT FOUND']:
continue
node = graph.merge_one('WORD', 'name', key)
for value in values:
node2 = graph.merge_one('WORD', 'name', value)
synOfNode = Relationship(node, 'SYNONYM', node2)
graph.create(synOfNode)
graph.push()

accessing values from a list in python

If I had the following code
buttonParameters = [
("button1", "button1.png"),
("button2", "button2.png"),
("button3", "button3.png"),
("button4", "button4.png"),
("button5", "button5.png"),
]
how would I go about accessing "button1" from buttonParameters.
Also, what type of list structure is this? I was reccomended using it, but I'm not sure I know what it's name is, and would like to search some to understand it more.
It seems like you are trying to retrieve a Value from a mapping, given a Key.
For this you are using a List when you should be using a Dictionary:
buttonParameters = {
"button1": "button1.png",
"button2": "button2.png",
"button3": "button3.png",
"button4": "button4.png",
"button5": "button5.png",
}
buttonParameters['button1'] #=> "button1.png"
A solution involving a List traversal to extract a value has linear worst-case performance whilst dictionary retrieval is amortised constant time.
You can convert your list of tuples into the above dictionary with:
buttonParameters = dict(buttonParameters)

Categories