I am trying to solve the following problem using pyspark.
I have a file on hdfs in the format which is a dump of lookup table.
key1, value1
key2, value2
...
I want to load this into python dictionary in pyspark and use it for some other purpose. So I tried to do:
table = {}
def populateDict(line):
(k,v) = line.split(",", 1)
table[k] = v
kvfile = sc.textFile("pathtofile")
kvfile.foreach(populateDict)
I found that table variable is not modified. So, is there a way to create a large inmemory hashtable in spark?
foreach is a distributed computation so you can't expect it to modify a datasctructure only visible in the driver. What you want is.
kv.map(line => { line.split(" ") match {
case Array(k,v) => (k,v)
case _ => ("","")
}.collectAsMap()
This is in scala but you get the idea, the important function is collectAsMap() which returns a map to the driver.
If you're data is very large you can use a PairRDD as a map. First map to pairs
kv.map(line => { line.split(" ") match {
case Array(k,v) => (k,v)
case _ => ("","")
}
then you can access with rdd.lookup("key") which returns a sequence of values associated with the key, though this definitely will not be as efficient as other distributed KV stores, as spark isn't really built for that.
For efficiency, see: sortByKey() and lookup()
lookup(key):
Return the list of values in the RDD for key key. This operation is done efficiently if the RDD has a known partitioner by only searching the partition that the key maps to.
The RDD will be re-partitioned by sortByKey() (see: OrderedRDD) and efficiently searched during lookup() calls. In code, something like,
kvfile = sc.textFile("pathtofile")
sorted_kv = kvfile.flatMap(lambda x: x.split("," , 1)).sortByKey()
sorted_kv.lookup('key1').take(10)
will do the trick both as an RDD and efficiently.
Related
I’m dealing with some serialized data fetched from an SQL Server database and that looks like this :
('|AFoo|BBaar|C61|DFoo Baar|E200060|F200523|G200240|', )
Any idea which format is this ? And is there any Python package that can deserilize this ?
What you show is a tuple that contains one value - a string. You can use string.split to construct a list of the string's component parts - i.e., AFoo, BBaar etc
t = ('|AFoo|BBaar|C61|DFoo Baar|E200060|F200523|G200240|', )
for e in t:
values = [v for v in e.split('|') if v]
print(values)
Output:
['AFoo', 'BBaar', 'C61', 'DFoo Baar', 'E200060', 'F200523', 'G200240']
Note:
The for loop is used as a generic approach that allows for multiple strings in the tuple. For the data fragment shown in the question, this isn't actually necessary
I am trying to convert a JSON file into a dictionary and apply key/value pairs, so I can then use groupbykey() to basically deduplicate the key/value pairs.
This is the original content of the file:
{"tax_pd":"200003","ein":"720378282"}
{"tax_pd":"200012","ein":"274027765"}
{"tax_pd":"200012","ein":"042746989"}
{"tax_pd":"200012","ein":"205993971"}
I have formatted it like so:
(u'201208', u'010620100')
(u'201208', u'860785769')
(u'201208', u'371650138')
(u'201208', u'237253410')
I want to turn these into key/value pairs, so I can apply GroupByKey, in my Dataflow Pipeline. I believe i need to turn it into a dictionary first?
I'm new to python and the google cloud applications and some help would be great!
EDIT : Code snippets
with beam.Pipeline(options=pipeline_options) as p:
(p
| 'ReadInputText' >> beam.io.ReadFromText(known_args.input)
| 'YieldWords' >> beam.ParDo(ExtractWordsFn())
# | 'GroupByKey' >> beam.GroupByKey()
| 'WriteInputText' >> beam.io.WriteToText(known_args.output))
class ExtractWordsFn(beam.DoFn):
def process(self, element):
words = re.findall(r'[0-9]+', element)
yield tuple(words)
A quick pure-Python solution would be:
import json
with open('path/to/my/file.json','rb') as fh:
lines = [json.loads(l) for l in fh.readlines()]
# [{'tax_pd': '200003', 'ein': '720378282'}, {'tax_pd': '200012', 'ein': '274027765'}, {'tax_pd': '200012', 'ein': '042746989'}, {'tax_pd': '200012', 'ein': '205993971'}]
Looking at your data, you don't have unique keys to do key:value by tax_pd and ein. Assuming there will be collisions, you could do the following:
myresults = {}
for line in lines:
# I'm assuming we want to use tax_pd as the key, and ein as the value, but this can be extended to other keys
# This will return None if the tax_pd is not already found
if not myresults.get(line.get('tax_pd')):
myresults[line.get('tax_pd')] = [line.get('ein')]
else:
myresults[line.get('tax_pd')] = list(set([line.get('ein'), *myresults[line.get('tax_pd')]))
#results
#{'200003': ['720378282'], '200012': ['205993971', '042746989', '274027765']}
This way you have unique keys, with lists of corresponding unique ein values. Not completely sure if that's what you're going for or not. set will automatically dedup a list, and the wrapping list reconverts the data type
You can then lookup by the tax_id explicitly:
myresults.get('200012')
# ['205993971', '042746989', '274027765']
EDIT: To read from the cloud storage, the code snippet here translated to be a bit easier to use:
with gcs.open(filename) as fh:
lines = fh.read().split('\n')
You can set up your gcs object using their api docs
I have a technical dictionary that I am using to correct various spellings of technical terms.
How can I use this structure (or restructure the below code to work) in order to return the key for any alternate spelling?
For example, if someone has written "craniem" I wish to return "cranium". I've tried a number of different constructions, including the one below, and cannot quite get it to work.
def techDict():
myDict = {
'cranium' : ['cranum','crenium','creniam','craniem'],
'coccyx' : ['coscyx','cossyx','koccyx','kosicks'],
'1814A' : ['Aero1814','A1814','1814'],
'SodaAsh' : ['sodaash','na2co3', 'soda', 'washingsoda','sodacrystals']
}
return myDict
techDict = techDict()
correctedSpelling = next(val for key, val in techDict.iteritems() if val=='1814')
print(correctedSpelling)
Using in instead of = will do the trick
next(k for k, v in techDict.items() if 'craniem' in v)
Just reverse and flatten your dictionary:
tech_dict = {
'cranium': ['cranum', 'crenium', 'creniam', 'craniem'],
'coccyx': ['coscyx', 'cossyx', 'koccyx', 'kosicks'],
'1814A': ['Aero1814', 'A1814', '1814'],
'SodaAsh': ['sodaash', 'na2co3', 'soda', 'washingsoda', 'sodacrystals'],
}
lookup = {val: key for key, vals in tech_dict.items() for val in vals}
# ^ note dict.iteritems doesn't exist in 3.x
Then you can trivially get:
corrected_spelling = lookup['1814']
This is far more efficient than potentially scanning through every list for every key in the dictionary to find your search term.
Also note: 1. compliance with the official style guide; and 2. that I've removed the techDict function entirely - it was pointless to write a function just to create a dictionary, especially as you immediately shadowed the function with the dictionary it returned so you couldn't even call it again.
So I just started with Neo4j, and I'm trying to figure out how I might populate my DataFrame. I have a dictionary of words as keys and synonyms as values in a list and I want to populate Neo4j that seems like it would be an interesting way to learn how to use the database.
An example would be:
'CRUNK' : [u'drunk', u'wasted', u'high', u'crunked', u'crazy', u'hammered', u'alcohol', u'hyphy', u'party']
The lists are not going to be of equal length so converting it to a more typical csv format is not an option, and I haven't found an explanation of how I could populate the database like I would for the SQL database in a Django app. I want to do something like this:
for each k,v in dictionary:
add k and add relationship to each value in v
Does anyone have any tutorials, documentation or answers that could help point me in the right direction?
I think what you want to do you can do in Cypher directly:
MERGE (w:Word {text:{root}})
UNWIND {words} as word
MERGE (w2:Word {text:word})
MERGE (w2)-[:SYNONYM]->(w)
You would then run this statement with http://py2neo.org's cypher-session API and the two parameters, a single root word and a list of words.
you can also use foreach instead of unwind
MERGE (w:Word {text:{root}})
FOREACH (word IN {words} |
MERGE (w2:Word {text:word})
MERGE (w2)-[:SYNONYM]->(w)
)
FINAL EDIT INCORPORATING MERGE:
This uses a dictionary to checks to make sure their output isn't NoneType or 'NOT FOUND', and populates the graph with a 'SYNONYM' relationship using the merge function to ensure their aren't duplicates.
import pickle
from py2neo import Graph
from py2neo import Node, Relationship
import random
graph = Graph(f'http://neo4j:{pw}#localhost:7474/db/data/'))
udSyn = pickle.load(open('lookup_ud', 'rb'))
myWords = udSyn.keys()
for key in myWords:
print(key)
values = udSyn[key]
if values in [None, 'NOT FOUND']:
continue
node = graph.merge_one('WORD', 'name', key)
for value in values:
node2 = graph.merge_one('WORD', 'name', value)
synOfNode = Relationship(node, 'SYNONYM', node2)
graph.create(synOfNode)
graph.push()
If I had the following code
buttonParameters = [
("button1", "button1.png"),
("button2", "button2.png"),
("button3", "button3.png"),
("button4", "button4.png"),
("button5", "button5.png"),
]
how would I go about accessing "button1" from buttonParameters.
Also, what type of list structure is this? I was reccomended using it, but I'm not sure I know what it's name is, and would like to search some to understand it more.
It seems like you are trying to retrieve a Value from a mapping, given a Key.
For this you are using a List when you should be using a Dictionary:
buttonParameters = {
"button1": "button1.png",
"button2": "button2.png",
"button3": "button3.png",
"button4": "button4.png",
"button5": "button5.png",
}
buttonParameters['button1'] #=> "button1.png"
A solution involving a List traversal to extract a value has linear worst-case performance whilst dictionary retrieval is amortised constant time.
You can convert your list of tuples into the above dictionary with:
buttonParameters = dict(buttonParameters)