Populating Neo4j using python Dictionary - python

So I just started with Neo4j, and I'm trying to figure out how I might populate my DataFrame. I have a dictionary of words as keys and synonyms as values in a list and I want to populate Neo4j that seems like it would be an interesting way to learn how to use the database.
An example would be:
'CRUNK' : [u'drunk', u'wasted', u'high', u'crunked', u'crazy', u'hammered', u'alcohol', u'hyphy', u'party']
The lists are not going to be of equal length so converting it to a more typical csv format is not an option, and I haven't found an explanation of how I could populate the database like I would for the SQL database in a Django app. I want to do something like this:
for each k,v in dictionary:
add k and add relationship to each value in v
Does anyone have any tutorials, documentation or answers that could help point me in the right direction?

I think what you want to do you can do in Cypher directly:
MERGE (w:Word {text:{root}})
UNWIND {words} as word
MERGE (w2:Word {text:word})
MERGE (w2)-[:SYNONYM]->(w)
You would then run this statement with http://py2neo.org's cypher-session API and the two parameters, a single root word and a list of words.
you can also use foreach instead of unwind
MERGE (w:Word {text:{root}})
FOREACH (word IN {words} |
MERGE (w2:Word {text:word})
MERGE (w2)-[:SYNONYM]->(w)
)

FINAL EDIT INCORPORATING MERGE:
This uses a dictionary to checks to make sure their output isn't NoneType or 'NOT FOUND', and populates the graph with a 'SYNONYM' relationship using the merge function to ensure their aren't duplicates.
import pickle
from py2neo import Graph
from py2neo import Node, Relationship
import random
graph = Graph(f'http://neo4j:{pw}#localhost:7474/db/data/'))
udSyn = pickle.load(open('lookup_ud', 'rb'))
myWords = udSyn.keys()
for key in myWords:
print(key)
values = udSyn[key]
if values in [None, 'NOT FOUND']:
continue
node = graph.merge_one('WORD', 'name', key)
for value in values:
node2 = graph.merge_one('WORD', 'name', value)
synOfNode = Relationship(node, 'SYNONYM', node2)
graph.create(synOfNode)
graph.push()

Related

Python Refactoring - Changing Variable Type and Value Within a Loop

I'm working on automating some word and PDF documents that need to be updated on a certain cadence.
The way I'm doing this is using dictionaries that replace variables within word documents.
My code works but because my area is not tech savvy I'm using an excel file so people can replace the values in that file whenever they need to update the documents.
I was also successful on pulling the dictionary key and values from excel but I'm trying to refactor this code which is repetitive. Here is an excerpt with 2 of the 7 dictionaries I'm creating:
dic = pd.read_excel('test.xlsx',"AD")
AD = dict(zip(dic.Key,dic.Value))
dic = pd.read_excel('test.xlsx',"RSM")
RSM = dict(zip(dic.Key,dic.Value))
I'm trying to refactor this so I can run it all within a single loop and trying something like this:
import pandas as pd
AD = "AD"
RSM = "RSM"
groups = [AD, RSM]
for item in groups:
dic = pd.read_excel('test.xlsx',item)
item = dict(zip(dic.Key,dic.Value))
So I'm basically first using the variable as a string to call the excel tab within the read_excel method and then I want to replace that same variable to become the output dictionary.
When I print item within the loop I do get the correct dictionaries but I'm not able to output a variable that stores each dictionary that the loop creates.
Any help would be appreciated.
Thanks!
You're almost there, you can just have a dictionary of dictionaries:
import pandas as pd
groups = ['AD', 'RSM']
dicts = {}
for item in groups:
dic = pd.read_excel('test.xlsx', item)
dicts[item] = dict(zip(dic.Key, dic.Value))
Now you can just access them like this:
print(dicts['AD']['some key'])
The values of a dictionary can be anything, including other dictionaries. Keys of dictionaries can be many things as well, as long as they're hashable, and strings are a common choice of course - and the names of your groups are just that.
Also note that I removed the variables named AD and RSM. You don't really achieve anything by having variables that are named after the string value they are assigned. It only serves to be able to leave off the quotes where you use the values, but it creates an additional indirection that serves no purpose.
If you don't even need the list of groups, but just want groups to be the actual dictionaries:
import pandas as pd
groups = {}
for item in ['AD', 'RSM']:
dic = pd.read_excel('test.xlsx', item)
groups[item] = dict(zip(dic.Key, dic.Value))
The problem is that you assign the result to the item variable and not to an entry in the list.
A simple fix would be to use a dictionary instead of a list to save the reult, eg
import pandas as pd
AD = "AD"
RSM = "RSM"
groups = {AD: None, RSM: None}
for item in groups.keys():
dic = pd.read_excel('test.xlsx',item)
groups[item] = dict(zip(dic.Key,dic.Value))
My suggestion would be to use an overall dictionary to track your work and also to save the results there. I refactored your code slightly to this:
import pandas as pd
groups = dict.fromkeys(('AD', 'RSM')) # setup main dict containing dicts
for item in groups:
dic = pd.read_excel('test.xlsx', item)
groups[item] = dict(zip(dic.Key, dic.Value)) # store individual dict
There's no need for your global constants that are used only once, so I removed those. I also added some spaces to help your Python code conform with PEP-8, the global standard style guide.
Now you can access each dictionary as you like, for example, groups['AD'].

Updating dictionary within loops

I have a list of dictionaries in which keys are "group_names" and values are gene_lists.
I want to update each dictionary with a new list of genes by looping through a species_list.
Here is my pseudocode:
groups=["group1", "group2"]
species_list=["spA", "spB"]
def get_genes(group,sp)
return gene_list
for sp in species_list:
for group in groups:
gene_list[group]=get_genes(group,sp)
gene_list.update(get_genes(group,sp))
The problem with this code is that new genes are replaced/overwritten by the previous ones instead of being added to the dictionary. My question is where should I put the following line. Although, I'm not sure if this is the only problem.
gene_list.update(get_genes(group,sp))
The data I have looks like this dataframe:
data={"group1":["geneA1", "geneA2"],
"group2":[ "geneB1","geneB2"]}
pd.DataFrame.from_dict(data).T
The data I want to create should look like this:
data={"group1":["geneA1", "geneA2", "geneX"],
"group2":[ "geneB1","geneB2", "geneX"]}
pd.DataFrame.from_dict(data).T
So in this case, "gene_x" refers to the new genes obtained by the get_genes function for each species and finally updated to the existing dictionary.
Any help would be much appreciated!!
You need to append to the list in the dictionary entry, not assign it.
Use setdefault() to provide a default empty list if the dictionary key doesn't exist yet.
for sp in species_list:
for group in groups:
gene_list.setdefault(group, []).extend(get_genes(group, sp))
From what I understand, you want to append new gene to each key, in order to do that:
new_gene = "gene_x"
data={"group1":["geneA1", "geneA2"], "group2":[ "geneB1","geneB2"]}
for value in data.values():
value.append(new_gene)
print(data)
You can also use defaultdict where you can append directly (read the docs for that).

Create nested python dictionary

I am using the python code below to extract some values from an excel spreadsheet and then push them to an html page for further processing. I would like to modify the code below so that I can add additional values against each task, any help
the code below does spit out the following:
{'line items': {'AMS Upgrade': '30667', 'BMS works':
'35722'}}
How can I revise the code below so that I can add 2 more values against each task i.e. AMS Upgrade and BMS works
and get the likes of (note the structure below could be wrong)
{'line items': {'AMS Upgrade': {'30667','100%', '25799'}},{'BMS works':
{'10667','10%', '3572'}} }
Code:
book = xlrd.open_workbook("Example - supporting doc.xls")
first_sheet = book.sheet_by_index(-1)
nested_dict = {}
nested_dict["line items"] = {}
for i in range(21,175):
Line_items = first_sheet.row_slice(rowx=i, start_colx=2, end_colx=8)
if str(Line_items[0].value) and str(Line_items[1].value):
if not Line_items[5].value ==0 :
nested_dict["line items"].update({str(Line_items[0].value) : str(Line_items[1].value)})
print nested_dict
print json.dumps(nested_dict)
*** as requested see excel extract below
In Python, each key of a dict can only be associated with a single value. However that single value can be a dict, list, set, etc that holds many values.
You will need to decide the type to use for the value associated with the 'AMS Upgrade' key, if you want it to hold multiple values like '30667','10%', '222'.
Note: what you have written:
{'30667','100%', '25799'}
Is a set literal in Python.

I need to create a social network using Python and Mongodb

I want to create a graph from a Mongodb collection. Nodes of this graph should be inventors of patents and they should be linked by a common id (that represents the patent in common).
Here is the code I wrote in order to print only nodes.
from pymongo import MongoClient
from pymongo import ASCENDING, DESCENDING
import networkx as nx
import matplotlib.pyplot as plt
uri ="mongodb://127.0.0.1:27017/Patent"
client = MongoClient(uri)
righe ={1:'CODINV2', 2:'INCY', 3:'INNAME', 4:'INADDR',5:'INADOTH',6:'INCITY',7:'INCOUNTY',8:'INREGION',9:'INSTATE',10:'INZIP',11:'nuts3',12:'alive',13:'APPLN_ID',14:'PROGR'}
db = client['Patent']
collection2 = db['projects']
collection = db['myprova']
nodi={}
i=0
G=nx.Graph()
k=1 #this parameter represents the fact that an inventor is still alive
db.projects.aggregate([{"$match": {"$and": [{"alive": k}, {"INCY": "IT"}]}}, {"$group": {"_id": "$CODINV2"}}, {"$out": "myprova"}], allowDiskUse=True)
inventor = collection.find()
newList=[]
for inv in inventor:
newList.append(inv)
print newList
for idi in newList:
nodi[idi] = i
G.add_node(i)
i += 1
#print(G.number_of_nodes())
nx.draw(G)
plt.show()
The attribute CODINV2 represents each inventor's id.
Running this code this errors appear in console:
http://i.stack.imgur.com/BC9wd.png
How can I solve this problem? Do you know another solution to reach my goal? I'm new in MondoDB and Python
From the error, I infer that idi is a dictionary. A dictionary cannot be hashed and therefore cannot be used as a key to another dictionary. It seems that your find query is returning a a set of dictionaries.
You are trying to store a dictionary as key for another dictionary and that is not allowed because dictionary is not hashable. See below
A dictionary’s keys are almost arbitrary values. Values that are not
hashable, that is, values containing lists, dictionaries or other
mutable types (that are compared by value rather than by object
identity) may not be used as keys.
Basically you have
nodi = {} //which is a dictionary
And below code tries to store dictionary idi as key of nodi
for idi in newList:
nodi[idi] = i
Because idi is a dictionary (as shown below) you get error
{"uid": xxxxxx} where xxx is numbers
if you replace the following
nodi[idi] = i
With
nodi[i] = idi
Then you won't get an error because i is hashable (just like string unlike list and dictionary).
You might then need to change the way you add node to G, so something like:
G.add_node(nodi[i]) where nodi[i] is nothing but {"uid": xxxxx}

How can I filter by key, or keys, a query in Python for Google App Engine?

I have a query and I can apply filters on them without any problem. This works fine:
query.filter('foo =', 'bar')
But what if I want to filter my query by key or a list of keys?
I have them as Key() property or as a string and by trying something like this, it didn't work:
query.filter('key =', 'some_key') #no success
query.filter('key IN', ['key1', 'key2']) #no success
Whilst it's possible to filter on key - see #dplouffe's answer - it's not a good idea. 'IN' clauses execute one query for each item in the clause, so you end up doing as many queries as there are keys, which is a particularly inefficient way to achieve your goal.
Instead, use a batch fetch operation, as #Luke documents, then filter any elements you don't want out of the list in your code.
You can filter queries by doing a GQL Query like this:
result = db.GqlQuery('select * from Model where __key__ IN :1', [db.Key.from_path('Model', 'Key1'), db.Key.from_path('Model', 'Key2')]).fetch(2)
or
result = Model.get([db.Key.from_path('Model', 'Key1'), db.Key.from_path('ProModelduct', 'Key2')])
You cannot filter on a Key. Oops, I was wrong about that. You can filter on a key and other properties at the same time if you have an index set up to handle it. It would look like this:
key = db.Key.from_path('MyModel', 'keyname')
MyModel.all().filter("__key__ =", key).filter('foo = ', 'bar')
You can also look up a number of models by their keys, key IDs, or key names with the get family of methods.
# if you have the key already, or can construct it from its path
models = MyModel.get(Key.from_path(...), ...)
# if you have keys with names
models = MyModel.get_by_key_name('asdf', 'xyz', ...)
# if you have keys with IDs
models = MyModel.get_by_id(123, 456, ...)
You can fetch many entities this way. I don't know the exact limit. If any of the keys doesn't exist, you'll get a None in the list for that entity.
If you need to filter on some property as well as the key, you'll have to do that in two steps. Either fetch by the keys and check for the property, or query on the property and validate the keys.
Here's an example of filtering after fetching. Note that you don't use the Query class's filter method. Instead just filter the list.
models = MyModels.get_by_key_name('asdf', ...)
filtered = itertools.ifilter(lambda x: x.foo == 'bar', models)
Have a look at: https://developers.google.com/appengine/docs/python/ndb/entities?hl=de#multiple
list_of_entities = ndb.get_multi(list_of_keys)

Categories