Python: Finding a path between nodes within groups with nested dictionaries - python

I have a dataset containing historical transaction records for real estate properties. Each property has an ID number. To check if the data is complete, for each property I am identifying a "transaction chain": I take the original buyer, and go through all intermediate buyer/seller combinations until I reach the final buyer of record. So for data that looks like this:
Buyer|Seller|propertyID
Bob|Jane|23
Tim|Bob|23
Karl|Tim|23
The transaction chain will look like: [Jane, Bob, Tim, Karl]
I am using three datasets to do this. The first contains the names of only the first buyer of each property. The second contains the names of all intermediate buyers and sellers, and the third contains only the final buyer for each property. I use three datasets so I can follow the process given by vikramls answer here.
In my version of the graph dictionary, each seller is a key to its corresponding buyer, and the oft-cited find_path function finds the path from first seller to last buyer. The problem is that the dataset is very large, so I get a maximum recursion depth reached error. I think I can solve this by nesting the graph dictionary inside another dictionary where they key is the property id number, and then searching for the path within ID groups. However, when I tried:
graph = {}
propertyIDgraph = {}
with open('buyersAndSellers.txt','r') as f:
for row in f:
propertyid, seller, buyer = row.strip('\n').split('|')
graph.setdefault(seller, []).append(buyer)
propertyIDgraph.setdefault(propertyid, []).append(graph)
f.close()
It assigned every buyer/seller combination to every property id. I would like it to assign the buyers and sellers to only their corresponding property ID.

You might attempt to something like the following. I adapted from the link at https://www.python.org/doc/essays/graphs/
Transaction = namedtuple('Transaction', ['Buyer', 'PropertyId'])
graph = {}
## maybe this is a db or a file
for data in datasource:
graph[data.seller] = Transaction(data.buyer,data.property_id)
## returns something like
## graph = {'Jane': [Transaction('Bob',23)],
## 'Bob': [Transaction('Tim',23)],
## 'Time': [Transaction('Karl',23)]}
##
def find_transaction_path(graph, original_seller,current_owner,target_property_id path=[]):
assert(target_property_id is not None)
path = path + [original_seller]
if start == end:
return path
if not graph.has_key(original_seller):
return None
shortest = None
for node in graph[start]:
if node not in path and node.property_id == target_property_id:
newpath = find_shortest_path(graph, node.Buyer, current_owner, path,target_property_id)
if newpath:
if not shortest or len(newpath) < len(shortest):
shortest = newpath
return shortest

I wouldn't recommend append to a graph. It will append to every node. Better check if exists first than right after append it to the already existed object.
Try this:
graph = {}
propertyIDgraph = {}
with open('buyersAndSellers.txt','r') as f:
for row in f:
propertyid, seller, buyer = row.strip('\n').split('|')
if seller in graph.iterkeys() :
graph[seller] = graph[seller] + [buyer]
else:
graph[seller] = [buyer]
if propertyid in propertyIDgraph.iterkeys():
propertyIDgraph[propertyid] = propertyIDgraph[propertyid] + [graph]
else:
propertyIDgraph[propertyid] = [graph]
f.close()
Here a link that maybe will be usefull:
syntax for creating a dictionary into another dictionary in python

Related

Python Gitlab API - list shared projects of a group/subgroup

I need to find all projects and shared projects within a Gitlab group with subgroups. I managed to list the names of all projects like this:
group = gl.groups.get(11111, lazy=True)
# find all projects, also in subgroups
projects=group.projects.list(include_subgroups=True, all=True)
for prj in projects:
print(prj.attributes['name'])
print("")
What I am missing is to list also the shared projects within the group. Or maybe to put this in other words: find out all projects where my group is a member. Is this possible with the Python API?
So, inspired by the answer of sytech, I found out that it was not working in the first place, as the shared projects were still hidden in the subgroups. So I came up with the following code that digs through all various levels of subgroups to find all shared projects. I assume this can be written way more elegant, but it works for me:
# group definition
main_group_id = 11111
# create empty list that will contain final result
list_subgroups_id_all = []
# create empty list that act as temporal storage of the results outside the function
list_subgroups_id_stored = []
# function to create a list of subgroups of a group (id)
def find_subgroups(group_id):
# retrieve group object
group = gl.groups.get(group_id)
# create empty lists to store id of subgroups
list_subgroups_id = []
#iterate through group to find id of all subgroups
for sub in group.subgroups.list():
list_subgroups_id.append(sub.id)
return(list_subgroups_id)
# function to iterate over the various groups for subgroup detection
def iterate_subgroups(group_id, list_subgroups_id_all):
# for a given id, find existing subgroups (id) and store them in a list
list_subgroups_id = find_subgroups(group_id)
# add the found items to the list storage variable, so that the results are not overwritten
list_subgroups_id_stored.append(list_subgroups_id)
# for each found subgroup_id, test if it is already part of the total id list
# if not, keep store it and test for more subgroups
for test_id in list_subgroups_id:
if test_id not in list_subgroups_id_all:
# add it to total subgroup id list (final results list)
list_subgroups_id_all.append(test_id)
# check whether test_id contains more subgroups
list_subgroups_id_tmp = iterate_subgroups(test_id, list_subgroups_id_all)
#if so, append to stored subgroup list that is currently checked
list_subgroups_id_stored.append(list_subgroups_id_tmp)
return(list_subgroups_id_all)
# find all subgroup and subsubgroups, etc... store ids in list
list_subgroups_id_all = iterate_subgroups(main_group_id , list_subgroups_id_all)
print("***ids of all subgroups***")
print(list_subgroups_id_all)
print("")
print("***names of all subgroups***")
list_names = []
for ids in list_subgroups_id_all:
group = gl.groups.get(ids)
group_name = group.attributes['name']
list_names.append(group_name)
print(list_names)
#print(list_subgroups_name_all)
print("")
# print all directly integrated projects of the main group, also those in subgroups
print("***integrated projects***")
group = gl.groups.get(main_group_id)
projects=group.projects.list(include_subgroups=True, all=True)
for prj in projects:
print(prj.attributes['name'])
print("")
# print all shared projects
print("***shared projects***")
for sub in list_subgroups_id_all:
group = gl.groups.get(sub)
for shared_prj in group.shared_projects:
print(shared_prj['path_with_namespace'])
print("")
One question that remains - at the very beginning I retrieve the main group by its id (here: 11111), but can I actually also get this id by looking for the name of the group? Something like: group_id = gl.group.get(attribute={'name','foo'}) (not working)?
You can get the shared projects by the .shared_projects attribute:
group = gl.groups.get(11111)
for proj in group.shared_projects:
print(proj['path_with_namespace'])
However, you cannot use the lazy=True argument to gl.groups.get.
>>> group = gl.groups.get(11111, lazy=True)
>>> group.shared_projects
AttributeError: shared_projects

How can I get the specific element in a json-like element in python?

I have an XML file containing some Open Street Map nodes. I am trying to select a node randomly. To do this first I am going to get the ids of all the nodes into an array and then select an id randomly. Then I wish to get the node which has that id number.
Now, I read the xml file and do the following:
tree = ET.parse('/Users/XXX/Documents/map.osm.xml')
root = tree.getroot()
idd = [] # ids of the nodes
for n in root.iter('node'):
idd.append( n.attrib["id"] )
Each n.attrib in the loop is something like this :
{'id': '6676298011', 'visible': 'true', 'version': '1', 'changeset': '72944617', 'timestamp': '2019-08-02T14:49:11Z', 'user': 'bkrc', 'uid': '8150490', 'lat': '41.0836908', 'lon': '29.0511424'}
How can I get the one with the id, for example, 6677592585 ?
# whole code :
import xml.etree.ElementTree as ET
import random
import json
tree = ET.parse('/Users/XXX/Documents/map.osm.xml')
root = tree.getroot()
idd = []
for n in root.iter('node'):
idd.append( n.attrib["id"] )
i = idd[0]
print(i)
Perhaps this isn't what you're looking for, but I would take advantage of the fact that you're already looping through all the nodes to get their id.
Assuming the id is unique for each node, while making a list of the ids I would create a new dict where the given id is a key to the node. So then when you randomly select your id you can just use it to get the node out of your new dict. This might not be a good solution if you have memory limitation, but the only other solution I can think of is looping through the original structure each time you want to get a node until you find the selected id in a given node, which would be CPU intensive. Maybe it would look something like this
idd = [] # ids of the nodes
mapped_nodes = {}
for n in root.iter('node'):
idd.append( n.attrib["id"] )
mapped_nodes[n.attrib["id"]] = n
If you just need n.attrib you could also just put that in your mapped_nodes

Google's radarsearch API results

I'm trying to geolocate all the businesses related to a keyword in my city using, first, the radarsearch API in order to retrieve the Place ID and later using the Places API to get more information of each Place ID (such as the name, or the formatted address).
In my first approach I splitted my city in 9 circumferences, each one with radius 22km and avoiding rural zones, where there's no supposed to be a business. This way I obtained (once removing duplicated results, due to the circumferences overlapping) approximately 150 businesses. This result is not reliable because the official webpage of the company asserts there are 245.
In order to retrieve ALL the businesses, I split my city in circumferences of radius 10km. Therefore with approx 50 pairs of coordinates I fill the city, including now all zones, both rural and non-rural. Now, surprisingly I obtain only 81 businesses! How can this be possible?
I'm storing all the information in separated dictionaries and I noticed the amount of data of each dictionary increases with the increasing of the radius and is always the same (for a fixed radius).
Now, apart from the previous question, is there any way to limit the amount of results each request yields?
The code I'm using is the following:
dict1 = {}
radius=20000
keyword='keyworkd'
key=YOUR_API_KEY
url_base="https://maps.googleapis.com/maps/api/place/radarsearch/json?"
list_dicts = []
for i,(lo, la) in enumerate(zip(lon_txt,lat_txt)):
url=url_base+'location='+str(lo)+','+str(la)+'&radius='+str(radius)+'&keyword='+keyword+'&key='+key
response = urllib2.urlopen(url)
table = json.load(response)
if table['status']=='OK':
for j,line in enumerate(table['results']):
temp = {j : line['place_id']}
dict1.update(temp)
list_dicts.append(dict1)
else:
pass
Finally I managed to solve this problem.
The thing was the dict initialization must be done in each loop iteration. Now it stores all the information and I retrieve what I wanted from the beginning.
dict1 = {}
radius=20000
keyword='keyworkd'
key=YOUR_API_KEY
url_base="https://maps.googleapis.com/maps/api/place/radarsearch/json?"
list_dicts = []
for i,(lo, la) in enumerate(zip(lon_txt,lat_txt)):
url=url_base+'location='+str(lo)+','+str(la)+'&radius='+str(radius)+'&keyword='+keyword+'&key='+key
response = urllib2.urlopen(url)
table = json.load(response)
if table['status']=='OK':
for j,line in enumerate(table['results']):
temp = {j : line['place_id']}
dict1.update(temp)
list_dicts.append(dict1)
dict1 = {}
else:
pass

Python find average of element in list with multiple elements

I have a ticker that grabs current information of multiple elements and adds it to a list in the format: trade_list.append([[trade_id, results]]).
Say we're tracking trade_id's 4555, 5555, 23232, the trade_list will keep ticking away adding their results to the list, I then want to find the averages of their results individually.
The code works as such:
Find accounts
for a in accounts:
find open trades of accounts
for t in range(len(trades)):
do some math
trades_list.append(trade_id,result)
avernum = 0
average = []
for r in range(len(trades_list)):
average.append(trades_list[r][1]) # This is the value attached to the trade_id
avernum+=1
results = float(sum(average)/avernum))
results_list.append([[trade_id,results]])
This fills out really quickly. This is after two ticks:
print(results_list)
[[[53471, 28.36432]], [[53477, 31.67835]], [[53474, 32.27664]], [[52232, 1908.30604]], [[52241, 350.4758]], [[53471, 28.36432]], [[53477, 31.67835]], [[53474, 32.27664]], [[52232, 1908.30604]], [[52241, 350.4758]]]
These averages will move and change very quickly. I want to use results_list to track and watch them, then compare previous averages to current ones
Thinking:
for r in range(len(results_list)):
if results_list[r][0] == trade_id:
restick.append(results_list[r][1])
resnum = len(restick)
if restick[resnum] > restick[resnum-1]:
do fancy things
Here is some short code that does what you I think you have described, although I might have misunderstood. You basically do exactly what you say; select everything that has a certain trade_id and returns its average.:
TID_INDEX = 0
DATA_INDEX = 1
def id_average(t_id, arr):
filt_arr = [i[DATA_INDEX] for i in arr if i[TID_INDEX] == t_id]
return sum(filt_arr)/len(filt_arr)

Python - Iterative self cross referencing

I have a bit of a logical challenge. I have a single table in excel that contains an identifier column and a cross reference column. There can be multiple rows for a single identifier which indicates multiple cross references. (see basic example below)
Any record that ends in the letter "X" indicates that it is a cross reference, and not an actual identifier. I need to generate a list of the cross references for each identifier, but trace it down to the actual cross reference identifier. So using "A1" as an example from the table above, I would need the list returned as follows "A2,A3,B1,B3". Notice there are no identifiers ending in "X" in the list, they have been traced down to the actual source record through the table.
Any ideas or help would be much appreciated. I'm using python and xlrd to read the table.
t = [
["a1","a2"],
["a1","a3"],
["a1","ax"],
["ax","b1"],
["ax","bx"],
["bx","b3"]
]
import itertools
def find_matches(t,key):
return list(itertools.chain(*[[v] if not v.endswith("x") else find_matches(t,v) for k,v in t if k == key]))
print find_matches(t,"a1")
you could treat your list as an adjacency matrix of a graph
something like
t = [
["a1","a2"],
["a1","a3"],
["a1","ax"],
["ax","b1"],
["ax","bx"],
["bx","b3"]
]
class MyGraph:
def __init__(self,adjacency_table):
self.table = adjacency_table
self.graph = {}
for from_node,to_node in adjacency_table:
if from_node in self.graph:
self.graph[from_node].append(to_node)
else:
self.graph[from_node] = [to_node]
print self.graph
def find_leaves(self,v):
seen = set(v)
def search(v):
for vertex in self.graph[v]:
if vertex in seen:
continue
seen.add(vertex)
if vertex in self.graph:
for p in search(vertex):
yield p
else:
yield vertex
for p in search(v):
yield p
print list(MyGraph(t).find_leaves("a1"))#,"a1")

Categories