Appending To Class List Within Multiple For Loops - python

After having read the following question and various answers ("Least Astonishment" and the Mutable Default Argument), as well as the official documentation (https://docs.python.org/3/tutorial/controlflow.html#default-argument-values), I've written my ResultsClass so that each instance of it has a separate list without affecting the defaults (at least, this is what should be happening from my new-gained understanding):
class ResultsClass:
def __init__(self,
project = None,
badpolicynames = None,
nonconformpolicydisks = None,
diskswithoutpolicies = None,
dailydifferences = None,
weeklydifferences = None):
self.project = project
if badpolicynames is None:
self.badpolicynames = []
if nonconformpolicydisks is None:
self.nonconformpolicydisks = []
if diskswithoutpolicies is None:
self.diskswithoutpolicies = []
if dailydifferences is None:
self.dailydifferences = []
if weeklydifferences is None:
self.weeklydifferences = []
By itself, this works as expected:
i = 0
for result in results:
result.diskswithoutpolicies.append("count is " + str(i))
print(result.diskswithoutpolicies)
i = i+1
['count is 0']
['count is 1']
['count is 2']
['count is 3']
etc.
The context of this script is that I'm trying to obtain information from each project within our Google Cloud infrastructure; predominantly in this instance, a list of disks with a snapshot schedule associated with them, a list of the scheduled snapshots of each disk within the last 24 hours, those that have bad schedule names that do not fit our naming convention, and the disks that do not have any snapshot schedules associated with them at all.
Within the full script, I use this exact same ResultsClass; yet when used within multiple for loops, the append again seems to be adding to the default values, and in all honesty I don't understand why.
The shortened version of the code is as follows:
# Code to obtain a list of projects
results = [ResultsClass() for i in range((len(projects)))]
for result in results:
for project in projects:
result.project = project
# Code to obtain each zone in the project
for zone in zones:
# Code to get each disk in zone
for disk in disks:
resourcepolicy = disk.get('resourcePolicies')
if resourcepolicy:
# Code to action if a resource policy exists
else:
result.badpolicynames.append(resourcepolicy[0].split('/')[-1])
result.nonconformpolicydisks.append(disk['id'])
else:
result.diskswithoutpolicies.append(disk['id'])
pprint(vars(result))
This then comes back with the results:
{'badpolicynames': [],
'dailydifferences': None,
'diskswithoutpolicies': ['**1098762112354315432**'],
'nonconformpolicydisks': [],
'project': '**project0**',
'weeklydifferences': None}
{'badpolicynames': [],
'dailydifferences': None,
'diskswithoutpolicies': ['**1098762112354315432**'],
['**1031876156872354739**'],
'nonconformpolicydisks': [],
'project': '**project1**',
'weeklydifferences': None}
Does a for loop (or multiple for loops) somehow negate the separate lists created within the ResultsClass? I need to understand why this is happening within Python and then how I can correct it.

Based on my best understanding, one of the glaring problem is you're nesting both the results and projects loop together, whereas you should be looping only either of those. I'd suggest looping the projects and creating a result in each instead of instantiating the classes in a list before.
results = []
for project in projects:
result = ResultsClass(project)
# Code to obtain each zone in the project
for zone in zones:
# Code to get each disk in zone
for disk in disks:
resourcepolicy = disk.get('resourcePolicies')
if resourcepolicy:
# Code to action if a resource policy exists
else:
result.badpolicynames.append(resourcepolicy[0].split('/')[-1])
result.nonconformpolicydisks.append(disk['id'])
else:
result.diskswithoutpolicies.append(disk['id'])
results.append(result)
pprint(vars(result))
With that, results is the list of your ResultsClass, and each result contain only one project, whereas your previous attempt would end with each ResultsClass with the same, last project.

I'm not sure if i get what you are trying to achieve correctly, but you are trying to transfer data from each project to a single result, right?
if so, you might want to use zip to have a single result per project:
for result, project in zip(results, projects):
# rest of the code
otherwise you are overriding the result for each previous project in the next loop iteration.
Another option would be to create the result in the loop:
results = []
for project in projects:
result = ResultsClass()
# ... your fetching code ...
results.append(result)

Related

Ren'py / Python - insert dynamic variable in class with a while loop - without overwriting with append

I'm currently trying to get into Python and Ren'Py a bit. Since I like to design a lot dynamically
and don't want to use copy&paste so often, I'm trying to create a page that will have ImageButtons with the corresponding number I specify.
In the example below I use "4" - but this can be higher.
I have built a class for this purpose:
Example:
init python:
class PictureSettings(object):
def __init__ (self, ImgIdle, ImgHover, LabelCall):
self.ImgIdle = ImgIdle
self.ImgHover = ImgHover
self.LabelCall = LabelCall
return
For the Idle/Hover and for the Jump.
If I insert in the code now in an object each entry "manually" with append I get all 4 pictures as desired indicated.
Example: (Works - but is not dynamic)
python:
var_pictures = []
var_pictures.append(PictureSettings("img_picture_1_idle", "img_picture_1_hover", "picture_1"))
var_pictures.append(PictureSettings("img_picture_2_idle", "img_picture_2_hover", "picture_2"))
var_pictures.append(PictureSettings("img_picture_3_idle", "img_picture_3_hover", "picture_3"))
var_pictures.append(PictureSettings("img_picture_4_idle", "img_picture_4_hover", "picture_4"))
I would like it to be like this:
Example (Here I get only ""img_picture_4_idle", "img_picture_4_hover", "picture_4""):
$ countlimit = 4
$ count = 1
python:
while count < countlimit:
var_pictures = []
var_pictures.append(PictureSettings(
ImgIdle = "img_picture_[count]_idle",
ImgHover = "img_picture_[count]_hover",
LabelCall = "picture_[count]"))
count += 1
Have already tried various things, unfortunately without success.
For example: with Add - instead of append (because this overwrites the result and leaves only the last entry
I get the following error:
var_pictures.add(PictureSettings( AttributeError: 'RevertableList' object has no attribute 'add')
Maybe someone can help me with the solution so I can keep my code dynamic without copying something X times.
Thanks for your help
You are creating your list inside your loop, so it is recreated every time.
At the end, you only get the last created list.
var_pictures = []
while count < countlimit:
var_pictures.append(PictureSettings(
ImgIdle = "img_picture_[count]_idle",
ImgHover = "img_picture_[count]_hover",
LabelCall = "picture_[count]"))
count += 1
On another subject, if you want to do this in a more pythonic way:
pictures = [] # no need for var_, we know its a variable
for i in range(1, 5):
pictures.append(PictureSettings(
# in python, we prefere snake_case attributes
img_idle=f'img_picture_{i}_idle',
img_hover=f'img_picture_{i}_hover',
...
))
# or even shorter with list comprehension
pictures = [
PictureSettings(
img_idle=f'img_picture_{i}_idle',
)
for i in range(1, 5)
]
By the way, no need to return in your class constructor

Pattern for serial-to-parallel-to-serial data processing

I'm working with arrays of datasets, iterating over each dataset to extract information, and using the extracted information to build a new dataset that I then pass to a parallel processing function that might do parallel I/O (requests) on the data.
The return is a new dataset array with new information, which I then have to consolidate with the previous one. The pattern ends up being Loop->parallel->Loop.
parallel_request = []
for item in dataset:
transform(item)
subdata = extract(item)
parallel_request.append(subdata)
new_dataset = parallel_function(parallel_request)
for item in dataset:
transform(item)
subdata = extract(item)
if subdata in new_dataset:
item[subdata] = new_dataset[subdata]
I'm forced to use two loops. Once to build the parallel request, and the again to consolidate the parallel results with my old data. Large chunks of these loops end up repeating steps. This pattern is becoming uncomfortably prevalent and repetitive in my code.
Is there some technique to "yield" inside the first loop after adding data to parallel_request, continuing on to the next item. Once parallel_request is filled, execute parallel function, and then resume the loop for each item again, restoring the previously saved context (local variables).
EDIT: I think one solution would be to use a function instead of a loop, and call it recursively. The downside being that i would definitely hit the recursion limit.
parallel_requests = []
final_output = []
index = 0
def process_data(dataset, last=False):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
index += 1
parallel_requests.append(subdata)
# If not last, recurse
# Otherwise, call the processing function.
if not last:
process_data(dataset, index == len(dataset))
else:
new_data = process_requests(parallel_requests)
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
process_data(original_dataset)
Any solution would involve somehow preserving data, data2, data3, subdata...etc, which would have to be stored somewhere. Recursion uses the stack to store them, which will trigger the recursion limit. Another way would be store them in some array outside of the loop, which makes the code much more cumbersome. Another solution would be to just recompute them, and would also require code duplication.
So I suspect to achieve this you'd need some specific Python facility that enables this.
I believe i have solved the issue:
Based on the previous recursive code, you can can exploit the generator facilities offered by Python to preserve the serial context when calling the parallel function:
def process_data(dataset, parallel_requests, final_output):
data = dataset[index]
data2 = transform(data)
data3 = expensive_slow_transform(data2)
subdata = extract(data3)
# ... some other work
parallel_requests.append(subdata)
yield
# Now processing of each item can resume, keeping it's
# local data variables, transforms, subdata...etc.
final_data = merge(subdata, new_data[index], data, data2, data3))
final_output.append(final_data)
final_output = []
parallel_requests = []
funcs = [process_data(datum, parallel_requests, final_output) for datum in dataset]
[next(f) for f in funcs]
process_requests(parallel_requests)
[next(f) for f in funcs]
The output list and generator calls are general enough that you can abstract away these lines in a helper function sets it up and calls the generators for you, leading to a very clean result with code overhead being one line for the function definition, and one line to call the helper.

Validating that all components required for an object to exist are present

I need to write a script that gets a list of components from an external source and based on a pre-defined list it validates whether the service is complete. This is needed because the presence of a single component doesn't automatically imply that the service is present - some components are pre-installed even when there is no service. I've devised something really simple below, but I was wondering what is the intelligent way of doing this? There must be a cleaner, simpler way.
# Components that make up a complete service
serviceComponents = ['A','B']
# Input from JSON
data = ['B','A','C']
serviceComplete = True
for i in serviceComponents:
if i in data:
print 'yay ' + i + ' found from ' + ', '.join(service2)
else:
serviceComplete = False
break
# If serviceComplete = True do blabla...
You could do it a few different ways:
set(serviceComponents) <= set(data)
set(serviceComponents).issubset(data)
all(c in data for c in serviceComponents)
You can make it shorter, but you lose readability. What you have now is probably fine. I'd go with the first approach personally, since it expresses your intent clearly with set operations.
# Components that make up a complete service
serviceComponents = ['A','B']
# Input from JSON
data = ['B','A','C']
if all(item in data for item in serviceComponents):
print("All required components are present")
Built-in Set would serve for you, use set.issubset to identify that your required service components is subset of input data:
serviceComponents = set(['A','B'])
input_data = set(['B','A','C'])
if serviceComponents.issubset(input_data):
# perform actions ...

Failed WriteBatch Operation with py2neo

I am trying to find a workaround to the following problem. I have seen it quasi-described in this SO question, yet not really answered.
The following code fails, starting with a fresh graph:
from py2neo import neo4j
def add_test_nodes():
# Add a test node manually
alice = g.get_or_create_indexed_node("Users", "user_id", 12345, {"user_id":12345})
def do_batch(graph):
# Begin batch write transaction
batch = neo4j.WriteBatch(graph)
# get some updated node properties to add
new_node_data = {"user_id":12345, "name": "Alice"}
# batch requests
a = batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
batch.set_properties(a, new_node_data) #<-- I'm the problem
# execute batch requests and clear
batch.run()
batch.clear()
if __name__ == '__main__':
# Initialize Graph DB service and create a Users node index
g = neo4j.GraphDatabaseService()
users_idx = g.get_or_create_index(neo4j.Node, "Users")
# run the test functions
add_test_nodes()
alice = g.get_or_create_indexed_node("Users", "user_id", 12345)
print alice
do_batch(g)
# get alice back and assert additional properties were added
alice = g.get_or_create_indexed_node("Users", "user_id", 12345)
assert "name" in alice
In short, I wish, in one batch transaction, to update existing indexed node properties. The failure is occurring at the batch.set_properties line, and it is because the BatchRequest object returned by the previous line is not being interpreted as a valid node. Though not entirely indentical, it feels like I am attempting something like the answer posted here
Some specifics
>>> import py2neo
>>> py2neo.__version__
'1.6.0'
>>> g = py2neo.neo4j.GraphDatabaseService()
>>> g.neo4j_version
(2, 0, 0, u'M06')
Update
If I split the problem into separate batches, then it can run without error:
def do_batch(graph):
# Begin batch write transaction
batch = neo4j.WriteBatch(graph)
# get some updated node properties to add
new_node_data = {"user_id":12345, "name": "Alice"}
# batch request 1
batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
# execute batch request and clear
alice = batch.submit()
batch.clear()
# batch request 2
batch.set_properties(a, new_node_data)
# execute batch request and clear
batch.run()
batch.clear()
This works for many nodes as well. Though I do not love the idea of splitting the batch up, this might be the only way at the moment. Anyone have some comments on this?
After reading up on all the new features of Neo4j 2.0.0-M06, it seems that the older workflow of node and relationship indexes are being superseded. There is presently a bit of a divergence on the part of neo in the way indexing is done. Namely, labels and schema indexes.
Labels
Labels can be arbitrarily attached to nodes and can serve as a reference for an index.
Indexes
Indexes can be created in Cypher by referencing Labels (here, User) and node property key, (screen_name):
CREATE INDEX ON :User(screen_name)
Cypher MERGE
Furthermore, the indexed get_or_create methods are now possible via the new cypher MERGE function, which incorporate Labels and their indexes quite succinctly:
MERGE (me:User{screen_name:"SunPowered"}) RETURN me
Batch
Queries of the sort can be batched in py2neo by appending a CypherQuery instance to the batch object:
from py2neo import neo4j
graph_db = neo4j.GraphDatabaseService()
cypher_merge_user = neo4j.CypherQuery(graph_db,
"MERGE (user:User {screen_name:{name}}) RETURN user")
def get_or_create_user(screen_name):
"""Return the user if exists, create one if not"""
return cypher_merge_user.execute_one(name=screen_name)
def get_or_create_users(screen_names):
"""Apply the get or create user cypher query to many usernames in a
batch transaction"""
batch = neo4j.WriteBatch(graph_db)
for screen_name in screen_names:
batch.append_cypher(cypher_merge_user, params=dict(name=screen_name))
return batch.submit()
root = get_or_create_user("Root")
users = get_or_create_users(["alice", "bob", "charlie"])
Limitation
There is a limitation, however, in that the results from a cypher query in a batch transaction cannot be referenced later in the same transaction. The original question was in reference to updating a collection of indexed user properties in one batch transaction. This is still not possible, as far as I can muster. For example, the following snippet throws an error:
batch = neo4j.WriteBatch(graph_db)
b1 = batch.append_cypher(cypher_merge_user, params=dict(name="Alice"))
batch.set_properties(b1, dict(last_name="Smith")})
resp = batch.submit()
So, it seems that although there is a bit less overhead in implementing the get_or_create over a labelled node using py2neo because the legacy indexes are no longer necessary, the original question still needs 2 separate batch transactions to complete.
Your problem seems not to be in batch.set_properties() but rather in the output of batch.get_or_create_in_index(). If you add the node with batch.create(), it works:
db = neo4j.GraphDatabaseService()
batch = neo4j.WriteBatch(db)
# create a node instead of getting it from index
test_node = batch.create({'key': 'value'})
# set new properties on the node
batch.set_properties(test_node, {'key': 'foo'})
batch.submit()
If you have a look at the properties of the BatchRequest object returned by batch.create() and batch.get_or_create_in_index() there is a difference in the URI because the methods use different parts of the neo4j REST API:
test_node = batch.create({'key': 'value'})
print test_node.uri # node
print test_node.body # {'key': 'value'}
print test_node.method # POST
index_node = batch.get_or_create_in_index(neo4j.Node, "Users", "user_id", 12345, {})
print index_node.uri # index/node/Users?uniqueness=get_or_create
print index_node.body # {u'value': 12345, u'key': 'user_id', u'properties': {}}
print index_node.method # POST
batch.submit()
So I guess batch.set_properties() somehow can't handle the URI of the indexed node? I.e. it doesn't really get the correct URI for the node?
Doesn't solve the problem, but could be a pointer for somebody else ;) ?

Simple example of retrieving 500 items from dynamodb using Python

Looking for a simple example of retrieving 500 items from dynamodb minimizing the number of queries. I know there's a "multiget" function that would let me break this up into chunks of 50 queries, but not sure how to do this.
I'm starting with a list of 500 keys. I'm then thinking of writing a function that takes this list of keys, breaks it up into "chunks," retrieves the values, stitches them back together, and returns a dict of 500 key-value pairs.
Or is there a better way to do this?
As a corollary, how would I "sort" the items afterwards?
Depending on you scheme, There are 2 ways of efficiently retrieving your 500 items.
1 Items are under the same hash_key, using a range_key
Use the query method with the hash_key
you may ask to sort the range_keys A-Z or Z-A
2 Items are on "random" keys
You said it: use the BatchGetItem method
Good news: the limit is actually 100/request or 1MB max
you will have to sort the results on the Python side.
On the practical side, since you use Python, I highly recommend the Boto library for low-level access or dynamodb-mapper library for higher level access (Disclaimer: I am one of the core dev of dynamodb-mapper).
Sadly, neither of these library provides an easy way to wrap the batch_get operation. On the contrary, there is a generator for scan and for query which 'pretends' you get all in a single query.
In order to get optimal results with the batch query, I recommend this workflow:
submit a batch with all of your 500 items.
store the results in your dicts
re-submit with the UnprocessedKeys as many times as needed
sort the results on the python side
Quick example
I assume you have created a table "MyTable" with a single hash_key
import boto
# Helper function. This is more or less the code
# I added to devolop branch
def resubmit(batch, prev):
# Empty (re-use) the batch
del batch[:]
# The batch answer contains the list of
# unprocessed keys grouped by tables
if 'UnprocessedKeys' in prev:
unprocessed = res['UnprocessedKeys']
else:
return None
# Load the unprocessed keys
for table_name, table_req in unprocessed.iteritems():
table_keys = table_req['Keys']
table = batch.layer2.get_table(table_name)
keys = []
for key in table_keys:
h = key['HashKeyElement']
r = None
if 'RangeKeyElement' in key:
r = key['RangeKeyElement']
keys.append((h, r))
attributes_to_get = None
if 'AttributesToGet' in table_req:
attributes_to_get = table_req['AttributesToGet']
batch.add_batch(table, keys, attributes_to_get=attributes_to_get)
return batch.submit()
# Main
db = boto.connect_dynamodb()
table = db.get_table('MyTable')
batch = db.new_batch_list()
keys = range (100) # Get items from 0 to 99
batch.add_batch(table, keys)
res = batch.submit()
while res:
print res # Do some usefull work here
res = resubmit(batch, res)
# The END
EDIT:
I've added a resubmit() function to BatchList in Boto develop branch. It greatly simplifies the worklow:
add all of your requested keys to BatchList
submit()
resubmit() as long as it does not return None.
this should be available in next release.

Categories