Iterating through large lists with potential conditions in Python

Iterating through large lists with potential conditions in Python - python

I have large chunks of data, normally at around 2000+ entries, but in this report we have the ability to look as far as we want so it could be up to 10,000 records
The report is split up into: Two categories and then within each Category, we split by Currency so we have several sub categories within the list.
My issue comes in efficiently calculating the various subtotals. I am using Django and pass a templatetag the currency and category, if it applies, and then the templatetag renders the total. Note that sometimes I have a subtotal just for the category, with no currency passed.
Initially, I was using a seperate query for each subtotal by just using .filter() if there was a currency/category like so:
if currency:
entries = entries.filter(item_currency=currency)
This became a problem as I would have too many queries, and too long of a generation time (2,000+ ms), so I opted to use list(entries) to execute my query right off the bat, and then loop through it with simple list comprehensions:
totals['quantity'] = sum([e.quantity for e in entries])
My problem if you don't see it yet, lies in .. how can I efficiently add the condition for currency / category on each list comprehension? Sometimes they won't be there, sometimes they will so I can't simply type:
totals['quantity'] = sum([e.quantity for e in entries if item_currency = currency])
I could make a huge if-block, but that's not very clean and is a maintenance disaster, so I'm reaching out to the Stackoverflow community for a bit of insight .. thanks in advance :)

You could define a little inline function:
def EntryMatches(e):
if use_currency and not (e.currency == currency):
return False
if use_category and not (e.category == category):
return False
return True
then
totals['quantity'] = sum([e.quantity for e in entries if EntryMatches(e)])
EntryMatches() will have access to all variables in enclosing scope, so no need to pass in any more arguments. You get the advantage that all of the logic for which entries to use is in one place, you still get to use the list comprehension to make the sum() more readable, but you can have arbitrary logic in EntryMatches() now.

Related

Unexpected output when looping through subdictionaries in Python

I have a nested dictionary, where I have tickers to identifiy certain assets in my dictionary and then for each of these assets I would like to store characteristics in a subdictionary for the asset, creating them in a simple loop like the below:
ticker = ["a","bb","ccc"]
ticker_dict = dict.fromkeys(ticker, {"Var":[]})
for key in ticker_dict:
ticker_dict[key]["Var"] = len(key)
From the above output I would expect, that for each ticker/asset it saves the "Var" variable as the length of its name, meaning the following:
{"a":{"Var":1},
"bb":{"Var":2},
"ccc":{"Var":3}}
But, in my view rather weirdly, the result is this
{"a":{"Var":3},
"bb":{"Var":3},
"ccc":{"Var":3}}
To provide further context, the real process is that I have four assets, for which I would like to store dataframes in their subdictionaries as this makes it easy for me to access them later in loops etc. Somehow though, the data from the last asset is simply copied over all assets, eventhogh I explicitly loop through different keys.
What's going on?
PS: I'm not sure how to explain the problem without the sample code, so I might have missed a similar entry on this site. If so, any hints to it would be appreciated as well of course.

In your code, {"Var":[]} is only evaluated once, causing there to be only 1 inner dictionary shared by all keys. Instead, you can use a dictionary comprehension:
ticker_dict = {key:{"Var":[]} for key in ticker}
and it will work as expected.

How to remove extraneous square brackets from a nested list inside a dictionary?

I have been working on a problem which involves sorting a large data set of shop orders, extracting shop and user information based on some parameters. Mostly this has involved creating dictionaries by iterating through a data set with a for loop and appending a new list, like this:
sshop = defaultdict(list)
for i in range(df_subset.shape[0]):
orderid, sid, userid, time = df.iloc[i]
sshop[sid].append(userid)
sData = dict(sshop)
#CREATES DICTIONARY OF UNIQUE SHOPS WITH USER DATA AS THE VALUE
shops = df_subset['shopid'].unique()
shops_dict = defaultdict(list)
for shop in shops:
shops_dict[shop].append(sData[shop])
shops_dict = dict(shops_dict)
shops_dict looks like this at this point:
{10009: [[196962305]], 10051: [[2854032, 48600461]], 10061: [[168750452, 194819216, 130633421,
62464559]]}
To get to the final stages I have had to repeat lines of code similar to these a couple of times. What seems to happen everytime I do this is that the VALUES in the dictionaries gain a set of square brackets.
This is one of my final dictionaries:
{10159: [[[1577562540.0, 1577736960.0, 1577737080.0]], [[1577651880.0, 1577652000.0, 1577652960.0]]],
10208: [[[1577651040.0, 1577651580.0, 1577797080.0]]]}
I don't entirely understand why this is happening, asides from I believe it is something to do with using defaultdict(list) and then converting that into a dictionary with dict().
These extra brackets, asides from being a little confusing, appear to be causing some problems for accessing the data using certain functions. I understand that there needs to be two sets of square brackets in total, one set that encases all the values in the dictionary key and another inside of that for each of the specific sets of values within that key.
My first question would be, is it possible to remove a specific set of square brackets from a dictionary like that?
My second question would be, if not - is there a better way of creating new dictionaries out the data from an older one without using defaultdict(list) and having all those extra square brackets?
Any help much appreciated!
Thanks :)!

In second loop use extend instead of append.
for shop in shops:
shops_dict[shop].extend(sData[shop])
shops_dict = dict(shops_dict)

How do I do a Simple Scan in DynamoDB?

Trying to understand the docs has been very difficult in relation to trying to understand how to do a simple scan in AWS DynamoDB.
Can someone please explain to me in simple terms how to do a basic scan?

What is a Scan?
The Scan operation returns one or more items and item attributes by accessing every item in a table or a secondary index.
Explanation
A scan operation in it's simplest form looks through everything in your table. Most of the time, you probably don't need the whole table to be returned or even looked at. As a result, many often decide to use filters to cut down on the stuff to look through, process and return.
How do I Scan?
Here is a simple scan operation in python. Even if you aren't using python, this guide will be very helpful.
# Table = 'grades'
# Year_levels = {0-12}
# Sort_key = overall_rank
# Attribute_categories = math, english, science | out of 100
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table("grades")
result = table.scan(
FilterExpression ='math > :math AND english > :eng',
ExpressionAttributeValues = {':math': 80,':eng': 70},
Select='SPECIFIC_ATTRIBUTES',
ProjectionExpression='year_level,overall_rank,math,english',
Limit = 50 #This is the amount of items to SCAN, not necessarily RETURN.
)
# return or print result
Explanation
FilterExpression and ExpressionAttributeValues. There are multiple ways of understanding how these work. One way to understand it is by seeing it as an item attribute value checker. In other words, every item that the scan goes through, the filter's applied upon it's attributes must be true for the item to be returned. e.g (a math score of 80%+ and an english score of 70%+)
Select and Projection Expression. In technical terms, the way in which I explain this is incorrect, however, in practical terms this way of understanding holds up: You can see that there is a SECOND filter, not for the item, but the ATTRIBUTES of the item that will be returned. e.g (I only want the
year_level, overall_rank, math, english to be returned, but no science)
Now if we combine the two we have an example: If an item is checked and matches the criteria placed upon it by the FilterExpression, it will be returned. HOWEVER, we only want SPECIFIC_ATTRIBUTES to be returned. At this point, the item will then be checked AGAIN against, this time, the Select criteria. The select criteria tells you what attributes FROM the item to return.
Limit is just the amount of items to check through, but not necessarily return.
References
References

How to resolve inconsistent annotation in Django?

I have the following 2 lines of
CategoryContext = Somemodel.objects.values('title__categories__category').distinct()
CategoryContextSum = CategoryContext.annotate(Total_Revenue=Sum('revenue')).order_by('-Total_Revenue')
CategoryContextAvg = CategoryContext.annotate(Average_Revenue=Avg('revenue')).order_by('-Average_Revenue')
The avg query yields a querylist of objects where the category comes first, followed by the revenue. So basically:
<QuerySet [{'title__categories__category':'Category', 'Average_Revenue':Decimal('100'),}, {'title__categories__category':'Category2':'Average_Revenue':Decimal('120'), }]>
The sum query on the other hand yields the revenue followed by the category, so basically:
<QuerySet [{'Total_Revenue':Decimal('100'), 'title__categories__category':'Category'}, {'Total_Revenue':Decimal('120'), 'title__categories__category':'Category2'}]>
Now I have tried flipping the queries around and changing the variable names so far, but I cannot seem to figure out why in the heck these 2 statements are behaving so differently. Does anybody know what could influence annotation behavior in Django?
Edit:
In case you are wondering why I need to understand this: I am passing the queryset to a method that turns it into data for generating a barchart and the first object in the dataset must be the identifier of the value. I could make it so that it inverts the whole process by checking whether this indeed is the case and ivnerting otherwise, but it seems to me that this shouldnt be necessary

This has little or nothing to do with annotate. Dictionaries in Python have no conventional sense of ordering (at least not until Python 3.6), and keys can be ordered differently across different queryset results.
And this constitutes little or not problem since you'll be access required values by key and not serially (as with sequences):
for obj_dct in your_qs:
print(obj_dct[some_key])
If your plot function takes dicts, no need to worry about ordering.

What is the best way to store set data in Python?

I have a list of data in the following form:
[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))
The data are loaded from files that belong to the same group. In each group there could be multiples of the same id, each coming from different files. I don't care about the duplicates, so I thought that a nice way to store all of this would be to throw it into a Set type. But there's a problem.
Sometimes for the same id the descriptions can vary slightly, as follows:
IPI00110753
Tubulin alpha-1A chain
Tubulin alpha-1 chain
Alpha-tubulin 1
Alpha-tubulin isotype M-alpha-1
(Note that this example is taken from the uniprot protein database.)
I don't care if the descriptions vary. I cannot throw them away because there is a chance that the protein database I am using will not contain a listing for a certain identifier. If this happens I will want to be able to display the human readable description to the biologists so they know roughly what protein they are looking at.
I am currently solving this problem by using a dictionary type. However I don't really like this solution because it uses a lot of memory (I have a lot of these ID's). This is only an intermediary listing of them. There is some additional processing the ID's go through before they are placed in the database so I would like to keep my data-structure smaller.
I have two questions really. First, will I get a smaller memory footprint using the Set type (over the dictionary type) for this, or should I use a sorted list where I check every time I insert into the list to see if the ID exists, or is there a third solution that I haven't thought of? Second, if the Set type is the better answer how do I key it to look at just the first element of the tuple instead of the whole thing?
Thank you for reading my question,
Tim
Update
based on some of the comments I received let me clarify a little. Most of what I do with data-structure is insert into it. I only read it twice, once to annotate it with additional information,* and once to do be inserted into the database. However down the line there may be additional annotation that is done before I insert into the database. Unfortunately I don't know if that will happen at this time.
Right now I am looking into storing this data in a structure that is not based on a hash-table (ie. a dictionary). I would like the new structure to be fairly quick on insertion, but reading it can be linear since I only really do it twice. I am trying to move away from the hash table to save space. Is there a better structure or is a hash-table about as good as it gets?
*The information is a list of Swiss-Prot protein identifiers that I get by querying uniprot.

Sets don't have keys. The element is the key.
If you think you want keys, you have a mapping. More-or-less by definition.
Sequential list lookup can be slow, even using a binary search. Mappings use hashes and are fast.
Are you talking about a dictionary like this?
{ 'id1': [ ('description1a', 'type1'), ('description1b','type1') ],
'id2': [ ('description2', 'type2') ],
...
}
This sure seems minimal. ID's are only represented once.
Perhaps you have something like this?
{ 'id1': ( ('description1a', 'description1b' ), 'type1' ),
'id2': ( ('description2',), 'type2' ),
...
}
I'm not sure you can find anything more compact unless you resort to using the struct module.

I'm assuming the problem you try to solve by cutting down on the memory you use is the address space limit of your process. Additionally you search for a data structure that allows you fast insertion and reasonable sequential read out.
Use less structures except strings (str)
The question you ask is how to structure your data in one process to use less memory. The one canonical answer to this is (as long as you still need associative lookups), use as little other structures then python strings (str, not unicode) as possible. A python hash (dictionary) stores the references to your strings fairly efficiently (it is not a b-tree implementation).
However I think that you will not get very far with that approach, since what you face are huge datasets that might eventually just exceed the process address space and the physical memory of the machine you're working with altogether.
Alternative Solution
I would propose a different solution that does not involve changing your data structure to something that is harder to insert or interprete.
Split your information up in multiple processes, each holding whatever datastructure is convinient for that.
Implement inter process communication with sockets such that processes might reside on other machines altogether.
Try to divide your data such as to minimize inter process communication (i/o is glacially slow compared to cpu cycles).
The advantage of the approach I outline is that
You get to use two ore more cores on a machine fully for performance
You are not limited by the address space of one process, or even the physical memory of one machine
There are numerous packages and aproaches to distributed processing, some of which are
linda
processing

If you're doing an n-way merge with removing duplicates, the following may be what you're looking for.
This generator will merge any number of sources. Each source must be a sequence.
The key must be in position 0. It yields the merged sequence one item at a time.
def merge( *sources ):
keyPos= 0
for s in sources:
s.sort()
while any( [len(s)>0 for s in sources] ):
topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ])
top= [ t for t in topEnum if t[1] is not None ]
top.sort( key=lambda a:a[1] )
src, key = top[0]
#print src, key
yield sources[ src ].pop(0)
This generator removes duplicates from a sequence.
def unique( sequence ):
keyPos= 0
seqIter= iter(sequence)
curr= seqIter.next()
for next in seqIter:
if next[keyPos] == curr[keyPos]:
# might want to create a sub-list of matches
continue
yield curr
curr= next
yield curr
Here's a script which uses these functions to produce a resulting sequence which is the union of all the sources with duplicates removed.
for u in unique( merge( source1, source2, source3, ... ) ):
print u
The complete set of data in each sequence must exist in memory once because we're sorting in memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming the other sequences.

How about using {id: (description, id_type)} dictionary? Or {(id, id_type): description} dictionary if (id,id_type) is the key.

Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but that has changed AFAIK. The only thing you save by using a set would then be the size of a pointer for each entry (the pointer to the value).
To use only a part of a tuple for the hashcode, you'd have to subclass tuple and override the hashcode method:
class ProteinTuple(tuple):
def __new__(cls, m1, m2, m3):
return tuple.__new__(cls, (m1, m2, m3))
def __hash__(self):
return hash(self[0])
Keep in mind that you pay for the extra function call to __hash__ in this case, because otherwise it would be a C method.
I'd go for Constantin's suggestions and take out the id from the tuple and see how much that helps.

It's still murky, but it sounds like you have some several lists of [(id, description, type)...]
The id's are unique within a list and consistent between lists.
You want to create a UNION: a single list, where each id occurs once, with possibly multiple descriptions.
For some reason, you think a mapping might be too big. Do you have any evidence of this? Don't over-optimize without actual measurements.
This may be (if I'm guessing correctly) the standard "merge" operation from multiple sources.
source1.sort()
source2.sort()
result= []
while len(source1) > 0 or len(source2) > 0:
if len(source1) == 0:
result.append( source2.pop(0) )
elif len(source2) == 0:
result.append( source1.pop(0) )
elif source1[0][0] < source2[0][0]:
result.append( source1.pop(0) )
elif source2[0][0] < source1[0][0]:
result.append( source2.pop(0) )
else:
# keys are equal
result.append( source1.pop(0) )
# check for source2, to see if the description is different.
This assembles a union of two lists by sorting and merging. No mapping, no hash.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.