This is my code for iterating on "ListOfDocuments" which is a list of over 500,000 dicts. Each of those dicts have around 30 key-value pairs that I need.
for document in ListOfDocuments:
for field in document:
if(field=="USELESS"):
continue
ExtraList[AllParameters[field]] = document[field]
ExtraList[AllParameters["C_Name"]] = filename.split(".")[0]
AppendingDataframe.loc[len(AppendingDataframe)] = ExtraList
What I'm trying to do is, store all possible column names in AllParameters, loop through the ListOfDocuments followed by looping through the obtained dict, followed by iterating each key-value pair and saving them in ExtraList which I append finally in the AppendingDataframe.
This approach is extremely slow even on most powerful of the machines and I know this is not the right way to do it. Any help would be very appreciated.
Edit:
A sample document looks like a normal key-value with over 30 keys.
Eg
{' FKey':12,'Skey':22,'NConfig':'NA','SCHEMA':'CD123...}
And I'd like to extract and store the individual key-value pairs.
Make threads. You can find out how many files you need to look through and possibly split it across 4 threads. This will make the process much faster as it will allow the documents to be read at the same time
You could start by making a method that accepts a list of files then loop through those. Then you could pass a few sections of the main list to the method and run them in threads. That should provide a decent increase in speed
You can implement this by implementing a function that processes a single entry of the list and then use multiprocessing:
import multiprocessing as multi
from multiprocessing import Manager
manager = Manager()
data = manager.list([])
def func(a): #Implement here the function
data.append(a) #that processes one dict from the list
p = multi.Pool(processes=16)
p.map(func, ListOfDocuments)
print data
Related
I'm using python with the binance.client wrapper. I'm gathering all of the BTC trade pairs from the exchange and wanting to create a simple dict with tradepair: price.
I have figured a way to do this but it seems clunky to me and takes a minute or so to run. I'm currently a programming student just getting started with python as well as some other languages.
Is there a better way to do it than this?
def BTCPair():
BTCPair = []
BTCPrice = []
BTCPairAndPrice = {}
exchange_info = client.get_exchange_info()
for s in exchange_info['symbols']:
if 'BTC' in (s['symbol'])[-3:]:
BTCPair.append(s['symbol'])
BTCPrice.append(client.get_avg_price(symbol=s['symbol'])['price'])
for i in range(len(BTCPair)):
BTCPairAndPrice[BTCPair[i]] = BTCPrice[i]
return BTCPairAndPrice
I don't understand why you are using two loops; one to put the data into lists, and another to convert those lists into a dictionary - why not just construct the dictionary directly?
You could just construct your dictionary directly using a comprehension:
BTCPairAndPrice = {
s['symbol']: client.get_avg_price(symbol=s['symbol'])['price']
for s in exchange_info['symbols']
if 'BTC' in (s['symbol'])[-3:]
}
The way the dictionary is constructed is unlikely to have a big impact on performance, but not iterating through all the data twice should have an impact if there is a lot of data.
Also consider that contacting a web service is also likely to take some time, so contacting the exchange might be the slowest part.
First of all, for i in range(len(BTCPair)) is an antipattern. You could instead iterate over those zipped together.
But we don't actually need to do that either! Rather than creating two lists and then iterating over them to fill in your dictionary, you could create everything in one go with a dictionary comprehension. Also, a cleaner way to check the end of a string is endswith().
def btc_pair():
symbols = client.get_exchange_info()['symbols']
return {
s['symbol']: client.get_avg_price(symbol=s['symbol'])['price']
for s in symbols
if s['symbol'].endswith('BTC')
}
This probably will run a little bit faster, but I doubt that the dictionary creation itself is the real performance bottleneck in your code.
I'm putting around 4 million different keys into a Python dictionary.
Creating this dictionary takes about 15 minutes and consumes about 4GB of memory on my machine. After the dictionary is fully created, querying the dictionary is fast.
I suspect that dictionary creation is so resource consuming because the dictionary is very often rehashed (as it grows enormously).
Is is possible to create a dictionary in Python with some initial size or bucket number?
My dictionary points from a number to an object.
class MyObject:
def __init__(self):
# some fields...
d = {}
d[i] = MyObject() # 4M times on different key...
With performance issues it's always best to measure. Here are some timings:
d = {}
for i in xrange(4000000):
d[i] = None
# 722ms
d = dict(itertools.izip(xrange(4000000), itertools.repeat(None)))
# 634ms
dict.fromkeys(xrange(4000000))
# 558ms
s = set(xrange(4000000))
dict.fromkeys(s)
# Not including set construction 353ms
The last option doesn't do any resizing, it just copies the hashes from the set and increments references. As you can see, the resizing isn't taking a lot of time. It's probably your object creation that is slow.
I tried :
a = dict.fromkeys((range(4000000)))
It creates a dictionary with 4 000 000 entries in about 3 seconds. After that, setting values are really fast. So I guess dict.fromkey is definitly the way to go.
If you know C, you can take a look at dictobject.c and the Notes on Optimizing Dictionaries. There you'll notice the parameter PyDict_MINSIZE:
PyDict_MINSIZE. Currently set to 8.
This parameter is defined in dictobject.h. So you could change it when compiling Python but this probably is a bad idea.
You can try to separate key hashing from the content filling with dict.fromkeys classmethod. It'll create a dict of a known size with all values defaulting to either None or a value of your choice. After that you could iterate over it to fill with the values. It'll help you to time the actual hashing of all keys. Not sure if you'd be able significantly increase the speed though.
If your datas need/can be stored on disc perhaps you can store your datas in a BSDDB database or use Cpickle to load/store your dictionnary
Do you initialize all keys with new "empty" instances of the same type? Is it not possible to write a defaultdict or something that will create the object when it is accessed?
Let's suppose I have a for loop. After each loop, I get a result that I want to store in a list.
I want to write something like this but it's not working:
for i in range(50):
"result_%d" %i = result
Supposing that result is a list containing the results after each loop.
I want to do that in order to have a different list for each result, so I could be able to use them after the loop is finished.
Is there any way to do that?
Note: I thought about storing all the result lists in one big list. But won't that be heavy for the code? Noting that each result list has a size of 60.
I thought about storing all the result arrays in one big array. But won't that be heavy for the code?
In Python, everything is an object and names and list contents are just references to them. Creating new list objects containing references to existing values is quite lightweight really.
Don't try to create dynamic variables, just store your results in another list or a dictionary.
In this case that's as easy as:
results = []
for i in range(50):
# do things
results.append(result)
and results is then a list with 50 references to other list objects. That's no different from having 50 names referencing those 50 list objects, other than that it is much easier to address them now.
I have a dict that has unix epoch timestamps for keys, like so:
lookup_dict = {
1357899: {} #some dict of data
1357910: {} #some other dict of data
}
Except, you know, millions and millions and millions of entries. I'd like to subset this dict, over and over again. Ideally, I'd love to be able to write something like I can in R, like:
lookup_value = 1357900
dict_subset = lookup_dict[key >= lookup_value]
# dict_subset now contains {1357910: {}}
But I confess, I can't find any actual proof that this is something Python can do without having, one way or the other, to iterate over every row. If I understand Python correctly (and I might not), key lookup of the form key in dict uses binary search, and is thus very fast; any way to do a binary search, on dict keys?
To do this without iterating, you're going to need the keys in sorted order. Then you just need to do a binary search for the first one >= lookup_value, instead of checking each one for >= lookup_value.
If you're willing to use a third-party library, there are plenty out there. The first two that spring to mind are bintrees (which uses a red-black tree, like C++, Java, etc.) and blist (which uses a B+Tree). For example, with bintrees, it's as simple as this:
dict_subset = lookup_dict[lookup_value:]
And this will be as efficient as you'd hope—basically, it adds a single O(log N) search on top of whatever the cost of using that subset. (Of course usually what you want to do with that subset is iterate the whole thing, which ends up being O(N) anyway… but maybe you're doing something different, or maybe the subset is only 10 keys out of 1000000.)
Of course there is a tradeoff. Random access to a tree-based mapping is O(log N) instead of "usually O(1)". Also, your keys obviously need to be fully ordered, instead of hashable (and that's a lot harder to detect automatically and raise nice error messages on).
If you want to build this yourself, you can. You don't even necessarily need a tree; just a sorted list of keys alongside a dict. You can maintain the list with the bisect module in the stdlib, as JonClements suggested. You may want to wrap up bisect to make a sorted list object—or, better, get one of the recipes on ActiveState or PyPI to do it for you. You can then wrap the sorted list and the dict together into a single object, so you don't accidentally update one without updating the other. And then you can extend the interface to be as nice as bintrees, if you want.
Using the following code will work out
some_time_to_filter_for = # blah unix time
# Create a new sub-dictionary
sub_dict = {key: val for key, val in lookup_dict.items()
if key >= some_time_to_filter_for}
Basically we just iterate through all the keys in your dictionary and given a time to filter out for we take all the keys that are greater than or equal to that value and place them into our new dictionary
I have a list of data in the following form:
[(id\__1_, description, id\_type), (id\__2_, description, id\_type), ... , (id\__n_, description, id\_type))
The data are loaded from files that belong to the same group. In each group there could be multiples of the same id, each coming from different files. I don't care about the duplicates, so I thought that a nice way to store all of this would be to throw it into a Set type. But there's a problem.
Sometimes for the same id the descriptions can vary slightly, as follows:
IPI00110753
Tubulin alpha-1A chain
Tubulin alpha-1 chain
Alpha-tubulin 1
Alpha-tubulin isotype M-alpha-1
(Note that this example is taken from the uniprot protein database.)
I don't care if the descriptions vary. I cannot throw them away because there is a chance that the protein database I am using will not contain a listing for a certain identifier. If this happens I will want to be able to display the human readable description to the biologists so they know roughly what protein they are looking at.
I am currently solving this problem by using a dictionary type. However I don't really like this solution because it uses a lot of memory (I have a lot of these ID's). This is only an intermediary listing of them. There is some additional processing the ID's go through before they are placed in the database so I would like to keep my data-structure smaller.
I have two questions really. First, will I get a smaller memory footprint using the Set type (over the dictionary type) for this, or should I use a sorted list where I check every time I insert into the list to see if the ID exists, or is there a third solution that I haven't thought of? Second, if the Set type is the better answer how do I key it to look at just the first element of the tuple instead of the whole thing?
Thank you for reading my question,
Tim
Update
based on some of the comments I received let me clarify a little. Most of what I do with data-structure is insert into it. I only read it twice, once to annotate it with additional information,* and once to do be inserted into the database. However down the line there may be additional annotation that is done before I insert into the database. Unfortunately I don't know if that will happen at this time.
Right now I am looking into storing this data in a structure that is not based on a hash-table (ie. a dictionary). I would like the new structure to be fairly quick on insertion, but reading it can be linear since I only really do it twice. I am trying to move away from the hash table to save space. Is there a better structure or is a hash-table about as good as it gets?
*The information is a list of Swiss-Prot protein identifiers that I get by querying uniprot.
Sets don't have keys. The element is the key.
If you think you want keys, you have a mapping. More-or-less by definition.
Sequential list lookup can be slow, even using a binary search. Mappings use hashes and are fast.
Are you talking about a dictionary like this?
{ 'id1': [ ('description1a', 'type1'), ('description1b','type1') ],
'id2': [ ('description2', 'type2') ],
...
}
This sure seems minimal. ID's are only represented once.
Perhaps you have something like this?
{ 'id1': ( ('description1a', 'description1b' ), 'type1' ),
'id2': ( ('description2',), 'type2' ),
...
}
I'm not sure you can find anything more compact unless you resort to using the struct module.
I'm assuming the problem you try to solve by cutting down on the memory you use is the address space limit of your process. Additionally you search for a data structure that allows you fast insertion and reasonable sequential read out.
Use less structures except strings (str)
The question you ask is how to structure your data in one process to use less memory. The one canonical answer to this is (as long as you still need associative lookups), use as little other structures then python strings (str, not unicode) as possible. A python hash (dictionary) stores the references to your strings fairly efficiently (it is not a b-tree implementation).
However I think that you will not get very far with that approach, since what you face are huge datasets that might eventually just exceed the process address space and the physical memory of the machine you're working with altogether.
Alternative Solution
I would propose a different solution that does not involve changing your data structure to something that is harder to insert or interprete.
Split your information up in multiple processes, each holding whatever datastructure is convinient for that.
Implement inter process communication with sockets such that processes might reside on other machines altogether.
Try to divide your data such as to minimize inter process communication (i/o is glacially slow compared to cpu cycles).
The advantage of the approach I outline is that
You get to use two ore more cores on a machine fully for performance
You are not limited by the address space of one process, or even the physical memory of one machine
There are numerous packages and aproaches to distributed processing, some of which are
linda
processing
If you're doing an n-way merge with removing duplicates, the following may be what you're looking for.
This generator will merge any number of sources. Each source must be a sequence.
The key must be in position 0. It yields the merged sequence one item at a time.
def merge( *sources ):
keyPos= 0
for s in sources:
s.sort()
while any( [len(s)>0 for s in sources] ):
topEnum= enumerate([ s[0][keyPos] if len(s) > 0 else None for s in sources ])
top= [ t for t in topEnum if t[1] is not None ]
top.sort( key=lambda a:a[1] )
src, key = top[0]
#print src, key
yield sources[ src ].pop(0)
This generator removes duplicates from a sequence.
def unique( sequence ):
keyPos= 0
seqIter= iter(sequence)
curr= seqIter.next()
for next in seqIter:
if next[keyPos] == curr[keyPos]:
# might want to create a sub-list of matches
continue
yield curr
curr= next
yield curr
Here's a script which uses these functions to produce a resulting sequence which is the union of all the sources with duplicates removed.
for u in unique( merge( source1, source2, source3, ... ) ):
print u
The complete set of data in each sequence must exist in memory once because we're sorting in memory. However, the resulting sequence does not actually exist in memory. Indeed, it works by consuming the other sequences.
How about using {id: (description, id_type)} dictionary? Or {(id, id_type): description} dictionary if (id,id_type) is the key.
Sets in Python are implemented using hash tables. In earlier versions, they were actually implemented using sets, but that has changed AFAIK. The only thing you save by using a set would then be the size of a pointer for each entry (the pointer to the value).
To use only a part of a tuple for the hashcode, you'd have to subclass tuple and override the hashcode method:
class ProteinTuple(tuple):
def __new__(cls, m1, m2, m3):
return tuple.__new__(cls, (m1, m2, m3))
def __hash__(self):
return hash(self[0])
Keep in mind that you pay for the extra function call to __hash__ in this case, because otherwise it would be a C method.
I'd go for Constantin's suggestions and take out the id from the tuple and see how much that helps.
It's still murky, but it sounds like you have some several lists of [(id, description, type)...]
The id's are unique within a list and consistent between lists.
You want to create a UNION: a single list, where each id occurs once, with possibly multiple descriptions.
For some reason, you think a mapping might be too big. Do you have any evidence of this? Don't over-optimize without actual measurements.
This may be (if I'm guessing correctly) the standard "merge" operation from multiple sources.
source1.sort()
source2.sort()
result= []
while len(source1) > 0 or len(source2) > 0:
if len(source1) == 0:
result.append( source2.pop(0) )
elif len(source2) == 0:
result.append( source1.pop(0) )
elif source1[0][0] < source2[0][0]:
result.append( source1.pop(0) )
elif source2[0][0] < source1[0][0]:
result.append( source2.pop(0) )
else:
# keys are equal
result.append( source1.pop(0) )
# check for source2, to see if the description is different.
This assembles a union of two lists by sorting and merging. No mapping, no hash.