Clean way of using a python dictionary to hold program statistics - python

I often write short programs when collect statistics when they run and report at the end. I normally gather these stats in a dictionary to display at the end.
I end up writing these like the simple example below, but I expect there is a cleaner more pythonic way to do this. This way can grow quite large (or nested) when there are several metrics.
stats = {}
def add_result_to_stats(result,func_name):
if not func_name in stats.keys():
stats[func_name] = {}
if not result in stats[func_name].keys():
stats[func_name][result] = 1
else:
stats[func_name][result] += 1

You could combine defaultdict with Counter which would reduce add_result_to_stats to one line:
from collections import defaultdict, Counter
stats = defaultdict(Counter)
def add_result_to_stats(result, func_name):
stats[func_name][result] += 1
add_result_to_stats('foo', 'bar')
print stats # defaultdict(<class 'collections.Counter'>, {'bar': Counter({'foo': 1})})

If you just have to count func_names and results go with a Counter
import collections
stats = collections.Counter()
def add_result_to_stats(result,func_name):
stats.update({(func_name, result):1})

Related

How can I auto-initialise key-value in python dictionary if key-value pair does not exist, without doing IF & ELSE?

How do I automatically initialize my python dictionary key-value (no idea what keys are used yet) as 1 if it does not yet exist and if it exists, just increment? I guess this concept can be used for any other logic.
Example:
The code below will give error because the char_counts[char] is not initialised yet for some. And I have no way of initialising this yet as I dont know what key-value pairs I will use at the start. (Actually if I do, side-tracked, is there a convenient way in python to intialise key-value pairs in one shot aside from looping?)
ANYWAY my main question is the below.
for ~some loop~:
char_counts[char] += 1
This is my current workaround, which seems a little lengthy for a simple operation. Is there better way to streamline/ shorten this?
for ~some loop~:
if char_counts.get(char, None):
char_counts[char] += 1
else:
char_counts[char] = 1
Thanks in advance!
use defaultdict from collections library of python
from collections import defaultdict
char_counts = defaultdict(lambda :0)
for ~some loop~:
char_counts[char] += 1
defaultdict will never throw KeyError, if key is missing that key is initialized with 0.
You can use get method of a dictionary object.
If the key is not present in the dictionary get will return the default value to which you want to initialize the dictionary.
char_counts = {}
for ~some loop~:
# If key is not present, 0 will be returned by default
char_counts[char] = char_count.get(char,0) + 1
once you understand how the above code works then read about defaultdict in collections module and try the below code.
from collections import defaultdict
char_counts = defaultdict(int)
for ~some loop~:
char_counts[char] += 1

Using python multiprocessing on a for loop that appends results to dictionary

So I've looked at both the documentation of the multiprocessing module, and also at the other questions asked here, and none seem to be similar to my case, hence I started a new question.
For simplicity, I have a piece of code of the form:
# simple dataframe of some users and their properties.
data = {'userId': [1, 2, 3, 4],
'property': [12, 11, 13, 43]}
df = pd.DataFrame.from_dict(data)
# a function that generates permutations of the above users, in the form of a list of lists
# such as [[1,2,3,4], [2,1,3,4], [2,3,4,1], [2,4,1,3]]
user_perm = generate_permutations(nr_perm=4)
# a function that computes some relation between users
def comp_rel(df, permutation, user_dict):
df1 = df.userId.isin(permutation[0])
df2 = df.userId.isin(permutation[1])
user_dict[permutation[0]] += permutation[1]
return user_dict
# and finally a loop:
user_dict = defaultdict(int)
for permutation in user_perm:
user_dict = comp_rel(df, permutation, user_dict)
I know this code makes very little (if any) sense right now, but I just wrote a small example that is close to the structure of the actual code that I am working on. That user_dict should finally contain userIds and some value.
I have the actual code, and it works fine, gives the correct dict and everything, but... it runs on a single thread. And it's painfully slow, keeping in mind that I have another 15 threads totally free.
My question is, how can I use the multiprocessing module of python to change the last for loop, and be able to run on all threads/cores available? I looked at the documentation, it's not very easy to understand.
EDIT: I am trying to use pool as:
p = multiprocessing.Pool(multiprocessing.cpu_count())
p.map(comp_rel(df, permutation, user_dict), user_perm)
p.close()
p.join()
however this breaks because I am using the line :
user_dict = comp_rel(df, permutation, user_dict)
in the initial code, and I don't know how these dictionaries should be merged after pool is done.
After short discussion in comments I've decided to post solution using ProcessPoolExecutor:
import concurrent.futures
from collections import defaultdict
def comp_rel(df, perm):
...
return perm[0], perm[1]
user_dict = defaultdict(int)
with concurrent.futures.ProcessPoolExecutor() as executor:
futures = {executor.submit(comp_rel, df, perm): perm for perm in user_perm}
for future in concurrent.futures.as_completed(futures):
try:
k, v = future.result()
except Exception as e:
print(f"{futures[future]} throws {e}")
else:
user_dict[k] += v
It works same as #tzaman, but it gives you possibility to handle exceptions. Also there're more interesting features in this module, check docs.
There are two parts to your comp_rel which need to be separated - first is the calculation itself which is computing some value for some userID. The second is the "accumulation" step which is adding that value to the user_dict result.
You can separate the calculation itself so that it returns a tuple of (id, value) and farm it out with multiprocessing, then accumulate the results afterwards in the main thread:
from multiprocessing import Pool
from functools import partial
from collections import defaultdict
# We make this a pure function that just returns a result instead of mutating anything
def comp_rel(df, perm):
...
return perm[0], perm[1]
comp_with_df = partial(comp_rel, df) # df is always the same, so factor it out
with Pool(None) as pool: # Pool(None) uses cpu_count automatically
results = pool.map(comp_with_df, user_perm)
# Now add up the results at the end:
user_dict = defaultdict(int)
for k, v in results:
user_dict[k] += v
Alternatively you could also pass a Manager().dict() object into the processing function directly, but that's a little more complicated and likely won't get you any additional speed.
Based on #Masklinn's suggestion, here's a slightly better way to do it to avoid memory overhead:
user_dict = defaultdict(int)
with Pool(None) as pool:
for k, v in pool.imap_unordered(comp_with_df, user_perm):
user_dict[k] += v
This way we add up the results as they complete, instead of having to store them all in an intermediate list first.

How to use defaultdict in dict.fromkeys?

I want to count the histogram for a property value(depth here) of 3 different samples with 1 dictionary.
SamplesList = ('Sa','Sb','Sc')
from collections import defaultdict
DepthCnt = dict.fromkeys(SamplesList, defaultdict(int))
This code will make DepthCnt contains 3 defaultdict(int) of the same one, thus I cannot count different samples.
How can I do it right ?
It is OK to use either DepthCnt[sample][depth] or DepthCnt[depth][sample].
I tested these 3 ways:
from collections import defaultdict
DepthCnt = {key:defaultdict(int) for key in SamplesList}
yDepthCnt = defaultdict(lambda: defaultdict(int))
from collections import Counter
cDepthCnt = {key:Counter() for key in SamplesList}
The memory size are:
DepthCnt[sample][depth]: 993487
yDepthCnt[depth][sample]: 1953307
cDepthCnt[sample][depth]: 994207
It seems good to change to Counter().
Use a dictionary expression/comprehension/display
DepthCnt = {key:defaultdict(int) for key in SamplesList}
It sounds like you're trying to count occurences of sammples in SamplesList. If so you're looking for a collections.Counter
Given:
SamplesList = ('Sa','Sb','Sc')
Counter:
from collections import Counter
DepthCnt = Counter(SamplesList)
print(DepthCnt)
#Counter({'Sc': 1, 'Sa': 1, 'Sb': 1})
Edit:
You can always use a counter instead of a defaultdict as well
DepthCnt = {key:Counter() for key in SamplesList}
print(DepthCnt)
#DepthCnt = {'Sa': Counter(), 'Sb': Counter(), 'Sc': Counter()}
P.S
If you're working over a large dataset as well take a look into the Counter class both Counter and defaultdict are similar below is the TLDR from this great answer to a question on Collections.Counter vs defaultdict(int)
Counter supports most of the operations you can do on a multiset. So,
if you want to use those operation then go for Counter.
Counter won't add new keys to the dict when you query for missing
keys. So, if your queries include keys that may not be present in the
dict then better use Counter.
Counter also has a method called most_common that allows you to sort items by their count. To get the same thing in defaultdict you'll have to use sorted.

How can you return a default value instead of a key error when accessing a multi-dimenional dictionary in python?

I'm trying to get a value out of a multi-dimensional dictionary, which looks for example like this:
count = {'animals': {'dogs': {'chihuahua': 23}}
So if i want to know how much chihuahua's i got, i'm printing count['animals']['dogs']['chihuahua']
But i want to access count['vehicles']['cars']['vw golf'] too, and instead of key errors i want to return 0.
actually i'm doing this:
if not 'vehicles' in count:
count['vehicles'] = {}
if not 'cars' in count['vehicles']:
count['vehicles']['cars'] = {}
if not 'vw golf' in count['vehicles']['cars']['vw golf']:
count['vehicles']['cars']['vw golf'] = 0
How can i do this better?
I'm thinking of some type of class which inherits from dict, but that's just an idea.
You can just do:
return count.get('vehicles', {}).get('cars', {}).get('vw golf', 0)
basically, return an empty dictionary if not found, and get the count at the end.
This would work assuming the dataset is in the specified format only. It would not raise errors, however you might have to tweak it for other datatypes
Demo
>>> count = {'animals': {'dogs': {'chihuahua': 23}}}
>>> count.get('vehicles', {}).get('cars', {}).get('vw golf', 0)
0
>>> count = {'vehicles': {'cars': {'vw golf': 100}}}
>>> count.get('vehicles', {}).get('cars', {}).get('vw golf', 0)
100
>>>
Use a combination of collections.defaultdict and collections.Counter:
from collections import Counter
from collections import defaultdict
counts = defaultdict(lambda: defaultdict(Counter))
Usage:
>>> counts['animals']['dogs']['chihuahua'] = 23
>>> counts['vehicles']['cars']['vw golf'] = 100
>>>
>>> counts['animals']['dogs']['chihuahua']
23
>>> # No fancy cars yet, Counter defaults to 0
... counts['vehicles']['cars']['porsche']
0
>>>
>>> # No bikes yet, empty counter
... counts['vehicles']['bikes']
Counter()
The lambda in the construction of the defaultdict is needed because defaultdict expects a factory. So lambda: defaultdict(Counter) basically creates a function that will return defaultdict(Counter) when called - which is what's required to create the multi-dimensional dictionary you described:
A dictionary whose values default to a dictionary whose values default to an instance of Counter.
The advantage of this solution is that you don't have to keep track of which categories you already defined. You can simply assign two new categories and a new count in one go, and use the same syntax to add a new count for existing categories:
>>> counts['food']['fruit']['bananas'] = 42
>>> counts['food']['fruit']['apples'] = 3
(This assumes that you'll always want exactly three dimensions to your data structure, the first two being category dictionaries and the third being a Counter where the actual counts of things will be stored).

return output of dictionary to alphabetical order

The following code prints out the word in the txt file and then how many instances there are of that word (e.g. a, 26) the problem is that it doesn't print it out in alphabetical order. Any help would be much appreciated
import re
def print_word_counts(filename):
s=open(filename).read()
words=re.findall('[a-zA-Z]+', s)
e=[x.lower() for x in (words)]
e.sort()
from collections import Counter
dic=Counter(e)
for key,value in dic.items():
print (key,value)
print_word_counts('engltreaty.txt')
You just need to sort the items. The builtin sorted should work wonderfully:
for key,value in sorted(dic.items()):
...
If you drop the e.sort() line, then this should run in approximately the same amount of time. The reason that it doesn't work is because dictionaries are based on hash tables which store items in order of their hash values (with some more complicated stuff when hash collisions occur). Since the hashing function is never specified anywhere, it means that you can't count on a dictionary keeping any order that you try to give it and that the order is implementation and version dependent. For other simple cases, the collections module has an OrderedDict subclass which does keep insertion order. however, that won't really help you here.
Note Counter is a subclass of dict so sorting before you add to Counter:
e.sort()
dic=Counter(e)
won't achieve order.
import re
from collections import Counter
def print_word_counts(filename):
c = Counter()
with open(filename) as f: # with block closes file at the end of the block
for line in f: # go line by line, don't load it all into mem at once
c.update(w.lower() for w in re.findall('[a-zA-Z]+', line))
for k, v in sorted(c.items()): # sorts
print k, v
print_word_counts('engltreaty.txt')

Categories