Why are there so many refcounts for my variable?

Why are there so many refcounts for my variable? - python

In my attempt to understanding the Python GIL (in Python 3.7.6.
), I played with sys.getrefcount() and the results are a bit bizarre.
From the documentation for sys.getrefcount(object)
Return the reference count of the object. The count returned is generally one higher than you might expect, because it includes the (temporary) reference as an argument to getrefcount().
In an attempt to grok it myself, here's the progression/confusion:
Firstly, should sys.getrefcount(object) work on values/literals? (please correct my terminology if I'm wrong), and why are the refcounts so random?
>>> import sys
>>> [sys.getrefcount(i) for i in range (10)]
[320, 195, 128, 43, 69, 24, 32, 18, 44, 17]
>>> [sys.getrefcount(str(i)) for i in range (10)] #refcount of every value is same now (?)
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Then, I tried to explore further
>>> # Let's probe further
>>> import random
>>> [sys.getrefcount(str(random.randint(1,20))) for i in range (10)]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> [sys.getrefcount(str(random.randint(1,20)*'a')) for i in range (10)]
[1, 1, 1, 1, 14, 1, 1, 1, 1, 1] # not every item is same
>>> [sys.getrefcount(random.choice('abcde')) for i in range (10)]
[20, 20, 12, 12, 9, 13, 12, 13, 12, 12]
>>> [sys.getrefcount(str(random.choice('abcde'))) for i in range (10)]
[9, 20, 12, 12, 12, 9, 12, 12, 12, 12]
What is going on above? I'm not sure if all the behaviors can be explained with just one misunderstanding that I might have or there are multiple things at play here. Please feel safe in assuming that the above lines were run sequentially in the Python Interpreter and nothing else was there that is not here.
For the question to make more sense, everything began here:
>>> sys.getrefcount(1)
187
>>> a = 1
>>> sys.getrefcount(a)
185
EDIT: I get it all, but why should sys.getrefcount(1) be so high?

Firstly, should sys.getrefcount(object) work on values/literals? (please correct my terminology if I'm wrong)
Yes. The literal is an expression that returns an object. That object might be cached (e.g., small numbers) or not (arbitrary strings).
, and why are the refcounts so random?
Coincidence/idiosyncracies of the interpreter.
>>> [sys.getrefcount(i) for i in range (10)]
[320, 195, 128, 43, 69, 24, 32, 18, 44, 17]
Small numbers literals are cached in CPython; they're all referring to the same object in memory, and whatever object is in memory has a reference to them. In this case, CPython might keep a reference to the small numbers cache for loops.
>>> [sys.getrefcount(str(i)) for i in range (10)] #refcount of every value is same now (?)
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
These are new objects, not searched in cache.
For the rest, strings are usually cached, and some references were in memory at the time.
>>> sys.getrefcount(1)
187
>>> a = 1
>>> sys.getrefcount(a)
185
a is merely a reference to the (usually cached) "1", and whatever difference there is between refcounts probably refers to whatever process the REPL carries in order to have a reference to the literal and print it.

Related

This particular way of using .map() in python

I was reading an article and I came across this below-given piece of code. I ran it and it worked for me:
x = df.columns
x_labels = [v for v in sorted(x.unique())]
x_to_num = {p[1]:p[0] for p in enumerate(x_labels)}
#till here it is okay. But I don't understand what is going with this map.
x.map(x_to_num)
The final result from the map is given below:
Int64Index([ 0, 3, 28, 1, 26, 23, 27, 22, 20, 21, 24, 18, 10, 7, 8, 15, 19,
13, 14, 17, 25, 16, 9, 11, 6, 12, 5, 2, 4],
dtype='int64')
Can someone please explain to me how the .map() worked here. I searched online, but could not find anything related.
ps: df is a pandas dataframe.

Let's look what .map() function in general does in python.
>>> l = [1, 2, 3]
>>> list(map(str, l))
# ['1', '2', '3']
Here the list having numeric elements is converted to string elements.
So, whatever function we are trying to apply using map needs an iterator.
You probably might have got confused because the general syntax of map (map(MappingFunction, IteratorObject)) is not used here and things still work.
The variable x takes the form of IteratorObject , while the dictionary x_to_num contains the mapping and hence takes the form of MappingFunction.
Edit: this scenario has nothing to with pandas as such, x can be any iterator type object.

Values change unexpectedly after function is called

can you lease tell me why S[f'{colors[0]}'] is changing after calling this function, and how to fix it.
S = {"1": list(range(0,5)), "2": list(range(20,25)), "3": list(range(10,15))}
colors = [1, 2 ,3]
def count_bycolor(colors):
countries_bycolor = S[f'{colors[0]}']
for i in range(1, len(colors)):
countries_bycolor.extend(S[f'{colors[i]}'])
return countries_bycolor
count_bycolor(colors)
len(S[f'{colors[0]}'])
count_bycolor(colors)
len(S[f'{colors[0]}'])
Thank for your help, and happy holidays!

You are performing operations on a list in a dict. Both of these are mutable objects and due to python being passed by reference pass-by-object-reference that has this consequence of changing your "original" object (it's the same object).
This means that you need to make copies of those objects if you want to operate on them without changing the original.
Based on your question it could be as simple as one line change:
import copy
def count_bycolor(colors):
countries_bycolor = copy.copy(S[f'{colors[0]}'])
for i in range(1, len(colors)):
countries_bycolor.extend(S[f'{colors[i]}'])
return countries_bycolor
count_bycolor(colors)
>>> [0, 1, 2, 3, 4, 20, 21, 22, 23, 24, 10, 11, 12, 13, 14]
S
>>> {'1': [0, 1, 2, 3, 4], '2': [20, 21, 22, 23, 24], '3': [10, 11, 12, 13, 14]}

Replace entry in specific numpy array stored in dictionary

I have a dictionary containing a variable number of numpy arrays (all same length), each array is stored in its respective key.
For each index I want to replace the value in one of the arrays by a newly calculated value. (This is a very simplyfied version what I'm actually doing.)
The problem is that when I try this as shown below, the value at the current index of every array in the dictionary is replaced, not just the one I specify.
Sorry if the formatting of the example code is confusing, it's my first question here (Don't quite get how to show the line example_dict["key1"][idx] = idx+10 properly indented in the next line of the for loop...).
>>> import numpy as np
>>> example_dict = dict.fromkeys(["key1", "key2"], np.array(range(10)))
>>> example_dict["key1"]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> example_dict["key2"]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> for idx in range(10):
example_dict["key1"][idx] = idx+10
>>> example_dict["key1"]
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
>>> example_dict["key2"]
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
I expected the loop to only access the array in example_dict["key1"], but somehow the same operation is applied to the array stored in example_dict["key2"] as well.

>>> hex(id(example_dict["key1"]))
'0x26a543ea990'
>>> hex(id(example_dict["key2"]))
'0x26a543ea990'
example_dict["key1"] and example_dict["key2"] are pointing at the same address. To fix this, you can use a dict comprehension.
import numpy
keys = ["key1", "key2"]
example_dict = {key: numpy.array(range(10)) for key in keys}

Is any difference between zip two and more than two lists?

I think that it's a very subtle issue, maybe an unknown bug in Python2.7. I'm making an interactive application. It should fit WLS (Weighted Linear Regression) model to the cloud of points. At the beginning the script reads the data from a text file (just a simple table with indexes, values and errors of each point). But in the data could be some points with NULL value marked by nocompl=99.9999. I have to know which these points are to reject them before the script starts the fitting. I do this in the following way:
# read the data from input file
Bunchlst = [Bunch(val1 = D[:,i], err_val1 = D[:,i+1], val2 = D[:,i+2], err_val2 = D[:,i+3]) for i in range(1, D.shape[1] - 1, 4)]
# here is the problem
for b in Bunchlst:
b.compl = list(np.logical_not([1 if nocompl in [im,ie,sm,se] else 0 for v1,e1,v2,e2 in zip(b.val1,b.err_val1,b.val2,b.err_val2)]))
# fit the model to the "good" points
wls = sm.WLS(list(compress(b.val1,b.compl)), sm.add_constant(list(compress(b.val2,b.compl)), prepend=False), weights=[1.0/i for i in list(compress(b.err_val2,b.compl))]).fit()
WLS model implemented in Python.compress() allows to filter the data (omitting NULL values). But this case generates the bug:
wls = sm.WLS(...).fit()
AttributeError: 'numpy.float64' object has no attribute 'WLS'
I made an investigation and when I zip only two lists, the problem disappears and WLS model computes itself correctly:
for b in Bunchlst:
b.compl = list(np.logical_not([1 if nocompl in [v1,v2] else 0 for v1,v2 in zip(b.val1,b.val2)]))
I wrote, that probably it may be a bug, because I checked b.compl in both cases. Always there were the same lists with True or False values (depending on the data from input file). Moreover, simple considerations lead to that it has to work for many lists:
>>> K = [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
>>> L = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> M = [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
>>> N = [32, 33, 34, 35, 36, 37, 38, 39, 40, 32]
>>> [1 if 26 in [k,l,m,n] else 0 for k,l,m,n in zip(K,L,M,N)]
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
All the best,
Benek

No, there is no difference in how zip() operates with 2 or more lists. Instead, your list comprehension assigned to the name sm in the loop, while at the same time you used the name sm to reference the statsmodels module.
Your simpler two-list version doesn't do this, so the name sm isn't rebound, and you don't run into the issue.
In Python 2, names used in the list comprehension are part of the local scope:
>>> foo
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'foo' is not defined
>>> [foo for foo in 'bar']
['b', 'a', 'r']
>>> foo
'r'
Here the name foo was set in the for loop of the list comprehension, and the name is still available after the loop.
Either rename your import, or rename your loop variables.

Sample k random permutations without replacement in O(N)

I need a number of unique random permutations of a list without replacement, efficiently. My current approach:
total_permutations = math.factorial(len(population))
permutation_indices = random.sample(xrange(total_permutations), k)
k_permutations = [get_nth_permutation(population, x) for x in permutation_indices]
where get_nth_permutation does exactly what it sounds like, efficiently (meaning O(N)). However, this only works for len(population) <= 20, simply because 21! is so mindblowingly long that xrange(math.factorial(21)) won't work:
OverflowError: Python int too large to convert to C long
Is there a better algorithm to sample k unique permutations without replacement in O(N)?

Up to a certain point, it's unnecessary to use get_nth_permutation to get permutations. Just shuffle the list!
>>> import random
>>> l = range(21)
>>> def random_permutations(l, n):
... while n:
... random.shuffle(l)
... yield list(l)
... n -= 1
...
>>> list(random_permutations(l, 5))
[[11, 19, 6, 10, 0, 3, 12, 7, 8, 16, 15, 5, 14, 9, 20, 2, 1, 13, 17, 18, 4],
[14, 8, 12, 3, 5, 20, 19, 13, 6, 18, 9, 16, 2, 10, 4, 1, 17, 15, 0, 7, 11],
[7, 20, 3, 8, 18, 17, 4, 11, 15, 6, 16, 1, 14, 0, 13, 5, 10, 9, 2, 19, 12],
[10, 14, 5, 17, 8, 15, 13, 0, 3, 16, 20, 18, 19, 11, 2, 9, 6, 12, 7, 4, 1],
[1, 13, 15, 18, 16, 6, 19, 8, 11, 12, 10, 20, 3, 4, 17, 0, 9, 5, 2, 7, 14]]
The odds are overwhelmingly against duplicates appearing in this list for len(l) > 15 and n < 100000, but if you need guarantees, or for lower values of len(l), just use a set to record and skip duplicates if that's a concern (though as you've observed in your comments, if n gets close to len(l)!, this will stall). Something like:
def random_permutations(l, n):
pset = set()
while len(pset) < n:
random.shuffle(l)
pset.add(tuple(l))
return pset
However, as len(l) gets longer and longer, random.shuffle becomes less reliable, because the number of possible permutations of the list increases beyond the period of the random number generator! So not all permutations of l can be generated that way. At that point, not only do you need to map get_nth_permutation over a sequence of random numbers, you also need a random number generator capable of producing every random number between 0 and len(l)! with relatively uniform distribution. That might require you to find a more robust source of randomness.
However, once you have that, the solution is as simple as Mark Ransom's answer.
To understand why random.shuffle becomes unreliable for large len(l), consider the following. random.shuffle only needs to pick random numbers between 0 and len(l) - 1. But it picks those numbers based on its internal state, and it can take only a finite (and fixed) number of states. Likewise, the number of possible seed values you can pass to it is finite. This means that the set of unique sequences of numbers it can generate is also finite; call that set s. For len(l)! > len(s), some permutations can never be generated, because the sequences that correspond to those permutations aren't in s.
What are the exact lengths at which this becomes a problem? I'm not sure. But for what it's worth, the period of the mersenne twister, as implemented by random, is 2**19937-1. The shuffle docs reiterate my point in a general way; see also what Wikipedia has to say on the matter here.

Instead of using xrange simply keep generating random numbers until you have as many as you need. Using a set makes sure they're all unique.
permutation_indices = set()
while len(permutation_indices) < k:
permutation_indices.add(random.randrange(total_permutations))

I had one implementation of nth_permutation (not sure from where I got it) which I modified for your purpose. I believe this would be fast enough to suit your need
>>> def get_nth_permutation(population):
total_permutations = math.factorial(len(population))
while True:
temp_population = population[:]
n = random.randint(1,total_permutations)
size = len(temp_population)
def generate(s,n,population):
for x in range(s-1,-1,-1):
fact = math.factorial(x)
d = n/fact
n -= d * fact
yield temp_population[d]
temp_population.pop(d)
next_perm = generate(size,n,population)
yield [e for e in next_perm]
>>> nth_perm = get_nth_permutation(range(21))
>>> [next(nth_perm) for k in range(1,10)]

You seem to be searching for the Knuth Shuffle! Good luck!

You could use itertools.islice instead of xrange():
CPython implementation detail: xrange() is intended to be simple and
fast Implementations may impose restrictions to achieve this. The C
implementation of Python restricts all arguments to native C longs
(“short” Python integers), and also requires that the number of
elements fit in a native C long. If a larger range is needed, an
alternate version can be crafted using the itertools module:
islice(count(start, step), (stop-start+step-1+2*(step<0))//step).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why are there so many refcounts for my variable? - python

Related

This particular way of using .map() in python

Values change unexpectedly after function is called

Replace entry in specific numpy array stored in dictionary

Is any difference between zip two and more than two lists?

Sample k random permutations without replacement in O(N)

Categories

Resources