This particular way of using .map() in python - python

I was reading an article and I came across this below-given piece of code. I ran it and it worked for me:
x = df.columns
x_labels = [v for v in sorted(x.unique())]
x_to_num = {p[1]:p[0] for p in enumerate(x_labels)}
#till here it is okay. But I don't understand what is going with this map.
x.map(x_to_num)
The final result from the map is given below:
Int64Index([ 0, 3, 28, 1, 26, 23, 27, 22, 20, 21, 24, 18, 10, 7, 8, 15, 19,
13, 14, 17, 25, 16, 9, 11, 6, 12, 5, 2, 4],
dtype='int64')
Can someone please explain to me how the .map() worked here. I searched online, but could not find anything related.
ps: df is a pandas dataframe.

Let's look what .map() function in general does in python.
>>> l = [1, 2, 3]
>>> list(map(str, l))
# ['1', '2', '3']
Here the list having numeric elements is converted to string elements.
So, whatever function we are trying to apply using map needs an iterator.
You probably might have got confused because the general syntax of map (map(MappingFunction, IteratorObject)) is not used here and things still work.
The variable x takes the form of IteratorObject , while the dictionary x_to_num contains the mapping and hence takes the form of MappingFunction.
Edit: this scenario has nothing to with pandas as such, x can be any iterator type object.

Related

Replace entry in specific numpy array stored in dictionary

I have a dictionary containing a variable number of numpy arrays (all same length), each array is stored in its respective key.
For each index I want to replace the value in one of the arrays by a newly calculated value. (This is a very simplyfied version what I'm actually doing.)
The problem is that when I try this as shown below, the value at the current index of every array in the dictionary is replaced, not just the one I specify.
Sorry if the formatting of the example code is confusing, it's my first question here (Don't quite get how to show the line example_dict["key1"][idx] = idx+10 properly indented in the next line of the for loop...).
>>> import numpy as np
>>> example_dict = dict.fromkeys(["key1", "key2"], np.array(range(10)))
>>> example_dict["key1"]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> example_dict["key2"]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> for idx in range(10):
example_dict["key1"][idx] = idx+10
>>> example_dict["key1"]
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
>>> example_dict["key2"]
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
I expected the loop to only access the array in example_dict["key1"], but somehow the same operation is applied to the array stored in example_dict["key2"] as well.
>>> hex(id(example_dict["key1"]))
'0x26a543ea990'
>>> hex(id(example_dict["key2"]))
'0x26a543ea990'
example_dict["key1"] and example_dict["key2"] are pointing at the same address. To fix this, you can use a dict comprehension.
import numpy
keys = ["key1", "key2"]
example_dict = {key: numpy.array(range(10)) for key in keys}

python : I want to avoid using 'copy' module

(The code below is not a working code. It's just to convey ideas)
My intention is to recursively call sumsubset(arr.remove(el),num,org), parsing updated arr.
However, this will cause unwanted removal to original arr which will cause error. So, I often depend on copy module, feeling somewhat awkward.
Is there any better way parsing updated arr without using copy module?
Thanks for answering my first question.
arr=[1, 1, 2, 4, 4, 4, 7, 9, 9, 13, 13, 13, 15, 15, 16, 16, 16, 19, 19, 20]
num=36
import copy
def sumsubset(arr,num,org):
for el in arr:
if el==org: return [el]
tmp=copy.copy(arr)
tmp.remove(el)
result=[el]+sumsubset(tmp,num-el,org)
return result
a=sumsubset(arr,36,36)
tmp = arr[:]
or
tmp = list(arr)
Both would create a new object, not just copying the reference to the original arr.

Most efficient way to iterate through list of lists

I'm currently collecting data from quandl and is saved as a list of lists. The list looks something like this (Price data):
['2', 1L, datetime.date(1998, 1, 2), datetime.datetime(2016, 9, 26, 1, 35, 3, 830563), datetime.datetime(2016, 9, 26, 1, 35, 3, 830563), '82.1900', '83.6200', '81.7500', '83.5000', '28.5183', 1286500.0]
This is typically 1 of about 5000 lists, and every once in awhile Quandl will spit back some NaN values that don't like being saved into the database.
['2', 1L, datetime.date(1998, 1, 2), datetime.datetime(2016, 9, 26, 1, 35, 3, 830563), datetime.datetime(2016, 9, 26, 1, 35, 3, 830563), 'nan', 'nan', 'nan', 'nan', 'nan', 0]
What would be the most efficient way of iterating through the list of lists to change 'nan' values into zeros?
I know I could do something like this, but it seems rather inefficient. This operation will need to be performed on 11 different values * 5000 different dates * 500 companies:
def screen_data(data):
new_data = []
for d in data:
new_list = []
for x in d:
new_value = x
if math.isNan(x):
new_value = 0
new_list.append(new_value)
new_data.append(new_list)
return new_data
I would be interested in any solution that could reduce the time. I know DataFrames might work, but not sure how it would solve the NaN issue.
Or if there is a way to include NaN values in an SQLServer5.6 database along with floats, changing the database is also a viable option.
Don't create a new list - rather, edit the old list in-place:
import math
def screenData(L):
for subl in L:
for i,n in enumerate(subl):
if math.isnan(n): subl[i] = 0
The only way I can think of, to make this faster, would be with multiprocessing
I haven't timed it but have you tried using nested list comprehension with conditional expressions ?
For example:
import datetime
data = [
['2', 1, datetime.date(1998, 1, 2),
datetime.datetime(2016, 9, 26, 1, 35, 3, 830563),
datetime.datetime(2016, 9, 26, 1, 35, 3, 830563),
'82.1900', '83.6200', '81.7500', '83.5000',
'28.5183', 1286500.0],
['2', 1, datetime.date(1998, 1, 2),
datetime.datetime(2016, 9, 26, 1, 35, 3, 830563),
datetime.datetime(2016, 9, 26, 1, 35, 3, 830563),
'nan', 'nan', 'nan', 'nan', 'nan', 0],
]
new_data = [[y if str(y).lower() != 'nan' else 0 for y in x] for x in data]
print(new_data)
I did not use math.isnan(y) because you have to be sure that y is a float number or you'll get an error. This is much more difficult to do while almost everything has a string representation. But I still made sure that I did the lower case comparison to 'nan' (with .lower()) since 'NaN' or 'Nan' are legal ways to express "Not a Number".
how about this
import math
def clean_nan(data_list,value=0):
for i,x in enumerate(data_list):
if math.isnan(x):
data_list[i] = value
return data_list
(the return is optional, as the modification was made in-place, but it is needed if used with map or similar, assuming of course that data_list is well a list or similar container)
depending on how you get your data and how you work with it will determined how to use it, for instance if you do something like this
for data in (my database/Quandl/whatever):
#do stuff with data
you can change it to
for data in (my database/Quandl/whatever):
clean_nan(data)
#do stuff with data
or use map or if you are in python 2 imap
for data in map(clean_nan,(my database/Quandl/whatever)):
#do stuff with data
that way you get to work with your data as soon as that arrive from the database/Quandl/whatever, granted if the place where you get the data also work as a generator, that is don't process the whole thing all at once, and if it does, procure to change it to a generator if possible. In either case with this you get to work with your data as soon as possible.

Is any difference between zip two and more than two lists?

I think that it's a very subtle issue, maybe an unknown bug in Python2.7. I'm making an interactive application. It should fit WLS (Weighted Linear Regression) model to the cloud of points. At the beginning the script reads the data from a text file (just a simple table with indexes, values and errors of each point). But in the data could be some points with NULL value marked by nocompl=99.9999. I have to know which these points are to reject them before the script starts the fitting. I do this in the following way:
# read the data from input file
Bunchlst = [Bunch(val1 = D[:,i], err_val1 = D[:,i+1], val2 = D[:,i+2], err_val2 = D[:,i+3]) for i in range(1, D.shape[1] - 1, 4)]
# here is the problem
for b in Bunchlst:
b.compl = list(np.logical_not([1 if nocompl in [im,ie,sm,se] else 0 for v1,e1,v2,e2 in zip(b.val1,b.err_val1,b.val2,b.err_val2)]))
# fit the model to the "good" points
wls = sm.WLS(list(compress(b.val1,b.compl)), sm.add_constant(list(compress(b.val2,b.compl)), prepend=False), weights=[1.0/i for i in list(compress(b.err_val2,b.compl))]).fit()
WLS model implemented in Python.compress() allows to filter the data (omitting NULL values). But this case generates the bug:
wls = sm.WLS(...).fit()
AttributeError: 'numpy.float64' object has no attribute 'WLS'
I made an investigation and when I zip only two lists, the problem disappears and WLS model computes itself correctly:
for b in Bunchlst:
b.compl = list(np.logical_not([1 if nocompl in [v1,v2] else 0 for v1,v2 in zip(b.val1,b.val2)]))
I wrote, that probably it may be a bug, because I checked b.compl in both cases. Always there were the same lists with True or False values (depending on the data from input file). Moreover, simple considerations lead to that it has to work for many lists:
>>> K = [11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
>>> L = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> M = [21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
>>> N = [32, 33, 34, 35, 36, 37, 38, 39, 40, 32]
>>> [1 if 26 in [k,l,m,n] else 0 for k,l,m,n in zip(K,L,M,N)]
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
All the best,
Benek
No, there is no difference in how zip() operates with 2 or more lists. Instead, your list comprehension assigned to the name sm in the loop, while at the same time you used the name sm to reference the statsmodels module.
Your simpler two-list version doesn't do this, so the name sm isn't rebound, and you don't run into the issue.
In Python 2, names used in the list comprehension are part of the local scope:
>>> foo
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'foo' is not defined
>>> [foo for foo in 'bar']
['b', 'a', 'r']
>>> foo
'r'
Here the name foo was set in the for loop of the list comprehension, and the name is still available after the loop.
Either rename your import, or rename your loop variables.

Find lists which together contain all values from 0-23 in list of lists python

I have a list of lists. The lists within these list look like the following:
[0,2,5,8,7,12,16,18], [0,9,18,23,5,8,15,16], [1,3,4,17,19,6,13,23],
[9,22,21,10,11,20,14,15], [2,8,23,0,7,16,9,15], [0,5,8,7,9,11,20,16]
Every small list has 8 values from 0-23 and there are no value repeats within a small list.
What I need now are the three lists which have the values 0-23 stored. It is possible that there are a couple of combinations to accomplish it but I do only need one.
In this particular case the output would be:
[0,2,5,8,7,12,16,18], [1,3,4,17,19,6,13,23], [9,22,21,10,11,20,14,15]
I thought to do something with the order but I'm not a python pro so it is hard for me to handle all the lists within the list (to compare all).
Thanks for your help.
The following appears to work:
from itertools import combinations, chain
lol = [[0,2,5,8,7,12,16,18], [0,9,18,23,5,8,15,16], [1,3,4,17,19,6,13,23], [9,22,21,10,11,20,14,15], [2,8,23,0,7,16,9,15], [0,5,8,7,9,11,20,16]]
for p in combinations(lol, 3):
if len(set((list(chain.from_iterable(p))))) == 24:
print(p)
break # if only one is required
This displays the following:
([0, 2, 5, 8, 7, 12, 16, 18], [1, 3, 4, 17, 19, 6, 13, 23], [9, 22, 21, 10, 11, 20, 14, 15])
If it will always happen that 3 list will form numbers from 0-23, and you only want first list, then this can be done by creating combinations of length 3, and then set intersection:
>>> li = [[0,2,5,8,7,12,16,18], [0,9,18,23,5,8,15,16], [1,3,4,17,19,6,13,23], [9,22,21,10,11,20,14,15], [2,8,23,0,7,16,9,15], [0,5,8,7,9,11,20,16]]
>>> import itertools
>>> for t in itertools.combinations(li, 3):
... if not set(t[0]) & set(t[1]) and not set(t[0]) & set(t[2]) and not set(t[1]) & set(t[2]):
... print t
... break
([0, 2, 5, 8, 7, 12, 16, 18], [1, 3, 4, 17, 19, 6, 13, 23], [9, 22, 21, 10, 11, 20, 14, 15])
Let's do a recursive solution.
We need a list of lists that contain these values:
target_set = set(range(24))
This is a function that recursively tries to find a list of lists that match exactly that set:
def find_covering_lists(target_set, list_of_lists):
if not target_set:
# Done
return []
if not list_of_lists:
# Failed
raise ValueError()
# Two cases -- either the first element works, or it doesn't
try:
first_as_set = set(list_of_lists[0])
if first_as_set <= target_set:
# If it's a subset, call this recursively for the rest
return [list_of_lists[0]] + find_covering_lists(
target_set - first_as_set, list_of_lists[1:])
except ValueError:
pass # The recursive call failed to find a solution
# If we get here, the first element failed.
return find_covering_lists(target_set, list_of_lists[1:])

Categories