Having issue with hierarchical index Set Behavior - python

I have can't figure out this weird behavior I am getting from the hierarchical index I have on a dataframe. In short, what I am trying to do is very simple; I am trying to figure out whether or not a tuple is in the index of my dataframe.
This is the behavior I expect:
arrays = [[dt.date(2014,6,4), dt.date(2014,6,4), dt.date(2014,6,21), dt.date(2014,6,21),dt.date(2014,6,13), dt.date(2014,6,13), dt.date(2014,6,7), dt.date(2014,6,7)],['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(randn(8), index=index)
print (dt.date(2014,6,4),'one') in s.index
print (dt.date(2014,6,4),'fifty') in s.index
print (dt.date(2014,1,1),'one') in s.index
which returns:
True
False
False
Here is what I am facing:
WeirdIdx = pd.MultiIndex(levels=[[dt.date(2014,7,4), dt.date(2014,7,5),dt.date(2014,7,6), dt.date(2014,7,7), dt.date(2014,7,8),dt.date(2014,7,9)], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]],labels=[[0, 0, 0, 0, 0], [8, 8, 8, 8, 8]],names=[u'day', u'hour'])
frame = pd.DataFrame({'a':np.random.normal(0,1,5)},index=WeirdIdx)
print type(frame)
print frame.index
print frame
yields:
<class 'pandas.core.frame.DataFrame'>
day hour
2014-07-04 8
8
8
8
8
a
day hour
2014-07-04 8 0.335840
8 0.801193
8 -0.092492
8 0.610675
8 -0.044947
and:
print (dt.date(2014,7,4),8) in frame.index
print (dt.date(2014,7,4),1) in frame.index
print (dt.date(2014,8,4),1) in frame.index
yields:
True
True
True
and finally:
frame.index
yields:
MultiIndex(levels=[[2014-07-04, 2014-07-05, 2014-07-06, 2014-07-07, 2014-07-08, 2014-07-09], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]],
labels=[[0, 0, 0, 0, 0], [8, 8, 8, 8, 8]],
names=[u'day', u'hour'])
One issue is that (dt.date(2014,8,4),1) in frame.index SHOULD be False!
What am I missing here?

The problem appears to be due to the fact that your MultiIndex is nonunique. Pandas has strange behavior in this situation, which I would consider a bug. The problem has nothing to do with dates or even DataFrames at all; it is purely a MultiIndex problem. Here is a simpler example:
WeirdIdx = pandas.MultiIndex(
levels=[[0], [1]],
labels=[[0, 0], [0,0]],names=[u'X', u'Y']
)
Then any tuple of the right size and types is considered contained in the MultiIndex:
>>> (0, 0) in WeirdIdx
True
>>> (1, 0) in WeirdIdx
True
>>> (100, 0) in WeirdIdx
True
>>> (100, 100) in WeirdIdx
True
Looking in the source code, I can see how these results arise: indexing falls back to slicing if the MultiIndex is nonunique, and slicing always works even if the values aren't present (just returning a zero-length slice). But I don't understand why things were implemented that way.
I can't find a bug about this on the pandas bug tracker, although there are a variety of bugs having to do with duplicate MutliIndexes, such as this bug. Some comments on this bug suggest the problem should have been fixed in pandas 0.14, but I don't know whether it actually has been fixed, and the bug is still open. My impression from the various bug reports is that MutliIndexes basically do not work unless they are unique. I would suggest opening a bug report and/or asking on the pandas mailing list.

Related

Pandas Convert Particular columns to a list [duplicate]

This question already has answers here:
Pandas DataFrame column to list [duplicate]
(4 answers)
How do I convert a Pandas series or index to a NumPy array? [duplicate]
(8 answers)
Closed 3 years ago.
I have a Python dataFrame with multiple columns.
LogBlk Page BayFail
0 0 [0, 1, 8, 9]
1 16 [0, 1, 4, 5, 6, 8, 9, 12, 13, 14]
2 32 [0, 1, 4, 5, 6, 8, 9, 12, 13, 14]
3 48 [0, 1, 4, 5, 6, 8, 9, 12, 13, 14]
I want to find BayFails that is associated with LogBlk=0, and Page=0.
df2 = df[ (df['Page'] == 16) & (df['LogBlk'] == 0) ]['BayFail']
This will return [0,1,8,9]
What I want to do is to convert this pandas.series into a list. Does anyone know how to do that?
pandas.Series, has a tolist method:
In [10]: import pandas as pd
In [11]: s = pd.Series([0,1,8,9], name = 'BayFail')
In [12]: s.tolist()
Out[12]: [0L, 1L, 8L, 9L]
Technical note: In my original answer I said that Series was a subclass of numpy.ndarray and inherited its tolist method. While that's true for Pandas version 0.12 or older, In the soon-to-be-released Pandas version 0.13, Series has been refactored to be a subclass of NDFrame. Series still has a tolist method, but it has no direct relationship to the numpy.ndarray method of the same name.
You can also convert them to numpy arrays
In [124]: s = pd.Series([0,1,8,9], name='BayFail')
In [125]: a = pd.np.array(s)
Out[125]: array([0, 1, 8, 9], dtype=int64)
In [126]: a[0]
Out[126]: 0

find repeating dates between two datetime arrays python

I have two datetime arrays, and I am trying to output an array with only those dates which are repeated between both arrays.. I feel like this is something I should be able to answer myself, but I have spent a lot of time searching and I do not understand how to solve this.
>>> datetime1[0:4]
array([datetime.datetime(2014, 6, 19, 4, 0),
datetime.datetime(2014, 6, 19, 5, 0),
datetime.datetime(2014, 6, 19, 6, 0),
datetime.datetime(2014, 6, 19, 7, 0)], dtype=object)
>>> datetime2[0:4]
array([datetime.datetime(2014, 6, 19, 3, 0),
datetime.datetime(2014, 6, 19, 4, 0),
datetime.datetime(2014, 6, 19, 5, 0),
datetime.datetime(2014, 6, 19, 6, 0)], dtype=object)
I've tried this below but I still do not understand why this does not work
>>> np.where(datetime1==datetime2)
(array([], dtype=int64),)
This:
datetime1==datetime2
Is an element-wise comparison. It compares [0] with [0], then [1] with [1], and gives you a boolean array.
Instead, try:
np.in1d(datetime1, datetime2)
This gives you a boolean array the same size as datetime1, set to True for those elements which exist in datetime2.
If your goal is only to get the values rather than the indexes, use this:
np.intersect1d(datetime1, datetime2)
https://docs.scipy.org/doc/numpy/reference/generated/numpy.intersect1d.html
I would say just iterate over the values of datetime1 and datetime2 and check for containment. So for example:
for date in datetime1:
if date in datetime2:
print(date)

Slicing without views (or: shuffling multiple arrays)

I have two different numpy arrays and I would like to shuffle them in asynchronized way.
The current solution is taken from https://www.tensorflow.org/versions/r0.8/tutorials/mnist/pros/index.html and proceeds as follows:
perm = np.arange(self.no_images_train)
np.random.shuffle(perm)
self.images_train = self.images_train[perm]
self.labels_train = self.labels_train[perm]
The problem is that it doubles memory each time I do it. Somehow the old arrays are not getting deleted, probably because the slicing operator creates views I guess. I tried the following change, out of pure desperation:
perm = np.arange(self.no_images_train)
np.random.shuffle(perm)
n_images_train = self.images_train[perm]
n_labels_train = self.labels_train[perm]
del self.images_train
del self.labels_train
gc.collect()
self.images_train = n_images_train
self.labels_train = n_labels_train
Still the same, memory leaks and I am running out of memory after a couple of operations.
Btw, the two arrays are of rank 100000,224,244,1 and 100000,1.
I know that this has been dealt with here (Better way to shuffle two numpy arrays in unison), but the answer didn't help me, as the provided solution needs slicing again.
Thanks for any help.
One way to permute two large arrays in-place in a synchronized way is to save the state of the random number generator and then shuffle the first array. Then restore the state and shuffle the second array.
For example, here are my two arrays:
In [48]: a
Out[48]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
In [49]: b
Out[49]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
Save the current internal state of the random number generator:
In [50]: state = np.random.get_state()
Shuffle a in-place:
In [51]: np.random.shuffle(a)
Restore the internal state of the random number generator:
In [52]: np.random.set_state(state)
Shuffle b in-place:
In [53]: np.random.shuffle(b)
Check that the permutations are the same:
In [54]: a
Out[54]: array([13, 12, 11, 15, 10, 5, 1, 6, 14, 3, 9, 7, 0, 8, 4, 2])
In [55]: b
Out[55]: array([13, 12, 11, 15, 10, 5, 1, 6, 14, 3, 9, 7, 0, 8, 4, 2])
For your code, this would look like:
state = np.random.get_state()
np.random.shuffle(self.images_train)
np.random.set_state(state)
np.random.shuffle(self.labels_train)
Actually I don't think that there is any issue with numpy or python. Numpy uses the system malloc / free to allocate the array and this leads to memory fragmentation (see Memory Fragmentation on SO).
So I guess that your memory profile may increase and suddenly drops when the system is able to reduce fragmentation, if possible.

Use bool list to retrieve elements from another list - Python

I have this situation:
main_list = [12, 10, 30, 10, 11,10, 31]
get_indices = [1, 0, 1, 1, 0, 0, 0]
What I want to do is, extract elements from main_list according to the boolean value in get_indices.
I tried the following but it is not working :
result = main_list[bool(get_indices)]
print result
10
It should be 12, 30, 10
I just came across compress() which is best suited for this task.
compress() generates an iterator for efficient looping.
from itertools import compress
main_list = [12, 10, 30, 10, 11,10, 31]
get_indices = [1, 0, 1, 1, 0, 0, 0]
print list(compress(main_list, get_indices))
[12, 30, 10]
You can use list comprehension -
result = [x for i, x in enumerate(main_list) if get_indices[i]]
You do not need to use bool() , 0 is considered False-like in boolean context, and anything non-zero is True-like.
Example/Demo -
>>> main_list = [12, 10, 30, 10, 11,10, 31]
>>> get_indices = [1, 0, 1, 1, 0, 0, 0]
>>> result = [x for i, x in enumerate(main_list) if get_indices[i]]
>>> result
[12, 30, 10]
Or if you really want something similar to your method, since you say -
But if I was able to get my method working, I could also do for example : result = main_list[~bool(get_indices)] to get : [10, 11, 10, 31]
You can use numpy , convert the lists to numpy arrays. Example -
In [30]: import numpy as np
In [31]: main_list = [12, 10, 30, 10, 11,10, 31]
In [32]: get_indices = [1, 0, 1, 1, 0, 0, 0]
In [33]: main_array = np.array(main_list)
In [34]: get_ind_array = np.array(get_indices).astype(bool)
In [35]: main_array[get_ind_array]
Out[35]: array([12, 30, 10])
In [36]: main_array[~get_ind_array]
Out[36]: array([10, 11, 10, 31])

Python Pandas GroupBy equivalent of If A and not B where clause in SQL

I am using pandas groupby and was wondering how to implement the following:
Dataframes A and B have the same variable to index on, but A has 20 unique index values and B has 5.
I want to create a dataframe C that contains rows whose indices are present in A and not in B.
Assume that the 5 unique index values in B are all present in A. C in this case would have only those rows associated with index values in A and not in B (i.e. 15).
Using inner, outer, left and right do not do this (unless I misread something).
In SQL I might do this as where A.index <> (not equal) B.index
My Left handed solution:
a) get the respective index columns from each data set, say x and y.
def match(x,y,compareCol):
"""
x and y are series
compare col is the name to the series being returned .
It is the same name as the name of x and y in their respective dataframes"""
x = x.unique()
y = y.unique()
""" Need to compare arrays x.unique() returns arrays"""
new = []
for item in (x):
if item not in y:
new.append(item)
returnADataFrame = pa.DataFrame(pa.Series(new, name = compareCol))
return returnADataFrame
b) now do a left join on this on the data set A.
I am reasonably confident that my elementwise comparison is slow as a tortoise on weed with no
motivation.
What about something like:
A.ix[A.index - B.index]
A.index - B.index is a set difference:
In [30]: A.index
Out[30]: Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], dtype=int64)
In [31]: B.index
Out[31]: Int64Index([ 0, 1, 2, 3, 999], dtype=int64)
In [32]: A.index - B.index
Out[32]: Int64Index([ 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], dtype=int64)
In [33]: B.index - A.index
Out[33]: Int64Index([999], dtype=int64)

Categories