Pandas Convert Particular columns to a list [duplicate] - python

This question already has answers here:
Pandas DataFrame column to list [duplicate]
(4 answers)
How do I convert a Pandas series or index to a NumPy array? [duplicate]
(8 answers)
Closed 3 years ago.
I have a Python dataFrame with multiple columns.
LogBlk Page BayFail
0 0 [0, 1, 8, 9]
1 16 [0, 1, 4, 5, 6, 8, 9, 12, 13, 14]
2 32 [0, 1, 4, 5, 6, 8, 9, 12, 13, 14]
3 48 [0, 1, 4, 5, 6, 8, 9, 12, 13, 14]
I want to find BayFails that is associated with LogBlk=0, and Page=0.
df2 = df[ (df['Page'] == 16) & (df['LogBlk'] == 0) ]['BayFail']
This will return [0,1,8,9]
What I want to do is to convert this pandas.series into a list. Does anyone know how to do that?

pandas.Series, has a tolist method:
In [10]: import pandas as pd
In [11]: s = pd.Series([0,1,8,9], name = 'BayFail')
In [12]: s.tolist()
Out[12]: [0L, 1L, 8L, 9L]
Technical note: In my original answer I said that Series was a subclass of numpy.ndarray and inherited its tolist method. While that's true for Pandas version 0.12 or older, In the soon-to-be-released Pandas version 0.13, Series has been refactored to be a subclass of NDFrame. Series still has a tolist method, but it has no direct relationship to the numpy.ndarray method of the same name.

You can also convert them to numpy arrays
In [124]: s = pd.Series([0,1,8,9], name='BayFail')
In [125]: a = pd.np.array(s)
Out[125]: array([0, 1, 8, 9], dtype=int64)
In [126]: a[0]
Out[126]: 0

Related

Is there a way to change axis of an 2d array in Python? [duplicate]

This question already has answers here:
Transpose list of lists
(14 answers)
Closed 2 years ago.
I want to change the x and y axis of a matrix. For example I want to store the first element of each
nested array in the first line, the second of each nested array on the second line etc...
For example:
list = [[1,2,3,4,5,6]
[7,8,9,10,11,12]
[13,14,15,16,17,18]
[19,20,21,22,23,24]]
I want to change into this:
new list = [[1,7,13,19]
[2,8,14,20]
[3,9,15,21]
[4,10,16,22]
[5,11,17,23]
[6,12,18,24]]
Note: This isnt a rotation
Use numpy.ndarray.T, which is the same as numpy.transpose
import numpy as np
data = [[1,2,3,4,5,6],
[7,8,9,10,11,12],
[13,14,15,16,17,18],
[19,20,21,22,23,24]]
# convert the list of lists to an array
data = np.array(data)
# transpose the array
data_t = data.T
# print(data_t)
array([[ 1, 7, 13, 19],
[ 2, 8, 14, 20],
[ 3, 9, 15, 21],
[ 4, 10, 16, 22],
[ 5, 11, 17, 23],
[ 6, 12, 18, 24]])

How can I find the final cumulative sum across numpy axis? [duplicate]

This question already has answers here:
How to calculate the sum of all columns of a 2D numpy array (efficiently)
(6 answers)
Closed 4 years ago.
I have a numpy array
np.array(data).shape
(50,50)
Now, I want to find the cumulative sums across axis=1. The problem is cumsum creates an array of cumulative sums, but I just care about the final value of every row.
This is incorrect of course:
np.cumsum(data, axis=1)[-1]
Is there a succinct way of doing this without looping through the array.
You are almost there, but as you have it now, you are selecting just the final row. What you need is to select all rows from the last column, so your indexing at the end should be: [:,-1].
Example:
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
>>> a.cumsum(axis=1)[:,-1]
array([ 10, 35, 60, 85, 110])
Note, I'm leaving this up as I think it explains what was going wrong with your attempt, but admittedly, there are more effective ways of doing this in the other answers!
The final cumulative sum of every row, is in fact simply the sum of every row, or the row-wise sum, so we can implement this as:
>>> x.sum(axis=1)
array([ 10, 35, 60, 85, 110])
So here for every row, we calculate the sum of all the columns. We thus do not need to first generate the sums in between (well these will likely be stored in an accumulator in numpy), but not "emitted" in the array.
You can use numpy.ufunc.reduce if you don't need the intermediary accumulated results of any ufunc.
>>> a = np.arange(9).reshape(3,3)
>>> a
>>>
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>>
>>> np.add.reduce(a, axis=1)
>>> array([ 3, 12, 21])
However, in the case of sum, Willem's answer is clearly superior and to be preferred. Just keep in mind that in the general case, there's ufunc.reduce.

How to convert numpy datetime64 into datetime [duplicate]

This question already has answers here:
Converting between datetime, Timestamp and datetime64
(14 answers)
Closed 7 years ago.
I basically face the same problem posted here:Converting between datetime, Timestamp and datetime64
but I couldn't find satisfying answer from it, my question how to extract datetime from numpy.datetime64 type:
if I try:
np.datetime64('2012-06-18T02:00:05.453000000-0400').astype(datetime.datetime)
it gave me:
1339999205453000000L
my current solution is convert datetime64 into a string and then turn to datetime again. but it seems quite a silly method.
Borrowing from
Converting between datetime, Timestamp and datetime64
In [220]: x
Out[220]: numpy.datetime64('2012-06-17T23:00:05.453000000-0700')
In [221]: datetime.datetime.utcfromtimestamp(x.tolist()/1e9)
Out[221]: datetime.datetime(2012, 6, 18, 6, 0, 5, 452999)
Accounting for timezones I think that's right. Looks rather clunky though.
Using int() is more explicit (I think) than tolist()):
In [294]: datetime.datetime.utcfromtimestamp(int(x)/1e9)
Out[294]: datetime.datetime(2012, 6, 18, 6, 0, 5, 452999)
or to get datetime in local:
In [295]: datetime.datetime.fromtimestamp(x.astype('O')/1e9)
But in the test_datetime.py file
https://github.com/numpy/numpy/blob/master/numpy/core/tests/test_datetime.py
I find some other options - first convert the general datetime64 to one of the format that specifies units:
In [296]: x.astype('M8[D]').astype('O')
Out[296]: datetime.date(2012, 6, 18)
In [297]: x.astype('M8[ms]').astype('O')
Out[297]: datetime.datetime(2012, 6, 18, 6, 0, 5, 453000)
This works for arrays:
In [303]: np.array([[x,x],[x,x]],dtype='M8[ms]').astype('O')[0,1]
Out[303]: datetime.datetime(2012, 6, 18, 6, 0, 5, 453000)
Note that Timestamp IS a sub-class of datetime.datetime so the [4] will generally work
In [4]: pd.Timestamp(np.datetime64('2012-06-18T02:00:05.453000000-0400'))
Out[4]: Timestamp('2012-06-18 06:00:05.453000')
In [5]: pd.Timestamp(np.datetime64('2012-06-18T02:00:05.453000000-0400')).to_pydatetime()
Out[5]: datetime.datetime(2012, 6, 18, 6, 0, 5, 453000)

Having issue with hierarchical index Set Behavior

I have can't figure out this weird behavior I am getting from the hierarchical index I have on a dataframe. In short, what I am trying to do is very simple; I am trying to figure out whether or not a tuple is in the index of my dataframe.
This is the behavior I expect:
arrays = [[dt.date(2014,6,4), dt.date(2014,6,4), dt.date(2014,6,21), dt.date(2014,6,21),dt.date(2014,6,13), dt.date(2014,6,13), dt.date(2014,6,7), dt.date(2014,6,7)],['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.Series(randn(8), index=index)
print (dt.date(2014,6,4),'one') in s.index
print (dt.date(2014,6,4),'fifty') in s.index
print (dt.date(2014,1,1),'one') in s.index
which returns:
True
False
False
Here is what I am facing:
WeirdIdx = pd.MultiIndex(levels=[[dt.date(2014,7,4), dt.date(2014,7,5),dt.date(2014,7,6), dt.date(2014,7,7), dt.date(2014,7,8),dt.date(2014,7,9)], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]],labels=[[0, 0, 0, 0, 0], [8, 8, 8, 8, 8]],names=[u'day', u'hour'])
frame = pd.DataFrame({'a':np.random.normal(0,1,5)},index=WeirdIdx)
print type(frame)
print frame.index
print frame
yields:
<class 'pandas.core.frame.DataFrame'>
day hour
2014-07-04 8
8
8
8
8
a
day hour
2014-07-04 8 0.335840
8 0.801193
8 -0.092492
8 0.610675
8 -0.044947
and:
print (dt.date(2014,7,4),8) in frame.index
print (dt.date(2014,7,4),1) in frame.index
print (dt.date(2014,8,4),1) in frame.index
yields:
True
True
True
and finally:
frame.index
yields:
MultiIndex(levels=[[2014-07-04, 2014-07-05, 2014-07-06, 2014-07-07, 2014-07-08, 2014-07-09], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]],
labels=[[0, 0, 0, 0, 0], [8, 8, 8, 8, 8]],
names=[u'day', u'hour'])
One issue is that (dt.date(2014,8,4),1) in frame.index SHOULD be False!
What am I missing here?
The problem appears to be due to the fact that your MultiIndex is nonunique. Pandas has strange behavior in this situation, which I would consider a bug. The problem has nothing to do with dates or even DataFrames at all; it is purely a MultiIndex problem. Here is a simpler example:
WeirdIdx = pandas.MultiIndex(
levels=[[0], [1]],
labels=[[0, 0], [0,0]],names=[u'X', u'Y']
)
Then any tuple of the right size and types is considered contained in the MultiIndex:
>>> (0, 0) in WeirdIdx
True
>>> (1, 0) in WeirdIdx
True
>>> (100, 0) in WeirdIdx
True
>>> (100, 100) in WeirdIdx
True
Looking in the source code, I can see how these results arise: indexing falls back to slicing if the MultiIndex is nonunique, and slicing always works even if the values aren't present (just returning a zero-length slice). But I don't understand why things were implemented that way.
I can't find a bug about this on the pandas bug tracker, although there are a variety of bugs having to do with duplicate MutliIndexes, such as this bug. Some comments on this bug suggest the problem should have been fixed in pandas 0.14, but I don't know whether it actually has been fixed, and the bug is still open. My impression from the various bug reports is that MutliIndexes basically do not work unless they are unique. I would suggest opening a bug report and/or asking on the pandas mailing list.

Python Pandas GroupBy equivalent of If A and not B where clause in SQL

I am using pandas groupby and was wondering how to implement the following:
Dataframes A and B have the same variable to index on, but A has 20 unique index values and B has 5.
I want to create a dataframe C that contains rows whose indices are present in A and not in B.
Assume that the 5 unique index values in B are all present in A. C in this case would have only those rows associated with index values in A and not in B (i.e. 15).
Using inner, outer, left and right do not do this (unless I misread something).
In SQL I might do this as where A.index <> (not equal) B.index
My Left handed solution:
a) get the respective index columns from each data set, say x and y.
def match(x,y,compareCol):
"""
x and y are series
compare col is the name to the series being returned .
It is the same name as the name of x and y in their respective dataframes"""
x = x.unique()
y = y.unique()
""" Need to compare arrays x.unique() returns arrays"""
new = []
for item in (x):
if item not in y:
new.append(item)
returnADataFrame = pa.DataFrame(pa.Series(new, name = compareCol))
return returnADataFrame
b) now do a left join on this on the data set A.
I am reasonably confident that my elementwise comparison is slow as a tortoise on weed with no
motivation.
What about something like:
A.ix[A.index - B.index]
A.index - B.index is a set difference:
In [30]: A.index
Out[30]: Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], dtype=int64)
In [31]: B.index
Out[31]: Int64Index([ 0, 1, 2, 3, 999], dtype=int64)
In [32]: A.index - B.index
Out[32]: Int64Index([ 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], dtype=int64)
In [33]: B.index - A.index
Out[33]: Int64Index([999], dtype=int64)

Categories