How can I use a panda row as index for a numpy array? Say I have
>>> grid = arange(10,20)
>>> df = pd.DataFrame([0,1,1,5], columns=['i'])
I would like to do
>>> df['j'] = grid[df['i']]
IndexError: unsupported iterator index
What is a short and clean way to actually perform this operation?
Update
To be precise, I want an additional column that has the values that correspond to the indices that the first column contains: df['j'][0] = grid[df['i'][0]] in column 0 etc
expected output:
index i j
0 0 10
1 1 11
2 1 11
3 5 15
Parallel Case: Numpy-to-Numpy
Just to show where the idea comes from, in standard python / numpy, if you have
>>> keys = [0, 1, 1, 5]
>>> grid = arange(10,20)
>>> grid[keys]
Out[30]: array([10, 11, 11, 15])
Which is exactly what I want to do. Only that my keys are not stored in a vector, they are stored in a column.
This is a numpy bug that surfaced with pandas 0.13.0 / numpy 1.8.0.
You can do:
In [5]: grid[df['i'].values]
Out[5]: array([0, 1, 1, 5])
In [6]: Series(grid)[df['i']]
Out[6]:
i
0 0
1 1
1 1
5 5
dtype: int64
This matches your output. You can assign an array to a column, as long as the length of the array/list is the same as the frame (otherwise how would you align it?)
In [14]: grid[keys]
Out[14]: array([10, 11, 11, 15])
In [15]: df['j'] = grid[df['i'].values]
In [17]: df
Out[17]:
i j
0 0 10
1 1 11
2 1 11
3 5 15
Related
Trying to figure out a way to slice non-contiguous and non-equal length rows of a pandas / numpy matrix so I can set the values to a common value. Has anyone come across an elegant solution for this?
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(12).reshape(3,4))
#x is the matrix we want to index into
"""
x before:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
"""
y = pd.DataFrame([[0,3],[2,2],[1,2],[0,0]])
#y is a matrix where each row contains a start idx and end idx per column of x
"""
0 1
0 0 3
1 2 3
2 1 3
3 0 1
"""
What I'm looking for is a way to effectively select different length slices of x based on the rows of y
x[y] = 0
"""
x afterwards:
array([[ 0, 1, 2, 0],
[ 0, 5, 0, 7],
[ 0, 0, 0, 11]])
Masking can still be useful, because even if a loop cannot be entirely avoided, the main dataframe x would not need to be involved in the loop, so this should speed things up:
mask = np.zeros_like(x, dtype=bool)
for i in range(len(y)):
mask[y.iloc[i, 0]:(y.iloc[i, 1] + 1), i] = True
x[mask] = 0
x
0 1 2 3
0 0 1 2 0
1 0 5 0 7
2 0 0 0 11
As a further improvement, consider defining y as a NumPy array if possible.
I customized this answer to your problem:
y_t = y.values.transpose()
y_t[1,:] = y_t[1,:] - 1 # or remove this line and change '>= r' below to '> r`
r = np.arange(x.shape[0])
mask = ((y_t[0,:,None] <= r) & (y_t[1,:,None] >= r)).transpose()
res = x.where(~mask, 0)
res
# 0 1 2 3
# 0 0 1 2 0
# 1 0 5 0 7
# 2 0 0 0 11
I know that using pandas lookup I can choose particular dataframe cells using pairs of (row,column) values. For example
frame = pd.DataFrame([[1,2,3],[4,5,6]])
frame.lookup([0,1],[1,2])
gives me
array([2, 6], dtype=int64)
Is there a similar way to assign values to cells? I am looking for something like this:
Pseudocode:
frame.lookup([0,1],[1,2]) = [7,8]
I don't know about any Pandas solution for that.
But if I'll have to do it on my own - below code should work.
frame = pd.DataFrame([[1,2,3],[4,5,6]])
# 0 1 2
# 0 1 2 3
# 1 4 5 6
frame.lookup([0,1],[1,2])
# array([2, 6])
def set_dataframe_values(dataframe, coords, values):
dataframe_ = pd.DataFrame(dataframe)
for x__y, value in zip(coords, values):
x, y = x__y
dataframe_[y][x] = value
return dataframe_
set_dataframe_values(frame, coords=[(0,1), (1,2)], values=[7,8])
# 0 1 2
# 0 1 7 3
# 1 4 5 8
frame.lookup([0,1],[1,2])
# array([7, 8])
The following code find index where df['A'] == 1
import pandas as pd
import numpy as np
import random
index = range(10)
random.shuffle(index)
df = pd.DataFrame(np.zeros((10,1)).astype(int), columns = ['A'], index = index)
df.A.iloc[3:6] = 1
df.A.iloc[6:] = 2
print df
print df.loc[df['A'] == 1].index.tolist()
It returns pandas index correctly. How do I get the integer index ([3,4,5]) instead using pandas API?
A
8 0
4 0
6 0
3 1
7 1
1 1
5 2
0 2
2 2
9 2
[3, 7, 1]
what about?
In [12]: df.index[df.A == 1]
Out[12]: Int64Index([3, 7, 1], dtype='int64')
or (depending on your goals):
In [15]: df.reset_index().index[df.A == 1]
Out[15]: Int64Index([3, 4, 5], dtype='int64')
Demo:
In [11]: df
Out[11]:
A
8 0
4 0
6 0
3 1
7 1
1 1
5 2
0 2
2 2
9 2
In [12]: df.index[df.A == 1]
Out[12]: Int64Index([3, 7, 1], dtype='int64')
In [15]: df.reset_index().index[df.A == 1]
Out[15]: Int64Index([3, 4, 5], dtype='int64')
Here is one way:
df.reset_index().index[df.A == 1].tolist()
This re-indexes the data frame with [0, 1, 2, ...], then extracts the integer index values based on the boolean mask df.A == 1.
Edit Credits to #Max for the index[df.A == 1] idea.
No need for numpy, you're right. Just pure python with a listcomp:
Just find the indexes where the values are 1
print([i for i,x in enumerate(df['A'].values) if x == 1])
Given the following inputs:
In [18]: input
Out[18]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
2 1 5 9 1
3 1 5 9 1
In [26]: df = input.drop_duplicates()
Out[26]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
How would I go about getting an array that has the indices of the rows from the subset that are equivalent, eg:
resultant = [0, 1, 0, 0]
I.e. the '1' here is basically stating that (row[1] in input) == (row[1] in df). Since there will be fewer unique rows than there will be multiple values in 'resultant' that will equate to similar values in df. i.e (row[k] in input == row[k+N] in input) == (row[1] in df) could be a case.
I am looking for actual row number mapping from input:df.
While this example is trivial in my case i have a ton of dropped mappings that might map to one index as an example.
Why do I want this? I am training an autoencoder type system where the target sequence is non-unique.
One way would be to treat it as a groupby on all columns:
>> df.groupby(list(df.columns)).groups
{(1, 5, 9, 1): [0, 2, 3], (2, 6, 10, 2): [1]}
Another would be to sort and then compare, which is less efficient in theory but could very well be faster in some cases and is definitely easier to make more tolerant of error:
>>> ds = df.sort(list(df.columns))
>>> eqs = (ds != ds.shift()).all(axis=1).cumsum()
>>> ds.index.groupby(eqs)
{1: [0, 2, 3], 2: [1]}
This seems the right datastructure to me, but if you really do want an array with the group ids, that's easy too, e.g.
>>> eqs.sort_index() - 1
0 0
1 1
2 0
3 0
dtype: int64
Don't have pandas installed on this computer, but I think you could use df.iterrows() like:
def find_matching_row(row, df_slimmed):
for index, slimmed_row in df_slimmed.iterrows():
if slimmed_row.equals(row[slimmed_row.columns]):
return index
def rows_mappings(df, df_slimmed):
for _, row in df.iterrows():
yield find_matching_row(row, df_slimmed)
list(rows_mappings(df, input))
This is if you are interested in generating the resultant list in your example, I don't quite follow the latter part of your reasoning.
I have a dataframe with categorical attributes where the index contains duplicates. I am trying to find the sum of each possible combination of index and attribute.
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()
print(y)
print(y.groupby(level=[0,1]).sum())
output
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
The stack and group by sum are just the same.
However, the one I expect is
11 x 2
11 y 6
12 x 6
12 y 10
EDIT 2:
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack().groupby(level=[0,1]).sum()
print(y.groupby(level=[0,1]).sum())
output:
11 x 1
y 3
x 1
y 3
12 x 3
y 5
x 3
y 5
dtype: int64
EDIT3:
An issue has been logged
https://github.com/pydata/pandas/issues/10417
With pandas 0.16.2 and Python 3, I was able to get the correct result via:
x.stack().reset_index().groupby(['level_0','level_1']).sum()
Which produces:
0
level_0 level_1
11 x 2
y 6
12 x 6
y 10
You can then change the index and column names to more desirable ones using reindex() and columns.
Based on my research, I agree that the failure of the original approach appears to be a bug. I think the bug is on Series, which is what x.stack() produces. My workaround is to turn the Series into a DataFrame via reset_index(). In this case the DataFrame does not have a MultiIndex anymore - I'm just grouping on labeled columns.
To make sure that grouping and summing works on a DataFrame with a MultiIndex, you can try this to get the same correct output:
x.stack().reset_index().set_index(['level_0','level_1'],drop=True).\
groupby(level=[0,1]).sum()
Either of these workarounds should take care of things until the bug is resolved.
I wonder if the bug has something to do with the MultiIndex instances that are created on a Series vs. a DataFrame. For example:
In[1]: obj = x.stack()
type(obj)
Out[1]: pandas.core.series.Series
In[2]: obj.index
Out[2]: MultiIndex(levels=[[11, 11, 12, 12], ['x', 'y']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])
vs.
In[3]: obj = x.stack().reset_index().set_index(['level_0','level_1'],drop=True)
type(obj)
Out[3]: pandas.core.frame.DataFrame
In[4]: obj.index
Out[4]: MultiIndex(levels=[[11, 12], ['x', 'y']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['level_0', 'level_1'])
Notice how the MultiIndex on the DataFrame describes the levels more correctly.
sum allows you to specify the levels to sum over in a MultiIndex data frame.
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack()
y.sum(level=[0,1])
11 x 2
y 6
12 x 6
y 10
Using Pandas 0.15.2, you just need one more iteration of groupby
x = pd.DataFrame({'x':[1,1,3,3],'y':[3,3,5,5]},index=[11,11,12,12])
y = x.stack().groupby(level=[0,1]).sum()
print(y.groupby(level=[0,1]).sum())
prints
11 x 2
y 6
12 x 6
y 10