Selecting a subset based on multiple slices in pandas/NumPy? - python

I want to select a subset of some pandas DataFrame columns based on several slices.
In [1]: df = pd.DataFrame(data={'A': np.random.rand(100), 'B': np.random.rand(100), 'C': np.random.rand(100)})
df.head()
Out[1]: A B C
0 0.745487 0.146733 0.594006
1 0.212324 0.692727 0.244113
2 0.954276 0.318949 0.199224
3 0.606276 0.155027 0.247255
4 0.155672 0.464012 0.229516
Something like:
In [2]: df.loc[[slice(1, 4), slice(42, 44)], ['B', 'C']]
Expected output:
Out[2]: B C
1 0.692727 0.244113
2 0.318949 0.199224
3 0.155027 0.247255
42 0.335285 0.000997
43 0.019172 0.237810
I've seen that NumPy's r_ object can help when wanting to use multiple slices, e.g:
In [3]: arr = np.array([1, 2, 3, 4, 5, 5, 5, 5])
arr[np.r_[1:3, 4:6]]
Out[3]: array([2, 3, 5, 5])
But I can't get this to work with some predefined collection (list) of slices. Ideally I would like to be able to specify a collection of ranges/slices and subset based on this. I doesn't seem like r_ accepts iterables? I've seen that one could for example create an array with hstack, and then use it as an index, like:
In [4]: idx = np.hstack((np.arange(1, 4), np.arange(42, 44)))
df.loc[idx, ['B', 'C']]
Out[4]: B C
1 0.692727 0.244113
2 0.318949 0.199224
3 0.155027 0.247255
42 0.335285 0.000997
43 0.019172 0.237810
Which gets me what I need, but is there any other faster/cleaner/preferred/whatever way of doing this?

A bit late, but it might also help others:
pd.concat([df.loc[sl, ['B', 'C']] for sl in [slice(1, 4), slice(42, 44)]])
This also works when your are dealing with other slices, e.g. time windows.

You can do:
df.loc[[x for x in range(1, 4)] + [x for x in range(42, 44)], ['B', 'C']]
Which took about 1/4 of the time with your np.hstack option.

Related

Adding 2D array to DataFrame

I am trying to create a DataFrame from these two lists
a = ['a', 'b', 'c']
b = [[1,2,3], [4,5], [7,8,9]]
df = pd.DataFrame(a, columns=['First'])
df['Second'] = b
df
This is the output I got-
First Second
0 a [1, 2, 3]
1 b [4, 5]
2 c [7, 8, 9]
How can I get rid of the [ ] brackets to get my expected output?
My expected output is
First Second
0 a 1, 2, 3
1 b 4, 5
2 c 7, 8, 9
Convert the column to a string type and strip the square brackets
df['Second'] = df['Second'].astype(str).str.strip('[]')
What are you trying to achieve here? A column with list of numeric values that is not a list? It seems bit counter-intuitive. You can maybe convert the values to string to get rid of the so called square brackets of list representation.
c = [", ".join(str(x) for x in y) for y in b]
df['Second'] = c
This will get rid of the brackets. But I am not really sure what is the use of it or what is your actual use-case.
One option is to first process the list of lists to convert it to the required type.
The question about usefulness of storing comma-separated numbers in a dataframe still remains as you won't be able to perform any computation on that.
a = ['a', 'b', 'c']
b = [[1,2,3], [4,5], [7,8,9]]
df = pd.DataFrame(a, columns=['First'])
df['Second'] = [','.join(map(str, i)) for i in b]
First Second
0 a 1,2,3
1 b 4,5
2 c 7,8,9

Map index of numpy matrix

How should I map indices of a numpy matrix?
For example:
mx = np.matrix([[5,6,2],[3,3,7],[0,1,6]]
The row/column indices are 0, 1, 2.
So:
>>> mx[0,0]
5
Let s say I need to map these indices, converting 0, 1, 2 into, e.g. 10, 'A', 'B' in the way that:
mx[10,10] #returns 5
mx[10,'A'] #returns 6 and so on..
I can just set a dict and use it to access the elements, but I would like to know if it is possible to do something like what I just described.
I would suggest using pandas dataframe with the index and columns using the new mapping for row and col indexing respectively for ease in indexing. It allows us to select a single element or an entire row or column with the familiar colon operator.
Consider a generic (non-square 4x3 shaped matrix) -
mx = np.matrix([[5,6,2],[3,3,7],[0,1,6],[4,5,2]])
Consider the mappings for rows and columns -
row_idx = [10, 'A', 'B','C']
col_idx = [10, 'A', 'B']
Let's take a look on the workflow with the given sample -
# Get data into dataframe with given mappings
In [57]: import pandas as pd
In [58]: df = pd.DataFrame(mx,index=row_idx, columns=col_idx)
# Here's how dataframe data looks like
In [60]: df
Out[60]:
10 A B
10 5 6 2
A 3 3 7
B 0 1 6
C 4 5 2
# Get one scalar element
In [61]: df.loc['C',10]
Out[61]: 4
# Get one entire col
In [63]: df.loc[:,10].values
Out[63]: array([5, 3, 0, 4])
# Get one entire row
In [65]: df.loc['A'].values
Out[65]: array([3, 3, 7])
And best of all we are not making any extra copies as the dataframe and its slices are still indexing into the original matrix/array memory space -
In [98]: np.shares_memory(mx,df.loc[:,10].values)
Out[98]: True
Try this:
import numpy as np
A = np.array(((1,2),(3,4),(50,100)))
dt = np.dtype([('ID', np.int32), ('Ring', np.int32)])
B = np.array(list(map(tuple, A)), dtype=dt)
print(B['ID'])
You can use the __getitem__ and __setitem__ special methods and create a new class as shown.
Store the index map as a dictionary in an instance variable self.index_map.
import numpy as np
class Matrix(np.matrix):
def __init__(self, lis):
self.matrix = np.matrix(lis)
self.index_map = {}
def setIndexMap(self, index_map):
self.index_map = index_map
def getIndex(self, key):
if type(key) is slice:
return key
elif key not in self.index_map.keys():
return key
else:
return self.index_map[key]
def __getitem__(self, idx):
return self.matrix[self.getIndex(idx[0]), self.getIndex(idx[1])]
def __setitem__(self, idx, value):
self.matrix[self.getIndex(idx[0]), self.getIndex(idx[1])] = value
Usage:
Creating a matrix.
>>> mx = Matrix([[5,6,2],[3,3,7],[0,1,6]])
>>> mx
Matrix([[5, 6, 2],
[3, 3, 7],
[0, 1, 6]])
Defining the Index Map.
>>> mx.setIndexMap({10:0, 'A':1, 'B':2})
Different ways to index the matrix.
>>> mx[0,0]
5
>>> mx[10,10]
5
>>> mx[10,'A']
6
It also handles slicing as shown.
>>> mx[1:3, 1:3]
matrix([[3, 7],
[1, 6]])

Align python arrays with missing data

I have some time series data, say:
# [ [time] [ data ] ]
a = [[0,1,2,3,4],['a','b','c','d','e']]
b = [[0,3,4]['f','g','h']]
and I would like an output with some filler value, lets say None for now:
a_new = [[0,1,2,3,4],['a','b','c','d','e']]
b_new = [[0,1,2,3,4],['f',None,None,'g','h']]
Is there a built in function in python/numpy to do this (or something like this)? Basically I would like to have all of my time vectors of equal size so I can calculate statistics (np.mean) and deal with the missing data accordingly.
How about this? (I'm assuming your definition of b was a typo, and I'm also assuming you know in advance how many entries you want.)
>>> b = [[0,3,4], ['f','g','h']]
>>> b_new = [list(range(5)), [None] * 5]
>>> for index, value in zip(*b): b_new[1][index] = value
>>> b_new
[[0, 1, 2, 3, 4], ['f', None, None, 'g', 'h']]
smarx has a fine answer, but pandas was made exactly for things like this.
# your data
a = [[0,1,2,3,4],['a','b','c','d','e']]
b = [[0,3,4],['f','g','h']]
# make an empty DataFrame (can do this faster but I'm going slow so you see how it works)
df_a = pd.DataFrame()
df_a['time'] = a[0]
df_a['A'] = a[1]
df_a.set_index('time',inplace=True)
# same for b (a faster way this time)
df_b = pd.DataFrame({'B':b[1]}, index=b[0])
# now merge the two Series together (the NaNs are in the right place)
df = pd.merge(df_a, df_b, left_index=True, right_index=True, how='outer')
In [28]: df
Out[28]:
A B
0 a f
1 b NaN
2 c NaN
3 d g
4 e h
Now the fun is just beginning. Within a DataFrame you can
compute all of your summary statistics (e.g. df.mean())
make plots (e.g. df.plot())
slice/dice your data basically however you want (e.g df.groupby())
Fill in or drop missing data using a specified method (e.g. df.fillna()),
take quarterly or monthly averages (e.g. df.resample()) and a lot more.
If you're just getting started (sorry for the infomercial it you aren't), I recommend reading 10 minutes to pandas for a quick overview.
Here's a vectorized NumPythonic approach -
def align_arrays(A):
time, data = A
time_new = np.arange(np.max(time)+1)
data_new = np.full(time_new.size, None, dtype=object)
data_new[np.in1d(time_new,time)] = data
return time_new, data_new
Sample runs -
In [113]: a = [[0,1,2,3,4],['a','b','c','d','e']]
In [114]: align_arrays(a)
Out[114]: (array([0, 1, 2, 3, 4]), array(['a', 'b', 'c', 'd', 'e'], dtype=object))
In [115]: b = [[0,3,4],['f','g','h']]
In [116]: align_arrays(b)
Out[116]: (array([0, 1, 2, 3, 4]),array(['f', None, None, 'g', 'h'],dtype=object))

pandas selection by callables

EDIT: Turns out my initial question was simply a versionitis issue. However, in the course of answering my initial question a few other questions were addressed, so I've reworded the questions and listed them below:
I'm familiarizing myself with some pandas capabilities, namely selection by callables. The docs advise use of lambda functions, e.g. to extract all samples in dataframe df1 with value > 0 for feature 'A':
df1.loc[lambda df: df.A > 0, :]
Is there a more compact, pythonic way to do this?
Let's say df1 is now a dataframe with feature A, but the values are mixed doubles and triples (2- and 3-tuples). How can I extract the samples which contain only doubles? I tried doing this as df1.loc[len(df1.A)>2,:], but it's clear that pandas doesn't broadcast the values the way I expect.
You have to restart IDE.
Your another question:
Use apply with len:
import pandas as pd
data = {'A': [(1,2), (1,2), (1,2), (1,2), (1,2,4), (1,2,3)],
'B': [13, 98, 23, 45, 64, 10]}
df = pd.DataFrame(data)
print (df)
A B
0 (1, 2) 13
1 (1, 2) 98
2 (1, 2) 23
3 (1, 2) 45
4 (1, 2, 4) 64
5 (1, 2, 3) 10
print (df[df.A.apply(len) >2])
A B
4 (1, 2, 4) 64
5 (1, 2, 3) 10
You can do what you want to do without the lambda function, as follows:
df1.loc[df1.A>0,:]
Perhaps the docs are outdated.

How do I do a SQL style disjoint or set difference on two Pandas DataFrame objects?

I'm trying to use Pandas to solve an issue courtesy of an idiot DBA not doing a backup of a now crashed data set, so I'm trying to find differences between two columns. For reasons I won't get into, I'm using Pandas rather than a database.
What I'd like to do is, given:
Dataset A = [A, B, C, D, E]
Dataset B = [C, D, E, F]
I would like to find values which are disjoint.
Dataset A!=B = [A, B, F]
In SQL, this is standard set logic, accomplished differently depending on the dialect, but a standard function. How do I elegantly apply this in Pandas? I would love to input some code, but nothing I have is even remotely correct. It's a situation in which I don't know what I don't know..... Pandas has set logic for intersection and union, but nothing for disjoint/set difference.
Thanks!
You can use the set.symmetric_difference function:
In [1]: df1 = DataFrame(list('ABCDE'), columns=['x'])
In [2]: df1
Out[2]:
x
0 A
1 B
2 C
3 D
4 E
In [3]: df2 = DataFrame(list('CDEF'), columns=['y'])
In [4]: df2
Out[4]:
y
0 C
1 D
2 E
3 F
In [5]: set(df1.x).symmetric_difference(df2.y)
Out[5]: set(['A', 'B', 'F'])
Here's a solution for multiple columns, probably not very efficient, I would love to get some feedback on making this faster:
input = pd.DataFrame({'A': [1, 2, 2, 3, 3], 'B': ['a', 'a', 'b', 'a', 'c']})
limit = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
def set_difference(input_set, limit_on_set):
limit_on_set_sub = limit_on_set[['A', 'B']]
limit_on_tuples = [tuple(x) for x in limit_on_set_sub.values]
limit_on_dict = dict.fromkeys(limit_on_tuples, 1)
entries_in_limit = input_set.apply(lambda row:
(row['A'], row['B']) in limit_on_dict, axis=1)
return input_set[~entries_in_limit]
>>> set_difference(input, limit)
item user
1 a 2
3 a 3

Categories