Selecting a range of columns in a dataframe - python

I have a dataset that consists of columns 0 to 10, and I would like to extract the information that is only in columns 1 to 5, not 6, and 7 to 9 (it means not the last column). So far, I have done the following:
A = B[:, [[1:5], [7:-1]]]
but I got a syntax error, how can I retrieve that data?

Advanced indexing doesn't take a list of lists of slices. Instead, you can use numpy.r_. This function doesn't take negative indices, but you can get round this by using np.ndarray.shape:
A = B[:, np.r_[1:6, 7:B.shape[1]-1]]
Remember to add 1 to the second part, since a: b does not include b, in the same way slice(a, b) does not include b. Also note that indexing begins at 0.
Here's a demo:
import numpy as np
B = np.random.randint(0, 10, (3, 11))
print(B)
[[5 8 8 8 3 0 7 2 1 6 7]
[4 3 8 7 3 7 5 6 0 5 7]
[1 0 4 0 2 2 5 1 4 2 3]]
A = B[:,np.r_[1:6, 7:B.shape[1]-1]]
print(A)
[[8 8 8 3 0 2 1 6]
[3 8 7 3 7 6 0 5]
[0 4 0 2 2 1 4 2]]

Another way would be to get your slices independently, and then concatenate:
A = np.concatenate([B[:, 1:6], B[:, 7:-1]], axis=1)
Using similar example data as #jpp:
B = np.random.randint(0, 10, (3, 10))
>>> B
array([[0, 5, 0, 6, 8, 5, 9, 3, 2, 0],
[8, 8, 1, 7, 3, 5, 7, 7, 4, 8],
[5, 5, 5, 2, 3, 1, 6, 4, 9, 6]])
A = np.concatenate([B[:, 1:6], B[:, 7:-1]], axis=1)
>>> A
array([[5, 0, 6, 8, 5, 3, 2],
[8, 1, 7, 3, 5, 7, 4],
[5, 5, 2, 3, 1, 4, 9]])

how about union the range?
B[:, np.union1d(range(1,6), range(7,10))]

Just to add some of my thoughts. There are two approaches one can take using either numpy or pandas. So I will demonstrate with some data, and assume that the data is the grades for a student in different courses he/she is enrolled in.
import pandas as pd
import numpy as np
data = {'Course A': [84, 82, 81, 89, 73, 94, 92, 70, 88, 95],
'Course B': [85, 82, 72, 77, 75, 89, 95, 84, 77, 94],
'Course C': [97, 94, 93, 95, 88, 82, 78, 84, 69, 78],
'Course D': [84, 82, 81, 89, 73, 94, 92, 70, 88, 95],
'Course E': [85, 82, 72, 77, 75, 89, 95, 84, 77, 94],
'Course F': [97, 94, 93, 95, 88, 82, 78, 84, 69, 78]
}
df = pd.DataFrame(data=data)
df.head()
CA CB CC CD CE CF
0 84 85 97 84 85 97
1 82 82 94 82 82 94
2 81 72 93 81 72 93
3 89 77 95 89 77 95
4 73 75 88 73 75 88
NOTE: CA through CF represent Course A through Course F.
To help us remember column names and their associated indexes, we can build a list of columns and their indexes via list comprehension.
map_cols = [f"{c[0]}:{c[1]}" for c in enumerate(df.columns)]
['0:Course A',
'1:Course B',
'2:Course C',
'3:Course D',
'4:Course E',
'5:Course F']
Now, to select say Course A, and Course D through Course F using indexing in numpy, you can do the following:
df.iloc[:, np.r_[0, 3:df.shape[1]]]
CA CD CE CF
0 84 84 85 97
1 82 82 82 94
2 81 81 72 93
3 89 89 77 95
4 73 73 75 88
You can also use pandas to the same effect.
df[[df.columns[0], *df.columns[3:]]]
CA CD CE CF
0 84 84 85 97
1 82 82 82 94
2 81 81 72 93
3 89 89 77 95
4 73 73 75 88

One can solve that with the sum of range
[In]: columns = list(range(1,6)) + list(range(7,10))
[Out]:
[1, 2, 3, 4, 5, 7, 8, 9]
Then, considering that your df is called df, using iloc to select the DF columns
newdf = df.iloc[:, columns]

Related

Which API can implement tensor expansion in tensorflow ?

If I have a tensor of (30,40,50), and I want to expand it out to the first order, then I get a second order tensor of (30,2000), and I don't know if tensorflow has an API that implements it.
import tensorflow as tf
import numpy as np
data1=tf.constant([
[[2,5,7,8],[6,4,9,10],[14,16,86,54]],
[[16,43,65,76],[43,65,7,24],[15,75,23,75]]])
data5=tf.reshape(data1,[3,8])
data2,data3,data4=tf.split(data1,3,1)
data6=tf.reshape(data2,[1,8])
data7=tf.reshape(data3,[1,8])
data8=tf.reshape(data4,[1,8])
data9=tf.concat([data6,data7,data8],0)
with tf.Session() as sess:
print(sess.run(data5))
print(sess.run(data))
This gives:
data5
[[ 2 5 7 8 6 4 9 10]
[14 16 86 54 16 43 65 76]
[43 65 7 24 15 75 23 75]]
data9
[[ 2 5 7 8 16 43 65 76]
[ 6 4 9 10 43 65 7 24]
[14 16 86 54 15 75 23 75]]
How do I get data9 directly?
Looks like you're trying to take the sub-tensors ranging across axis 0 (data1[0], data1[1], ...) and concatenate them along axis 2.
Transposing before reshaping should do the trick:
tf.reshape(tf.transpose(data1, [1,0,2]), [data1.shape[1], data1.shape[0] * data1.shape[2]])
You can try:
data9 = tf.layers.flatten(tf.transpose(data1, perm=[1, 0, 2]))
Output:
array([[ 2, 5, 7, 8, 16, 43, 65, 76],
[ 6, 4, 9, 10, 43, 65, 7, 24],
[14, 16, 86, 54, 15, 75, 23, 75]], dtype=int32)

python resampling from two dataframes

have two data frames
import pandas as pd
df = pd.DataFrame({'x': [10, 47, 58, 68, 75, 80],
'y': [10, 9, 8, 7, 6, 5]})
df2 = pd.DataFrame({'x': [45, 55, 66, 69, 79, 82], 'y': [10, 9, 8, 7, 6, 5]})
df1
x y
10 10
47 9
58 8
68 7
75 6
80 5
df2
x y
45 10
55 9
66 8
69 7
79 6
82 5
I want to interpolate between them and generate a new data frame with a sampling rate of N. Assume N=3 for this example.
The desired output is
x y
10 10
27.5 10
45 10
...
75 6
77 6
79 6
80 5
81 5
82 5
How can I use my data frames to create the desired output?
If you don't mind using numpy, this solution will give you your desired output:
import pandas as pd
import numpy as np
N = 3
df = pd.DataFrame({'x': [10, 47, 58, 68, 75, 80],
'y': [10, 9, 8, 7, 6, 5]})
df2 = pd.DataFrame({'x': [45, 55, 66, 69, 79, 82], 'y': [10, 9, 8, 7, 6, 5]})
new_x = np.array([np.linspace(i, j, N) for i, j in zip(df['x'], df2['x'])]).flatten()
new_y = df['y'].loc[np.repeat(df.index.values, N)]
final_df = pd.DataFrame({'x': new_x, 'y': new_y})
print(final_df)
Output
x y
0 10.0 10
1 27.5 10
2 45.0 10
3 47.0 9
...
15 80.0 5
16 81.0 5
17 82.0 5

How to get the N maximum values per row in a numpy ndarray?

We know how to do it when N = 1
import numpy as np
m = np.arange(15).reshape(3, 5)
m[xrange(len(m)), m.argmax(axis=1)] # array([ 4, 9, 14])
What is the best way to get the top N, when N > 1? (say, 5)
Doing a partial sort using np.partition can be much cheaper than a full sort:
gen = np.random.RandomState(0)
x = gen.permutation(100)
# full sort
print(np.sort(x)[-10:])
# [90 91 92 93 94 95 96 97 98 99]
# partial sort such that the largest 10 items are in the last 10 indices
print(np.partition(x, -10)[-10:])
# [90 91 93 92 94 96 98 95 97 99]
If you need the largest N items to be sorted, you can call np.sort on the last N elements in your partially sorted array:
print(np.sort(np.partition(x, -10)[-10:]))
# [90 91 92 93 94 95 96 97 98 99]
This can still be much faster than a full sort on the whole array, provided your array is sufficiently large.
To sort across each row of a two-dimensional array you can use the axis= arguments to np.partition and/or np.sort:
y = np.repeat(np.arange(100)[None, :], 5, 0)
gen.shuffle(y.T)
# partial sort, followed by a full sort of the last 10 elements in each row
print(np.sort(np.partition(y, -10, axis=1)[:, -10:], axis=1))
# [[90 91 92 93 94 95 96 97 98 99]
# [90 91 92 93 94 95 96 97 98 99]
# [90 91 92 93 94 95 96 97 98 99]
# [90 91 92 93 94 95 96 97 98 99]
# [90 91 92 93 94 95 96 97 98 99]]
Benchmarks:
In [1]: %%timeit x = np.random.permutation(10000000)
...: np.sort(x)[-10:]
...:
1 loop, best of 3: 958 ms per loop
In [2]: %%timeit x = np.random.permutation(10000000)
np.partition(x, -10)[-10:]
....:
10 loops, best of 3: 41.3 ms per loop
In [3]: %%timeit x = np.random.permutation(10000000)
np.sort(np.partition(x, -10)[-10:])
....:
10 loops, best of 3: 78.8 ms per loop
Why not do something like:
np.sort(m)[:,-N:]
partition, sort, argsort etc take an axis parameter
Let's shuffle some values
In [161]: A=np.arange(24)
In [162]: np.random.shuffle(A)
In [163]: A=A.reshape(4,6)
In [164]: A
Out[164]:
array([[ 1, 2, 4, 19, 12, 11],
[20, 5, 13, 21, 22, 3],
[10, 6, 16, 18, 17, 8],
[23, 9, 7, 0, 14, 15]])
Partition:
In [165]: A.partition(4,axis=1)
In [166]: A
Out[166]:
array([[ 2, 1, 4, 11, 12, 19],
[ 5, 3, 13, 20, 21, 22],
[ 6, 8, 10, 16, 17, 18],
[14, 7, 9, 0, 15, 23]])
the 4 smallest values of each row are first, the 2 largest last; slice to get an array of the 2 largest:
In [167]: A[:,-2:]
Out[167]:
array([[12, 19],
[21, 22],
[17, 18],
[15, 23]])
Sort is probably slower, but on a small array like this probably doesn't matter much. Plus it lets you pick any N.
In [169]: A.sort(axis=1)
In [170]: A
Out[170]:
array([[ 1, 2, 4, 11, 12, 19],
[ 3, 5, 13, 20, 21, 22],
[ 6, 8, 10, 16, 17, 18],
[ 0, 7, 9, 14, 15, 23]])

Remove duplicate values from entire dataframe

I have a Pandas DataFrame as follows;
data = pd.DataFrame({'A':[1,2,3,1,23,3,76,2,45,76],'B':[12,56,22,45,1,3,98,79,77,67]})
To remove duplicate values from the dataframe I have done this;
set(data['A'].unique()).union(set(data['B'].unique()))
which results in;
set([1, 2, 3, 12, 76, 77, 79, 67, 22, 23, 98, 45, 56])
Is there a better way of doing this? Is there a way of achieving this by using drop_duplicates?
Edit:
also, What if I had two more columns 'C' & 'D' but need to drop duplicates only from 'A' & 'B' ?
If you are intent on collapsing this
In [10]: np.unique(data.values.ravel())
Out[10]: array([ 1, 2, 3, 12, 22, 23, 45, 56, 67, 76, 77, 79, 98])
This will work as well
In [12]: data.unstack().drop_duplicates()
Out[12]:
A 0 1
1 2
2 3
4 23
6 76
8 45
B 0 12
1 56
2 22
6 98
7 79
8 77
9 67
dtype: int64

reshaping ndarrays versus regular arrays in numpy?

I have an object of type 'numpy.ndarray', called "myarray", that when printed to the screen using python's "print", looks like hits
[[[ 84 0 213 232] [153 0 304 363]]
[[ 33 0 56 104] [ 83 0 77 238]]
[[ 0 0 9 61] [ 0 0 2 74]]]
"myarray" is made by another library. The value of myarray.shape equals (3, 2). I expected this to be a 3dimensional array, with three indices. When I try to make this structure myself, using:
second_array = array([[[84, 0, 213, 232], [153, 0, 304, 363]],
[[33, 0, 56, 104], [83, 0, 77, 238]],
[[0, 0, 9, 61], [0, 0, 2, 74]]])
I get that second_array.shape is equal to (3, 2, 4), as expected. Why is there this difference? Also, given this, how can I reshape "myarray" so that the two columns are merged, i.e. so that the result is:
[[[ 84 0 213 232 153 0 304 363]]
[[ 33 0 56 104 83 0 77 238]]
[[ 0 0 9 61 0 0 2 74]]]
Edit: to clarify, I know that in the case of second_array, I can do second_array.reshape((3,8)). But how does this work for the ndarray which has the format of myarray but does not have a 3d index?
myarray.dtype is "object" but can be changed to be ndarray too.
Edit 2: Getting closer, but still cannot quite get the ravel/flatten followed by reshape. I have:
a = array([[1, 2, 3],
[4, 5, 6]])
b = array([[ 7, 8, 9],
[10, 11, 12]])
arr = array([a, b])
I try:
arr.ravel().reshape((2,6))
But this gives [[1, 2, 3, 4, 5, 6], ...] and I wanted [[1, 2, 3, 7, 8, 9], ...]. How can this be done?
thanks.
Indeed, ravel and hstack can be useful tools for reshaping arrays:
import numpy as np
myarray = np.empty((3,2),dtype = object)
myarray[:] = [[np.array([ 84, 0, 213, 232]), np.array([153, 0, 304, 363])],
[np.array([ 33, 0, 56, 104]), np.array([ 83, 0, 77, 238])],
[np.array([ 0, 0, 9, 61]), np.array([ 0, 0, 2, 74])]]
myarray = np.hstack(myarray.ravel()).reshape(3,2,4)
print(myarray)
# [[[ 84 0 213 232]
# [153 0 304 363]]
# [[ 33 0 56 104]
# [ 83 0 77 238]]
# [[ 0 0 9 61]
# [ 0 0 2 74]]]
myarray = myarray.ravel().reshape(3,8)
print(myarray)
# [[ 84 0 213 232 153 0 304 363]
# [ 33 0 56 104 83 0 77 238]
# [ 0 0 9 61 0 0 2 74]]
Regarding Edit 2:
import numpy as np
a = np.array([[1, 2, 3],
[4, 5, 6]])
b = np.array([[ 7, 8, 9],
[10, 11, 12]])
arr = np.array([a, b])
print(arr)
# [[[ 1 2 3]
# [ 4 5 6]]
# [[ 7 8 9]
# [10 11 12]]]
Notice that
In [45]: arr[:,0,:]
Out[45]:
array([[1, 2, 3],
[7, 8, 9]])
Since you want the first row to be [1,2,3,7,8,9], the above shows that you want the second axis to be the first axis. This can be accomplished with the swapaxes method:
print(arr.swapaxes(0,1).reshape(2,6))
# [[ 1 2 3 7 8 9]
# [ 4 5 6 10 11 12]]
Or, given a and b, or equivalently, arr[0] and arr[1], you could form arr directly with the hstack method:
arr = np.hstack([a, b])
# [[ 1 2 3 7 8 9]
# [ 4 5 6 10 11 12]]

Categories