Turning Sequence/two dimentional array into Dataframe Column in Pandas - python

So I use my previously trained model to predict new data
y_pred_aplikasi = model.predict(X_aplikasi)
y_pred_aplikasi
It returns
array([[7.7066602e-07, 9.9993092e-01, 4.6858725e-07],
[7.1568817e-02, 4.3571211e-07, 7.3567069e-01],
[9.8825598e-01, 6.3803792e-03, 4.4066067e-07],
...,
[3.8332163e-15, 1.0000000e+00, 1.4775689e-11],
[1.8400473e-14, 1.0000000e+00, 6.1960957e-11],
[7.0748132e-01, 5.9783965e-02, 5.7850748e-02]], dtype=float32)
​
I want to make that sequence into something like this, with the largest value of each part become 1 and the rest 0.
A B C
0 1 0
0 0 1
1 0 0
....
1 0 0
0 0 1
1 0 0
how can I achieve this with pandas?

Considering this to be your array:
In [841]: a
Out[841]:
array([[7.7066602e-07, 9.9993092e-01, 4.6858725e-07],
[7.1568817e-02, 4.3571211e-07, 7.3567069e-01],
[9.8825598e-01, 6.3803792e-03, 4.4066067e-07],
[3.8332163e-15, 1.0000000e+00, 1.4775689e-11],
[1.8400473e-14, 1.0000000e+00, 6.1960957e-11],
[7.0748132e-01, 5.9783965e-02, 5.7850748e-02]])
Convert above array into dataframe using pd.DataFrame constructor:
In [851]: df = pd.DataFrame(a, columns=['A', 'B', 'C'])
In [852]: df
Out[852]:
A B C
0 7.706660e-07 9.999309e-01 4.685873e-07
1 7.156882e-02 4.357121e-07 7.356707e-01
2 9.882560e-01 6.380379e-03 4.406607e-07
3 3.833216e-15 1.000000e+00 1.477569e-11
4 1.840047e-14 1.000000e+00 6.196096e-11
5 7.074813e-01 5.978397e-02 5.785075e-02
Replace max value with 1, else 0, using df.where and df.max(axis=1):
In [854]: df = df.eq(df.where(df != 0).max(1), axis=0).astype(int)
In [855]: df
Out[855]:
A B C
0 0 1 0
1 0 0 1
2 1 0 0
3 0 1 0
4 0 1 0
5 1 0 0

Manually looping through each element could work, but not sure how feasible that is for your application.
for i in range(len(y_pred_aplikasi)):
for j in range(3):
# for j in range(len(y_pred_aplikasi[i])): # to be more dynamic
if y_pred_aplikasi[i][j] == y_pred_aplikasi[i].max():
y_pred_aplikasi[i][j] = 1
else:
y_pred_aplikasi[i][j] = 0
y_pred_aplikasi.astype(int)
Out[5]:
array([[0, 1, 0],
[0, 0, 1],
[1, 0, 0],
...,
[0, 1, 0],
[0, 1, 0],
[1, 0, 0]])

Related

Sum n values of n-lists with the same index in Python

For example, I have 5 lists with 10 elements each one generated with random values simulating a coin toss.
I get my 5 lists with 10 elements in the following way:
result = [0,1] #0 is tail #1 is head
probability = [1/2,1/2]
N = 10
list = []
def list_generator(number): #this number would be 5 in this case
for i in range(number):
n_round = np.array(rnd.choices(result, probability, k=N))
print(n_round)
list_generator(5)
And for example I would get this
[1 1 0 0 0 1 0 1 1 0]
[0 1 0 0 0 1 1 1 0 1]
[1 1 0 0 1 1 1 0 1 1]
[0 0 0 1 0 0 0 1 0 0]
[0 0 1 1 0 0 0 0 1 1]
How can I sum only the numbers of the same column, I mean, I would like to get a list that appends the value of 1+0+1+0+0 (the first column), then, that list appends the sum of each second coin toss of each round i.e. 1+1+1+0+0 (the second column), and so on with the ten coin tosses
(I need it in a list because I will use this to plot a graph)
I have thought about making a matrix with each array and summing only the nth column and append that value in the list but I do not know how to do that, I do not have much knowledge about using arrays.
Have your function return a 2d numpy array and then sum along the required axis. Separately, you don't need to pass probability to random.choices as equal probabilities are the default.
import random
import numpy as np
def list_generator(number):
return np.array([np.array(random.choices([0,1], k=10)) for i in range(number)])
a = list_generator(5)
>>> a
array([[0, 1, 1, 1, 0, 1, 1, 0, 0, 0],
[1, 0, 1, 0, 1, 1, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 0, 1, 1],
[1, 1, 0, 0, 1, 1, 1, 1, 0, 0],
[0, 1, 1, 0, 0, 1, 1, 1, 0, 0]])
>>> a.sum(axis=0)
array([3, 4, 3, 2, 3, 5, 4, 3, 2, 1])
You can use numpy.random.randint to generate your randomized data. Then use sum to get the sum of the columns:
import numpy as np
N = 10
data = np.random.randint(2, size=(N, N))
print(data)
print(data.sum(axis=0))
[[1 0 1 1 1 1 0 0 1 1]
[0 0 1 1 0 0 1 1 1 0]
[1 1 0 1 1 1 0 0 1 1]
[1 1 0 0 0 0 1 1 1 1]
[1 0 0 1 1 1 0 1 1 1]
[1 0 1 1 0 1 0 1 1 1]
[0 0 0 1 0 1 0 1 1 0]
[0 0 0 1 0 1 0 1 0 1]
[1 0 0 0 1 0 1 0 1 1]
[1 0 1 1 0 1 0 0 0 1]]
[7 2 4 8 4 7 3 6 8 8]

Create several new dataframes or dictionaries from one dataframe

I have a dataframe like this:
evt pcle bin_0 bin_1 bin_2 ... bin_49
1 pi 1 0 0 0
1 pi 1 0 0 0
1 k 0 0 0 1
1 pi 0 0 1 0
2 pi 0 0 1 0
2 k 0 1 0 0
3 J 0 1 0 0
3 pi 0 0 0 1
3 pi 1 0 0 0
3 k 0 1 0 0
...
5000 J 0 0 1 0
5000 pi 0 1 0 0
5000 k 0 0 0 1
With this information, I want to create several other dataframes df_{evt} (or maybe dictionaries should be better?):
df_1 :
pcle cant bin_0 bin_1 bin_2 ... bin_49
pi 3 2 0 1 0
k 1 0 0 0 1
df_2 :
pcle cant bin_0 bin_1 bin_2 ... bin_49
pi 1 0 0 1 0
k 0 1 0 0 0
In total there would be 5000 dataframes (1 for each evt) where in each of them:
*the column "cant" has the ocurrences of "pcle" in the particular "evt".
*bin_0 ... bin_49 have the sum of the values for this particular "pcle" in
the particular "evt".
Which is the best way to achieve this goal?
Here's a possible solution:
import pandas as pd
import numpy as np
columns = ["evt", "pcle", "bin_0", "bin_1", "bin_2", "bin_3"]
data = [[1, "pi", 1, 0, 0, 0],
[1, "pi", 0, 0, 0, 0],
[1, "k", 0, 0, 0, 1],
[1, "pi", 0, 0, 1, 0],
[2, "pi", 0, 0, 1, 0],
[2, "k", 0, 1, 0, 0],
[3, "J", 0, 1, 0, 0],
[3, "pi", 0, 0, 0, 1],
[3, "pi", 1, 0, 0, 0],
[3, "k", 0, 1, 0, 0]]
df = pd.DataFrame(data=data, columns=columns)
# group your data by the columns you want
grouped = df.groupby(["evt", "pcle"])
# compute the aggregates for the bin_X
df_t = grouped.aggregate(np.sum)
# move pcle from index to column
df_t.reset_index(level=["pcle"], inplace=True)
# count occurrences of pcle
df_t["cant"] = grouped.size().values
# filter evt with .loc
df_t.loc[1]
If you want to make it into a dictionary then you can run:
d = {i:j.reset_index(drop=True) for i, j in df_t.groupby(df_t.index)}

Index of identical rows in a NumPy array

I already asked a variation of this question, but I still have a problem regarding the runtime of my code.
Given a numpy array consisting of 15000 rows and 44 columns. My goal is to find out which rows are equal and add them to a list, like this:
1 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
1 0 0 0 0
1 2 3 4 5
Result:
equal_rows1 = [1,2,3]
equal_rows2 = [0,4]
What I did up till now is using the following code:
import numpy as np
input_data = np.load('IN.npy')
equal_inputs1 = []
equal_inputs2 = []
for i in range(len(input_data)):
for j in range(i+1,len(input_data)):
if np.array_equal(input_data[i],input_data[j]):
equal_inputs1.append(i)
equal_inputs2.append(j)
The problem is that it takes a lot of time to return the desired arrays and that this allows only 2 different "similar row lists" although there can be more. Is there any better solution for this, especially regarding the runtime?
This is pretty simple with pandas groupby:
df
A B C D E
0 1 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 1 0 0 0 0
5 1 2 3 4 5
[g.index.tolist() for _, g in df.groupby(df.columns.tolist()) if len(g.index) > 1]
# [[1, 2, 3], [0, 4]]
If you are dealing with many rows and many unique groups, this might get a bit slow. The performance depends on your data. Perhaps there is a faster NumPy alternative, but this is certainly the easiest to understand.
You can use collections.defaultdict, which retains the row values as keys:
from collections import defaultdict
dd = defaultdict(list)
for idx, row in enumerate(df.values):
dd[tuple(row)].append(idx)
print(list(dd.values()))
# [[0, 4], [1, 2, 3], [5]]
print(dd)
# defaultdict(<class 'list'>, {(1, 0, 0, 0, 0): [0, 4],
# (0, 0, 0, 0, 0): [1, 2, 3],
# (1, 2, 3, 4, 5): [5]})
You can, if you wish, filter out unique rows via a dictionary comprehension.

Numpy - How to shift values at indexes where change happened

So I would like to shift my values in a 1D numpy arrays, where change happened. The sample of shifting shall be configured.
input = np.array([0,0,0,0,1,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0])
shiftSize = 2
out = np.magic(input, shiftSize)
print out
np.array([0,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,0,0])
For example the first switch happened and index 4, so index 2,3 becomes '1'.
The next happened at 5, so 6 and 7 becomes '1'.
EDIT: Also it would be important to be without for cycle because, that might be slow (it is needed for large data sets)
EDIT2: indexes and variable name
I tried with np.diff, so i get where the changes happened and then np.put, but with multiple index ranges it seems impossible.
Thank you for the help in advance!
What you want is called "binary dilation" and is contained in scipy.ndimage:
import numpy as np
import scipy.ndimage
input = np.array([0,0,0,0,1,0,0,0,0,0,1,1,1,0,0,1,0,0,0,0], dtype=bool)
out = scipy.ndimage.morphology.binary_dilation(input, iterations=2).astype(int)
# array([0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
Nils' answer seems good. Here is an alternative using NumPy only:
import numpy as np
def dilate(ar, amount):
# Convolve with a kernel as big as the dilation scope
dil = np.convolve(np.abs(ar), np.ones(2 * amount + 1), mode='same')
# Crop in case the convolution kernel was bigger than array
dil = dil[-len(ar):]
# Take non-zero and convert to input type
return (dil != 0).astype(ar.dtype)
# Test
inp = np.array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0])
print(inp)
print(dilate(inp, 2))
Output:
[0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0]
[0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0]
Another numpy solution :
def dilatation(seed,shift):
out=seed.copy()
for sh in range(1,shift+1):
out[sh:] |= seed[:-sh]
for sh in range(-shift,0):
out[:sh] |= seed[-sh:]
return out
Example (shift = 2) :
in : [0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0]
out: [0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1]

Insert data from one sorted array into another sorted array

I apologize if this has been asked here - I've hunted around here and in the Tentative NumPy Tutorial for an answer.
I have 2 numpy arrays. The first array is similar to:
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
(etc... It's ~700x10 in actuality)
I then have a 2nd array similar to
3 1
4 18
5 2
(again, longer - maybe 400 or so rows)
The first column of the 2nd array is always completely contained within the first column of the first array
What I'd like to do is to insert the 2nd column of the 2nd array into that first array as part of an existing column, i.e:
array a:
1 0 0 0 0
2 0 0 0 0
3 1 0 0 0
4 18 0 0 0
5 2 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
(I'd be filling in each of those columns in turn, but each covers a different range within the original)
My first try was along the lines of a[b[:,0],1]=b[:,1] which puts them into the indices of b, not the values (ie, in my example above instead of filling in rows 3, 4, and 5, I filled in 2, 3, and 4). I should have realized that!
Since then, I've tried to make it work pretty inelegantly with where(), and I think I could make it work by finding the difference in the starting values of the first columns.
I'm new to python, so perhaps I'm overly optimistic - but it seems like there should be a more elegant way and I'm just missing it.
Thanks for any insights!
If the numbers in the first column of a are in sorted order, then you could use
a[a[:,0].searchsorted(b[:,0]),1] = b[:,1]
For example:
import numpy as np
a = np.array([(1,0,0,0,0),
(2,0,0,0,0),
(3,0,0,0,0),
(4,0,0,0,0),
(5,0,0,0,0),
(6,0,0,0,0),
(7,0,0,0,0),
(8,0,0,0,0),
])
b = np.array([(3, 1),
(5, 18),
(7, 2)])
a[a[:,0].searchsorted(b[:,0]),1] = b[:,1]
print(a)
yields
[[ 1 0 0 0 0]
[ 2 0 0 0 0]
[ 3 1 0 0 0]
[ 4 0 0 0 0]
[ 5 18 0 0 0]
[ 6 0 0 0 0]
[ 7 2 0 0 0]
[ 8 0 0 0 0]]
(I changed your example a bit to show that the values in b's first column do not have to be contiguous.)
If a[:,0] is not in sorted order, then you could use np.argsort to workaround this:
a = np.array( [(1,0,0,0,0),
(2,0,0,0,0),
(5,0,0,0,0),
(3,0,0,0,0),
(4,0,0,0,0),
(6,0,0,0,0),
(7,0,0,0,0),
(8,0,0,0,0),
])
b = np.array([(3, 1),
(5, 18),
(7, 2)])
perm = np.argsort(a[:,0])
a[:,1][perm[a[:,0][perm].searchsorted(b[:,0])]] = b[:,1]
print(a)
yields
[[ 1 0 0 0 0]
[ 2 0 0 0 0]
[ 5 18 0 0 0]
[ 3 1 0 0 0]
[ 4 0 0 0 0]
[ 6 0 0 0 0]
[ 7 2 0 0 0]
[ 8 0 0 0 0]]
The setup:
a = np.arange(20).reshape(2,10).T
b = np.array([[1, 100], [3, 300], [8, 800]])
This will work if you don't know anything about a[:, 0] except that it is sorted.
index = a[:, 0].searchsorted(b[:, 0])
a[index, 1] = b[:, 1]
print a
array([[ 0, 10],
[ 1, 100],
[ 2, 12],
[ 3, 300],
[ 4, 14],
[ 5, 15],
[ 6, 16],
[ 7, 17],
[ 8, 800],
[ 9, 19]])
But if you know that a[:, 0] is a sequence of contiguous integers like your example you can do:
index = b[:,0] + a[0, 0]
a[index, 1] = b[:, 1]

Categories