I have some data that looks like:
data = [1,2,4,5,9] (random pattern of increasing integers)
And I want to plot it in a binary horizontal line so that y=1 for every x value specified in data and zero otherwise.
I have a few different data arrays that I'd like to stack, similar to this style (this is CCD clocking data but the plot format looks ideal)
I think I need to create a list of ones for my data array, but how do I specify the zero value for everything not in the array?
Thanks
You got the point. You can create a list with 1 in any position specified in data and 0 elsewhere. This can be done very easily with a list comprehension
def binary_data(data):
return [1 if x in data else 0 for x in range(data[-1] + 1)]
which will act like this:
>>> data = [1, 2, 4, 5, 9]
>>> bindata = binary_data(data)
>>> bindata
[0, 1, 1, 0, 1, 1, 0, 0, 0, 1]
Now all you have to do is plot it... or better step it since it's binary data and step() looks way better:
import numpy as np
from matplotlib.pyplot import step, show
def binary_data(data):
return [1 if x in data else 0 for x in range(data[-1] + 1)]
data = [1, 2, 4, 5, 9]
bindata = binary_data(data)
xaxis = np.arange(0, data[-1] + 1)
yaxis = np.array(bindata)
step(xaxis, yaxis)
show()
To plot multiple data arrays stacked on the same figure you could tweak binary_data() like this:
def binary_data(data, yshift=0):
return [yshift+1 if x in data else yshift for x in range(data[-1] + 1)]
so now you can set yshift parameter to shift data arrays on the y-axis. E.g,
>>> data = [1, 2, 4, 5, 9]
>>> bindata1 = binary_data(data)
>>> bindata1
[0, 1, 1, 0, 1, 1, 0, 0, 0, 1]
>>> bindata2 = binary_data(data, 2)
>>> bindata2
[2, 3, 3, 2, 3, 3, 2, 2, 2, 3]
Let's say you have data1, data2 and data3 to plot stacked, you'd go like:
import numpy as np
from matplotlib.pyplot import step, show
def binary_data(data, yshift=0):
return [yshift+1 if x in data else yshift for x in range(data[-1] + 1)]
data1 = [1, 2, 4, 5, 9]
bindata1 = binary_data(data1)
x1 = np.arange(0, data1[-1] + 1)
y1 = np.array(bindata1)
data2 = [1, 4, 9]
bindata2 = binary_data(data2, 2)
x2 = np.arange(0, data2[-1] + 1)
y2 = np.array(bindata2)
data3 = [1, 2, 8, 9]
bindata3 = binary_data(data3, 4)
x3 = np.arange(0, data3[-1] + 1)
y3 = np.array(bindata3)
step(x1, y1, x2, y2, x3, y3)
show()
that you can easily edit to make it work with an arbitrary amount of data arrays:
data = [ [1, 2, 4, 5, 9],
[1, 4, 9],
[1, 2, 8, 9] ]
for shift, d in enumerate(data):
bindata = binary_data(d, 2 * shift)
x = np.arange(0, d[-1] + 1)
y = np.array(bindata)
step(x, y)
show()
Finally if you are dealing with data arrays with different length (say [1,2] and [15,16]) and you don't like plots that vanish in the middle of the figure you can tweak binary_data() again to force its range to the maximum range of your data.
import numpy as np
from matplotlib.pyplot import step, show
def binary_data(data, limit, yshift=0):
return [yshift+1 if x in data else yshift for x in range(limit)]
data = [ [1, 2, 4, 5, 9, 12, 13, 14],
[1, 4, 10, 11, 20, 21, 22],
[1, 2, 3, 4, 15, 16, 17, 18] ]
# find out the longest data to plot
limit = max( [ x[-1] + 1 for x in data] )
x = np.arange(0, limit)
for shift, d in enumerate(data):
bindata = binary_data(d, limit, 2 * shift)
y = np.array(bindata)
step(x, y)
show()
Edit: As #ImportanceOfBeingErnest suggested, if you prefer to perform data to bindata conversion without having to define your own binary_data() function you could use numpy.zeros_like(). Just pay more attention when you stack them:
import numpy as np
from matplotlib.pyplot import step, show
data = [ [1, 2, 4, 5, 9, 12, 13, 14],
[1, 4, 10, 11, 20, 21, 22],
[1, 2, 3, 4, 15, 16, 17, 18] ]
# find out the longest data to plot
limit = max( [ x[-1] + 1 for x in data] )
x = np.arange(0, limit)
for shift, d in enumerate(data):
y = np.zeros_like(x)
y[d] = 1
# don't forget to shift
y += 2*shift
step(x, y)
show()
You can create an array with all zeros and assign 1 for those elements in data
import numpy as np
data = [1,2,4,5,9]
t = np.arange(0,data[-1]+1)
x = np.zeros_like(t)
x[data] = 1
You might then plot it with the step function
import matplotlib.pyplot as plt
plt.step(t,x, where="post")
plt.show()
or with where = "pre", depending on how to interprete your data
Related
Have this value error problem
The x is an array of 0-9 10 total digits
X is passed into the for loop and put into the equation
Struggling with how y and x aren't the same size when the equation has run 10 times
import numpy as np
import matplotlib.pyplot as plt
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
a = np.array([2])
b = np.array([-3])
print(f'Scalar check for 0 dimensions a {a.ndim}, b {b.ndim} x {x.ndim}')
for i in x:
print(i)
y = i*a + b
plt.plot(x, y)
raise ValueError(f"x and y must have same first dimension, but "
ValueError: x and y must have same first dimension, but have shapes (10,) and (1,)
Though it would have ran when I changed the dimensions of a and b to 1d arrays before they were scalar but that was obviously not the error causing it
You are overwritting the y value each time. So in the end you have y = [15].
You can re-write it as follows:
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
a = np.array(2) <-- note the removed brackets: []
b = np.array(-3) <--
y = []
for i in x:
y.append(i * a + b)
and even simpler approach is
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
a = np.array(2)
b = np.array(-3)
y = x * a + b
I need to average the Y values corresponding to the values in the X array...
X=np.array([ 1, 1, 2, 2, 2, 2, 3, 3 ... ])
Y=np.array([ 10, 30, 15, 10, 16, 10, 15, 20 ... ])
In other words, the equivalents of the 1 values in the X array are 10 and 30 in the Y array, and the average of this is 20, the equivalents of the 2 values are 15, 10, 16, and 10, and their average is 12.75, and so on...
How can I calculate these average values?
One option is to use a property of linear regression (with categorical variables):
import numpy as np
x = np.array([ 1, 1, 2, 2, 2, 2, 3, 3 ])
y = np.array([ 10, 30, 15, 10, 16, 10, 15, 20 ])
x_dummies = x[:, None] == np.unique(x)
means = np.linalg.lstsq(x_dummies, y, rcond=None)[0]
print(means) # [20. 12.75 17.5 ]
You can try using pandas
import pandas as pd
import numpy as np
N = pd.DataFrame(np.transpose([X,Y]),
columns=['X', 'Y']).groupby('X')['Y'].mean().to_numpy()
# array([20. , 12.75, 17.5 ])
import numpy as np
X = np.array([ 1, 1, 2, 2, 2, 2, 3, 3])
Y = np.array([ 10, 30, 15, 10, 16, 10, 15, 20])
# Only unique values
unique_vals = np.unique(X);
# Loop for every value
for val in unique_vals:
# Search for proper indexes in Y
idx = np.where(X == val)
# Mean for finded indexes
aver = np.mean(Y[idx])
print(f"Average for {val}: {aver}")
Result:
Average for 1: 20.0
Average for 2: 12.75
Average for 3: 17.5
you can use something like the below code :
import numpy as np
X=np.array([ 1, 1, 2, 2, 2, 2, 3, 3])
Y=np.array([ 10, 30, 15, 10, 16, 10, 15, 20])
def groupby(a, b):
# Get argsort indices, to be used to sort a and b in the next steps
sidx = b.argsort(kind='mergesort')
a_sorted = a[sidx]
b_sorted = b[sidx]
# Get the group limit indices (start, stop of groups)
cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])
# Split input array with those start, stop ones
out = [a_sorted[i:j] for i,j in zip(cut_idx[:-1],cut_idx[1:])]
return out
group_by_array=groupby(Y,X)
for item in group_by_array:
print(np.average(item))
I use the information in the below link to answer the question:
Group numpy into multiple sub-arrays using an array of values
I think this solution should work:
avg_arr = []
i = 1
while i <= np.max(x):
inds = np.where(x == i)
my_val = np.average(y[inds[0][0]:inds[0][-1]])
avg_arr.append(my_val)
i+=1
Definitely, not the cleanest, but I was able to test it quickly and it does indeed work.
My data frame includes a purchase. A buyer (buyer_id) can buy several items (item_id).
I split the data with splitter() and put it into a dok matrix generate_matrix(). Then I enter this data in method get_train_samples() and then get my x_train, x_test, y_train and y_test.
How can I compress this code?
And how can I combine generate_matrix() and get_train_samples() and enter them in a 'real' one hot encoding matrix?
Dataframe:
d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)
purchaseid itemid
0 0 3
1 0 8
2 0 2
3 1 10
4 2 3
Code:
import random
import numpy as np
import pandas as pd
import scipy.sparse as sp
PERCENTAGE_SPLIT = 20
NUM_NEGATIVES = 4
def splitter(df):
df_ = pd.DataFrame()
sum_purchase = df['purchaseid'].nunique()
amount = round((sum_purchase / 100) * PERCENTAGE_SPLIT)
random_list = random.sample(df['purchaseid'].unique().tolist(), amount)
df_ = df.loc[df['purchaseid'].isin(random_list)]
df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
return [df_reduced, df_]
def generate_matrix(df_main, dataframe, name):
mat = sp.dok_matrix((df_main.shape[0], len(df_main['itemid'].unique())), dtype=np.float32)
for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
mat[purchaseid, itemid] = 1.0
return mat
dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)
train_mat = generate_matrix(df, df_tr, 'train')
val_mat = generate_matrix(df, df_val, 'val')
def get_train_samples(train_mat, num_negatives):
user_input, item_input, labels = [], [], []
num_user, num_item = train_mat.shape
for (u, i) in train_mat.keys():
user_input.append(u)
item_input.append(i)
labels.append(1)
# negative instances
for t in range(num_negatives):
j = np.random.randint(num_item)
while (u, j) in train_mat.keys():
j = np.random.randint(num_item)
user_input.append(u)
item_input.append(j)
labels.append(0)
return user_input, item_input, labels
num_users, num_items = train_mat.shape
model = get_model(num_users, num_items, ...)
user_input, item_input, labels = get_train_samples(train_mat, NUM_NEGATIVES)
val_user_input, val_item_input, val_labels = get_train_samples(val_mat, NUM_NEGATIVES)
What I need
user_input
item_input
labels
val_user_input
val_item_input
val_labels
num_users
It is pretty vague what kind of one hot encoding matrix you are looking for. From get_train_samples, it seems you don't really need the sparse matrix for model training at the end of the day. Besides, I'm not sure how you are going to one-hot encode observations with three variables (user_id,item_id,purchased or not)
As for the problem of combining generate_matrix with get_train_samples, it is pretty simple,
def generate_matrix(df_main,df,num_negatives):
n_samples,n_classes = df_main.shape[0],df_main['itemid'].nunique()
mat = sp.dok_matrix((n_samples,n_classes), dtype=np.float32)
user_input,item_input,labels = [],[],[]
for purchaseid,itemid in zip(df['purchaseid'],df['itemid']):
mat[purchaseid,itemid] = 1.0
# the data with label 0 in OP's original code
fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives)
# label the fake labels with -1.0
mat[np.repeat(purchaseid,num_negatives),fake_items] = -1.0
# the three lists
user_input.extend([purchaseid]*(num_negatives+1))
item_input.append(itemid);item_input.extend(fake_items.tolist())
labels.append(1.0);labels.extend(np.zeros(num_negatives).tolist())
return mat,user_input,item_input,labels
as you can see, in my generate_matrix, the fake samples (items not purchased by a user) is encoded with -1.0 in the sparse matrix. Besides, I use a pretty compact way fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives) to generate the fake itemid as compared to the while loop in your code.
With this function, you can run
import random
import numpy as np
import pandas as pd
import scipy.sparse as sp
d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)
PERCENTAGE_SPLIT = 20
NUM_NEGATIVES = 4
def splitter(df):
df_ = pd.DataFrame()
sum_purchase = df['purchaseid'].nunique()
amount = round(sum_purchase*(PERCENTAGE_SPLIT/100))
random_list = random.sample(df['purchaseid'].unique().tolist(),amount)
df_ = df.loc[df['purchaseid'].isin(random_list)]
df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
return [df_reduced, df_]
dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)
train_mat,user_input_t,item_id_t,labels_t = generate_matrix(df, df_tr, NUM_NEGATIVES)
val_mat,user_input_v,item_id_v,labels_v = generate_matrix(df, df_val, NUM_NEGATIVES)
You will see by running this code, the length of train_mat.keys() can be different from the length of user_input_t. This is because the same items can be chosen multiple times in fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives). If you want to keep the two lengths to be the same, you need to set replacement=False in fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives).
Is there a more efficient way in determining the averages of a certain area in a given numpy array? For simplicity, lets say I have a 5x5 array:
values = np.array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8]])
I would like to get the averages of each coordinate, with a specified area size, assuming the array wraps around. Lets say the certain area is size 2, thus anything around a certain point within distance 2 will be considered. For example, to get the average of the area from coordinate (2,2), we need to consider
2,
2, 3, 4,
2, 3, 4, 5, 6
4, 5, 6,
6,
Thus, the average will be 4.
For coordinate (4, 4) we need to consider:
6,
6, 7, 3,
6, 7, 8, 4, 5
3, 4, 0,
5,
Thus the average will be 4.92.
Currently, I have the following code below. But since I have a for loop I feel like it could be improved. Is there a way to just use numpy built in functions?
Is there a way to use np.vectorize to gather the subarrays (area), place it all in an array, then use np.einsum or something.
def get_average(matrix, loc, dist):
sum = 0
num = 0
size, size = matrix.shape
for y in range(-dist, dist + 1):
for x in range(-dist + abs(y), dist - abs(y) + 1):
y_ = (y + loc.y) % size
x_ = (x + loc.x) % size
sum += matrix[y_, x_]
num += 1
return sum/num
class Coord():
def __init__(self, x, y):
self.x = x
self.y = y
values = np.array([[0, 1, 2, 3, 4],
[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8]])
height, width = values.shape
averages = np.zeros((height, width), dtype=np.float16)
for r in range(height):
for c in range(width):
loc = Coord(c, r)
averages[r][c] = get_average(values, loc, 2)
print(averages)
Output:
[[ 3.07617188 2.92382812 3.5390625 4.15234375 4. ]
[ 2.92382812 2.76953125 3.38476562 4. 3.84570312]
[ 3.5390625 3.38476562 4. 4.6171875 4.4609375 ]
[ 4.15234375 4. 4.6171875 5.23046875 5.078125 ]
[ 4. 3.84570312 4.4609375 5.078125 4.921875 ]]
This solution is less efficient (slower) than yours but is just an example using numpy.ma module.
Required libraries:
import numpy as np
import numpy.ma as ma
Define methods to do the job:
# build the shape of the area as a rhomboid
def rhomboid2(dim):
size = 2*dim + 1
matrix = np.ones((size,size))
for y in range(-dim, dim + 1):
for x in range(-dim + abs(y), dim - abs(y) + 1):
matrix[(y + dim) % size, (x + dim) % size] = 0
return matrix
# build a mask using the area shaped
def mask(matrix_shape, rhom_dim):
mask = np.zeros(matrix_shape)
bound = 2*rhom_dim+1
rhom = rhomboid2(rhom_dim)
mask[0:bound, 0:bound] = rhom
# roll to set the position of the rhomboid to 0,0
mask = np.roll(mask,-rhom_dim, axis = 0)
mask = np.roll(mask,-rhom_dim, axis = 1)
return mask
Then, iterate to build the result:
mask_ = mask((5,5), 2) # call the mask sized as values array with a rhomboid area of size 2
averages = np.zeros_like(values, dtype=np.float16) # initialize the recipient
# iterate over the mask to calculate the average
for y in range(len(mask_)):
for x in range(len(mask_)):
masked = ma.array(values, mask = mask_)
averages[y,x] = np.mean(masked)
mask_ = np.roll(mask_, 1, axis = 1)
mask_ = np.roll(mask_, 1, axis = 0)
Which returns
# [[3.076 2.924 3.54 4.152 4. ]
# [2.924 2.77 3.385 4. 3.846]
# [3.54 3.385 4. 4.617 4.46 ]
# [4.152 4. 4.617 5.23 5.08 ]
# [4. 3.846 4.46 5.08 4.92 ]]
x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
I want to grab first 2 rows of array x from every block of 5, result should be:
x[fancy_indexing] = [1,2, 6,7, 11,12]
It's easy enough to build up an index like that using a for loop.
Is there a one-liner slicing trick that will pull it off? Points for simplicity here.
Approach #1 Here's a vectorized one-liner using boolean-indexing -
x[np.mod(np.arange(x.size),M)<N]
Approach #2 If you are going for performance, here's another vectorized approach using NumPy strides -
n = x.strides[0]
shp = (x.size//M,N)
out = np.lib.stride_tricks.as_strided(x, shape=shp, strides=(M*n,n)).ravel()
Sample run -
In [61]: # Inputs
...: x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
...: N = 2
...: M = 5
...:
In [62]: # Approach 1
...: x[np.mod(np.arange(x.size),M)<N]
Out[62]: array([ 1, 2, 6, 7, 11, 12])
In [63]: # Approach 2
...: n = x.strides[0]
...: shp = (x.size//M,N)
...: out=np.lib.stride_tricks.as_strided(x,shape=shp,strides=(M*n,n)).ravel()
...:
In [64]: out
Out[64]: array([ 1, 2, 6, 7, 11, 12])
I first thought you need this to work for 2d arrays due to your phrasing of "first N rows of every block of M rows", so I'll leave my solution as this.
You could work some magic by reshaping your array into 3d:
M = 5 # size of blocks
N = 2 # number of columns to cut
x = np.arange(3*4*M).reshape(4,-1) # (4,3*N)-shaped dummy input
x = x.reshape(x.shape[0],-1,M)[:,:,:N+1].reshape(x.shape[0],-1) # (4,3*N)-shaped output
This will extract every column according to your preference. In order to use it for your 1d case you'd need to make your 1d array into a 2d one using x = x[None,:].
Reshape the array to multiple rows of five columns then take (slice) the first two columns of each row.
>>> x
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
>>> x.reshape(x.shape[0] / 5, 5)[:,:2]
array([[ 1, 2],
[ 6, 7],
[11, 12]])
Or
>>> x.reshape(x.shape[0] / 5, 5)[:,:2].flatten()
array([ 1, 2, 6, 7, 11, 12])
>>>
It only works with 1-d arrays that have a length that is a multiple of five.
import numpy as np
x = np.array(range(1, 16))
y = np.vstack([x[0::5], x[1::5]]).T.ravel()
y
// => array([ 1, 2, 6, 7, 11, 12])
Taking the first N rows of every block of M rows in the array [1, 2, ..., K]:
import numpy as np
K = 30
M = 5
N = 2
x = np.array(range(1, K+1))
y = np.vstack([x[i::M] for i in range(N)]).T.ravel()
y
// => array([ 1, 2, 6, 7, 11, 12, 16, 17, 21, 22, 26, 27])
Notice that .T and .ravel() are fast operations: they don't copy any data, but just manipulate the dimensions and strides of the array.
If you insist on getting your slice using fancy indexing:
import numpy as np
K = 30
M = 5
N = 2
x = np.array(range(1, K+1))
fancy_indexing = [i*M+n for i in range(len(x)//M) for n in range(N)]
x[fancy_indexing]
// => array([ 1, 2, 6, 7, 11, 12, 16, 17, 21, 22, 26, 27])