Compress methods and get the data into a One Hot Encoding Matrix - python

My data frame includes a purchase. A buyer (buyer_id) can buy several items (item_id).
I split the data with splitter() and put it into a dok matrix generate_matrix(). Then I enter this data in method get_train_samples() and then get my x_train, x_test, y_train and y_test.
How can I compress this code?
And how can I combine generate_matrix() and get_train_samples() and enter them in a 'real' one hot encoding matrix?
Dataframe:
d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)
purchaseid itemid
0 0 3
1 0 8
2 0 2
3 1 10
4 2 3
Code:
import random
import numpy as np
import pandas as pd
import scipy.sparse as sp
PERCENTAGE_SPLIT = 20
NUM_NEGATIVES = 4
def splitter(df):
df_ = pd.DataFrame()
sum_purchase = df['purchaseid'].nunique()
amount = round((sum_purchase / 100) * PERCENTAGE_SPLIT)
random_list = random.sample(df['purchaseid'].unique().tolist(), amount)
df_ = df.loc[df['purchaseid'].isin(random_list)]
df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
return [df_reduced, df_]
def generate_matrix(df_main, dataframe, name):
mat = sp.dok_matrix((df_main.shape[0], len(df_main['itemid'].unique())), dtype=np.float32)
for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
mat[purchaseid, itemid] = 1.0
return mat
dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)
train_mat = generate_matrix(df, df_tr, 'train')
val_mat = generate_matrix(df, df_val, 'val')
def get_train_samples(train_mat, num_negatives):
user_input, item_input, labels = [], [], []
num_user, num_item = train_mat.shape
for (u, i) in train_mat.keys():
user_input.append(u)
item_input.append(i)
labels.append(1)
# negative instances
for t in range(num_negatives):
j = np.random.randint(num_item)
while (u, j) in train_mat.keys():
j = np.random.randint(num_item)
user_input.append(u)
item_input.append(j)
labels.append(0)
return user_input, item_input, labels
num_users, num_items = train_mat.shape
model = get_model(num_users, num_items, ...)
user_input, item_input, labels = get_train_samples(train_mat, NUM_NEGATIVES)
val_user_input, val_item_input, val_labels = get_train_samples(val_mat, NUM_NEGATIVES)
What I need
user_input
item_input
labels
val_user_input
val_item_input
val_labels
num_users

It is pretty vague what kind of one hot encoding matrix you are looking for. From get_train_samples, it seems you don't really need the sparse matrix for model training at the end of the day. Besides, I'm not sure how you are going to one-hot encode observations with three variables (user_id,item_id,purchased or not)
As for the problem of combining generate_matrix with get_train_samples, it is pretty simple,
def generate_matrix(df_main,df,num_negatives):
n_samples,n_classes = df_main.shape[0],df_main['itemid'].nunique()
mat = sp.dok_matrix((n_samples,n_classes), dtype=np.float32)
user_input,item_input,labels = [],[],[]
for purchaseid,itemid in zip(df['purchaseid'],df['itemid']):
mat[purchaseid,itemid] = 1.0
# the data with label 0 in OP's original code
fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives)
# label the fake labels with -1.0
mat[np.repeat(purchaseid,num_negatives),fake_items] = -1.0
# the three lists
user_input.extend([purchaseid]*(num_negatives+1))
item_input.append(itemid);item_input.extend(fake_items.tolist())
labels.append(1.0);labels.extend(np.zeros(num_negatives).tolist())
return mat,user_input,item_input,labels
as you can see, in my generate_matrix, the fake samples (items not purchased by a user) is encoded with -1.0 in the sparse matrix. Besides, I use a pretty compact way fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives) to generate the fake itemid as compared to the while loop in your code.
With this function, you can run
import random
import numpy as np
import pandas as pd
import scipy.sparse as sp
d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)
PERCENTAGE_SPLIT = 20
NUM_NEGATIVES = 4
def splitter(df):
df_ = pd.DataFrame()
sum_purchase = df['purchaseid'].nunique()
amount = round(sum_purchase*(PERCENTAGE_SPLIT/100))
random_list = random.sample(df['purchaseid'].unique().tolist(),amount)
df_ = df.loc[df['purchaseid'].isin(random_list)]
df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
return [df_reduced, df_]
dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)
train_mat,user_input_t,item_id_t,labels_t = generate_matrix(df, df_tr, NUM_NEGATIVES)
val_mat,user_input_v,item_id_v,labels_v = generate_matrix(df, df_val, NUM_NEGATIVES)
You will see by running this code, the length of train_mat.keys() can be different from the length of user_input_t. This is because the same items can be chosen multiple times in fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives). If you want to keep the two lengths to be the same, you need to set replacement=False in fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives).

Related

Calculating the averages of elements in one array based on data in another array

I need to average the Y values corresponding to the values in the X array...
X=np.array([ 1, 1, 2, 2, 2, 2, 3, 3 ... ])
Y=np.array([ 10, 30, 15, 10, 16, 10, 15, 20 ... ])
In other words, the equivalents of the 1 values in the X array are 10 and 30 in the Y array, and the average of this is 20, the equivalents of the 2 values are 15, 10, 16, and 10, and their average is 12.75, and so on...
How can I calculate these average values?
One option is to use a property of linear regression (with categorical variables):
import numpy as np
x = np.array([ 1, 1, 2, 2, 2, 2, 3, 3 ])
y = np.array([ 10, 30, 15, 10, 16, 10, 15, 20 ])
x_dummies = x[:, None] == np.unique(x)
means = np.linalg.lstsq(x_dummies, y, rcond=None)[0]
print(means) # [20. 12.75 17.5 ]
You can try using pandas
import pandas as pd
import numpy as np
N = pd.DataFrame(np.transpose([X,Y]),
columns=['X', 'Y']).groupby('X')['Y'].mean().to_numpy()
# array([20. , 12.75, 17.5 ])
import numpy as np
X = np.array([ 1, 1, 2, 2, 2, 2, 3, 3])
Y = np.array([ 10, 30, 15, 10, 16, 10, 15, 20])
# Only unique values
unique_vals = np.unique(X);
# Loop for every value
for val in unique_vals:
# Search for proper indexes in Y
idx = np.where(X == val)
# Mean for finded indexes
aver = np.mean(Y[idx])
print(f"Average for {val}: {aver}")
Result:
Average for 1: 20.0
Average for 2: 12.75
Average for 3: 17.5
you can use something like the below code :
import numpy as np
X=np.array([ 1, 1, 2, 2, 2, 2, 3, 3])
Y=np.array([ 10, 30, 15, 10, 16, 10, 15, 20])
def groupby(a, b):
# Get argsort indices, to be used to sort a and b in the next steps
sidx = b.argsort(kind='mergesort')
a_sorted = a[sidx]
b_sorted = b[sidx]
# Get the group limit indices (start, stop of groups)
cut_idx = np.flatnonzero(np.r_[True,b_sorted[1:] != b_sorted[:-1],True])
# Split input array with those start, stop ones
out = [a_sorted[i:j] for i,j in zip(cut_idx[:-1],cut_idx[1:])]
return out
group_by_array=groupby(Y,X)
for item in group_by_array:
print(np.average(item))
I use the information in the below link to answer the question:
Group numpy into multiple sub-arrays using an array of values
I think this solution should work:
avg_arr = []
i = 1
while i <= np.max(x):
inds = np.where(x == i)
my_val = np.average(y[inds[0][0]:inds[0][-1]])
avg_arr.append(my_val)
i+=1
Definitely, not the cleanest, but I was able to test it quickly and it does indeed work.

How to efficiently shuffle some values of a numpy array while keeping their relative order?

I have a numpy array and a mask specifying which entries from that array to shuffle while keeping their relative order. Let's have an example:
In [2]: arr = np.array([5, 3, 9, 0, 4, 1])
In [4]: mask = np.array([True, False, False, False, True, True])
In [5]: arr[mask]
Out[5]: array([5, 4, 1]) # These entries shall be shuffled inside arr, while keeping their order.
In [6]: np.where(mask==True)
Out[6]: (array([0, 4, 5]),)
In [7]: shuffle_array(arr, mask) # I'm looking for an efficient realization of this function!
Out[7]: array([3, 5, 4, 9, 0, 1]) # See how the entries 5, 4 and 1 haven't changed their order.
I've written some code that can do this, but it's really slow.
import numpy as np
def shuffle_array(arr, mask):
perm = np.arange(len(arr)) # permutation array
n = mask.sum()
if n > 0:
old_true_pos = np.where(mask == True)[0] # old positions for which mask is True
old_false_pos = np.where(mask == False)[0] # old positions for which mask is False
new_true_pos = np.random.choice(perm, n, replace=False) # draw new positions
new_true_pos.sort()
new_false_pos = np.setdiff1d(perm, new_true_pos)
new_pos = np.hstack((new_true_pos, new_false_pos))
old_pos = np.hstack((old_true_pos, old_false_pos))
perm[new_pos] = perm[old_pos]
return arr[perm]
To make things worse, I actually have two large matrices A and B with shape (M,N). Matrix A holds arbitrary values, while each row of matrix B is the mask which to use for shuffling one corresponding row of matrix A according to the procedure that I outlined above. So what I want is shuffled_matrix = row_wise_shuffle(A, B).
The only way I have so far found to do it is via my shuffle_array() function and a for loop.
Can you think of any numpy'onic way to accomplish this task avoiding loops? Thank you so much in advance!
For 1d case:
import numpy as np
a = np.arange(8)
b = np.array([1,1,1,1,0,0,0,0])
# Get ordered values
ordered_values = a[np.where(b==1)]
# We'll shuffle both arrays
shuffled_ix = np.random.permutation(a.shape[0])
a_shuffled = a[shuffled_ix]
b_shuffled = b[shuffled_ix]
# Replace the values with correct order
a_shuffled[np.where(b_shuffled==1)] = ordered_values
a_shuffled # Notice that 0, 1, 2, 3 preserves order.
>>>
array([0, 1, 2, 6, 3, 4, 7, 5])
for 2d case, columnwise shuffle (along axis=1):
import numpy as np
a = np.arange(24).reshape(4,6)
b = np.array([[0,0,0,0,1,1], [1,1,1,0,0,0], [1,1,1,1,0,0], [0,0,1,1,0,0]])
# The code below works for column shuffle (i.e. axis=1).
# Get ordered values
i,j = np.where(b==1)
values = a[i, j]
values
# We'll shuffle both arrays for axis=1
# taken from https://stackoverflow.com/questions/5040797/shuffling-numpy-array-along-a-given-axis
idx = np.random.rand(*a.shape).argsort(axis=1)
a_shuffled = np.take_along_axis(a,idx,axis=1)
b_shuffled = np.take_along_axis(b,idx,axis=1)
# Replace the values with correct order
a_shuffled[np.where(b_shuffled==1)] = values
# Get the result
a_shuffled # see that 4,5 | 6,7,8 | 12,13,14,15 | 20, 21 preserves order
>>>
array([[ 4, 1, 0, 3, 2, 5],
[ 9, 6, 7, 11, 8, 10],
[12, 13, 16, 17, 14, 15],
[23, 20, 19, 22, 21, 18]])
for 2d case, rowwise shuffle (along axis=0), we can use the same code, first transpose arrays and after shuffle transpose back:
import numpy as np
a = np.arange(24).reshape(4,6)
b = np.array([[0,0,0,0,1,1], [1,1,1,0,0,0], [1,1,1,1,0,0], [0,0,1,1,0,0]])
# The code below works for column shuffle (i.e. axis=1).
# As you said rowwise, we first transpose
at = a.T
bt = b.T
# Get ordered values
i,j = np.where(bt==1)
values = at[i, j]
values
# We'll shuffle both arrays for axis=1
# taken from https://stackoverflow.com/questions/5040797/shuffling-numpy-array-along-a-given-axis
idx = np.random.rand(*at.shape).argsort(axis=1)
at_shuffled = np.take_along_axis(at,idx,axis=1)
bt_shuffled = np.take_along_axis(bt,idx,axis=1)
# Replace the values with correct order
at_shuffled[np.where(bt_shuffled==1)] = values
# Get the result
a_shuffled = at_shuffled.T
a_shuffled # see that 6,12 | 7, 13 | 8,14,20 | 15, 21 preserves order
>>>
array([[ 6, 7, 2, 3, 10, 17],
[18, 19, 8, 15, 16, 23],
[12, 13, 14, 21, 4, 5],
[ 0, 1, 20, 9, 22, 11]])

Watching a counter, tally total and counting missed counts

I am attempting to create a piece of code that will watch a counter with an output something like:
a1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30]
I want the code to be able to tally the total and tell me how many counts are missed for example if this happened:
a1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24, 25, 26, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,1,2]
I would get a total of 92 still, but get feedback that 8 are missing.
I have gotten very close with the following code:
Blk_Tot = 0
CBN = 0
LBN = 0
x = 0
y = 0
z = 0
MissedBlocks = 0
for i in range(len(a1)):
CBN = a1[i]
if CBN - LBN <= 0:
if LBN == 30:
y = 30 - abs(CBN - LBN)
elif LBN < 30:
z = 30 - LBN
y = 30 - abs(CBN - LBN) + z
print(z)
Blk_Tot = Blk_Tot + y
else:
x = CBN - LBN
Blk_Tot = Blk_Tot + x
if x > 1:
MissedBlocks = MissedBlocks - 1 + x
LBN = CBN
print(Blk_Tot)
print(MissedBlocks)
If I delete anywhere between 1 and 30 it works perfectly, however if I delete across 30, say 29,30,1,2 it breaks.I don't expect it to be able to miss 30 in a row and still be able to come up with a proper total however.
Anyone have any ideas on how this might be achieved? I feel like I am missing an obvious answer :D
Sorry I think I was unclear, a1 is a counter coming from an external device that counts from 1 to 30 and then wraps around to 1 again. Each count is actually part of a message to show that the message was received; so say 1 2 4, I know that the 3rd message is missing. What I am trying to do is found out the total that should have been recieved and how many are missing from the count.
Update after an idea from the posts below, another method of doing this maybe:
Input:
123456
List[1,2,3,4,5,6]
1.Check first input to see which part of the list it is in and start from there (in case we don't start from zero)
2.every time an input is received check if that matches the next value in the array
3.if not then how many steps does it take to find that value
You don't need to keep track if you past the 30 line.
Just compare with the ideal sequence and count the missing numbers.
No knowledge if parts missing at the end.
No knowledge if more than 30 parts are missing in a block.
from itertools import cycle
def idealSeqGen():
for i in cycle(range(1,31)):
yield(i)
def receivedSeqGen():
a1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1,2]
for i in a1:
yield(i)
receivedSeq = receivedSeqGen()
idealSeq = idealSeqGen()
missing = 0
ideal = next(idealSeq)
try:
while True:
received = next(receivedSeq)
while received != ideal:
missing += 1
ideal = next(idealSeq)
ideal = next(idealSeq)
except StopIteration:
pass
print (f'There are {missing} items missing')
Edit
The loop part can be a little bit simpler
missing = 0
try:
while True:
ideal = next(idealSeq)
received = next(receivedSeq)
while received != ideal:
missing += 1
ideal = next(idealSeq)
except StopIteration:
pass
print (f'There are {missing} items missing')
In general, if you want to count the number of differences between two lists, you can easily use a dictionary. The other answer would also work, but it is highly inefficient for even slightly larger lists.
def counter(lst):
# create a dictionary with count of each element
d = {}
for i in lst:
if d.get(i, None):
d[i] += 1
else:
d[i] = 1
return d
def compare(d1, d2):
# d1 and d2 are dictionaries
ans = 0
for i in d1.values():
if d2.get(i, None):
# comapares the common values in both lists
ans += abs(d1[i]-d2[i])
d2[i] = 0
else:
#for elements only in the first list
ans += d1[i]
for i in d2.values():
# for elements only in the second list
if d2[i]>0:
ans += d2[i]
return ans
l1 = [...]
l2 = [...]
print(compare(counter(l1), counter(l2)))
New code to check for missing elements from a repeating sequence pattern
Now that I have understood your question more clearly, here's the code. The assumption in this code is the list will always be in ascending order from 1 thru 30 and then repeats again from 1. There can be missing elements between 1 and 30 but the order will always be in ascending order between 1 and 30.
If the source data is as shown in list a1, then the code will result in 8 missing elements.
a1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1,2]
a2 = a1.copy()
c = 1
missing = 0
while a2:
if a2[0] == c:
c+=1
a2.pop(0)
elif a2[0] > c:
missing +=1
c+=1
elif a2[0] < c:
missing += 31-c
c = 1
if c == 31: c=1
print (f'There are {missing} items missing in the list')
The output of this will be:
There are 8 items missing in the list
Let me know if this addresses your question
earlier code to compare two lists
You cannot use set as the items are repeated. So you need to sort them and find out how many times each element is in both lists. The difference will give you the missing counts. You may have an element in a1 but not in a2 or vice versa. So finding out the absolute count of missing items will give you the results.
I will update the response with better variables in my next update.
Here's how I did it:
code with comments:
a1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30]
a2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,1,2]
#step 1: Find out which list is longer. We will use that as the master list
if len(a1) > len(a2):
master_list = a1.copy()
second_list = a2.copy()
else:
master_list = a2.copy()
second_list = a1.copy()
#step 2: We must sort both master and second list
# so we can compare against each other
master_list.sort()
second_list.sort()
#set the counter to zero
missing = 0
#iterate through the master list and find all values in master against second list
#for each iteration, remove the value in master[0] from both master and second list
#when you have iterated through the full list, you will get an empty master_list
#this will help you to use while statement to iterate until master_list is empty
while master_list:
#pick the first element of master list to search for
x = master_list[0]
#count the number of times master_list[0] is found in both master and second list
a_count = master_list.count(x)
b_count = second_list.count(x)
#absolute difference of both gives you how many are missing from each other
#master may have 4 occurrences and second may have 2 occurrences. abs diff is 2
#master may have 2 occurrences and second may have 5 occurrences. abs diff is 3
missing += abs(a_count - b_count)
#now remove all occurrences of master_list[0] from both master and second list
master_list = [i for i in master_list if i != x]
second_list = [i for i in second_list if i != x]
#iterate until master_list is empty
#you may end up with a few items in second_list that are not found in master list
#add them to the missing items list
#thats your absolute total of all missing items between lists a1 and a2
#if you want to know the difference between the bigger list and shorter one,
#then don't add the missing items from second list
missing += len(second_list)
#now print the count of missig elements between the two lists
print ('Total number of missing elements are:', missing)
The output from this is:
Total number of missing elements are: 7
If you want to find out which elements are missing, then you need to add a few more lines of code.
In the above example, elements 27,28,29,30, 4, 5 are missing from a2 and 31 from a1. So total number of missing elements is 7.
code without comments:
a1 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30]
a2 = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,
5, 6, 7, 8, 9, 10, 11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,1,2]
if len(a1) > len(a2):
master_list = a1.copy()
second_list = a2.copy()
else:
master_list = a2.copy()
second_list = a1.copy()
master_list.sort()
second_list.sort()
missing = 0
while master_list:
x = master_list[0]
a_count = master_list.count(x)
b_count = second_list.count(x)
missing += abs(a_count - b_count)
master_list = [i for i in master_list if i != x]
second_list = [i for i in second_list if i != x]
missing += len(second_list)
print ('Total number of missing elements are:', missing)

Plotting binary data in python

I have some data that looks like:
data = [1,2,4,5,9] (random pattern of increasing integers)
And I want to plot it in a binary horizontal line so that y=1 for every x value specified in data and zero otherwise.
I have a few different data arrays that I'd like to stack, similar to this style (this is CCD clocking data but the plot format looks ideal)
I think I need to create a list of ones for my data array, but how do I specify the zero value for everything not in the array?
Thanks
You got the point. You can create a list with 1 in any position specified in data and 0 elsewhere. This can be done very easily with a list comprehension
def binary_data(data):
return [1 if x in data else 0 for x in range(data[-1] + 1)]
which will act like this:
>>> data = [1, 2, 4, 5, 9]
>>> bindata = binary_data(data)
>>> bindata
[0, 1, 1, 0, 1, 1, 0, 0, 0, 1]
Now all you have to do is plot it... or better step it since it's binary data and step() looks way better:
import numpy as np
from matplotlib.pyplot import step, show
def binary_data(data):
return [1 if x in data else 0 for x in range(data[-1] + 1)]
data = [1, 2, 4, 5, 9]
bindata = binary_data(data)
xaxis = np.arange(0, data[-1] + 1)
yaxis = np.array(bindata)
step(xaxis, yaxis)
show()
To plot multiple data arrays stacked on the same figure you could tweak binary_data() like this:
def binary_data(data, yshift=0):
return [yshift+1 if x in data else yshift for x in range(data[-1] + 1)]
so now you can set yshift parameter to shift data arrays on the y-axis. E.g,
>>> data = [1, 2, 4, 5, 9]
>>> bindata1 = binary_data(data)
>>> bindata1
[0, 1, 1, 0, 1, 1, 0, 0, 0, 1]
>>> bindata2 = binary_data(data, 2)
>>> bindata2
[2, 3, 3, 2, 3, 3, 2, 2, 2, 3]
Let's say you have data1, data2 and data3 to plot stacked, you'd go like:
import numpy as np
from matplotlib.pyplot import step, show
def binary_data(data, yshift=0):
return [yshift+1 if x in data else yshift for x in range(data[-1] + 1)]
data1 = [1, 2, 4, 5, 9]
bindata1 = binary_data(data1)
x1 = np.arange(0, data1[-1] + 1)
y1 = np.array(bindata1)
data2 = [1, 4, 9]
bindata2 = binary_data(data2, 2)
x2 = np.arange(0, data2[-1] + 1)
y2 = np.array(bindata2)
data3 = [1, 2, 8, 9]
bindata3 = binary_data(data3, 4)
x3 = np.arange(0, data3[-1] + 1)
y3 = np.array(bindata3)
step(x1, y1, x2, y2, x3, y3)
show()
that you can easily edit to make it work with an arbitrary amount of data arrays:
data = [ [1, 2, 4, 5, 9],
[1, 4, 9],
[1, 2, 8, 9] ]
for shift, d in enumerate(data):
bindata = binary_data(d, 2 * shift)
x = np.arange(0, d[-1] + 1)
y = np.array(bindata)
step(x, y)
show()
Finally if you are dealing with data arrays with different length (say [1,2] and [15,16]) and you don't like plots that vanish in the middle of the figure you can tweak binary_data() again to force its range to the maximum range of your data.
import numpy as np
from matplotlib.pyplot import step, show
def binary_data(data, limit, yshift=0):
return [yshift+1 if x in data else yshift for x in range(limit)]
data = [ [1, 2, 4, 5, 9, 12, 13, 14],
[1, 4, 10, 11, 20, 21, 22],
[1, 2, 3, 4, 15, 16, 17, 18] ]
# find out the longest data to plot
limit = max( [ x[-1] + 1 for x in data] )
x = np.arange(0, limit)
for shift, d in enumerate(data):
bindata = binary_data(d, limit, 2 * shift)
y = np.array(bindata)
step(x, y)
show()
Edit: As #ImportanceOfBeingErnest suggested, if you prefer to perform data to bindata conversion without having to define your own binary_data() function you could use numpy.zeros_like(). Just pay more attention when you stack them:
import numpy as np
from matplotlib.pyplot import step, show
data = [ [1, 2, 4, 5, 9, 12, 13, 14],
[1, 4, 10, 11, 20, 21, 22],
[1, 2, 3, 4, 15, 16, 17, 18] ]
# find out the longest data to plot
limit = max( [ x[-1] + 1 for x in data] )
x = np.arange(0, limit)
for shift, d in enumerate(data):
y = np.zeros_like(x)
y[d] = 1
# don't forget to shift
y += 2*shift
step(x, y)
show()
You can create an array with all zeros and assign 1 for those elements in data
import numpy as np
data = [1,2,4,5,9]
t = np.arange(0,data[-1]+1)
x = np.zeros_like(t)
x[data] = 1
You might then plot it with the step function
import matplotlib.pyplot as plt
plt.step(t,x, where="post")
plt.show()
or with where = "pre", depending on how to interprete your data

Split list into dictionary

I am trying to split my list into a dictionary and I wrote a function for it. It takes in a list and gets the list length. If the length of the list is 253 , it creates a dictionary with 26 keys - this is calculated by rounding up the number to a higher 10 (250 will create 25 keys, 251-259 will create 26 keys). Each key will store a chuck of the original list (as of now I am storing 10 elements per list). I feel
def limit_files(file_list, at_a_time=10):
l = file_list
n = len(l)
d, r = divmod(n, at_a_time)
num_keys = d + 1 if r else d
slice = n // num_keys
vals = (l[i:i+slice] for i in range(0, n, slice+1))
dct = dict(zip(range(1,num_keys+1),vals))
return (dct)
I just wanted to know if there is way to improve the code
You can use itertools.izip_longest to group items in your list as equally divided chunks.
import itertools
def limit_files(file_list, at_a_time=10):
d, r = divmod(len(file_list), at_a_time)
num_keys = d + 1 if r > 0 else d
chunks = itertools.izip_longest(*([iter(x)] * at_a_time))
return dict((x, y) for x, y in enumerate(chunks))
Note that, it will be padding with None values to fill out the last chunk.
>>> limit_files([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
{0: (1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
1: (11, 12, None, None, None, None, None, None, None, None)}
Alternatively, without using padding values, you can just iterate through the range and call itertools.islice until the iterator exhausts as follows:
def limit_files(file_list, at_a_time=10):
d, r = divmod(len(file_list), at_a_time)
num_keys = d + 1 if r > 0 else d
iterator = iter(file_list)
dictionary = {}
for chunk_id in xrange(num_keys):
chunk = list(itertools.islice(iterator, at_a_time))
if not chunk:
break
dictionary[chunk_id] = chunk
return dictionary
>>> limit_files([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
{0: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1: [11, 12]}

Categories