Trying to create a 3D matrix in Python by choosing specific data

Trying to create a 3D matrix in Python by choosing specific data - python

Hi so I'm trying to make a 3D matrix here.. It's the MovieLens data (https://grouplens.org/datasets/movielens/100k/) from where I'm taking a u1.base and u1.test pair as training and test sets (respectively). Below is an image of the format of the data of variable training_set you'll discover in the code.
The 3D matrix I'm trying to create is of the format (User, Movie, Timestamp) and the data in each of those cells is the ratings given by, for example, user 1 to movie 1 at time 1.
If it's any help, below is the code where a 2D matrix is created with users in the rows and all the movies as the columns.
import numpy as np
import pandas as pd
training_set = pd.read_csv('ml-100k/u1.base', delimiter = '\t')
training_set = np.array(training_set, dtype='int')
test_set = pd.read_csv('ml-100k/u1.test', delimiter = '\t')
test_set = np.array(test_set, dtype = 'int64')
nb_users = int(max(max(training_set[:, 0]), max(test_set[:, 0])))
nb_movies = int(max(max(training_set[:, 1]), max(test_set[:, 1])))
def convert(data):
new_data = [] #final list that we will return
for id_users in range(1, nb_users+1):
id_movies = data[:, 1][data[:, 0] == id_users] #contains the IDs of the movies rated by the id_user
id_ratings = data[:, 2][data[:, 0] == id_users] #all movie ratings given by specific user
ratings = np.zeros(nb_movies)
ratings[id_movies-1] = id_ratings #these two lines are just so that the movies that are not rated by user have null (0) values
new_data.append(list(ratings))
return (new_data)
training_set = convert(training_set)
test_set = convert(test_set)
Below is a code that I tried which gave a number of errors, so many that I couldn't scroll up to the first one it threw.
import numpy as np
import pandas as pd
training_set = pd.read_csv('ml-100k/u1.base', delimiter = '\t')
training_set = np.array(training_set, dtype='int')
test_set = pd.read_csv('ml-100k/u1.test', delimiter = '\t')
test_set = np.array(test_set, dtype = 'int64')
nb_users = int(max(max(training_set[:, 0]), max(test_set[:, 0])))
nb_movies = int(max(max(training_set[:, 1]), max(test_set[:, 1])))
#The changes I made start here --
nb_timestamps = int(max(len(training_set[:, 3]), len(test_set[:, 3])))
ts_min = int(min(min(training_set[:, 3]), min(test_set[:, 3])))
ts_max = int(max(max(training_set[:, 3]), max(test_set[:, 3])))
def convert(data):
new_data = [] #final list that we will return
for timestamp in range(ts_min, ts_max+1):
for id_users in range(1, nb_users+1):
id_movies = data[:, 1][data[:, 0] == id_users][data[:, 3] == timestamp]
#contains the IDs of the movies rated by the id_user
id_ratings = data[:, 2][data[:, 0] == id_users][data[:, 3] == timestamp]
ratings = np.zeros(nb_movies)
ratings[id_movies-1] = id_ratings
new_data.append(list(ratings))
return (new_data)
training_set = convert(training_set)
test_set = convert(test_set)

Remark: Please don't take this as an answer (yet).
There are few things to improve in your code:
When you read the csv you're taking the first row as header which means you are not considering all the data
If in this case (asn it should be so) there is just one user can rate a movie only one time you can use pd.pivot_table in order to get your 2D matrix.
import pandas as pd
import numpy as np
training_set = pd.read_csv('ml-100k/u1.base',
delimiter='\t',
header=None, # First row is not header
names=["user", "movie",
"rating", "timestamp"]) # rename headers
# with pd.pivot_table you get a df where user are in rows
# and movies in columns. The value is the rating for movie (i,j)
ratings = pd.pivot_table(training_set,
index=["user"],
columns=["movie"],
values="rating")
In case you want 0s instead of NaN you can use ratings.fillna(0). But I wouldn't do so. You should be care cos this will mess up the eventual statistics you want to extract.
In case you need the 2D matrix you can just use ratings.values.
UPDATE
In order to get your 3D matrix we can do the same pivoting with timestamps
timestamps = pd.pivot_table(training_set,
index=["user"],
columns=["movie"],
values="timestamp")
# get matrix
mat_ratings = ratings.values
mat_timestamps = timestamps.values
# stack matrix
mat3d = np.dstack((mat_ratings, mat_timestamps))
You can now check that from 2 matrix with shape (943, 1650) we get one of shape (943, 1650, 2). Note to get the shape of matrix mat just run mat.shape.

Related

Audio Data Agmentation in python

I am using below function to augment audio data generated from wav audio files.
def generate_augmented_data(file_path):
augmented_data = []
samples = load_wav(file_path,get_duration=False)
for time_value in [0.7, 1, 1.3]:
for pitch_value in [-1, 0, 1]:
time_stretch_data = librosa.effects.time_stretch(samples, rate=time_value)
final_data = librosa.effects.pitch_shift(time_stretch_data, sr=sample_rate, n_steps=pitch_value)
augmented_data.append(final_data)
return augmented_data
I also need to augment the class labels and facing difficulties with it.
Tried below cod, but its not getting me the expected result
## generating augmented data.
def generate_augmented_data_label(file_path, label):
augmented_data = []
augmented_label = []
samples = load_wav(file_path,get_duration=False)
for time_value in [0.7, 1, 1.3]:
for pitch_value in [-1, 0, 1]:
time_stretch_data = librosa.effects.time_stretch(samples, rate=time_value)
final_data = librosa.effects.pitch_shift(time_stretch_data, sr=sample_rate, n_steps=pitch_value)
augmented_data.append(final_data)
augmented_label.append(label)
return augmented_data,augmented_label
Before augmentation shape for data and labels are as below,
X_train.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
X_train_augmented_data = []
y_train_augmented_data = []
for i in range(len(X_train)):
#print(i)
t1 = X_train.iloc[i]
t2 = y_train[i]
tmp1,tmp2 = generate_augmented_data_label(t1,t2)
#print(tmp1,tmp2)
X_train_augmented_data.append(tmp1)
y_train_augmented_data.append(tmp2)
len(X_train)
1600
len(y_train)
1600
print(len(X_train_augmented_data))
print(len(y_train_augmented_data))
After data augmentation and an additional masking step, shape is coming as
augmented_train_data_mask = []
for i in range(0,len(augmented_train_data_pad)):
augmented_train_data_mask.append(list(map(bool,augmented_train_data_pad[i])))
augmented_train_data_mask = np.array(augmented_train_data_mask)
print(augmented_train_data_pad.shape)
print(augmented_train_data_mask.shape)
(14400, 17640)
(14400, 17640)
However, label len is still 1600. Later when I pass these into an LSTM model, I am getting a shape mismatch error.
ValueError: Data cardinality is ambiguous:
x sizes: 14400, 14400
y sizes: 1600
Make sure all arrays contain the same number of samples.
Looking for some help to resolve this issue.

You can use numpy repeat function to replicate your numpy array.
ex:
In: arr = np.arange(3)
out: array([0, 1, 2])
In : arr.repeat(3)
Out: array([0, 0, 0, 1, 1, 1, 2, 2, 2])
Hope this will suffice your requirement.

You may refer link for reference:
#https://www.geeksforgeeks.org/python-add-similar-value-multiple-times-in-list/
type(y_train)= panda series
from itertools import repeat
new_label=[]
for index, value in y_train.items():
new_label.extend(repeat(value, 2))
len(new_label)

Need to use apply or broadcasting and masking to iterate over a DataFrame

I have a data frame that I need to iterate over. I want to use either apply or broadcasting and masking. This is the pseudocode I am trying to improve upon.
2 The algorithm
Algorithm 1: The algorithm
initialize the population (of size n) uniformly randomly, obeying the bounds;
while a pre-determined number of iterations is not complete do
set the random parameters (two independent parameters for each of the d
variables); find the best and the worst vectors in the population;
for each vector in the population do create a new vector using the
current vector, the best vector, the worst vector, and the random
parameters;
if the new vector is at least as good as the current vector then
current vector = new vector;
This is the code I have so far.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-5.0, 10.0, size = (20, 5)), columns = list('ABCDE'))
pd.set_option('display.max_columns', 500)
df
#while portion of pseudocode
f_func = np.square(df).sum(axis=1)
final_func = np.square(f_func)
xti_best = final_func.idxmin()
xti_worst = final_func.idxmax()
print(final_func)
print(df.head())
print(df.tail())
*#for loop of pseudocode
#for row in df.iterrows():
#implement equation from assignment
#define in array math
#xi_new = row.to_numpy() + np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_best].values - np.absolute(row.to_numpy())) - np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_worst].values - np.absolute(row.to_numpy()))
#print(xi_new)*
df2 = df.apply(lambda row: 0 if row == 0 else row + np.random.uniform(0, 1, size = (1, 5)) * (df.iloc[xti_best].values - np.absolute(axis = 1)))
print(df2)
The formula I am trying to use for xi_new is:
#xi_new = xi_current + random value between 0,1(xti_best -abs(xi_current)) - random value(xti_worst - abs(xi_current))

I'm not sure I'm implementing your formula correctly, but hopefully this helps
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-5.0, 10.0, size = (20, 5)), columns = list('ABCDE'))
#while portion of pseudocode
f_func = np.square(df).sum(axis=1)
final_func = np.square(f_func)
xti_best_idx = final_func.idxmin()
xti_worst_idx = final_func.idxmax()
xti_best = df.loc[xti_best_idx]
xti_worst = df.loc[xti_worst_idx]
#Calculate random values for the whole df for the two different areas where you need randomness
nrows,ncols = df.shape
r1 = np.random.uniform(0, 1, size = (nrows, ncols))
r2 = np.random.uniform(0, 1, size = (nrows, ncols))
#xi_new = xi_current + random value between 0,1(xti_best -abs(xi_current)) - random value(xti_worst - abs(xi_current))
df= df+r1*xti_best.sub(df.abs())-r2*xti_worst.sub(df.abs())
df

Drawing equal samples from each class in stratified sampling

So I have 1000 class 1 and 2500 class 2. So naturally when using:
sklearn's train_test_split(test_size = 200, stratify = y). I get an imbalanced test set since it is preserving the data distribution from the original data set. However, I would like to split to have 100 class 1 and 100 class 2 in the test set.
How would I do it? Any suggestions would be appreciated.

Split Manually
A manual solution isn't that scary. Main steps explained:
Isolate the index of class-1 and class-2 rows.
Use np.random.permutation() to select random n1 and n2 test samples for class 1 and 2 respectively.
Use df.index.difference() to perform inverse selection for the train samples.
The code can be easily generalized to arbitrary number of classes and arbitrary numbers to be selected as test data (just put n1/n2, idx1/idx2, etc. into lists and process by loops). But that's out of the scope of the question itself.
Code
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
# data
df = pd.DataFrame(
data={
"label": np.array([1]*1000 + [2]*2500),
# label 1 has value > 0, label 2 has value < 0
"value": np.hstack([np.random.uniform(0, 1, 1000),
np.random.uniform(-1, 0, 2500)])
}
)
df = df.sample(frac=1).reset_index(drop=True)
# sampling number for each class
n1 = 100
n2 = 100
# 1. get indexes and lengths for the classes respectively
idx1 = df.index.values[df["label"] == 1]
idx2 = df.index.values[df["label"] == 2]
len1 = len(idx1) # 1000
len2 = len(idx2) # 2500
# 2. draw index for test dataset
draw1 = np.random.permutation(len1)[:n1] # keep the first n1 entries to be selected
idx1_test = idx1[draw1]
draw2 = np.random.permutation(len2)[:n2]
idx2_test = idx2[draw2]
# combine the drawn indexes
idx_test = np.hstack([idx1_test, idx2_test])
# 3. derive index for train dataset
idx_train = df.index.difference(idx_test)
# split
df_train = df.loc[idx_train, :] # optional: .reset_index(drop=True)
df_test = df.loc[idx_test, :]
# len(df_train) = 3300
# len(df_test) = 200
# verify that no row was missing
idx_merged = np.hstack([df_train.index.values, df_test.index.values])
assert len(np.unique(idx_merged)) == 3500

Adding residue IDs to a numpy array consisting of time series data of water coordinates

I got this script for generating time series data of water molecules, and I want to add one more header row to that generated matrix with residue IDs of water molecules. Could anybody help with with reworking this script? Thanks!
import numpy as np
import MDAnalysis as mda
u = mda.Universe(PSF, DCD)
water_oxygens = u.select_atoms("name OW")
# pre-allocate the array for the data
data = np.zeros((u.trajectory.n_frames, water_oxygens.n_atoms + 1))
for i, ts in enumerate(u.trajectory):
data[i, 0] = ts.time # store current time
data[i, 1:] = water_oxygens.positions[:, 2] # extract all z-coordinates

Here is an adjusted code example. You might need to install the package MDAnalysisTests to run it:
import numpy as np
import MDAnalysis as mda
from MDAnalysisTests.datafiles import waterPSF, waterDCD
u = mda.Universe(waterPSF, waterDCD)
water_oxygens = u.select_atoms("name OH2")
# pre-allocate the array for the data
# one extra row for the header water residue IDs
data = np.zeros((u.trajectory.n_frames + 1, water_oxygens.n_atoms + 1))
# initialise the water residue IDs
data[0, 0] = np.NaN # the time column
data[0, 1:] = water_oxygens.atoms.resids
for i, ts in enumerate(u.trajectory, start=1):
data[i, 0] = ts.time # store current time
data[i, 1:] = water_oxygens.positions[:, 2] # extract all z-coordinates

How can I divide my data table called (my_data2) in two samples

I would like to divide my data table called (my_data2) in two samples called (learning sample and test sample). How to apply the logistic regression on the first part of my table (the first sample), then apply predict on the second part? Thank you.
This is my coding;
import numpy as np
from statsmodels.formula.api import logit
FNAME2 ="C:/Users/lenovo/Desktop/Nouveau dossier (2)/table.csv"
FinalTableau=np.savetxt(FNAME2,my_data[index_to_use] , delimiter=",")
my_data2 = np.genfromtxt (FNAME2, delimiter = ',')
x= my_data2 [:,1]
a= my_data2[:,3]
#x with values 1 and 2
print x
#converts my binary data series from (1, 2) to (0,1)
x= my_data[:, 1] - 1
print x
form = 'x ~ a'
affair_model = logit (form, my_data2)
affair_result = affair_model.fit ()
print affair_result.summary ()
print affair_result.predict()

To split my_data2 into two arrays of roughly equal size:
N = len(my_data2)//2
learning_sample, test_sample = my_data2[:N], my_data2[N:]
For example,
import numpy as np
from statsmodels.formula.api import logit
FNAME2 = "C:/Users/lenovo/Desktop/Nouveau dossier (2)/table.csv"
FinalTableau = np.savetxt(FNAME2, my_data[index_to_use], delimiter=",")
my_data2 = np.genfromtxt(FNAME2, delimiter=',')
# converts my binary data series from (1, 2) to (0,1)
my_data2[:, 1] -= 1
# print my_data2
N = len(my_data2)//2
learning_sample, test_sample = my_data2[:N], my_data2[N:]
x = learning_sample[:, 1]
a = learning_sample[:, 3]
# x with values 1 and 2
print x
form = 'x ~ a'
affair_model = logit(form, learning_sample)
affair_result = affair_model.fit()
print affair_result.summary()
print affair_result.predict()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trying to create a 3D matrix in Python by choosing specific data - python

Related

Audio Data Agmentation in python

Need to use apply or broadcasting and masking to iterate over a DataFrame

Drawing equal samples from each class in stratified sampling

Adding residue IDs to a numpy array consisting of time series data of water coordinates

How can I divide my data table called (my_data2) in two samples

Categories

Resources