matrix vs. list - switching from Matlab to Python - python

Coming from a Matlab background, where everything is a matrix/vector, it was very easy to loop through a given data set and build a matrix successively. Since the final object was a matrix, it was also very easy to extract specific elements of the matrix. I'm finding it rather problematic in Python. I've reproduced the code here to explain where I am getting stuck.
The original data just a time series with a month and a price. The goal is to simulate select subsets of these prices. The loop starts by collecting all months into one set, and then drops one month in each successive loop. For 12 months, I will have (n^2 - n)/2 + n, 78 columns in total in this example. To be clear, the n is the total number of time periods; 12 in this data set. The rows of the matrix will be the Z scores sampled from the standard normal variable - the goal is to simulate all 78 prices in one go in a matrix. The # of z scores is determined by the variable num_terminal_values, currently set to 5 for just keeping things simple/easy to visualize at this point.
Here's a link to a google sheet with the original matrix google sheet with corr mat . The code below may not work from the google sheet; the sheet is intended to show what the original data is. My steps (and Python code) are as follows:
#1 read the data
dfCrv = pd.read_excel(xl, sheet_name = 'upload', usecols = range(0,2)).dropna(axis=0)
#2 create the looper variables and then loop through the data to build a matrix. The rows in the matrix are Z values sampled from the standard normal (this is the variable num_terminal_values). The columns refers to each individual simulation month.
import datetime as dt
lst_zUCorr = []
num_terminal_values = 5
as_of = dt.datetime(2020, 12, 1)
max_months = dfCrv.shape[0]
sim_months = pd.date_range(dfCrv['term'].iloc[0], dfCrv['term'].iloc[-1], freq='MS')
end_month = dfCrv['term'].iloc[-1]
dfCrv = dfCrv.set_index('term',drop=False)
for runNum in range(max_months):
sim_month = dfCrv['term'].iloc[runNum]
ttm = ((sim_month - as_of).days)/365
num_months = (end_month.year - sim_month.year) * 12 + (end_month.month - sim_month.month) + 1
zUCorr = npr.standard_normal(size=(num_terminal_values, num_months))
lst_zUCorr.append(zUCorr)
investigate the objects
lst_zUCorr
z = np.hstack(lst_zUCorr)
z
So far, everything works fine. However, I don't know how to transform the object lst_zUCorr to a simple matrix. I've tried hstack etc.; but this still doesn't look like a matrix.
The next set of operations require data in simple matrix form; but what I'm getting here isn't a matrix. Here's a visual:
Key point/question - the final 5x78 matrix in Matlab can be used to do more operations. Is there a way to convert the equivalent Python object into a 5x78 matrix, or will I now need to do more coding to access specific subsets of the Python objects?

Related

Calculate the mean from excel sheet for specific rows

Hello guys! I am struggling to calculate the mean of certain rows from
an excel sheet using python. In particular, I would like to calculate the mean for every three rows starting from the first three and then moving to the next three and so on. My excel sheet contains 156 rows of data.
My data sheet looks like this:
And this is my code:
import numpy
import pandas as pd
df = pd.read_excel("My Excel.xlsx")
x = df.iloc[[0,1,2], [9,10,11]].mean()
print(x)
To sum up, I am trying to calculate the mean of Part 1 Measurements 1 (rows 1,2,3) and the mean of Part 2
Measurements 1 (rows 9,10,11) using one line of code, or some kind of index. I am expecting to receive two lists of numbers, one that stands for the mean of Part 1 Measurement 1 (rows 1,2,3) and the other for the mean of Part 2 Measurements 1 (rows 10,11,12). I am also familiar with the fact that python counts row number one as 0. The index should have a form of n+1.
Thank you in advance.
You could (e.g.) generate a list for each mean you want to calculate:
x1, x2 = list(df.iloc[[0,1,2]].mean()), list(df.iloc[[9,10,11]].mean())
Or you could also generate a list of lists:
x = [list(df.iloc[[0,1,2]].mean()), list(df.iloc[[9,10,11]].mean())]

Paritition matrix into smaller matrices based on multiple values

So I have this giant matrix (~1.5 million rows x 7 columns) and am trying to figure out an efficient way to split it up. For simplicity of what I'm trying to do, I'll work with this much smaller matrix as an example for what I'm trying to do. The 7 columns consist of (in this order): item number, an x and y coordinate, 1st label (non-numeric), data #1, data #2, and 2nd label (non-numeric). So using pandas, I've imported from an excel sheet my matrix called A that looks like this:
What I need to do is partition this based on both labels (i.e. so I have one matrix that is all the 13G + Aa together, another matrix that is 14G + Aa, and another one that is 14G + Ab together -- this would have me wind up with 3 separate 2x7 matrices). The reason for this is because I need to run a bunch of statistics on the dataset of numbers of the "Marker" column for each individual matrix (e.g. in this example, break the 6 "marker" numbers into three sets of 2 "marker" numbers, and then run statistics on each set of two numbers). Since there are going to be hundreds of these smaller matrices on the real data set I have, I was trying to figure out some way to make the smaller matrices be labeled something like M1, M2, ..., M500 (or whatever number it ends up being) so that way later, I can use some loops to apply statistics to each individual matrix all at once without having to write it 500+ times.
What I've done so far is to use pandas to import my data set into python as a matrix with the command:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\path\cancerdata.csv")
A = df.as_matrix() #Convert excel sheet to matrix
A = np.delete(A, (0),axis=0) #Delete header row
Unfortunately, I haven't come across many resources for how to do what I want, which is why I wanted to ask here to see if anyone knows how to split up a matrix into smaller matrices based on multiple labels.
Your question has many implications, so instead of giving you a straight answer I'll try to give you some pointers on how to tackle this problem.
First off, don't transform your DataFrame into a Matrix. DataFrames are well-optimised for slicing and indexing operations (a Pandas Series object is in reality a fancy Numpy array anyway), so you only lose functionality by converting it to a Matrix.
You could probably convert your label columns into a MultiIndex. This way, you'll be able to access slices of your original DataFrame using df.loc, with a syntax similar to df.loc[label1].loc[label2].
A MultiIndex may sound confusing at first, but it really isn't. Try executing this code block and see for yourself how the resulting DataFrame looks like:
df = pd.read_csv("C:\path\cancerdata.csv")
labels01 = df["Label 1"].unique()
labels02 = df["Label 2"].unique()
index = pd.MultiIndex.from_product([labels01, labels02])
df.set_index(index, inplace=True)
print(df)
Here, we extracted all unique values in the columns "Label 1" and "Label 2", and created an MultiIndex based on all possible combinations of Label 1 vs. Label 2. In the df.set_index line, we extracted those columns from the DataFrame - now they act as indices for your other columns. For example, in order to access the DataFrame slice from your original DataFrame whose Label 1 = 13G and Label 2 = Aa, you can simply write:
sliced_df = df.loc["13G"].loc["Aa"]
And perform whatever calculations/statistics you need with it.
Lastly, instead of saving each sliced DataFrame into a list or dictionary, and then iterating over them to perform the calculations, consider rearranging your code so that, as soon as you create your sliced DataFrame, you peform the calculations, save them to a output/results file/DataFrame, and move on to the next slicing operation. Something like:
for L1 in labels01:
for L2 in labels02:
sliced_df = df.loc[L1].loc[L2]
results = perform_calculations(sub_df)
save_results(results)
This will both improve memory consumption and performance, which may be important considering your large dataset.

Taking the mean of every n elements in an array and converting MATLAB code to Python

I am attempting to convert a MATLAB program to Python and have run into a snag with a certain loop: I have a 5868x3500 matrix comprising of 5868 daily observations the ratio of returns and volumes of 3500 stocks, this data is used to produce a measure of market liquidity by taking monthly averages the ratio of each stock's return over its volume. I have a 5868x1 vector called Dummymonth which assigns an integer to each month from 1 to 270, with ~22 trading days per month (1,1,1,1,1,1,1,1,1,1,1... 2,2,2,2,2,2... 270,270,270).
The loop I'm stuck on needs to convert the 5868x3500 matrix into a 270x3500 matrix by taking the monthly average according to the Dummymonth values (i.e. Basically taking the average of every 22 values).
I've tried converting the code as cleanly as possible (substituting MATLAB's find() function for Python's .argwhere()), but I am relatively new to Python (and MATLAB really) so the problems with the code do not seem immediately obvious to me.
Here is the section of MATLAB code I am trying to emulate:
numberofmonth=Dummymonth(size(Ret,1));
i=1;
for di=1:numberofmonth
v=find(Dummymonth==di);
for j=1:size(Ret, 2)
Amihud2(i,j)=nanmean(Amihud1(v,j));
end
i=i+1;
end
And here is what I have in Python:
import numpy as np
Amihud2 = np.empty((270, len(Amihud1)))
for month_num in range(0, 270):
v = np.argwhere(dummy == month_num)
for i in range(1, len(Amihud1)):
for j in range(1, len(Amihud1[0])):
Amihud2[i][j] = np.mean(Amihud1[v][j])
The errors I am usually seeing are "index out of bounds errors".
I think one of the errors is related to Python's 0 indexing. If you loop over something and start at 1, you miss the first (index 0) values. Here is one solution (there are many):
#Create Dummy index
dummy =np.array([np.repeat(i,22) for i in np.arange(270)+1]).flatten()
#Make Dataset for example
dat = np.random.random((len(dummy),3500))
#Calculate average per month
dat2 = np.empty((270,3500))
i=-1
for m in np.unique(dummy):
i=i+1
dat2[i,:]=dat[dummy==m].mean(axis=0)

Append multiple columns into two columns python

I have a csv file which contains approximately 100 columns of data. Each column represents temperature values taken every 15 minutes throughout the day for each of the 100 days. The header of each column is the date for that day. I want to convert this into two columns, the first being the date time (I will have to create this somehow), and the second being the temperatures stacked on top of each other for each day.
My attempt:
with open("original_file.csv") as ofile:
stack_vec = []
next(ofile)
for line in ofile:
columns = lineo.split(',') # get all the columns
for i in range (0,len(columns)):
stack_vec.append(columnso[i])
np.savetxt("converted.csv",stack_vec, delimiter=",", fmt='%s')
In my attempt, I am trying to create a new vector with each column appended to the end of it. However, the code is extremely slow and likely not working! Once I have this step figured out, I then need to take the date from each column and add 15 minutes to the date time for each row. Any help would be greatly appreciated.
If i got this correct you have a csv with 96 rows and 100 Columns and want to stack in into one vector day after day to a vector with 960 entries , right ?
An easy approach would be to use numpy:
import numpy as np
x = np.genfromtxt('original_file.csv', delimiter=',')
data = x.ravel(order ='F')
Note numpy is a third party library but the go-to library for math.
the first line will read the csv into a ndarray which is like matrix ( even through it behaves different for mathematical operations)
Then with ravel you vectorize it. the oder is so that it stacks rows ontop of each other instead of columns, i.e day after day. (Leave it as default / blank if you want time point after point)
For your date problem see How can I make a python numpy arange of datetime i guess i couldn't give a better example.
if you have this two array you can ensure the shape by x.reshape(960,1) and then stack them with np.concatenate([x,dates], axis = 1 ) with dates being you date vector.

efficient, fast numpy histograms

I have a 2D numpy array consisting of ca. 15'000'000 datapoints. Each datapoint has a timestamp and an integer value (between 40 and 200). I must create histograms of the datapoint distribution (16 bins: 40-49, 50-59, etc.), sorted by year, by month within the current year, by week within the current year, and by day within the current month.
Now, I wonder what might be the most efficient way to accomplish this. Given the size of the array, performance is a conspicuous consideration. I am considering nested "for" loops, breaking down the arrays by year, by month, etc. But I was reading that numpy arrays are highly memory-efficient and have all kinds of tricks up their sleeve for fast processing. So I was wondering if there is a faster way to do that. As you may have realized, I am an amateur programmer (a molecular biologist in "real life") and my questions are probably rather naïve.
First, fill in your 16 bins without considering date at all.
Then, sort the elements within each bin by date.
Now, you can use binary search to efficiently locate a given year/month/week within each bin.
In order to do this, there is a function in numpy, numpy.bincount. It is blazingly fast. It is so fast that you can create a bin for each integer (161 bins) and day (maybe 30000 different days?) resulting in a few million bins.
The procedure:
calculate an integer index for each bin (e.g. 17 x number of day from the first day in the file + (integer - 40)//10)
run np.bincount
reshape to the correct shape (number of days, 17)
Now you have the binned data which can then be clumped into whatever bins are needed in the time dimension.
Without knowing the form of your input data the integer bin calculation code could be something like this:
# let us assume we have the data as:
# timestamps: 64-bit integer (seconds since something)
# values: 8-bit unsigned integer with integers between 40 and 200
# find the first day in the sample
first_day = np.min(timestamps) / 87600
# we intend to do this but fast:
indices = (timestamps / 87600 - first_day) * 17 + ((values - 40) / 10)
# get the bincount vector
b = np.bincount(indices)
# calculate the number of days in the sample
no_days = (len(b) + 16) / 17
# reshape b
b.resize((no_days, 17))
It should be noted that the first and last days in b depend on the data. In testing this most of the time is spent in calculating the indices (around 400 ms with an i7 processor). If that needs to be reduced, it can be done in approximately 100 ms with numexpr module. However, the actual implementation depends really heavily on the form of timestamps; some are faster to calculate, some slower.
However, I doubt if any other binning method will be faster if the data is needed up to the daily level.
I did not quite understand it from your question if you wanted to have separate views on the (one by year, ony by week, etc.) or some other binning method. In any case that boils down to summing the relevant rows together.
Here is a solution, employing the group_by functionality found in the link below:
http://pastebin.com/c5WLWPbp
import numpy as np
dates = np.arange('2004-02', '2005-05', dtype='datetime64[D]')
np.random.shuffle(dates)
values = np.random.randint(40,200, len(dates))
years = np.array(dates, dtype='datetime64[Y]')
months = np.array(dates, dtype='datetime64[M]')
weeks = np.array(dates, dtype='datetime64[W]')
from grouping import group_by
bins = np.linspace(40,200,17)
for m, g in zip(group_by(months)(values)):
print m
print np.histogram(g, bins=bins)[0]
Alternatively, you could take a look at the pandas package, which probably has an elegant solution to this problem as well.

Categories