Measuring covariance on several rows - python

I'm new to Python and I'm trying to find my way by trying to perform some calculations (i can do them easily in excel, but now I want to know how to do it in Python).
One calculation is the covariance.
I have a simple example where I have 3 items that are sold and we have the demand per item of 24 months.
Here, you see a snapshot of the excel file:
Items and their demand over 24 months
The goal is to measure the covariance between all the three items. Thus the covariance between item 1 and 2, 1 and 3 and 2 and 3. But also, I want to know how to do it for more than 3 items, let's say for a thousand items.
The calculations are as follows:
First I have to calculate the averages per item. This is already something I found by doing the following code:
after importing the following:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
I imported the file:
df = pd.read_excel("Directory\\Covariance.xlsx")
And calculated the average per row:
x=df.iloc[:,1:].values
df['avg'] = x.mean(axis=1)
This gives the file with an extra column, the average (avg):
Items, their demand and the average
The following calculation that should be done is to calculate the covariance between, lets say for example, item 1 and 2. this is mathematically done as follows:
(column "1" of item 1- column "avg" of item 1)*(column "1" of item 2- column "avg" of item 2). This has to be done for column "1" to "24", so 24 times. This should add 24 columns to the file df.
After this, we should take the average of these columns and that displays the covariance between item 1 and 2. Because we have to do this N-1 times, so in this simple case we should have 2 covariance numbers (for the first item, the covariance with item 2 and 3, for the second item, the covariance with item 1 and 3 and for the third item, the covariance with item 1 and 2).
So the first question is; how can I achieve this for these 3 items, so that the file has a column that displays 2 covariance outcomes per item (first item should have a column with the covariance number of item 1 and 2 and a second column with the covariance number between item 1 and 3, and so on...).
The second question is of course: what if I have a 1000 items; how do I then efficiently do this, because then I have 999 covariance numbers per item and thus 999 extra columns, but also 999*25 columns extra if I calculate it via the above methodology. So how do I perform this calculation for every item as efficient as possible?

Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:
df = pd.read_excel("Directory\\Covariance.xlsx", index_col=0)
Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!
avg = df.mean(axis=1)
To calculate the covariance matrix, just call .cov(). This however calculates pair-wise covariances of columns, to transpose the dataframe first:
cov = df.T.cov()
If you want, you can put everything together in 1 dataframe:
df['avg'] = avg
df = df.join(cov, rsuffix='_cov')
Note: the covariance matrix includes the covariance with itself = the variance per item.

Related

Calculate the mean from excel sheet for specific rows

Hello guys! I am struggling to calculate the mean of certain rows from
an excel sheet using python. In particular, I would like to calculate the mean for every three rows starting from the first three and then moving to the next three and so on. My excel sheet contains 156 rows of data.
My data sheet looks like this:
And this is my code:
import numpy
import pandas as pd
df = pd.read_excel("My Excel.xlsx")
x = df.iloc[[0,1,2], [9,10,11]].mean()
print(x)
To sum up, I am trying to calculate the mean of Part 1 Measurements 1 (rows 1,2,3) and the mean of Part 2
Measurements 1 (rows 9,10,11) using one line of code, or some kind of index. I am expecting to receive two lists of numbers, one that stands for the mean of Part 1 Measurement 1 (rows 1,2,3) and the other for the mean of Part 2 Measurements 1 (rows 10,11,12). I am also familiar with the fact that python counts row number one as 0. The index should have a form of n+1.
Thank you in advance.
You could (e.g.) generate a list for each mean you want to calculate:
x1, x2 = list(df.iloc[[0,1,2]].mean()), list(df.iloc[[9,10,11]].mean())
Or you could also generate a list of lists:
x = [list(df.iloc[[0,1,2]].mean()), list(df.iloc[[9,10,11]].mean())]

How to get consecutive averages of the column values based on the condition from another column in the same data frame using pandas

I have large data frame in pandas which has two columns Time and Values. I want to calculate consecutive averages for values in column Values based on the condition which is formed from the column Time.
I want to calculate average of the first l values in column Values, then next l values from the same column and so on, till the end of the data frame. The value l is the number of values that go into every average and it is determined by the time difference in column Time. Starting data frame looks like this
Time Values
t1 v1
t2 v2
t3 v3
... ...
tk vk
For example, average needs to be taken at every 2 seconds and the number of time values inside that time difference will determine the number of values l for which the average will be calculated.
a1 would be the first average of l values, a2 next, and so on.
Second part of the question is the same calculation of averages, but if the number l is known in advance. I tried this
df['Time'].iloc[0:l].mean()
which works for the first l values.
In addition, I would need to store the average values in another data frame with columns Time and Averages for plotting using matplotlib.
How can I use pandas to achieve my goal?
I have tried the following
df = pd.DataFrame({'Time': [1595006371.756430732,1595006372.502789381 ,1595006373.784446912 ,1595006375.476658051], 'Values': [4,5,6,10]},index=list('abcd'))
I get
Time Values
a 1595006371.756430732 4
b 1595006372.502789381 5
c 1595006373.784446912 6
d 1595006375.476658051 10
Time is in the format seconds.milliseconds.
If I expect to have the same number of values in every 2 seconds till the end of the data frame, I can use the following loop to calculate value of l:
s=1
l=0
while df['Time'][s] - df['Time'][0] <= 2:
s+=1
l+=1
Could this be done differently, without the loop?
How can I do this if number l is not expected to be the same inside each averaging interval?
For the given l, I want to calculate average values of l elements in another column, for example column Values, and to populate column Averages of data frame df1 with these values.
I tried with the following code
p=0
df1=pd.DataFrame(columns=['Time','Averages']
for w in range (0, len(df)-1,2):
df1['Averages'][p]=df['Values'].iloc[w:w+2].mean()
p=p+1
Is there any other way to calculate these averages?
To clarify a bit more.
I have two columns Time and Values. I want to determine how many consecutive values from the column Values should be averaged at one point. I do that by determining this number l from the column Time by calculating how many rows are inside the time difference of 2 seconds. When I determined that value, for example 2, then I average first two values from the column Values, and then next 2, and so on till the end of the data frame. At the end, I store this value in the separate column of another data frame.
I would appreciate your assistance.
You talk about Time and Value and then groups of consecutive rows.
If you want to group by consecutive rows and get the mean of the Time and Value this does it for you. You really need to show by example what you are really trying to achieve.
d = list(pd.date_range(dt.datetime(2020,7,1), dt.datetime(2020,7,2), freq="15min"))
df = pd.DataFrame({"Time":d,
"Value":[round(random.uniform(0, 1),6) for x in d]})
df
n = 5
df.assign(grp=df.index//5).groupby("grp").agg({"Time":lambda s: s.mean(),"Value":"mean"})

How to to arrange a loop in order to loop over columns and then do something

I'm a complete newbie to python, and I'm currently trying to work on a problem that allows me to take the average of each column except the number of columns is unknown.
I figured how to do it if I knew how many columns it is and to do each calculation separate. I'm supposed to do it by creating an empty list and looping the columns back into it.
import numpy as np
#average of all data not including NAN
def average (dataset):
return np.mean (dataset [np.isfinite (dataset)])
#this is how I did it by each column separate
dataset = np.genfromtxt("some file")
print (average(dataset [:,0]))
print (average(dataset [:,1]))
#what I'm trying to do with a loop
def avg (dataset):
for column in dataset:
lst = []
column = #i'm not sure how to define how many columns I have
Avg = average (column)
return Avg
You can use the numpy.mean() function:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
with:
np.mean(my_data, axis=0)
The axis indicates whether you are taking the average along columns or rows (axis = 0 means you take the average of each column, what you are trying to do). The output will be a vector whose length is the same as the number of columns (or rows) along which you took the average, and each element is the average of the corresponding column (or row). You do not need to know the shape of the matrix in advance to do this.
You CAN do this using a for loop, but it's not a good idea -- looping over matrices in numpy is slow, whereas using vectorized operations like np.mean() is very very fast. So in general when using numpy one tries to use those types of built-in operations instead of looping over everything at least if possible.
Also -- if you want the number of columns in your matrix -- it's
my_matrix.shape[1]
returns number of columns;
my_matrix.shape[0] is number of rows.

Is there any way to do a task for each member of a list without loop in python?

I have a large number of array in pandas with 256 row and 5 columns and I would like to calculate statistical(min, max, mean, ....) features for 4 members of array in each column. i wrote the following code but it is so time-consuming:
for col in array:
for j in range(0,256,1):
min = array[col].iloc[j:j+4].min()
max= array[col].iloc[j:j+4].max()
(other functions)
as I have many array and i would like to do this task for each array it is very time consuming. is there any way to write a simpler code without loop that decreases the time of execution.
You want to calculate min and max for 4 consecutive elements of a pandas.DataFrame?
This can be done using pandas rolling:
df.rolling(4).agg(['min', 'max']).shift(-3)
The shift is necessary as the default for pandas is to have the window right aligned.

Pandas not saving changes when iterating rows

let's say I have the following dataframe:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
Now I want to multiply the variable Shots for a random value (multiplier in the code) and recaclucate the StG variable that is nothing but Shots/Goals, the code I used is:
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Shots'] *= multiplier
row['StG']=float(row['Shots'])/float(row['Goals'])
Then I saved the .csv and it was identically at the original one, so after the for I simply used print(df) to obtain:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
If I print the values row per row during the for iteration I see they change, but its like they don't save in the df.
I think it is because I'm simply accessing to the values,not the actual dataframe.
I should add something like df.row[], but it returns DataFrame has no row property.
Thanks for the help.
____EDIT____
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Impresions']*=multiplier
row['Clicks']*=(np.random.randint(1,multiplier+1))
row['Ctr']= float(row['Clicks'])/float(row['Impresions'])
row['Mult']=multiplier
#print (row['Clicks'],row['Impresions'],row['Ctr'],row['Mult'])
The main condition is that the number of Clicks cant be ever higher than the number of impressions.
Then I recalculate the ratio between Clicks/Impressions on CTR.
I am not sure if multiplying the entire column is the best choice to maintain the condition that for each row Impr >= Clicks, hence I went row by row
Fom the pandas docs about iterrows(): pandas.DataFrame.iterrows
"You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect."
The good news is you don't need to iterate over rows - you can perform the operations on columns:
# Generate an array of random integers of same length as your DataFrame
multipliers = np.random.randint(1, 5+1, size=len(df))
# Multiply corresponding elements from df['Shots'] and multipliers
df['Shots'] *= multipliers
# Recalculate df['StG']
df['StG'] = df['Shots']/df['Goals']
Define a function that returns a series:
def f(x):
m = np.random.randint(1,5+1)
return pd.Series([x.Shots * m, x.Shots/x.Goals * m])
Apply the function to the data frame row-wise, it will return another data frame which can be used to replace some columns in the existing data frame, or create new columns in data frame
df[['Shots', 'StG']] = df.apply(f, axis=1)
This approach is very flexible as long as the new column values depend only on other values in the same row.

Categories