Hello guys! I am struggling to calculate the mean of certain rows from
an excel sheet using python. In particular, I would like to calculate the mean for every three rows starting from the first three and then moving to the next three and so on. My excel sheet contains 156 rows of data.
My data sheet looks like this:
And this is my code:
import numpy
import pandas as pd
df = pd.read_excel("My Excel.xlsx")
x = df.iloc[[0,1,2], [9,10,11]].mean()
print(x)
To sum up, I am trying to calculate the mean of Part 1 Measurements 1 (rows 1,2,3) and the mean of Part 2
Measurements 1 (rows 9,10,11) using one line of code, or some kind of index. I am expecting to receive two lists of numbers, one that stands for the mean of Part 1 Measurement 1 (rows 1,2,3) and the other for the mean of Part 2 Measurements 1 (rows 10,11,12). I am also familiar with the fact that python counts row number one as 0. The index should have a form of n+1.
Thank you in advance.
You could (e.g.) generate a list for each mean you want to calculate:
x1, x2 = list(df.iloc[[0,1,2]].mean()), list(df.iloc[[9,10,11]].mean())
Or you could also generate a list of lists:
x = [list(df.iloc[[0,1,2]].mean()), list(df.iloc[[9,10,11]].mean())]
Related
I have large data frame in pandas which has two columns Time and Values. I want to calculate consecutive averages for values in column Values based on the condition which is formed from the column Time.
I want to calculate average of the first l values in column Values, then next l values from the same column and so on, till the end of the data frame. The value l is the number of values that go into every average and it is determined by the time difference in column Time. Starting data frame looks like this
Time Values
t1 v1
t2 v2
t3 v3
... ...
tk vk
For example, average needs to be taken at every 2 seconds and the number of time values inside that time difference will determine the number of values l for which the average will be calculated.
a1 would be the first average of l values, a2 next, and so on.
Second part of the question is the same calculation of averages, but if the number l is known in advance. I tried this
df['Time'].iloc[0:l].mean()
which works for the first l values.
In addition, I would need to store the average values in another data frame with columns Time and Averages for plotting using matplotlib.
How can I use pandas to achieve my goal?
I have tried the following
df = pd.DataFrame({'Time': [1595006371.756430732,1595006372.502789381 ,1595006373.784446912 ,1595006375.476658051], 'Values': [4,5,6,10]},index=list('abcd'))
I get
Time Values
a 1595006371.756430732 4
b 1595006372.502789381 5
c 1595006373.784446912 6
d 1595006375.476658051 10
Time is in the format seconds.milliseconds.
If I expect to have the same number of values in every 2 seconds till the end of the data frame, I can use the following loop to calculate value of l:
s=1
l=0
while df['Time'][s] - df['Time'][0] <= 2:
s+=1
l+=1
Could this be done differently, without the loop?
How can I do this if number l is not expected to be the same inside each averaging interval?
For the given l, I want to calculate average values of l elements in another column, for example column Values, and to populate column Averages of data frame df1 with these values.
I tried with the following code
p=0
df1=pd.DataFrame(columns=['Time','Averages']
for w in range (0, len(df)-1,2):
df1['Averages'][p]=df['Values'].iloc[w:w+2].mean()
p=p+1
Is there any other way to calculate these averages?
To clarify a bit more.
I have two columns Time and Values. I want to determine how many consecutive values from the column Values should be averaged at one point. I do that by determining this number l from the column Time by calculating how many rows are inside the time difference of 2 seconds. When I determined that value, for example 2, then I average first two values from the column Values, and then next 2, and so on till the end of the data frame. At the end, I store this value in the separate column of another data frame.
I would appreciate your assistance.
You talk about Time and Value and then groups of consecutive rows.
If you want to group by consecutive rows and get the mean of the Time and Value this does it for you. You really need to show by example what you are really trying to achieve.
d = list(pd.date_range(dt.datetime(2020,7,1), dt.datetime(2020,7,2), freq="15min"))
df = pd.DataFrame({"Time":d,
"Value":[round(random.uniform(0, 1),6) for x in d]})
df
n = 5
df.assign(grp=df.index//5).groupby("grp").agg({"Time":lambda s: s.mean(),"Value":"mean"})
So I have this giant matrix (~1.5 million rows x 7 columns) and am trying to figure out an efficient way to split it up. For simplicity of what I'm trying to do, I'll work with this much smaller matrix as an example for what I'm trying to do. The 7 columns consist of (in this order): item number, an x and y coordinate, 1st label (non-numeric), data #1, data #2, and 2nd label (non-numeric). So using pandas, I've imported from an excel sheet my matrix called A that looks like this:
What I need to do is partition this based on both labels (i.e. so I have one matrix that is all the 13G + Aa together, another matrix that is 14G + Aa, and another one that is 14G + Ab together -- this would have me wind up with 3 separate 2x7 matrices). The reason for this is because I need to run a bunch of statistics on the dataset of numbers of the "Marker" column for each individual matrix (e.g. in this example, break the 6 "marker" numbers into three sets of 2 "marker" numbers, and then run statistics on each set of two numbers). Since there are going to be hundreds of these smaller matrices on the real data set I have, I was trying to figure out some way to make the smaller matrices be labeled something like M1, M2, ..., M500 (or whatever number it ends up being) so that way later, I can use some loops to apply statistics to each individual matrix all at once without having to write it 500+ times.
What I've done so far is to use pandas to import my data set into python as a matrix with the command:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\path\cancerdata.csv")
A = df.as_matrix() #Convert excel sheet to matrix
A = np.delete(A, (0),axis=0) #Delete header row
Unfortunately, I haven't come across many resources for how to do what I want, which is why I wanted to ask here to see if anyone knows how to split up a matrix into smaller matrices based on multiple labels.
Your question has many implications, so instead of giving you a straight answer I'll try to give you some pointers on how to tackle this problem.
First off, don't transform your DataFrame into a Matrix. DataFrames are well-optimised for slicing and indexing operations (a Pandas Series object is in reality a fancy Numpy array anyway), so you only lose functionality by converting it to a Matrix.
You could probably convert your label columns into a MultiIndex. This way, you'll be able to access slices of your original DataFrame using df.loc, with a syntax similar to df.loc[label1].loc[label2].
A MultiIndex may sound confusing at first, but it really isn't. Try executing this code block and see for yourself how the resulting DataFrame looks like:
df = pd.read_csv("C:\path\cancerdata.csv")
labels01 = df["Label 1"].unique()
labels02 = df["Label 2"].unique()
index = pd.MultiIndex.from_product([labels01, labels02])
df.set_index(index, inplace=True)
print(df)
Here, we extracted all unique values in the columns "Label 1" and "Label 2", and created an MultiIndex based on all possible combinations of Label 1 vs. Label 2. In the df.set_index line, we extracted those columns from the DataFrame - now they act as indices for your other columns. For example, in order to access the DataFrame slice from your original DataFrame whose Label 1 = 13G and Label 2 = Aa, you can simply write:
sliced_df = df.loc["13G"].loc["Aa"]
And perform whatever calculations/statistics you need with it.
Lastly, instead of saving each sliced DataFrame into a list or dictionary, and then iterating over them to perform the calculations, consider rearranging your code so that, as soon as you create your sliced DataFrame, you peform the calculations, save them to a output/results file/DataFrame, and move on to the next slicing operation. Something like:
for L1 in labels01:
for L2 in labels02:
sliced_df = df.loc[L1].loc[L2]
results = perform_calculations(sub_df)
save_results(results)
This will both improve memory consumption and performance, which may be important considering your large dataset.
I'm new to Python and I'm trying to find my way by trying to perform some calculations (i can do them easily in excel, but now I want to know how to do it in Python).
One calculation is the covariance.
I have a simple example where I have 3 items that are sold and we have the demand per item of 24 months.
Here, you see a snapshot of the excel file:
Items and their demand over 24 months
The goal is to measure the covariance between all the three items. Thus the covariance between item 1 and 2, 1 and 3 and 2 and 3. But also, I want to know how to do it for more than 3 items, let's say for a thousand items.
The calculations are as follows:
First I have to calculate the averages per item. This is already something I found by doing the following code:
after importing the following:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
I imported the file:
df = pd.read_excel("Directory\\Covariance.xlsx")
And calculated the average per row:
x=df.iloc[:,1:].values
df['avg'] = x.mean(axis=1)
This gives the file with an extra column, the average (avg):
Items, their demand and the average
The following calculation that should be done is to calculate the covariance between, lets say for example, item 1 and 2. this is mathematically done as follows:
(column "1" of item 1- column "avg" of item 1)*(column "1" of item 2- column "avg" of item 2). This has to be done for column "1" to "24", so 24 times. This should add 24 columns to the file df.
After this, we should take the average of these columns and that displays the covariance between item 1 and 2. Because we have to do this N-1 times, so in this simple case we should have 2 covariance numbers (for the first item, the covariance with item 2 and 3, for the second item, the covariance with item 1 and 3 and for the third item, the covariance with item 1 and 2).
So the first question is; how can I achieve this for these 3 items, so that the file has a column that displays 2 covariance outcomes per item (first item should have a column with the covariance number of item 1 and 2 and a second column with the covariance number between item 1 and 3, and so on...).
The second question is of course: what if I have a 1000 items; how do I then efficiently do this, because then I have 999 covariance numbers per item and thus 999 extra columns, but also 999*25 columns extra if I calculate it via the above methodology. So how do I perform this calculation for every item as efficient as possible?
Pandas has a builtin function to calculate the covariance matrix, but first you need to make sure your dataframe is in the correct format. The first column in your data actually contains the row labels, so let's put those in the index:
df = pd.read_excel("Directory\\Covariance.xlsx", index_col=0)
Then you can calculate also the mean more easily, but don't put it back in your dataframe yet!
avg = df.mean(axis=1)
To calculate the covariance matrix, just call .cov(). This however calculates pair-wise covariances of columns, to transpose the dataframe first:
cov = df.T.cov()
If you want, you can put everything together in 1 dataframe:
df['avg'] = avg
df = df.join(cov, rsuffix='_cov')
Note: the covariance matrix includes the covariance with itself = the variance per item.
I have a csv file which contains approximately 100 columns of data. Each column represents temperature values taken every 15 minutes throughout the day for each of the 100 days. The header of each column is the date for that day. I want to convert this into two columns, the first being the date time (I will have to create this somehow), and the second being the temperatures stacked on top of each other for each day.
My attempt:
with open("original_file.csv") as ofile:
stack_vec = []
next(ofile)
for line in ofile:
columns = lineo.split(',') # get all the columns
for i in range (0,len(columns)):
stack_vec.append(columnso[i])
np.savetxt("converted.csv",stack_vec, delimiter=",", fmt='%s')
In my attempt, I am trying to create a new vector with each column appended to the end of it. However, the code is extremely slow and likely not working! Once I have this step figured out, I then need to take the date from each column and add 15 minutes to the date time for each row. Any help would be greatly appreciated.
If i got this correct you have a csv with 96 rows and 100 Columns and want to stack in into one vector day after day to a vector with 960 entries , right ?
An easy approach would be to use numpy:
import numpy as np
x = np.genfromtxt('original_file.csv', delimiter=',')
data = x.ravel(order ='F')
Note numpy is a third party library but the go-to library for math.
the first line will read the csv into a ndarray which is like matrix ( even through it behaves different for mathematical operations)
Then with ravel you vectorize it. the oder is so that it stacks rows ontop of each other instead of columns, i.e day after day. (Leave it as default / blank if you want time point after point)
For your date problem see How can I make a python numpy arange of datetime i guess i couldn't give a better example.
if you have this two array you can ensure the shape by x.reshape(960,1) and then stack them with np.concatenate([x,dates], axis = 1 ) with dates being you date vector.
let's say I have the following dataframe:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
Now I want to multiply the variable Shots for a random value (multiplier in the code) and recaclucate the StG variable that is nothing but Shots/Goals, the code I used is:
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Shots'] *= multiplier
row['StG']=float(row['Shots'])/float(row['Goals'])
Then I saved the .csv and it was identically at the original one, so after the for I simply used print(df) to obtain:
Shots Goals StG
0 1 2 0.5
1 3 1 0.33
2 4 4 1
If I print the values row per row during the for iteration I see they change, but its like they don't save in the df.
I think it is because I'm simply accessing to the values,not the actual dataframe.
I should add something like df.row[], but it returns DataFrame has no row property.
Thanks for the help.
____EDIT____
for index,row in df.iterrows():
multiplier = (np.random.randint(1,5+1))
row['Impresions']*=multiplier
row['Clicks']*=(np.random.randint(1,multiplier+1))
row['Ctr']= float(row['Clicks'])/float(row['Impresions'])
row['Mult']=multiplier
#print (row['Clicks'],row['Impresions'],row['Ctr'],row['Mult'])
The main condition is that the number of Clicks cant be ever higher than the number of impressions.
Then I recalculate the ratio between Clicks/Impressions on CTR.
I am not sure if multiplying the entire column is the best choice to maintain the condition that for each row Impr >= Clicks, hence I went row by row
Fom the pandas docs about iterrows(): pandas.DataFrame.iterrows
"You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect."
The good news is you don't need to iterate over rows - you can perform the operations on columns:
# Generate an array of random integers of same length as your DataFrame
multipliers = np.random.randint(1, 5+1, size=len(df))
# Multiply corresponding elements from df['Shots'] and multipliers
df['Shots'] *= multipliers
# Recalculate df['StG']
df['StG'] = df['Shots']/df['Goals']
Define a function that returns a series:
def f(x):
m = np.random.randint(1,5+1)
return pd.Series([x.Shots * m, x.Shots/x.Goals * m])
Apply the function to the data frame row-wise, it will return another data frame which can be used to replace some columns in the existing data frame, or create new columns in data frame
df[['Shots', 'StG']] = df.apply(f, axis=1)
This approach is very flexible as long as the new column values depend only on other values in the same row.