Pandas - Use values from rows with equal values in iteration - python

In case this has been answered in the past I want to apologize, I was not sure how to phrase the question.
I have a dataframe with 3d coordinates and rows with a scalar value (magnetic field in this case) for each point in space. I calculated the radius as the distance from the line at (x,y)=(0,0) for each point. The unique radius and z values are transferred into a new dataframe. Now I want to calculate the scalar values for every point (Z,R) in the volume by averaging over all points in the 3d system with equal radius.
Currently I am iterating over all unique Z and R values. It works but is awfully slow.
df is the original dataframe, dfn is the new one which - in the beginning - only contains the unique combinations of R and Z values.
for r in dfn.R.unique():
for z in df.Z.unique():
dfn.loc[(df["R"]==r)&(df["Z"]==z), "B"] = df["B"][(df["R"]==r)&(df["Z"]==z)].mean()
Is there any way to speed this up by writing a single line of code, in which pandas is given the command to grab all rows from the original dataframe, where Z and R have the values according to each row in the new dataframe?
Thank you in advance for your help.

Try groupby!!!
It looks like you can achieve with something like:
df[['R', 'Z', 'B']].groupby(['R', 'Z']).mean()

Related

Weird correlation results

I'm trying to calculate the correlation between 2 multi-index dataframes(a and b) in two ways:
1)calculate the date-to-date correlation directly with a.corr(b) which returns a result X
2)take the mean values for all dates and calculate the correlation
a.mean().corr(b.mean()) and I got a result Y.
I made a scatter plot and in this way I needed both dataframes with the same index.
I decided to calculate:
a.mean().corr(b.reindex_like(a).mean()) and I again achieved the value X.
It's strange for me because I expected to get 'Y'. I thought that the corr function reindex the dataframes one to another. If not, what is this value Y I am getting?
Thanks in advance!
I have found the answer - when I do the reindex, I cut most of the values. One of the dataframes consists of only one value per date, so the mean is equal to this value.

How to to arrange a loop in order to loop over columns and then do something

I'm a complete newbie to python, and I'm currently trying to work on a problem that allows me to take the average of each column except the number of columns is unknown.
I figured how to do it if I knew how many columns it is and to do each calculation separate. I'm supposed to do it by creating an empty list and looping the columns back into it.
import numpy as np
#average of all data not including NAN
def average (dataset):
return np.mean (dataset [np.isfinite (dataset)])
#this is how I did it by each column separate
dataset = np.genfromtxt("some file")
print (average(dataset [:,0]))
print (average(dataset [:,1]))
#what I'm trying to do with a loop
def avg (dataset):
for column in dataset:
lst = []
column = #i'm not sure how to define how many columns I have
Avg = average (column)
return Avg
You can use the numpy.mean() function:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
with:
np.mean(my_data, axis=0)
The axis indicates whether you are taking the average along columns or rows (axis = 0 means you take the average of each column, what you are trying to do). The output will be a vector whose length is the same as the number of columns (or rows) along which you took the average, and each element is the average of the corresponding column (or row). You do not need to know the shape of the matrix in advance to do this.
You CAN do this using a for loop, but it's not a good idea -- looping over matrices in numpy is slow, whereas using vectorized operations like np.mean() is very very fast. So in general when using numpy one tries to use those types of built-in operations instead of looping over everything at least if possible.
Also -- if you want the number of columns in your matrix -- it's
my_matrix.shape[1]
returns number of columns;
my_matrix.shape[0] is number of rows.

Comparing two dataframes in pandas for all values greater than the other

I have two data frames with numeric values.
I want to compare both of them and check if one has all values greater than the other.
I have a formula say where mean is mr and variance is vr and alpha is a scalar value, then I want to check if the dataframe r > (mr + alpha * vr) where mr is a dataframe with mean values and vr is variance dataframe. R is an individual dataframe for comparison.
if(r>(mr+alpha*vr)) :
do something
For example my r DataFrame is r=pd.DataFrame({"a":[5,1,8,9,10],"b":[4,5,6,7,8],"c":[11,12,12,14,15]}) and the other part entirely on the right is say toCompare=pd.DataFrame({"a":[6,7,8,9,10],"b":[2,3,5,6,6],"c":[4,5,17,8,9]})
So r>toCompare should result True,since elements in "b" are greater.
I needed to just check if all values are True in the DataFrame. I got this to work finally. It was a bit difficult to figure in the large piece of code.
any((r>(mr+alpha*vr)).any())

Center of mass for pandas dataframe in Python

I am looking to find a center of mass for N-dimensional space in Python.I have a dataframe with K columns (some contain text and some contain numbers)
{X1...Xk}
...
{Z1..Zk}
k > 10000
I need to calculate center of mass for all numerical values in the dataframe.
What is the best way to do it?
The center of mass is simply the mean of the values on each dimension, and you just want to calculate it on non-object columns, so:
df.ix[:,df.dtypes != 'O'].mean()
EDIT: although the OP only mentioned "text" and "numbers", the following alternative is indeed more general (thanks MaxU):
df.select_dtypes(include=['number']).mean()

Iterating over rows of a pandas dataframe, finding point of minimum distance in another dataframe

I have two very large (+100000 rows) pandas data frames. Both frames have lists of points with geographical data, and what I am trying to do is find the data from the closest point in the second frame and place it with the corresponding entry in the first. I currently have the following code:
Data1['MatchInd'] = None
for i, row in Data1.iterrows():
Dist = (Data2['x'] - row['x'])**2 + (Data2['y'] - row['y'])**2
Data1['MatchInd'][i] = Dist.idxmin()
While there is nothing wrong with this, it is very slow, and involves a loop over rows. I was curious if anyone had a better idea?
Best,
Matt

Categories