Center of mass for pandas dataframe in Python - python

I am looking to find a center of mass for N-dimensional space in Python.I have a dataframe with K columns (some contain text and some contain numbers)
{X1...Xk}
...
{Z1..Zk}
k > 10000
I need to calculate center of mass for all numerical values in the dataframe.
What is the best way to do it?

The center of mass is simply the mean of the values on each dimension, and you just want to calculate it on non-object columns, so:
df.ix[:,df.dtypes != 'O'].mean()
EDIT: although the OP only mentioned "text" and "numbers", the following alternative is indeed more general (thanks MaxU):
df.select_dtypes(include=['number']).mean()

Related

Extract values from XArray's DataArray to column using indices

So, I'm doing something that is maybe a bit unorthodox, I have a number of 9-billion pixel raster maps based on the NLCD, and I want to get the values from these rasters for the pixels which have ever been built-up, which are about 500 million:
built_up_index = pandas.DataFrame(np.column_stack(np.where(unbuilt == 0)), columns = ["row", "column"]).sort_values(["row", "column"])
That piece of code above gives me a dataframe where one column is the row index and the other is the column index of all the pixels which show construction in any of the NLCD raster maps (unbuilt is the ones and zeros raster which contains that).
I want to use this to then read values from these NLCD maps and others, so that each pixel is a row and each column is a variable, say, its value in the NLCD 2001, then its value in 2004, 2006 and so on (As well as other indices I have calculated). So the dataframe would look as such:
|row | column | value_2001 | value_2004 | var3 | ...
(VALUES HERE)
I have tried the following thing:
test = sprawl_2001.isel({'y': np.array(built_up_frame.iloc[:,0]), 'x': np.array(built_up_frame.iloc[:,1])}, drop = True).to_dataset(name="var").to_dataframe()
which works if I take a subsample as such:
test = sprawl_2001.isel({'y': np.array(built_up_frame.iloc[0:10000,0]), 'x': np.array(built_up_frame.iloc[0:10000,1])}, drop = True).to_dataset(name="var").to_dataframe()
but it doesn't do what I want, because the length is squared, as it seems it's trying to create a 2-d array which it then flattens, when what I want is a vector containing the values of the pixels I subsampled.
I could obviously do this in a loop, pixel by pixel, but I imagine this would be extremely slow for 500 million values and there has to be a more efficient way.
Any advice here?
EDIT: In the end I gave up on using the index, because I get the impression Xarrays will only make an array of the same dimensions (about 161000 columns and 104000 rows) as my original dataset with a bunch of missing values, rather than creating a column vector with the values I want. I'm using np.extract:
def src_to_frame(src, unbuilt, varname):
return pd.DataFrame(np.extract(unbuilt == 0, src), columns=[varname])
where src is the raster containing the variable of interest, unbuilt is the raster of the same size where 0s are the pixels that have ever been built, and varname is the name of the variable. It does what I want and fits in the RAM I have. Maybe not the most optimal, but it works!
This looks like a good application for advanced indexing with DataArrays
sprawl_2001.isel(
y=built_up_frame.iloc[0:10000,0].to_xarray(),
x=built_up_frame.iloc[0:10000,1].to_xarray(),
).to_dataset(name="var").to_dataframe()

Generate missing values on the dataset based on ZIPF distribution

Currently, I want to observe the impact of missing values on my dataset. I replace data point (10, 20, 90 %) to missing values and observe the impact. This function below is to replace a certain per cent data point to missing.
def dropout(df, percent):
# create df copy
mat = df.copy()
# number of values to replace
prop = int(mat.size * percent)
# indices to mask
mask = random.sample(range(mat.size), prop)
# replace with NaN
np.put(mat, mask, [np.NaN]*len(mask))
return mat
My question is, I want to replace missing values based on zipf distirbution/power low/long tail. For instance, I have a dataset that contains of 10 columns (5 columns categorical data and 5 columns numerical data). I want to replace some data points on 5 columns categorical based on zipf law, columns in the left sides have more missing rather than in the right side.
I used Python to do this task.
I saw Scipy manual about zipf distirbution in this link: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.zipf.html but still it's not help me much.
Zipf distributions are a family of distributions on 0 to infinity, whereas you want to delete values from only 5 discrete columns, so you will have to make some arbitrary decisions to do this. Here is one way:
Pick a parameter for your Zipf distribution, say a = 2 as in the example given on the SciPy documentation page.
Looking at the plot given on that same page, you could decide to truncate at 10, i.e. if any sampled value of more than 10 comes up, you're just going to discard it.
Then you could just map the remaining domain of 0 to 10 linearly to your five categorical columns: Any value between 0 and 2 corresponds to the first column, and so on.
So you iteratively sample single values from your Zipf distribution using the SciPy function. For every sampled value, you delete one data point in the column the value corresponds to (see 3.), until you have reached the overall desired percentage of missing values.

Pandas - Use values from rows with equal values in iteration

In case this has been answered in the past I want to apologize, I was not sure how to phrase the question.
I have a dataframe with 3d coordinates and rows with a scalar value (magnetic field in this case) for each point in space. I calculated the radius as the distance from the line at (x,y)=(0,0) for each point. The unique radius and z values are transferred into a new dataframe. Now I want to calculate the scalar values for every point (Z,R) in the volume by averaging over all points in the 3d system with equal radius.
Currently I am iterating over all unique Z and R values. It works but is awfully slow.
df is the original dataframe, dfn is the new one which - in the beginning - only contains the unique combinations of R and Z values.
for r in dfn.R.unique():
for z in df.Z.unique():
dfn.loc[(df["R"]==r)&(df["Z"]==z), "B"] = df["B"][(df["R"]==r)&(df["Z"]==z)].mean()
Is there any way to speed this up by writing a single line of code, in which pandas is given the command to grab all rows from the original dataframe, where Z and R have the values according to each row in the new dataframe?
Thank you in advance for your help.
Try groupby!!!
It looks like you can achieve with something like:
df[['R', 'Z', 'B']].groupby(['R', 'Z']).mean()

Removing points which deviate too much from adjacent point in Pandas

So I'm doing some time series analysis in Pandas and have a peculiar pattern of outliers which I'd like to remove. The bellow plot is based on a dataframe with the first column as a date and the second column the data
AS you can see those points of similar values interspersed and look like lines are likely instrument quirks and should be removed. Ive tried using both rolling_mean, median and removal based on standard deviation to no avail. For an idea of density, its daily measurements from 1984 to the present. Any ideas?
auge = pd.read_csv('GaugeData.csv', parse_dates=[0], header=None)
gauge.columns = ['Date', 'Gauge']
gauge = gauge.set_index(['Date'])
gauge['1990':'1995'].plot(style='*')
And the result of applying rolling median
gauge = pd.rolling_mean(gauge, 5, center=True)#gauge.diff()
gauge['1990':'1995'].plot(style='*')
After rolling median
You can demand that each data point has at least "N" "nearby" data points within a certain distance "D".
N can be 2 or more.
nearby for element gauge[i] can be a pair like: gauge[i-1] and gauge[i+1], but since some only have neighbors on one side you can ask for at least two elements with distance in indexes (dates) less than 2. So, let's say at least 2 of {gauge[i-2], gauge[i-1] gauge[i+1], gauge[i+2]} should satisfy: Distance(gauge[i], gauge[ix]) < D
D - you can decide this based on how close you expect those real data points to be.
It won't be perfect, but it should get most of the noise out of the dataset.

Comparing two dataframes in pandas for all values greater than the other

I have two data frames with numeric values.
I want to compare both of them and check if one has all values greater than the other.
I have a formula say where mean is mr and variance is vr and alpha is a scalar value, then I want to check if the dataframe r > (mr + alpha * vr) where mr is a dataframe with mean values and vr is variance dataframe. R is an individual dataframe for comparison.
if(r>(mr+alpha*vr)) :
do something
For example my r DataFrame is r=pd.DataFrame({"a":[5,1,8,9,10],"b":[4,5,6,7,8],"c":[11,12,12,14,15]}) and the other part entirely on the right is say toCompare=pd.DataFrame({"a":[6,7,8,9,10],"b":[2,3,5,6,6],"c":[4,5,17,8,9]})
So r>toCompare should result True,since elements in "b" are greater.
I needed to just check if all values are True in the DataFrame. I got this to work finally. It was a bit difficult to figure in the large piece of code.
any((r>(mr+alpha*vr)).any())

Categories