Extract values from XArray's DataArray to column using indices - python

So, I'm doing something that is maybe a bit unorthodox, I have a number of 9-billion pixel raster maps based on the NLCD, and I want to get the values from these rasters for the pixels which have ever been built-up, which are about 500 million:
built_up_index = pandas.DataFrame(np.column_stack(np.where(unbuilt == 0)), columns = ["row", "column"]).sort_values(["row", "column"])
That piece of code above gives me a dataframe where one column is the row index and the other is the column index of all the pixels which show construction in any of the NLCD raster maps (unbuilt is the ones and zeros raster which contains that).
I want to use this to then read values from these NLCD maps and others, so that each pixel is a row and each column is a variable, say, its value in the NLCD 2001, then its value in 2004, 2006 and so on (As well as other indices I have calculated). So the dataframe would look as such:
|row | column | value_2001 | value_2004 | var3 | ...
(VALUES HERE)
I have tried the following thing:
test = sprawl_2001.isel({'y': np.array(built_up_frame.iloc[:,0]), 'x': np.array(built_up_frame.iloc[:,1])}, drop = True).to_dataset(name="var").to_dataframe()
which works if I take a subsample as such:
test = sprawl_2001.isel({'y': np.array(built_up_frame.iloc[0:10000,0]), 'x': np.array(built_up_frame.iloc[0:10000,1])}, drop = True).to_dataset(name="var").to_dataframe()
but it doesn't do what I want, because the length is squared, as it seems it's trying to create a 2-d array which it then flattens, when what I want is a vector containing the values of the pixels I subsampled.
I could obviously do this in a loop, pixel by pixel, but I imagine this would be extremely slow for 500 million values and there has to be a more efficient way.
Any advice here?
EDIT: In the end I gave up on using the index, because I get the impression Xarrays will only make an array of the same dimensions (about 161000 columns and 104000 rows) as my original dataset with a bunch of missing values, rather than creating a column vector with the values I want. I'm using np.extract:
def src_to_frame(src, unbuilt, varname):
return pd.DataFrame(np.extract(unbuilt == 0, src), columns=[varname])
where src is the raster containing the variable of interest, unbuilt is the raster of the same size where 0s are the pixels that have ever been built, and varname is the name of the variable. It does what I want and fits in the RAM I have. Maybe not the most optimal, but it works!

This looks like a good application for advanced indexing with DataArrays
sprawl_2001.isel(
y=built_up_frame.iloc[0:10000,0].to_xarray(),
x=built_up_frame.iloc[0:10000,1].to_xarray(),
).to_dataset(name="var").to_dataframe()

Related

Generate missing values on the dataset based on ZIPF distribution

Currently, I want to observe the impact of missing values on my dataset. I replace data point (10, 20, 90 %) to missing values and observe the impact. This function below is to replace a certain per cent data point to missing.
def dropout(df, percent):
# create df copy
mat = df.copy()
# number of values to replace
prop = int(mat.size * percent)
# indices to mask
mask = random.sample(range(mat.size), prop)
# replace with NaN
np.put(mat, mask, [np.NaN]*len(mask))
return mat
My question is, I want to replace missing values based on zipf distirbution/power low/long tail. For instance, I have a dataset that contains of 10 columns (5 columns categorical data and 5 columns numerical data). I want to replace some data points on 5 columns categorical based on zipf law, columns in the left sides have more missing rather than in the right side.
I used Python to do this task.
I saw Scipy manual about zipf distirbution in this link: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.zipf.html but still it's not help me much.
Zipf distributions are a family of distributions on 0 to infinity, whereas you want to delete values from only 5 discrete columns, so you will have to make some arbitrary decisions to do this. Here is one way:
Pick a parameter for your Zipf distribution, say a = 2 as in the example given on the SciPy documentation page.
Looking at the plot given on that same page, you could decide to truncate at 10, i.e. if any sampled value of more than 10 comes up, you're just going to discard it.
Then you could just map the remaining domain of 0 to 10 linearly to your five categorical columns: Any value between 0 and 2 corresponds to the first column, and so on.
So you iteratively sample single values from your Zipf distribution using the SciPy function. For every sampled value, you delete one data point in the column the value corresponds to (see 3.), until you have reached the overall desired percentage of missing values.

Paritition matrix into smaller matrices based on multiple values

So I have this giant matrix (~1.5 million rows x 7 columns) and am trying to figure out an efficient way to split it up. For simplicity of what I'm trying to do, I'll work with this much smaller matrix as an example for what I'm trying to do. The 7 columns consist of (in this order): item number, an x and y coordinate, 1st label (non-numeric), data #1, data #2, and 2nd label (non-numeric). So using pandas, I've imported from an excel sheet my matrix called A that looks like this:
What I need to do is partition this based on both labels (i.e. so I have one matrix that is all the 13G + Aa together, another matrix that is 14G + Aa, and another one that is 14G + Ab together -- this would have me wind up with 3 separate 2x7 matrices). The reason for this is because I need to run a bunch of statistics on the dataset of numbers of the "Marker" column for each individual matrix (e.g. in this example, break the 6 "marker" numbers into three sets of 2 "marker" numbers, and then run statistics on each set of two numbers). Since there are going to be hundreds of these smaller matrices on the real data set I have, I was trying to figure out some way to make the smaller matrices be labeled something like M1, M2, ..., M500 (or whatever number it ends up being) so that way later, I can use some loops to apply statistics to each individual matrix all at once without having to write it 500+ times.
What I've done so far is to use pandas to import my data set into python as a matrix with the command:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\path\cancerdata.csv")
A = df.as_matrix() #Convert excel sheet to matrix
A = np.delete(A, (0),axis=0) #Delete header row
Unfortunately, I haven't come across many resources for how to do what I want, which is why I wanted to ask here to see if anyone knows how to split up a matrix into smaller matrices based on multiple labels.
Your question has many implications, so instead of giving you a straight answer I'll try to give you some pointers on how to tackle this problem.
First off, don't transform your DataFrame into a Matrix. DataFrames are well-optimised for slicing and indexing operations (a Pandas Series object is in reality a fancy Numpy array anyway), so you only lose functionality by converting it to a Matrix.
You could probably convert your label columns into a MultiIndex. This way, you'll be able to access slices of your original DataFrame using df.loc, with a syntax similar to df.loc[label1].loc[label2].
A MultiIndex may sound confusing at first, but it really isn't. Try executing this code block and see for yourself how the resulting DataFrame looks like:
df = pd.read_csv("C:\path\cancerdata.csv")
labels01 = df["Label 1"].unique()
labels02 = df["Label 2"].unique()
index = pd.MultiIndex.from_product([labels01, labels02])
df.set_index(index, inplace=True)
print(df)
Here, we extracted all unique values in the columns "Label 1" and "Label 2", and created an MultiIndex based on all possible combinations of Label 1 vs. Label 2. In the df.set_index line, we extracted those columns from the DataFrame - now they act as indices for your other columns. For example, in order to access the DataFrame slice from your original DataFrame whose Label 1 = 13G and Label 2 = Aa, you can simply write:
sliced_df = df.loc["13G"].loc["Aa"]
And perform whatever calculations/statistics you need with it.
Lastly, instead of saving each sliced DataFrame into a list or dictionary, and then iterating over them to perform the calculations, consider rearranging your code so that, as soon as you create your sliced DataFrame, you peform the calculations, save them to a output/results file/DataFrame, and move on to the next slicing operation. Something like:
for L1 in labels01:
for L2 in labels02:
sliced_df = df.loc[L1].loc[L2]
results = perform_calculations(sub_df)
save_results(results)
This will both improve memory consumption and performance, which may be important considering your large dataset.

Pandas - Use values from rows with equal values in iteration

In case this has been answered in the past I want to apologize, I was not sure how to phrase the question.
I have a dataframe with 3d coordinates and rows with a scalar value (magnetic field in this case) for each point in space. I calculated the radius as the distance from the line at (x,y)=(0,0) for each point. The unique radius and z values are transferred into a new dataframe. Now I want to calculate the scalar values for every point (Z,R) in the volume by averaging over all points in the 3d system with equal radius.
Currently I am iterating over all unique Z and R values. It works but is awfully slow.
df is the original dataframe, dfn is the new one which - in the beginning - only contains the unique combinations of R and Z values.
for r in dfn.R.unique():
for z in df.Z.unique():
dfn.loc[(df["R"]==r)&(df["Z"]==z), "B"] = df["B"][(df["R"]==r)&(df["Z"]==z)].mean()
Is there any way to speed this up by writing a single line of code, in which pandas is given the command to grab all rows from the original dataframe, where Z and R have the values according to each row in the new dataframe?
Thank you in advance for your help.
Try groupby!!!
It looks like you can achieve with something like:
df[['R', 'Z', 'B']].groupby(['R', 'Z']).mean()

How to to arrange a loop in order to loop over columns and then do something

I'm a complete newbie to python, and I'm currently trying to work on a problem that allows me to take the average of each column except the number of columns is unknown.
I figured how to do it if I knew how many columns it is and to do each calculation separate. I'm supposed to do it by creating an empty list and looping the columns back into it.
import numpy as np
#average of all data not including NAN
def average (dataset):
return np.mean (dataset [np.isfinite (dataset)])
#this is how I did it by each column separate
dataset = np.genfromtxt("some file")
print (average(dataset [:,0]))
print (average(dataset [:,1]))
#what I'm trying to do with a loop
def avg (dataset):
for column in dataset:
lst = []
column = #i'm not sure how to define how many columns I have
Avg = average (column)
return Avg
You can use the numpy.mean() function:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
with:
np.mean(my_data, axis=0)
The axis indicates whether you are taking the average along columns or rows (axis = 0 means you take the average of each column, what you are trying to do). The output will be a vector whose length is the same as the number of columns (or rows) along which you took the average, and each element is the average of the corresponding column (or row). You do not need to know the shape of the matrix in advance to do this.
You CAN do this using a for loop, but it's not a good idea -- looping over matrices in numpy is slow, whereas using vectorized operations like np.mean() is very very fast. So in general when using numpy one tries to use those types of built-in operations instead of looping over everything at least if possible.
Also -- if you want the number of columns in your matrix -- it's
my_matrix.shape[1]
returns number of columns;
my_matrix.shape[0] is number of rows.

How to interpolate Excel file of unordered coordinates to a grid

I have csv files which are 1200 Rows x 3 Columns. Number of rows can differ from as low as 500 to as large as 5000 but columns remain same.
I want to create a feature vector from these files which will thus maintain consistent cells/vector length & thus help in finding out the distance between these vectors.
FILE_1
A, B, C
(267.09669678867186, 6.3664069175720197, 1257325.5809999991),
(368.24070923984374, 9.0808353424072301, 49603.662999999884),
(324.21470826328124, 11.489830970764199, 244391.04699999979),
(514.33452027500005, 7.5162401199340803, 56322.424999999988),
(386.19673340976561, 9.4927110671997106, 175958.77100000033),
(240.09965330898439, 10.3463039398193, 457819.8519411764),
(242.17559998691405, 8.4401674270629901, 144891.51100000029),
(314.23066895664061, 7.4405002593994096, 58433.818999999959),
(933.3073596304688, 7.1564397811889604, 41977.960000000014),
(274.04136473476564, 4.8482465744018599, 48782.314891525479),
(584.2639294320312, 7.90128517150879, 49730.705000000096),
(202.13173096835936, 10.559995651245099, 20847.805144088608),
(324.98563963710939, 2.2546300888061501, 43767.774800000007),
(464.35059935390626, 11.573680877685501, 1701597.3915132943),
(776.28339964687495, 8.7755222320556605, 106882.2469999999),
(310.11652952968751, 10.3175926208496, 710341.19162800116),
(331.19962889492189, 10.7578010559082, 224621.80632433048),
(452.31337752387947, 7.3100395202636701, 820707.26700000139),
(430.16615111171876, 10.134071350097701, 18197.691999999963),
(498.24687010585939, 11.0102319717407, 45423.269964585743),
.....,
.....,
500th row
FILE_2
(363.02781861484374, 8.8369808197021502, 72898.479666666608),
(644.20353882968755, 8.6263589859008807, 22776.78799999999),
(259.25105469882811, 9.8575859069824201, 499615.64068339905),
(410.19474608242189, 9.8795070648193395, 316146.18800000293),
(288.12153809726561, 4.7451887130737296, 58615.577999999943),
(376.25868409335936, 10.508985519409199, 196522.12200000012),
(261.11118895351564, 8.5228433609008807, 32721.110000000026),
(319.98896605312501, 3.2100667953491202, 60587.077000000027),
(286.94926268398439, 4.7687568664550799, 47842.133999999867),
(121.00206177890625, 7.9372291564941397, 239813.20531182736),
(308.19895750820314, 6.0029039382934597, 26354.519000000011),
(677.17011839687495, 9.0299625396728498, 10391.757655172449),
(182.1304913216797, 8.0010566711425799, 145583.55700000061),
(187.06341736972655, 9.9460496902465803, 77488.229000000007),
(144.07867615878905, 3.6044106483459499, 104651.56499999999),
(288.92317015468751, 4.3750333786010698, 151872.1949999998),
(228.2089825326172, 4.4475774765014604, 658120.07628214348),
(496.18831055820311, 11.422966003418001, 2371155.6659999997),
(467.30134398281251, 11.0771179199219, 109702.48440899582),
(163.08418089687501, 5.7271881103515598, 38107.106791666629),
.....,
.....,
3400th row
You can see that there is no correspondence between the two files, i.e. if someone asked you to calculate the distance between these two vectors its not possible.
The aim is to be able to interpolate the rows of both the files in such a manner so that there is a consistency across all such files. i.e. when I look up first row,
it should represent same feature across all the files. Now lest look at FILE_1
Range of values for three columns is (considering only 20 rows for time being)
A: 202.13173096835936,933.3073596304688
B: 2.2546300888061501, 11.573680877685501
C: 18197.691999999963,1701597.3915132943
I want to put these points on a 3d array, the grid size of which will be .1X.1X.1 (or lets say 10X10X10 or any arbitrary size grid cell)
But for that to work we need to normalize the data (mean normalize etc)
Now the data we have is a 3d data, which need to be normalized in order to interpolate them into this 3d array. Which neednt be 3d even if its a vector then that will also do.
Now when I said I need to average the points, by that I meant that if in a cell more than two points happen to fall (which will happen if the cell size is big eg 100X100X100) then we will take the average value of x,y,z coordinate as the value of that cell.
These interpolated vectors will have same length & correspondence, because corresponding point of a vector when compared to rest of such vectors will represent same point.
**NOTE : Min & Max range for all coordinates across all files is 100:1000,2:12, 10000:2000000

Categories