Generate missing values on the dataset based on ZIPF distribution - python

Currently, I want to observe the impact of missing values on my dataset. I replace data point (10, 20, 90 %) to missing values and observe the impact. This function below is to replace a certain per cent data point to missing.
def dropout(df, percent):
# create df copy
mat = df.copy()
# number of values to replace
prop = int(mat.size * percent)
# indices to mask
mask = random.sample(range(mat.size), prop)
# replace with NaN
np.put(mat, mask, [np.NaN]*len(mask))
return mat
My question is, I want to replace missing values based on zipf distirbution/power low/long tail. For instance, I have a dataset that contains of 10 columns (5 columns categorical data and 5 columns numerical data). I want to replace some data points on 5 columns categorical based on zipf law, columns in the left sides have more missing rather than in the right side.
I used Python to do this task.
I saw Scipy manual about zipf distirbution in this link: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.zipf.html but still it's not help me much.

Zipf distributions are a family of distributions on 0 to infinity, whereas you want to delete values from only 5 discrete columns, so you will have to make some arbitrary decisions to do this. Here is one way:
Pick a parameter for your Zipf distribution, say a = 2 as in the example given on the SciPy documentation page.
Looking at the plot given on that same page, you could decide to truncate at 10, i.e. if any sampled value of more than 10 comes up, you're just going to discard it.
Then you could just map the remaining domain of 0 to 10 linearly to your five categorical columns: Any value between 0 and 2 corresponds to the first column, and so on.
So you iteratively sample single values from your Zipf distribution using the SciPy function. For every sampled value, you delete one data point in the column the value corresponds to (see 3.), until you have reached the overall desired percentage of missing values.

Related

Extract values from XArray's DataArray to column using indices

So, I'm doing something that is maybe a bit unorthodox, I have a number of 9-billion pixel raster maps based on the NLCD, and I want to get the values from these rasters for the pixels which have ever been built-up, which are about 500 million:
built_up_index = pandas.DataFrame(np.column_stack(np.where(unbuilt == 0)), columns = ["row", "column"]).sort_values(["row", "column"])
That piece of code above gives me a dataframe where one column is the row index and the other is the column index of all the pixels which show construction in any of the NLCD raster maps (unbuilt is the ones and zeros raster which contains that).
I want to use this to then read values from these NLCD maps and others, so that each pixel is a row and each column is a variable, say, its value in the NLCD 2001, then its value in 2004, 2006 and so on (As well as other indices I have calculated). So the dataframe would look as such:
|row | column | value_2001 | value_2004 | var3 | ...
(VALUES HERE)
I have tried the following thing:
test = sprawl_2001.isel({'y': np.array(built_up_frame.iloc[:,0]), 'x': np.array(built_up_frame.iloc[:,1])}, drop = True).to_dataset(name="var").to_dataframe()
which works if I take a subsample as such:
test = sprawl_2001.isel({'y': np.array(built_up_frame.iloc[0:10000,0]), 'x': np.array(built_up_frame.iloc[0:10000,1])}, drop = True).to_dataset(name="var").to_dataframe()
but it doesn't do what I want, because the length is squared, as it seems it's trying to create a 2-d array which it then flattens, when what I want is a vector containing the values of the pixels I subsampled.
I could obviously do this in a loop, pixel by pixel, but I imagine this would be extremely slow for 500 million values and there has to be a more efficient way.
Any advice here?
EDIT: In the end I gave up on using the index, because I get the impression Xarrays will only make an array of the same dimensions (about 161000 columns and 104000 rows) as my original dataset with a bunch of missing values, rather than creating a column vector with the values I want. I'm using np.extract:
def src_to_frame(src, unbuilt, varname):
return pd.DataFrame(np.extract(unbuilt == 0, src), columns=[varname])
where src is the raster containing the variable of interest, unbuilt is the raster of the same size where 0s are the pixels that have ever been built, and varname is the name of the variable. It does what I want and fits in the RAM I have. Maybe not the most optimal, but it works!
This looks like a good application for advanced indexing with DataArrays
sprawl_2001.isel(
y=built_up_frame.iloc[0:10000,0].to_xarray(),
x=built_up_frame.iloc[0:10000,1].to_xarray(),
).to_dataset(name="var").to_dataframe()

Multiply by unique number based on which pandas interval a number falls within

I am trying to take a number multiply it by a unique number given which interval it falls within.
I did a groupby on my pandas dataframe according to which bins a value fell into
bins = pd.cut(df['A'], 50)
grouped = df['B'].groupby(bins)
interval_averages = grouped.mean()
A
(0.00548, 0.0209] 0.010970
(0.0209, 0.0357] 0.019546
(0.0357, 0.0504] 0.036205
(0.0504, 0.0651] 0.053656
(0.0651, 0.0798] 0.068580
(0.0798, 0.0946] 0.086754
(0.0946, 0.109] 0.094038
(0.109, 0.124] 0.114710
(0.124, 0.139] 0.136236
(0.139, 0.153] 0.142115
(0.153, 0.168] 0.161752
(0.168, 0.183] 0.185066
(0.183, 0.198] 0.205451
I need to be able to check which interval a number falls into, and then multiply it by the average value of the B column for that interval range.
From the docs I know I can use the in keyword to check if a number is in an interval, but I cannot find how to access the value for a given interval. In addition, I don't want to have to loop through the Series checking if the number is in each interval, that seems quite slow.
Does anybody know how to do this efficiently?
Thanks a lot.
You can store the numbers being tested in an array, and use the cut() method with your bins to sort the values into their respective intervals. This will return an array with the bins that each number has fallen into. You can use this array to determine where the value in the dataframe (the mean) that you need to access is located (you will know the correct row) and access the value via iloc.
Hopefully this helps a bit

Find specific combination of values in pandas dataframe

I am preparing a dataframe for machine learning. The data set contains weather data from several weather stations in australia over a period of 10 years. One of the measured attributes is Evaporation. It has about 50% missing values.
Now I want to find out, whether the missing values are evenly distributed over all weather stations or if roughly half of the weather stations just never measured Evaporation.
How can I find out about the distribution of a value in combination with another attribute?
I basically want to loop over the weather stations and get a count of NaNs and normal values.
rain_df.query('Location == "Albury"').Location.count()
This gives me the number of measurement points from the weaher station in Albury. Now how can I find out how many NaNs were measured in Albury compared to normal (non-NaN) measurements?
You can use .isnull() to mask a series with True for NaNs and False for everything else. Then you can use .value_counts(normalize=True) to get the proportions of NaN and non NaN in that series.
rain_df.query('Location == "Albury"').Location.isnull().value_counts(normalize=True)

Removing points which deviate too much from adjacent point in Pandas

So I'm doing some time series analysis in Pandas and have a peculiar pattern of outliers which I'd like to remove. The bellow plot is based on a dataframe with the first column as a date and the second column the data
AS you can see those points of similar values interspersed and look like lines are likely instrument quirks and should be removed. Ive tried using both rolling_mean, median and removal based on standard deviation to no avail. For an idea of density, its daily measurements from 1984 to the present. Any ideas?
auge = pd.read_csv('GaugeData.csv', parse_dates=[0], header=None)
gauge.columns = ['Date', 'Gauge']
gauge = gauge.set_index(['Date'])
gauge['1990':'1995'].plot(style='*')
And the result of applying rolling median
gauge = pd.rolling_mean(gauge, 5, center=True)#gauge.diff()
gauge['1990':'1995'].plot(style='*')
After rolling median
You can demand that each data point has at least "N" "nearby" data points within a certain distance "D".
N can be 2 or more.
nearby for element gauge[i] can be a pair like: gauge[i-1] and gauge[i+1], but since some only have neighbors on one side you can ask for at least two elements with distance in indexes (dates) less than 2. So, let's say at least 2 of {gauge[i-2], gauge[i-1] gauge[i+1], gauge[i+2]} should satisfy: Distance(gauge[i], gauge[ix]) < D
D - you can decide this based on how close you expect those real data points to be.
It won't be perfect, but it should get most of the noise out of the dataset.

Understanding percentile= calculation in describes () of python

I am trying to understand the following:
1)how the percentiles are calculated.
2) Why did python not return me the values in a sorted order (which was my expectation) as an output
3) My requirement is to know actual value below which x% of population lies. How to do that?
Thanks
Python-2
new=pd.DataFrame({'a':range(10),'b':[60510,60053,54968,62269,91107,29812,45503,6460,62521,37128]})
print new.describe(percentiles=[ 0,0.1 ,0.2,0.3,0.4, 0.50, 0.6,0.7,0.8 ,0.90,1 ])
1)how the percentiles are calculated
90% percentile/quantile means 10% of the data is greater than that value, 90% of the data falls below that value. By default, it's based on a linear interpolation. This is why in your a column, values increment by 0.9instead of original data values of [0, 1, 2 ...]. If you want to use nearest values instead of interpolation, you can use the quantile method instead of describe and change the interpolation parameter.
2) Why did python not return me the values in a sorted order (which was my expectation) as an output
Your question is unclear here. It does return values in a sorted order, indexed based on the output of the .describe method output: count, mean, std, min, quantiles from low to high, max. If you only want quantiles and not the other statistics, you can use the quantile method instead.
3) My requirement is to know actual value below which x% of population lies. How to do that?
Nothing is wrong with the output. Those quantiles are accurate, although they aren't very meaningful when your data only has 10 observations.
Edit: It wasn't originally clear to me that you were attempting to do stats on a frequency table. I don't know of a direct solution in pandas that don't involve moving your data over to a numpy array. You could use numpy.repeat like to get a raw list of observations to put back into pandas and do descriptive stats on.
vals = np.array(new.a)
freqs = np.array(new.b)
observations = np.repeat(vals, freqs)

Categories