I am preparing a dataframe for machine learning. The data set contains weather data from several weather stations in australia over a period of 10 years. One of the measured attributes is Evaporation. It has about 50% missing values.
Now I want to find out, whether the missing values are evenly distributed over all weather stations or if roughly half of the weather stations just never measured Evaporation.
How can I find out about the distribution of a value in combination with another attribute?
I basically want to loop over the weather stations and get a count of NaNs and normal values.
rain_df.query('Location == "Albury"').Location.count()
This gives me the number of measurement points from the weaher station in Albury. Now how can I find out how many NaNs were measured in Albury compared to normal (non-NaN) measurements?
You can use .isnull() to mask a series with True for NaNs and False for everything else. Then you can use .value_counts(normalize=True) to get the proportions of NaN and non NaN in that series.
rain_df.query('Location == "Albury"').Location.isnull().value_counts(normalize=True)
Related
I have dataset with quite a lot data missing which stores hourly data for several years. I would now to implement a seasonal filling method where I need the best data I have for two following years (2*8760 entries). This means the least amount of data missing (or least amount of nan values) for two following years. I then need then the end time and start time of this period in datetime format. My data is stored in a dataframe where the index is the hourly datetime. How can I achieve this?
EDIT:
To make it a bit clearer I need to select all entries (values and nan values) from a time period of of two years (or of 2*8760 rows) where the least amount of nan values occur.
You can remove all the NAN values from your data by using df = df.dropna()
I have a dataset on health indicators, with columns such as 'Country', 'Year', 'GDP', and 'Life expectancy'. The data covers the years 2000-2015.
So, there is data for many health indicators for each country for each of the years from 2000-2015.
Many of the variables have missing (NaN) data for specific years/countries.
So, for instance, How would I replace NaN values with average/mean values specific to the given country/year range for all countries?
Additionally, since this is longitudinal data, it would be great to maintain the general trend over time within each country's 16 years of data. Is there a way to replace NaN data for each country, accounting for the general trend for that country/variable over time?
If you guys could explain both methods, that would be phenomenal.
link to data: https://www.kaggle.com/kumarajarshi/life-expectancy-who
Thanks,
D
screenshot of data
You probably want to look into the pd.Dataframe.interpolate() method. It has different methods for filling NaNs in a time series or filling in missing values.
Currently, I want to observe the impact of missing values on my dataset. I replace data point (10, 20, 90 %) to missing values and observe the impact. This function below is to replace a certain per cent data point to missing.
def dropout(df, percent):
# create df copy
mat = df.copy()
# number of values to replace
prop = int(mat.size * percent)
# indices to mask
mask = random.sample(range(mat.size), prop)
# replace with NaN
np.put(mat, mask, [np.NaN]*len(mask))
return mat
My question is, I want to replace missing values based on zipf distirbution/power low/long tail. For instance, I have a dataset that contains of 10 columns (5 columns categorical data and 5 columns numerical data). I want to replace some data points on 5 columns categorical based on zipf law, columns in the left sides have more missing rather than in the right side.
I used Python to do this task.
I saw Scipy manual about zipf distirbution in this link: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.zipf.html but still it's not help me much.
Zipf distributions are a family of distributions on 0 to infinity, whereas you want to delete values from only 5 discrete columns, so you will have to make some arbitrary decisions to do this. Here is one way:
Pick a parameter for your Zipf distribution, say a = 2 as in the example given on the SciPy documentation page.
Looking at the plot given on that same page, you could decide to truncate at 10, i.e. if any sampled value of more than 10 comes up, you're just going to discard it.
Then you could just map the remaining domain of 0 to 10 linearly to your five categorical columns: Any value between 0 and 2 corresponds to the first column, and so on.
So you iteratively sample single values from your Zipf distribution using the SciPy function. For every sampled value, you delete one data point in the column the value corresponds to (see 3.), until you have reached the overall desired percentage of missing values.
The dataset is of occurrence of particular insects in a location for the given year and month. This is available for about 30 years. Now when I give a random location and year, month of future, I want what is the probability of finding that insects in that place based on the historic data.
I tried to to classification problem by labelling all available data as 1. And wanted to check the probability of new data point being label 1 . But the error was thrown as there should be at least two classes to train.
The data looks like this:The x and y are longitude and latitude
x y year month
17.01 22.87 2013 01
42.32. 33.09 2015 12
Think about the problem as a map. You'll need a map for each time period you're interested in, so sum all the occurrences in each month and year for each location. Unless the locations are already binned, you'll need to use some binning as otherwise it is pretty meaningless. So round the values in x and y to a reasonable precision level or use numpy to bin the data. Then you can create a map with the counts/ use a markov model to predict the occurrence.
The reason you're not getting anywhere at the moment is that the chance of finding an insect at any random point is virtually 0.
So I'm doing some time series analysis in Pandas and have a peculiar pattern of outliers which I'd like to remove. The bellow plot is based on a dataframe with the first column as a date and the second column the data
AS you can see those points of similar values interspersed and look like lines are likely instrument quirks and should be removed. Ive tried using both rolling_mean, median and removal based on standard deviation to no avail. For an idea of density, its daily measurements from 1984 to the present. Any ideas?
auge = pd.read_csv('GaugeData.csv', parse_dates=[0], header=None)
gauge.columns = ['Date', 'Gauge']
gauge = gauge.set_index(['Date'])
gauge['1990':'1995'].plot(style='*')
And the result of applying rolling median
gauge = pd.rolling_mean(gauge, 5, center=True)#gauge.diff()
gauge['1990':'1995'].plot(style='*')
After rolling median
You can demand that each data point has at least "N" "nearby" data points within a certain distance "D".
N can be 2 or more.
nearby for element gauge[i] can be a pair like: gauge[i-1] and gauge[i+1], but since some only have neighbors on one side you can ask for at least two elements with distance in indexes (dates) less than 2. So, let's say at least 2 of {gauge[i-2], gauge[i-1] gauge[i+1], gauge[i+2]} should satisfy: Distance(gauge[i], gauge[ix]) < D
D - you can decide this based on how close you expect those real data points to be.
It won't be perfect, but it should get most of the noise out of the dataset.