Finding minimum and maximum value for each row, excluding NaN values - python

I have a code that plots multiple wind speed values during a day at 50 different altitudes. I'm trying to program it where it gives me the minimum and maximum values at each different altitude so I can see the minimum and maximum winds experienced during the day.
I've tried np.min(wind_speed, axis=0) but this is giving me nan. I have a line that reads bad values of wind speed as nan. How would I able to avoid getting the nan value and getting the actual minimum and maximum value occurring during the day?

To ignore the NaN values use nanmin and the analagous nanmax:
npnanmin(wind_speed, axis=0)
npnanmax(wind_speed, axis=0)
This will ignore the NaN values as desired
Example:
In [93]:
wind_speed = np.array([234,np.NaN,343, np.NaN])
wind_speed
Out[93]:
array([ 234., nan, 343., nan])
In [94]:
print(np.nanmin(wind_speed, axis=0), np.nanmax(wind_speed, axis=0))
234.0 343.0

Related

During the calculation of mean of a column in dataframe that contain missing values

Let's take an example.
suppose we have a data frame that has column name "f1"
f1 : {2, 4, NaN, 1, NaN, 15}
and when we apply mean imputation to it then we write a code like this
dataframe['f1'].fillna(dataframe['f1'].mean())
so my doubt is when it computes the mean of f1 during dataframe['f1'].mean() I know that it excludes the NaN value during summation(in the numerator) because they can't be added but what I want to know is it can be included or excluded in the denominator when dividing by the total number of values.
mean is computes like this
mean(f1) = (2+4+1+15)/6(include NaN in total number of values)
or this way
mean(f1) = (2+4+1+15)/4(exclude NaN in total number of values)
also, explain why?
thanks in advance
pd.Series.mean calculates the mean only for non-NaN values, so for above data, mean is (2+4+1+15)/4=5.5, 4 is the number of non-NaN values, this is the default behavior for calculating mean. If you want to include the mean for the given Series using all the rows for denominator, you can fillna(0) before calling mean():
Calling mean() directly:
df['f1'].fillna(df['f1'].mean())
0 2.0
1 4.0
2 5.5 <------
3 1.0
4 5.5 <------
5 15.0
Name: f1, dtype: float64
calling mean() after fillna(0):
df['f1'].fillna(df['f1'].fillna(0).mean())
0 2.000000
1 4.000000
2 3.666667 <------
3 1.000000
4 3.666667 <------
5 15.000000
Name: f1, dtype: float64
According to the official documentation of pandas.DataFrame.mean "skipna" parameter excludes the NA/null values. If it was excluded from numerator but denominator this would be exclusively mentioned in the documentation. You could prove yourself that it is excluded from denominator by performing a simple experimentation with a dummy dataframe such as the one you have examplified in the question.
The reason NA/null values should be excluded from denominator is about being statistically correct. Mean is the sum of the numbers divided by total number of them. If you could not add a value to the summation, then it is pointless to make an extra count in the denominator for it. If you count it in the denominator, it equals behaving as though the NA/null value was 0. However, the value is not 0, it is unknown, unobserved, hidden etc.
If you are acknowledged about the nature of the distribution in practice, you could interpolate or fill NA/null values accordingly with the nature of the distribution, then take the mean of all the values. For instance, if you realize that the feature in question has a linear nature, you could interpolate missing values with "linear" approach.

Data cleaning of (lat,long) coordinates

I'm new to Python and I want to understand how I can remove values from my dataset that are 0.00000
In context, I am working on the dataset https://www.kaggle.com/ksuchris2000/oklahoma-earthquakes-and-saltwater-injection-wells
The file InjectionWells.csv has some values in their coordinates (LAT and LONG) which I need to remove but I don't know exactly how. This is so I can make a scatterplot with X longitude and Y latitude
I tried the following but didn't work. Can you please guide me?
You need to discover the outlier values on LAT, LONG
your plot is one way, but here's an automated way
First, use dat.info() to see which columns are numeric, what the dtypes are. You are interested in LAT, LONG.
Use dat[['LAT','LONG']].describe() on your two columns of interest to get descriptive statistics and find out their outlier values.
.describe() takes an argument percentiles which is a list, it defaults to
[.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.
..but you want to exclude rare/outlier values, so try including (say) the 1st/99th and 5th/95th percentiles also:
>>> pd.options.display.float_format = '{:.2f}'.format # suppress unwanted dp's
>>> dat[['LAT','LONG']].describe(percentiles=[.01,.05,.1,.25,.5,.9,.95,.99])
# OR:
>>> dat[dat['LAT'].between(33.97,36.96) & dat['LONG'].between(-101.80,-95.48)]
LAT LONG
count 11125.00 11125.00
mean 35.21 -96.85
std 2.69 7.58
min 0.00 -203.63
1% 33.97 -101.80 # <---- 1st percentile
5% 34.20 -99.76
10% 34.29 -98.25
25% 34.44 -97.63
50% 35.15 -97.37
90% 36.78 -95.95
95% 36.85 -95.74
99% 36.96 -95.48 # <---- 99th percentile
max 73.99 97.70
So the 1st-99th percentile ranges of your LAT and LONG values are:
33.97 <= LAT <= 36.96
-101.80 <= LONG <= -95.48
So now you can exclude these with a one-line apply(..., axis=1):
dat2 = dat[ dat.apply(lambda row: (33.97<=row['LAT']<= 36.96) and (-101.80<=row['LONG']<=-95.48), axis=1) ]
API# Operator Operator ID WellType ... ZONE Unnamed: 18 Unnamed: 19 Unnamed: 20
0 3500300026.00 PHOENIX PETROCORP INC 19499.00 2R ... CHEROKEE NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
11121 3515323507.00 SANDRIDGE EXPLORATION & PRODUCTION LLC 22281.00 2D ... MUSSELLEM, OKLAHOMA NaN NaN NaN
[10760 rows x 21 columns]
Note this has gone from 11125 down to 10760 rows. So we dropped 365 rows.
Finally it's always a good idea to check that the extreme values of your filtered LAT, LONG are in the range you expected:
>>> dat2[['LAT','LONG']].describe(percentiles=[.01,.05,.1,.25,.5,.9,.95,.99])
LAT LONG
count 10760.00 10760.00
mean 35.33 -97.25
std 0.91 1.11
min 33.97 -101.76
1% 34.08 -101.62
5% 34.21 -99.19
10% 34.30 -98.20
25% 34.44 -97.62
50% 35.13 -97.36
90% 36.77 -95.99
95% 36.83 -95.80
99% 36.93 -95.56
max 36.96 -95.49
PS there's nothing magical about taking 1st/99th percentiles. You can play with the describe(... percentiles) yourself. You could use 0.005, 0.002, 0.001 percentiles etc. - you get to decide what constitutes an outlier.
You can create a Boolean series by comparing a column of a dataframe to a single value. Then you can use that series to index the dataframe, so that only those rows that meet the condition are selected:
data = df[['LONG', 'LAT']]
data = data[data['LONG'] < -75]

Why NaNs were printed when trying to find Min & Max in a column?

load dataset
dataImf = pd.read_csv('/home/anubhav/datasets/lifesat/gdp_per_capita.csv', thousands=',', delimiter='\t', encoding='latin1',na_values='n/a')
collected unique entries in each column.
dum11,dum22,dum33 = dataImf.Country.unique() , dataImf['GDP per capita'].unique() , dataImf['Estimates Start After'].unique()
minimum, maximum, length of unique entries & print if required.
print(dum22.min(),"-->",dum22.max(),len(dum22),"\n",np.sort(dum22),"\n")
#nan --> nan 188
print(dataImf['GDP per capita'].min(),"-->",dataImf['GDP per capita'].max(),len(dum22),"\n",np.sort(dum22),"\n")
#220.86 --> 101994.093 188
print(dum33.min(),"-->",dum33.max(),len(dum33),"\n",np.sort(dum33),"\n")
#nan --> nan 17
print(dataImf['Estimates Start After'].min(),"-->",dataImf['Estimates Start After'].max(),len(dum33),"\n",np.sort(dum33),"\n")
#0.0 --> 2015.0 17
Question: if I take out unique values and then try to get the min. & max., it outputs NaNs, but if I do not apply any unique() method and use the df['col_name'].min() or max(), it output the correct values.?
(I took distinct values from all 'n' instances of a column to reduce the search for min. or max. function by avoiding redundant searches.)
Please suggest on why after considering unique() method the output is NaNs.?
Series.unique returns a numpy array:
df = pd.DataFrame({'A': [1, 2, 3, np.nan]})
df
Out:
A
0 1.0
1 2.0
2 3.0
3 NaN
df['A'].unique()
Out: array([ 1., 2., 3., nan])
Now the method you call on df['A'].unique() will be a numpy method. ndarray.min() returns nan if there are nan's in the array. pd.Series.min() returns the minimum ignoring the nan's, however.
If you want to use the array, you need to use nanmin:
df['A'].unique().min()
Out: nan
np.nanmin(df['A'].unique())
Out: 1.0
or convert the result to a Series:
pd.Series(df['A'].unique()).min()
Out: 1.0

Pandas describe vs scipy.stats percentileofscore with NaN?

I'm having a weird situation, where pd.describe is giving me percentile markers that disagree with scipy.stats percentileofscore, because of NaNs, I think.
My df is:
f_recommend
0 3.857143
1 4.500000
2 4.458333
3 NaN
4 3.600000
5 NaN
6 4.285714
7 3.587065
8 4.200000
9 NaN
When I run df.describe(percentiles=[.25, .5, .75]) I get:
f_recommend
count 7.000000
mean 4.069751
std 0.386990
min 3.587065
25% 3.728571
50% 4.200000
75% 4.372024
max 4.500000
I get the same values when I run with NaN removed.
When I want to look up a specific value, however, when I run scipy.stats.percentileofscore(df['f_recommend'], 3.61, kind = 'mean') I get: 28th percentile with NaN and 20th without.
Any thoughts to explain this discrepancy?
ETA:
I don't believe that the problem is that we're calculating percentiles differently. Because that only matters when you're calculating percentiles of the same 2 numbers in different ways. But here, describe gives 25 percentile as 3.72. So there is absolutely no way that 3.61 can be 28th percentile. None of the formulas should give that.
In particular, when I use describe on the 6 values without NaN, I get the same values, so that's ignoring NaN, which is fine. But when I run percentile of score without the NaN I get a number that doesn't match.
ETA 2:
Simpler example:
In [48]: d = pd.DataFrame([1,2,3,4,5,6,7])
In [49]: d.describe()
Out[49]:
0
count 7.000000
mean 4.000000
std 2.160247
min 1.000000
25% 2.500000
50% 4.000000
75% 5.500000
max 7.000000
In [50]: sp.stats.percentileofscore(d[0], 2.1, kind = 'mean')
Out[50]: 28.571428571428573
the "kind" argument doesn't matter because 2.1 is unique.
scipy.stats.percentileofscore does not ignore nan, nor does it check for the value and handle it in some special way. It is just another floating point value in your data. This means the behavior of percentileofscore with data containing nan is undefined, because of the behavior of nan in comparisons:
In [44]: np.nan > 0
Out[44]: False
In [45]: np.nan < 0
Out[45]: False
In [46]: np.nan == 0
Out[46]: False
In [47]: np.nan == np.nan
Out[47]: False
Those results are all correct--that is how nan is supposed to behave. But that means, in order to know how percentileofscore handles nan, you have to know how the code does comparisons. And that is an implementation detail that you shouldn't have to know, and that you can't rely on to be the same in future versions of scipy.
If you investigate the behavior of percentfileofscore, you'll find that it behaves as if nan was infinite. For example, if you replace nan with a value larger than any other value in the input, you'll get the same results:
In [53]: percentileofscore([10, 20, 25, 30, np.nan, np.nan], 18)
Out[53]: 16.666666666666664
In [54]: percentileofscore([10, 20, 25, 30, 999, 999], 18)
Out[54]: 16.666666666666664
Unfortunately, you can't rely on this behavior. If the implementation changes in the future, nan might end up behaving like negative infinity, or have some other unspecified behavior.
The solution to this "problem" is simple: don't give percentileofscore any nan values. You'll have to clean up your data first. Note that this can be as simple as:
result = percentileofscore(a[~np.isnan(a)], score)
the answer is very simple.
There is no universally accepted formula for computing percentiles, in particular when your data contains ties or when it cannot be perfectly broken down in equal-size buckets.
For instance, have a look at the documentation in R. There are more than seven types of formulas! https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html
At the end, it comes down to understanding which formula is used and whether the differences are big enough to be a problem in your case.

Subsetting a Pandas series

I have a Pandas series. Basically one specific row of a pandas data frame.
Name: NY.GDP.PCAP.KD.ZG, dtype: int64
NY.GDP.DEFL.ZS_logdiff 0.341671
NY.GDP.DISC.CN 0.078261
NY.GDP.DISC.KN 0.083890
NY.GDP.FRST.RT.ZS 0.296574
NY.GDP.MINR.RT.ZS 0.264811
NY.GDP.MKTP.CD_logdiff 0.522725
NY.GDP.MKTP.CN_logdiff 0.884601
NY.GDP.MKTP.KD_logdiff 0.990679
NY.GDP.MKTP.KD.ZG 0.992603
NY.GDP.MKTP.KN_logdiff -0.077253
NY.GDP.MKTP.PP.CD_logDiff 0.856861
NY.GDP.MKTP.PP.KD_logdiff 0.990679
NY.GDP.NGAS.RT.ZS -0.018126
NY.GDP.PCAP.CD_logdiff 0.523433
NY.GDP.PCAP.KD.ZG 1.000000
NY.GDP.PCAP.KN_logdiff 0.999456
NY.GDP.PCAP.PP.CD_logdff 0.857321
NY.GDP.PCAP.PP.KD_logdiff 0.999456
The first column is index as you would find in a series. Now I want to basically get all these index names in a list such that only those index should come whose absolute value in the right column is less than 0.5. To give a context this series is basically a row corresponding to the variable NY.GDP.PCAP.KD.ZG in a correlation matrix and I want to retain this variable along with those variables which have correlation less than 0.5 with this variable. Rest variables I will drop from the dataframe
Currently I do something like this but it also keeps NaN
print(tourism[columns].corr().ix[14].where(np.absolute(tourism[columns].corr().ix[14]<0.5)))
where tourism is the data frame , columns is the set of columns on which I did correlation analysis and 14 is the row in the correlation matrix corresponding to column mentioned above
gives:
NY.GDP.DEFL.ZS_logdiff 0.341671
NY.GDP.DISC.CN 0.078261
NY.GDP.DISC.KN 0.083890
NY.GDP.FRST.RT.ZS 0.296574
NY.GDP.MINR.RT.ZS 0.264811
NY.GDP.MKTP.CD_logdiff NaN
NY.GDP.MKTP.CN_logdiff NaN
NY.GDP.MKTP.KD_logdiff NaN
NY.GDP.MKTP.KD.ZG NaN
NY.GDP.MKTP.KN_logdiff -0.077253
NY.GDP.MKTP.PP.CD_logDiff NaN
NY.GDP.MKTP.PP.KD_logdiff NaN
NY.GDP.NGAS.RT.ZS -0.018126
NY.GDP.PCAP.CD_logdiff NaN
NY.GDP.PCAP.KD.ZG NaN
NY.GDP.PCAP.KN_logdiff NaN
NY.GDP.PCAP.PP.CD_logdff NaN
NY.GDP.PCAP.PP.KD_logdiff NaN
Name: NY.GDP.PCAP.KD.ZG, dtype: float64
If x is your series, then:
x[x.abs() < 0.5].index

Categories