Data cleaning of (lat,long) coordinates

Data cleaning of (lat,long) coordinates - python

I'm new to Python and I want to understand how I can remove values from my dataset that are 0.00000
In context, I am working on the dataset https://www.kaggle.com/ksuchris2000/oklahoma-earthquakes-and-saltwater-injection-wells
The file InjectionWells.csv has some values in their coordinates (LAT and LONG) which I need to remove but I don't know exactly how. This is so I can make a scatterplot with X longitude and Y latitude
I tried the following but didn't work. Can you please guide me?

You need to discover the outlier values on LAT, LONG
your plot is one way, but here's an automated way
First, use dat.info() to see which columns are numeric, what the dtypes are. You are interested in LAT, LONG.
Use dat[['LAT','LONG']].describe() on your two columns of interest to get descriptive statistics and find out their outlier values.
.describe() takes an argument percentiles which is a list, it defaults to
[.25, .5, .75], which returns the 25th, 50th, and 75th percentiles.
..but you want to exclude rare/outlier values, so try including (say) the 1st/99th and 5th/95th percentiles also:
>>> pd.options.display.float_format = '{:.2f}'.format # suppress unwanted dp's
>>> dat[['LAT','LONG']].describe(percentiles=[.01,.05,.1,.25,.5,.9,.95,.99])
# OR:
>>> dat[dat['LAT'].between(33.97,36.96) & dat['LONG'].between(-101.80,-95.48)]
LAT LONG
count 11125.00 11125.00
mean 35.21 -96.85
std 2.69 7.58
min 0.00 -203.63
1% 33.97 -101.80 # <---- 1st percentile
5% 34.20 -99.76
10% 34.29 -98.25
25% 34.44 -97.63
50% 35.15 -97.37
90% 36.78 -95.95
95% 36.85 -95.74
99% 36.96 -95.48 # <---- 99th percentile
max 73.99 97.70
So the 1st-99th percentile ranges of your LAT and LONG values are:
33.97 <= LAT <= 36.96
-101.80 <= LONG <= -95.48
So now you can exclude these with a one-line apply(..., axis=1):
dat2 = dat[ dat.apply(lambda row: (33.97<=row['LAT']<= 36.96) and (-101.80<=row['LONG']<=-95.48), axis=1) ]
API# Operator Operator ID WellType ... ZONE Unnamed: 18 Unnamed: 19 Unnamed: 20
0 3500300026.00 PHOENIX PETROCORP INC 19499.00 2R ... CHEROKEE NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
11121 3515323507.00 SANDRIDGE EXPLORATION & PRODUCTION LLC 22281.00 2D ... MUSSELLEM, OKLAHOMA NaN NaN NaN
[10760 rows x 21 columns]
Note this has gone from 11125 down to 10760 rows. So we dropped 365 rows.
Finally it's always a good idea to check that the extreme values of your filtered LAT, LONG are in the range you expected:
>>> dat2[['LAT','LONG']].describe(percentiles=[.01,.05,.1,.25,.5,.9,.95,.99])
LAT LONG
count 10760.00 10760.00
mean 35.33 -97.25
std 0.91 1.11
min 33.97 -101.76
1% 34.08 -101.62
5% 34.21 -99.19
10% 34.30 -98.20
25% 34.44 -97.62
50% 35.13 -97.36
90% 36.77 -95.99
95% 36.83 -95.80
99% 36.93 -95.56
max 36.96 -95.49
PS there's nothing magical about taking 1st/99th percentiles. You can play with the describe(... percentiles) yourself. You could use 0.005, 0.002, 0.001 percentiles etc. - you get to decide what constitutes an outlier.

You can create a Boolean series by comparing a column of a dataframe to a single value. Then you can use that series to index the dataframe, so that only those rows that meet the condition are selected:
data = df[['LONG', 'LAT']]
data = data[data['LONG'] < -75]

Related

How to calculate best-fit line for each row with NaN?

I have a dataset storing marathon segment splits (5K, 10K, ...) in seconds and identifiers (age, gender, country) as columns and individuals as rows. Each cell for a marathon segment split column may contain either a float (specifying the number of seconds required to reach the segment) or "NaN". A row may contain up to 4 NaN values. Here is some sample data:
Age M/F Country 5K 10K 15K 20K Half Official Time
2323 38 M CHI 1300.0 2568.0 3834.0 5107.0 5383.0 10727.0
2324 23 M USA 1286.0 2503.0 3729.0 4937.0 5194.0 10727.0
2325 36 M USA 1268.0 2519.0 3775.0 5036.0 5310.0 10727.0
2326 37 M POL 1234.0 2484.0 3723.0 4972.0 5244.0 10727.0
2327 32 M DEN NaN 2520.0 3782.0 5046.0 5319.0 10728.0
I intend to calculate a best fit line for marathon split times (using only the columns between "5K" to "Half") for each row with at least one NaN; from the best fit line for the row, I want to impute a data point to replace the NaN with.
From the sample data, I intend to calculate a best fit line for row 2327 only (using values 2520.0, 3782.0, 5046.0, and 5319.0). Using this best fit line for row 2327, I intend to replace the NaN 5K time with the predicted 5K time.
How can I calculate this best fit line for each row with NaN?
Thanks in advance.

I "extrapolated" a solution from here from 2015 https://stackoverflow.com/a/31340344/6366770 (pun intended). Extrapolation definition I am not sure if in 2021 pandas has reliable extrapolation methods, so you might have to use scipy or other libraries.
When doing the Extrapolation , I excluded the "Half" column. That's because the running distances of 5K, 10K, 15K and 20K are 100% linear. It is literally a straight line if you exclude the half marathon column. But, that doesn't mean that expected running times are linear. Obviously, as you run a longer distance your average time per kilometer is lower. But, this "gets the job done" without getting too involved in an incredibly complex calculation.
Also, this is worth noting. Let's say that the first column was 1K instead of 5K. Then, this method would fail. It only works because the distances are linear. If it was 1K, you would also have to use the data from the rows of the other runners, unless you were making calculations based off the kilometers in the column names themselves. Either way, this is an imperfect solution, but much better than pd.interpolation. I linked another potential solution in the comments of tdy's answer.
import scipy as sp
import pandas as pd
# we focus on the four numeric columns from 5K-20K and and Transpose the dataframe, since we are going horizontally across columns. T
#T he index must also be numeric, so we drop it, but don't worry, we add back just the numbers and maintain the index later on.
df_extrap = df.iloc[:,4:8].T.reset_index(drop=True)
# create a scipy interpolation function to be called by a custom extrapolation function later on
def scipy_interpolate_func(s):
s_no_nan = s.dropna()
return sp.interpolate.interp1d(s_no_nan.index.values, s_no_nan.values, kind='linear', bounds_error=False)
def my_extrapolate_func(scipy_interpolate_func, new_x):
x1, x2 = scipy_interpolate_func.x[0], scipy_interpolate_func.x[-1]
y1, y2 = scipy_interpolate_func.y[0], scipy_interpolate_func.y[-1]
slope = (y2 - y1) / (x2 - x1)
return y1 + slope * (new_x - x1)
#Concat each extrapolated column altogether and transpose back to initial shape to be added to the original dataframe
s_extrapolated = pd.concat([pd.Series(my_extrapolate_func(scipy_interpolate_func(df_extrap[s]),
df_extrap[s].index.values),
index=df_extrap[s].index) for s in df_extrap.columns], axis=1).T
cols = ['5K', '10K', '15K', '20K']
df[cols] = s_extrapolated
df
Out[1]:
index Age M/F Country 5K 10K 15K 20K Half \
0 2323 38 M CHI 1300.0 2569.0 3838.0 5107.0 5383.0
1 2324 23 M USA 1286.0 2503.0 3720.0 4937.0 5194.0
2 2325 36 M USA 1268.0 2524.0 3780.0 5036.0 5310.0
3 2326 37 M POL 1234.0 2480.0 3726.0 4972.0 5244.0
4 2327 32 M DEN 1257.0 2520.0 3783.0 5046.0 5319.0
Official Time
0 10727.0
1 10727.0
2 10727.0
3 10727.0
4 10728.0

How to find the values in a dataframe and the one after it?

I have a pandas dataframe that contains river levels and rainfalls together in a DataFrame called Hourly. I would like to be able to loop through for every rainfall value and collect its river level value, plus the next river level reading.
For Example:
River Level Rainfall
0.876 0.0
0.877 0.8
0.882 0.0
In this case if I was looking for the values for 0.8mm of rainfall, I would like it to return the 0.877 that is in the same row as the 0.8 and also the 0.882 in the row immediately after.
I would like it to output:
0.877
0.882
Currently I have a loop that goes through and locates all the rows for a given rainfall value but cannot figure out how to get the one row after value.
Any help greatly appreciated.

Try this
s = df.Rainfall.eq(0.8)
out = df.loc[s | s.shift(), 'River Level']
Out[364]:
1 0.877
2 0.882
Name: River Level, dtype: float64

shift is the way to go as suggested by #Andy L. If you are looking for alternatives here's another way (after Wayne suggestion):
rainfall_value = 0.8
index = data[data.Rainfall == rainfall_value].index.tolist()
index = [item for x in index for item in [x, x+1]]
result = data.iloc[index]
print(result['River Level'])
# 1 0.877
# 2 0.882

Detecting outliers in a Pandas dataframe using a rolling standard deviation

I have a DataFrame for a fast Fourier transformed signal.
There is one column for the frequency in Hz and another column for the corresponding amplitude.
I have read a post made a couple of years ago, that you can use a simple boolean function to exclude or only include outliers in the final data frame that are above or below a few standard deviations.
df = pd.DataFrame({'Data':np.random.normal(size=200)}) # example dataset of normally distributed data.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] # or if you prefer the other way around
The problem is that my signal drops several magnitudes (up to 10 000 times smaller) as frequency increases up to 50 000Hz. Therefore, I am unable to use a function that only exports values above 3 standard deviation because I will only pick up the "peaks" outliers from the first 50 Hz.
Is there a way I can export outliers in my dataframe that are above 3 rolling standard deviations of a rolling mean instead?

This is maybe best illustrated with a quick example. Basically you're comparing your existing data to a new column that is the rolling mean plus three standard deviations, also on a rolling basis.
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# Create a few outliers (3 of them, at index locations 10, 55, 80)
df.iloc[[10, 55, 80]] = 40.
r = df.rolling(window=20) # Create a rolling object (no computation yet)
mps = r.mean() + 3. * r.std() # Combine a mean and stdev on that object
print(df[df.Data > mps.Data]) # Boolean filter
# Data
# 55 40.0
# 80 40.0
To add a new column filtering only to outliers, with NaN elsewhere:
df['Peaks'] = df['Data'].where(df.Data > mps.Data, np.nan)
print(df.iloc[50:60])
Data Peaks
50 -1.29409 NaN
51 -1.03879 NaN
52 1.74371 NaN
53 -0.79806 NaN
54 0.02968 NaN
55 40.00000 40.0
56 0.89071 NaN
57 1.75489 NaN
58 1.49564 NaN
59 1.06939 NaN
Here .where returns
An object of same shape as self and whose corresponding entries are
from self where cond is True and otherwise are from other.

Finding minimum and maximum value for each row, excluding NaN values

I have a code that plots multiple wind speed values during a day at 50 different altitudes. I'm trying to program it where it gives me the minimum and maximum values at each different altitude so I can see the minimum and maximum winds experienced during the day.
I've tried np.min(wind_speed, axis=0) but this is giving me nan. I have a line that reads bad values of wind speed as nan. How would I able to avoid getting the nan value and getting the actual minimum and maximum value occurring during the day?

To ignore the NaN values use nanmin and the analagous nanmax:
npnanmin(wind_speed, axis=0)
npnanmax(wind_speed, axis=0)
This will ignore the NaN values as desired
Example:
In [93]:
wind_speed = np.array([234,np.NaN,343, np.NaN])
wind_speed
Out[93]:
array([ 234., nan, 343., nan])
In [94]:
print(np.nanmin(wind_speed, axis=0), np.nanmax(wind_speed, axis=0))
234.0 343.0

pandas column division ValueError (putmask: mask and data must be the same size)

I am attempting to divide one column by another inside of a function:
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
As can be seen, I am dividing by a column within the DataFrame, but I am getting a rather strange error:
ValueError: putmask: mask and data must be the same size
I must confess, this is the first time I have seen this error. It seems to suggest that the DF and the column are of different lengths, but clearly (since the column comes from the DataFrame) they are not.
A further twist is that am using this function to loop a data management procedure over year-specific sets (the data are from the Quarterly Census of Employment and Wages 'singlefiles' in the beta series). The sets associated with the 1990-2000 time period go off without a hitch, but 2001 throws this error. I am afraid I have not been able to identify a difference in structure across years, and even if I could, how would it explain the length mismatch?
Any thoughts would be greatly appreciated.
EDIT (2/1/2014): Thanks for taking a look Tom. As requested, the pandas version is 0.13.0, and the data file in question is located here on the BLS FTP site. Just to clarify what I meant by consistent structure, every year has the same variable set and dtype (in addition to a consistent data code structure).
EDIT (2/1/2014): Perhaps it would be useful to share the entire function:
def qcew(f,m_dict):
'''Function reads in file and captures county level aggregations with government contributions'''
#Read in file
cew=pd.read_csv(f)
#Create string version of area fips
cew['fips']=cew['area_fips'].astype(str)
#Generate description variables
cew['area']=cew['fips'].map(m_dict['area'])
cew['industry']=cew['industry_code'].map(m_dict['industry'])
cew['agglvl']=cew['agglvl_code'].map(m_dict['agglvl'])
cew['own']=cew['own_code'].map(m_dict['ownership'])
cew['size']=cew['size_code'].map(m_dict['size'])
#Generate boolean masks
lagg_mask=cew['agglvl_code']==73
lsize_mask=cew['size_code']==0
#Subset data to above specifications
cew_super=cew[lagg_mask & lsize_mask]
#Define column subset
lsub_cols=['year','fips','area','industry_code','industry','own','annual_avg_estabs_count','annual_avg_emplvl',\
'total_annual_wages','own_code']
#Subset to desired columns
cew_sub=cew_super[lsub_cols]
#Rename columns
cew_sub.columns=['year','fips','cty','ind_code','industry','own','estabs','emp','tot_wages','own_code']
#Set index
cew_sub.set_index(['year','fips','cty'],inplace=True)
#Capture total wage base and the contributions of Federal, State, and Local
cew_base=cew_sub['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_fed=cew_sub[cew_sub['own_code']==1]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_st=cew_sub[cew_sub['own_code']==2]['tot_wages'].groupby(level=['year','fips','cty']).sum()
cew_loc=cew_sub[cew_sub['own_code']==3]['tot_wages'].groupby(level=['year','fips','cty']).sum()
#Convert to DFs for join
lbase=DataFrame(cew_base).rename(columns={0:'base'})
lfed=DataFrame(cew_fed).rename(columns={0:'fed_wage'})
lstate=DataFrame(cew_st).rename(columns={0:'st_wage'})
llocal=DataFrame(cew_loc).rename(columns={0:'loc_wage'})
#Join these series
lcontrib_lev=pd.concat([lbase,lfed,lstate,llocal],axis='index').fillna(0)
#Diag prints
print f
print lcontrib_lev.head()
print lcontrib_lev.describe()
print '*****************************\n'
#Calculate proportional contributions (failure point)
lcontrib=lcontrib_lev.div(lcontrib_lev['base'],axis='index')
#Group base data by year, county, and industry
cew_g=cew_sub.reset_index().groupby(['year','fips','cty','ind_code','industry']).sum().reset_index()
#Join contributions to joined data
cew_contr=cew_g.set_index(['year','fips','cty']).join(lcontrib[['fed_wage','st_wage','loc_wage']])
return cew_contr[[x for x in cew_contr.columns if x != 'own_code']]

Work ok for me (this is on 0.13.1, but IIRC I don't think anything in this particular area changed, but its possible it was a bug that was fixed).
In [48]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').head()
Out[48]:
base fed_wage st_wage loc_wage
year fips cty
2001 1000 1000 NaN NaN NaN NaN
1000 NaN NaN NaN NaN
10000 10000 NaN NaN NaN NaN
10000 NaN NaN NaN NaN
10001 10001 NaN NaN NaN NaN
[5 rows x 4 columns]
In [49]: lcontrib_lev.div(lcontrib_lev['base'],axis='index').tail()
Out[49]:
base fed_wage st_wage loc_wage
year fips cty
2001 CS566 CS566 1 0.000000 0.000000 0.000000
US000 US000 1 0.022673 0.027978 0.073828
USCMS USCMS 1 0.000000 0.000000 0.000000
USMSA USMSA 1 0.000000 0.000000 0.000000
USNMS USNMS 1 0.000000 0.000000 0.000000
[5 rows x 4 columns]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data cleaning of (lat,long) coordinates - python

You can create a Boolean series by comparing a column of a dataframe to a single value. Then you can use that series to index the dataframe, so that only those rows that meet the condition are selected: data = df[['LONG', 'LAT']] data = data[data['LONG'] < -75]

Related

How to calculate best-fit line for each row with NaN?

How to find the values in a dataframe and the one after it?

Detecting outliers in a Pandas dataframe using a rolling standard deviation

Finding minimum and maximum value for each row, excluding NaN values

pandas column division ValueError (putmask: mask and data must be the same size)

Categories

Resources