Pandas Dataframe replace outliers - python

Thank you in advance for your help! (Code Provided Below) (Data Here)
I would like to remove the outliers outside of 5/6th standard deviation for columns 5 cm through 225 cm and replace them with the average value for that date (Month/Day) and depth. What is the best way to do that?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
raw_data = pd.read_csv('all-deep-soil-temperatures.csv', index_col=1, parse_dates=True)
df_all_stations = raw_data.copy()
df_selected_station.fillna(method = 'ffill', inplace=True);
df_selected_station_D=df_selected_station.resample(rule='D').mean()
df_selected_station_D['Day'] = df_selected_station_D.index.dayofyear
mean=df_selected_station_D.groupby(by='Day').mean()
mean['Day']=mean.index
mean.head()

For a more general solution, assuming that you are given a dataframe df with some column a.
from scipy import stats.
df[np.abs(stats.zscore(df['a'])) > 5]['a'] = df['a'].mean()

Related

How to specify and locate cells using Pandas and use fillna

I am using a dataset that can be found on Kaggle website (https://www.kaggle.com/claytonmiller/lbnl-automated-fault-detection-for-buildings-data).
I am trying to write a code that can specify based on Timestamp to look for those specific rows and apply a condition (In the context of this dataset the time between 10:01 PM to 6:59 AM) and fill all the columns corresponding to those specific rows with zero.
I have tried the following code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
%matplotlib inline
df = pd.read_csv('RTU.csv')
def fill_na(row):
if dt.time(22, 1) <= pd.to_datetime(row['Timestamp']).time() <= dt.time(6, 59):
row.fillna(0)
### df = df.apply(fill_na, axis=1) ###
df= df.apply(lambda row : fill_na(row), axis=1)
#### df.fillna(0, inplace=True) ###
df.head(2000)
However after changing the axis of the dataset it seems it can no longer work as intended.
I don't think you need a function to do that. Just filter the rows using a condition and then fillna.
import datetime as dt
import pandas as pd
df = pd.read_csv('RTU.csv',parse_dates=['Timestamp'])
df.head()
cond = (df.Timestamp.dt.time > dt.time(22,0)) | ((df.Timestamp.dt.time < dt.time(7,0)))
df[cond] = df[cond].fillna(0,axis=1)
Shows that the na before 7am fill with 0

Why is my correlation matrix coming empty?

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv(r"C:\Users\amgup\Downloads\Model_Wells\combined.csv", sep=',', usecols=['ACOUSTICIMPEDANCE1', 'CALI', 'DT','GR','NPHI','RHOB','LLD','PIGN','SP','VCL'], dtype='unicode')
cor=data.corr()
print(cor)
probably none of your columns are numeric. You have to typecast them to numeric values. That way you can calculate correlation.
Example:
data['CALI'] = data['CALI'].astype(np.float64)

Selecting between timestamps pandas

Hello I cannot understand why this code does not select rows between dates. It shows me same dataset from first date 2004. Here is my code below:
import pandas as pd
from pandas import DataFrame
import datetime
from matplotlib import pyplot as plt
df1 = pd.read_csv('time_series_15min_singleindex.csv',header=0,index_col=0,parse_dates=True)
df=DataFrame(df1,columns['utc_timestamp','DE_solar_generation_actual','DE_wind_onshore_generation_actual']
df['utc_timestamp'] = pd.to_datetime(df['utc_timestamp'],utc=True)
start_date=pd.to_datetime('2008-12-31',utc=True)
end_date=pd.to_datetime('2009-01-01',utc=True)
df[df['utc_timestamp'].between(start_date,end_date)]
df.plot()
You forget assign back, use:
df = df[df['utc_timestamp'].between(start_date,end_date)]

How can I pick out July month of this ime series (runoff) to plot?

How can I pick out just July-month of these time series? My time series goes from 1985-2018 with runoff values on the right side. I need to get some help with further code to pick out the July-values and then plot it.
my code:
from pandas import read_csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import cartopy
from datetime import date,datetime
dir1 = "mystiations/"
files = os.listdir(dir1)
files = np.sort(files)
files_txt = [i for i in files if i.endswith('.txt_')]
df = pd.read_csv(dir1+files_txt[0],skiprows=6,header=None, index_col=0,sep=" ",na_values=-9999)
df.index = pd.to_datetime(df.index,format="%Y%m%d/%H%M")
parse_dates=True
index_col=0
myperiod = df["1985":"2018"]
myperiod
runoff

Getting the Date attribute correct for subgrouping and regression lmplot

Given the following data from a CSV file, I want to plot a regression plot using Matlab for the mean of the 2-bedroom price.
I have managed to use subgroup to get the mean. However, after reading the solution from Stackoverflow and trying it, I mostly end up with other never-ending data-related problems. In general most of the errors are either to convert it to string or it is not index etc.
Bedrooms Price Date
0 2.0 NaN 3/9/2016
1 1480000.0 3/12/2016
2 2.0 1035000.0 4/2/2016
3 3.0 NaN 4/2/2016
4 3.0 1465000.0 4/2/2016
#Assume you have the following dataframe df that describes flights
%matplotlib inline
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('justtesting.csv', nrows=50, usecols=['Price','Date','Bedrooms'])
df = df.dropna(0)e
df['Date'] = pd.to_datetime(df.Date)
df.sort_values("Date", axis = 0, ascending = True, inplace = True)
df2 = df[df['Bedrooms'] == 2].groupby(["Date"]).agg(['sum'])
df2.head()
df2.info()
sns.set()
g=sns.lmplot(x="Date", y="Price", data=df2, lowess=True)
#Assume you have the following dataframe df that describes flights
%matplotlib inline
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = x.copy()
df = df.dropna(0)
df.sort_values("Date", axis = 0, ascending = True, inplace = True)
df2 = df[df['Bedrooms'] == 2].groupby(["Date", 'Bedrooms'], as_index=False).sum()
df2.head()
df2.info()
sns.set()
g=sns.lmplot(x='Date', y="Price", data=df2, lowess=True)
Groupby makes the grouped by columns as index by default. Giving as_index=False will fix that. However, seasborn lmplot is required to have a float value. More info can be found on this question

Categories