Why is my correlation matrix coming empty? - python

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.read_csv(r"C:\Users\amgup\Downloads\Model_Wells\combined.csv", sep=',', usecols=['ACOUSTICIMPEDANCE1', 'CALI', 'DT','GR','NPHI','RHOB','LLD','PIGN','SP','VCL'], dtype='unicode')
cor=data.corr()
print(cor)

probably none of your columns are numeric. You have to typecast them to numeric values. That way you can calculate correlation.
Example:
data['CALI'] = data['CALI'].astype(np.float64)

Related

Why does Pandas Plot looks different when using csv or xlsx data?

i've got two datasets with the exact same data but they look different when plotted the same way. One is a .xlsx file and one is a .csv file.
Here are the two codes:
For the CSV:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans
daten = pd.read_csv(r"Path\Übungsdaten.csv", header=0, sep=";")
print("Total rows: {0}".format(len(daten)))
print(daten.columns)
plt.scatter(daten['InsuredValue'], daten['Policy'])
plt.xlim(2500000)
plt.ylim(100100)
plt.show()
And for the xlsx:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.cluster import KMeans
daten = pd.read_excel(r"Path\Übungsdaten.xlsx")
print("Total rows: {0}".format(len(daten)))
plt.scatter(daten['InsuredValue'],daten['Policy'] )
plt.xlim(2500000)
plt.ylim(100100)
plt.show()
Here are the two Plots:
csv with plt.xlim(2500000) plt.ylim(100100)
and the csv without restrictions:
and finally the .xlsx plot:
My question is first of all, why is there a black bar on the bottom of the first two plots? (im guessing this is every single value of "InsuredValue") and how can I form the csv plo to the same ratio as the xlsx plot?
Thank you very much
I had to convert the "InsuredValue" column to int with the following code:
daten.astype({'InsuredValue':'int'})

Decompose time series in Python with statsmodels

I am trying to decompose a time series. The database is a 2x8638 matrix. Follow the code.
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
df1 = pd.read_csv("u_x_ts.csv").set_index("0")
df1.head()
enter image description here
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(df1, model='multiplicative')
result.plot()
plt.show()
then python returns the error message:
ValueError: Multiplicative seasonality is not appropriate for zero and negative values
I think statsmodels doesn't support such small values, because at the beginning of the series the values ​​are too small.
But if anyone knows a way out of this problem I appreciate it.

Fitting a cumulative gaussian to data

I am trying to fit a cumulative Gaussian distribution to my data, but I get a strange result with negative mu... :
libraries:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.stats import norm
import numpy as np
First I am importing the data from an Excel
data = pd.read_excel ('....xlsx',sheet_name='test', na_filter=True)
the data look like:
then creating a data frame:
data_sort = pd.DataFrame(data, columns=['x','y'])
and fit the pdf:
mu,sigma = curve_fit(norm.cdf, data_sort['x'], data_sort['y'], p0=[0,1])[0]
and I get back mu= -0.512, sigma=0.106, which is just totally wrong...

Pandas Dataframe replace outliers

Thank you in advance for your help! (Code Provided Below) (Data Here)
I would like to remove the outliers outside of 5/6th standard deviation for columns 5 cm through 225 cm and replace them with the average value for that date (Month/Day) and depth. What is the best way to do that?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
raw_data = pd.read_csv('all-deep-soil-temperatures.csv', index_col=1, parse_dates=True)
df_all_stations = raw_data.copy()
df_selected_station.fillna(method = 'ffill', inplace=True);
df_selected_station_D=df_selected_station.resample(rule='D').mean()
df_selected_station_D['Day'] = df_selected_station_D.index.dayofyear
mean=df_selected_station_D.groupby(by='Day').mean()
mean['Day']=mean.index
mean.head()
For a more general solution, assuming that you are given a dataframe df with some column a.
from scipy import stats.
df[np.abs(stats.zscore(df['a'])) > 5]['a'] = df['a'].mean()

plot graph from python dataframe

i want to convert that dataframe
into this dataframe and plot a matplotlib graph using date along x axis
changed dataframe
Use df.T.plot(kind='bar'):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame.from_csv('./housing_price_index_2010-11_100.csv')
df.T.plot(kind='bar')
plt.show()
you can also assign the transpose to a new variable and plot that (what you asked in the comment):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame.from_csv('./housing_price_index_2010-11_100.csv')
df_transposed = df.T
df_transposed.plot(kind='bar')
plt.show()
both result the same:

Categories