Pearson correlation matrix in Python

Pearson correlation matrix in Python - python

I'm characterizing biological samples in infrared spectroscopy (mid infrared region) and the resulting data is used in predictive models, for disease prediction. I'm now working the spectrum data using supervised learning and the first step is to prepare the data (smoothing, peak finding, peak filtering, etc) and now I have a matrix of 93x1 (the dependent variables, or the disease/not disease label), where 93 are the number of samples, and a matrix of 93x210, where 210 are the number of wavelenghts where I can find the pre-filered absorption peaks. From these 210 wavelenght I need to extract the features (absorption peaks) that I'll feed into my model. For this, I'm using Pearson correlation matrix in python where the header is the 210 xi wavelenghts. I want to find correlations between absortion peaks at 'xi wavelenght' and samples. The issue is that the resulted matrix gives me '1' everywhere
Disclaimer: I'm a newbie in python
import numpy as np
import pandas as pd
from google.colab import files
uploaded = files.upload()
df2 = pd.read_excel(io.BytesIO(uploaded['20191201-Peaks.xlsx']),header=None, Index=None)
df2.columns=['Label_0_1','Sample',1700.105,...,1500.49]
print (df2.dtypes)
Label_0_1 object
Sample object
1700.105 float64
1699.141 float64
1698.177 float64
...
1504.35 float64
1503.38 float64
1502.42 float64
1501.45 float64
1500.49 float64
Length: 210, dtype: object
df2.shape
(93, 210)
import chart_studio.plotly as py
import plotly.graph_objs as go
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(150,150))
cor = df2.corr(method='pearson')
cor
[![correlation matrix][1]][1]

Related

How to apply land, sea mask in xarray?

I have a gridded temperature dataset df (time: 2920 x: 349 y: 277) and a land sea mask for the same grid mf (time: 1 x: 349 y: 277) where mf.land = 1 for land grid points and mf.land = 0 for ocean points. I want to use the land sea mask to eliminate ocean points from my temperature dataset df, i.e. I only want grid points in df where mf.land = 1.
Here's what df looks like:
And here's what mf looks like:
I'm trying this:
#import libraries
import os
import matplotlib.pyplot as plt
from netCDF4 import Dataset as netcdf_dataset
import numpy as np
from cartopy import config
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import xarray as xr
import pandas as pd
import netCDF4 as nc
#open temperature data and land sea mask
df=xr.open_dataset('/home/mmartin/LauNath/air.2m.2015.nc')
mf=xr.open_dataset('/home/mmartin/WinterMaxThesis/NOAAGrid/land.nc')
#apply mask
mask = (mf.land >= 1)
LandOnly=df.air.loc[mask]
But am having trouble because of the difference in dimensions. How can I mask out these ocean grid points?

The problem is actually occurring because the data arrays do have the same dimensions, but they shouldn’t. What I mean by that is that the time dimension on the land mask makes xarray think that it needs to align the two time dimensions. However, there is no overlap in the time coordinate on the two datasets, so when xarray aligns them all the data is mis-aligned in time and thus dropped. Since the land mask doesn't change through time (at least, that's what I'm assuming) it's best to exclude the time dimension from the land mask so xarray can broadcast it against the full time dimension of the data.
If you drop the time dimension on land_mask, it will broadcast as you expect:
mask = (mf.land >= 1).squeeze(['time'], drop=True)
now, you can mask your data with .where, optionally dropping all-nan slices with drop=True:
LandOnly=df.air.where(mask, drop=True)
See the user guide sections on broadcasting and automatic alignment for more info.

mask NetCDF using shapefile and calculate average and anomaly for all polygons within the shapefile

There are several tutorials (example 1, example 2, example 3) about masking NetCDF using shapefile and calculating average measures. However, I was confused with those workflows about masking NetCDF and extracting measures such as average, and those tutorials did not include extract anomaly (for example, the difference between temperature in 2019 and a baseline average temperature).
I make an example here. I have downloaded monthly temperature (download temperature file) from 2000 to 2019 and the state-level US shapefile (download shapefile). I want to get the state-level average temperature based on the monthly average temperature from 2000 to 2019 and the temperature anomaly of year 2019 relative to baseline temperature from 2000 to 2010. Specifically, the final dataframe looks as follow:
state
avg_temp
anom_temp2019
AL
xx
xx
AR
xx
xx
...
...
...
WY
xx
xx
# Load libraries
%matplotlib inline
import regionmask
import numpy as np
import xarray as xr
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
# Read shapefile
us = gpd.read_file('./shp/state_cus.shp')
# Read gridded data
ds = xr.open_mfdataset('./temp/monthly_mean_t2m_*.nc')
......
I really appreciate your help that providing an explicit workflow that could do the above task. Thanks a lot.

This can be achieved using regionmask. I don't use your files but the xarray tutorial data and naturalearth data for the US states.
import numpy as np
import regionmask
import xarray as xr
# load polygons of US states
us_states_50 = regionmask.defined_regions.natural_earth.us_states_50
# load an example dataset
air = xr.tutorial.load_dataset("air_temperature")
# turn into monthly time resolution
air = air.resample(time="M").mean()
# create a mask
mask3D = us_states_50.mask_3D(air)
# latitude weights
wgt = np.cos(np.deg2rad(air.lat))
# calculate regional averages
reg_ave = air.weighted(mask3D * wgt).mean(("lat", "lon"))
# calculate the average temperature (over 2013-2014)
avg_temp = reg_ave.sel(time=slice("2013", "2014")).mean("time")
# calculate the anomaly (w.r.t. 2013-2014)
reg_ave_anom = reg_ave - avg_temp
# select a single timestep (January 2013)
reg_ave_anom_ts = reg_ave_anom.sel(time="2013-01")
# remove the time dimension
reg_ave_anom_ts = reg_ave_anom_ts.squeeze(drop=True)
# convert to a pandas dataframe so it's in tabular form
df = reg_ave_anom_ts.air.to_dataframe()
# set the state codes as index
df = df.set_index("abbrevs")
# remove other columns
df = df.drop(columns="names")
You can find info how to use your own shapefile on the regionmask docs (Working with geopandas).
disclaimer: I am the main author of regionmask.

Limited factor loading output in python factor-analyzer

I am new here but I hope you guys can help me out.
I'm trying to conduct factor analysis with word vectors in python using the Factor-Analyzer module. I have a DataFrame with
100 columns and more than 15,000 rows.
I did not receive any error when I performed a factor analysis. Below is the output:
FactorAnalyzer(bounds=(0.005, 1), impute='median', is_corr_matrix=False,
method='minres', n_factors=10, rotation='varimax',
rotation_kwargs={}, use_smc=True)
But when I try to get the loadings, it only returns 100 rows. I want to get the loadings for all rows.
Here is my code:
import pandas as pd
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
import numpy as np
import pickle
factor_df = pd.read_pickle("word_vectors.pkl")
factor_df = pd.DataFrame(data=factor_df)
fa = FactorAnalyzer(n_factors=10, rotation='varimax')
fa.fit(factor_df)
loading = fa.loadings_
loadings_df = pd.DataFrame(fa.loadings_)
loadings_df
The pickle file for my dataset is here.

Factor loadings are the weights and correlations between each variable (column in your DataFrame) and the factor, so the fa.loadings_ object is an array with shape (number_of_variables, number_of_factors) - in your example (100, 10).
If you would like to get transformed DataFrame with only 10 columns in each row, you should call fa.transform(factor_df) after fa.fit(factor_df). Returned array will have shape (number_of_rows, number_of_factors) - in your example (15_000, 10).

How to calculate covariance on 2 columns out of multiple columns in python?

I've provided a sample data below. It contains 8x10 matrix which contains two-dimensional normal distribution. For ex, col1 and col2 is 1 set, col3/col4 is another and so on. I'm trying to calculate covariance of the individual set in python. So far, I've been unsuccessful and i'm new to python. However, below is what I've tried:
import pandas
import numpy
import matplotlib.pyplot as plg
data = pandas.read_excel("testfile.xlsx", header=None)
dataNpy = pandas.DataFrame.to_numpy(data)
mean = numpy.mean(dataNpy, axis=0)
dataAWithoutMean = dataNpy - mean
covB = numpy.cov(dataAWithoutMean)
print("cov is: " + str(covB))
I've been tasked to calculate 4 separate covariance matrices and plot the covariance value for each set. In addition, plot the variance of each set.
dataset:
5.583566716 -0.441667252 -0.663300181 -1.249623134 -6.530464227 -4.984165997 2.594874802 2.646629654
6.129721509 2.374902708 -2.583949571 -2.224729817 0.279965502 -0.850298098 -1.542499771 -2.686894831
5.793226266 1.133844629 -1.939493549 1.570726544 -2.125423302 -1.33966397 -0.42901856 -0.09814741
3.413049714 -0.1133744 -0.032092831 -0.122147373 2.063549449 0.685517481 5.887909556 4.056242954
-2.639701885 -0.716557389 -0.851273969 -0.522784614 -7.347432606 -2.653482175 1.043389849 0.774192416
-1.84827484 -0.636893709 -2.223488277 -1.227420764 0.253999505 0.540299783 -1.593071594 -0.70980532
0.754029441 1.427571018 5.486147486 2.956320758 2.054346142 1.939929175 -3.559875405 -3.074861749
2.009806308 1.916796155 7.820990369 2.953681659 2.071682641 0.105056782 -1.120995825 -0.036335483
1.875128481 1.785216268 -2.607698929 0.244415372 -0.793431956 -1.598343481 -2.120852679 -2.777871862
0.168442246 0.324606905 0.53741174 0.274617158 -2.99037756 -3.323958514 -3.288399345 -2.482277047
Thanks for helping in advance :)

Is this what you need?
import pandas
import numpy
import matplotlib.pyplot as plt
data = pandas.read_excel("Book1.xlsx", header=None)
mean = data.mean(axis=0)
dataAWithoutMean = data - mean
# Variance of each set
dataAWithoutMean.var()
# Covariance matrix
cov = dataAWithoutMean.cov()
plt.matshow(cov)
plt.show()

Simple conversion of netCDF4.Dataset to xarray Dataset

I know how to convert netCDF4.Dataset to xarray DataArray manually. However, I would like to know whether is there any simple and elegant way, e.g. using xarray backend, for simple conversion of the following 'netCDF4.Dataset' object to xarray DataArray object:
<type 'netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
Originating_or_generating_Center: US National Weather Service, National Centres for Environmental Prediction (NCEP)
Originating_or_generating_Subcenter: NCEP Ensemble Products
GRIB_table_version: 2,1
Type_of_generating_process: Ensemble forecast
Analysis_or_forecast_generating_process_identifier_defined_by_originating_centre: Global Ensemble Forecast System (GEFS)
Conventions: CF-1.6
history: Read using CDM IOSP GribCollection v3
featureType: GRID
History: Translated to CF-1.0 Conventions by Netcdf-Java CDM (CFGridWriter2)
Original Dataset = /data/ldm/pub/native/grid/NCEP/GEFS/Global_1p0deg_Ensemble/member/GEFS_Global_1p0deg_Ensemble_20170926_0600.grib2.ncx3#LatLon_181X360-p5S-180p0E; Translation Date = 2017-09-26T17:50:23.259Z
geospatial_lat_min: 0.0
geospatial_lat_max: 90.0
geospatial_lon_min: 0.0
geospatial_lon_max: 359.0
dimensions(sizes): time2(2), ens(21), isobaric1(12), lat(91), lon(360)
variables(dimensions): float32 u-component_of_wind_isobaric_ens(time2,ens,isobaric1,lat,lon), float64 time2(time2), int32 ens(ens), float32 isobaric1(isobaric1), float32 lat(lat), float32 lon(lon), float32 v-component_of_wind_isobaric_ens(time2,ens,isobaric1,lat,lon)
groups:
I've got this using siphon.ncss.

The next release of xarray (0.10) has support for this very thing, or at least getting an xarray dataset from a netCDF4 one, for exactly the reason you're trying to use it:
import xarray as xr
nc = nc4.Dataset('filename.nc', mode='r') # Or from siphon.ncss
dataset = xr.open_dataset(xr.backends.NetCDF4DataStore(nc))
Or with siphon.ncss, this would look like:
from datetime import datetime
from siphon.catalog import TDSCatalog
import xarray as xr
gfs_cat = TDSCatalog('http://thredds.ucar.edu/thredds/catalog'
'/grib/NCEP/GFS/Global_0p5deg/catalog.xml')
latest = gfs_cat.latest
ncss = latest.subset()
query = ncss.query().variables('Temperature_isobaric')
query.time(datetime.utcnow()).accept('netCDF4')
nc = ncss.get_data(query)
dataset = xr.open_dataset(xr.backends.NetCDF4DataStore(nc))
Until it's released, you could install xarray from master. Otherwise, the only other solution is to do everything manually.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pearson correlation matrix in Python - python

Related

How to apply land, sea mask in xarray?

mask NetCDF using shapefile and calculate average and anomaly for all polygons within the shapefile

Limited factor loading output in python factor-analyzer

How to calculate covariance on 2 columns out of multiple columns in python?

Simple conversion of netCDF4.Dataset to xarray Dataset

Categories

Resources