boxplot structure disappears when pandas contains nan [duplicate] - python

I am using matplotlib to plot a box figure but there are some missing values (NaN). Then I found it doesn't display the box figure within the columns having NaN values.
Do you know how to solve this problem?
Here are the codes.
import numpy as np
import matplotlib.pyplot as plt
#==============================================================================
# open data
#==============================================================================
filename='C:\\Users\\liren\\OneDrive\\Data\\DATA in the first field-final\\ks.csv'
AllData=np.genfromtxt(filename,delimiter=";",skip_header=0,dtype='str')
TreatmentCode = AllData[1:,0]
RepCode = AllData[1:,1]
KsData= AllData[1:,2:].astype('float')
DepthHeader = AllData[0,2:].astype('float')
TreatmentUnique = np.unique(TreatmentCode)[[3,1,4,2,8,6,9,7,0,5,10],]
nT = TreatmentUnique.size#nT=number of treatments
#nD=number of deepth;nR=numbers of replications;nT=number of treatments;iT=iterms of treatments
nD = 5
nR = 6
KsData_3D = np.zeros((nT,nD,nR))
for iT in range(nT):
Treatment = TreatmentUnique[iT]
TreatmentFilter = TreatmentCode == Treatment
KsData_Filtered = KsData[TreatmentFilter,:]
KsData_3D[iT,:,:] = KsData_Filtered.transpose()iD = 4
fig=plt.figure()
ax = fig.add_subplot(111)
plt.boxplot(KsData_3D[:,iD,:].transpose())
ax.set_xticks(range(1,nT+1))
ax.set_xticklabels(TreatmentUnique)
ax.set_title(DepthHeader[iD])
Here is the final figure and some of the treatments are missing in the box.

You can remove the NaNs from the data first, then plot the filtered data.
To do that, you can first find the NaNs using np.isnan(data), then perform the bitwise inversion of that Boolean array using the ~: bitwise inversion operator. Use that to index the data array, and you filter out the NaNs.
filtered_data = data[~np.isnan(data)]
In a complete example (adapted from here)
Tested in python 3.10, matplotlib 3.5.1, seaborn 0.11.2, numpy 1.21.5, pandas 1.4.2
For 1D data:
import matplotlib.pyplot as plt
import numpy as np
# fake up some data
np.random.seed(2022) # so the same data is created each time
spread = np.random.rand(50) * 100
center = np.ones(25) * 50
flier_high = np.random.rand(10) * 100 + 100
flier_low = np.random.rand(10) * -100
data = np.concatenate((spread, center, flier_high, flier_low), 0)
# Add a NaN
data[40] = np.NaN
# Filter data using np.isnan
filtered_data = data[~np.isnan(data)]
# basic plot
plt.boxplot(filtered_data)
plt.show()
For 2D data:
For 2D data, you can't simply use the mask above, since then each column of the data array would have a different length. Instead, we can create a list, with each item in the list being the filtered data for each column of the data array.
A list comprehension can do this in one line: [d[m] for d, m in zip(data.T, mask.T)]
import matplotlib.pyplot as plt
import numpy as np
# fake up some data
np.random.seed(2022) # so the same data is created each time
spread = np.random.rand(50) * 100
center = np.ones(25) * 50
flier_high = np.random.rand(10) * 100 + 100
flier_low = np.random.rand(10) * -100
data = np.concatenate((spread, center, flier_high, flier_low), 0)
data = np.column_stack((data, data * 2., data + 20.))
# Add a NaN
data[30, 0] = np.NaN
data[20, 1] = np.NaN
# Filter data using np.isnan
mask = ~np.isnan(data)
filtered_data = [d[m] for d, m in zip(data.T, mask.T)]
# basic plot
plt.boxplot(filtered_data)
plt.show()
I'll leave it as an exercise to the reader to extend this to 3 or more dimensions, but you get the idea.
Use seaborn, which is a high-level API for matplotlib
seaborn.boxplot filters NaN under the hood
import seaborn as sns
sns.boxplot(data=data)
1D
2D
NaN is also ignored if plotting from df.plot(kind='box') for pandas, which uses matplotlib as the default plotting backend.
import pandas as pd
df = pd.DataFrame(data)
df.plot(kind='box')
1D
2D

Related

Extract More Than Two Dimensions via Python: sklearn.cross_decomposition import CCA & transform

I am very interested in using Python to extract 3-4 Dimensions via Canonical Correlation Analyses. I am pasting my very basic code below, and it appears to always default to only extracting two Dimensions even though each of my input arrays are 10,000+ X 3. Even if I have 4 columns for my X & Y matrices it always gives just two Dimensions - was hoping for three and eventually four as I add many more raw Features to my X and Y arrays. Trying to keep simple for now. Could part of my problem also be that some of my Field Names have spaces in them too?
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
data = "G:\Shared drives\Data Intelligence\ZF\Segmentation/Data.csv"
df = pd.read_csv(data)
df.head()
print(df.columns)
X = df[['Altcurr Ext Stk Sales Cc',\
'Altcurr Ext Dss Sales Cc',\
'LBM Sales']]
X.head()
X_mc = (X-X.mean())/(X.std())
X_mc.head()
Y = df[['Primary_Supplier_0_org1',\
'Primary_Supplier_1_org2',\
'Primary_Supplier_2_TV']]
Y.head()
Y_mc = (Y-Y.mean())/(Y.std())
Y_mc.head()
from sklearn.cross_decomposition import CCA
ca = CCA()
ca.fit(X_mc, Y_mc)
X_c, Y_c = ca.transform(X_mc, Y_mc)
By default the CCA() function sets , you can check out the documentation :
Parameters:
n_components int, default=2
Number of components to keep. Should be in [1, min(n_samples, n_features, n_targets)].
For your dataset, X and Y both have 3 columns, so you can go up to n_components = 3 . Using an example dataset :
from sklearn.datasets import make_blobs
from sklearn.cross_decomposition import CCA
X, _ = make_blobs(n_samples=10000, centers=3, n_features=6,random_state=0)
y = X[:,3:]
X = X[:,:3]
ca = CCA(n_components = 3)
ca.fit(X, y)
X_c, Y_c = ca.transform(X, y)
print(X_c.shape)
(10000, 3)
print(Y_c.shape)
(10000, 3)

Reverse Array in a dataframe

Hi I am trying to extract data from a netCDF file, but the data is upside down. How can I reverse the database:
The data I want to extract is the height data from the (netcdf) at the points I have in the CSV file. my Data:
import numpy as np
from netCDF4 import Dataset
import matplotlib.pyplot as plt
import pandas as pd
from mpl_toolkits.basemap import Basemap
from matplotlib.patches import Path, PathPatch
csv_data = np.loadtxt('CSV with target coordinates',skiprows=1,delimiter=',')
num_el = csv_data[:,0]
lat = csv_data[:,1]
lon = csv_data[:,2]
value = csv_data[:,3]
data = Dataset("elevation Data",'r')
lon_range = data.variables['x_range'][:]
lat_range = data.variables['y_range'][:]
topo_range = data.variables['z_range'][:]
spacing = data.variables['spacing'][:]
dimension = data.variables['dimension'][:]
z = data.variables['z'][:]
lon_num = dimension[0]
lat_num = dimension[1]
etopo_lon = np.linspace(lon_range[0],lon_range[1],dimension[0])
etopo_lat = np.linspace(lat_range[0],lat_range[1],dimension[1])
topo = np.reshape(z, (lat_num, lon_num))
height = np.empty_like(num_el)
desired_lat_idx = np.empty_like(num_el)
desired_lon_idx = np.empty_like(num_el)
for i in range(len(num_el)):
tmp_lat = np.abs(etopo_lat - lat[i]).argmin()
tmp_lon = np.abs(etopo_lon - lon[i]).argmin()
desired_lat_idx[i] = tmp_lat
desired_lon_idx[i] = tmp_lon
height[i] = topo[tmp_lat,tmp_lon]
height[height<-10]=0
print(len(desired_lat_idx))
print(len(desired_lon_idx))
print(len(height))
dfl= pd.DataFrame({
'Latitude' : lat.reshape(-1),
'Longitude': lon.reshape(-1),
'Altitude': height.reshape(-1)
});
print(dfl)
# but the Lat should not be changed here (the dfl must be correct)
df =dfl
lat=np.array(df['Latitude'])
lon=np.array(df['Longitude'])
val=np.array(df['Altitude'])
m = basemap.Basemap(projection='robin', lon_0=0, lat_0=0, resolution='l',area_thresh=1000)
m.drawcoastlines(color = 'black')
x,y = m(lon,lat)
colormesh= m.contourf(x,y,val,100, tri=True, cmap = 'terrain')
plt.colorbar(location='bottom',pad=0.04,fraction=0.06)
plt.show()
I have already tried:
lat = csv_data[:,1]
lat= lat*(-1)
But this didnĀ“t work
It's a plotting artifact().
Just do:
colormesh= m.contourf(x,y[::-1],val,100, tri=True, cmap = 'terrain')
y[::-1] will reverse the order of the y latitude elements (as opposed to the land-mass outlines; and while keeping the x longitude coordinates the same) and hence flip them.
I've often had this problem with plotting numpy image data in the past.
Your raw CSV data are unlikely to be flipped themselves (why would they be?). You should try sanity-checking them [I am not a domain expert I'm afraid]! Overlaying an actual coordinate grid can help with this.
Another way to do it is given here: Reverse Y-Axis in PyPlot
You could also therefore just do
ax = plt.gca()
ax.invert_yaxis()

Masking a variable with lat and lon but needed 3d array

what i am trying is masking a value from nc file with numpy array, according to specific location but it gives me 1d array and i can not use this array for plotting here my code.
from netCDF4 import Dataset
import matplotlib.pyplot as plt
import numpy as np
import numpy.ma as ma
file = './sample_data/NSS.AMBX.NK.D08214.S0740.E0931.B5312324.WI.nc'
data = Dataset(file,mode='r')
fcdBT89gHz = np.asarray(data.groups['Data_Fields']['fcdr_brightness_temperature_1'][:])
fcdBT150gHz = np.asarray(data.groups['Data_Fields']['fcdr_brightness_temperature_2'][:])
fcdBT183_1gHz = np.asarray(data.groups['Data_Fields']['fcdr_brightness_temperature_3'][:])
fcdBT183_3gHz = np.asarray(data.groups['Data_Fields']['fcdr_brightness_temperature_4'][:])
fcdBT183_7gHz = np.asarray(data.groups['Data_Fields']['fcdr_brightness_temperature_5'][:])
lats = data.groups['Geolocation_Time_Fields']['latitude'] #Enlem degerleri
lons = data.groups['Geolocation_Time_Fields']['longitude'] #Boylam degerleri
latlar = np.asarray(lats[:]) # Lati
lonlar = np.asarray(lons[:]) # Long
lo = ma.masked_outside(lonlar,105,110)
la = ma.masked_outside(latlar,30,35)
merged_coord=~ma.mask_or(la.mask,lo.mask)
h = plt.plot(fcdBT150gHz[merged_coord])
The output is like that but i need latitudes in x axis like this plot
If you need shape of variables:
lo.shape = (2495, 90)
la.shape = (2495, 90)
fcdBT150gHz[merged_coord].shape = (701,)
Maybe i did not use true way for masking. If data is needed here.

2D bin (x,y) and calculate mean of values (c) of 10 deepest data points (z)

For a data set consisting of:
coordinates x, y
depth z
a certain value c
I would like to do the following more efficient:
bin the data set in 2D bins based on the coordinates (x, y)
take the 10 deepest data points (z) per bin
calculate the mean value of c of these 10 data points per bin
Finally show a 2d heatmap with the calculated mean values.
I have found a working solution, but this takes too long for small bins and/or large data sets.
Is there a more efficient way of achieving the same result?
Current working example
Example dataframe:
import numpy as np
from numpy.random import rand
import pandas as pd
import math
import matplotlib.pyplot as plt
n = 10000
df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)})
Bin the data set:
cell_size = 0.01
nx = math.ceil((max(df['x']) - min(df['x'])) / cell_size)
ny = math.ceil((max(df['y']) - min(df['y'])) / cell_size)
x_range = np.arange(0, nx)
y_range = np.arange(0, ny)
df['xbin'], x_edges = pd.cut(x=df['x'], bins=nx, labels=x_range, retbins=True)
df['ybin'], y_edges = pd.cut(x=df['y'], bins=ny, labels=y_range, retbins=True)
Code that now takes to long:
df = df.groupby(['xbin', 'ybin']).apply(
lambda d: d.sort_values('z').head(10).mean())
Update an empty DataFrame for the bins without data and show result:
index = pd.MultiIndex.from_product([x_range, y_range],
names=['xbin', 'ybin'])
tot_df = pd.DataFrame(index=index, columns=['z', 'c'])
tot_df.update(df)
zval = tot_df['c'].astype('float').values
zval = zval.reshape((nx, ny))
zval = zval.T
zval = np.flipud(zval)
extent = [min(x_edges), max(x_edges), min(y_edges), max(y_edges)]
plt.matshow(zval, aspect='auto', extent=extent)
plt.show()
you can use np.searchsorted to bin the rows by x and y and then use groupby to take 10 deep values and calculate means. As groupby will maintains the order in each group you can sort values before applying bins. groupby will perform better without apply
df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)})
df = df.sort_values("z", ascending=False)
bins = np.linspace(0, 1, 11)
df["bin_x"] = np.searchsorted(bins, df['x'].values) - 1
df["bin_y"] = np.searchsorted(bins, df['y'].values) - 1
result = df.groupby(["bin_x", "bin_y"]).head(10)
result.groupby(["bin_x", "bin_y"])["c"].mean()
Result
bin_x bin_y
0 0 0.369531
1 0.601803
2 0.554452
3 0.575464
4 0.455198
...
9 5 0.469838
6 0.420772
7 0.367549
8 0.379200
9 0.523083
Name: c, Length: 100, dtype: float64

Pandas finding local max and min

I have a pandas data frame with two columns one is temperature the other is time.
I would like to make third and fourth columns called min and max. Each of these columns would be filled with nan's except where there is a local min or max, then it would have the value of that extrema.
Here is a sample of what the data looks like, essentially I am trying to identify all the peaks and low points in the figure.
Are there any built in tools with pandas that can accomplish this?
The solution offered by fuglede is great but if your data is very noisy (like the one in the picture) you will end up with lots of misleading local extremes. I suggest that you use scipy.signal.argrelextrema() method. The .argrelextrema() method has its own limitations but it has a useful feature where you can specify the number of points to be compared, kind of like a noise filtering algorithm. for example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.signal import argrelextrema
# Generate a noisy AR(1) sample
np.random.seed(0)
rs = np.random.randn(200)
xs = [0]
for r in rs:
xs.append(xs[-1] * 0.9 + r)
df = pd.DataFrame(xs, columns=['data'])
n = 5 # number of points to be checked before and after
# Find local peaks
df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal,
order=n)[0]]['data']
df['max'] = df.iloc[argrelextrema(df.data.values, np.greater_equal,
order=n)[0]]['data']
# Plot results
plt.scatter(df.index, df['min'], c='r')
plt.scatter(df.index, df['max'], c='g')
plt.plot(df.index, df['data'])
plt.show()
Some points:
you might need to check the points afterward to ensure there are no twine points very close to each other.
you can play with n to filter the noisy points
argrelextrema returns a tuple and the [0] at the end extracts a numpy array
Assuming that the column of interest is labelled data, one solution would be
df['min'] = df.data[(df.data.shift(1) > df.data) & (df.data.shift(-1) > df.data)]
df['max'] = df.data[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)]
For example:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Generate a noisy AR(1) sample
np.random.seed(0)
rs = np.random.randn(200)
xs = [0]
for r in rs:
xs.append(xs[-1]*0.9 + r)
df = pd.DataFrame(xs, columns=['data'])
# Find local peaks
df['min'] = df.data[(df.data.shift(1) > df.data) & (df.data.shift(-1) > df.data)]
df['max'] = df.data[(df.data.shift(1) < df.data) & (df.data.shift(-1) < df.data)]
# Plot results
plt.scatter(df.index, df['min'], c='r')
plt.scatter(df.index, df['max'], c='g')
df.data.plot()
using Numpy
ser = np.random.randint(-40, 40, 100) # 100 points
peak = np.where(np.diff(ser) < 0)[0]
or
double_difference = np.diff(np.sign(np.diff(ser)))
peak = np.where(double_difference == -2)[0]
using Pandas
ser = pd.Series(np.random.randint(2, 5, 100))
peak_df = ser[(ser.shift(1) < ser) & (ser.shift(-1) < ser)]
peak = peak_df.index
You can do something similar to Foad's .argrelextrema() solution, but with the Pandas .rolling() function:
# Find local peaks
n = 5 #rolling period
local_min_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).min()]
local_max_vals = df.loc[df['data'] == df['data'].rolling(n, center=True).max()]
plt.scatter(local_min_vals.index, local_min_vals, c='r')
plt.scatter(local_max_vals.index, local_max_vals, c='g')

Categories