Determine number of consecutive identical points in a grid - python

I have dataframe and grid size is 12*8
I want to calculate the number of consecutive red dots (only in the vertical direction ) and make new column with it (col = consecutive red ) for blue it will be zero
for example
X y red/blue consecutive red
1 1 blue 0
1 3 red 3
1 4 red 3
1 2 blue 0
1 5 red 3
9 4 red 5
[![enter image description here][1]][1]
Already have data for first 3 columns
from sklearn.neighbors import BallTree
red_points = df[ red]
blue_points = df[!= red]
tree = BallTree(red_points[['x','y']], leaf_size=40, metric='minkowski')
distance, index = tree.query(df[['x','y']], k=2)

I am not aware of such algorithm (there may very well be one), but writing the algo isn't that hard (I work with numpy because I'm used to it and because you can easily accelerate with CUDA and port to other data science python tools).
The data (0=blue, 1=red):
import numpy as np
import pandas as pd
# Generating dummy data for testing
X = np.random.randint(2, size=(ROWS, COLS))
# Visualizing
df = pd.DataFrame(data=X)
bg='background-color: ' x: [bg+'red' if v>=1 else bg+'blue' for v in x])
The algorithm:
result = np.zeros((ROWS,COLS),
for y,x in np.ndindex(X.shape):
if X[y, x]==0:
cons = 1 # consecutive in any direction including current
# Going backwward while we can
prev = y-1
while prev>=0:
if X[prev,x]==0:
# Going forward while we can
nxt = y+1
while nxt<=ROWS-1:
if X[nxt,x]==0:
df2 = pd.DataFrame(data=result) x: [bg+'red' if v>=1 else bg+'blue' for v in x])
And the result:
Please note that in numpy the first coordinate represents the row index (y in your case), and the second the column (x in your case), you can use transpose on your data if you want to swap to x,y.


Calculate gap between two datasets (pandas, matplotlib, fill_between already used)

I'd like to ask for suggestions how to calculate lenght of gap between two datasets in matplotlib made of pandas dataframe. Ideally, I would like to have these gap values written in the plot and also, if it is possible, include them into the dataframe.
Here is my simplified example of dataframe:
import pandas as pd
d = {'Mean-1': [0.195842, 0.295069, 0.321345, 0.773725], 'SEM-1': [0.001216, 0.002687, 0.005267, 0.029974], 'Mean-2': [0.143103, 0.250505, 0.305767, 0.960804],'SEM-2': [0.000959, 0.001368, 0.003722, 0.150025], 'Atom Number': [1, 3, 5, 7]}
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number
0 0.195842 0.001216 0.143103 0.000959 1
1 0.295069 0.002687 0.250505 0.001368 3
2 0.321345 0.005267 0.305767 0.003722 5
3 0.773725 0.029974 0.960804 0.150025 7
Then I made plot, where we can see two lines representing Mean-1 and Mean-2, and then shaded area around each line representing standard error of the mean. This is done for the selected atom numbers.
import matplotlib.pyplot as plt
ax = df.plot(x='Atom Number', y=['Mean-1','Mean-2'])
y_1 = df['Mean-1']
y_2 = df['Mean-2']
x = df['Atom Number']
error_1 = df['SEM-1']
error_2 = df['SEM-1']
ax.fill_between(df['Atom Number'], y_1-error_1, y_1+error_1, alpha=0.2, edgecolor='#CC4F1B', facecolor='#FF9848')
ax.fill_between(df['Atom Number'], y_2-error_2, y_2+error_2, alpha=0.2, edgecolor='#3F7F4C', facecolor='#7EFF99')
What I would like to do further is to calculate the gap for each residue. The gap is the white space only, thus space where the lines as well as the shaded areas (SEMs) don't overlap.
And also would like to know if I can somehow print the gap values from the plot? And save them into column. Thank You for suggestions.
It's not a compact solution but you could try something like this (Check the order of things). Calculate all the position (y_i and upper and lower limits).
import numpy as np
df['y1_upper'] = y_1+error_1
df['y1_lower'] = y_1-error_1
df['y2_upper'] = y_2+error_2
df['y2_lower'] = y_2-error_2
which gives
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number y1_upper y1_lower \
0 0.195842 0.001216 0.143103 0.000959 1 0.197058 0.194626
1 0.295069 0.002687 0.250505 0.001368 3 0.297756 0.292382
2 0.321345 0.005267 0.305767 0.003722 5 0.326612 0.316078
3 0.773725 0.029974 0.960804 0.150025 7 0.803699 0.743751
y2_upper y2_lower
0 0.144319 0.141887
1 0.253192 0.247818
2 0.311034 0.300500
3 0.990778 0.930830
The distances (gaps) are calculated differently depending on if y_1 is over y_2and vice versa. So use conditions on the upper and lower limits and use linalg.norm to compute the distance.
conditions = [
(df['y1_lower'] >= df['y2_upper']),
(df['y1_lower'] < df['y2_upper'])]
choices = [np.linalg.norm(df['y1_lower']-df['y2_upper']), np.linalg.norm(df['y2_lower']-df['y1_upper'])]
df['dist'] =, choices)
This gives
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number y1_upper y1_lower \
0 0.195842 0.001216 0.143103 0.000959 1 0.197058 0.194626
1 0.295069 0.002687 0.250505 0.001368 3 0.297756 0.292382
2 0.321345 0.005267 0.305767 0.003722 5 0.326612 0.316078
3 0.773725 0.029974 0.960804 0.150025 7 0.803699 0.743751
y2_upper y2_lower dist
0 0.144319 0.141887 0.255175
1 0.253192 0.247818 0.255175
2 0.311034 0.300500 0.255175
3 0.990778 0.930830 0.149605
As I said, check the order, but this is a possible solution.
IIUC, do you want something like this:
import matplotlib.pyplot as plt
ax = df.plot(x='Atom Number', y=['Mean-1','Mean-2'], figsize=(15,8))
y_1 = df['Mean-1']
y_2 = df['Mean-2']
x = df['Atom Number']
error_1 = df['SEM-1']
error_2 = df['SEM-1']
ax.fill_between(df['Atom Number'], y_1-error_1, y_1+error_1, alpha=0.2, edgecolor='#CC4F1B', facecolor='#FF9848')
ax.fill_between(df['Atom Number'], y_2-error_2, y_2+error_2, alpha=0.2, edgecolor='#3F7F4C', facecolor='#7EFF99')
ax.fill_between(df['Atom Number'], y_1+error_1, y_2-error_2, alpha=.2, edgecolor='k', facecolor='blue')
for i in range(len(x)):
gap = y_1[i]+error_1[i] - y_2[i]-error_2[i]
ylabel = min(y_1[i], y_2[i]) + abs(gap) / 2
_ = ax.annotate(f'{gap:0.4f}', xy=(x[i],ylabel), xytext=(x[i]-.14,y_1[i]+gap/abs(gap)*.2), arrowprops=dict(arrowstyle="-"))

Create and pass random values to Pandas dataframes with hard bounds

I am trying to simulate a pandas dataframe, using random values, with a combination of hard upper/lower values. I am using np.random.normal, as the original data is fairly normally distributed.
The code I am using to create the dataframe is:
df = pd.DataFrame({
"Temp": np.random.normal(6.809892, 2.975827,93),
"Sun": np.random.normal(1.615054,2.053996,93),
"Rel Hum": np.random.normal(87.153118,5.529958,93)
In the above example, I would like there to be a hard lower and upper bound for all three values. For example, Rel. Hum. could not go below 0, or above 100. Edit: all three values would not have the same bounds, either upper or lower. Temp can go negative, while sun would be bounded at 0, and 24)
How can I force these values, while creating a relatively normally distribution, and passing them to the dataframe at the same time?
Edit : Note that this samples from a truncated normal for the given parameters and will most likely not be truly normally distributed, sorry for the confusion.
Use scipy truncated normal defined as :
"The standard form of this distribution is a standard normal truncated to the range [a, b]"
from scipy.stats import truncnorm
low_bound = 0
upper_bound = 100
mean = 8
std = 2
a, b = (low_bound - mean) / std, (upper_bound - mean) / std
n_samples = 1000
samples = truncnorm.rvs(a = a, b = b,
loc = mean, scale = std,
size = n_samples)
Thanks to ALollz for the corrections !
Try clip() function to bound the values, example:
>>> df[df['Rel Hum']>100].head()
Temp Sun Rel Hum
32 4.734005 4.102939 100.064077
Name: Rel Hum, Length: 93, dtype: float64
>>> df[df['Rel Hum']>100].head()
Temp Sun Rel Hum
32 4.734005 4.102939 100.064077
>>> df['Rel Hum'].clip(0, 100, inplace=True) # assigns values outside boundary to 0 and 100
>>> df.head()
Temp Sun Rel Hum
0 9.714943 6.255931 93.105135
1 0.551001 3.063972 85.923184
2 7.780588 3.580514 79.124139
3 3.766066 3.684801 84.543149
4 8.541507 -3.066196 83.598925
>>> df[df['Rel Hum']>100].head()
Empty DataFrame
Columns: [Temp, Sun, Rel Hum]
Index: []
Just do a clip:
df = pd.DataFrame({
"Temp": np.random.normal(6.809892, 2.975827,93),
"Sun": np.random.normal(1.615054,2.053996,93),
"Rel Hum": np.random.normal(87.153118,5.529958,93)
And plot:
You can clip, though this leaves you with a spike at the edges:
import pandas as pd
import numpy as np
N = 10**5
df = pd.DataFrame({"Rel Hum": np.random.normal(87.153118,5.529958, N)})
df['Rel Hum'].clip(lower=0, upper=100).plot(kind='hist', bins=np.arange(60,101,1))
If you want to avoid that spike redraw out of bounds points until everything is within bounds:
while not df['Rel Hum'].between(0, 100).all():
m = ~df['Rel Hum'].between(0, 100)
df.loc[m, 'Rel Hum'] = np.random.normal(87.153118, 5.529958, m.sum())
df['Rel Hum'].plot(kind='hist', bins=np.arange(60,101,1))

2D bin (x,y) and calculate mean of values (c) of 10 deepest data points (z)

For a data set consisting of:
coordinates x, y
depth z
a certain value c
I would like to do the following more efficient:
bin the data set in 2D bins based on the coordinates (x, y)
take the 10 deepest data points (z) per bin
calculate the mean value of c of these 10 data points per bin
Finally show a 2d heatmap with the calculated mean values.
I have found a working solution, but this takes too long for small bins and/or large data sets.
Is there a more efficient way of achieving the same result?
Current working example
Example dataframe:
import numpy as np
from numpy.random import rand
import pandas as pd
import math
import matplotlib.pyplot as plt
n = 10000
df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)})
Bin the data set:
cell_size = 0.01
nx = math.ceil((max(df['x']) - min(df['x'])) / cell_size)
ny = math.ceil((max(df['y']) - min(df['y'])) / cell_size)
x_range = np.arange(0, nx)
y_range = np.arange(0, ny)
df['xbin'], x_edges = pd.cut(x=df['x'], bins=nx, labels=x_range, retbins=True)
df['ybin'], y_edges = pd.cut(x=df['y'], bins=ny, labels=y_range, retbins=True)
Code that now takes to long:
df = df.groupby(['xbin', 'ybin']).apply(
lambda d: d.sort_values('z').head(10).mean())
Update an empty DataFrame for the bins without data and show result:
index = pd.MultiIndex.from_product([x_range, y_range],
names=['xbin', 'ybin'])
tot_df = pd.DataFrame(index=index, columns=['z', 'c'])
zval = tot_df['c'].astype('float').values
zval = zval.reshape((nx, ny))
zval = zval.T
zval = np.flipud(zval)
extent = [min(x_edges), max(x_edges), min(y_edges), max(y_edges)]
plt.matshow(zval, aspect='auto', extent=extent)
you can use np.searchsorted to bin the rows by x and y and then use groupby to take 10 deep values and calculate means. As groupby will maintains the order in each group you can sort values before applying bins. groupby will perform better without apply
df = pd.DataFrame({'x':rand(n), 'y':rand(n), 'z':rand(n), 'c':rand(n)})
df = df.sort_values("z", ascending=False)
bins = np.linspace(0, 1, 11)
df["bin_x"] = np.searchsorted(bins, df['x'].values) - 1
df["bin_y"] = np.searchsorted(bins, df['y'].values) - 1
result = df.groupby(["bin_x", "bin_y"]).head(10)
result.groupby(["bin_x", "bin_y"])["c"].mean()
bin_x bin_y
0 0 0.369531
1 0.601803
2 0.554452
3 0.575464
4 0.455198
9 5 0.469838
6 0.420772
7 0.367549
8 0.379200
9 0.523083
Name: c, Length: 100, dtype: float64

switch axis for a data cube (fits file)

I've some problems and I could not find any answer to my problem.
I'm trying to create a datacube in python, where the three axis are (RA,DEC,z), that is 2 sky position and red shift.
I think my code for generating the cube works, I define the cube as:
cube = np.zeros([int(size_x),int(size_y),int(Nchannel)])
where x and y are pixel coordinates and the redshift is sliced in channels. Having this cube I'm filling it with intensity of some lines. At the end I define my .fits header as follows:
hdr = fits.Header()
hdr['EQUINOX'] = 2000
hdr['CRPIX1'] = round(size_ra*3600./pix_size/2.)
hdr['CRPIX2'] = round(size_dec*3600./pix_size/2.)
hdr['CRPIX3'] = 0
hdr['CRVAL1'] = ra0
hdr['CRVAL2'] = dec0
hdr['CRVAL3'] = z_min
hdr['CD1_1'] = pix_size/3600.
hdr['CD1_2'] = 0.
hdr['CD2_1'] = 0.
hdr['CD2_2'] = pix_size/3600.
hdr['CTYPE1'] = "RA---TAN"
hdr['CTYPE2'] = "DEC--TAN"
hdr['CTYPE3'] = "Z"
hdr['BUNIT'] = "Jy/pixel"
And here is the problem, my cube.fits is in the "bad" direction. When I open it using ds9 the z-axis is not the redshift z...
I'm suspecting a bad header, but where can I specify the axis in the fits header?
The axes are indeed inverted, FITS uses the Fortran convention (column-major order) whereas Python/Numpy uses the C convention (row-major order).
So for your cube you need to define the axes as (z, y, x):
In [1]: import numpy as np
In [2]: from import fits
In [3]: fits.ImageHDU(data=np.zeros((5,4,3))).header
XTENSION= 'IMAGE ' / Image extension
BITPIX = -64 / array data type
NAXIS = 3 / number of array dimensions
NAXIS1 = 3
NAXIS2 = 4
NAXIS3 = 5
PCOUNT = 0 / number of parameters
GCOUNT = 1 / number of groups

Use diagonal fill to eliminate 8-connectivity of the background in Python (similar to bwmorph diag in MATLAB)

I'm looking for a way to connect 8-connected pixels in Python, similar to MATLAB's bwmorph 'diag' function:
BW = bwmorph(BW, 'diag')
For example,
0 1 0 0 1 0
1 0 0 -> 1 1 0
0 0 0 0 0 0
Thanks in advance!
That works, thanks! Here's the python code:
def bwmorphDiag(bw):
# filter for 8-connectivity of the background
f = np.array(([1, -1, 0],[-1, 1, 0],[0, 0, 0]),dtype =
# initialize result with original image
bw = bw.astype(
res2 = bw.copy().astype(np.bool)
for ii in range(4): # all orientations
# add results where sum equals 2 -> two background pixels on the
# diagonal with 2 foreground pixels on the crossing mini-anti-diagonal
res2 = res2 | (ndimage.filters.convolve(np.invert(bw),f) == 2)
f = np.rot90(f) # rotate filter to next orientation
return res2
you can achieve the same result using simple image filtering. I did it in MATLAB, but it should be straight forward to do it in python as well:
% random binary image
bw = rand(50) > 0.5;
% result using bwmorph(bw,'diag')
res1 = bwmorph(bw,'diag');
% filter for 8-connectivity of the background
f = [1 -1 0;-1 1 0;0 0 0];
% initialize result with original image
res2 = bw;
for ii = 1:4 % all orientations
% add results where sum equels 2 -> two background pixels on the
% diagonal with 2 foreground pixels on the crossing mini-anti-diagonal
res2 = res2 | ( imfilter(double(~bw),f) == 2 );
f = rot90(f); % rotate filter to next orientation
isequal(res2,res1) % yes
I was actually looking for the same python equivalent, the bwmorph('diag') of MATLAB. But since I couldn't find it I eventually decided to code it. Please check the MATLAB help for bwmorph and the diag option to get further info about what it does.
import numpy as np
import scipy.ndimage.morphology as smorph
import skimage.morphology as skm
class bwmorph:
def diag(imIn):
strl = np.array([
bwIm = np.zeros(imIn.shape,dtype=int)
imIn = np.array(imIn)
imIn = imIn/np.max(np.max(imIn)) #normalizing to be added later
for i in range(7):
bwIm = bwIm + smorph.binary_hit_or_miss(imIn,strl[i,:,:])
bwIm = ((bwIm>0) + imIn)>0
return bwIm # out put is boolean
I used 'hit or miss' transform, with the structural element 'strl' defined at the beginning. I guess that's a classic way to do it.
Please watch the #staticmethod is you're running it on older python.
Usage example would be bwmorph().diag(BinaryImage)
All the best ;)
