Two colour scatter plot in R or in python - python

I have a dataset of three columns and n number of rows. column 1 contains name, column 2 value1, and column 3 value2 (rank2).
I want to plot a scatter plot with the outlier values displaying names.
The R commands I am using in are:
tiff('scatterplot.tiff')
data<-read.table("scatterplot_data", header=T)
attach(data)
reg1<-lm(A~B)
plot(A,B,col="red")
abline(reg1)
outliers<-data[which(2^(data[,2]-data[,3]) >= 4 | 2^(data[,2]-data[,3]) <=0.25),]
text(outliers[,2], outliers[,3],labels=outliers[,1],cex=0.50)
dev.off()
and I get a figure like this:
What I want is the labels on the lower half should be of one colour and the labels in upper half should be of another colour say green and red respectively.
Any suggestions, or adjustment in the commands?

You already have a logical test that works to your satisfaction. Just use it in the color spec to text:
text(outliers[,2], outliers[,3],labels=outliers[,1],cex=0.50,
col=c("blue", "green")[
which(2^(data[,2]-data[,3]) >= 4 , 2^(data[,2]-data[,3]) <=0.25)] )
It's untested of course because you offered no test case, but my reasoning is that the which() function should return 1 for the differences >= 4, and 2 for the ones <= 0.25, and integer(0) for all the others and that this should give you the proper alignment of color choices with the 'outliers' vector.

Using python, matplotlib (pylab) to plot, and scipy, numpy to fit data. The trick with numpy is to create a index or mask to filter out the results that you want.
EDIT: Want to selectively color the top and bottom outliers? It's a simple combination of both masks that we created:
import scipy as sci
import numpy as np
import pylab as plt
# Create some data
N = 1000
X = np.random.normal(5,1,size=N)
Y = X + np.random.normal(0,5.5,size=N)/np.random.normal(5,.1)
NAMES = ["foo"]*1000 # Customize names here
# Fit a polynomial
(a,b)=sci.polyfit(X,Y,1)
# Find all points above the line
idx = (X*a + b) < Y
# Scatter according to that index
plt.scatter(X[idx],Y[idx], color='r')
plt.scatter(X[~idx],Y[~idx], color='g')
# Find top 10 outliers
err = ((X*a+b) - Y) ** 2
idx_L = np.argsort(err)[-10:]
for i in idx_L:
plt.text(X[i], Y[i], NAMES[i])
# Color the outliers purple or black
top = idx_L[idx[idx_L]]
bot = idx_L[~idx[idx_L]]
plt.scatter(X[top],Y[top], color='purple')
plt.scatter(X[bot],Y[bot], color='black')
XF = np.linspace(0,10,1000)
plt.plot(XF, XF*a + b, 'k--')
plt.axis('tight')
plt.show()

Related

How can I cut a piece away from a plot and set the point I need to zero?

In my work I have the task to read in a CSV file and do calculations with it. The CSV file consists of 9 different columns and about 150 lines with different values acquired from sensors. First the horizontal acceleration was determined, from which the distance was derived by double integration. This represents the lower plot of the two plots in the picture. The upper plot represents the so-called force data. The orange graph shows the plot over the 9th column of the CSV file and the blue graph shows the plot over the 7th column of the CSV file.
As you can see I have drawn two vertical lines in the lower plot in the picture. These lines represent the x-value, which in the upper plot is the global minimum of the orange function and the intersection with the blue function. Now I want to do the following, but I need some help: While I want the intersection point between the first vertical line and the graph to be (0,0), i.e. the function has to be moved down. How do I achieve this? Furthermore, the piece of the function before this first intersection point (shown in purple) should be omitted, so that the function really only starts at this point. How can I do this?
In the following picture I try to demonstrate how I would like to do that:
If you need my code, here you can see it:
import numpy as np
import matplotlib.pyplot as plt
import math as m
import loaddataa as ld
import scipy.integrate as inte
from scipy.signal import find_peaks
import pandas as pd
import os
# Loading of the values
print(os.path.realpath(__file__))
a,b = os.path.split(os.path.realpath(__file__))
print(os.chdir(a))
print(os.chdir('..'))
print(os.chdir('..'))
path=os.getcwd()
path=path+"\\Data\\1 Fabienne\\Test1\\left foot\\50cm"
print(path)
dataListStride = ld.loadData(path)
indexStrideData = 0
strideData = dataListStride[indexStrideData]
#%%Calculation of the horizontal acceleration
def horizontal(yAngle, yAcceleration, xAcceleration):
a = ((m.cos(m.radians(yAngle)))*yAcceleration)-((m.sin(m.radians(yAngle)))*xAcceleration)
return a
resultsHorizontal = list()
for i in range (len(strideData)):
strideData_yAngle = strideData.to_numpy()[i, 2]
strideData_xAcceleration = strideData.to_numpy()[i, 4]
strideData_yAcceleration = strideData.to_numpy()[i, 5]
resultsHorizontal.append(horizontal(strideData_yAngle, strideData_yAcceleration, strideData_xAcceleration))
resultsHorizontal.insert(0, 0)
#plt.plot(x_values, resultsHorizontal)
#%%
#x-axis "convert" into time: 100 Hertz makes 0.01 seconds
scale_factor = 0.01
x_values = np.arange(len(resultsHorizontal)) * scale_factor
#Calculation of the global high and low points
heel_one=pd.Series(strideData.iloc[:,7])
plt.scatter(heel_one.idxmax()*scale_factor,heel_one.max(), color='red')
plt.scatter(heel_one.idxmin()*scale_factor,heel_one.min(), color='blue')
heel_two=pd.Series(strideData.iloc[:,9])
plt.scatter(heel_two.idxmax()*scale_factor,heel_two.max(), color='orange')
plt.scatter(heel_two.idxmin()*scale_factor,heel_two.min(), color='green')#!
#Plot of force data
plt.plot(x_values[:-1],strideData.iloc[:,7]) #force heel
plt.plot(x_values[:-1],strideData.iloc[:,9]) #force toe
# while - loop to calculate the point of intersection with the blue function
i = heel_one.idxmax()
while strideData.iloc[i,7] > strideData.iloc[i,9]:
i = i-1
# Length calculation between global minimum orange function and intersection with blue function
laenge=(i-heel_two.idxmin())*scale_factor
print(laenge)
#%% Integration of horizontal acceleration
velocity = inte.cumtrapz(resultsHorizontal,x_values)
plt.plot(x_values[:-1], velocity)
#%% Integration of the velocity
s = inte.cumtrapz(velocity, x_values[:-1])
plt.plot(x_values[:-2],s)
I hope it's clear what I want to do. Thanks for helping me!
I didn't dig all the way through your code, but the following tricks may be useful.
Say you have x and y values:
x = np.linspace(0,3,100)
y = x**2
Now, you only want the values corresponding to, say, .5 < x < 1.5. First, create a boolean mask for the arrays as follows:
mask = np.logical_and(.5 < x, x < 1.5)
(If this seems magical, then run x < 1.5 in your interpreter and observe the results).
Then use this mask to select your desired x and y values:
x_masked = x[mask]
y_masked = y[mask]
Then, you can translate all these values so that the first x,y pair is at the origin:
x_translated = x_masked - x_masked[0]
y_translated = y_masked - y_masked[0]
Is this the type of thing you were looking for?

How to make 3d (4 variable) ternary (pyramid) plot in R or Python?

I have data that are multidimensional compositional data (all dimensions sum to 1 or 100). I have learned how to use three of the variables to create a 2d ternary plot.
I would like to add a fourth dimension such that my plot looks like this.
I am willing to use python or R. I am using pyr2 to create the ternary plots in python using R right now, but just because that's an easy solution. If the ternary data could be transformed into 3d coordinates a simple wire plot could be used.
This post shows how 3d compositional data can be transformed into 2d data so that normal plotting method can be used. One solution would be to do the same thing in 3d.
Here is some sample Data:
c1 c2 c3 c4
0 0.082337 0.097583 0.048608 0.771472
1 0.116490 0.065047 0.066202 0.752261
2 0.114884 0.135018 0.073870 0.676229
3 0.071027 0.097207 0.070959 0.760807
4 0.066284 0.079842 0.103915 0.749959
5 0.016074 0.074833 0.044532 0.864561
6 0.066277 0.077837 0.058364 0.797522
7 0.055549 0.057117 0.045633 0.841701
8 0.071129 0.077620 0.049066 0.802185
9 0.089790 0.086967 0.083101 0.740142
10 0.084430 0.094489 0.039989 0.781093
Well, I solved this myself using a wikipedia article, an SO post, and some brute force. Sorry for the wall of code, but you have to draw all the plot outlines and labels and so forth.
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d, Axes3D
from itertools import combinations
import pandas as pd
def plot_ax(): #plot tetrahedral outline
verts=[[0,0,0],
[1,0,0],
[0.5,np.sqrt(3)/2,0],
[0.5,0.28867513, 0.81649658]]
lines=combinations(verts,2)
for x in lines:
line=np.transpose(np.array(x))
ax.plot3D(line[0],line[1],line[2],c='0')
def label_points(): #create labels of each vertices of the simplex
a=(np.array([1,0,0,0])) # Barycentric coordinates of vertices (A or c1)
b=(np.array([0,1,0,0])) # Barycentric coordinates of vertices (B or c2)
c=(np.array([0,0,1,0])) # Barycentric coordinates of vertices (C or c3)
d=(np.array([0,0,0,1])) # Barycentric coordinates of vertices (D or c3)
labels=['a','b','c','d']
cartesian_points=get_cartesian_array_from_barycentric([a,b,c,d])
for point,label in zip(cartesian_points,labels):
if 'a' in label:
ax.text(point[0],point[1]-0.075,point[2], label, size=16)
elif 'b' in label:
ax.text(point[0]+0.02,point[1]-0.02,point[2], label, size=16)
else:
ax.text(point[0],point[1],point[2], label, size=16)
def get_cartesian_array_from_barycentric(b): #tranform from "barycentric" composition space to cartesian coordinates
verts=[[0,0,0],
[1,0,0],
[0.5,np.sqrt(3)/2,0],
[0.5,0.28867513, 0.81649658]]
#create transformation array vis https://en.wikipedia.org/wiki/Barycentric_coordinate_system
t = np.transpose(np.array(verts))
t_array=np.array([t.dot(x) for x in b]) #apply transform to all points
return t_array
def plot_3d_tern(df,c='1'): #use function "get_cartesian_array_from_barycentric" to plot the scatter points
#args are b=dataframe to plot and c=scatter point color
bary_arr=df.values
cartesian_points=get_cartesian_array_from_barycentric(bary_arr)
ax.scatter(cartesian_points[:,0],cartesian_points[:,1],cartesian_points[:,2],c=c)
#Create Dataset 1
np.random.seed(123)
c1=np.random.normal(8,2.5,20)
c2=np.random.normal(8,2.5,20)
c3=np.random.normal(8,2.5,20)
c4=[100-x for x in c1+c2+c3] #make sur ecomponents sum to 100
#df unecessary but that is the format of my real data
df1=pd.DataFrame(data=[c1,c2,c3,c4],index=['c1','c2','c3','c4']).T
df1=df1/100
#Create Dataset 2
np.random.seed(1234)
c1=np.random.normal(16,2.5,20)
c2=np.random.normal(16,2.5,20)
c3=np.random.normal(16,2.5,20)
c4=[100-x for x in c1+c2+c3]
df2=pd.DataFrame(data=[c1,c2,c3,c4],index=['c1','c2','c3','c4']).T
df2=df2/100
#Create Dataset 3
np.random.seed(12345)
c1=np.random.normal(25,2.5,20)
c2=np.random.normal(25,2.5,20)
c3=np.random.normal(25,2.5,20)
c4=[100-x for x in c1+c2+c3]
df3=pd.DataFrame(data=[c1,c2,c3,c4],index=['c1','c2','c3','c4']).T
df3=df3/100
fig = plt.figure()
ax = Axes3D(fig) #Create a 3D plot in most recent version of matplot
plot_ax() #call function to draw tetrahedral outline
label_points() #label the vertices
plot_3d_tern(df1,'b') #call function to plot df1
plot_3d_tern(df2,'r') #...plot df2
plot_3d_tern(df3,'g') #...
The accepted answer explains how to do this in python but the question was also asking about R.
I've provided an answer in this thread on how to do this 'manually' in R.
Otherwise, you can use the klaR package directly for this:
df <- matrix(c(
0.082337, 0.097583, 0.048608, 0.771472,
0.116490, 0.065047, 0.066202, 0.752261,
0.114884, 0.135018, 0.073870, 0.676229,
0.071027, 0.097207, 0.070959, 0.760807,
0.066284, 0.079842, 0.103915, 0.749959,
0.016074, 0.074833, 0.044532, 0.864561,
0.066277, 0.077837, 0.058364, 0.797522,
0.055549, 0.057117, 0.045633, 0.841701,
0.071129, 0.077620, 0.049066, 0.802185,
0.089790, 0.086967, 0.083101, 0.740142,
0.084430, 0.094489, 0.039989, 0.781094
), byrow = TRUE, nrow = 11, ncol = 4)
# install.packages(c("klaR", "scatterplot3d"))
library(klaR)
#> Loading required package: MASS
quadplot(df)
Created on 2020-08-14 by the reprex package (v0.3.0)

Matplotlib graphing distribution with two colors

The goal here is to color value above a certain threshold into one color and values below this threshold into another color. The code below tries to just separate it into two histographs but it only looks balanced if the threshold is at 50%. I'm assuming I must play around with the discreetlevel variable.
finalutilityrange is some vector with a bunch of values(you must generate it to test the code), which I am trying to graph. The value deter is the value that determines whether they will be blue or red. discreetlevel is just the amount of bins I would want.
import random
import numpy as np
import matplotlib.pyplot as plt
discreetlevel = 10
deter = 2
for x in range(0,len(finalutilityrange)):
if finalutilityrange[x-1]>=deter:
piraterange.append(finalutilityrange[x-1])
else:
nonpiraterange.append(finalutilityrange[x-1])
plt.hist(piraterange,bins=discreetlevel,normed=False,cumulative=False,color = 'b')
plt.hist(nonpiraterange,bins=discreetlevel),normed=False,cumulative=False,color = 'r')
plt.title("Histogram")
plt.xlabel("Utlity")
plt.ylabel("Probability")
plt.show()
This solution is a bit more complex than #user2699's. I am just presenting it for completeness. You have full control over the patch objects that hist returns, so if you can ensure that the threshold you are using is exactly on a bin edge, it is easy to change to color of selected patches. You can do this because hist can accept a sequence of bin edges as the bins parameter.
import numpy as np
from matplotlib import pyplot as plt
# Make sample data
finalutilityrange = np.random.randn(100)
discreetlevel = 10
deter = 0.2
# Manually create `discreetlevel` bins anchored to `deter`
binsAbove = round(discreetlevel * np.count_nonzero(finalutilityrange > deter) / finalutilityrange.size)
binsBelow = discreetlevel - binsAbove
binwidth = max((finalutilityrange.max() - deter) / binsAbove,
(deter - finalutilityrange.min()) / binsBelow)
bins = np.concatenate([
np.arange(deter - binsBelow * binwidth, deter, binwidth),
np.arange(deter, deter + (binsAbove + 0.5) * binwidth, binwidth)
])
# Use the bins to make a single histogram
h, bins, patches = plt.hist(finalutilityrange, bins, color='b')
# Change the appropriate patches to red
plt.setp([p for p, b in zip(patches, bins) if b >= deter], color='r')
The result is a homogenous histogram with bins of different colors:
The bins may be a tad wider than if you did not anchor to deter. Either the first or last bin will generally go a little past the edge of the data.
This answer doesn't address your code since it isn't self-contained, but for what you're trying to do the default histogram should work (assuming numpy/pyplot is loaded)
x = randn(100)
idx = x < 0.2 # Threshold to separate values
hist([x[idx], x[~idx]], color=['b', 'r'])
Explanation:
first line just generates some random data to test,
creates an index for where the data is below some threshold, this can be negated with ~ to find where it's above the threshold
Last line plots the histogram. The command takes a list of separate groups to plot, which doesn't make a big difference here but if normed=True it will
There's more the hist plot can do, so look over the documentation before you accidentally implement it yourself.
Just as above do:
x = np.random.randn(100)
threshold_x = 0.2 # Threshold to separate values
x_lower, x_upper = (
[_ for _ in x if _ < threshold_x],
[_ for _ in x if _ >= threshold_x]
)
hist([x_lower, x_upper], color=['b', 'r'])

Unequal width binned histogram in python

I have an array with probability values stored in it. Some values are 0. I need to plot a histogram such that there are equal number of elements in each bin. I tried using matplotlibs hist function but that lets me decide number of bins. How do I go about plotting this?(Normal plot and hist work but its not what is needed)
I have 10000 entries. Only 200 have values greater than 0 and lie between 0.0005 and 0.2. This distribution isnt even as 0.2 only one element has whereas 2000 approx have value 0.0005. So plotting it was an issue as the bins had to be of unequal width with equal number of elements
The task does not make much sense to me, but the following code does, what i understood as the thing to do.
I also think the last lines of the code are what you really wanted to do. Using different bin-widths to improve visualization (but don't target the distribution of equal amount of samples within each bin)! I used astroml's hist with method='blocks' (astropy supports this too)
Code
# Python 3 -> beware the // operator!
import numpy as np
import matplotlib.pyplot as plt
from astroML import plotting as amlp
N_VALUES = 1000
N_BINS = 100
# Create fake data
prob_array = np.random.randn(N_VALUES)
prob_array /= np.max(np.abs(prob_array),axis=0) # scale a bit
# Sort array
prob_array = np.sort(prob_array)
# Calculate bin-borders,
bin_borders = [np.amin(prob_array)] + [prob_array[(N_VALUES // N_BINS) * i] for i in range(1, N_BINS)] + [np.amax(prob_array)]
print('SAMPLES: ', prob_array)
print('BIN-BORDERS: ', bin_borders)
# Plot hist
counts, x, y = plt.hist(prob_array, bins=bin_borders)
plt.xlim(bin_borders[0], bin_borders[-1] + 1e-2)
print('COUNTS: ', counts)
plt.show()
# And this is, what i think, what you really want
fig, (ax1, ax2) = plt.subplots(2)
left_blob = np.random.randn(N_VALUES/10) + 3
right_blob = np.random.randn(N_VALUES) + 110
both = np.hstack((left_blob, right_blob)) # data is hard to visualize with equal bin-widths
ax1.hist(both)
amlp.hist(both, bins='blocks', ax=ax2)
plt.show()
Output

creating a color coded time chart using colorbar and colormaps in python

I'm trying to make a time tracking chart based on a daily time tracking file that I used. I wrote code that crawls through my files and generates a few lists.
endTimes is a list of times that a particular activity ends in minutes going from 0 at midnight the first day of the month to however many minutes are in a month.
labels is a list of labels for the times listed in endTimes. It is one shorter than endtimes since the trackers don't have any data about before 0 minute. Most labels are repeats.
categories contains every unique value of labels in order of how well I regard that time.
I want to create a colorbar or a stack of colorbars (1 for eachday) that will depict how I spend my time for a month and put a color associated with each label. Each value in categories will have a color associated. More blue for more good. More red for more bad. It is already in order for the jet colormap to be right, but I need to get desecrate color values evenly spaced out for each value in categories. Then I figure the next step would be to convert that to a listed colormap to use for the colorbar based on how the labels associated with the categories.
I think this is the right way to do it, but I am not sure. I am not sure how to associate the labels with color values.
Here is the last part of my code so far. I found one function to make a discrete colormaps. It does, but it isn't what I am looking for and I am not sure what is happening.
Thanks for the help!
# now I need to develop the graph
import numpy as np
from matplotlib import pyplot,mpl
import matplotlib
from scipy import interpolate
from scipy import *
def contains(thelist,name):
# checks if the current list of categories contains the one just read
for val in thelist:
if val == name:
return True
return False
def getCategories(lastFile):
'''
must determine the colors to use
I would like to make a gradient so that the better the task, the closer to blue
bad labels will recieve colors closer to blue
read the last file given for the information on how I feel the order should be
then just keep them in the order of how good they are in the tracker
use a color range and develop discrete values for each category by evenly spacing them out
any time not found should assume to be sleep
sleep should be white
'''
tracker = open(lastFile+'.txt') # open the last file
# find all the categories
categories = []
for line in tracker:
pos = line.find(':') # does it have a : or a ?
if pos==-1: pos=line.find('?')
if pos != -1: # ignore if no : or ?
name = line[0:pos].strip() # split at the : or ?
if contains(categories,name)==False: # if the category is new
categories.append(name) # make a new one
return categories
# find good values in order of last day
newlabels=[]
for val in getCategories(lastDay):
if contains(labels,val):
newlabels.append(val)
categories=newlabels
# convert discrete colormap to listed colormap python
for ii,val in enumerate(labels):
if contains(categories,val)==False:
labels[ii]='sleep'
# create a figure
fig = pyplot.figure()
axes = []
for x in range(endTimes[-1]%(24*60)):
ax = fig.add_axes([0.05, 0.65, 0.9, 0.15])
axes.append(ax)
# figure out the colors to use
# stole this function to make a discrete colormap
# http://www.scipy.org/Cookbook/Matplotlib/ColormapTransformations
def cmap_discretize(cmap, N):
"""Return a discrete colormap from the continuous colormap cmap.
cmap: colormap instance, eg. cm.jet.
N: Number of colors.
Example
x = resize(arange(100), (5,100))
djet = cmap_discretize(cm.jet, 5)
imshow(x, cmap=djet)
"""
cdict = cmap._segmentdata.copy()
# N colors
colors_i = np.linspace(0,1.,N)
# N+1 indices
indices = np.linspace(0,1.,N+1)
for key in ('red','green','blue'):
# Find the N colors
D = np.array(cdict[key])
I = interpolate.interp1d(D[:,0], D[:,1])
colors = I(colors_i)
# Place these colors at the correct indices.
A = zeros((N+1,3), float)
A[:,0] = indices
A[1:,1] = colors
A[:-1,2] = colors
# Create a tuple for the dictionary.
L = []
for l in A:
L.append(tuple(l))
cdict[key] = tuple(L)
# Return colormap object.
return matplotlib.colors.LinearSegmentedColormap('colormap',cdict,1024)
# jet colormap goes from blue to red (good to bad)
cmap = cmap_discretize(mpl.cm.jet, len(categories))
cmap.set_over('0.25')
cmap.set_under('0.75')
#norm = mpl.colors.Normalize(endTimes,cmap.N)
print endTimes
print labels
# make a color list by matching labels to a picture
#norm = mpl.colors.ListedColormap(colorList)
cb1 = mpl.colorbar.ColorbarBase(axes[0],cmap=cmap
,orientation='horizontal'
,boundaries=endTimes
,ticks=endTimes
,spacing='proportional')
pyplot.show()
It sounds like you want something like a stacked bar chart with the color values mapped to a given range? In that case, here's a rough example:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
# Generate data....
intervals, weights = [], []
max_weight = 5
for _ in range(30):
numtimes = np.random.randint(3, 15)
times = np.random.randint(1, 24*60 - 1, numtimes)
times = np.r_[0, times, 24*60]
times.sort()
intervals.append(np.diff(times) / 60.0)
weights.append(max_weight * np.random.random(numtimes + 1))
# Plot the data as a stacked bar chart.
for i, (interval, weight) in enumerate(zip(intervals, weights)):
# We need to calculate where the bottoms of the bars will be.
bottoms = np.r_[0, np.cumsum(interval[:-1])]
# We want the left edges to all be the same, but increase with each day.
left = len(interval) * [i]
patches = plt.bar(left, interval, bottom=bottoms, align='center')
# And set the colors of each bar based on the weights
for val, patch in zip(weight, patches):
# We need to normalize the "weight" value between 0-1 to feed it into
# a given colorbar to generate an actual color...
color = cm.jet(float(val) / max_weight)
patch.set_facecolor(color)
# Setting the ticks and labels manually...
plt.xticks(range(0, 30, 2), range(1, 31, 2))
plt.yticks(range(0, 24 + 4, 4),
['12am', '4am', '8am', '12pm', '4pm', '8pm', '12am'])
plt.xlabel('Day')
plt.ylabel('Hour')
plt.axis('tight')
plt.show()

Categories