How to interpolate values between points - python

I have this dataset show below
temp = [0.1, 1, 4, 10, 15, 20, 25, 30, 35, 40]
sg =[0.999850, 0.999902, 0.999975, 0.999703, 0.999103, 0.998207, 0.997047, 0.995649, 0.99403, 0.99222]
sg_temp = pd.DataFrame({'temp' : temp,
'sg' : sg})
temp sg
0 0.1 0.999850
1 1.0 0.999902
2 4.0 0.999975
3 10.0 0.999703
4 15.0 0.999103
5 20.0 0.998207
6 25.0 0.997047
7 30.0 0.995649
8 35.0 0.994030
9 40.0 0.992220
I would like to interpolate all the values between 0.1 and 40 on a scale of 0.001 with a spline interpolation and have those points as in the dataframe as well. I have used resample() before but can't seem to find an equivalent for this case.
I have tried this based off of other questions but it doesn't work.
scale = np.linspace(0, 40, 40*1000)
interpolation_sg = interpolate.CubicSpline(list(sg_temp.temp), list(sg_temp.sg))

It works very well for me. What exactly does not work for you?
Have you correctly used the returned CubicSpline to generate your interpolated values? Or is there some kind of error?
Basically you obtain your interpolated y values by plugging in the new x values (scale) to your returned CubicSpline function:
y = interpolation_sg(scale)
I believe this is the issue here. You probably expect that the interpolation function returns you the values, but it returns a function. And you use this function to obtain your values.
If I plot this, I obtain this graph:
import matplotlib.pyplot as plt
plt.plot(sg_temp['temp'], sg_temp['sg'], marker='o', ls='') # Plots the originial data
plt.plot(scale, interpolation_sg(scale)) # Plots the interpolated data

Call scale with the result of the interpolation:
from scipy import interpolate
out = pd.DataFrame(
{'temp': scale,
'sg': interpolate.CubicSpline(sg_temp['temp'],
sg_temp['sg'])(scale)
})
Visual output:
Code for the plot
ax = plt.subplot()
out.plot(x='temp', y='sg', label='interpolated', ax=ax)
sg_temp.plot(x='temp', y='sg', marker='o', label='sg', ls='', ax=ax)

Related

What is the most simple way to set scatterplot color based on category in python?

I'm trying to, in the most simple way, color points in a scatterplot using python. X is one column, y is another, and the last (let's say Z) has values (for example A, B, C). I would like to color the points (X, Y) using the value in Z.
I realize somewhat similar questions have been asked in the past, but this just isn't working out for me. Possibly because I had to force everything to be a float?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import scipy.stats as stats
df = pd.read_csv(r"C:\......combinedsheet2.csv")
df['crowd1'] = pd.to_numeric(df['c1'], errors='coerce')
df['crowd3'] = pd.to_numeric(df['c3'], errors='coerce')
df['dist1'] = pd.to_numeric(df['d1'], errors='coerce')
I'm not sure why these specific values were read as anything other than floats-- everything else was, and I haven't used this command enough to know whether it messed with any future data analysis and may be the source of some pf my trouble when trying to do mixed-model analysis and such.
To plot I use:
df.plot(x="c1", y="d1", c="black", kind="scatter")
ax = plt.gca()
ax.set_ylim([0, 610])
ax.set_xlim([0, 30])
And to plot all of my data together I use:
df.plot(x=["c1", "c2", "c3", "c4"], y=["d1", "d2", "d3", "d4"], c="black", kind="scatter")
ax = plt.gca()
ax.set_ylim([0, 450])
ax.set_xlim([0, 20])
Here is my csv file contents, minus a few decimal points in some cases (first 3 lines):
bwc
c1
d1
dbz
c2
d2
lmr
c3
d3
tti
c4
d4
A
12
67.00
F
20.0
454.2
I
4
405.4
L
14.0
137.9
B
8
122.0
G
20.0
265.0
J
3
490
M
0.0
144.9
A
0
217.0
F
15.0
235.0
I
0
62.80
N
11.0
418.7
I would like to in each instance be able to see each different point (A, B, C, etc) as a different color. Thanks!
I suggest using the seaborn package to do this. The first plot can be created like this:
sns.scatterplot(data=df, x='c1', y='d1', hue='bwc')
When plotting all the data together, you first need to reshape the dataframe to have the x, y, and hue variables in single columns. There is more than one way to do this. The following example uses pd.wide_to_long which requires renaming the columns containing the letters:
import io
import pandas as pd # v 1.2.3
import seaborn as sns # v 0.11.1
data = """
bwc c1 d1 dbz c2 d2 lmr c3 d3 tti c4 d4
A 12 67.00 F 20.0 454.2 I 4 405.4 L 14.0 137.9
B 8 122.0 G 20.0 265.0 J 3 490 M 0.0 144.9
A 0 217.0 F 15.0 235.0 I 0 62.80 N 11.0 418.7
"""
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
# Melt dataframe to have x, y and hue variables in single columns
dfren = (df.rename(dict(bwc='let1', dbz='let2', lmr='let3', tti='let4'), axis=1)
.reset_index())
dfmelt = pd.wide_to_long(dfren, stubnames=['let', 'c', 'd'], i='index', j='j')
# Plot scatter plot with seaborn
ax = sns.scatterplot(data=dfmelt, x='c', y='d', hue='let')
ax.figure.set_size_inches(8,6)
ax.set_ylim([0, 450])
ax.set_xlim([0, 20]);

Seaborn distplot only whole numbers

How can I make a distplot with seaborn to only have whole numbers?
My data is an array of numbers between 0 and ~18. I would like to plot the distribution of the numbers.
Impressions
0 210
1 1084
2 2559
3 4378
4 5500
5 5436
6 4525
7 3329
8 2078
9 1166
10 586
11 244
12 105
13 51
14 18
15 5
16 3
dtype: int64
Code I'm using:
sns.distplot(Impressions,
# bins=np.arange(Impressions.min(), Impressions.max() + 1),
# kde=False,
axlabel=False,
hist_kws={'edgecolor':'black', 'rwidth': 1})
plt.xticks = range(current.Impressions.min(), current.Impressions.max() + 1, 1)
Plot looks like this:
What I'm expecting:
The xlabels should be whole numbers
Bars should touch each other
The kde line should simply connect the top of the bars. By the looks of it, the current one assumes to have 0s between (x, x + 1), hence why the downward spike (This isn't required, I can turn off kde)
Am I using the correct tool for the job or distplot shouldn't be used for whole numbers?
For your problem can be solved bellow code,
import seaborn as sns # for data visualization
import numpy as np # for numeric computing
import matplotlib.pyplot as plt # for data visualization
arr = np.array([1,2,3,4,5,6,7,8,9])
sns.distplot(arr, bins = arr, kde = False)
plt.xticks(arr)
plt.show()
enter image description here
In this way, you can plot histogram using seaborn sns.distplot() function.
Note: Whatever data you will pass to bins and plt.xticks(). It should be an ascending order.

colouring data points in seaborn plot using vector of RGB values for each datapoint

I have a pandas dataframe with some values. I wanted to use seaborn's stripplot to visualise the spread of my data, although this is the first time I'm using seaborn. I thought it would be interesting to colour the datapoints that were outliers, so I created a column containing RGB tuples for each value. I have used this approach before and I find it very convenient so I would love to find a way to make this work because seaborn is quite nice.
This is how the dataframe might look:
SUBJECT CONDITION(num) hit hit_box_outliers \
0 4.0 1.0 0.807692 0
1 4.0 2.0 0.942308 0
2 4.0 3.0 1.000000 0
3 4.0 4.0 1.000000 0
4 5.0 1.0 0.865385 0
hit_colours
0 (0.38823529411764707, 0.38823529411764707, 0.3...
1 (0.38823529411764707, 0.38823529411764707, 0.3...
2 (0.38823529411764707, 0.38823529411764707, 0.3...
3 (0.38823529411764707, 0.38823529411764707, 0.3...
4 (0.38823529411764707, 0.38823529411764707, 0.3...
Then I try to plot it here:
sns.stripplot(x='CONDITION(num)', y='hit', data=edfg, jitter=True, color=edfg['hit_colours'])
and I am given the following error:
ValueError: Could not generate a palette for <map object at 0x000002265939FB00>
Any ideas for how I can achieve this seemingly easy task?
It seems you want to distinguish between a point being an outlier or not. There are hence two possible cases, which are determined by the column hit_box_outliers.
You may use this column as the hue for the stripplot. To get a custom color for the two events, use a palette (or list of colors).
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
df= pd.DataFrame({"CONDITION(num)" : np.tile([1,2,3,4],25),
"hit" : np.random.rand(100),
"hit_box_outliers": np.random.randint(2, size=100)})
sns.stripplot(x='CONDITION(num)', y='hit', hue ="hit_box_outliers", data=df, jitter=True,
palette=("limegreen", (0.4,0,0.8)))
plt.show()

Plotting frequency distribution/histogram with frequency table

I'm familiar with the matplotlib histogram reference:
http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist
However, I don't actually have the original series/data to pass into the plot. I have only the summary statistics in a DataFrame. Example:
df
lower upper occurrences frequency
0.0 0.5 17 .111
0.5 0.1 65 .426
0.1 1.5 147 .963
1.5 2.0 210 1.376
.
.
.
You don't want to calculate a histogram here, because you already have the histogrammed data. Therefore you may simply plot a bar chart.
fig, ax = plt.subplots()
ax.bar(df.lower, df.occurences, width=df.upper-df.lower, ec="k", align="edge")

How can I set the x-axis tick locations for a bar plot created from a pandas DataFrame?

I have a simple plot, with x labels of 1, 1.25, 1.5, 1.75, 2 etc. up to 15:
The plot was created from a pandas.DataFrame without specifying the xtick interval:
speed.plot(kind='bar',figsize=(15, 7))
Now I would like the x-interval to be in increments of 1 rather than 0.25, so the labels would read 1,2,3,4,5 etc.
I'm sure this is easy but I cannot for the life of me figure it out.
I've found plt.xticks() which seems like it's the right call but maybe it's set_xticks?
I've changed the x ticks a great amount without doing what I wanted up until this point. Any help would be greatly appreciated.
The way that pandas handles x-ticks for bar plots can be quite confusing if your x-labels have numeric values. Let's take this example:
import pandas as pd
import numpy as np
x = np.linspace(0, 1, 21)
y = np.random.rand(21)
s = pd.Series(y, index=x)
ax = s.plot(kind='bar', figsize=(10, 3))
ax.figure.tight_layout()
You might expect the tick locations to correspond directly to the values in x, i.e. 0, 0.05, 0.1, ..., 1.0. However, this isn't the case:
print(ax.get_xticks())
# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]
Instead pandas sets the tick locations according to the indices of each element in x, but then sets the tick labels according to the values in x:
print(' '.join(label.get_text() for label in ax.get_xticklabels()))
# 0.0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1.0
Because of this, setting the tick positions directly (either by using ax.set_xticks) or passing the xticks= argument to pd.Series.plot() will not give you the effect you are expecting:
new_ticks = np.linspace(0, 1, 11) # 0.0, 0.1, 0.2, ..., 1.0
ax.set_xticks(new_ticks)
Instead you would need to update the positions and the labels of your x-ticks separately:
# positions of each tick, relative to the indices of the x-values
ax.set_xticks(np.interp(new_ticks, s.index, np.arange(s.size)))
# labels
ax.set_xticklabels(new_ticks)
This behavior actually makes a lot of sense in most cases. For bar plots it is common for the x-labels to be non-numeric (e.g. strings corresponding to categories), in which case it wouldn't be possible to use the values in x to set the tick locations. Without introducing another argument to specify their locations, the most logical choice would be to use their indices instead.

Categories