Plotting frequency distribution/histogram with frequency table - python

I'm familiar with the matplotlib histogram reference:
http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.hist
However, I don't actually have the original series/data to pass into the plot. I have only the summary statistics in a DataFrame. Example:
df
lower upper occurrences frequency
0.0 0.5 17 .111
0.5 0.1 65 .426
0.1 1.5 147 .963
1.5 2.0 210 1.376
.
.
.

You don't want to calculate a histogram here, because you already have the histogrammed data. Therefore you may simply plot a bar chart.
fig, ax = plt.subplots()
ax.bar(df.lower, df.occurences, width=df.upper-df.lower, ec="k", align="edge")

Related

How to interpolate values between points

I have this dataset show below
temp = [0.1, 1, 4, 10, 15, 20, 25, 30, 35, 40]
sg =[0.999850, 0.999902, 0.999975, 0.999703, 0.999103, 0.998207, 0.997047, 0.995649, 0.99403, 0.99222]
sg_temp = pd.DataFrame({'temp' : temp,
'sg' : sg})
temp sg
0 0.1 0.999850
1 1.0 0.999902
2 4.0 0.999975
3 10.0 0.999703
4 15.0 0.999103
5 20.0 0.998207
6 25.0 0.997047
7 30.0 0.995649
8 35.0 0.994030
9 40.0 0.992220
I would like to interpolate all the values between 0.1 and 40 on a scale of 0.001 with a spline interpolation and have those points as in the dataframe as well. I have used resample() before but can't seem to find an equivalent for this case.
I have tried this based off of other questions but it doesn't work.
scale = np.linspace(0, 40, 40*1000)
interpolation_sg = interpolate.CubicSpline(list(sg_temp.temp), list(sg_temp.sg))
It works very well for me. What exactly does not work for you?
Have you correctly used the returned CubicSpline to generate your interpolated values? Or is there some kind of error?
Basically you obtain your interpolated y values by plugging in the new x values (scale) to your returned CubicSpline function:
y = interpolation_sg(scale)
I believe this is the issue here. You probably expect that the interpolation function returns you the values, but it returns a function. And you use this function to obtain your values.
If I plot this, I obtain this graph:
import matplotlib.pyplot as plt
plt.plot(sg_temp['temp'], sg_temp['sg'], marker='o', ls='') # Plots the originial data
plt.plot(scale, interpolation_sg(scale)) # Plots the interpolated data
Call scale with the result of the interpolation:
from scipy import interpolate
out = pd.DataFrame(
{'temp': scale,
'sg': interpolate.CubicSpline(sg_temp['temp'],
sg_temp['sg'])(scale)
})
Visual output:
Code for the plot
ax = plt.subplot()
out.plot(x='temp', y='sg', label='interpolated', ax=ax)
sg_temp.plot(x='temp', y='sg', marker='o', label='sg', ls='', ax=ax)

Create a plot with x axis as timestamp and y axis as shifted price

I am new to time-series programming with pandas. Can somebody help me with this.
Create a plot with x axis as timestamp and y axis as shifted price. In the plot draw the following dotted lines:
Green dotted line which indicates mean
Say mean of shifted price distribution is 0.5 and standard deviation is 2.25
Line should be y = 0.5 ie horizontal line parallel to x-axis
Red dotted lines which indicates one standard deviation above and below x-axis.
Line should be y=2.25 and y=-2.25
Following is a sample image which shows the shifted price in y-axis, time in x-axis, green dotted
line on mean and red dotted line on +- standard deviation
here is the sample data:
0 2017-11-05 09:20:01.134 2123.0 12.23 34.12 300.0
1 2017-11-05 09:20:01.789 2133.0 32.43 45.62 330.0
2 2017-11-05 09:20:02.238 2423.0 35.43 55.62 NaN
3 2017-11-05 09:20:02.567 3423.0 65.43 56.62 NaN
4 2017-11-05 09:20:02.948 2463.0 45.43 58.62 NaN
Consider your price as a Series and plot it as follow :
import numpy as np
import pandas as pd
# Date
rng = pd.date_range('1/1/2000', periods=1000)
# Create a Random Series
ts = pd.Series(np.random.randn(len(rng)), index=rng)
# Create plot
ax = ts.plot()
# Plot de mean
ax.axhline(y=ts.mean(), color='r', linestyle='--', lw=2)
# Plot CI
ax.axhline(y=ts.mean() + 1.96*np.sqrt(np.var(ts)), color='g', linestyle=':', lw=2)
ax.axhline(y=ts.mean() - 1.96*np.sqrt(np.var(ts)), color='g', linestyle=':', lw=2)

colouring data points in seaborn plot using vector of RGB values for each datapoint

I have a pandas dataframe with some values. I wanted to use seaborn's stripplot to visualise the spread of my data, although this is the first time I'm using seaborn. I thought it would be interesting to colour the datapoints that were outliers, so I created a column containing RGB tuples for each value. I have used this approach before and I find it very convenient so I would love to find a way to make this work because seaborn is quite nice.
This is how the dataframe might look:
SUBJECT CONDITION(num) hit hit_box_outliers \
0 4.0 1.0 0.807692 0
1 4.0 2.0 0.942308 0
2 4.0 3.0 1.000000 0
3 4.0 4.0 1.000000 0
4 5.0 1.0 0.865385 0
hit_colours
0 (0.38823529411764707, 0.38823529411764707, 0.3...
1 (0.38823529411764707, 0.38823529411764707, 0.3...
2 (0.38823529411764707, 0.38823529411764707, 0.3...
3 (0.38823529411764707, 0.38823529411764707, 0.3...
4 (0.38823529411764707, 0.38823529411764707, 0.3...
Then I try to plot it here:
sns.stripplot(x='CONDITION(num)', y='hit', data=edfg, jitter=True, color=edfg['hit_colours'])
and I am given the following error:
ValueError: Could not generate a palette for <map object at 0x000002265939FB00>
Any ideas for how I can achieve this seemingly easy task?
It seems you want to distinguish between a point being an outlier or not. There are hence two possible cases, which are determined by the column hit_box_outliers.
You may use this column as the hue for the stripplot. To get a custom color for the two events, use a palette (or list of colors).
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
df= pd.DataFrame({"CONDITION(num)" : np.tile([1,2,3,4],25),
"hit" : np.random.rand(100),
"hit_box_outliers": np.random.randint(2, size=100)})
sns.stripplot(x='CONDITION(num)', y='hit', hue ="hit_box_outliers", data=df, jitter=True,
palette=("limegreen", (0.4,0,0.8)))
plt.show()

How do i show the proper count value in seaborn?

CH Gayle 17
YK Pathan 16
AB de Villiers 15
DA Warner 14
SK Raina 13
RG Sharma 13
MEK Hussey 12
AM Rahane 12
MS Dhoni 12
G Gambhir 12
I have a series like this. I want to plot the player on the x axis and their respective value on the y axis. I tried this code:
man_of_match=(matches['player_of_match'].value_counts())
sns.countplot(x=(man_of_match),data=matches,color='B')
sns.plt.show()
But with this code, it plots the frequency of the numeric value, i.e on x axis 12 gets plotted and the count on y axis becomes 4. Similarly for 13 on x axis it shows 2 on y axis.
How do i make the x axis show the name of the player and the y axis the corresponding value of the player.?
sns.countplot is meant to do the counting for you. You are counting yourself with value_counts then plotting the counts of counts. Pass matches directly to sns.countplot
ax = sns.countplot(matches['player_of_match'], color='B')
plt.sca(ax)
plt.xticks(rotation=90);
If you want to limit it to the top 10 players. Use value_counts as you did. But use matplotlib directly, to plot.
ax = matches['player_of_match'].value_counts().head(10).plot.bar(width=.8, color='R')
ax.set_xlabel('player_of_match')
ax.set_ylabel('count')
You can get it to look a lot like the seaborn plot
kws = dict(width=.8, color=sns.color_palette('pastel'))
ax = matches['player_of_match'].value_counts().head(10).plot.bar(**kws)
ax.set_xlabel('player_of_match')
ax.set_ylabel('count')
ax.grid(False, axis='x')

How can I set the x-axis tick locations for a bar plot created from a pandas DataFrame?

I have a simple plot, with x labels of 1, 1.25, 1.5, 1.75, 2 etc. up to 15:
The plot was created from a pandas.DataFrame without specifying the xtick interval:
speed.plot(kind='bar',figsize=(15, 7))
Now I would like the x-interval to be in increments of 1 rather than 0.25, so the labels would read 1,2,3,4,5 etc.
I'm sure this is easy but I cannot for the life of me figure it out.
I've found plt.xticks() which seems like it's the right call but maybe it's set_xticks?
I've changed the x ticks a great amount without doing what I wanted up until this point. Any help would be greatly appreciated.
The way that pandas handles x-ticks for bar plots can be quite confusing if your x-labels have numeric values. Let's take this example:
import pandas as pd
import numpy as np
x = np.linspace(0, 1, 21)
y = np.random.rand(21)
s = pd.Series(y, index=x)
ax = s.plot(kind='bar', figsize=(10, 3))
ax.figure.tight_layout()
You might expect the tick locations to correspond directly to the values in x, i.e. 0, 0.05, 0.1, ..., 1.0. However, this isn't the case:
print(ax.get_xticks())
# [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]
Instead pandas sets the tick locations according to the indices of each element in x, but then sets the tick labels according to the values in x:
print(' '.join(label.get_text() for label in ax.get_xticklabels()))
# 0.0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1.0
Because of this, setting the tick positions directly (either by using ax.set_xticks) or passing the xticks= argument to pd.Series.plot() will not give you the effect you are expecting:
new_ticks = np.linspace(0, 1, 11) # 0.0, 0.1, 0.2, ..., 1.0
ax.set_xticks(new_ticks)
Instead you would need to update the positions and the labels of your x-ticks separately:
# positions of each tick, relative to the indices of the x-values
ax.set_xticks(np.interp(new_ticks, s.index, np.arange(s.size)))
# labels
ax.set_xticklabels(new_ticks)
This behavior actually makes a lot of sense in most cases. For bar plots it is common for the x-labels to be non-numeric (e.g. strings corresponding to categories), in which case it wouldn't be possible to use the values in x to set the tick locations. Without introducing another argument to specify their locations, the most logical choice would be to use their indices instead.

Categories