I'm working on my first big data project for my university. My dataset is this one: https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset
In this part I'd like to:
Take only the best 20 variables of that particular column (IMDB Score
and Gross)
Plot everything to see the graph.
With this code I can see the graph as shown
Top20 = newmovieDef[['IMDB Score', 'Gross']].sort_values('IMDB Score', ascending=False).nlargest(20, 'IMDB Score')
newmovieDef[['IMDB Score', 'Gross']].sort_values('IMDB Score', ascending=False).nlargest(20, 'IMDB Score')
#visualizing top 20 in plot
plt.figure(figsize=(7,7))
x = Top20["IMDB Score"]
y = Top20["Gross"]
plt.bar(x, y, color="purple")
plt.show()
But if then I write this:
#GROSS-DURATION ---PROBLEMA GRAFICO
Top20 = newmovieDef[['Gross', 'Duration']].sort_values('Gross', ascending=False).nlargest(20, 'Gross')
newmovieDef[['Gross', 'Duration']].sort_values('Gross', ascending=False).nlargest(20, 'Gross')
#visualizing top 20 in plot
plt.figure(figsize=(7,7))
x = Top20["Gross"]
y = Top20["Duration"]
plt.bar(x, y, color="green")
plt.show()
it gives me a blank graph as in
Gross and Duration are continuous variables so a bar chart with Gross on the xaxis and Duration on the yaxis is not the right choice for a visualization. To see the relationship between two continuous variables (in this case Gross and Duration), generally, a scatter (X-Y) plot is used.
From this source, "Bar graphs are used to compare things between different groups or to track changes over time." The key word here is groups which means discrete variables (usually represented as strings in Python).
From the same source, "X-Y plots are used to determine relationships between the two different things. The x-axis is used to measure one event (or variable) and the y-axis is used to measure the other."
You can modify your code to show a scatter (X-Y) plot as follows:
plt.figure(figsize=(7,7))
x = Top20["Gross"]
y = Top20["Duration"]
# Scatter plot
plt.plot(x, y, color="green")
plt.show()
If you really want a bar plot, then I would suggest binning your continuous data. This breaks a continuous variable into discrete groups which can then be shown on a bar graph although this is still not the best choice for the visualization.
This book is an exceptional (free) resource for data visualization. It's written with the R programming language, but the general principles still apply.
Related
I'm iterating through all columns of my df to plot their densities to see if and how I need to transform/normalize my data. I'm using Seaborn and this code:
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(16,40))
fig.tight_layout() #othwerwise the plots overlapped each other and I couldn't see the column names
for i, column in enumerate(df.columns):
sns.histplot(df[column],ax=axes[i//n_cols,i%n_cols], kde=True, legend=True, fmt='g')
This results in a mostly okay graph, however the scaling of the y axis is waaay too big in some cases:
City 3 and 4 are just fine, however, the highest Count for City 4 is at around 200, yet the plot scales y until 10 000, which makes the data hard to interpret. The x axis also goes way beyond where it should, as the highest cost is at about 1000000, but the plot goes until 25000000. When I plot City 4 separately and force a ylim of 200 and xlim of 1000000 I get a much more understandable plot:
Why is the y axis (and actually, the x axis also) scaled so weirdly, and how can I change my code to scale it down so that I don't get a ylim much higher than the actually displayed data?
Thank you!
Set the shared_yaxis to False.
This will get the subplots to plot at the respective maximum points of the corresponding data.
Example:
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(16,40), sharey=False)
I am practicing with Python Pandas plotting functions and I am trying to plot the content of two series extracted from the same dataframe into one plot.
When I plot the two series individually the result is correct. However, when I plot them together, the one that I plot as second appears flat in the picture.
Here is my code:
# dailyFlow and smooth are created in the same way from the same dataframe
dailyFlow = pd.Series(dataFrame...
smooth = pd.Series(dataFrame...
# lower the noise in the signal with standard deviation = 6
smooth = smooth.resample('D').sum().rolling(31, center=True, win_type='gaussian').sum(std=6)
dailyFlow.plot(style ='-b')
plt.legend(loc = 'upper right')
plt.show()
smooth.plot(style ='-r')
plt.legend(loc = 'upper right')
plt.show()
plt.figure(figsize=(12,5))
smooth.plot(style ='-r')
dailyFlow.plot(style ='-b')
plt.legend(loc = 'upper right')
plt.show()
Here is the output of my function:
I already tried using the parameter secondary_y=True in the second plot, but then I lose the information on the second line in the legend and the scaling between the two plots is wrong.
Many sources on the Internet seem to suggest that plotting the two series like I am doing should be correct, but then why is the third plot incorrect?
Thank you very much for your help.
For the data you have, the 3rd plot is correct. Look at the scale of the y axis on your two plots: one goes up to 70,000 and the other to 60,000,000.
I suspect what you actually want is a .rolling(...).mean() which should have a range comparable to your original data.
If you would like to make both plots bigger, you cold try something like this
fig, ax1 = plt.subplots()
ax1.set_ylim([0, 75000])
# plot first graph
ax2 = ax1.twinx() # second axes that shares the same x-axis
ax2.set_ylim([0, 60000000])
#plot the second graph
I'm working on creating a bar chart for a skewed data set using python matplotlib.
While I'm able to generate the graph without any issue, In the graph generated, the bar related to the skewed data is covering the majority of the bar chart and making the other nonskewed data look relatively small and negligible.
Below is the code used to generate the bar graph.
import numpy as np
import matplotlib.pyplot as plt
x=["A","B","C","D","E","F"]
y=[25,11,46,895,68,5]
fig,ax = plt.subplots()
r1=plt.barh(y=x,
width=y,
height=0.8)
#ht = [x.get_width() for x in r1.get_children()]
r1y = np.asarray([x.get_y() for x in r1.get_children()])
r1h = np.asarray([x.get_height() for x in r1.get_children()])
for i in range(5):
plt.text(y[i],r1y[i]+r1h[i]/2, '%s'% (y[i]), ha='left', va='center')
plt.xticks([0,10,100,1000])
plt.show()
The above code would create a bar chart with 0,10,100 and 1000 as xtick values and they are placed at a relative distance based on their value.
While this is valid and expected behvaior, one single skewed bar is impacting the entire bar chart.
So,is it possible to place these xtick values at equidistant so that the skewed data doesn't occupy the majority of the space in the final output?
In the expected output, values related 0-10-100 should occupy around 66.6% of the space and 100-1000 should occupy the rest of the 33.3% of the space.
Example:
Try to add plt.xscale('log'):
x=["A","B","C","D","E","F"]
y=[25,11,46,895,68,5]
fig,ax = plt.subplots()
r1=plt.barh(y=x,
width=y,
height=0.8)
r1y = np.asarray([x.get_y() for x in r1.get_children()])
r1h = np.asarray([x.get_height() for x in r1.get_children()])
for i in range(5):
plt.text(y[i],r1y[i]+r1h[i]/2, '%s'% (y[i]), ha='left', va='center')
plt.xscale('log')
plt.show()
Output:
I'm trying to plot two datasets (called Height and Temperature) on different y axes.
Both datasets have the same length.
Both datasets are linked together by a third dataset, RH.
I have tried to use matplotlib to plot the data using twiny() but I am struggling to align both datasets together on the same plot.
Here is the plot I want to align.
The horizontal black line on the figure is defined as the 0°C degree line that was found from Height and was used to test if both datasets, when plotted, would be aligned. They do not. There is a noticable difference between the black line and the 0°C tick from Temperature.
Rather than the two y axes changing independently from each other I would like to plot each index from Height and Temperature at the same y position on the plot.
Here is the code that I used to create the plot:
#Define number of subplots sharing y axis
f, ax1 = plt.subplots()
ax1.minorticks_on()
ax1.grid(which='major',axis='both',c='grey')
#Set axis parameters
ax1.set_ylabel('Height $(km)$')
ax1.set_ylim([np.nanmin(Height), np.nanmax(Height)])
#Plot RH
ax1.plot(RH, Height, label='Original', lw=0.5)
ax1.set_xlabel('RH $(\%)$')
ax2 = ax1.twinx()
ax2.plot(RH, Temperature, label='Original', lw=0.5, c='black')
ax2.set_ylabel('Temperature ($^\circ$C)')
ax2.set_ylim([np.nanmin(Temperature), np.nanmax(Temperature)])
Any help on this would be amazing. Thanks.
Maybe the atmosphere is wrong. :)
It sounds like you are trying to align the two y axes at particular values. Why are you doing this? The relationship of Height vs. Temperature is non-linear, so I think you are setting the stage for a confusing graph. Any particular line you plot can only be interpreted against one vertical axis.
If needed, I think you will be forced to "do some math" on the limits of the y axes. This link may be helpful:
align scales
Let's use the famous Titanic dataset found here:
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls
And read it in as a dataframe: df
I'm interested in visualizing survival rate per passenger segment, with passenger segments defined as a hexbin bucket of fare x age.
Generating the hexbin of those two features is fairly straightforward:
sns.set(font_scale=1.5)
sns.set_style("white")
fig = plt.figure(figsize=(8,8))
fig = sns.jointplot("age", "fare", data=df, kind="hex",
joint_kws={'gridsize':22},
xlim=(-20, 90), ylim=(-20,300), mincnt=0,
stat_func=None, marginal_kws={"bins":10, "color":"k", "rug":True}, color="black"
)
But rather than density (which is shown in the marginal plot anyway), I'd like the color of the chart to represent survival rate (survived is a binary 1 & 0 dataframe feature) for all passengers counted within each bin.
Answers here are somewhat helpful, but scatter plots are problematic for dense datasets, thus my use of a hexbin.
Any help how I might make this work?