Matplotlib both axis values overlapping - python

Just started using Matplotlib, I have imported csv file using URL, In this file there are almost 190+ entries for countries along with specific regions in which this country belongs to like India in Asia. I am able to plot all data but due to these much data all X Axis and Y Axis values overlap each other and getting messy.
Code:
country_cols = ['Country', 'Region']
country_data = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv",names=country_cols)
country_list = country_data.Country.tolist()
region_list = country_data.Region.tolist()
plt.plot(region_list,country_list)
And output shows like this
For sake of learning, I am using a simple line chart, I also want to know which graph type should be used for representing such data? It would be so much helpful.

I think you need fig.autofmt_xdate()
Try this code:
country_cols = ['Country', 'Region']
country_data = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv",names=country_cols)
country_list = country_data.Country.tolist()
region_list = country_data.Region.tolist()
fig = plt.figure()
plt.plot(region_list,country_list)
fig.autofmt_xdate()
plt.show()

Related

Using pandas and seaborn to make an age pyramid chart

I am working on a mock census data and want to use my data frame to take the values of 'Male' and 'Female' from the Gender Column and plot them against their ages, which in itself is a different column. I have tried multiple different ways and cannot get this to plot at all.
The data has been cleaned in the dataframe and I have also attempted to split this data with a numpy array, although I know that there is a way of doing this just manipulating the dataframe, though I don't know how.
Attempted code for pyramid
*pop_age = df.T
pop_age.reset_index(inplace=True)
pop_age.columns = ['Age', 'Female', 'Male']
f, ax = plt.subplots(figsize=(10,20))
age_plot = sns.barplot(x='Male', y='Age', data=pop_age, lw=0)
age_plot = sns.barplot(x='Female', y='Age', data=pop_age, lw=0)
age_plot.set(xlabel='Population Count', ylabel='Age', title='Population Age Pyramid')*
Numpy Array splitting the data
men=[]
women=[]
for i in range(len(data2)):
if data2[i][7] == 'Male':
a=data2[i]
men.append(a)
elif data2[i][7] == 'Female' or 'Fe male':
b=data2[i]
woman.append(b)
Any help would be appreciated. :)
Your code seems to be good.
You just have to precise the color you want for each barplot :
age_plot = sns.barplot(x='Male', y='Age', data=pop_age, lw=0, color = 'thecoloryouwant')
Then, you just have to create the legend manually and changing the tick labels of the x axis to get only positive values.

Plotly: How to prepare data visualization for below image using scatter bubble chart?

Here is my dataset after cleaning csv file
Here is output what I want
What I want is , I have to display years in x axis and column values in y axis.and I want to display bubbles with different colors and size with play animation button
I am new to data science , can someone help me ,how can I achieve this?
Judging by your dataset and attached image, what you're asking for is something like this:
But I'm not sure that is what you actually want. You see, with your particular dataset there aren't enough dimensions to justify an animation. Or even a bubble plot. This is because you're only looking at one value. So you end up showing the same value throuh the bubble sizes and on the y axis. And there's really no need to change your dataset given that your provided screenshot is in fact your desired plot. But we can talk more about that if you'd like.
Since you haven't provided a sample dataset, I've used a dataset that's available through plotly express and reshaped it so that is matches your dataset:
Complete code:
# imports
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
import math
import numpy as np
# color cycle
colors = px.colors.qualitative.Alphabet*10
# sample data with similar structure as OP
df = px.data.gapminder().query("continent=='Americas'")
dfp=df.pivot(index='year', columns='country', values='pop')
dfp=dfp[['United States', 'Mexico', 'Argentina', 'Brazil', 'Colombia']]
dfp=dfp.sort_values(by='United States', ascending = False)
dfp=dfp.T
dfp.columns = [str(yr) for yr in dfp.columns]
dfp = dfp[dfp.columns[::-1]].T
# build figure and add traces
fig=go.Figure()
for col, country in enumerate(dfp):
vals = dfp[country].values
yVals = [col]*len(vals)
fig.add_traces(go.Scatter(
y=yVals,
x=dfp.index,
mode='markers',
marker=dict(color=colors[col],
size=vals,
sizemode='area',
#sizeref=2.*max(vals)/(40.**2),
sizeref=2.*max(dfp.max())/(40.**2),
sizemin=4),
name = country
))
# edit y tick layout
tickVals = np.arange(0, len(df.columns))
fig.update_layout(
yaxis = dict(tickmode = 'array',
tickvals = tickVals,
ticktext = dfp.columns.tolist()))
fig.show()

Parsing CSV file using Panda

I have been using matplotlib for quite some time now and it is great however, I want to switch to panda and my first attempt at it didn't go so well.
My data set looks like this:
sam,123,184,2.6,543
winter,124,284,2.6,541
summer,178,384,2.6,542
summer,165,484,2.6,544
winter,178,584,2.6,545
sam,112,684,2.6,546
zack,145,784,2.6,547
mike,110,984,2.6,548
etc.....
I want first to search the csv for anything with the name mike and create it own list. Now with this list I want to be able to do some math for example add sam[3] + winter[4] or sam[1]/10. The last part would be to plot it columns against each other.
Going through this page
http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table
The only thing I see is if I have a column header, however, I don't have any headers. I only know the position in a row of the values I want.
So my question is:
How do I create a bunch of list for each row (sam, winter, summer)
Is this method efficient if my csv has millions of data point?
Could I use matplotlib plotting to plot pandas dataframe?
ie :
fig1 = plt.figure(figsize= (10,10))
ax = fig1.add_subplot(211)
ax.plot(mike[1], winter[3], label='Mike vs Winter speed', color = 'red')
You can read a csv without headers:
data=pd.read_csv(filepath, header=None)
Columns will be numbered starting from 0.
Selecting and filtering:
all_summers = data[data[0]=='summer']
If you want to do some operations grouping by the first column, it will look like this:
data.groupby(0).sum()
data.groupby(0).count()
...
Selecting a row after grouping:
sums = data.groupby(0).sum()
sums.loc['sam']
Plotting example:
sums.plot()
import matplotlib.pyplot as plt
plt.show()
For more details about plotting, see: http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html
df = pd.read_csv(filepath, header=None)
mike = df[df[0]=='mike'].values.tolist()
winter = df[df[0]=='winter'].values.tolist()
Then you can plot those list as you wanted to above
fig1 = plt.figure(figsize= (10,10))
ax = fig1.add_subplot(211)
ax.plot(mike, winter, label='Mike vs Winter speed', color = 'red')

Timeseries plot with min/max shading using Seaborn

I am trying to create a 3-line time series plot based on the following data , in a Week x Overload graph, where each Cluster is a different line.
I have multiple observations for each (Cluster, Week) pair (5 for each atm, will have 1000). I would like the points on the line to be the average Overload value for that specific (Cluster, Week) pair, and the band be the min/max values of it.
Currently using the following bit of code to plot it, but I'm not getting any lines, as I don't know what unit to specify using the current dataframe:
ax14 = sns.tsplot(data = long_total_cluster_capacity_overload_df, value = "Overload", time = "Week", condition = "Cluster")
GIST Data
I have a feeling I still need to re-shape my dataframe, but I have no idea how. Looking for a final results that looks like this
Based off this incredible answer, I was able to create a monkey patch to beautifully do what you are looking for.
import pandas as pd
import seaborn as sns
import seaborn.timeseries
def _plot_range_band(*args, central_data=None, ci=None, data=None, **kwargs):
upper = data.max(axis=0)
lower = data.min(axis=0)
#import pdb; pdb.set_trace()
ci = np.asarray((lower, upper))
kwargs.update({"central_data": central_data, "ci": ci, "data": data})
seaborn.timeseries._plot_ci_band(*args, **kwargs)
seaborn.timeseries._plot_range_band = _plot_range_band
cluster_overload = pd.read_csv("TSplot.csv", delim_whitespace=True)
cluster_overload['Unit'] = cluster_overload.groupby(['Cluster','Week']).cumcount()
ax = sns.tsplot(time='Week',value="Overload", condition="Cluster", unit="Unit", data=cluster_overload,
err_style="range_band", n_boot=0)
Output Graph:
Notice that the shaded regions line up with the true maximum and minimums in the line graph!
If you figure out why the unit variable is required, please let me know.
If you do not want them all on the same graph then:
import pandas as pd
import seaborn as sns
import seaborn.timeseries
def _plot_range_band(*args, central_data=None, ci=None, data=None, **kwargs):
upper = data.max(axis=0)
lower = data.min(axis=0)
#import pdb; pdb.set_trace()
ci = np.asarray((lower, upper))
kwargs.update({"central_data": central_data, "ci": ci, "data": data})
seaborn.timeseries._plot_ci_band(*args, **kwargs)
seaborn.timeseries._plot_range_band = _plot_range_band
cluster_overload = pd.read_csv("TSplot.csv", delim_whitespace=True)
cluster_overload['subindex'] = cluster_overload.groupby(['Cluster','Week']).cumcount()
def customPlot(*args,**kwargs):
df = kwargs.pop('data')
pivoted = df.pivot(index='subindex', columns='Week', values='Overload')
ax = sns.tsplot(pivoted.values, err_style="range_band", n_boot=0, color=kwargs['color'])
g = sns.FacetGrid(cluster_overload, row="Cluster", sharey=False, hue='Cluster', aspect=3)
g = g.map_dataframe(customPlot, 'Week', 'Overload','subindex')
Which produces the following, (you can obviously play with the aspect ratio if you think the proportions are off)
I finally used the good old plot with a design (subplots) that seems (to me) more readable.
df = pd.read_csv('TSplot.csv', sep='\t', index_col=0)
# Compute the min, mean and max (could also be other values)
grouped = df.groupby(["Cluster", "Week"]).agg({'Overload': ['min', 'mean', 'max']}).unstack("Cluster")
# Plot with sublot since it is more readable
axes = grouped.loc[:,('Overload', 'mean')].plot(subplots=True)
# Getting the color palette used
palette = sns.color_palette()
# Initializing an index to get each cluster and each color
index = 0
for ax in axes:
ax.fill_between(grouped.index, grouped.loc[:,('Overload', 'mean', index + 1)],
grouped.loc[:,('Overload', 'max', index + 1 )], alpha=.2, color=palette[index])
ax.fill_between(grouped.index,
grouped.loc[:,('Overload', 'min', index + 1)] , grouped.loc[:,('Overload', 'mean', index + 1)], alpha=.2, color=palette[index])
index +=1
I really thought I would be able to do it with seaborn.tsplot. But it does not quite look right. Here is the result I get with seaborn:
cluster_overload = pd.read_csv("TSplot.csv", delim_whitespace=True)
cluster_overload['Unit'] = cluster_overload.groupby(['Cluster','Week']).cumcount()
ax = sns.tsplot(time='Week',value="Overload", condition="Cluster", ci=100, unit="Unit", data=cluster_overload)
Outputs:
I am really confused as to why the unit parameter is necessary since my understanding is that all the data is aggregated based on (time, condition) The Seaborn Documentation defines unit as
Field in the data DataFrame identifying the sampling unit (e.g.
subject, neuron, etc.). The error representation will collapse over
units at each time/condition observation. This has no role when data
is an array.
I am not certain of the meaning of 'collapsed over'- especially since my definition wouldn't make it a required variable.
Anyways, here's the output if you want exactly what you discussed, not nearly as pretty. I am not sure how to manually shade in those regions, but please share if you figure it out.
cluster_overload = pd.read_csv("TSplot.csv", delim_whitespace=True)
grouped = cluster_overload.groupby(['Cluster','Week'],as_index=False)
stats = grouped.agg(['min','mean','max']).unstack().T
stats.index = stats.index.droplevel(0)
colors = ['b','g','r']
ax = stats.loc['mean'].plot(color=colors, alpha=0.8, linewidth=3)
stats.loc['max'].plot(ax=ax,color=colors,legend=False, alpha=0.3)
stats.loc['min'].plot(ax=ax,color=colors,legend=False, alpha=0.3)
Outputs:

Python: Legend has wrong colors on Pandas MultiIndex plot

I'm trying to plot data from 2 seperate MultiIndex, with the same data as levels in each.
Currently, this is generating two seperate plots and I'm unable to customise the legend by appending some string to individualise each line on the graph. Any help would be appreciated!
Here is the method so far:
def plot_lead_trail_res(df_ante, df_post, symbols=[]):
if len(symbols) < 1:
print "Try again with a symbol list. (Time constraints)"
else:
df_ante = df_ante.loc[symbols]
df_post = df_post.loc[symbols]
ante_leg = [str(x)+'_ex-ante' for x in df_ante.index.levels[0]]
post_leg = [str(x)+'_ex-post' for x in df_post.index.levels[0]]
print "ante_leg", ante_leg
ax = df_ante.unstack(0).plot(x='SHIFT', y='MUTUAL_INFORMATION', legend=ante_leg)
ax = df_post.unstack(0).plot(x='SHIFT', y='MUTUAL_INFORMATION', legend=post_leg)
ax.set_xlabel('Time-shift of sentiment data (days) with financial data')
ax.set_ylabel('Mutual Information')
Using this function call:
sentisignal.plot_lead_trail_res(data_nasdaq_top_100_preprocessed_mi_res, data_nasdaq_top_100_preprocessed_mi_res_validate, ['AAL', 'AAPL'])
I obtain the following figure:
Current plots
Ideally, both sets of lines would be on the same graph with the same axes!
Update 2 [Concatenation Solution]
I've solved the issues of plotting from multiple frames using concatenation, however the legend does not match the line colors on the graph.
There are not specific calls to legend and the label parameter in plot() has not been used.
Code:
df_ante = data_nasdaq_top_100_preprocessed_mi_res
df_post = data_nasdaq_top_100_preprocessed_mi_res_validate
symbols = ['AAL', 'AAPL']
df_ante = df_ante.loc[symbols]
df_post = df_post.loc[symbols]
df_ante.index.set_levels([[str(x)+'_ex-ante' for x in df_ante.index.levels[0]],df_ante.index.levels[1]], inplace=True)
df_post.index.set_levels([[str(x)+'_ex-post' for x in df_post.index.levels[0]],df_post.index.levels[1]], inplace=True)
df_merge = pd.concat([df_ante, df_post])
df_merge['SHIFT'] = abs(df_merge['SHIFT'])
df_merge.unstack(0).plot(x='SHIFT', y='MUTUAL_INFORMATION')
Image:
MultiIndex Plot Image
I think, with
ax = df_ante.unstack(0).plot(x='SHIFT', y='MUTUAL_INFORMATION', legend=ante_leg)
you put the output of the plot() in ax, including the lines, which then get overwritten by the second function call. Am I right, that the lines which were plotted first are missing?
The official procedure would be rather something like
fig = plt.figure(figsize=(5, 5)) # size in inch
ax = fig.add_subplot(111) # if you want only one axes
now you have an axes object in ax, and can take this as input for the next plots.

Categories