Using pandas and seaborn to make an age pyramid chart

Using pandas and seaborn to make an age pyramid chart - python

I am working on a mock census data and want to use my data frame to take the values of 'Male' and 'Female' from the Gender Column and plot them against their ages, which in itself is a different column. I have tried multiple different ways and cannot get this to plot at all.
The data has been cleaned in the dataframe and I have also attempted to split this data with a numpy array, although I know that there is a way of doing this just manipulating the dataframe, though I don't know how.
Attempted code for pyramid
*pop_age = df.T
pop_age.reset_index(inplace=True)
pop_age.columns = ['Age', 'Female', 'Male']
f, ax = plt.subplots(figsize=(10,20))
age_plot = sns.barplot(x='Male', y='Age', data=pop_age, lw=0)
age_plot = sns.barplot(x='Female', y='Age', data=pop_age, lw=0)
age_plot.set(xlabel='Population Count', ylabel='Age', title='Population Age Pyramid')*
Numpy Array splitting the data
men=[]
women=[]
for i in range(len(data2)):
if data2[i][7] == 'Male':
a=data2[i]
men.append(a)
elif data2[i][7] == 'Female' or 'Fe male':
b=data2[i]
woman.append(b)
Any help would be appreciated. :)

Your code seems to be good.
You just have to precise the color you want for each barplot :
age_plot = sns.barplot(x='Male', y='Age', data=pop_age, lw=0, color = 'thecoloryouwant')
Then, you just have to create the legend manually and changing the tick labels of the x axis to get only positive values.

Related

Couldn't align X axis values with bars on top of them using seaborn barplot with hue [duplicate]

My graph is ending up looking like this:
I took the original titanic dataset and sliced some columns and created a new dataframe via the following code.
Cabin_group = titanic[['Fare', 'Cabin', 'Survived']] #selecting certain columns from dataframe
Cabin_group.Cabin = Cabin_group.Cabin.str[0] #cleaning the Cabin column
Cabin_group = Cabin_group.groupby('Cabin', as_index =False).Survived.mean()
Cabin_group.drop([6,7], inplace = True) #drop Cabin G and T as instances are too low
Cabin_group['Status']= ('Poor', 'Rich', 'Rich', 'Medium', 'Medium', 'Poor') #giving each Cabin a status value.
So my new dataframe `Cabin_group' ends up looking like this:
Cabin Survived Status
0 A 0.454545 Poor
1 B 0.676923 Rich
2 C 0.574468 Rich
3 D 0.652174 Medium
4 E 0.682927 Medium
5 F 0.523810 Poor
Here is how I tried to plot the dataframe
fig = plt.subplots(1,1, figsize = (10,4))
sns.barplot(x ='Cabin', y='Survived', hue ='Status', data = Cabin_group )
plt.show()
So a couple of things are off with this graph;
First we have the bars A, D, E and F shifted away from their respective x-axis labels. Secondly, the bars itself seem to appear thinner/skinnier than my usual barplots.
Not sure how to shift the bars to their proper place, as well as how to control the width of the bars.
Thank you.

This can be achieved by doing dodge = False. It is handled in the new version of seaborn.

The bar are not aligned since it expects 3 bars for each x (1 for each distinct value of Status) and only one is provided. I think one of the solution is to map a color to the Status. As far as i know it is not possible to do thaht easily. However, here is an example of how to do that. I'm not sure about that since it seems complicated to simply map a color to a category (and the legend is not displayed).
# Creating a color mapping
Cabin_group['Color'] = Series(pd.factorize(Cabin_group['Status'])[0]).map(
lambda x: sns.color_palette()[x])
g = sns.barplot(x ='Cabin', y='Survived', data=Cabin_group, palette=Cabin_group['Color'])
When I see how simple it is in R ... But infortunately the ggplot implementation in Python does not allow to plot a geom_bar with stat = 'identity'.
library(tidyverse)
Cabin_group %>% ggplot() +
geom_bar(aes(x = Cabin, y= Survived, fill = Status),
stat = 'identity')

how to make stacked plots for dataframe with multiple index in python?

I have trade export data which is collected weekly. I intend to make stacked bar plot with matplotlib but I have little difficulties managing pandas dataframe with multiple indexes. I looked into this post but not able to get what I am expecting. Can anyone suggest a possible way of doing this in python? Seems I made the wrong data aggregation and I think I might use for loop to iterate year then make a stacked bar plot on a weekly base. Does anyone know how to make this easier in matplotlib? any idea?
reproducible data and my attempt
import pandas as pd
import matplotlib.pyplot as plt
# load the data
url = 'https://gist.githubusercontent.com/adamFlyn/0eb9d60374c8a0c17449eef4583705d7/raw/edea1777466284f2958ffac6cafb86683e08a65e/mydata.csv'
df = pd.read_csv(url, parse_dates=['weekly'])
df.drop('Unnamed: 0', axis=1, inplace=True)
nn = df.set_index(['year','week'])
nn.drop("weekly", axis=1, inplace=True)
f, a = plt.subplots(3,1)
nn.xs('2018').plot(kind='bar',ax=a[0])
nn.xs('2019').plot(kind='bar',ax=a[1])
nn.xs('2020').plot(kind='bar',ax=a[2])
plt.show()
plt.close()
this attempt didn't work for me. instead of explicitly selecting years like 2018, 2019, ..., is there any more efficient to make stacked bar plots for dataframe with multiple indexes? Any thoughts?
desired output
this is the desired stacked bar plot for year of 2018 as an example
how should I get my desired stacked bar plot? Any better ideas?

Try this:
nn.groupby(level=0).plot.bar(stacked=True)
or to prevent year as tuple in x axis:
for n, g in nn.groupby(level=0):
g.loc[n].plot.bar(stacked=True)
Update per request in comments
for n, g in nn.groupby(level=0):
ax = g.loc[n].plot.bar(stacked=True, title=f'{n} Year', figsize=(8,5))
ax.legend(loc='lower center')
Change layout position
fig, ax = plt.subplots(1,3)
axi = iter(ax)
for n, g in nn.groupby(level=0):
axs = next(axi)
g.loc[n].plot.bar(stacked=True, title=f'{n}', figsize=(15,8), ax=axs)
axs.legend(loc='lower center')

Try using loc instead of xs:
f, a = plt.subplots(3,1)
for x, ax in zip(nn.index.unique('year'),a.ravel()):
nn.loc[x].plot.bar(stacked=True, ax=ax)

Matplotlib both axis values overlapping

Just started using Matplotlib, I have imported csv file using URL, In this file there are almost 190+ entries for countries along with specific regions in which this country belongs to like India in Asia. I am able to plot all data but due to these much data all X Axis and Y Axis values overlap each other and getting messy.
Code:
country_cols = ['Country', 'Region']
country_data = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv",names=country_cols)
country_list = country_data.Country.tolist()
region_list = country_data.Region.tolist()
plt.plot(region_list,country_list)
And output shows like this
For sake of learning, I am using a simple line chart, I also want to know which graph type should be used for representing such data? It would be so much helpful.

I think you need fig.autofmt_xdate()
Try this code:
country_cols = ['Country', 'Region']
country_data = pd.read_csv("https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv",names=country_cols)
country_list = country_data.Country.tolist()
region_list = country_data.Region.tolist()
fig = plt.figure()
plt.plot(region_list,country_list)
fig.autofmt_xdate()
plt.show()

Matplotlib scatter plot gives ValueError(msg.format(c.shape, x.size, y.size)) when specifying colors

Im stuck on an assignment where they have us use data from
https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv
Using matplotlib I need to:
Create a scatterplot with the Fare paid and the Age, differ the plot color by gender.
So far I am having trouble getting the color to be plotted by the gender.
So far this is what I have:
import pandas as pd
import matplotlib.pyplot as plt
titanic = pd.read_csv('https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv')
plt.scatter(titanic['age'],titanic['fare'],alpha=0.5)
plt.show()
When I tried this:
plt.scatter(titanic['age'],titanic['fare'], alpha=0.5,c=titanic['sex'])
plt.show()
it gave me a raise ValueError(msg.format(c.shape, x.size, y.size))

You're nearly there. You cannot pass strings to c unless they're valid colors. You can either pass a list of valid colors, or pass numeric, integer values by factorizing your column. For example:
plt.scatter(titanic['age'], titanic['fare'], alpha=0.5, c=pd.factorize(titanic['sex'])[0])
Or,
titanic = titanic.dropna(subset=['sex'])
mapping = {'male' : 'blue', 'female' : 'red'}
plt.scatter(titanic['age'], titanic['fare'], alpha=0.5, c=titanic['sex'].map(mapping))
plt.show()

You will need to remove the NaN row, which is last row here, then:
url= "https://raw.githubusercontent.com/Geoyi/Cleaning-Titanic-Data/master/titanic_original.csv"
titanic = pd.read_csv(url, skipfooter=1, engine='python')
colors = {'male':'red', 'female':'blue'}
fig2, ax2 = plt.subplots()
ax2.scatter(titanic['age'], titanic['fare'], alpha=0.5, c=titanic['sex'].apply(lambda x: colors[x]))

Parsing CSV file using Panda

I have been using matplotlib for quite some time now and it is great however, I want to switch to panda and my first attempt at it didn't go so well.
My data set looks like this:
sam,123,184,2.6,543
winter,124,284,2.6,541
summer,178,384,2.6,542
summer,165,484,2.6,544
winter,178,584,2.6,545
sam,112,684,2.6,546
zack,145,784,2.6,547
mike,110,984,2.6,548
etc.....
I want first to search the csv for anything with the name mike and create it own list. Now with this list I want to be able to do some math for example add sam[3] + winter[4] or sam[1]/10. The last part would be to plot it columns against each other.
Going through this page
http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table
The only thing I see is if I have a column header, however, I don't have any headers. I only know the position in a row of the values I want.
So my question is:
How do I create a bunch of list for each row (sam, winter, summer)
Is this method efficient if my csv has millions of data point?
Could I use matplotlib plotting to plot pandas dataframe?
ie :
fig1 = plt.figure(figsize= (10,10))
ax = fig1.add_subplot(211)
ax.plot(mike[1], winter[3], label='Mike vs Winter speed', color = 'red')

You can read a csv without headers:
data=pd.read_csv(filepath, header=None)
Columns will be numbered starting from 0.
Selecting and filtering:
all_summers = data[data[0]=='summer']
If you want to do some operations grouping by the first column, it will look like this:
data.groupby(0).sum()
data.groupby(0).count()
...
Selecting a row after grouping:
sums = data.groupby(0).sum()
sums.loc['sam']
Plotting example:
sums.plot()
import matplotlib.pyplot as plt
plt.show()
For more details about plotting, see: http://pandas.pydata.org/pandas-docs/version/0.18.1/visualization.html

df = pd.read_csv(filepath, header=None)
mike = df[df[0]=='mike'].values.tolist()
winter = df[df[0]=='winter'].values.tolist()
Then you can plot those list as you wanted to above
fig1 = plt.figure(figsize= (10,10))
ax = fig1.add_subplot(211)
ax.plot(mike, winter, label='Mike vs Winter speed', color = 'red')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using pandas and seaborn to make an age pyramid chart - python

Related

Couldn't align X axis values with bars on top of them using seaborn barplot with hue [duplicate]

how to make stacked plots for dataframe with multiple index in python?

Matplotlib both axis values overlapping

Matplotlib scatter plot gives ValueError(msg.format(c.shape, x.size, y.size)) when specifying colors

Parsing CSV file using Panda

Categories

Resources