I'm trying to create histogram from grouped data in pandas.
So far I was able to create standard line plot. But I can't figure out how to do the same to get histogram (bar chart). I would like to get 2 age histograms of persons who survived Titanic crush and who didn't - to see if there is a difference in age distribution.
Source data:
https://www.udacity.com/api/nodes/5454512672/supplemental_media/titanic-datacsv/download
So far my code:
import pandas as pn
titanic = pn.DataFrame.from_csv('titanic_data.csv')
SurvivedAge= titanic.groupby(['Survived','Age']).size()
SurvivedAge=SurvivedAge.reset_index()
SurvivedAge.columns=['Survived', 'Age', 'Num']
SurvivedAge.index=(SurvivedAge['Survived'])
del SurvivedAge['Survived']
SurvivedAget=SurvivedAge.reset_index().pivot('Age', 'Survived','Num')
SurvivedAget.plot()
when I'm trying to plot a histogram from this data set I'm getting strange results.
SurvivedAget.hist()
I would be grateful for help with that.
You can:
titanic = pd.read_csv('titanic_data.csv')
survival_by_age = titanic.groupby(['Age', 'Survived']).size().unstack('Survived')
survival_by_age.columns = ['No', 'Yes']
survival_by_age.plot.bar(title='Survival by Age')
to get:
which you can further tweak. You could also consolidate the fractional ages so you can use integer indices, or bin the data into say 5yr age spans to get more user-friendly output. And then there is seaborn with a various types of distribution plots.
Related
I am trying to plot random rows in a dataset, where the data consists of data collated across different dates. I have plotted it in such a way that the x-axis is labelled for the specific dates, and there is no interpolation between dates.
The issue I am having, is that the values plotted by matplotlib, do not match the entry values in the dataset. I am unsure as to what is happening here, would anyone be able to provide some insight, and possibly as to how I would fix it?
I have attached an image of the dataset and the plot, with the code contained below.
The code for generating the x-ticks, is as follows:
In: #creating a flat dates object such that dates are integer objects
flat_Dates_dates = flat_Dates[2:7]
flat_Dates_dates
Out: [20220620, 20220624, 20220627, 20220701, 20220708]
In: #creating datetime object(pandas, not datetime module) to only plot specific dates and remove interpolation of dates
date_obj_pd = pd.to_datetime(flat_Dates_dates, format=("%Y%m%d"))
Out: DatetimeIndex(['2022-06-20', '2022-06-24', '2022-06-27', '2022-07-01',
'2022-07-08'],
dtype='datetime64[ns]', freq=None)
As you can see from the dataset, the plotted trends should not take that form, the data values are wildly different from where they should be on the graph.
Edit: Apologies, I forgot to mention x = date_obj_pd - which is why I added the code, essentially just the array of datetime objects.
y is just the name of the pandas DataFrame (data table) I have included in the image.
You are plotting columns instead of rows. The blue line contains elements 1:7 from the first column, namely these:
If you transpose the dataframe you should get the desired result:
plt.plot(x, y[1:7].transpose(), 'o--')
I have 4 Dataframes with different location: Indonesia, Singapore, Malaysia and Total each of them containing the percentage of the 5 top revenue-generating products. I have plotted them separately.
I want to combine them together on one plot where X-axis shows different locations and top-revenue-generating products for each location.
I have printed data frames and as you can see they have different products in them.
print(Ind_top_cat, Sin_top_cat, Mal_top_cat, Tot_top_cat)
Category Amt
M020P 0.144131
MH 0.099439
ML 0.055052
PB 0.050057
PPDR 0.048315
Category Amt
ML 0.480781
M015 0.073034
PPDR 0.035412
M025 0.033418
M020 0.031836
Category Amt
TN 0.343650
PPDR 0.190773
NMCN 0.118425
M015 0.047539
NN 0.038140
Category Amt
M020P 0.158575
MH 0.092012
ML 0.064179
PPDR 0.050803
PB 0.044301
Thanks to joelostblom I was able to construct a plot, however, there are still some issues.
enter image description here
all_countries = pd.concat([Ind_top_cat, Sin_top_cat, Mal_top_cat, Tot_top_cat])
all_countries['Category'] = all_countries.index
sns.barplot(x='Country', y='Amt',hue = 'Category',data=all_countries)
Is there any way I can put legend values on the x-axis (no need to colour categories on I want to instead colour countries), and put data values on top of bars. Also, bars are not centred and have no idea how to solve it.
You could create a new column in each dataframe with the country name, e.g.
Ind_top_cat['Country'] = 'Indonesia'
Sin_top_cat['Country'] = 'Singapore'
The you can create one big dataframe by concatenating the country dataframes together:
all_countries = pd.concat([Ind_top_cat, Sin_top_cat])
And finally, you can use a high level plotting library such as seaborn to assign one column to the x-axis location and one to the color of the bars:
import seaborn as sns
sns.barplot(x='Country', y='Amt', color='Category', data=all_countries)
You can scroll down to the second example on this page to get an idea what such a plot would look like (also pasted below):
I am trying to automate the plotting procedure of a large dataframe matrix. The goal is to plot each column with an other column. Each column represents a variable. See also the image below.
F.e: sex vs age, sex vs BMI, sex vs smoke, sex vs type and so on.
For the sake of clearity, I have simplified the problem to image below:
enter image description here
Initially, I tried to plot each combination by hand. But this is rather a time-consuming excersize and not what I want.
I tried also this (not working):
variables = ["Sex", "Age", "BMI"]
for variable in variables:
plt.scatter(df.variable, df.variable)
plt.xlabel('variable')
plt.ylabel('variable')
plt.title('variable vs. variable')
plt.show()
Any help is welcome!
PS: If it would be a simple excersize to incorporate a linear regression on the combination of variables as well, that would also be appreciated.
Greetings,
Nadia
What you coded plots each column against itself. What you described is a nested loop. A simple upgrade is
col_choice = ["Sex", "Age", "BMI"]
for pos, axis1 in enumerate(col_choice): # Pick a first col
for axis2 in enumerate(col_choice[pos+1:]): # Pick a later col
plt.scatter(df.loc[:, axis1], df.loc[:, axis2])
I think this generates a series acceptable to scatter.
Does that help? If you want to be more "Pythonic", then look into itertools.product to generate your column choices.
You could do something like this:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Create dummy dataframe, or load your own with pd.read_csv()
columns = ["sex", "age", "BMI", "smoke", "type"]
data = pd.DataFrame(np.array([[1,0,0,1,0], [23,16,94,18,24], [32, 26, 28, 23, 19], [0,1,1,1,0], [1,2,2,2,1]]).T, columns=columns)
x_col = "sex"
y_columns = ["age", "BMI", "smoke"]
for y_col in y_columns:
figure = plt.figure
ax = plt.gca()
ax.scatter(data[x_col], data[y_col])
ax.set_xlabel(x_col)
ax.set_ylabel(y_col)
ax.set_title("{} vs {}".format(x_col, y_col))
plt.legend()
plt.show()
Basically, if you have your dataset saved as a .csv file, you can load it with pandas using pd.read_csv(), and use the column names as keys to access the corresponding rows, and iterate on that (here I created a dummy dataframe just for the sake of it).
Regarding the linear regression part, you should check out the scikit-learn library. It has a lot of regression models for many different tasks like regression, classification and clustering
I have a dataset with mostly non numeric forms. I would love to create a visualization for them but I am having an error message.
My data set looks like this
|plant_name|Customer_name|Job site|Delivery.Date|DeliveryQuantity|
|SN13|John|Sweden|01.01.2019|6|
|SN14|Ruth|France|01.04.2018|4|
|SN15|Jane|Serbia|01.01.2019|2|
|SN11|Rome|Denmark|01.04.2018|10|
|SN14|John|Sweden|03.04.2018|5|
|SN15|John|Sweden|04.09.2019|7|
|
I need to create a lineplot to show how many times John made a purchase using Delivery Date as my timeline (x-axis)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.set_option("display.max_rows", 5)
hr_data = pd.read_excel("D:\data\Days_Calculation.xlsx", parse_dates = True)
x = hr_data['DeliveryDate']
y = hr_data ['Customer_name']
sns.lineplot(x,y)
Error: No numeric types to aggregate
My expected result show be a line graph like this
John's marker will present on the timeline (Delivery Date) on "01.01.2019", "03.04.2018" and "04.09.2019"
Another instance
To plot string vs float for example Total number of quantity (DeliveryQuantity) vs Customer Name .How can one approach this
how do one format the axes distance of a plot (not label)
Why not make Delivery Date a timestamp object instead of a string?
hr_data["Delivery.Date"] = pd.to_datetime(hr_data["Delivery.Date"])
Now you got plot options.
Working with John.
john_data = hr_data[hr_data["Customer_name"]=="John"]
sns.countplot(john_data["Delivery.Date"])
Generally speaking you have to aggregate something when working with categorical data. Whether you will be counting names in a column or adding number of orders, or ranking some categories this is still numeric data.
plot_data = hr_data.pivot_table(index='DeliveryDate', columns='Customer_name', values='DeliveryQuantity', aggfunc='sum')
plt.xticks(LISTOFVALUESFORXRANGE)
plot_data.plot(legend=False)
I am looking at the famous Titanic dataset from the Kaggle competition found here: http://www.kaggle.com/c/titanic-gettingStarted/data
I have loaded and processed the data using:
# import required libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# load the data from the file
df = pd.read_csv('./data/train.csv')
# import the scatter_matrix functionality
from pandas.tools.plotting import scatter_matrix
# define colors list, to be used to plot survived either red (=0) or green (=1)
colors=['red','green']
# make a scatter plot
scatter_matrix(df,figsize=[20,20],marker='x',c=df.Survived.apply(lambda x:colors[x]))
df.info()
How can I add the categorical columns like Sex and Embarked to the plot?
You need to transform the categorical variables into numbers to plot them.
Example (assuming that the column 'Sex' is holding the gender data, with 'M' for males & 'F' for females)
df['Sex_int'] = np.nan
df.loc[df['Sex'] == 'M', 'Sex_int'] = 0
df.loc[df['Sex'] == 'F', 'Sex_int'] = 1
Now all females are represented by 0 & males by 1. Unknown genders (if there are any) will be ignored.
The rest of your code should process the updated dataframe nicely.
after googling and remembering something like the .map() function I fixed it in the following way:
colors=['red','green'] # color codes for survived : 0=red or 1=green
# create mapping Series for gender so it can be plotted
gender = Series([0,1],index=['male','female'])
df['gender']=df.Sex.map(gender)
# create mapping Series for Embarked so it can be plotted
embarked = Series([0,1,2,3],index=df.Embarked.unique())
df['embarked']=df.Embarked.map(embarked)
# add survived also back to the df
df['survived']=target
now I can plot it again...and drop the added columns afterwards.
thanks everyone for responding.....
Here is my solution:
# convert string column to category
df.Sex = df.Sex.astype('category')
# create additional column for its codes
df['Sex_code'] = df_clean.Sex.cat.codes