I am looking at the famous Titanic dataset from the Kaggle competition found here: http://www.kaggle.com/c/titanic-gettingStarted/data
I have loaded and processed the data using:
# import required libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# load the data from the file
df = pd.read_csv('./data/train.csv')
# import the scatter_matrix functionality
from pandas.tools.plotting import scatter_matrix
# define colors list, to be used to plot survived either red (=0) or green (=1)
colors=['red','green']
# make a scatter plot
scatter_matrix(df,figsize=[20,20],marker='x',c=df.Survived.apply(lambda x:colors[x]))
df.info()
How can I add the categorical columns like Sex and Embarked to the plot?
You need to transform the categorical variables into numbers to plot them.
Example (assuming that the column 'Sex' is holding the gender data, with 'M' for males & 'F' for females)
df['Sex_int'] = np.nan
df.loc[df['Sex'] == 'M', 'Sex_int'] = 0
df.loc[df['Sex'] == 'F', 'Sex_int'] = 1
Now all females are represented by 0 & males by 1. Unknown genders (if there are any) will be ignored.
The rest of your code should process the updated dataframe nicely.
after googling and remembering something like the .map() function I fixed it in the following way:
colors=['red','green'] # color codes for survived : 0=red or 1=green
# create mapping Series for gender so it can be plotted
gender = Series([0,1],index=['male','female'])
df['gender']=df.Sex.map(gender)
# create mapping Series for Embarked so it can be plotted
embarked = Series([0,1,2,3],index=df.Embarked.unique())
df['embarked']=df.Embarked.map(embarked)
# add survived also back to the df
df['survived']=target
now I can plot it again...and drop the added columns afterwards.
thanks everyone for responding.....
Here is my solution:
# convert string column to category
df.Sex = df.Sex.astype('category')
# create additional column for its codes
df['Sex_code'] = df_clean.Sex.cat.codes
Related
I'm currently struggling with my dataframe in Pandas (new to this).
I have a 3 columns dataframe : Categorical_data1, Categorical_data2,Output. (2400 rows x 3 columns).
Both categorical data (inputs) are strings and output is depending of inputs.
Categorical_data1 = ['type1','type2', ... , 'type6']
Categorical_data2 = ['rain1','rain2', 'rain3','rain4]
So 24 possible pairs of categorical data.
I want to plot a heatmap (using seaborn for instance) of the number of 0 in outputs regarding couples of categorical data (Cat_data1,Cat_data2). I tried several things using boolean.
I tried to figure out how to compute exact amount of 0
count = ((df['Output'] == 0) & (df(['Categorical_Data1'] == 'type1') & (df(['Categorical_Data2'] == 'rain1')))).sum()
but it failed.
The output belongs to [0,1] with a large amount of 0 (around 1200 over 2400). My goal is to have something like this Source by jcdoming (I can't upload images...) with months = Categorical Data1, years = Categorical Data2 ; and numbers of 0 in ouputs).
Thank you for your help.
Use a seaborn countplot. It gives counts of categorical data occurrences in a certain feature. Use hue to add in the second feature to the visualization:
import seaborn as sns
sns.countplot(data=dataframe, x='Categorical_Data1', hue='Categorical_Data2')
I have a dataframe with categorical columns and a target column with two categories - 0 and 1.
dfx.target.value_counts()
0 8062
1 3919
Name: target, dtype: int64
I tried to create parallel plot for them using plotly, This works fine, I am pasting my target column's output:
fig = px.parallel_categories(dfx)
fig.show()
Then I tried to color code them, so according to documentation, we can specify column's name.
fig = px.parallel_categories(dfx, color = 'target')
fig.show()
however, while specifying color scheme - which can be done using a column's name, I am getting wrong distribution in the target column :
An additional category appears in gray color, also, the count of 0 and 1 in target column is wrong.
Note: There are no na in the data
Update: It turns out, it was version issue. After updating the package, I was able to do it.
have generated a sample dataset to plot
following your code, specifying color="target" caused errors, invalid colors
changed to use pandas series map() to build a series of target value to a color
required addition of dimensions parameter so that color was not added as an additional category in trace.
distributions are identical with or without using color parameter
import pandas as pd
import numpy as np
import plotly.express as px
# build a dataframe for use in plot
V=6
a = [chr(i) for i in range(ord("A"), ord("A")+V)]
R=2000
dfx = pd.DataFrame({c:np.random.choice(a[0:V//(i+1)], R) for i, c in enumerate(["source","interim","target"])})
# the plot - use "target" for colors
px.parallel_categories(
dfx,
dimensions=dfx.columns,
color=dfx["target"].map(
{
l: px.colors.qualitative.Light24[i % len(px.colors.qualitative.Light24)]
for i, l in enumerate(dfx["target"].unique())
}
),
)
I am trying to automate the plotting procedure of a large dataframe matrix. The goal is to plot each column with an other column. Each column represents a variable. See also the image below.
F.e: sex vs age, sex vs BMI, sex vs smoke, sex vs type and so on.
For the sake of clearity, I have simplified the problem to image below:
enter image description here
Initially, I tried to plot each combination by hand. But this is rather a time-consuming excersize and not what I want.
I tried also this (not working):
variables = ["Sex", "Age", "BMI"]
for variable in variables:
plt.scatter(df.variable, df.variable)
plt.xlabel('variable')
plt.ylabel('variable')
plt.title('variable vs. variable')
plt.show()
Any help is welcome!
PS: If it would be a simple excersize to incorporate a linear regression on the combination of variables as well, that would also be appreciated.
Greetings,
Nadia
What you coded plots each column against itself. What you described is a nested loop. A simple upgrade is
col_choice = ["Sex", "Age", "BMI"]
for pos, axis1 in enumerate(col_choice): # Pick a first col
for axis2 in enumerate(col_choice[pos+1:]): # Pick a later col
plt.scatter(df.loc[:, axis1], df.loc[:, axis2])
I think this generates a series acceptable to scatter.
Does that help? If you want to be more "Pythonic", then look into itertools.product to generate your column choices.
You could do something like this:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# Create dummy dataframe, or load your own with pd.read_csv()
columns = ["sex", "age", "BMI", "smoke", "type"]
data = pd.DataFrame(np.array([[1,0,0,1,0], [23,16,94,18,24], [32, 26, 28, 23, 19], [0,1,1,1,0], [1,2,2,2,1]]).T, columns=columns)
x_col = "sex"
y_columns = ["age", "BMI", "smoke"]
for y_col in y_columns:
figure = plt.figure
ax = plt.gca()
ax.scatter(data[x_col], data[y_col])
ax.set_xlabel(x_col)
ax.set_ylabel(y_col)
ax.set_title("{} vs {}".format(x_col, y_col))
plt.legend()
plt.show()
Basically, if you have your dataset saved as a .csv file, you can load it with pandas using pd.read_csv(), and use the column names as keys to access the corresponding rows, and iterate on that (here I created a dummy dataframe just for the sake of it).
Regarding the linear regression part, you should check out the scikit-learn library. It has a lot of regression models for many different tasks like regression, classification and clustering
I have a dataset with mostly non numeric forms. I would love to create a visualization for them but I am having an error message.
My data set looks like this
|plant_name|Customer_name|Job site|Delivery.Date|DeliveryQuantity|
|SN13|John|Sweden|01.01.2019|6|
|SN14|Ruth|France|01.04.2018|4|
|SN15|Jane|Serbia|01.01.2019|2|
|SN11|Rome|Denmark|01.04.2018|10|
|SN14|John|Sweden|03.04.2018|5|
|SN15|John|Sweden|04.09.2019|7|
|
I need to create a lineplot to show how many times John made a purchase using Delivery Date as my timeline (x-axis)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.set_option("display.max_rows", 5)
hr_data = pd.read_excel("D:\data\Days_Calculation.xlsx", parse_dates = True)
x = hr_data['DeliveryDate']
y = hr_data ['Customer_name']
sns.lineplot(x,y)
Error: No numeric types to aggregate
My expected result show be a line graph like this
John's marker will present on the timeline (Delivery Date) on "01.01.2019", "03.04.2018" and "04.09.2019"
Another instance
To plot string vs float for example Total number of quantity (DeliveryQuantity) vs Customer Name .How can one approach this
how do one format the axes distance of a plot (not label)
Why not make Delivery Date a timestamp object instead of a string?
hr_data["Delivery.Date"] = pd.to_datetime(hr_data["Delivery.Date"])
Now you got plot options.
Working with John.
john_data = hr_data[hr_data["Customer_name"]=="John"]
sns.countplot(john_data["Delivery.Date"])
Generally speaking you have to aggregate something when working with categorical data. Whether you will be counting names in a column or adding number of orders, or ranking some categories this is still numeric data.
plot_data = hr_data.pivot_table(index='DeliveryDate', columns='Customer_name', values='DeliveryQuantity', aggfunc='sum')
plt.xticks(LISTOFVALUESFORXRANGE)
plot_data.plot(legend=False)
I'm trying to create histogram from grouped data in pandas.
So far I was able to create standard line plot. But I can't figure out how to do the same to get histogram (bar chart). I would like to get 2 age histograms of persons who survived Titanic crush and who didn't - to see if there is a difference in age distribution.
Source data:
https://www.udacity.com/api/nodes/5454512672/supplemental_media/titanic-datacsv/download
So far my code:
import pandas as pn
titanic = pn.DataFrame.from_csv('titanic_data.csv')
SurvivedAge= titanic.groupby(['Survived','Age']).size()
SurvivedAge=SurvivedAge.reset_index()
SurvivedAge.columns=['Survived', 'Age', 'Num']
SurvivedAge.index=(SurvivedAge['Survived'])
del SurvivedAge['Survived']
SurvivedAget=SurvivedAge.reset_index().pivot('Age', 'Survived','Num')
SurvivedAget.plot()
when I'm trying to plot a histogram from this data set I'm getting strange results.
SurvivedAget.hist()
I would be grateful for help with that.
You can:
titanic = pd.read_csv('titanic_data.csv')
survival_by_age = titanic.groupby(['Age', 'Survived']).size().unstack('Survived')
survival_by_age.columns = ['No', 'Yes']
survival_by_age.plot.bar(title='Survival by Age')
to get:
which you can further tweak. You could also consolidate the fractional ages so you can use integer indices, or bin the data into say 5yr age spans to get more user-friendly output. And then there is seaborn with a various types of distribution plots.