Pandas + Seaborn : compute number of 0 regarding categorical datas - python

I'm currently struggling with my dataframe in Pandas (new to this).
I have a 3 columns dataframe : Categorical_data1, Categorical_data2,Output. (2400 rows x 3 columns).
Both categorical data (inputs) are strings and output is depending of inputs.
Categorical_data1 = ['type1','type2', ... , 'type6']
Categorical_data2 = ['rain1','rain2', 'rain3','rain4]
So 24 possible pairs of categorical data.
I want to plot a heatmap (using seaborn for instance) of the number of 0 in outputs regarding couples of categorical data (Cat_data1,Cat_data2). I tried several things using boolean.
I tried to figure out how to compute exact amount of 0
count = ((df['Output'] == 0) & (df(['Categorical_Data1'] == 'type1') & (df(['Categorical_Data2'] == 'rain1')))).sum()
but it failed.
The output belongs to [0,1] with a large amount of 0 (around 1200 over 2400). My goal is to have something like this Source by jcdoming (I can't upload images...) with months = Categorical Data1, years = Categorical Data2 ; and numbers of 0 in ouputs).
Thank you for your help.

Use a seaborn countplot. It gives counts of categorical data occurrences in a certain feature. Use hue to add in the second feature to the visualization:
import seaborn as sns
sns.countplot(data=dataframe, x='Categorical_Data1', hue='Categorical_Data2')

Related

How to plot values from same column using pandas?

I'm a beginner with pandas and python in general.
I would like to make a scatter plot of the column "vol" in a way that the values in 'x' axis are the ones that correspond to '1' in the column "reg", and 'y' axis are values of "vol" that correspond to '0' in "reg"
I'd appreciate your help.
vol
dx
reg
4324.208797
CN
1
3805.078032
CN
1
3820.867115
CN
1
3657.034962
CN
1
3967.540763
CN
1
202822.164817
MCI
0
240965.499488
MCI
0
258301.119915
MCI
0
220183.190232
MCI
0
212202.300552
MCI
0
Try the following:
import matplotlib.pyplot as plt
plt.scatter(df[df.reg==1]['vol'], df[df.reg==0]['vol'])
plt.show()
An alternative is to use pandas even for plotting as matplotlib requires too much work. But to do that, dataframe needs to be transformed into wide format for pandas plotting to work. Then plotting comes for free:
vol = df.groupby('reg').apply(lambda g: g.reset_index(drop=True)).unstack('reg')['vol']
vol.plot(kind='scatter',x=1,y=0, title='scatter plot')
groupby...reset simply provides a common index across reg series 0..4. unstack transforms it into columns/wide format that works with pandas. Here is the final dataframe that is ready for pandas plotting.

Multiple columns visualization with plotly or seaborn

I have data of factories and their error codes during production
such as below;
PlantID A B C D
1 0 1 2 4
1 3 0 2 0
3 0 0 0 1
4 0 1 1 5
Each row represent production order.
I want to create a graph with x-axis=PlantID's and y-axis are A,B,C,D with different bars.
In this way I can see that which factory has the most D error, which has A in one graph
I usually use plotly and seaborn but I couldn't find any solution for that, y-axis is single column in every example
Thanks in advance,
Seaborn likes its data in long or wide-form.
As mentioned above, seaborn will be most powerful when your datasets have a particular organization. This format ia alternately called “long-form” or “tidy” data and is described in detail by Hadley Wickham in this academic paper. The rules can be simply stated:
Each variable is a column
Each observation is a row
The following code converts the original dataframe to a long form dataframe.
By stacking the columns on top of each other such that every row corresponds to a single record that specifies the column name and the value (the count).
import numpy as np
import pandas as pd
import seaborn as sns
# Generating some data
N = 20
PlantID = np.random.choice(np.arange(1, 4), size=N, replace=True)
data = dict((k, np.random.randint(0, 50, size=N)) for k in ['A', 'B', 'C', 'D'])
df = pd.DataFrame(data, index=PlantID)
df.index = df.index.set_names('PlantID')
# Stacking the columns and resetting the index to create a longformat. (And some renaming)
df = df.stack().reset_index().rename({'level_1' : 'column', 0: 'count'},axis=1)
sns.barplot(x='PlantID', y='count', hue='column', data=df)
Pandas has really clever built-in plotting functionality:
df.plot(kind='bar')
plt.show()

A line graph for non-numeric data

I have a dataset with mostly non numeric forms. I would love to create a visualization for them but I am having an error message.
My data set looks like this
|plant_name|Customer_name|Job site|Delivery.Date|DeliveryQuantity|
|SN13|John|Sweden|01.01.2019|6|
|SN14|Ruth|France|01.04.2018|4|
|SN15|Jane|Serbia|01.01.2019|2|
|SN11|Rome|Denmark|01.04.2018|10|
|SN14|John|Sweden|03.04.2018|5|
|SN15|John|Sweden|04.09.2019|7|
|
I need to create a lineplot to show how many times John made a purchase using Delivery Date as my timeline (x-axis)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
pd.set_option("display.max_rows", 5)
hr_data = pd.read_excel("D:\data\Days_Calculation.xlsx", parse_dates = True)
x = hr_data['DeliveryDate']
y = hr_data ['Customer_name']
sns.lineplot(x,y)
Error: No numeric types to aggregate
My expected result show be a line graph like this
John's marker will present on the timeline (Delivery Date) on "01.01.2019", "03.04.2018" and "04.09.2019"
Another instance
To plot string vs float for example Total number of quantity (DeliveryQuantity) vs Customer Name .How can one approach this
how do one format the axes distance of a plot (not label)
Why not make Delivery Date a timestamp object instead of a string?
hr_data["Delivery.Date"] = pd.to_datetime(hr_data["Delivery.Date"])
Now you got plot options.
Working with John.
john_data = hr_data[hr_data["Customer_name"]=="John"]
sns.countplot(john_data["Delivery.Date"])
Generally speaking you have to aggregate something when working with categorical data. Whether you will be counting names in a column or adding number of orders, or ranking some categories this is still numeric data.
plot_data = hr_data.pivot_table(index='DeliveryDate', columns='Customer_name', values='DeliveryQuantity', aggfunc='sum')
plt.xticks(LISTOFVALUESFORXRANGE)
plot_data.plot(legend=False)

How to find the correct condition for my matplotlib scatterplot?

I'm trying to correlate two measures(DD & DRE) from a data set which contains many more columns. I created a data frame and called it as 'Data'.
Within this Data, I want to create a scatterplot between DD(X axis) & DRE(y Axis), I want to include DD values between 0 and 100.
Please help me with the first line of my code to get the condition of DD between 0 and 100
Also when I plot the scatterplot, I get dots beyond 100% ( Y axis is DRE in %) though I dont have any value >100%.
Data1= Data[ Data['DD']<100]
plt.scatter(Data1.DD,Data1.DRE)
tick_val = [0,10,20,30,40,50,60,70,80,90,100]
tick_lab = ['0%','10%','20%','30%','40%','50%','60%','70%','80%','90%','100']
plt.yticks(tick_val,tick_lab)
plt.show()

Turning Pandas DataFrame into Histogram Using Matplotlib

I have a Pandas DataFrame which has a two columns, pageviews and type:
pageviews type
0 48.0 original
1 1.0 licensed
2 181.0 licensed
...
I'm trying to create a histogram each for original and licensed. Each histogram would (ideally) chart the number of occurrences in a given range for that particular type. So the x-axis would be a range of pageviews and the y-axis would be the number of pageviews that fall within that range.
Any recs on how to do this? I feel like it should be straightforward...
Thanks!
Using your current dataframe: df.hist(by='type')
For example:
# Me recreating your dataframe
pageviews = np.random.randint(200, size=100)
types = np.random.choice(['original','licensed'], size=100)
df = pd.DataFrame({'pageviews': pageviews,'type':types})
# Code you need to create faceted histogram by type
df.hist(by='type')
pandas.DataFrame.hist documentation

Categories