How to plot values from same column using pandas? - python

I'm a beginner with pandas and python in general.
I would like to make a scatter plot of the column "vol" in a way that the values in 'x' axis are the ones that correspond to '1' in the column "reg", and 'y' axis are values of "vol" that correspond to '0' in "reg"
I'd appreciate your help.
vol
dx
reg
4324.208797
CN
1
3805.078032
CN
1
3820.867115
CN
1
3657.034962
CN
1
3967.540763
CN
1
202822.164817
MCI
0
240965.499488
MCI
0
258301.119915
MCI
0
220183.190232
MCI
0
212202.300552
MCI
0

Try the following:
import matplotlib.pyplot as plt
plt.scatter(df[df.reg==1]['vol'], df[df.reg==0]['vol'])
plt.show()

An alternative is to use pandas even for plotting as matplotlib requires too much work. But to do that, dataframe needs to be transformed into wide format for pandas plotting to work. Then plotting comes for free:
vol = df.groupby('reg').apply(lambda g: g.reset_index(drop=True)).unstack('reg')['vol']
vol.plot(kind='scatter',x=1,y=0, title='scatter plot')
groupby...reset simply provides a common index across reg series 0..4. unstack transforms it into columns/wide format that works with pandas. Here is the final dataframe that is ready for pandas plotting.

Related

Pandas + Seaborn : compute number of 0 regarding categorical datas

I'm currently struggling with my dataframe in Pandas (new to this).
I have a 3 columns dataframe : Categorical_data1, Categorical_data2,Output. (2400 rows x 3 columns).
Both categorical data (inputs) are strings and output is depending of inputs.
Categorical_data1 = ['type1','type2', ... , 'type6']
Categorical_data2 = ['rain1','rain2', 'rain3','rain4]
So 24 possible pairs of categorical data.
I want to plot a heatmap (using seaborn for instance) of the number of 0 in outputs regarding couples of categorical data (Cat_data1,Cat_data2). I tried several things using boolean.
I tried to figure out how to compute exact amount of 0
count = ((df['Output'] == 0) & (df(['Categorical_Data1'] == 'type1') & (df(['Categorical_Data2'] == 'rain1')))).sum()
but it failed.
The output belongs to [0,1] with a large amount of 0 (around 1200 over 2400). My goal is to have something like this Source by jcdoming (I can't upload images...) with months = Categorical Data1, years = Categorical Data2 ; and numbers of 0 in ouputs).
Thank you for your help.
Use a seaborn countplot. It gives counts of categorical data occurrences in a certain feature. Use hue to add in the second feature to the visualization:
import seaborn as sns
sns.countplot(data=dataframe, x='Categorical_Data1', hue='Categorical_Data2')

Plot average of multiple line plots with different x values

I have multiple dataframes that look similar to this:
x y x y
1 2 0.5 2
2 4 1.5 6
3 6 3 12
Where the x columns are my indices. I want to plot the average line plot for these multiple datasets. My idea was to concatenate the two dataframes so that I have a scatterplot and can do a best fit line, but Pandas is throwing an error Reindexing only valid with uniquely valued Index objects. I've read other questions for this error message and have renamed my index names and column names to x_1 x_2 and y_1 and y_2 but it is still complaining, I believe because some of the x values are the same. What am I doing wrong here?
Not sure if I understand completely how your dataframes look like, but you can concatenate two (or more) dataframes df1,df2... by doing:
new_dataframe = pd.DataFrame(np.concatenate([df1,df2]),columns=['x','y'])
where my imports are
import pandas as pd
import numpy as np
Are you just looking for a best fit line for all the points? If so you can concat and use lmplot.
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'x':[1,2,3],'y':[2,4,6]})
df2 = pd.DataFrame({'x':[.5,1.5,3], 'y':[2,6,12]})
out = pd.concat([df,df2])
sns.lmplot(data=out, x='x', y='y', ci=None);

Multiple columns visualization with plotly or seaborn

I have data of factories and their error codes during production
such as below;
PlantID A B C D
1 0 1 2 4
1 3 0 2 0
3 0 0 0 1
4 0 1 1 5
Each row represent production order.
I want to create a graph with x-axis=PlantID's and y-axis are A,B,C,D with different bars.
In this way I can see that which factory has the most D error, which has A in one graph
I usually use plotly and seaborn but I couldn't find any solution for that, y-axis is single column in every example
Thanks in advance,
Seaborn likes its data in long or wide-form.
As mentioned above, seaborn will be most powerful when your datasets have a particular organization. This format ia alternately called “long-form” or “tidy” data and is described in detail by Hadley Wickham in this academic paper. The rules can be simply stated:
Each variable is a column
Each observation is a row
The following code converts the original dataframe to a long form dataframe.
By stacking the columns on top of each other such that every row corresponds to a single record that specifies the column name and the value (the count).
import numpy as np
import pandas as pd
import seaborn as sns
# Generating some data
N = 20
PlantID = np.random.choice(np.arange(1, 4), size=N, replace=True)
data = dict((k, np.random.randint(0, 50, size=N)) for k in ['A', 'B', 'C', 'D'])
df = pd.DataFrame(data, index=PlantID)
df.index = df.index.set_names('PlantID')
# Stacking the columns and resetting the index to create a longformat. (And some renaming)
df = df.stack().reset_index().rename({'level_1' : 'column', 0: 'count'},axis=1)
sns.barplot(x='PlantID', y='count', hue='column', data=df)
Pandas has really clever built-in plotting functionality:
df.plot(kind='bar')
plt.show()

Compact way of visualizing heat maps of correlated data

I am trying to visualize the correlation of the Result column with every other column.
A_B A_C B_C Result
0 0.318182 0.925311 0.860465 91
1 -0.384030 0.991803 0.996344 12
2 -0.818182 0.411765 0.920000 53
3 0.444444 0.978261 0.944444 64
A_B = (A-B)/(A+B) correspondingly all other values too.
which works for smaller no. of columns but if I increase the no. of columns then no. of rows in heatmap keeps on stacking up.Is there any compact way to represent it.
Following code will reproduce the output-
import pandas as pd
import seaborn as sns
data = {'A':[232,243,12,546,67,12,78,11,245],
'B':[120,546,120,210,56,120,56,89,12],
'C':[9,1,5,6,7,43,7,12,64],
'Result':[91,12,53,64,71,436,74,123,641],
}
df = pd.DataFrame(data,columns=['A','B','C','Result'])
#Responsible for (A-B)/(A+B) ,(A-C)/(A+C) and similarly
colnames = df.columns.tolist()[:-1]
for i,c in enumerate(colnames):
if i!=len(colnames):
for k in range(i+1,len(colnames)):
df[c+'_'+colnames[k]]=(df[c]-df[colnames[k]])/(df[c]+df[colnames[k]])
newdf = df[['A_B','A_C','B_C','Result']].copy()
#Plotting A_B,A_C,B_C by ignoring the output of result of itself
plot = pd.DataFrame(newdf.corr().iloc[:-1,-1])
sns.heatmap(plot,annot=True)
A technique which I heard but unable to find any source ,is representing each correlation factor in the mini-recangles like
So according to it, considering the given map as a matrix of 3*3 and (0,0) starting from left-bottom, A_B will be represented in (1,1)
A_C in (2,1),B_C in (2,2).
But ,I am not getting it how to do it ?
You can plot the correlation of each column against the Result column and other columns as well. Below is one way to do so. Providing the x- and y-ticklabels guides you better for comparing the correlations. You can also annotate the correlation values to be displayed on the heat map.
cor = newdf.corr()
sns.heatmap(cor, xticklabels=cor.columns.values,
yticklabels=cor.columns.values, annot=True)

Turning Pandas DataFrame into Histogram Using Matplotlib

I have a Pandas DataFrame which has a two columns, pageviews and type:
pageviews type
0 48.0 original
1 1.0 licensed
2 181.0 licensed
...
I'm trying to create a histogram each for original and licensed. Each histogram would (ideally) chart the number of occurrences in a given range for that particular type. So the x-axis would be a range of pageviews and the y-axis would be the number of pageviews that fall within that range.
Any recs on how to do this? I feel like it should be straightforward...
Thanks!
Using your current dataframe: df.hist(by='type')
For example:
# Me recreating your dataframe
pageviews = np.random.randint(200, size=100)
types = np.random.choice(['original','licensed'], size=100)
df = pd.DataFrame({'pageviews': pageviews,'type':types})
# Code you need to create faceted histogram by type
df.hist(by='type')
pandas.DataFrame.hist documentation

Categories