Pandas scatter_matrix plotting - additional arguments - python

I am running Python 3.6 with Pandas version 0.19.2. On the code example below, I have two questions regarding the Pandas plotting function scatter_matrix():
**1.**How can I colour-label the observations in the scatter plots with respect to the Label column?
**2.**How can I specify the number of bins for the histograms on the diagonal? Can I do this individually or just one bin number for all?
import pandas as pd
import numpy as np
N= 1000
df_feat = pd.DataFrame(np.random.randn(N, 4), columns=['A','B','C','D'])
df_label = pd.DataFrame(np.random.choice([0,1], N), columns=['Label'])
df = pd.concat([df_feat, df_label], axis=1)
axes = pd.tools.plotting.scatter_matrix(df, alpha=0.2)
This is linked to this more general one.

To answer your first question, there may be a less 'kludgey' way, but
scatter_matrix(df,c=['r' if i == 1 else 'b' for i in df['Label']])
To answer the second:
The scatter matrix can use the pd.hist() api to use hist keywords passed in a dictionary
scatter_matrix(df,hist_kwds={'bins':5})

Related

Plotting complex graph in pandas

I have the following dataset
ids count
1 2000210
2 -23123
3 100
4 500
5 102300120
...
1 million 123213
I want a graph where I have group of ids (all unique ids) in the x axis and count in y axis and a distribution chart that looks like the following
How can I achieve this in pandas dataframe in python.
I tried different ways but I am only getting a basic plot and not as complex as the drawing.
What I tried
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
df.plot(x="range", y="count");
But the plots dont make any sense. I am also new to plotting in pandas. I searched for a long time for charts like this in the internet and could really use some help with such graphs
From what I understood from your question and comments here is what you can do:
1) Import the libraries and set the default theme:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
2) Create your dataframe:
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
3) Plot your data
3.1) Simple take using only the seaborn library:
sns.kdeplot(data=df, x="count", weights="range")
Output:
3.2) More complex take using seaborn and matplotlib libraries:
sns.histplot(x=df["count"], weights=df["range"], discrete=True,
color='darkblue', edgecolor='black',
kde=True, kde_kws={'cut': 2}, line_kws={'linewidth': 4})
plt.ylabel("range")
plt.show()
Output:
Personal note: please make sure to check all the solutions, if they
are not enough comment and we will work together in order to find you
a solution
From a distribution plot of ids you can use:
import numpy as np
import pandas as pd
np.random.seed(seed=123)
df = pd.DataFrame(np.random.randn(1000000), columns=["ids"])
df['ids'].plot(kind='kde')

Python Pandas Seaborn FacetGrid: use dataframe series' names to set up columns

I am using pandas dataframes to hold some volume calculation results, and trying to configure a seaborn FacetGrid setup to visualize results of 4 different types of volume calculations for a reservoir zone.
I believe I can handle the dataframe part, my problems is with the visualization part:
Each different type of volume calculations is loaded in the dataframe as a series. The series name corresponds to the type of volume calculation. I want to create a number of plots then, aligned so that each column of plot corresponds to one series in my dataframe.
Theory (documentation) says this should do it (example from tutorial at https://seaborn.pydata.org/tutorial/axis_grids.html):
import seaborn as sns
import matpltlib.pyplot as plt
tips = sns.load_dataset("tips")
g=sns.FacetGrid(tips, col = "time")
I cannot find the referenced dataset "tips" for download, but I think that is a minor problem. From the code snippet above and after some testing on my own data, I infer that "time" in that dataset refers to the name of one series in the dataframe and that different times would then be different categories or other types of values in that series.
This is not how my dataset is ordered. I have the different types of volume calculations that I would see as individual plots (in columns) represented as series in my dataframe. How do I provide the series name as input to seaborn FacetGrid col= argument?
g = seaborn.FacetGrid(data=volumes_table, col=?????)
I cannot figure out how I can get col=dataframe.series and I cannot find any documented example of that.
here's a setup with some exciting dummy names and dummy values
import os
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
#provide some input data, using a small dictionary
volumes_categories = {'zone_numbers': [1, 2, 3, 4],
'zone_names': ['corona', 'hiv', 'h5n1', 'measles'],
'grv': [30, 90, 80, 100],
'nv': [20, 60, 20, 50],
'pv': [5, 12, 4, 25],
'hcpv': [4, 6, 1, 20]}
# create the dataframe
volumes_table = pandas.DataFrame(volumes_categories)
# set up for plotting
seaborn.set(style='ticks')
g= seaborn.FacetGrid(data=volumes_table, col='zone_names')
The above setup generates columns ok, but I cannot get the colums to represent series in my dataframe (the columns when visualizing the dataframe as a table....)
What do I need to do?
The main part of the solution is described in BBQuercus's answer: reshaping the nice, human-readable wide-format dataframe/table into a long-format table which is simpler to digest for seaborn, using seaborn.melt()
I implemented this by creating a copy of the original dataframe and melting the copy:
# first copy dataframe
vol_table2 = volumes_table.copy()
#melt it into long format
vol_table2 = pandas.melt(vol_table2, id_vars = ['zone_numbers','zone_names'], value_vars=['grv','nv','pv','hcpv'], var_name = "volume_type", value_name = "volume")
In the end I also decided to scrap the explicit FacetGrid and map setup and use seaborn.catplot (with FacetGrid functionality included).
Thanks for assistance
(PS: it must be a good idea for seaborn to accept series names for Facetgrid setup)
Once we imported all requirements:
import seaborn as sns
import matplotlib.pyplot as plt
tips = sns.load_dataset('tips')
The FacetGrid essentially just provides a canvas to draw on. You can then use the map function to "project" plotting functions onto the canvas:
# Blueprint
g = sns.FacetGrid(dataframe, col="dataframe.column", row="dataframe.column")
g = g.map(plotting.function, "dataframe.column")
# Example with the tips dataset
g = sns.FacetGrid(tips, col="time", row="smoker")
g = g.map(plt.hist, "total_bill")
plt.show()
In your case as mentioned above I would also melt the columns first to get a tidy data format and then plot as usual. Changing what to plot however necessary:
volumes_table = volumes_table.melt(id_vars=['zone_numbers', 'zone_names'])
g = sns.FacetGrid(data=volumes_table, col='variable')
g = g.map(plt.scatter, 'zone_numbers', 'value')
plt.show()

How to remove certain values before plotting data

I'm using python for the first time. I have a csv file with a few columns of data: location, height, density, day etc... I am plotting height (i_h100) v density (i_cd) and have managed to constrain the height to values below 50 with the code below. I now want to constrain the values on the y axis to be within a certain 'day' range say (85-260). I can't work out how to do this.
import pandas
import matplotlib.pyplot as plt
data=pandas.read_csv('data.csv')
data.plot(kind='scatter',x='i_h100',y='i_cd')
plt.xlim(right=50)
Use .loc to subset data going into graph.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Make some dummy data
np.random.seed(42)
df = pd.DataFrame({'a':np.random.randint(0,365,20),
'b':np.random.rand(20),
'c':np.random.rand(20)})
# all data: plot of 'b' vs. 'c'
df.plot(kind='scatter', x='b', y='c')
plt.show()
# use .loc to subset data displayed based on value in 'a'
# can also use .loc to restrict values of 'b' displayed rather than plt.xlim
df.loc[df['a'].between(85,260) & (df['b'] < 0.5)].plot(kind='scatter', x='b', y='c')
plt.show()

Splitting large data set and plotting the average in matplotlib

I have a large data set with over 10,000 rows with values between 0 and 400,000,000. I would like to plot those values vs. the mean of another column in matplotlib where the x axis increments by 50,000,000 but I am unsure how to do so. I can plot it using pandas but would really like to do it using matplotlib but unsure how. This is what I have in pandas:
mean_values = df.groupby(pd.cut(df['budget_adj'],np.arange(0,4000000000,50000000)))['vote_average'].mean()
mean_values.plot(kind='line',figsize=(12,5))
I think I figured out what your problem is
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# Create some data
df = pd.DataFrame({'budget_adj': np.random.uniform(0, 4000000000, 10000),
'vote_average': np.random.uniform(0, 100000, 10000)})
# Calculate the mean values
mean_values = df.groupby(pd.cut(df['budget_adj'],np.arange(0,4000000000,50000000)))['vote_average'].mean()
And this is what I suspect you do
# This wont work since mean_values.index is an interval
plt.plot(mean_values.index, mean_values)
This wont work since you index is a categorical interval. In order for plot to work your x-values have to be numbers. We can convert our intervals in many ways
# You can pick the left endpoint...
x_values = [i.left for i in mean_values.index]
# the right endpoint...
x_values = [i.right for i in mean_values.index]
# or the center value.
x_values = [i.mid for i in mean_values.index]
# And NOW you will get no error
plt.plot(x_values, mean_values)

Seaborn pairplot and NaN values

I'm trying to understand why this fails, even though the documentation says:
dropna : boolean, optional
Drop missing values from the data before plotting.
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.__version__
# '0.7.dev'
# generate an example DataFrame
a = pd.DataFrame(data={
'a': np.random.normal(size=(100,)),
'b': np.random.lognormal(size=(100,)),
'c': np.random.exponential(size=(100,))})
sns.pairplot(a) # this works as expected
# snip
b = a.copy()
b.iloc[5,2] = np.nan # replace one value in col 'c' by a NaN
sns.pairplot(b) # this fails with error
# "AttributeError: max must be larger than min in range parameter."
# in histogram(a, bins, range, normed, weights, density)"
> sns.pairplot(b, dropna=True) # same error as above
when you are using the data directly, ie
sns.pairplot(b) #Same as sns.pairplot(b, x_vars=['a','b','c'] , y_vars=['a','b','c'],dropna=True)
your are plotting against all the columns in the DataFrame,Then make sure no:of rows are same in all columns.
sns.pairplot(b, x_vars=['a','c'] , y_vars=['a','b','c'],dropna=True)
In this case it works fine, but there will be a minute difference in the graph for removing the 'NaN value'.
So, If you want to plot with the whole Data then :-
either the null values must be replaced using "fillna()",
or the whole row containing 'nan values' must be dropped
b = b.drop(b.index[5])
sns.pairplot(b)
I'm going to post an answer to my own question, even though it doesn't exactly solve the problem in general, but at least it solves my problem.
The problem arises when trying to draw histograms. However, it looks like the kdes are much more robust to missing data. Therefore, this works, despite the NaN in the middle of the dataframe:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.__version__
# '0.7.dev'
# generate an example DataFrame
a = pd.DataFrame(data={
'a': np.random.normal(size=(100,)),
'b': np.random.lognormal(size=(100,)),
'c': np.random.exponential(size=(100,))})
a.iloc[5,2] = np.nan # replace one value in col 'c' by a NaN
sns.pairplot(a, diag_kind='kde')
Something of a necro- but as I cracked the answer to this today I thought it might be worth sharing. I could not find this solution elsewhere on the web... If the Seaborn ignoreNa keyword has not worked for your data and you don't want to drop all rows that have any NaN. This should work for you.
All of this is in Seaborn 0.9 with pandas 0.23.4, assuming a data frame (df) with j rows (samples) that have n columns (attributes).
The solution to the issue of Seaborn being unable to cope with NaN arrays being passed to it; particularly when you want to make sure you retain a row due to it having other data within it that is useful, is based on using a function to intercept the pair-wise columns before they are passed to the PairGrid for plotting.
Functions can be passed to the grid sectors to carry out an operation per subplot. A simple example of this would be to calculate and annotate RMSE for a column pair (subplot) onto each plot:
def rmse(x,y, **kwargs):
rmse = math.sqrt(skm.mean_squared_error(x, y))
label = 'RMSE = ' + str(round(rmse, 2))
ax = plt.gca()
ax.annotate(label, xy = (0.1, 0.95), size = 20, xycoords = ax.transAxes)
grid = grid.map_upper(rmse)
Therefore by writing a function that Seaborn can take as a data plotting argument, which drops NaNs on a column pair basis as the grid.map_ iterates over the main data frame, we can minimize data loss per sample (row). This is because one NaN in a row will not cause the entire row to be lost for all sub-plots. But rather just the sub-plot for that specific column pair will exclude the given row.
The following function carries out the pairwise NaN drop, returns the two series that seaborn then plots on the axes with matplotlibs scatter plot:
df = [YOUR DF HERE]
def col_nan_scatter(x,y, **kwargs):
df = pd.DataFrame({'x':x[:],'y':y[:]})
df = df.dropna()
x = df['x']
y = df['y']
plt.gca()
plt.scatter(x,y)
cols = df.columns
grid = sns.PairGrid(data= df, vars = cols, height = 4)
grid = grid.map_upper(col_nan_scatter)
The same can be done with seaborn plotting (with for example, just the x value):
def col_nan_kde_histo(x, **kwargs):
df = pd.DataFrame({'x':x[:]})
df = df.dropna()
x = df['x']
plt.gca()
sns.kdeplot(x)
cols = df.columns
grid = sns.PairGrid(data= df, vars = cols, height = 4)
grid = grid.map_upper(col_nan_scatter)
grid = grid.map_upper(col_nan_kde_histo)

Categories