Plotting complex graph in pandas - python

I have the following dataset
ids count
1 2000210
2 -23123
3 100
4 500
5 102300120
...
1 million 123213
I want a graph where I have group of ids (all unique ids) in the x axis and count in y axis and a distribution chart that looks like the following
How can I achieve this in pandas dataframe in python.
I tried different ways but I am only getting a basic plot and not as complex as the drawing.
What I tried
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
df.plot(x="range", y="count");
But the plots dont make any sense. I am also new to plotting in pandas. I searched for a long time for charts like this in the internet and could really use some help with such graphs

From what I understood from your question and comments here is what you can do:
1) Import the libraries and set the default theme:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()
2) Create your dataframe:
df = pd.DataFrame(np.random.randn(1000000, 2), columns=["count", "ids"]).cumsum()
df["range"] = pd.Series(list(range(len(df))))
3) Plot your data
3.1) Simple take using only the seaborn library:
sns.kdeplot(data=df, x="count", weights="range")
Output:
3.2) More complex take using seaborn and matplotlib libraries:
sns.histplot(x=df["count"], weights=df["range"], discrete=True,
color='darkblue', edgecolor='black',
kde=True, kde_kws={'cut': 2}, line_kws={'linewidth': 4})
plt.ylabel("range")
plt.show()
Output:
Personal note: please make sure to check all the solutions, if they
are not enough comment and we will work together in order to find you
a solution

From a distribution plot of ids you can use:
import numpy as np
import pandas as pd
np.random.seed(seed=123)
df = pd.DataFrame(np.random.randn(1000000), columns=["ids"])
df['ids'].plot(kind='kde')

Related

Producing a heatmap from a pandas dataframe with rows of the form (x,y,z), where z is intended to be the heat value

Let's say I have a dataframe with this structure
and I intend to transform it into something like this (done laboriously and quite manually):
Is there a simple call to (say) a seaborn or plotly function that would do this? Something like
heatmap(df, x='Dose', y='Distance', z='Passrate')
or perhaps a simple way of restructuring the dataframe to facilitate using sns.heatmap or plotly's imshow, or similar? It seems strange to me that I cannot find a straightforward way of putting data formatted in this way into a high-level plotting function.
Use df.pivot_table to get your data in the correct shape first.
Setup: create some random data
import pandas as pd
import numpy as np
import seaborn as sns
p_rate = np.arange(0,100)/np.arange(0,100).sum()
data = {'Dose': np.repeat(np.arange(0,3.5,0.5), 10),
'Distance': np.tile(np.arange(0,3.5,0.5), 10),
'Passrate': np.random.choice(np.arange(0,100), size=70,
p=p_rate)}
df = pd.DataFrame(data)
Code: pivot and apply sns.heatmap
df_pivot = df.pivot_table(index='Distance',
columns='Dose',
values='Passrate',
aggfunc='mean').sort_index(ascending=False)
sns.heatmap(df_pivot, annot=True, cmap='coolwarm')
Result:

Plotting top 10 Values in Big Data

I need help plotting some categorical and numerical Values in python. the code is given below:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv('train_feature_store.csv')
df.info
df.head
df.columns
plt.figure(figsize=(20,6))
sns.countplot(x='Store', data=df)
plt.show()
Size = df[['Size','Store']].groupby(['Store'], as_index=False).sum()
Size.sort_values(by=['Size'],ascending=False).head(10)
However, the data size is so huge (Big data) that I'm not even able to make meaningful plotting in python. Basically, I just want to take the top 5 or top 10 values in python and make a plot of that as given below:-
In an attempt to plot the thing, I'm trying to put the below code into a dataframe and plot it, but not able to do so. Can anyone help me out in this:-
Size = df[['Size','Store']].groupby(['Store'], as_index=False).sum()
Size.sort_values(by=['Size'],ascending=False).head(10)
Below, is a link to the sample dataset. However, the dataset is a representation, in the original one where I'm trying to do the EDA, which has around 3 thousand unique stores and 60 thousand rows of data. PLEASE HELP! Thanks!
https://drive.google.com/drive/folders/1PdXaKXKiQXX0wrHYT3ZABjfT3QLIYzQ0?usp=sharing
You were pretty close.
import pandas as pd
import seaborn as sns
df = pd.read_csv('train_feature_store.csv')
sns.set(rc={'figure.figsize':(16,9)})
g = df.groupby('Store', as_index=False)['Size'].sum().sort_values(by='Size', ascending=False).head(10)
sns.barplot(data=g, x='Store', y='Size', hue='Store', dodge=False).set(xticklabels=[]);
First of all.. looking at the data ..looks like it holds data from scotland to Kolkata ..
categorize the data by geography first & then visualize.
Regards
Maitryee

How to change seaborn violinplot legend labels?

I'm using seaborn to make a violinplot, which uses hues to identify who survived and who didn't. This is given by the column 'DEATH_EVENT', where 0 means the person survived and 1 means they didn't. The only issue I'm having is that I can't figure out how to set labels for this hue legend. As seen below, 'DEATH_EVENT' presents 0 and 1, but I want to change this into 'Survived' and 'Not survived'.
Current code:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
sns.set()
plt.style.use('seaborn')
data = pd.read_csv('heart_failure_clinical_records_dataset.csv')
g = sns.violinplot(data=data, x='smoking', y='age', hue='DEATH_EVENT')
g.set_xticklabels(['No smoking', 'Smoking'])
I tried to use: g.legend(labels=['Survived', 'Not survived']), but it returns it without the colors, instead a thin and thick line for some reason.
I'm aware I could just use:
data['DEATH_EVENT'].replace({0:'Survived', 1:'Not survived'}, inplace=True)
but I wanted to see if there was another way. I'm still a rookie, so I'm guessing that there's a reason why the CSV's author made it so that it uses integers to describe plenty of things. Ex: if someone smokes or not, sex, diabetic or not, etc. Maybe it runs faster?
Controlling Seaborn legends is still somewhat tricky (some extensions to matplotlib's API would be helpful). In this case, you could grab the handles from the just-created legend and reuse them for a new legend:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.DataFrame({"smoking": np.random.randint(0, 2, 200),
"survived": np.random.randint(0, 2, 200),
"age": np.random.normal(60, 10, 200),
"DEATH_EVENT": np.random.randint(0, 2, 200)})
ax = sns.violinplot(data=data, x='smoking', y='age', hue='DEATH_EVENT')
ax.set_xticklabels(['No smoking', 'Smoking'])
ax.legend(handles=ax.legend_.legendHandles, labels=['Survived', 'Not survived'])
Here is an approach to make the change via the dataframe without changing the original dataframe. To avoid accessing ax.legend_ alltogether (to remove the legend title), a trick is to rename the column to a blank string (and use that blank string for hue). If the dataframe isn't super long (i.e. not having millions of rows), the speed and memory overhead are quite modest.
names = {0: 'Survived', 1: 'Not survived'}
ax = sns.violinplot(data=data.replace({'DEATH_EVENT': names}).rename(columns={'DEATH_EVENT': ''}),
x='smoking', y='age', hue='')

Clustermapping in Python using Seaborn

I am trying to create a heatmap with dendrograms on Python using Seaborn and I have a csv file with about 900 rows. I'm importing the file as a pandas dataframe and attempting to plot that but a large number of the rows are not being represented in the heatmap. What am I doing wrong?
This is the code I have right now. But the heatmap only represents about 49 rows.
Here is an image of the clustermap I've obtained but it is not displaying all of my data.
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
# Data set
df = pd.read_csv('diff_exp_gene.csv', index_col = 0)
# Default plot
sns.clustermap(df, cmap = 'RdBu', row_cluster=True, col_cluster=True)
plt.show()
Thank you.
An alternative approach would be to use imshow in matpltlib. I'm not exactly sure what your question is but I demonstrate a way to graph points on a plane from csv file
import numpy as np
import matplotlib.pyplot as plt
import csv
infile = open('diff_exp_gene.csv')
df = csv.DictReader(in_file)
temp = np.zeros((128,128), dtype = int)
for row in data:
if row['TYPE'] == types:
temp[int(row['Y'])][int(row['X'])] = temp[int(row['Y'])][int(row['X'])] + 1
plt.imshow(temp, cmap = 'hot', origin = 'lower')
plt.show()
As far as I know, keywords that apply to seaborn heatmaps also apply to clustermap, as the sns.clustermap passes to the sns.heatmap. In that case, all you need to do in your example is to set yticklabels=True as a keyword argument in sns.clustermap(). That will make all of the 900 rows appear.
By default, it is set as "auto" to avoid overlap. The same applies to the xticklabels. See more here: https://seaborn.pydata.org/generated/seaborn.heatmap.html

Seaborn.countplot : order categories by count

I know that seaborn.countplot has the attribute order which can be set to determine the order of the categories. But what I would like to do is have the categories be in order of descending count. I know that I can accomplish this by computing the count manually (using a groupby operation on the original dataframe, etc.) but I am wondering if this functionality exists with seaborn.countplot. Surprisingly, I cannot find an answer to this question anywhere.
This functionality is not built into seaborn.countplot as far as I know - the order parameter only accepts a list of strings for the categories, and leaves the ordering logic to the user.
This is not hard to do with value_counts() provided you have a DataFrame though. For example,
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='darkgrid')
titanic = sns.load_dataset('titanic')
sns.countplot(x = 'class',
data = titanic,
order = titanic['class'].value_counts().index)
plt.show()
Most often, a seaborn countplot is not really necessary. Just plot with pandas bar plot:
import seaborn as sns; sns.set(style='darkgrid')
import matplotlib.pyplot as plt
df = sns.load_dataset('titanic')
df['class'].value_counts().plot(kind="bar")
plt.show()

Categories