Plotting multiple columns of different sizes with Pandas - python

I'm fairly new to Pandas, but typically what I do with data (when all columns are of equal sizes), I build np.zeros(count) matrices, then use a for loop to populate the data from a text file (np.genfromtxt()) to do my graphing and analysis in matplotlib.
However, I am now trying to implement similar analysis with columns of different sizes on the same plot from a CSV file.
For instance:
data.csv:
A B C D E F
1 2 3 4 5 6
2 3 4 5 6 7
3 4 5 6
4 5
df = pandas.read_csv('data.csv')
ax = df.plot(x = 'A', y = 'B')
df.plot(x = 'C', y = 'D', ax = ax)
df.plot(x = 'E', y = 'F', ax = ax)
This code plots the first two on the same graph, but the rest of the information is lost (and there are a lot more columns of mismatched sizes, but the x/y columns I am plotting are the all the same size).
Is there an easier way to do all of this? Thanks!

Here is how you could generalize your solution :
I edited my answer to add an error handling. If you have a lonely last column, it'll still work.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
data = {
'A' : [1, 2, 3, 4],
'B' : [2, 3, 4, 5],
'C' : [3, 4, 5, np.nan],
'D' : [4, 5, 6, np.nan],
'E' : [5, 6, np.nan, np.nan],
'F' : [6, 7, np.nan, np.nan]
}
df = pd.DataFrame(data)
def Chris(df):
ax = df.plot(x='A', y='B')
df.plot(x='C', y='D', ax=ax)
df.plot(x='E', y='F', ax=ax)
plt.show()
def IMCoins(df):
fig, ax = plt.subplots()
try:
for idx in range(0, df.shape[1], 2):
df.plot(x = df.columns[idx],
y = df.columns[idx + 1],
ax= ax)
except IndexError:
print('Index Error: Log the error.')
plt.show()
Chris(df)
IMCoins(df)

Related

How to create a Crosstab Plot?

I would like to create a 'Crosstab' plot like the below using matplotlib or seaborn:
Using the following dataframe:
import pandas as pd
data = [['A', 'C', 2], ['A', 'D', 8], ['B', 'C', 25], ['B', 'D', 30]]
df = pd.DataFrame(data = data, columns = ['col', 'row', 'val'])
col row val
0 A C 2
1 A D 8
2 B C 25
3 B D 30
An option in matplotlib could be by adding Rectangles to the origin via plt.gca and add_patch. The problem is that I did here all manually like this:
from matplotlib.patches import Rectangle
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
plt.xlim(-10, 40)
plt.ylim(-40, 40)
plt.rcParams['figure.figsize'] = (10,16)
someX, someY = 0, 0
currentAxis = plt.gca()
currentAxis.add_patch(Rectangle((someX, someY), 30, 30, facecolor="purple"))
ax.text(15, 15, '30')
currentAxis.add_patch(Rectangle((someX, someY), 25, -25, facecolor="blue"))
ax.text(12.5, -12.5, '25')
currentAxis.add_patch(Rectangle((someX, someY), -2, -2, facecolor="red"))
ax.text(-1, -1, '2')
currentAxis.add_patch(Rectangle((someX, someY), -8, 8, facecolor="green"))
ax.text(-4, 4, '8')
Output:
As you can see, the plot doesn't look that nice. So I was wondering if it is possible to somehow automatically create 'Crosstab' plots using matplotlib or seaborn?
I am not sure whether matplotlib or seaborn have dedicated functions for this type of plot or not, but using plt.bar and plt.bar_label instead of Rectangle and plt.Text might help automatize things a little (label placement etc.).
See code below:
import matplotlib.pyplot as plt
data = [['A', 'C', 2], ['A', 'D', 8], ['B', 'C', 25], ['B', 'D', 30]]
pos={'A':-1,'B':0,'C':-1,'D':1}
fig,ax=plt.subplots(figsize=(10,10))
p=[ax.bar(pos[d[0]]*d[2],pos[d[1]]*d[2],width=d[2],align='edge') for d in data]
[ax.bar_label(p[i],labels=[data[i][2]], label_type='center',fontsize=18) for i in range(len(data))]
ax.set_aspect('equal')

How to split a grouped plot in Seaborn Python?

I have a data frame like this:
df:
Type Col-1 Col-2
A 3 8
A 4 7
A 5 9
A 6 6
A 7 7
B 4 8
B 2 7
B 6 6
B 4 9
B 5 7
I have 2 violin plots for Col-1 & Col-2. Now, I want to create a single violin plot with 2 violin images for Type A & B. In the violin plot, I want to split every violin such that the left half of the violin denotes Col-1 & right half of the violin denotes Col-2. I created two separate violin plots for col-1 and col-2 but now I want to make it a single plot and represent 2 columns at a time by splitting. How can I do it?
This is my code for separate plots:
def violin(data):
for col in data.columns:
x = data[col].to_frame().reset_index()
ax = sns.violinplot(data=x, x='type',y=col,inner='quart',split=True)
plt.show()
violin(df)
This is what my current violin plots look like. I want to make them in single plot:
Can anyone help me with this?
Seaborn works easiest with data in "long form", combining the value columns.
Here is how the code could look like:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.DataFrame({'Type': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'],
'Col-1': [4, 3, 5, 6, 7, 4, 2, 6, 4, 5],
'Col-2': [7, 8, 9, 6, 7, 8, 7, 6, 9, 7]})
df_long = df.melt(id_vars=['Type'], value_vars=['Col-1', 'Col-2'], var_name='Col', value_name='Value')
plt.figure(figsize=(12, 5))
sns.set()
sns.violinplot(data=df_long, x='Type', y='Value', hue='Col', split=True, palette='spring')
plt.tight_layout()
plt.show()

Cumulative average in python

I'm working with csv files.
I'd like a to create a continuously updated average of a sequence. ex;
I'd like to output the average of each individual value of a list
list; [a, b, c, d, e, f]
formula:
(a)/1= ?
(a+b)/2=?
(a+b+c)/3=?
(a+b+c+d)/4=?
(a+b+c+d+e)/5=?
(a+b+c+d+e+f)/6=?
To demonstrate:
if i have a list; [1, 4, 7, 4, 19]
my output should be; [1, 2.5, 4, 4, 7]
explained;
(1)/1=1
(1+4)/2=2.5
(1+4+7)/3=4
(1+4+7+4)/4=4
(1+4+7+4+19)/5=7
As far as my python file it is a simple code:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('somecsvfile.csv')
x = [] #has to be a list of 1 to however many rows are in the "numbers" column, will be a simple [1, 2, 3, 4, 5] etc...
#x will be used to divide the numbers selected in y to give us z
y = df[numbers]
z = #new dataframe derived from the continuous average of y
plt.plot(x, z)
plt.show()
If numpy is needed that is no problem.
pandas.DataFrame.expanding is what you need.
Using it you can just call df.expanding().mean() to get the result you want:
mean = df.expanding().mean()
print(mean)
Out[10]:
0 1.0
1 2.5
2 4.0
3 4.0
4 7.0
If you want to do it just in one column, use pandas.Series.expanding.
Just use the column instead of df:
df['column_name'].expanding().mean()
You can use cumsum to get cumulative sum and then divide to get the running average.
x = np.array([1, 4, 7, 4, 19])
np.cumsum(x)/range(1,len(x)+1)
print (z)
output:
[1. 2.5 4. 4. 7. ]
To give a complete answer to your question, filling in the blanks of your code using numpy and plotting:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
#df = pd.read_csv('somecsvfile.csv')
#instead I just create a df with a column named 'numbers'
df = pd.DataFrame([1, 4, 7, 4, 19], columns = ['numbers',])
x = range(1, len(df)+1) #x will be used to divide the numbers selected in y to give us z
y = df['numbers']
z = np.cumsum(y) / np.array(x)
plt.plot(x, z, 'o')
plt.xticks(x)
plt.xlabel('Entry')
plt.ylabel('Cumulative average')
But as pointed out by Augusto, you can also just put the whole thing into a DataFrame. Adding a bit more to his approach:
n = [1, 4, 7, 4, 19]
df = pd.DataFrame(n, columns = ['numbers',])
#augment the index so it starts at 1 like you want
df.index = np.arange(1, len(df)+1)
# create a new column for the cumulative average
df = df.assign(cum_avg = df['numbers'].expanding().mean())
# numbers cum_avg
# 1 1 1.0
# 2 4 2.5
# 3 7 4.0
# 4 4 4.0
# 5 19 7.0
# plot
df['cum_avg'].plot(linestyle = 'none',
marker = 'o',
xticks = df.index,
xlabel = 'Entry',
ylabel = 'Cumulative average')

Matplotlib align uneven number of subplots

I want to plot 11 figures using subplots. My idea is to have 2 rows: 6 plots on the first, 5 on the second. I use the following code.
import matplotlib.pyplot as plt
import pandas as pd
fig, axes = plt.subplots(2, 6, figsize=(30, 8))
fig.tight_layout(h_pad=6, w_pad=6)
x = 0
y = 0
for i in range(0, 11):
data = [[1, i*1], [2, i*2*2], [3, i*3*3]]
df = pd.DataFrame(data, columns = ['x', 'y'])
df.plot('x', ['y'], ax=axes[x,y])
y += 1
if y > 5:
y = 0
x += 1
fig.delaxes(ax=axes[1,5])
This works, but the bottom row is not aligned to the center, which makes the result a bit ugly. I want the figures to all be of the same size, so I cannot extend the last one to make everything even.
My question: how do I align the second row to be centered such that the full picture is symmetrical?
You could use gridspec dividing each row into 12 partitions and recombining them pairswise:
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import pandas as pd
fig = plt.figure(figsize=(12, 5))
gs = gridspec.GridSpec(2, 12)
for i in range(0, 11):
if i < 6:
ax = plt.subplot(gs[0, 2 * i:2 * i + 2])
else:
ax = plt.subplot(gs[1, 2 * i - 11:2 * i + 2 - 11])
data = [[1, i * 1], [2, i * 2 * 2], [3, i * 3 * 3]]
df = pd.DataFrame(data, columns=['x', 'y'])
df.plot('x', 'y', ax=ax)
plt.tight_layout()
plt.show()

Is there a way to create a stacked bar graph from pandas?

I have an sqlite database setup with some data. I have imported it through sql statements via pandas:
df1 = pd.read_sql_query("Select avg(Duration),keyword,filename from keywords group by keyword,filename order by filename", con)
The data looks as follows:
Based on this I want to construct a stacked bar graph that looks like this:
I've tried various different solutions including matplotlib, pandas.plot but im unable to successfully construct this graph.
Thanks in advance.
This snippet should work:
import pandas as pd
import matplotlib.pyplot as plt
data = [[2, 'A', 'output.xml'], [5, 'B', 'output.xml'],
[3, 'A', 'output.xml'], [2, 'B', 'output.xml'],
[5, 'C', 'output2.xml'], [1, 'B', 'output2.xml'],
[6, 'C', 'output.xml'], [3, 'C', 'output2.xml'],
[3, 'A', 'output2.xml'], [3, 'B', 'output.xml'],
[2, 'C', 'output.xml'], [1, 'C', 'output2.xml']
]
df = pd.DataFrame(data, columns = ['duration', 'Keyword', 'Filename'])
df2 = df.groupby(['Filename', 'Keyword'])['duration'].sum().unstack('Keyword').fillna(0)
df2[['A','B', 'C']].plot(kind='bar', stacked=True)
It is similar to this question with the difference that I sum the values of the the concerned field instead of counting.
1.You just have to use:
ax=df.pivot_table(index='fillname',columns='keyword',values='avg(duration)').plot(kind='bar',stacked=True,figsize=(15,15),fontsize=25)
ax.legend(fontsize=25)
2. Example
df=pd.DataFrame()
df['avg(duration)']=[7,4,5,9,3,2]
df['keywoard']=['a','b','c','a','b','c']
df['fillname']=['out1','out1','out1','out2','out2','out2']
df
2.1 Output df example:
avg(duration) keywoard fillname
0 7 a out1
1 4 b out1
2 5 c out1
3 9 a out2
4 3 b out2
5 2 c out2
2.2 Drawing
ax=df.pivot_table(index='fillname',columns='keywoard',values='avg(duration)').plot(kind='bar',stacked=True,figsize=(15,15),fontsize=25)
ax.legend(fontsize=25)
2.3 Output image example:
3. In addiccion using:
#set ylim
plt.ylim(-1, 20)
plt.xlim(-1,4)
#grid on
plt.grid()
# set y=0
ax.axhline(0, color='black', lw=1)
#change size of legend
ax.legend(fontsize=25,loc=(0.9,0.4))
#hiding upper and right axis layout
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
#changing the thickness
ax.spines['bottom'].set_linewidth(3)
ax.spines['left'].set_linewidth(3)
#setlabels
ax.set_xlabel('fillname',fontsize=20,color='r')
ax.set_ylabel('avg(duration)',fontsize=20,color='r')
#rotation
plt.xticks(rotation=0)

Categories