Pandas dataframe as input for matplotlib.pyplot.boxplot - python

I have a pandas dataframe which looks like this:
[('1975801_m', 1 0.203244
10 -0.159756
16 -0.172756
19 -0.089756
20 -0.033756
23 -0.011756
24 0.177244
32 0.138244
35 -0.104756
36 0.157244
40 0.108244
41 0.032244
42 0.063244
45 0.362244
59 -0.093756
62 -0.070756
65 -0.030756
66 -0.100756
73 -0.140756
77 -0.110756
81 -0.100756
84 -0.090756
86 -0.180756
87 0.119244
88 0.709244
102 -0.030756
105 -0.000756
107 -0.010756
109 0.039244
111 0.059244
Name: RTdiff), ('3878418_m', 1637 0.13811
1638 -0.21489
1644 -0.15989
1657 -0.11189
1662 -0.03289
1666 -0.09489
1669 0.03411
1675 -0.00489
1676 0.03511
1677 0.39711
1678 -0.02289
1679 -0.05489
1681 -0.01989
1691 0.14411
1697 -0.10589
1699 0.09411
1705 0.01411
1711 -0.12589
1713 0.04411
1715 0.04411
1716 0.01411
1731 0.06411
1738 -0.25589
1741 -0.21589
1745 0.39411
1746 -0.13589
1747 -0.10589
1748 0.08411
Name: RTdiff)
I would like to use it as input for the mtplotlib.pyplot.boxplot function.
the error I get from matplotlib.pyplot.boxplot(mydataframe) is ValueError: cannot set an array element with a sequence
I tried to use list(mydataframe) instead of mydataframe. That fails with the same error.
I also tried matplotlib.pyplot.boxplot(np.fromiter(mydataframe, np.float)) - that fails with ValueError: setting an array element with a sequence.

It's not clear that your data are in a DataFrame. It appears to be a list of Series objects.
Once it's really in a DataFrame, the trick here is the create your figure and axes ahead of time and use the **kwargs that you would normally use with matplotlib.axes.boxplot. You also need to make sure that your data is a DataFrame and not a Series
import numpy as np
import matplotlib.pyplot as plt
import pandas
fig, ax = plt.subplots()
df = pandas.DataFrame(np.random.normal(size=(37,5)), columns=list('ABCDE'))
df.boxplot(ax=ax, positions=[2,3,4,6,8], notch=True, bootstrap=5000)
ax.set_xticks(range(10))
ax.set_xticklabels(range(10))
plt.show()
Which gives me:
Failing that, you can take a similar approach, looping through the columns you would like to plot using your ax object directly.
import numpy as np
import matplotlib.pyplot as plt
import pandas
df = pandas.DataFrame(np.random.normal(size=(37,5)), columns=list('ABCDE'))
fig, ax = plt.subplots()
for n, col in enumerate(df.columns):
ax.boxplot(df[col], positions=[n+1], notch=True)
ax.set_xticks(range(10))
ax.set_xticklabels(range(10))
plt.show()
Which gives:

Related

Is it possible to plot a barchart with upper and lower limits of the bins with Pandas,seaborn or Matplotlib

I will like to know how I can go about plotting a barchart with upper and lower limits of the bins represented by the values in the age_classes column of the dataframe shown below with pandas, seaborn or matplotlib. A sample of the dataframe looks like this:
age_classes total_cases male_cases female_cases
0 0-9 693 381 307
1 10-19 931 475 454
2 20-29 4530 1919 2531
3 30-39 7466 3505 3885
4 40-49 13701 6480 7130
5 50-59 20975 11149 9706
6 60-69 18089 11761 6254
7 70-79 19238 12281 6868
8 80-89 16252 8553 7644
9 >90 4356 1374 2973
10 Unknown 168 84 81
If you want a chart like this:
then you can make it with sns.barplot setting age_classes as x and one columns (in my case total_cases) as y, like in this code:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('data.csv')
fig, ax = plt.subplots()
sns.barplot(ax = ax,
data = df,
x = 'age_classes',
y = 'total_cases')
plt.show()

Python Matplotlib X-axis label dual axis with dataframe

I've got a dual axis bar and line plot using matplotlib. I read the data in as a dataframe,
[WEEK SIGNUPS APPLICATIONS PRECOURSE_WORK QUALIFIED ENROLLED SPEND
2019-10-07 5674 2938 2220 106 2 77581.67
2019-10-14 4538 2225 2309 567 204 61258.08
2019-10-21 3865 1997 1801 121 39 53700.58
2019-10-28 3559 1886 1641 162 39 53543.28
2019-11-04 3782 1946 1980 190 109 49495.64
2019-11-11 4033 2035 1568 118 109 49952.17
2019-11-18 3999 2009 1537 83 77 58545.72
2019-11-25 6170 3322 1660 110 61 52332.4
2019-12-02 5189 2658 7041 73 30 56727.55
2019-12-09 4631 2497 7904 174 116 60977.49
2019-12-16 4935 2501 3492 108 82 68179.54
2019-12-23 5289 2603 1983 80 38 76956.81
2019-12-30 5843 3037 2150 90 80 76246.14
2020-01-06 4194 1930 1619 74 57 46114.68]
My code works and produces a graph (below)
Here is my code
import matplotlib.pyplot as plt
from pylab import rcParams
from matplotlib import style
style.use('seaborn-paper')
#print(plt.style.available)
rcParams['figure.figsize'] = 20, 10
#plt.xticks(df[['WEEK']])
ax = df[['SPEND']].plot(kind='bar', color = 'lightblue')
ax.set_ylabel("Spend",color="blue",fontsize=20)
ax.set_xlabel('Weeks',color="blue",fontsize=20)
ax2 = ax.twinx()
ax2.plot(df[['SIGNUPS','APPLICATIONS','ENROLLED']].values, linestyle='-', marker='o', linewidth=4.0)
fmt = '${x:,.0f}'
tick = mtick.StrMethodFormatter(fmt)
ax.yaxis.set_major_formatter(tick)
When I uncomment the line plt.xticks(df[['WEEK']]) I get the following error
ConversionError Failed to convert value(s) to axis unit.
Can anyone help me out?
plt.xticks is expecting the tick locations to be specified and optionally the labels, from the docs the signature is
xticks(ticks, [labels], **kwargs)
So when you do
plt.xticks(df[['WEEK']])
It is trying to interpret the dates in the 'WEEK' column as the locations for the ticks. What you want to do instead is use plt.set_xticklabels which expects only the labels be specified, i.e.
plt.set_xticklabels(df[['WEEK']])
# or
plt.set_xticklabels(df[['WEEK']].values)
Although you may also need to manually covert the values to strings, depending on how they are defined.

How Normalize Data Mining in Python with library

How Normalize Data Mining MinMax from csv in Python 3 with library
this is example of my data
RT NK NB SU SK P TNI IK IB TARGET
84876 902 1192 2098 3623 169 39 133 1063 94095
79194 902 1050 2109 3606 153 39 133 806 87992
75836 902 1060 1905 3166 161 39 133 785 83987
75571 902 112 1878 3190 158 39 133 635 82618
83797 1156 134 1900 3518 218 39 133 709 91604
91648 1291 127 2225 3596 249 39 133 659 99967
79063 1346 107 1844 3428 247 39 133 591 86798
84357 1018 122 2152 3456 168 39 133 628 92073
90045 954 110 2044 3638 174 39 133 734 97871
83318 885 198 1872 3691 173 39 133 778 91087
93300 1044 181 2077 4014 216 39 133 635 101639
88370 1831 415 2074 4323 301 39 133 502 97988
91560 1955 377 2015 4153 349 39 223 686 101357
85746 1791 314 1931 3878 297 39 215 449 94660
93855 1891 344 2064 3947 287 39 162 869 103458
97403 1946 382 1937 4029 289 39 122 1164 107311
the formula MinMax is
= (data-min)/(max-min)*0.8+0.1
i got the code but the normalize data is not each column
I know how to count it like this
(first data of RT - min column RT data) / (max column RT- min column RT) * 0.8 + 0.1, etc
so does the next column
(first data of NK - min column NK data) / (max column NK- min column NK) * 0.8 + 0.1
like this
please help me
this is my code, but i don't understand
from sklearn.preprocessing import Normalizer
from pandas import read_csv
from numpy import set_printoptions
import pandas as pd
#df1=pd.read_csv("dataset.csv")
#print(df1)
namaFile = 'dataset.csv'
nama = ['rt', 'niagak', 'niagab', 'sosum', 'soskhus', 'p', 'tni', 'ik', 'ib', 'TARGET']
dataFrame = read_csv(namaFile, names=nama)
array = dataFrame.values
#membagi array
X = array[:,0:10]
Y = array[:,9]
skala = Normalizer().fit(X)
normalisasiX = skala.transform(X)
#data hasil
print('Normalisasi Data')
set_printoptions(precision = 3)
print(normalisasiX[0:5,:])
the results of manual counting with code are very different
we can use pandas python library.
import pandas as pd
df = pd.read_csv("filename")
norm = (df - df.min()) / (df.max() - df.min() )*0.8 + 0.1
norm will have the normalised dataframe
By using MinMaxScaler from sklearn you can solve your problem.
from pandas import read_csv
from sklearn.preprocessing import MinMaxScaler
df = read_csv("your-csv-file")
data = df.values
scaler = MinMaxScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from pandas import DataFrame
from sklearn.preprocessing import MinMaxScaler
data = pd.read_csv('Q4dataset.csv')
#print(data)
df = DataFrame(data,columns=['X','Y'])
scaler = MinMaxScaler()
scaler.fit(df)
#print(scaler.transform(df))
minmaxdf = scaler.transform(df)
kmeans = KMeans(n_clusters=2).fit(minmaxdf)
centroids = kmeans.cluster_centers_
plt.scatter(df['X'], df['Y'], c= kmeans.labels_.astype(float), s=30, alpha=1)
You can use the code I wrote above. I performed min-max normalization on two-dimensional data and then applied K means clustering algorithm.Be sure to include your own data set in .csv format

Set Xticks frequency to dataframe index

I currently have a dataframe that has as an index the years from 1990 to 2014 (25 rows). I want my plot to have the X axis with all the years showing. I'm using add_subplot as I plan to have 4 plots in this figure (all of them with the same X axis).
To create the dataframe:
import pandas as pd
import numpy as np
index = np.arange(1990,2015,1)
columns = ['Total Population','Urban Population']
pop_plot = pd.DataFrame(index=index, columns=columns)
pop_plot = df_.fillna(0)
pop_plot['Total Population'] = np.arange(150,175,1)
pop_plot['Urban Population'] = np.arange(50,125,3)
Total Population Urban Population
1990 150 50
1991 151 53
1992 152 56
1993 153 59
1994 154 62
1995 155 65
1996 156 68
1997 157 71
1998 158 74
1999 159 77
2000 160 80
2001 161 83
2002 162 86
2003 163 89
2004 164 92
2005 165 95
2006 166 98
2007 167 101
2008 168 104
2009 169 107
2010 170 110
2011 171 113
2012 172 116
2013 173 119
2014 174 122
The code that I currently have:
fig = plt.figure(figsize=(10,5))
ax1 = fig.add_subplot(2,2,1, xticklabels=pop_plot.index)
plt.subplot(2, 2, 1)
plt.plot(pop_plot)
legend = plt.legend(pop_plot, bbox_to_anchor=(0.1, 1, 0.8, .45), loc=3, ncol=1, mode='expand')
legend.get_frame().set_alpha(0)
ax1.set_xticks(range(len(pop_plot.index)))
This is the plot that I get:
When I comment the set_xticks I get the following plot:
#ax1.set_xticks(range(len(pop_plot.index)))
I've tried a couple of answers that I found here, but I didn't have much success.
It's not clear what ax1.set_xticks(range(len(pop_plot.index))) should be used for. It will set the ticks to the numbers 0,1,2,3 etc. while your plot should range from 1990 to 2014.
Instead, you want to set the ticks to the numbers of your data:
ax1.set_xticks(pop_plot.index)
Complete corrected example:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
index = np.arange(1990,2015,1)
columns = ['Total Population','Urban Population']
pop_plot = pd.DataFrame(index=index, columns=columns)
pop_plot['Total Population'] = np.arange(150,175,1)
pop_plot['Urban Population'] = np.arange(50,125,3)
fig = plt.figure(figsize=(10,5))
ax1 = fig.add_subplot(2,2,1)
ax1.plot(pop_plot)
legend = ax1.legend(pop_plot, bbox_to_anchor=(0.1, 1, 0.8, .45), loc=3, ncol=1, mode='expand')
legend.get_frame().set_alpha(0)
ax1.set_xticks(pop_plot.index)
plt.show()
The easiest option is to use the xticks parameter for pandas.DataFrame.plot
Pass the dataframe index to xticks: xticks=pop_plot.index
# given the dataframe in the OP
ax = pop_plot.plot(xticks=pop_plot.index, figsize=(15, 5))
# move the legend
ax.legend(bbox_to_anchor=(0.1, 1, 0.8, .45), loc=3, ncol=1, mode='expand', frameon=False)

IPython notebook stops evaluating cells after plt.show()

I am using iPython to do some coding. When I open the notebook and run some codes by doing SHIFT+ENTER it runs. But after one or two times, it stops giving any output. Why is that. I have to shutdown the notebook again open it and then it runs for few times and same problem again.
Here is the code I have used.
Cell Toolbar:
Question 1: Rotational Invariance of PCA
I(1): Importing the data sets and plotting a scatter plot of the two.
In [1]:
# Channging the working directory
import os
os.getcwd()
path="/Users/file/"
os.chdir(path)
pwd=os.getcwd()
print(pwd)
# Importing the libraries
import pandas as pd
import numpy as np
import scipy as sp
# Mentioning the files to be imported
file=["2d-gaussian.csv","2d-gaussian-rotated.csv"]
# Importing the two csv files in pandas dataframes
XI=pd.read_csv(file[0],header=None)
XII=pd.read_csv(file[1],header=None)
#XI
XII
Out[5]:
0 1
0 1.372310 -2.111748
1 -0.397896 1.968246
2 0.336945 1.338646
3 1.983127 -2.462349
4 -0.846672 0.606716
5 0.582438 -0.645748
6 4.346416 -4.645564
7 0.830186 -0.599138
8 -2.460311 2.096945
9 -1.594642 2.828128
10 3.767641 -3.401645
11 0.455917 -0.224665
12 2.878315 -2.243932
13 -1.062223 0.142675
14 -0.698950 1.113589
15 -4.681619 4.289080
16 0.411498 -0.041293
17 0.276973 0.187699
18 1.500835 -0.284463
19 -0.387535 -0.265205
20 3.594708 -2.581400
21 2.263455 -2.660592
22 -1.686090 1.566998
23 1.381510 -0.944383
24 -0.085535 -1.697205
25 1.030609 -1.448967
26 3.647413 -3.322129
27 -3.474906 2.977695
28 -7.930797 8.506523
29 -0.931702 1.440784
... ... ...
70 4.433750 -2.515612
71 1.495646 -0.058674
72 -0.928938 0.605706
73 -0.890883 -0.005911
74 -2.245630 1.333171
75 -0.707405 0.121334
76 0.675536 -0.822801
77 1.975917 -1.757632
78 -1.239322 2.053495
79 -2.360047 1.842387
80 2.436710 -1.445505
81 0.348497 -0.635207
82 -1.423243 -0.017132
83 0.881054 -1.823523
84 0.052809 1.505141
85 -2.466735 2.406453
86 -0.499472 0.970673
87 4.489547 -4.443907
88 -2.000164 4.125330
89 1.833832 -1.611077
90 -0.944030 0.771001
91 -1.677884 1.920365
92 0.372318 -0.474329
93 -2.073669 2.020200
94 -0.131636 -0.844568
95 -1.011576 1.718216
96 -1.017175 -0.005438
97 5.677248 -4.572855
98 2.179323 -1.704361
99 1.029635 -0.420458
100 rows × 2 columns
The two raw csv files have been imported as data frames. Next we will concatenate both the dataframes into one dataframe to plot a combined scatter plot
In [6]:
# Joining two dataframes into one.
df_combined=pd.concat([XI,XII],axis=1,ignore_index=True)
df_combined
Out[6]:
0 1 2 3
0 2.463601 -0.522861 1.372310 -2.111748
1 -1.673115 1.110405 -0.397896 1.968246
2 -0.708310 1.184822 0.336945 1.338646
3 3.143426 -0.338861 1.983127 -2.462349
4 -1.027700 -0.169674 -0.846672 0.606716
5 0.868458 -0.044767 0.582438 -0.645748
6 6.358290 -0.211529 4.346416 -4.645564
7 1.010685 0.163375 0.830186 -0.599138
8 -3.222466 -0.256939 -2.460311 2.096945
9 -3.127371 0.872207 -1.594642 2.828128
10 5.069451 0.258798 3.767641 -3.401645
11 0.481244 0.163520 0.455917 -0.224665
12 3.621976 0.448577 2.878315 -2.243932
13 -0.851991 -0.650218 -1.062223 0.142675
14 -1.281659 0.293194 -0.698950 1.113589
15 -6.343242 -0.277567 -4.681619 4.289080
16 0.320172 0.261774 0.411498 -0.041293
17 0.063126 0.328573 0.276973 0.187699
18 1.262396 0.860105 1.500835 -0.284463
19 -0.086500 -0.461557 -0.387535 -0.265205
20 4.367168 0.716517 3.594708 -2.581400
21 3.481827 -0.280818 2.263455 -2.660592
22 -2.300280 -0.084211 -1.686090 1.566998
23 1.644655 0.309095 1.381510 -0.944383
24 1.139623 -1.260587 -0.085535 -1.697205
25 1.753325 -0.295824 1.030609 -1.448967
26 4.928210 0.230011 3.647413 -3.322129
27 -4.562678 -0.351581 -3.474906 2.977695
28 -11.622940 0.407100 -7.930797 8.506523
29 -1.677601 0.359976 -0.931702 1.440784
... ... ... ... ...
70 4.913941 1.356329 4.433750 -2.515612
71 1.099070 1.016093 1.495646 -0.058674
72 -1.085156 -0.228560 -0.928938 0.605706
73 -0.625769 -0.634129 -0.890883 -0.005911
74 -2.530594 -0.645206 -2.245630 1.333171
75 -0.586007 -0.414415 -0.707405 0.121334
76 1.059484 -0.104132 0.675536 -0.822801
77 2.640018 0.154351 1.975917 -1.757632
78 -2.328373 0.575707 -1.239322 2.053495
79 -2.971570 -0.366041 -2.360047 1.842387
80 2.745141 0.700888 2.436710 -1.445505
81 0.695584 -0.202735 0.348497 -0.635207
82 -0.994271 -1.018499 -1.423243 -0.017132
83 1.912425 -0.666426 0.881054 -1.823523
84 -1.026954 1.101637 0.052809 1.505141
85 -3.445865 -0.042626 -2.466735 2.406453
86 -1.039549 0.333189 -0.499472 0.970673
87 6.316906 0.032272 4.489547 -4.443907
88 -4.331379 1.502719 -2.000164 4.125330
89 2.435918 0.157511 1.833832 -1.611077
90 -1.212710 -0.122350 -0.944030 0.771001
91 -2.544347 0.171460 -1.677884 1.920365
92 0.598670 -0.072133 0.372318 -0.474329
93 -2.894802 -0.037809 -2.073669 2.020200
94 0.504119 -0.690281 -0.131636 -0.844568
95 -1.930254 0.499670 -1.011576 1.718216
96 -0.715406 -0.723096 -1.017175 -0.005438
97 7.247917 0.780923 5.677248 -4.572855
98 2.746180 0.335849 2.179323 -1.704361
99 1.025371 0.430754 1.029635 -0.420458
100 rows × 4 columns
Plotting two separate scatter plot of all the four columns onto one scatter diagram
In [ ]:
import matplotlib.pyplot as plt
# Fucntion for scatter plot
def scatter_plot():
# plots scatter for first two columns(Unrotated Gaussian data)
plt.scatter(df_combined.ix[:,0], df_combined.ix[:,1],color='red',marker='+')
# plots scatter for Rotated Gaussian data
plt.scatter(df_combined.ix[:,2], df_combined.ix[:,3] ,color='green', marker='x')
legend = plt.legend(loc='upper right')
# set ranges of x and y axes
plt.xlim([-12,12])
plt.ylim([-12,12])
plt.show()
# Function call
scatter_plot()
In [ ]:
def plot_me1():
# create figure and axes
fig = plt.figure()
# split the page into a 1x1 array of subplots and put me in the first one (111)
# (as a matter of fact, the only one)
ax = fig.add_subplot(111)
# plots scatter for x, y1
ax.scatter(df_combined.ix[:,0], df_combined.ix[:,1], color='red', marker='+', s=100)
# plots scatter for x, y2
ax.scatter(df_combined.ix[:,2], df_combined.ix[:,3], color='green', marker='x', s=100)
plt.xlim([-12,12])
plt.ylim([-12,12])
plt.show()
plot_me1()
In [ ]:
You should not use plt.show() in the notebook. This will open an external window that blocks the evaluation of your cell.
Instead begin your notebooks with %matplotlib inline or the cool new %matplotlib notebook (the latter is only possible with matplotlib >= 1.4.3 and ipython >= 3.0)
After the evaluation of each cell, the (still open) figure object is automatically shown in your notebook.
This minimal code example works in notebook. Note that it does not call plt.show()
%matplotlib inline
import matplotlib.pyplot as plt
x = [1,2,3]
y = [3,2,1]
_ = plt.plot(x,y)
%matplotlib inline simply displays the image.
%matplotlib notebook was added recently and offers many of the cool features (zooming, measuring,...) of the interactive backends:

Categories