Pandas Plotting Y-Axis indexing issue - python

I have this pandas data frame set up:
FY NY_State
0 1986-87 89431973
1 1987-88 95958200
2 1988-89 100664606
3 1989-90 99703990
4 1990-91 95446076
5 1991-92 91487047
6 1992-93 92658482
7 1993-94 88026334
8 1994-95 90845207
9 1995-96 80070860
10 1996-97 77357591
11 1997-98 87040859
12 1998-99 89547598
13 1999-00 93484650
14 2000-01 118696779
15 2001-02 132748185
16 2002-03 111932612
17 2003-04 116911977
18 2004-05 119898693
19 2005-06 149293542
20 2006-07 161647387
21 2007-08 193891526
22 2008-09 170071041
23 2009-10 180069745
24 2010-11 174704520
FWIW:
In [50]: totalData.dtypes
Out[50]:
FY object
NY_State int64
dtype: object
I want to make a bar chart with the FY on the x-axis and the y-axis being the amount in the NY_State column.
I've been getting some progress with this:
totalData.plot(x=totalData.FY, kind='bar')
but that gives me this:
Then I tried this:
totalData.plot(x=totalData.FY, kind='bar', ylim=(70000000, 240000000))
And that gave me this:
Which is better, but still not what I want. I tried:
totalData.plot(x=totalData.FY, y=totalData.NY_State, kind='bar')
but that gives me an exception of
IndexError: indices are out-of-bounds
...which makes no sense whatsoever to me how that's possible.
Would really appreciate help.

Related

Getting array with empty values in flask but get correct values in a python notebook

I am making a basic web application which takes the inputs for a logistic regression model and returns the class in which it lies. Here is the code for the prediction:
test_data = pd.Series([battery_power, blue, clock_speed, dual_sim, fc, four_g,
int_memory, m_dep, mobile_wt, n_cores, pc, px_height,
px_width, ram, sc_h, sc_w, talk_time, three_g,
touch_screen, wifi])
df = pd.read_csv("Users\ADMIN\Desktop\project\mobiledata_clean.csv")
df.drop(['Unnamed: 0', 'price_range'], inplace=True, axis=1)
print(df)
print(test_data)
#scaling the values
xpred= np.array((test_data-df.min())/(df.max()-df.min())).reshape(1,-1)
print(xpred)
the test_data is:
0 842
1 0
2 2.2
3 0
4 1
5 0
6 7
7 0.6
8 188
9 2
10 2
11 20
12 756
13 2549
14 9
15 7
16 19
17 0
18 0
19 1
dtype: object
Here's the dataframe in df:
df
I get a (1,40) array of null values in the xpred variable. can someone tell me why this is happening
The data type of test_data is showing as an object, try casting it into float and then do the operations maybe.
For series: s.astype('int64')
For dataframe: df.astype({'col1': 'int64'})

panda DataFrame.value_counts().plot().bar() and DataFrame.value_counts().cumsum().plot() not using the same axis

I am trying to draw a frequency bar plot and a cumulative "ogive" in the same plot. If I draw them separately both are shown OK, but when shown in the same figure, the cumulative graphic is shown shifted. Below the code used.
df = pd.DataFrame({'Correctas': [4,6,5,4,7,2,8,3,5,6,9,6,6,7,5,5,8,10,4,8,3,6,9,5,11,5,12,7,7,5,4,6]});
df['Correctas'].value_counts(sort = False).plot.bar();
df['Correctas'].value_counts(sort = False).cumsum().plot();
plt.show()
The frequency data is
2 1
3 3
4 7
5 14
6 20
7 24
8 27
9 29
10 30
11 31
12 32
So the cumulative shall start from 2 and it starts from 4 on x axis.
image showing the error
This has to do with bar chart plotting categorical x-axis. Here is a quick fix:
df = pd.DataFrame({'Correctas': [4,6,5,4,7,2,8,3,5,6,9,6,6,7,5,5,8,10,4,8,3,6,9,5,11,5,12,7,7,5,4,6]});
df_counts = df['Correctas'].value_counts(sort = False)
df_counts.index = df_counts.index.astype('str')
df_counts.plot.bar(alpha=.8);
df_counts.cumsum().plot(color='k', kind='line');
plt.show();
Output:

Xtick frequency in pandas boxplot

I am using pandas groupby for plotting wind speed Vs direction using a bar and whisker plot. However the xaxis is not readable due to so many wind direction value close to each other.
I have tried the oc_params ax.set_xticks but instead I am having empty x-axis or modified xaxis with different values
The head of my dataframe
Kvit_TIU dir_cat
0 0.064740 14
1 0.057442 15
2 0.056750 15
3 0.069002 17
4 0.068464 17
5 0.067057 17
6 0.071901 12
7 0.050464 5
8 0.066165 1
9 0.073993 27
10 0.090784 34
11 0.121366 33
12 0.087172 34
13 0.066197 30
14 0.073020 17
15 0.071784 16
16 0.081699 17
17 0.088014 14
18 0.076758 14
19 0.078574 14
I used groupby = dir_cat to create a box plot
fig = plt.figure() # create the canvas for plotting
ax1 = plt.subplot(1,1,1)
ax1 = df_KvTr10hz.boxplot(column='Kvit_TIU', by='dir_cat', showfliers=False, showmeans=True)
ax1.set_xticks([30,90, 180,270, 330])
I would like to have the x-axis plotted with a reduced frequency. So that the plot can be readable
ax1 = df_KvTr10hz.dropna().boxplot(column='Kvit_TIU', by='dir_cat', showfliers=False, showmeans=True)
EDIT: Using OP sample dataframe
However, if we substitute with NaNs the Kvit_TIU values for 'dir_cat'>=30

Using pandas series date as xtick label

I have this dataframe called 'dfArrivalDate' (with the first 11 rows shown)
arrival_date count
0 2013-06-08 9
1 2013-06-27 8
2 2013-03-06 8
3 2013-06-01 8
4 2013-06-28 6
5 2012-11-28 6
6 2013-06-11 5
7 2013-06-29 5
8 2013-06-09 4
9 2013-06-03 3
10 2013-05-31 3
sortedArrivalDate = transform.sort('arrival_date')
I wanted to plot them in a bar chart to see the count by arrival date. I called
sortedArrivalDate.plot(kind = 'bar') [![enter image description here][1]]
but i'm getting the index as the row ticks of my bar chart. I figured i need to use 'xticks'.
sortedArrivalDate.plot(kind = 'bar', xticks = sortedArrivalDate.arrival_date)
but I run into the error: TypeError: Cannot compare type 'Timestamp' with type 'float'
I tried a different approach.
fig, ax = plt.subplots()
ax.plot(sortedArrivalDate.arrival_date, sortedArrivalDate.count)
This time the error is ValueError: x and y must have same first dimension
I'm thinking this might just be an easy fix and since I don't have much experience coding in pandas and matplotlib, I might be missing a very simple thing here. Care to guide me in the right direction? thanks.
IIUC:
df = df.sort_values(by='arrival_date')
df.plot(x='arrival_date', y='count', kind='bar')

Line chart in matplotlib with a double axis(strings on the axis)

I am trying to create a chart using python from a data in an Excel sheet. The data looks like this
Location Values
Trial 1 Edge 12
M-2 13
Center 14
M-4 15
M-5 12
Top 13
Trial 2 Edge 10
N-2 11
Center 11
N-4 12
N-5 13
Top 14
Trial 3 Edge 15
R-2 13
Center 12
R-4 11
R-5 10
Top 3
I want my graph to look like this:
Chart-1
.The chart should have the Location column values as X-axis, i.e, string object. This can be done easily(by using/creating Location as an array),
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
datalink=('/Users/Maxwell/Desktop/W1.xlsx')
df=pd.read_excel(datalink,skiprows=2)
x1=df.loc[:,['Location']]
x2=df.loc[:,['Values']]
x3=np.linspace(1,len(x2),num=len(x2),endpoint=True)
vals=['Location','Edge','M-2','Center','M-4','M-5','Top','Edge','N-2','Center','N-4','N-5','Top','Edge','R-2']
plt.figure(figsize=(12,8),dpi=300)
plt.subplot(1,1,1)
plt.xticks(x3,vals)
plt.plot(x3,x2)
plt.show()
But, I also want to show Trial-1, Trial-2 .. on X-axis. Upto now I had been using Excel to generate chart but, I have a lot of similar data and want to use python to automate the task.
With your excel sheet that has data as follows,
,
you can use matplotlib to create the plot you wanted. It is not straightforward but can be done. See below:
EDIT: earlier I suggested factorplot, but it is not applicable because your location values for each trial are not constant.
df = pd.read_excel(r'test_data.xlsx', header = 1, parse_cols = "D:F",
names = ['Trial', 'Location', 'Values'])
'''
Trial Location Values
0 Trial 1 Edge 12
1 NaN M-2 13
2 NaN Center 14
3 NaN M-4 15
4 NaN M-5 12
5 NaN Top 13
6 Trial 2 Edge 10
7 NaN N-2 11
8 NaN Center 11
9 NaN N-4 12
10 NaN N-5 13
11 NaN Top 14
12 Trial 3 Edge 15
13 NaN R-2 13
14 NaN Center 12
15 NaN R-4 11
16 NaN R-5 10
17 NaN Top 3
'''
# this will replace the nan with corresponding trial number for each set of trials
df = df.fillna(method = 'ffill')
'''
Trial Location Values
0 Trial 1 Edge 12
1 Trial 1 M-2 13
2 Trial 1 Center 14
3 Trial 1 M-4 15
4 Trial 1 M-5 12
5 Trial 1 Top 13
6 Trial 2 Edge 10
7 Trial 2 N-2 11
8 Trial 2 Center 11
9 Trial 2 N-4 12
10 Trial 2 N-5 13
11 Trial 2 Top 14
12 Trial 3 Edge 15
13 Trial 3 R-2 13
14 Trial 3 Center 12
15 Trial 3 R-4 11
16 Trial 3 R-5 10
17 Trial 3 Top 3
'''
from matplotlib import rcParams
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
rcParams.update({'font.size': 10})
fig1 = plt.figure()
f, ax1 = plt.subplots(1, figsize = (10,3))
ax1.plot(list(df.Location.index), df['Values'],'o-')
ax1.set_xticks(list(df.Location.index))
ax1.set_xticklabels(df.Location, rotation=90 )
ax1.yaxis.set_label_text("Values")
# create a secondary axis
ax2 = ax1.twiny()
# hide all the spines that we dont need
ax2.spines['top'].set_visible(False)
ax2.spines['bottom'].set_visible(False)
ax2.spines['right'].set_visible(False)
ax2.spines['left'].set_visible(False)
pos1 = ax2.get_position() # get the original position
pos2 = [pos1.x0 + 0, pos1.y0 -0.2, pos1.width , pos1.height ] # create a new position by offseting it
ax2.xaxis.set_ticks_position('bottom')
ax2.set_position(pos2) # set a new position
trials_ticks = 1.0 * df.Trial.value_counts().cumsum()/ (len(df.Trial)) # create a series object for ticks for each trial group
trials_ticks_positions = [0]+list(trials_ticks) # add a additional zero. this will make tick at zero.
trials_labels_offset = 0.5 * df.Trial.value_counts()/ (len(df.Trial)) # create an offset for the tick label, we want the tick label to between ticks
trials_label_positions = trials_ticks - trials_labels_offset # create the position of tick labels
# set the ticks and ticks labels
ax2.set_xticks(trials_ticks_positions)
ax2.xaxis.set_major_formatter(ticker.NullFormatter())
ax2.xaxis.set_minor_locator(ticker.FixedLocator(trials))
ax2.xaxis.set_minor_formatter(ticker.FixedFormatter(list(trials_label_positions.index)))
ax2.tick_params(axis='x', length = 10,width = 1)
plt.show()
results in

Categories