How can multiple columns be plotted, where the column headers are on the x-axis?
Data frame:
In[] nfl.head()
Out[]:
NfL_BL NfL_V02 NfL_V04 NfL_V06 NfL_V08 NfL_V12
0 5.67 NaN 6.150000 7.940000 9.03 40.200001
5 7.88 6.66 7.100000 8.190000 8.39 8.570000
11 15.50 NaN 17.799999 19.799999 24.50 23.900000
16 6.52 6.38 7.220000 8.980000 8.00 7.350000
22 4.53 NaN 4.960000 5.900000 4.98 4.930000
...
It is a temporal (from BL to V12), and I wanted the columns to be the x-axis.
I got some awful plots and hists in any of my attempts.
The simplest way is with pandas.DataFrame.plot
import pandas as pd
import numpy as np
# test data and dataframe
data = {0: {'NfL_BL': 5.67, 'NfL_V02': np.nan, 'NfL_V04': 6.15, 'NfL_V06': 7.94, 'NfL_V08': 9.03, 'NfL_V12': 40.200001},
5: {'NfL_BL': 7.88, 'NfL_V02': 6.66, 'NfL_V04': 7.1, 'NfL_V06': 8.19, 'NfL_V08': 8.39, 'NfL_V12': 8.57},
11: {'NfL_BL': 15.5, 'NfL_V02': np.nan, 'NfL_V04': 17.799999, 'NfL_V06': 19.799999, 'NfL_V08': 24.5, 'NfL_V12': 23.9},
16: {'NfL_BL': 6.52, 'NfL_V02': 6.38, 'NfL_V04': 7.22, 'NfL_V06': 8.98, 'NfL_V08': 8.0, 'NfL_V12': 7.35},
22: {'NfL_BL': 4.53, 'NfL_V02': np.nan, 'NfL_V04': 4.96, 'NfL_V06': 5.9, 'NfL_V08': 4.98, 'NfL_V12': 4.93}}
nfl = pd.DataFrame.from_dict(data, orient='index')
# display(nfl)
NfL_BL NfL_V02 NfL_V04 NfL_V06 NfL_V08 NfL_V12
0 5.67 NaN 6.15 7.94 9.03 40.20
5 7.88 6.66 7.10 8.19 8.39 8.57
11 15.50 NaN 17.80 19.80 24.50 23.90
16 6.52 6.38 7.22 8.98 8.00 7.35
22 4.53 NaN 4.96 5.90 4.98 4.93
# plot dataframe
nfl.plot()
# bar plot
nfl.plot.bar()
.transpose and plot
This will set the column headers as the index, and make it possible to plot them on the x-axis.
# transposed dataframe nfl.T
0 5 11 16 22
NfL_BL 5.67 7.88 15.5 6.52 4.53
NfL_V02 NaN 6.66 NaN 6.38 NaN
NfL_V04 6.15 7.10 17.8 7.22 4.96
NfL_V06 7.94 8.19 19.8 8.98 5.90
NfL_V08 9.03 8.39 24.5 8.00 4.98
NfL_V12 40.20 8.57 23.9 7.35 4.93
# plot a transposed dataframe
nfl.T.plot.bar()
For a histogram of all the values
I recommend using seaborn, which is a high-level API for matplotlib.
It makes working with different shapes of data, easier.
seaborn.distplot
import seaborn as sns
# plot
p = sns.distplot(a=nfl, kde=False)
p.set_xlabel('bins')
p.set_ylabel('counts')
Related
I have this dataframe df:
alpha1 week_day calendar_week
0 2.49 Freitag 2022-04-(01/07)
1 1.32 Samstag 2022-04-(01/07)
2 2.70 Sonntag 2022-04-(01/07)
3 3.81 Montag 2022-04-(01/07)
4 3.58 Dienstag 2022-04-(01/07)
5 3.48 Mittwoch 2022-04-(01/07)
6 1.79 Donnerstag 2022-04-(01/07)
7 2.12 Freitag 2022-04-(08/14)
8 2.41 Samstag 2022-04-(08/14)
9 1.78 Sonntag 2022-04-(08/14)
10 3.19 Montag 2022-04-(08/14)
11 3.33 Dienstag 2022-04-(08/14)
12 2.88 Mittwoch 2022-04-(08/14)
13 2.98 Donnerstag 2022-04-(08/14)
14 3.01 Freitag 2022-04-(15/21)
15 3.04 Samstag 2022-04-(15/21)
16 2.72 Sonntag 2022-04-(15/21)
17 4.11 Montag 2022-04-(15/21)
18 3.90 Dienstag 2022-04-(15/21)
19 3.16 Mittwoch 2022-04-(15/21)
and so on, with ascending calendar weeks.
I performed a pivot table to generate a heatmap.
df_pivot = pd.pivot_table(df, values=['alpha1'], index=['week_day'], columns=['calendar_week'])
What I get is:
alpha1 \
calendar_week 2022-(04-29/05-05) 2022-(05-27/06-02) 2022-(07-29/08-04)
week_day
Dienstag 3.32 2.09 4.04
Donnerstag 3.27 2.21 4.65
Freitag 2.83 3.08 4.19
Mittwoch 3.22 3.14 4.97
Montag 2.83 2.86 4.28
Samstag 2.62 3.62 3.88
Sonntag 2.81 3.25 3.77
\
calendar_week 2022-(08-26/09-01) 2022-04-(01/07) 2022-04-(08/14)
week_day
Dienstag 2.92 3.58 3.33
Donnerstag 3.58 1.79 2.98
Freitag 3.96 2.49 2.12
Mittwoch 3.09 3.48 2.88
Montag 3.85 3.81 3.19
Samstag 3.10 1.32 2.41
Sonntag 3.39 2.70 1.78
As you see the sorting of the pivot table is messed up. I need the same sorting for the columns (calendar weeks) as in the original dataframe.
I have been looking all over but couldn't find how to achieve this.
Would be also very nice, if the sorting of the rows remains the same.
Any help will be greatly appreciated
UPDATE
I didn't paste all the data. It would have been too long
The calendar_week column consist of following elements
'2022-04-(01/07)',
'2022-04-(08/14)',
'2022-04-(15/21)',
'2022-04-(22/28)',
'2022-(04-29/05-05)',
'2022-05-(06/12)',
'2022-05-(13/19)',
'2022-05-(20/26)',
'2022-(05-27/06-02)',
'2022-06-(03/09)'
'2022-06-(10/16)'
'2022-06-(17/23)'
'2022-06-(24/30)'
'2022-07-(01/07)'
'2022-07-(08/14)'
'2022-07-(15/21)'
'2022-07-(22/28)'
'2022-(07-29/08-04)'
'2022-08-(05/11)'
etc....
Each occurs 7 times in df. It represents a calendar week.
The sorting is the natural time sorting.
After pivoting the dataframe, the sorting of the column get messed up. And I guess it's due to the 2 different types: 2022-(07-29/08-04) and 2022-07-(15/21).
Try running this:
df_pivot.sort_values(by = ['calendar_week'], axis = 1, ascending = True)
I got the following output. Is this what you wanted?
calendar_week
2022-04-(01/07)
2022-04-(08/14)
2022-04-(15/21)
week_day
Dienstag
3.58
3.33
3.90
Donnerstag
1.79
2.98
NaN
Freitag
2.49
2.12
3.01
Mittwoch
3.48
2.88
3.16
Montag
3.81
3.19
4.11
be sure to remove the NaN values using the fillna() function.
I hope that answers it. :)
You can use an ordered Categorical for your week days and sort the dates after pivoting with sort_index:
# define the desired order of the days
days = ['Montag', 'Dienstag', 'Mittwoch', 'Donnerstag',
'Freitag', 'Samstag', 'Sonntag']
df_pivot = (df
.assign(week_day=pd.Categorical(df['week_day'], categories=days,
ordered=True))
.pivot_table(values='alpha1', index='week_day',
columns='calendar_week')
.sort_index(axis=1)
)
output:
calendar_week 2022-04-(01/07) 2022-04-(08/14) 2022-04-(15/21)
week_day
Montag 3.81 3.19 4.11
Dienstag 3.58 3.33 3.90
Mittwoch 3.48 2.88 3.16
Donnerstag 1.79 2.98 NaN
Freitag 2.49 2.12 3.01
Samstag 1.32 2.41 3.04
Sonntag 2.70 1.78 2.72
I have this dataframe; please note the last column ("Yr_Mo_Date") on the right
In[38]: data.head()
Out[38]:
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL Yr_Mo_Dy
0 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 61-1-1
1 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83 61-1-2
2 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71 61-1-3
3 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88 61-1-4
4 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83 61-1-5
The type of the "Yr_Mo_Dy" column is object while the others are float64.
I simply want to change the order of the columns so that "Yr_Mo_Dy" is the first column in the dataframe.
I tried the following but I get TypeError. What's wrong?
In[39]: cols = data.columns.tolist()
In[40]: cols
Out[40]:
['RPT',
'VAL',
'ROS',
'KIL',
'SHA',
'BIR',
'DUB',
'CLA',
'MUL',
'CLO',
'BEL',
'MAL',
'Yr_Mo_Dy']
In[41]: cols = cols[-1] + cols[:-1]
TypeError Traceback (most recent call last)
<ipython-input-59-c0130d1863e8> in <module>()
----> 1 cols = cols[-1] + cols[:-1]
TypeError: must be str, not list
You need add : for one element list because need concanecate 2 lists:
#string
print (cols[-1])
Yr_Mo_Dy
#one element list
print (cols[-1:])
['Yr_Mo_Dy']
cols = cols[-1:] + cols[:-1]
Or is possible add [], but it is worse readable:
cols = [cols[-1]] + cols[:-1]
print (cols)
['Yr_Mo_Dy', 'RPT', 'VAL', 'ROS', 'KIL', 'SHA', 'BIR',
'DUB', 'CLA', 'MUL', 'CLO', 'BEL', 'MAL']
Option 1
Use pd.DataFrame.insert and pd.DataFrame.pop to alter the dataframe in place. This is a very generalizable solution as you can swap in any column position for popping or inserting.
c = df.columns[-1]
df.insert(0, c, df.pop(c))
df
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 61-1-1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 61-1-2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 61-1-3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 61-1-4 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 61-1-5 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
Option 2
pd.DataFrame.reindex_axis and np.roll
df.reindex_axis(np.roll(df.columns, 1), 1)
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 61-1-1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 61-1-2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 61-1-3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 61-1-4 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 61-1-5 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
data :
118.6
109.7
126.7
107.8
113.9
109.7
109.7
98.2
112.3
153.7
157.8
85
126.7
125.1
155.4
138.5
154.6
189.9
120.6
101.6
128.7
138.5
210.8
124.4
189.8
122.5
161.7
188.6
229.1
168.7
233.7
137.5
126.6
244.4
141.9
227.5
183
177.6
244.4
95.1
116.4
75.9
75.3
109.8
117.1
75.9
109.8
71.2
71.3
89.6
93.3
84.7
85
82.9
145.3
107.7
84.2
96.7
89.8
86.2
85
89.6
67.5
64.9
48.1
54.9
56.1
60.6
51
44.6
64.3
57.6
66.2
69
60
70.2
65.4
60.1
49.4
61.4
62.8
78.8
70.3
82.7
68.6
I want to convert this numeric data in to ordinal .
Example :
if data values are comes in 60 to 69.9 then it will show me 1.
if data values are comes in 70 to 79.9 then it will show me 2.
if data values are comes in 80 to 89.9 then it will show me 3.
if data values are comes in 90 to 99.9 then it will show me 4. so on.
I know how to binarize data, using binaryX = binarizer.transform(X)
but i don't know how to convert numeric interval values in single ordinal value.
What about just dividing by 10, and subtracting the offset?
data = ['60', '69.9', '70', '73', '80']
[int((float(a) // 10) - 5) for a in data] # [1, 1, 2, 2, 3]
or, if you are using NumPy
((numpy.array([float(a) for a in data]) // 10) - 5).astype(int) # [1 1 2 2 3]
I am learning data frames and trying out different graphs. I have a data set of video games and am trying to plot a graph which shows years on x axis, net sales on y axis and the graph has to be per video game genre. I have grouped the data but am facing issues displaying it. Below is what I have tried:
import pandas as pd
%matplotlib inline
from matplotlib.pyplot import hist
df = pd.read_csv('VideoGames.csv')
s = df.groupby(['Genre','Year_of_Release']).agg(sum)['Global_Sales']
print(s)
The data is grouped properly as shown below:
Genre Year_of_Release
Action 1980.0 0.34
1981.0 14.84
1982.0 6.52
1983.0 2.86
1984.0 1.85
1985.0 3.52
1986.0 13.74
1987.0 1.12
1988.0 1.75
1989.0 4.64
1990.0 6.39
1991.0 6.76
1992.0 3.83
1993.0 1.81
1994.0 1.55
1995.0 3.57
1996.0 20.58
1997.0 27.58
1998.0 39.44
1999.0 27.77
2000.0 34.04
2001.0 59.39
2002.0 86.76
2003.0 67.93
2004.0 76.25
2005.0 85.53
2006.0 66.13
2007.0 104.97
2008.0 135.01
2009.0 137.66
...
Sports 2013.0 41.23
2014.0 45.10
2015.0 40.90
2016.0 23.53
Strategy 1991.0 0.94
1992.0 0.37
1993.0 0.81
1994.0 3.56
1995.0 6.51
1996.0 5.61
1997.0 7.71
1998.0 13.46
1999.0 18.45
2000.0 8.52
2001.0 7.55
2002.0 5.56
2003.0 7.99
2004.0 7.16
2005.0 5.31
2006.0 4.22
2007.0 9.26
2008.0 11.55
2009.0 12.36
2010.0 13.77
2011.0 8.84
2012.0 3.27
2013.0 6.09
2014.0 0.99
2015.0 1.84
2016.0 1.15
Name: Global_Sales, dtype: float64
Please advise how i can plot the graphs for all the genre's in one diagram. Thank you.
In pandas plot, the index will be plotted as x axis and every column is plotted separately, so you just need to transform the series to a data frame with Genre as columns:
ax = s.unstack('Genre').plot(kind = "line")
Before I start sorry for my English, my poor python knowledge (newbie) and a possible duplicate question. I tried and searched a lot but couldn't find any solution to the problem that I got stuck. Here is the problem;
I have an array named array1 that is loaded with numpy.loadtxt() it is a text file that has 2 columns of data with x and y. x range from 0.4 to 15. the increment is not problem.
I also have a second array array2 which contains x' values range from 10 to 12.
Note: The increment of xin each array is different. I will use them for linear interpolation for y values later.
I want to crop the first array by using the second array x' values range 10 to 12 .
I tried this;
new_array = array1[(array1>=np.amin(array2)) * (array1<= np.amax(array2))]
It crops the first array (array1). But I can only extract x values.
[ 10. 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 11. 11.1
11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 12. 12.1 12.2 12.3
12.4 12.5 12.6 12.7 12.8 12.9]
I want to extract the values of x and y from array1 by a given range of x values from another array.
Edit
array1[[ 0.3 0.302 0.304 0.306 0.308 0.31 0.312 0.314 0.316
... 13.4 13.5 13.6 13.7 13.8 13.9 14. 14.1
14.2 14.3 14.4 14.5 14.6 14.7 14.8 14.9 15. ]
[ 8.82 9. 9.18 9.35 9.52 9.69 9.85 10.02
10.18 10.35 10.52 10.67 10.82 10.97 11.12 11.25
11.39 11.52 …................... 2.3044 1.7773 2.271 2.721 ]]
array2 = [[ 10. 10.02 10.03 10.04 10.05 10.06 10.07 10.08 10.09 10.1
10.12 10.13 10.14 10.15 10.16 10.17 10.18 10.19 10.2 10.21
10.22 10.23 10.24 10.25 10.26 10.27 10.28 10.29 10.3 10.31
10.33 10.34 10.35 10.36 10.37 10.38 10.39 10.4 10.41 10.42
10.43 10.44 10.45 10.46 10.47 10.48 10.49 10.5 10.51 10.52
10.53 10.54 10.59 10.64 10.7 10.75 10.8 10.85 10.9 10.95 11.
11.05 11.1 11.15 ...... 12.64 12.65 12.66 12.67 12.68 12.69
12.7 12.71 12.72 12.73 12.74 12.75 12.76 12.77 12.78 12.79
12.8 12.81 12.82 12.83 12.84 12.85 12.86 12.87 12.88 12.89
12.9 ][ 0.0058 0.0073 0.0081 0.0088 0.0096 0.0104 0.0112 0.012 0.0128
0.0136 0.0165 0.0018 0.0195 0.021 0.0226 0.0241 0.0256 0.0272
0.0288 0.0334 …. 0.1092 0.0879 0.0667 0.0458 0.0433 0.0409
0.0385 0.0361 0.0337 0.0314 0.0291 0.0268 0.0245 0.0223 0.0209
0.0195 0.0182 0.0168 0.0155 0.0141 0.0128 0.0115 0.0101 0.0088
0.0085 0.0081 0.0078 0.0074 0.0071 0.0068 0.0064 0.0061 0.0058
0.0054]]
Again sorry for my English. I hope I succeded in explaining myself
Thank you a lot for your helps :)
Assuming that the first index corresponds to x, this may work:
indices = (array1[0,...] >= np.min(array2[0,...])) & (array1[0,...] <= np.max(array2[0,...]))
xselected = array1[0,indices]
yselected = array1[1,indices]
Notes: do not use np.amin, but np.min instead. Do not combine the indices together with a *, but use the boolean and: &.
I've indexed the arrays with array[0,...], but I think you can just use array[0] there as well, since the 0 indexes the first dimension.