I am learning data frames and trying out different graphs. I have a data set of video games and am trying to plot a graph which shows years on x axis, net sales on y axis and the graph has to be per video game genre. I have grouped the data but am facing issues displaying it. Below is what I have tried:
import pandas as pd
%matplotlib inline
from matplotlib.pyplot import hist
df = pd.read_csv('VideoGames.csv')
s = df.groupby(['Genre','Year_of_Release']).agg(sum)['Global_Sales']
print(s)
The data is grouped properly as shown below:
Genre Year_of_Release
Action 1980.0 0.34
1981.0 14.84
1982.0 6.52
1983.0 2.86
1984.0 1.85
1985.0 3.52
1986.0 13.74
1987.0 1.12
1988.0 1.75
1989.0 4.64
1990.0 6.39
1991.0 6.76
1992.0 3.83
1993.0 1.81
1994.0 1.55
1995.0 3.57
1996.0 20.58
1997.0 27.58
1998.0 39.44
1999.0 27.77
2000.0 34.04
2001.0 59.39
2002.0 86.76
2003.0 67.93
2004.0 76.25
2005.0 85.53
2006.0 66.13
2007.0 104.97
2008.0 135.01
2009.0 137.66
...
Sports 2013.0 41.23
2014.0 45.10
2015.0 40.90
2016.0 23.53
Strategy 1991.0 0.94
1992.0 0.37
1993.0 0.81
1994.0 3.56
1995.0 6.51
1996.0 5.61
1997.0 7.71
1998.0 13.46
1999.0 18.45
2000.0 8.52
2001.0 7.55
2002.0 5.56
2003.0 7.99
2004.0 7.16
2005.0 5.31
2006.0 4.22
2007.0 9.26
2008.0 11.55
2009.0 12.36
2010.0 13.77
2011.0 8.84
2012.0 3.27
2013.0 6.09
2014.0 0.99
2015.0 1.84
2016.0 1.15
Name: Global_Sales, dtype: float64
Please advise how i can plot the graphs for all the genre's in one diagram. Thank you.
In pandas plot, the index will be plotted as x axis and every column is plotted separately, so you just need to transform the series to a data frame with Genre as columns:
ax = s.unstack('Genre').plot(kind = "line")
Related
I like to merge or combine two dataframes of different size df1 and df2, based on a range of dates, for example:
df1:
Date Open High Low
2021-07-01 8.43 8.44 8.22
2021-07-02 8.36 8.4 8.28
2021-07-06 8.22 8.23 8.06
2021-07-07 8.1 8.19 7.98
2021-07-08 8.07 8.1 7.91
2021-07-09 7.97 8.11 7.92
2021-07-12 8 8.2 8
2021-07-13 8.15 8.18 8.06
2021-07-14 8.18 8.27 8.12
2021-07-15 8.21 8.26 8.06
2021-07-16 8.12 8.23 8.07
df2:
Day of month Revenue Earnings
01 45000 4000
07 43500 5000
12 44350 6000
15 39050 7000
results should be something like this:
combination:
Date Open High Low Earnings
2021-07-01 8.43 8.44 8.22 4000
2021-07-02 8.36 8.4 8.28 4000
2021-07-06 8.22 8.23 8.06 4000
2021-07-07 8.1 8.19 7.98 5000
2021-07-08 8.07 8.1 7.91 5000
2021-07-09 7.97 8.11 7.92 5000
2021-07-12 8 8.2 8 6000
2021-07-13 8.15 8.18 8.06 6000
2021-07-14 8.18 8.27 8.12 6000
2021-07-15 8.21 8.26 8.06 7000
2021-07-16 8.12 8.23 8.07 7000
The Earnings column is merged based on a range of date, how can I do this in python pandas?
Try merge_asof
#df1.date=pd.to_datetime(df1.date)
df1['Day of month'] = df1.Date.dt.day
out = pd.merge_asof(df1, df2, on ='Day of month', direction = 'backward')
out
Out[213]:
Date Open High Low Day of month Revenue Earnings
0 2021-07-01 8.43 8.44 8.22 1 45000 4000
1 2021-07-02 8.36 8.40 8.28 2 45000 4000
2 2021-07-06 8.22 8.23 8.06 6 45000 4000
3 2021-07-07 8.10 8.19 7.98 7 43500 5000
4 2021-07-08 8.07 8.10 7.91 8 43500 5000
5 2021-07-09 7.97 8.11 7.92 9 43500 5000
6 2021-07-12 8.00 8.20 8.00 12 44350 6000
7 2021-07-13 8.15 8.18 8.06 13 44350 6000
8 2021-07-14 8.18 8.27 8.12 14 44350 6000
9 2021-07-15 8.21 8.26 8.06 15 39050 7000
10 2021-07-16 8.12 8.23 8.07 16 39050 7000
A more general approach is the following:
First you introduce a key both dataframes share.
In this case, the day of the month (or, potentially, multiple keys like day of the month and month). df1["day"] = df1["Date"].dt.day
If you were to merge (leftjoin df2 on df1) now, you wouldn't have enough keys in df2, as there are days missing. To fill the gaps, we could interpolate, or use the naïve approach: If we don't know the Revenue / Earnings for a specific day, we take the last known one and apply no further calculation. One way to achieve this is described here: How to replace NaNs by preceding or next values in pandas DataFrame? df.fillna(method='ffill')
Now we merge on our key. Following the doc https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html , we do it like this: df1.merge(df2, left_on='day')
Voilà!
This question already has answers here:
Boxplot of Multiple Columns of a Pandas Dataframe on the Same Figure (seaborn)
(4 answers)
Closed 1 year ago.
I am a newbie in data analysis. I wish to know how to boxplot multiple columns (x-axis = Points, Score, Weigh) in a single graph and make the y-axis as a standardized scale for comparison. I have tried and couldn't understand the code (Python+Pandas+Seaborn) for this. Help me out guys. The dataset for the same is as follows:
Cars
Points
Score
Weigh
0
Mazda RX4
3.90
2.620
16.46
1
Mazda RX4 Wag
3.90
2.875
17.02
2
Datsun 710
3.85
2.320
18.61
3
Hornet 4 Drive
3.08
3.215
19.44
4
Hornet Sportabout
3.15
3.440
17.02
5
Valiant
2.76
3.460
20.22
6
Duster 360
3.21
3.570
15.84
7
Merc 240D
3.69
3.190
20.00
8
Merc 230
3.92
3.150
22.90
9
Merc 280
3.92
3.440
18.30
10
Merc 280C
3.92
3.440
18.90
11
Merc 450SE
3.07
4.070
17.40
12
Merc 450SL
3.07
3.730
17.60
13
Merc 450SLC
3.07
3.780
18.00
14
Cadillac Fleetwood
2.93
5.250
17.98
15
Lincoln Continental
3.00
5.424
17.82
16
Chrysler Imperial
3.23
5.345
17.42
17
Fiat 128
4.08
2.200
19.47
18
Honda Civic
4.93
1.615
18.52
19
Toyota Corolla
4.22
1.835
19.90
20
Toyota Corona
3.70
2.465
20.01
21
Dodge Challenger
2.76
3.520
16.87
22
AMC Javelin
3.15
3.435
17.30
23
Camaro Z28
3.73
3.840
15.41
24
Pontiac Firebird
3.08
3.845
17.05
25
Fiat X1-9
4.08
1.935
18.90
26
Porsche 914-2
4.43
2.140
16.70
27
Lotus Europa
3.77
1.513
16.90
28
Ford Pantera L
4.22
3.170
14.50
29
Ferrari Dino
3.62
2.770
15.50
30
Maserati Bora
3.54
3.570
14.60
31
Volvo 142E
4.11
2.780
18.60
My output should look something like:
Output Boxplot Graph
With matplotlib:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv("test_data.txt")
plt.rcParams['figure.figsize'] = (8,4)
data.boxplot(column=['Points', 'Score', 'Weigh'], grid='true', color='blue',fontsize=10, rot=30)
And with seaborn:
import pandas as pd
import seaborn as sns
data = pd.read_csv("test_data.txt")
ax = sns.boxplot(data=data, palette="Set2")
boxplot = df.boxplot(column=['Points', 'Score', 'Weight'])
might work here
I have a dataframe, which has different rates for multiple 'N' currencies over a time period.
dataframe
Dates AUD CAD CHF GBP EUR
20/05/2019 0.11 -0.25 -0.98 0.63 0.96
21/05/2019 0.14 -0.35 -0.92 1.92 0.92
...
02/01/2020 0.135 -0.99 -1.4 0.93 0.83
Firstly, I would like to reshape the dataframe table to look like the below as I would like to join another table which would be in a similar format:
dataframe
Dates Pairs Rates
20/05/2019 AUD 0.11
20/05/2019 CAD -0.25
20/05/2019 CHF -0.98
...
...
02/01/2020 AUD 0.135
02/01/2020 CAD -0.99
02/01/2020 CHF -1.4
Then, for every N currency, I would like to plot a histogram . So with the above, it would be 5 separate histograms based off each N ccy.
I assume I would need to get this in some sort of loop, but not sure on the easiest way to approach.
Thanks
Use DataFrame.melt first:
df['Dates'] = pd.to_datetime(df['Dates'], dayfirst=True)
df = df.melt('Dates', var_name='Pairs', value_name='Rates')
print (df)
Dates Pairs Rates
0 2019-05-20 AUD 0.110
1 2019-05-21 AUD 0.140
2 2020-01-02 AUD 0.135
3 2019-05-20 CAD -0.250
4 2019-05-21 CAD -0.350
5 2020-01-02 CAD -0.990
6 2019-05-20 CHF -0.980
7 2019-05-21 CHF -0.920
8 2020-01-02 CHF -1.400
9 2019-05-20 GBP 0.630
10 2019-05-21 GBP 1.920
11 2020-01-02 GBP 0.930
12 2019-05-20 EUR 0.960
13 2019-05-21 EUR 0.920
14 2020-01-02 EUR 0.830
And then DataFrameGroupBy.hist:
df.groupby('Pairs').hist()
I need to find cases where "price of y" was less than 3.5 until time 30:00
and after that when "price of x" jump above 3.5.
I made column of "Demical Time" to make it easier for me (less than 30:00 is less than 1800 sec in Demical)
I tried to find all the cases which price of y was under 3.5 (and above 0) but I failed to write code which gives the cases where price of y was under 3.5 AND price of x was greater than 3.5 after 30:00.
df1 = df[(df['price_of_Y']<3.5)&(df['price_of_Y']>0)& (df['Demical time']<1800)]
#the cases for price of y under 3.5 before time is 30:00 (Demical time =1800)
df2 = df[(df['price_of_X']>3.5) & (df['Demical time'] >1800 )]`
#the cases for price of x above 3.5 after time is 30:00 (Demical time =1800)
# the question is how do i combine them to one line?
price_of_X time price_of_Y Demical time
0 3.30 0 4.28 0
1 3.30 0:00 4.28 0
2 3.30 0:00 4.28 0
3 3.30 0:00 4.28 0
4 3.30 0:00 4.28 0
5 3.30 0:00 4.28 0
6 3.30 0:00 4.28 0
7 3.30 0:00 4.28 0
8 3.30 0:00 4.28 0
9 3.30 0:00 4.28 0
10 3.30 0:00 4.28 0
11 3.25 0:26 4.28 26
12 3.40 1:43 4.28 103
13 3.25 3:00 4.28 180
14 3.25 4:16 4.28 256
15 3.40 5:34 4.28 334
16 3.40 6:52 4.28 412
17 3.40 8:09 4.28 489
18 3.40 9:31 4.28 571
19 5.00 10:58 8.57 658
20 5.00 12:13 8.57 733
21 5.00 13:31 7.38 811
22 5.00 14:47 7.82 887
23 5.00 16:01 7.82 961
24 5.00 17:18 7.38 1038
25 5.00 18:33 7.38 1113
26 5.00 19:50 7.38 1190
27 5.00 21:09 7.38 1269
28 5.00 22:22 7.38 1342
29 5.00 23:37 8.13 1417
... ... ... ... ...
18138 7.50 59:03:00 28.61 3543
18139 7.50 60:19:00 28.61 3619
18140 7.50 61:35:00 34.46 3695
18141 8.00 62:48:00 30.16 3768
18142 7.50 64:03:00 34.46 3843
18143 8.00 65:20:00 30.16 3920
18144 7.50 66:34:00 28.61 3994
18145 7.50 67:53:00 30.16 4073
18146 8.00 69:08:00 26.19 4148
18147 7.00 70:23:00 23.10 4223
18148 7.00 71:38:00 23.10 4298
18149 8.00 72:50:00 30.16 4370
18150 7.50 74:09:00 26.19 4449
18151 7.50 75:23:00 25.58 4523
18152 7.00 76:40:00 19.07 4600
18153 7.00 77:53:00 19.07 4673
18154 9.00 79:11:00 31.44 4751
18155 9.00 80:27:00 27.11 4827
18156 10.00 81:41:00 34.52 4901
18157 10.00 82:56:00 34.52 4976
18158 11.00 84:16:00 43.05 5056
18159 10.00 85:35:00 29.42 5135
18160 10.00 86:49:00 29.42 5209
18161 11.00 88:04:00 35.70 5284
18162 13.00 89:19:00 70.38 5359
18163 15.00 90:35:00 70.42 5435
18164 19.00 91:48:00 137.70 5508
18165 23.00 93:01:00 511.06 5581
18166 NaN NaN NaN 0
18167 NaN NaN NaN 0
[18168 rows x 4 columns]
dataframe:
This should solve it.
I have used a bit different data and condition values, but you should get the idea of what i am doing.
import pandas as pd
df = pd.DataFrame({'price_of_X': [3.30,3.25,3.40,3.25,3.25,3.40],
'price_of_Y': [2.28,1.28,4.28,4.28,1.18,3.28],
'Decimal_time': [0,26,103,180,256,334]
})
print(df)
df1 = df.loc[(df['price_of_Y']<3.5)&(df['price_of_X']>3.3)&(df['Decimal_time']>103),:]
print(df1)
output:
df
price_of_X price_of_Y Decimal_time
0 3.30 2.28 0
1 3.25 1.28 26
2 3.40 4.28 103
3 3.25 4.28 180
4 3.25 1.18 256
5 3.40 3.28 334
df1
price_of_X price_of_Y Decimal_time
5 3.4 3.28 334
Similar to what #IMCoins suggested as a comment, use two boolean masks to achieve the selection that you require.
mask1 = (df['price_of_Y'] < 3.5) & (df['price_of_Y'] > 0) & (df['Demical time'] < 1800)
mask2 = (df['price_of_X'] > 3.5) & (df['Demical time'] > 1800)
df[mask1 | mask2]
I have this dataframe; please note the last column ("Yr_Mo_Date") on the right
In[38]: data.head()
Out[38]:
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL Yr_Mo_Dy
0 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 61-1-1
1 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83 61-1-2
2 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71 61-1-3
3 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88 61-1-4
4 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83 61-1-5
The type of the "Yr_Mo_Dy" column is object while the others are float64.
I simply want to change the order of the columns so that "Yr_Mo_Dy" is the first column in the dataframe.
I tried the following but I get TypeError. What's wrong?
In[39]: cols = data.columns.tolist()
In[40]: cols
Out[40]:
['RPT',
'VAL',
'ROS',
'KIL',
'SHA',
'BIR',
'DUB',
'CLA',
'MUL',
'CLO',
'BEL',
'MAL',
'Yr_Mo_Dy']
In[41]: cols = cols[-1] + cols[:-1]
TypeError Traceback (most recent call last)
<ipython-input-59-c0130d1863e8> in <module>()
----> 1 cols = cols[-1] + cols[:-1]
TypeError: must be str, not list
You need add : for one element list because need concanecate 2 lists:
#string
print (cols[-1])
Yr_Mo_Dy
#one element list
print (cols[-1:])
['Yr_Mo_Dy']
cols = cols[-1:] + cols[:-1]
Or is possible add [], but it is worse readable:
cols = [cols[-1]] + cols[:-1]
print (cols)
['Yr_Mo_Dy', 'RPT', 'VAL', 'ROS', 'KIL', 'SHA', 'BIR',
'DUB', 'CLA', 'MUL', 'CLO', 'BEL', 'MAL']
Option 1
Use pd.DataFrame.insert and pd.DataFrame.pop to alter the dataframe in place. This is a very generalizable solution as you can swap in any column position for popping or inserting.
c = df.columns[-1]
df.insert(0, c, df.pop(c))
df
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 61-1-1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 61-1-2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 61-1-3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 61-1-4 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 61-1-5 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
Option 2
pd.DataFrame.reindex_axis and np.roll
df.reindex_axis(np.roll(df.columns, 1), 1)
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 61-1-1 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 61-1-2 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 61-1-3 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 61-1-4 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 61-1-5 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83