I have dataframe total_year, which contains three columns (year, action, comedy).
How can I plot two columns (action and comedy) on y-axis?
My code plots only one:
total_year[-15:].plot(x='year', y='action', figsize=(10,5), grid=True)
Several column names may be provided to the y argument of the pandas plotting function. Those should be specified in a list, as follows.
df.plot(x="year", y=["action", "comedy"])
Complete example:
import matplotlib.pyplot as plt
import pandas as pd
df = pd.DataFrame({"year": [1914,1915,1916,1919,1920],
"action" : [2.6,3.4,3.25,2.8,1.75],
"comedy" : [2.5,2.9,3.0,3.3,3.4] })
df.plot(x="year", y=["action", "comedy"])
plt.show()
Pandas.DataFrame.plot() per default uses index for plotting X axis, all other numeric columns will be used as Y values.
So setting year column as index will do the trick:
total_year.set_index('year').plot(figsize=(10,5), grid=True)
When using pandas.DataFrame.plot, it's only necessary to specify a column to the x parameter.
The caveat is, the rest of the columns with numeric values will be used for y.
The following code contains extra columns to demonstrate. Note, 'date' is left as a string. However, if 'date' is converted to a datetime dtype, the plot API will also plot the 'date' column on the y-axis.
If the dataframe includes many columns, some of which should not be plotted, then specify the y parameter as shown in this answer, but if the dataframe contains only columns to be plotted, then specify only the x parameter.
In cases where the index is to be used as the x-axis, then it is not necessary to specify x=.
import pandas as pd
# test data
data = {'year': [1914, 1915, 1916, 1919, 1920],
'action': [2.67, 3.43, 3.26, 2.82, 1.75],
'comedy': [2.53, 2.93, 3.02, 3.37, 3.45],
'test1': ['a', 'b', 'c', 'd', 'e'],
'date': ['1914-01-01', '1915-01-01', '1916-01-01', '1919-01-01', '1920-01-01']}
# create the dataframe
df = pd.DataFrame(data)
# display(df)
year action comedy test1 date
0 1914 2.67 2.53 a 1914-01-01
1 1915 3.43 2.93 b 1915-01-01
2 1916 3.26 3.02 c 1916-01-01
3 1919 2.82 3.37 d 1919-01-01
4 1920 1.75 3.45 e 1920-01-01
# plot the dataframe
df.plot(x='year', figsize=(10, 5), grid=True)
Related
I have written a program like so:
# Author: Evan Gertis
# Date : 11/09
# program: Linear Regression
# Resource: https://seaborn.pydata.org/generated/seaborn.scatterplot.html
import seaborn as sns
import pandas as pd
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
# Step 1: load the data
grades = pd.read_csv("grades.csv")
logging.info(grades.head())
# Step 2: plot the data
plot = sns.scatterplot(data=grades, x="Hours", y="GPA")
fig = plot.get_figure()
fig.savefig("out.png")
Using the data set
Hours,GPA,Hours,GPA,Hours,GPA
11,2.84,9,2.85,25,1.85
5,3.20,5,3.35,6,3.14
22,2.18,14,2.60,9,2.96
23,2.12,18,2.35,20,2.30
20,2.55,6,3.14,14,2.66
20,2.24,9,3.05,19,2.36
10,2.90,24,2.06,21,2.24
19,2.36,25,2.00,7,3.08
15,2.60,12,2.78,11,2.84
18,2.42,6,2.90,20,2.45
I would like to plot out all of the relationships at this time I just get one plot:
Expected:
all relationships plotted
Actual:
I wrote a basic program and I was expecting all of the relationships to be plotted.
The origin of the problem is that the columns names in your file are the same and thus when pandas read the columns adds number to the loaded data frame
import seaborn as sns
import pandas as pd
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
grades = pd.read_csv("grades.csv")
print(grades.columns)
>>> Index(['Hours', 'GPA', 'Hours.1', 'GPA.1', 'Hours.2', 'GPA.2'], dtype='object')
therefore when you plot the scatter plot you need to give the name of the column names that pandas give
# in case you want all scatter plots in the same figure
plot = sns.scatterplot(data=grades, x="Hours", y="GPA", label='GPA')
sns.scatterplot(data=grades, x='Hours.1', y='GPA.1', ax=plot, label="GPA.1")
sns.scatterplot(data=grades, x='Hours.2', y='GPA.2', ax=plot, label='GPA.2')
fig = plot.get_figure()
fig.savefig("out.png")
There are better options than manually creating a plot for each group of columns
Because the columns in the file have redundant names, pandas automatically renames them.
Imports and DataFrame
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
# read the data from the file
df = pd.read_csv('d:/data/gpa.csv')
# display(df)
Hours GPA Hours.1 GPA.1 Hours.2 GPA.2
0 11 2.84 9 2.85 25 1.85
1 5 3.20 5 3.35 6 3.14
2 22 2.18 14 2.60 9 2.96
3 23 2.12 18 2.35 20 2.30
4 20 2.55 6 3.14 14 2.66
5 20 2.24 9 3.05 19 2.36
6 10 2.90 24 2.06 21 2.24
7 19 2.36 25 2.00 7 3.08
8 15 2.60 12 2.78 11 2.84
9 18 2.42 6 2.90 20 2.45
Option 1: Chunk the column names
This option can be used to plot the data in a loop without manually creating each plot
Using this answer from How to iterate over a list in chunks will create a list of column name groups:
[Index(['Hours', 'GPA'], dtype='object'), Index(['Hours.1', 'GPA.1'], dtype='object'), Index(['Hours.2', 'GPA.2'], dtype='object')]
# create groups of column names to be plotted together
def chunker(seq, size):
return [seq[pos:pos + size] for pos in range(0, len(seq), size)]
# function call
col_list = chunker(df.columns, 2)
# iterate through each group of column names to plot
for x, y in chunker(df.columns, 2):
sns.scatterplot(data=df, x=x, y=y, label=y)
Option 2: Fix the data
# filter each group of columns, melt the result into a long form, and get the value
h = df.filter(like='Hours').melt().value
g = df.filter(like='GPA').melt().value
# get the gpa column names
gpa_cols = df.columns[1::2]
# use numpy to create a list of labels with the appropriate length
labels = np.repeat(gpa_cols, len(df))
# otherwise use a list comprehension to create the labels
# labels = [v for x in gpa_cols for v in [x]*len(df)]
# create a new dataframe
dfl = pd.DataFrame({'hours': h, 'gpa': g, 'label': labels})
# save dfl if desired
dfl.to_csv('gpa_long.csv', index=False)
# plot
sns.scatterplot(data=dfl, x='hours', y='gpa', hue='label')
Plot Result
I have two data frames that collect historical price series of two different stocks. applying describe () I noticed that the elements of the first stock are 1291 while those of the second are 1275. This difference is due to the fact that the two securities are listed on different stock exchanges and therefore show differences on some dates. What I would like to do is keep the two separate dataframes, but make sure that in the first dataframe, all those rows whose dates are not present in the second dataframe are deleted in order to have the perfect matching of the two dataframes to do the analyzes. I have read that there are functions such as merge () or join () but I have not been able to understand well how to use them (if these are the correct functions). I thank those who will use some of their time to answer my question.
"ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1275 and the array at index 1 has size 1291"
Thank you
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_datareader as web
from scipy import stats
import seaborn as sns
pd.options.display.min_rows= None
pd.options.display.max_rows= None
tickers = ['DISW.MI','IXJ','NRJ.PA','SGOL','VDC','VGT']
wts= [0.19,0.18,0.2,0.08,0.09,0.26]
price_data = web.get_data_yahoo(tickers,
start = '2016-01-01',
end = '2021-01-01')
price_data = price_data['Adj Close']
ret_data = price_data.pct_change()[1:]
port_ret = (ret_data * wts).sum(axis = 1)
benchmark_price = web.get_data_yahoo('ACWE.PA',
start = '2016-01-01',
end = '2021-01-01')
benchmark_ret = benchmark_price["Adj Close"].pct_change()[1:].dropna()
#From now i get error
sns.regplot(benchmark_ret.values,
port_ret.values)
plt.xlabel("Benchmark Returns")
plt.ylabel("Portfolio Returns")
plt.title("Portfolio Returns vs Benchmark Returns")
plt.show()
(beta, alpha) = stats.linregress(benchmark_ret.values,
port_ret.values)[0:2]
print("The portfolio beta is", round(beta, 4))
Let's consider a toy example.
df1 consists of 6 days of data and df2 consists of 5 days of data.
What I have understood, you want df1 also to have 5 days of data matching the dates with df2.
df1
df1 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=6),
'px':np.random.rand(6)
})
df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627
5 2021-05-22 0.127086
df2
df2 = pd.DataFrame({
'date':pd.date_range('2021-05-17', periods=5),
'px':np.random.rand(5)
})
df2
date px
0 2021-05-17 0.650976
1 2021-05-18 0.393061
2 2021-05-19 0.985700
3 2021-05-20 0.879786
4 2021-05-21 0.463206
Code
To consider only matching dates in df1 from df2.
df1 = df1[df1.date.isin(df2.date)]
Output df1
date px
0 2021-05-17 0.054907
1 2021-05-18 0.192294
2 2021-05-19 0.214051
3 2021-05-20 0.623223
4 2021-05-21 0.004627
I want to merge two plots, that is my dataframe:
df_inc.head()
id date real_exe_time mean mean+30% mean-30%
0 Jan 31 33.14 43.0 23.0
1 Jan 30 33.14 43.0 23.0
2 Jan 33 33.14 43.0 23.0
3 Jan 38 33.14 43.0 23.0
4 Jan 36 33.14 43.0 23.0
My first plot:
df_inc.plot.scatter(x = 'date', y = 'real_exe_time')
Then
My second plot:
df_inc.plot(x='date', y=['mean','mean+30%','mean-30%'])
When I try to merge with:
fig=plt.figure()
ax = df_inc.plot(x='date', y=['mean','mean+30%','mean-30%']);
df_inc.plot.scatter(x = 'date', y = 'real_exe_time', ax=ax)
plt.show()
I got the following:
How I can merge the right way?
You should not repeat your mean values as an extra column. df.plot() for categorical data will be plotted against the index - hence you will see the original scatter plot (also plotted against the index) squeezed into the left corner.
You could create instead an additional aggregation dataframe that you can plot then into the same graph:
import matplotlib.pyplot as plt
import pandas as pd
#test data generation
import numpy as np
n=30
np.random.seed(123)
df = pd.DataFrame({"date": np.random.choice(list("ABCDEF"), n), "real_exe_time": np.random.randint(1, 100, n)})
df = df.sort_values(by="date").reindex()
#aggregate data for plotting
df_agg = df.groupby("date")["real_exe_time"].agg(mean="mean").reset_index()
df_agg["mean+30%"] = df_agg["mean"] * 1.3
df_agg["mean-30%"] = df_agg["mean"] * 0.7
#plot both into the same subplot
ax = df.plot.scatter(x = 'date', y = 'real_exe_time')
df_agg.plot(x='date', y=['mean','mean+30%','mean-30%'], ax=ax)
plt.show()
Sample output:
You could also consider using seaborn that has, for instance, pointplots for categorical data aggregation.
I'm Guessing that you haven't transform the Date to a datetime object so the first thing you should do is this
#Transform the date to datetime object
df_inc['date']=pd.to_datetime(df_inc['date'],format='%b')
fig=plt.figure()
ax = df_inc.plot(x='date', y=['mean','mean+30%','mean-30%']);
df_inc.plot.scatter(x = 'date', y = 'real_exe_time', ax=ax)
plt.show()
I have the following pandas groupby object, and I'd like to turn the result into a new dataframe.
Following, is the code to get the conditional probability:
bin_probs = data.groupby('season')['bin'].value_counts()/data.groupby('season')['bin'].count()
I've tried the following code, but it returns as follows.
I like the season to fill in each row. How can I do that?
a = pd.DataFrame(data_5.groupby('season')['bin'].value_counts()/data_5.groupby('season')['bin'].count())
a is a DataFrame, but with a 2-level index, so my interpretation is you want a dataframe without a multi-level index.
The index can't be reset when the name in the index and the column are the same.
Use pandas.Series.reset_index, and set name='normalized_bin, to rename the bin column.
This would not work with the implementation in the OP, because that is a dataframe.
This works with the following implementation, because a pandas.Series is created with .groupby.
The correct way to normalize the column is to use the normalize=True parameter in .value_counts.
import pandas as pd
import random # for test data
import numpy as np # for test data
# setup a dataframe with test data
np.random.seed(365)
random.seed(365)
rows = 1100
data = {'bin': np.random.randint(10, size=(rows)),
'season': [random.choice(['fall', 'winter', 'summer', 'spring']) for _ in range(rows)]}
df = pd.DataFrame(data)
# display(df.head())
bin season
0 2 summer
1 4 winter
2 1 summer
3 5 winter
4 2 spring
# groupby, normalize and reset the the Series index
a = df.groupby(['season'])['bin'].value_counts(normalize=True).reset_index(name='normalized_bin')
# display(a.head(15))
season bin normalized_bin
0 fall 2 0.15600
1 fall 9 0.11600
2 fall 3 0.10800
3 fall 4 0.10400
4 fall 6 0.10000
5 fall 0 0.09600
6 fall 8 0.09600
7 fall 5 0.08400
8 fall 7 0.08000
9 fall 1 0.06000
10 spring 0 0.11524
11 spring 8 0.11524
12 spring 9 0.11524
13 spring 3 0.11152
14 spring 1 0.10037
Using the OP code for a
As already noted above, use normalize=True to get normalized values
The solution in the OP, creates a DataFrame, because the .groupby is wrapped with the DataFrame constructor, pandas.DataFrame.
To reset the index, you must first pandas.DataFrame.rename the bin column, and then use pandas.DataFrame.reset_index
a = pd.DataFrame(df.groupby('season')['bin'].value_counts()/df.groupby('season')['bin'].count()).rename(columns={'bin': 'normalized_bin'}).reset_index()
Other Resources
See Pandas unable to reset index because name exist to reset by a level.
Plotting
It is easier to plot from the multi-index Series, by using pandas.Series.unstack(), and then use pandas.DataFrame.plot.bar
For side-by-side bars, set stacked=False.
The bars are all equal to 1, because this is normalized data.
s = df.groupby(['season'])['bin'].value_counts(normalize=True).unstack()
# plot a stacked bar
s.plot.bar(stacked=True, figsize=(8, 6))
plt.legend(title='bin', bbox_to_anchor=(1.05, 1), loc='upper left')
You are looking for parameter normalize:
bin_probs = data.groupby('season')['bin'].value_counts(normalize=True)
Read more about it here:
After transposing my Python Dataframe, I could not access my column name to plot a graph. I want to choose two columns but failed. It keeps saying no such column names. I am pretty new to Python, dataframe and transpose. Could someone help please?
Below is my input file and I want to transpose row to Column. It was successful when I transposed but I could not select "Canada" and "Cameroon" to plot a graph.
country 1990 1991 1992 1993 1994 1995
0 Cambodia 65.4 65.7 66.2 66.7 67.1 68.4
1 Cameroon 63.9 63.7 64.7 65.6 66.6 67.6
2 Canada 98.6 99.6 99.6 99.8 99.9 99.9
3 Cape Verde 77.7 77.0 76.6 89.0 79.0 78.0
import pandas as pd
import numpy as np
import re
import math
import matplotlib.pyplot as plt
missing_values=["n/a","na","-","-","N/A"]
df = pd.read_csv('StackoverflowGap.csv', na_values = missing_values)
# Transpose
df = df.transpose()
plt.figure(figsize=(12,8))
plt.plot(df['Canada','Cameroon'], linewidth = 0.5)
plt.title("Time Series for Canada")
plt.show()
It produces a long list of error messages but the final message is
KeyError: ('Canada', 'Cameroon')
There a few things you might need to do when working with the data.
If the csv file has no header then use df = pd.read_csv('StackoverflowGap.csv', na_values = missing_values, header = None).
When you transpose, you need to name the columns
df.columns= df.iloc[0].
Having done this you need to drop the first row of your table (because it contains the column names) df = df.reindex(df.index.drop(0)).
Finally, when accessing the data by columns (in the plt.plot() command) you need to use df[] on the list of columns, i.e. df[['Canada', 'Cameroon']].
EDIT So the code, as it works for me is as follows
df = pd.read_csv('StackoverflowGap.csv', na_values = missing_values, header = None)
df = df.T
df.columns= df.iloc[0]
df = df.reindex(df.index.drop('country'))
df.index.name = 'Year'
plt.figure(figsize=(12,8))
plt.plot(df[['Canada','Cameroon']], linewidth = 0.5)
plt.title("Time Series for Canada")
plt.show()