Graphing a dataframe line plot with a legend in Matplotlib - python

I'm working with a dataset that has grades and states and need to create line graphs by state showing what percent of each state's students fall into which bins.
My methodology (so far) is as follows:
First I import the dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
records = [{'Name':'A', 'Grade':'.15','State':'NJ'},{'Name':'B', 'Grade':'.15','State':'NJ'},{'Name':'C', 'Grade':'.43','State':'CA'},{'Name':'D', 'Grade':'.75','State':'CA'},{'Name':'E', 'Grade':'.17','State':'NJ'},{'Name':'F', 'Grade':'.85','State':'HI'},{'Name':'G', 'Grade':'.89','State':'HI'},{'Name':'H', 'Grade':'.38','State':'CA'},{'Name':'I', 'Grade':'.98','State':'NJ'},{'Name':'J', 'Grade':'.49','State':'NJ'},{'Name':'K', 'Grade':'.17','State':'CA'},{'Name':'K', 'Grade':'.94','State':'HI'},{'Name':'M', 'Grade':'.33','State':'HI'},{'Name':'N', 'Grade':'.22','State':'NJ'},{'Name':'O', 'Grade':'.7','State':'NJ'}]
df = pd.DataFrame(records)
df.Grade = df.Grade.astype(float)
Next I cut each grade into a bin
df['bin'] = pd.cut(df['Grade'],[-np.inf,.05,.1,.15,.2,.25,.3,.35,.4,.45,.5,.55,.6,.65,.7,.75,.8,.85,.9,.95,1],labels=False)/10
Then I create a pivot table giving me the count of people by bin in each state
df2 = pd.pivot_table(df,index=['bin'],columns='State',values=['Name'],aggfunc=pd.Series.nunique,margins=True)
df2 = df2.fillna(0)
Then I convert those n-counts into percentages and remove the margin rows
df3 = df2.div(df2.iloc[-1])
df3 = df3.iloc[:-1,:-1]
Now I want to create a line graph with multiple lines (one for each state) with the bin on the x axis and the percentage on the Y axis. df3.plot() will give me the chart I want but I would like to accomplish the same using matplotlib, because it offers me greater customization of the graph. However, running
plt.plot(df3)
gives me the lines I need but I can't get the legend the work properly. Any thoughts on how to accomplish this?

It may not be the best way, but I use the pandas plot function to draw df3, then get the legend and get the new label names. Please note that the processing of the legend string is limited to this data.
line = df3.plot(kind='line')
handles, labels = line.get_legend_handles_labels()
label = []
for l in labels:
label.append(l[7:-1])
plt.legend(handles, label, loc='best')

You can do this:
plt.plot(df3,label="label")
plt.legend()
plt.show()
For more information visit here
And if it helps you to solve your issues then don't forget to mark this as accepted answer.

Related

How do you get dates on the start on the specified month? (matplotlib)

# FEB
# configuring the figure and plot space
fig, lx = plt.subplots(figsize=(30,10))
# converting the Series into str so the data can be plotted
wd = df2['Unnamed: 1']
wd = wd.astype(float)
# adding the x and y axes' values
lx.plot(list(df2.index.values), wd)
# defining what the labels will be
lx.set(xlabel='Day', ylabel='Weight', title='Daily Weight February 2022')
# defining the date format
date_format = DateFormatter('%m-%d')
lx.xaxis.set_major_formatter(date_format)
lx.xaxis.set_minor_locator(mdates.WeekdayLocator(interval=1))
Values I would like the x-axis to have:
['2/4', '2/5', '2/6', '2/7', '2/8', '2/9', '2/10', '2/11', '2/12', '2/13', '2/14', '2/15', '2/16', '2/17', '2/18', '2/19', '2/20', '2/21', '2/22', '2/23', '2/24', '2/25', '2/26', '2/27']
Values on the x-axis:
enter image description here
It is giving me the right number of values just not the right labels. I have tried to specify the start and end with xlim=['2/4', '2/27], however that did seem to work.
It would be great to see how your df2 actually looks, but from your code snippet, it looks like it has weights recorded but not the corresponding dates.
How about prepare a data frame that has dates in it?
(Also, since this question is tagged with seaborn too, I'm going to use Seaborn, but the same idea should work.)
import pandas as pd
import seaborn as sns
import seaborn.objects as so
from matplotlib.dates import DateFormatter
sns.set_theme()
Create an index with the dates starting from 4 Feb with the number of days we have weight recorded.
index = pd.date_range(start="2/4/2022", periods=df.count().Weight, name="Date")
Then with Seaborn's object interface (v0.12+), we can do:
(
so.Plot(df2.set_index(index), x="Date", y="Weight")
.add(so.Line())
.scale(x=so.Temporal().label(formatter=DateFormatter('%m-%d')))
.label(title="Daily Weight February 2022")
)
I have solved this solution. Very simple. I just added mdates.WeekdayLocator() to set_major_formatter. I overlooked this when I was going through the matplotlib docs. But happy to have found this solution.

make correlation plot on time series data in python

I want to see a correlation on a rolling week basis in time series data. The reason because I want to see how rolling correlation moves each year. To do so, I tried to use pandas.corr(), pandas.rolling_corr() built-in function for getting rolling correlation and tried to make line plot, but I couldn't correct the correlation line chart. I don't know how should I aggregate time series for getting rolling correlation line chart. Does anyone knows any way of doing this in python? Is there any workaround to get rolling correlation line chart from time series data in pandas? any idea?
my attempt:
I tried of using pandas.corr() to get correlation but it was not helpful to generate rolling correlation line chart. So, here is my new attempt but it is not working. I assume I should think about the right way of data aggregation to make rolling correlation line chart.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://gist.githubusercontent.com/adamFlyn/eb784c86c44fd7ed3f2504157a33dc23/raw/79b6aa4f2e0ffd1eb626dffdcb609eb2cb8dae48/corr.csv'
df = pd.read_csv(url)
df['date'] = pd.to_datetime(df['date'])
def get_corr(df, window=4):
dfs = []
for key, value in df:
value["ROLL_CORR"] = pd.rolling_corr(value["prod_A_price"],value["prod_B_price"], window)
dfs.append(value)
df_final = pd.concat(dfs)
return df_final
corr_df = get_corr(df, window=12)
fig, ax = plt.subplots(figsize=(7, 4), dpi=144)
sns.lineplot(x='week', y='ROLL_CORR', hue='year', data=corr_df,alpha=.8)
plt.show()
plt.close()
doing this way is not working to me. By doing this, I want to see how the rolling correlations move each year. Can anyone point me out possible of doing rolling correlation line chart from time-series data in python? any thoughts?
desired output
here is the desired rolling correlation line chart that I want to get. Note that desired plot was generated from MS excel. I am wondering is there any possible way of doing this in python? Is there any workaround to get a rolling correlation line chart from time-series data in python? how should I correct my current attempt to get the desired output? any thoughts?
Using your code and description as a starting point.
Panda's Rolling class has an apply function which can be leveraged (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.window.rolling.Rolling.apply.html#pandas.core.window.rolling.Rolling.apply)
Two tricks are involved to make the code work:
Accessing the whole row in the applied function (Pandas rolling apply using multiple columns)
We call the rolling function on a pandas.Series (here df['week']) to avoid going the applied function once per column
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
url = 'https://gist.githubusercontent.com/adamFlyn/eb784c86c44fd7ed3f2504157a33dc23/raw/79b6aa4f2e0ffd1eb626dffdcb609eb2cb8dae48/corr.csv'
df = pd.read_csv(url)
def get_corr(ser):
rolling_df = df.loc[ser.index]
return rolling_df['prod_A_price'].corr(rolling_df['prod_B_price'])
df['ROLL_CORR'] = df['week'].rolling(4).apply(get_corr)
number_years = 3
for week, df_week in df.groupby('week'):
df = df.append({
'week': week,
'year': f'{number_years} year avg',
'ROLL_CORR': df_week.sort_values(by='date').head(number_years)['ROLL_CORR'].mean()
}, ignore_index=True)
fig, ax = plt.subplots(figsize=(7, 4), dpi=144)
sns.lineplot(x='week', y='ROLL_CORR', hue='year', data=df,alpha=.8)
plt.show()
plt.close()
You'll find here the generated image by seaborn
With the 3 year average

Facing difficulty to chose right plotting graph in python for large categories

I have data frame with 3 columns. Language, Total Value and Percentage . I am not sure which plotting to use in python for better visualization.
Below is the data:
import pandas as pd
data={'Language':['Haitian,Creole','Dutch','French','English','Xhosa','Afrikaans','Lati','Galicia','Quechua','Danish','Western,Frisia','Xhosa,French','French,Xhosa','Spanish','Norwegian,Nynorsk','Norwegia','Germa','Indonesia','Interlingua','Romania','French,English','Interlingue','Czech','Scots','Uzbek','Manx','Luxembourgish','Malagasy','Irish','Slovak','Inupiaq','Morisye','English,French','Finnish','Dutch,Afrikaans','Afar','Corsica','Portuguese','Dutch,English','Sundanese','Kinyarwanda','Malay','Volapük','Afrikaans,Dutch','Wolof','Basque','Estonia','Italia','Lithuania','Scottish,Gaelic','Hungaria','Breto','Kalaallisut','Welsh','Zhuang','Lingala','Occita','Maori','Khasi','Maltese','Seselwa,Creole,French','Vietnamese','Tagalog','Fijia','zzp','Romansh','Bislama','Polish','Swedish','Xhosa,English','English,Dutch','Catala','Hmong','Turkme','Somali','Nyanja','Turkish','Oromo','Ganda','Tswana','Javanese','Southern,Sotho','Samoa','Guarani','Aymara','Naur','Waray','Icelandic','Rundi','Latvia','Shona','Klingo','Tonga','Cebuano','Igbo','Aka','French,Dutch','Hawaiia','Esperanto','Albania','Yoruba','Swahili','Breton,French','Dutch,Danish','Serbia'],'Total_Value':['180455','86394','40609','18355','17882','2508','483','362','259','258','247','209','172','162','156','139','130','71','70','64','45','39','38','33','33','30','29','27','26','24','22','21','20','17','16','14','14','13','13','13','12','11','11','10','9','9','9','8','8','8','7','7','6','6','6','6','6','6','6','5','5','5','5','5','4','4','4','4','4','4','3','3','3','3','3','3','2','2','2','2','2','2','2','2','2','2','2','2','1','1','1','1','1','1','1','1','1','1','1','1','1','1','1','1','1'],'Percentage':['0.515799403','0.246942305','0.116073802','0.052464592','0.051112604','0.007168684','0.001380572','0.001034714','0.000740307','0.000737448','0.000706007','0.00059739','0.000491632','0.000463049','0.000445899','0.000397307','0.000371583','0.000202941','0.000200083','0.000182933','0.000128625','0.000111475','0.000108616','0.0000943','0.0000943','0.0000857','0.0000829','0.0000772','0.0000743','0.0000686','0.0000629','0.00006','0.0000572','0.0000486','0.0000457','0.00004','0.00004','0.0000372','0.0000372','0.0000372','0.0000343','0.0000314','0.0000314','0.0000286','0.0000257','0.0000257','0.0000257','0.0000229','0.0000229','0.0000229','0.00002','0.00002','0.0000171','0.0000171','0.0000171','0.0000171','0.0000171','0.0000171','0.0000171','0.0000143','0.0000143','0.0000143','0.0000143','0.0000143','0.0000114','0.0000114','0.0000114','0.0000114','0.0000114','0.0000114','0.00000857','0.00000857','0.00000857','0.00000857','0.00000857','0.00000857','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000572','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286','0.00000286']}
df = pd.DataFrame(data)
I don't know which is the best way to visualize this three attributes using matplotlib,seaborn,plotly
Language column has 106 categories and it has equivalent total value and percentage column
Request help to provide good interpretable visualization graph
Tried with below code I could see only 52 languages in x axis
import chart_studio.plotly as py
import plotly.graph_objects as go
fig = go.Figure(data=go.Heatmap(
z=[code_lang['percentage']],
x=code_lang['Language'],
y=code_lang['Total Value'],
hoverongaps = False))
fig.show()
It would be helpful if any better one is there
Here is a way to show the data as a wordcloud.
Some remarks:
the original Total_Value and Percentage columns are text strings; they need to be converted to numeric
the Total_Value and Percentage columns have equivalent information: only one of the two needs to be shown
a lot of the percentages are extremely small, so they get invisible with most types of visualization
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import pandas as pd
# data=...
df = pd.DataFrame(data)
df.Percentage = df.Percentage.astype(float)
df.Total_Value = df.Total_Value.astype(int)
word_dict = {}
for row in df.itertuples(index=False):
word_dict[row.Language] = row.Percentage
wordcloud = WordCloud(background_color="white", width=1200, height=1000
).generate_from_frequencies(word_dict)
plt.axis('off')
plt.imshow(wordcloud)
plt.show()
In order to have the large values not overwhelm the smaller, the percentages could be brought closer together, e.g. using word_dict[row.Language] = row.Percentage ** .2.

Integrating over range of dates, and labeling the xaxis

I am trying to integrate 2 curves as they change through time using pandas. I am loading data from a CSV file like such:
Where the Dates are the X-axis and both the Oil & Water points are the Y-axis. I have learned to use the cross-section option to isolate the "NAME" values, but am having trouble finding a good way to integrate with dates as the X-axis. I eventually would like to be able to take the integrals of both curves and stack them against each other. I am also having trouble with the plot defaulting the x-ticks to arbitrary values, instead of the dates.
I can change the labels/ticks manually, but have a large CSV to process and would like to automate the process. Any help would be greatly appreciated.
NAME,DATE,O,W
A,1/20/2000,12,50
B,1/20/2000,25,28
C,1/20/2000,14,15
A,1/21/2000,34,50
B,1/21/2000,8,3
C,1/21/2000,10,19
A,1/22/2000,47,35
B,1/22/2000,4,27
C,1/22/2000,46,1
A,1/23/2000,19,31
B,1/23/2000,18,10
C,1/23/2000,19,41
Contents of CSV in text form above.
Further to my comment above, here is some sample code (using logic from the example mentioned) to label your xaxis with formatted dates. Hope this helps.
Data Collection / Imports:
Just re-creating your dataset for the example.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
header = ['NAME', 'DATE', 'O', 'W']
data = [['A','1/20/2000',12,50],
['B','1/20/2000',25,28],
['C','1/20/2000',14,15],
['A','1/21/2000',34,50],
['B','1/21/2000',8,3],
['C','1/21/2000',10,19],
['A','1/22/2000',47,35],
['B','1/22/2000',4,27],
['C','1/22/2000',46,1],
['A','1/23/2000',19,31],
['B','1/23/2000',18,10],
['C','1/23/2000',19,41]]
df = pd.DataFrame(data, columns=header)
df['DATE'] = pd.to_datetime(df['DATE'], format='%m/%d/%Y')
# Subset to just the 'A' labels.
df_a = df[df['NAME'] == 'A']
Plotting:
# Define the number of ticks you need.
nticks = 4
# Define the date format.
mask = '%m-%d-%Y'
# Create the set of custom date labels.
step = int(df_a.shape[0] / nticks)
xdata = np.arange(df_a.shape[0])
xlabels = df_a['DATE'].dt.strftime(mask).tolist()[::step]
# Create the plot.
fig, ax = plt.subplots(1, 1)
ax.plot(xdata, df_a['O'], label='Oil')
ax.plot(xdata, df_a['W'], label='Water')
ax.set_xticks(np.arange(df_a.shape[0], step=step))
ax.set_xticklabels(xlabels, rotation=45, horizontalalignment='right')
ax.set_title('Test in Naming Labels for the X-Axis')
ax.legend()
Output:
I'd recommend modifying the X-axis into some form of integers or floats (Seconds, minutes, hours days since a certain time, based on the precision that you need). You can then use usual methods to integrate and the x-axes would no longer default to some other values.
See How to convert datetime to integer in python

Multiple series in a trace for plotly

I dynamically generate a pandas dataframe where columns are months, index is day-of-month, and values are cumulative revenue. This is fairly easy, b/c it just pivots a dataframe that is month/dom/rev.
But now I want to plot it in plotly. Since every month the columns will expand, I don't want to manually add a trace per month. But I can't seem to have a single trace incorporate multiple columns. I could've sworn this was possible.
revs = Scatter(
x=df.index,
y=[df['2016-Aug'], df['2016-Sep']],
name=['rev', 'revvv'],
mode='lines'
)
data=[revs]
fig = dict( data=data)
iplot(fig)
This generates an empty graph, no errors. Ideally I'd just pass df[df.columns] to y. Is this possible?
You were probably thinking about cufflinks. You can plot a whole dataframe with Plotly using the iplot function without data replication.
An alternative would be to use pandas.plot to get an matplotlib object which is then converted via plotly.tools.mpl_to_plotly and plotted. The whole procedure can be shortened to one line:
plotly.plotly.plot_mpl(df.plot().figure)
The output is virtually identical, just the legend needs tweaking.
import plotly
import pandas as pd
import random
import cufflinks as cf
data = plotly.tools.OrderedDict()
for month in ['2016-Aug', '2016-Sep']:
data[month] = [random.randrange(i * 10, i * 100) for i in range(1, 30)]
#using cufflinks
df = pd.DataFrame(data, index=[i for i in range(1, 30)])
fig = df.iplot(asFigure=True, kind='scatter', filename='df.html')
plot_url = plotly.offline.plot(fig)
print(plot_url)
#using mpl_to_plotly
plot_url = plotly.offline.plot(plotly.tools.mpl_to_plotly(df.plot().figure))
print(plot_url)

Categories