Plotting using multi index values - python

I have an excel file with the following data:
My code so far:
import pandas as pd
import matplotlib as plt
df=pd.read_excel('file.xlsx', header=[0,1], index_col=[0])
Firstly, am I reading in my file correctly to have a multi index using Main (A,B,C) as the first level and Value (X,Y) as the second level.
Using Pandas and Matplotlib - how do I plot individual scatter plot for Main (A,B,C) with each x,y as the scatter values (imaged below) . I can do it messily calling each column in an individual plot function.
Is there a nicer way to do it with multi-indexing or group by?

This should help:
df = df.set_index(['Main1', 'Main2']).value
df.unstack.plot(kind='line', stacked=True)

Related

How can I plot a pandas dataframe as a scatter graph? I think I may have messed up the indexing and can't add a new index?

I am trying to plot my steps as a scatter graph and then eventually add a trend line.
I managed to get it to work with df.plot() but it is a line chart.
The following is the code I have tried:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data_file = pd.read_csv('CSV/stepsgyro.csv')
# print(data_file.head())
# put in the correct data types
data_file = data_file.astype({"steps": int})
pd.to_datetime(data_file['date'])
# makes the date definitely the index at the bottom
data_file.set_index(['date'], inplace=True)
# sorts the data frame by the index
data_file.sort_values(by=['date'], inplace=True, ascending=True)
# data_file.columns.values[1] = 'date'
# plot the raw steps data
# data_file.plot()
plt.scatter(data_file.date, data_file.steps)
plt.title('Daily Steps')
plt.grid(alpha=0.3)
plt.show()
plt.close('all')
# plot the cumulative steps data
data_file = data_file.cumsum()
data_file.plot()
plt.title('Cumulative Daily Steps')
plt.grid(alpha=0.3)
plt.show()
plt.close('all')
and here is a screenshot of what it's looking like on my IDE:
any guidance would be greatly appreciated!
You have set the index to be the "date" column. From that moment on, there is no "date" column anymore, hence data_file.date fails.
Two options:
Don't set the index. Sorting doesn't seem to be needed anyways.
Plot the index, plt.scatter(data_file.index, data_file.steps)
I can't figure out just by looking at your example why you are getting that error. However, I can offer a quick and easy solution to plotting your data:
data_file.plot(marker='.', linestyle='none')
You can use df.plot(kind='scatter') to avoid the line chart.

Subplot of difference between data points imported with Pandas and conversion of time values

I'm relatively new to Python (in the process of self-teaching) and so this is proving to be quite a learning curve but I'm very happy to get to grips with it. I have a set of data points from an experiment in excel, one column is time (with the format 00:00:00:000) and a second column is the measured parameter.
I'm using pandas to read the excel document in order to produce a graph from it with time along the x-axis and the measured variable along the y-axis. However, when I plot the data, the time column becomes the data point number (i.e. 00:00:00:000 - 00:05:40:454 becomes 0 - 2000) and I'm not sure why. Could anyone please advise how to rectify this?
Secondly, I'd like to produce a subplot that shows the difference between the y-values as a function of time, basically a gradient to show the variation. Is there a way to easily calculate this and display it using pandas?
Here is my code, please do forgive how basic it is!
import pandas as pd
import matplotlib.pyplot as plt
import pylab
df = pd.read_excel('rest.xlsx', 'Sheet1')
df.plot(legend=False, grid=False)
plt.show()
plt.savefig('myfig')
If you just read the excel file, pandas will create a RangeIndex, starting at 0. To use your time information from you excel file as index, you have to specify the name (as string) of the time column with the key-word argument index_col in the read_excel call:
df = pd.read_excel('rest.xlsx', 'Sheet1', index_col='name_of_time_column')
Just replace 'name_of_time_column' with the actual name of the column that contains the time information.
(Hopefully pandas will automatically parse the time information to a Datetimeindex, but your format should be fine.) The plot will use the Datetimeindex on x-axis.
To get the time difference between each datapoint, use the diff method with argument 1 on your DataFrame:
difference = df.diff(1)
difference.plot(legend=False, grid=False)
Try This:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('rest.xlsx', 'Sheet1')
X = df['Time'].tolist()#If the time column called 'Time'
Y = df['Parameter'].tolist()#If the Parameter column called 'Parameter'
plt.plot(X,Y)
plt.gcf().autofmt_xdate()
plt.show()
With matplotlib you can create a figure with two axis and name the axis for example ax_df and ax_diff:
import matplotlib.pyplot as plt
fig, [ax_df, ax_diff] = plt.subplots(nrows=2, ncols=1, sharex=True)
sharex=True specifies to use the same x-axis for both subplots.
When calling plot on the DataFrame, you can redirect the output to the axis by specifying the axes with the keyword argument ax:
df.plot(ax=ax_df)
df.diff(1).plot(ax=ax_diff)
plt.show()

pandas groupby sum area plot

I'm looking to make a stacked area plot over time, based on summary data created by groupby and sum.
The groupby and sum part correctly groups and sums the data I want, but it seems the resultant format is nonsense in terms of plotting it.
I'm not sure where to go from here:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.DataFrame({'invoice':[1,2,3,4,5,6],'year':[2016,2016,2017,2017,2017,2017],'part':['widget','wonka','widget','wonka','wonka','wonka'],'dollars':[10,20,30,10,10,10]})
#drop the invoice number from the data since we don't need it
df=df[['dollars','part','year']]
#group by year and part, and add them up
df=df.groupby(['year','part']).sum()
#plotting this is nonsense:
df.plot.area()
plt.show()
to chart multiple series, its easiest to have each series organized as a separate column, i.e. replace
df=df.groupby(['year','part']).sum()
with
df=df.groupby(['year', 'part']).sum().unstack(-1)
Then the rest of the code should work. But, I'm not sure if this is what you need because the desired output is not shown.
df.plot.area() then produces the chart like

Plotting Pandas groupby groups using subplots and loop

I am trying to generate a grid of subplots based off of a Pandas groupby object. I would like each plot to be based off of two columns of data for one group of the groupby object. Fake data set:
C1,C2,C3,C4
1,12,125,25
2,13,25,25
3,15,98,25
4,12,77,25
5,15,889,25
6,13,56,25
7,12,256,25
8,12,158,25
9,13,158,25
10,15,1366,25
I have tried the following code:
import pandas as pd
import csv
import matplotlib as mpl
import matplotlib.pyplot as plt
import math
#Path to CSV File
path = "..\\fake_data.csv"
#Read CSV into pandas DataFrame
df = pd.read_csv(path)
#GroupBy C2
grouped = df.groupby('C2')
#Figure out number of rows needed for 2 column grid plot
#Also accounts for odd number of plots
nrows = int(math.ceil(len(grouped)/2.))
#Setup Subplots
fig, axs = plt.subplots(nrows,2)
for ax in axs.flatten():
for i,j in grouped:
j.plot(x='C1',y='C3', ax=ax)
plt.savefig("plot.png")
But it generates 4 identical subplots with all of the data plotted on each (see example output below):
I would like to do something like the following to fix this:
for i,j in grouped:
j.plot(x='C1',y='C3',ax=axs)
next(axs)
but I get this error
AttributeError: 'numpy.ndarray' object has no attribute 'get_figure'
I will have a dynamic number of groups in the groupby object I want to plot, and many more elements than the fake data I have provided. This is why I need an elegant, dynamic solution and each group data set plotted on a separate subplot.
Sounds like you want to iterate over the groups and the axes in parallel, so rather than having nested for loops (which iterates over all groups for each axis), you want something like this:
for (name, df), ax in zip(grouped, axs.flat):
df.plot(x='C1',y='C3', ax=ax)
You have the right idea in your second code snippet, but you're getting an error because axs is an array of axes, but plot expects just a single axis. So it should also work to replace next(axs) in your example with ax = axs.next() and change the argument of plot to ax=ax.

Basic Matplotlib Scatter Plot From Pandas DataFrame

How to make a basic scatter plot of column in a DataFrame vs the index of that DataFrame? Im using python 2.7.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dataframe['Col'].plot()
plt.show()
This shows a line chart of 'Col' plotted against the values in my DataFrame index (dates in this case).
But how do I plot a scatterplot rather than a line chart?
I tried
plt.scatter(dataframe['Col'])
plt.show()
But scatter() requires 2 arguments. So how do I pass the series dataframe['Col'] and my dataframe index into scatter() ?
I for this I tried
plt.scatter(dataframe.index.values, dataframe['Col'])
plt.show()
But chart is blank.
If you just want to change from lines to points (and not really want/need to use matplotlib.scatter) you can simply set the style:
In [6]: df= pd.DataFrame({'Col': np.random.uniform(size=1000)})
In [7]: df['Col'].plot(style='.')
Out[7]: <matplotlib.axes.AxesSubplot at 0x4c3bb10>
See the docs of DataFrame.plot and the general plotting documentation.
Strange. That ought to work.
Running this
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
dataframe = pd.DataFrame({'Col': np.random.uniform(size=1000)})
plt.scatter(dataframe.index, dataframe['Col'])
spits out something like this
Maybe quit() and fire up a new session?

Categories