I am trying to use the .plot() function in pandas to plot data into a line graph.
The data sorted by date with 48 rows after each date. Sample below:
1 2 ... 46 47 48
18 2018-02-19 1.317956 1.192840 ... 1.959250 1.782985 1.418093
19 2018-02-20 1.356267 1.192248 ... 2.123432 1.760629 1.569340
20 2018-02-21 1.417181 1.288694 ... 2.086715 1.823581 1.612062
21 2018-02-22 1.431536 1.279514 ... 2.201972 1.878109 1.694159
etc until row 346.
I tried the below but .plot does not seem to take positional arguments:
df.plot(x=df.iloc[0:346,0],y=[0:346,1:49]
How would I go about plotting my rows by date (the 1st column) on a line graph and can I expand this to include multiple axis?
There are multiple ways to do this, some of which are directly through the pandas dataframe. However, given your sample plotting line, I think the easiest might be to just use matplotlib directly:
import matplotlib.pyplot as plt
plt.plot(df.iloc[0:346,0],df.iloc[0:346,1:49])
For multiple axes you can add a few lines to make subplots. For example:
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = plt.subplot(1,2,1)
plt.plot(df.iloc[0:346,0],df.iloc[0:346,1:10],ax=ax1)
ax2 = plt.subplot(1,2,2)
plt.plot(df.iloc[0:346,0],df.iloc[0:346,20:30],ax=ax2)
You can also do this using the pandas plot() function that you were trying to use - it also takes an ax argument the same way as above, where you can provide the axis on which to plot. If you want to stick to pandas, I think you'd be best off setting the index to be a datetime index (see this link as an example: https://stackoverflow.com/a/27051371/12133280) and then using df.plot('column_name',ax=ax1). The x axis will be the index, which you would have set to be the date.
Related
I have the following Pandas Dataframe (linked above) and I'd like to plot a graph with the values 1.0 - 39.0 on the x axis and the y axis would be the dataframe values in the column of these (-0.004640 etc). The rows are other lines I'd like to plot, so at the end there will be a lot of lines.
I've tried to transpose my plot but that doesn't seem to work.
How could I go about doing this?
You could try to use matplotlib:
import matplotlib.pyplot as plt
%matplotlib inline
x=[1.0, 39.0]
plt.plot(x, df[1.0])
plt.plot(x, df[2.0})
...
I'm relatively new to Python (in the process of self-teaching) and so this is proving to be quite a learning curve but I'm very happy to get to grips with it. I have a set of data points from an experiment in excel, one column is time (with the format 00:00:00:000) and a second column is the measured parameter.
I'm using pandas to read the excel document in order to produce a graph from it with time along the x-axis and the measured variable along the y-axis. However, when I plot the data, the time column becomes the data point number (i.e. 00:00:00:000 - 00:05:40:454 becomes 0 - 2000) and I'm not sure why. Could anyone please advise how to rectify this?
Secondly, I'd like to produce a subplot that shows the difference between the y-values as a function of time, basically a gradient to show the variation. Is there a way to easily calculate this and display it using pandas?
Here is my code, please do forgive how basic it is!
import pandas as pd
import matplotlib.pyplot as plt
import pylab
df = pd.read_excel('rest.xlsx', 'Sheet1')
df.plot(legend=False, grid=False)
plt.show()
plt.savefig('myfig')
If you just read the excel file, pandas will create a RangeIndex, starting at 0. To use your time information from you excel file as index, you have to specify the name (as string) of the time column with the key-word argument index_col in the read_excel call:
df = pd.read_excel('rest.xlsx', 'Sheet1', index_col='name_of_time_column')
Just replace 'name_of_time_column' with the actual name of the column that contains the time information.
(Hopefully pandas will automatically parse the time information to a Datetimeindex, but your format should be fine.) The plot will use the Datetimeindex on x-axis.
To get the time difference between each datapoint, use the diff method with argument 1 on your DataFrame:
difference = df.diff(1)
difference.plot(legend=False, grid=False)
Try This:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('rest.xlsx', 'Sheet1')
X = df['Time'].tolist()#If the time column called 'Time'
Y = df['Parameter'].tolist()#If the Parameter column called 'Parameter'
plt.plot(X,Y)
plt.gcf().autofmt_xdate()
plt.show()
With matplotlib you can create a figure with two axis and name the axis for example ax_df and ax_diff:
import matplotlib.pyplot as plt
fig, [ax_df, ax_diff] = plt.subplots(nrows=2, ncols=1, sharex=True)
sharex=True specifies to use the same x-axis for both subplots.
When calling plot on the DataFrame, you can redirect the output to the axis by specifying the axes with the keyword argument ax:
df.plot(ax=ax_df)
df.diff(1).plot(ax=ax_diff)
plt.show()
It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.
I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:
df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)
One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.
You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.
First let's generate some sample data:
import pandas as pd
import numpy as np
import seaborn as sns
N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})
print(df.head())
group value
0 A 0.816847
1 A 0.468465
2 C 0.871975
3 B 0.933708
4 A 0.480170
...
Next, make the boxplot and save the axis object:
ax = df.boxplot(column='value', by='group', showfliers=True,
positions=range(df.group.unique().shape[0]))
Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.
Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:
sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.
You can get the value of the medians by using the .get_data() property of the matplotlib.lines.Line2D objects that draw them, without having to use seaborn.
Let bp be your boxplot created as bp=plt.boxplot(data). Then, bp is a dict containing the medians key, among others. That key contains a list of matplotlib.lines.Line2D, from which you can extract the (x,y) position as follows:
bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
[[x0, x1],[y0,y1]] = m.get_data()
X.append(np.mean((x0,x1)))
Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')
For an arbitrary dataset (data), this script generates this figure. Hope it helps!
I am trying to generate a grid of subplots based off of a Pandas groupby object. I would like each plot to be based off of two columns of data for one group of the groupby object. Fake data set:
C1,C2,C3,C4
1,12,125,25
2,13,25,25
3,15,98,25
4,12,77,25
5,15,889,25
6,13,56,25
7,12,256,25
8,12,158,25
9,13,158,25
10,15,1366,25
I have tried the following code:
import pandas as pd
import csv
import matplotlib as mpl
import matplotlib.pyplot as plt
import math
#Path to CSV File
path = "..\\fake_data.csv"
#Read CSV into pandas DataFrame
df = pd.read_csv(path)
#GroupBy C2
grouped = df.groupby('C2')
#Figure out number of rows needed for 2 column grid plot
#Also accounts for odd number of plots
nrows = int(math.ceil(len(grouped)/2.))
#Setup Subplots
fig, axs = plt.subplots(nrows,2)
for ax in axs.flatten():
for i,j in grouped:
j.plot(x='C1',y='C3', ax=ax)
plt.savefig("plot.png")
But it generates 4 identical subplots with all of the data plotted on each (see example output below):
I would like to do something like the following to fix this:
for i,j in grouped:
j.plot(x='C1',y='C3',ax=axs)
next(axs)
but I get this error
AttributeError: 'numpy.ndarray' object has no attribute 'get_figure'
I will have a dynamic number of groups in the groupby object I want to plot, and many more elements than the fake data I have provided. This is why I need an elegant, dynamic solution and each group data set plotted on a separate subplot.
Sounds like you want to iterate over the groups and the axes in parallel, so rather than having nested for loops (which iterates over all groups for each axis), you want something like this:
for (name, df), ax in zip(grouped, axs.flat):
df.plot(x='C1',y='C3', ax=ax)
You have the right idea in your second code snippet, but you're getting an error because axs is an array of axes, but plot expects just a single axis. So it should also work to replace next(axs) in your example with ax = axs.next() and change the argument of plot to ax=ax.
I have a parsed very large dataframe with some values like this and several columns:
Name Age Points ...
XYZ 42 32pts ...
ABC 41 32pts ...
DEF 32 35pts
GHI 52 35pts
JHK 72 35pts
MNU 43 42pts
LKT 32 32pts
LKI 42 42pts
JHI 42 35pts
JHP 42 42pts
XXX 42 42pts
XYY 42 35pts
I have imported numpy and matplotlib.
I need to plot a graph of the number of times the value in the column 'Points' occurs. I dont need to have any bins for the plotting. So it is more of a plot to see how many times the same score of points occurs over a large dataset.
So essentially the bar plot (or histogram, if you can call it that) should show that 32pts occurs thrice, 35pts occurs 5 times and 42pts occurs 4 times. If I can plot the values in sorted order, all the more better. I have tried df.hist() but it is not working for me.
Any clues? Thanks.
I would plot the results of the dataframe's value_count method directly:
import matplotlib.pyplot as plt
import pandas
data = load_my_data()
fig, ax = plt.subplots()
data['Points'].value_counts().plot(ax=ax, kind='bar')
If you want to remove the string 'pnts' from all of the elements in your column, you can do something like this:
df['points_int'] = df['Points'].str.replace('pnts', '').astype(int)
That assumes they all end with 'pnts'. If it varying from line to line, you need to look into regular expressions like this:
Split columns using pandas
And the official docs: http://pandas.pydata.org/pandas-docs/stable/text.html#text-string-methods
Seaborn package has countplot function which can be made use of to make frequency plot:
import seaborn as sns
ax = sns.countplot(x="Points",data=df)