pandas groupby sum area plot - python

I'm looking to make a stacked area plot over time, based on summary data created by groupby and sum.
The groupby and sum part correctly groups and sums the data I want, but it seems the resultant format is nonsense in terms of plotting it.
I'm not sure where to go from here:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df=pd.DataFrame({'invoice':[1,2,3,4,5,6],'year':[2016,2016,2017,2017,2017,2017],'part':['widget','wonka','widget','wonka','wonka','wonka'],'dollars':[10,20,30,10,10,10]})
#drop the invoice number from the data since we don't need it
df=df[['dollars','part','year']]
#group by year and part, and add them up
df=df.groupby(['year','part']).sum()
#plotting this is nonsense:
df.plot.area()
plt.show()

to chart multiple series, its easiest to have each series organized as a separate column, i.e. replace
df=df.groupby(['year','part']).sum()
with
df=df.groupby(['year', 'part']).sum().unstack(-1)
Then the rest of the code should work. But, I'm not sure if this is what you need because the desired output is not shown.
df.plot.area() then produces the chart like

Related

Plotting using multi index values

I have an excel file with the following data:
My code so far:
import pandas as pd
import matplotlib as plt
df=pd.read_excel('file.xlsx', header=[0,1], index_col=[0])
Firstly, am I reading in my file correctly to have a multi index using Main (A,B,C) as the first level and Value (X,Y) as the second level.
Using Pandas and Matplotlib - how do I plot individual scatter plot for Main (A,B,C) with each x,y as the scatter values (imaged below) . I can do it messily calling each column in an individual plot function.
Is there a nicer way to do it with multi-indexing or group by?
This should help:
df = df.set_index(['Main1', 'Main2']).value
df.unstack.plot(kind='line', stacked=True)

Scatter Plot with different color for positive and negative values

Here is my problem
This is a sample of my two DataFrames (I have 30 columns in reality)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.DataFrame({"Marc":[6,0,8,-30,-15,0,-3],
"Elisa":[0,1,0,-1,0,-2,-4],
"John":[10,12,24,-20,7,-10,-30]})
df1 = pd.DataFrame({"Marc":[8,2,15,-12,-8,0,-35],
"Elisa":[4,5,7,0,0,1,-2],
"John":[20,32,44,-30,15,-10,-50]})
I would like to create a scatter plot with two different colors :
1 color if the scores of df1 are negative and one if they are positive, but I don't really know how to do it.
I already did that by using matplotlib
plt.scatter(df,df1);
And I also checked this link Link but the problem is that I have two Pandas Dataframe
and not numpy array as on this link. Hence the I can't use the c= np.sign(df.y) method.
I would like to keep Pandas DataFrame as I have many columns but I really stuck on that.
If anyone has a solution, you are welcome!
You can pass the color array in, but it seems to work with 1D array only:
# colors as stated
colors = np.where(df1<0, 'C0', 'C1')
# stack and ravel to turn into 1D
plt.scatter(df.stack(),df1.stack(), c=colors.ravel())
Output:

How to plot a single column based on its sorted values

Lets assume I have a dataframe with one column.
I sorted it and when I tried to plot the sorted values, the plot is getting plotted as per indices not as per the sorted values.
How to achieve a plot which is plotted based on the sorted values?
I want the plot to be curve from top declining towards to bottom.
Ex code:
import pandas as pd
import matplotlib.pyplot as plt
a=pd.DataFrame()
a['col']=(4,5,8,10,1,0,15,20)
a_sorted=a.sort_values(by='col',ascending=False)
plt.plot(a_s)
I believe you need default index by Series.reset_index and drop=True parameter:
a_sorted=a.sort_values(by='col',ascending=False).reset_index(drop=True)
Then also working Series.plot:
a_sorted.plot()
Another solution is ploting numpy 1d array by Series.values:
a_sorted=a.sort_values(by='col',ascending=False)
plt.plot(a_sorted.values)
Or use sorting in descending order:
plt.plot(-np.sort(-a['col']))

Subplot of difference between data points imported with Pandas and conversion of time values

I'm relatively new to Python (in the process of self-teaching) and so this is proving to be quite a learning curve but I'm very happy to get to grips with it. I have a set of data points from an experiment in excel, one column is time (with the format 00:00:00:000) and a second column is the measured parameter.
I'm using pandas to read the excel document in order to produce a graph from it with time along the x-axis and the measured variable along the y-axis. However, when I plot the data, the time column becomes the data point number (i.e. 00:00:00:000 - 00:05:40:454 becomes 0 - 2000) and I'm not sure why. Could anyone please advise how to rectify this?
Secondly, I'd like to produce a subplot that shows the difference between the y-values as a function of time, basically a gradient to show the variation. Is there a way to easily calculate this and display it using pandas?
Here is my code, please do forgive how basic it is!
import pandas as pd
import matplotlib.pyplot as plt
import pylab
df = pd.read_excel('rest.xlsx', 'Sheet1')
df.plot(legend=False, grid=False)
plt.show()
plt.savefig('myfig')
If you just read the excel file, pandas will create a RangeIndex, starting at 0. To use your time information from you excel file as index, you have to specify the name (as string) of the time column with the key-word argument index_col in the read_excel call:
df = pd.read_excel('rest.xlsx', 'Sheet1', index_col='name_of_time_column')
Just replace 'name_of_time_column' with the actual name of the column that contains the time information.
(Hopefully pandas will automatically parse the time information to a Datetimeindex, but your format should be fine.) The plot will use the Datetimeindex on x-axis.
To get the time difference between each datapoint, use the diff method with argument 1 on your DataFrame:
difference = df.diff(1)
difference.plot(legend=False, grid=False)
Try This:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel('rest.xlsx', 'Sheet1')
X = df['Time'].tolist()#If the time column called 'Time'
Y = df['Parameter'].tolist()#If the Parameter column called 'Parameter'
plt.plot(X,Y)
plt.gcf().autofmt_xdate()
plt.show()
With matplotlib you can create a figure with two axis and name the axis for example ax_df and ax_diff:
import matplotlib.pyplot as plt
fig, [ax_df, ax_diff] = plt.subplots(nrows=2, ncols=1, sharex=True)
sharex=True specifies to use the same x-axis for both subplots.
When calling plot on the DataFrame, you can redirect the output to the axis by specifying the axes with the keyword argument ax:
df.plot(ax=ax_df)
df.diff(1).plot(ax=ax_diff)
plt.show()

data points connected in wrong order in line graph

I'm reading a pandas dataframe, and trying to generate a plot from it. In the plot, the data points seem to be getting connected in an order determined by ascending y value, resulting in a weird zig-zagging plot like this:
The code goes something like this:
from pandas import DataFrame as df
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
data = df.from_csv(...)
plt.plot(data['COL1'], data['COL2'])
Any suggestions on how to fix the order in which the dots are connected (i.e. connect them in the sequence in which they appear going from left to right on the plot)? Thanks.
Is the order of values in COL1 different from the csv?
You can sort by COL1 first, add this before plotting:
data.sort('COL1', inplace=True)

Categories