Multiple lines on line plot/time series with matplotlib - python

How do I plot multiple traces represented by a categorical variable on matplotlib or plot.ly on Python? I am trying to replicate the geom_line(aes(x=Date,y=Value,color=Group) function from R.
Is there a way to achieve this on Python without the need to have the groups in separate columns? Do I have to restructure the data inevitably?
Let's say I have the following data:
Date Group Value
1/01/2015 A 50
2/01/2015 A 60
1/01/2015 B 100
2/01/2015 B 120
1/01/2015 C 40
2/01/2015 C 55
1/01/2015 D 36
2/01/2015 D 20
I would like date on the x axis, value on the y axis, and the group categories represented by different coloured lines/traces.
Thanks.

Assuming your data is in a pandas dataframe df, it would be hard to plot it without the groups being in separate columns, but that is actually a step very easily done in one line,
df.pivot(index="Date", columns="Group", values="Value").plot()
Complete example:
u = u"""Date Group Value
1/01/2015 A 50
2/01/2015 A 60
1/01/2015 B 100
2/01/2015 B 120
1/01/2015 C 40
2/01/2015 C 55
1/01/2015 D 36
2/01/2015 D 20"""
import io
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO(u), delim_whitespace=True)
df["Date"] = pd.to_datetime(df["Date"])
df.pivot(index="Date", columns="Group", values="Value").plot()
plt.show()

Related

Python Pandas. Describe() by date

I would like to plot summary statistics over time for panel data. The X axis would be time and the Y axis would be the variable of interest with lines for Mean, min/max, P25, P50, P75 etc.
This would basically loop through and calc the stats for each date over all the individual observations and then plot them.
What I am trying to do is similar to below, but y axis would be dates instead of 1-10.
import numpy as np
import pandas as pd
# Create random data
rd = pd.DataFrame(np.random.randn(100, 10))
rd.describe().T.drop('count', axis=1).plot()
In my dataset, the time series of each individual is stacked on one another.
I tried running the following but I seem to get the descriptive stats of entire dataset and not broken down by date.
rd = rd.groupby('period').count().describe()
print (rd)
rd.show()
Using the dataframe below as the example:
df = pd.DataFrame({'Values':[10,20,30,20,40,60,40,80,120],'period': [1,2,3,1,2,3,1,2,3]})
df
Values period
0 10 1
1 20 2
2 30 3
3 20 1
4 40 2
5 60 3
6 40 1
7 80 2
8 120 3
Now,plotting the descriptive statistics by date using groupby:
df.groupby('period').describe()['Values'].drop('count', axis = 1).plot()

pandas: modifying values in dataframe from another column

While splitting data into columns, there was some glitch, due to which I have got some noisy data.
site code
--- ---
0 apple_123 45
1 apple_456 xy_33
2 facebook_123 24
3 google_123 NaN
4 google_123 pq_51
I need to clean the data, such that I get the following result:
site code
--- ---
0 apple_123 45
1 apple_456_xy 33
2 facebook_123 24
3 google_123 NaN
4 google_123_pq 51
I have been able to obtain the rows that need to be modified, but am unable to progress further:
import numpy as np
import pandas as pd
site = ['apple_123','apple_456','facebook_123','google_123','google_123']
code = [45,'xy_33',24,np.nan,'pq_51']
df = pd.DataFrame(list(zip(site,code)), columns=['site','code'])
df[(~df.code.astype(str).str.isdigit())&(~df.code.isna())]
Use Series.str.extract for get non numeric and numeric values to helper DataFrame and then processing each column separately - remove _ by Series.str.strip, add from right side by Series.radd and convert missing values to emty string, last add to code column, for second use Series.fillna for replace not mached values from 1 column to original:
df1 = df.code.str.extract('(\D+)(\d+)')
df['site'] += df1[0].str.strip('_').radd('_').fillna('')
df['code'] = df1[1].fillna(df['code'])
print (df)
site code
0 apple_123 45
1 apple_456_xy 33
2 facebook_123 24
3 google_123 NaN
4 google_123_pq 51

Plot a CDF from a frequency table in Python

I have some frequency data:
Rank Count
A 34
B 1
C 1
D 2
E 1
F 4
G 112
H 1
...
in a dictionary:
d = {"A":34,"B":1,"C":1,"D":2,"E":1,"F":4,"G":112,"H":1,.......}
The letters represent a rank from highest to lowest (A to Z), and the number of time I observed the rank in the dataset.
How can I plot the cumulative distribution function given that I already have the frequencies of my observations in the dictionary? I want to be able to see the general ranking of the observations. For example: 50% of my observations have a rank lower than E.
I have been searching for info about this but I always find ways to plot the CDF from the raw observations but not from the counts.
Thanks in advance.
Maybe you want to plot a bar plot with the rank on the x axis and the cdf on the y axis?
u = u"""Rank Count
A 34
B 1
C 1
D 2
E 1
F 4
G 112
H 1"""
import io
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO(u), delim_whitespace=True)
df["Cum"] = df.Count.cumsum()/df.Count.sum()
df.plot.bar(x="Rank", y="Cum")
plt.show()

pandas groupby: how to calculate percentage of total?

How can I calculate a column showing the % of total in a groupby?
One way to do it is to calculate it manually after the groupby, as in the last line of this example:
import numpy as np
import pandas as pd
df= pd.DataFrame(np.random.randint(5,8,(10,4)), columns=['a','b','c','d'])
g = df.groupby('a').agg({'b':['sum','mean'], 'c':['sum'], 'd':['sum']})
g.columns = g.columns.map('_'.join)
g['b %']=g['b_sum']/g['b_sum'].sum()
However, in my real data I have many more columns, and I'd need the % right after the sum, so with this approach I'd have to manually change the order of the columns.
Is there a more direct way of doing it so that the % is the column right after the sum? Note that I need the agg(), or something equivalent, because in all my groupbys I apply different aggregate functions to different columns (e.g. sum and avg of x, but only the min of y, etc.).
I think you need lambda function in agg and then replace column names to %:
np.random.seed(78)
df= pd.DataFrame(np.random.randint(5,8,(10,4)), columns=['a','b','c','d'])
g =(df.groupby('a')
.agg({'b':['sum',lambda x: x.sum()/ df['b'].sum(),'mean'],
'c':['sum'],
'd':['sum']}))
g.columns = g.columns.map('_'.join).str.replace('<lambda>','%')
print (g)
d_sum c_sum b_sum b_% b_mean
a
5 25 24 24 0.387097 6
6 11 11 14 0.225806 7
7 22 23 24 0.387097 6

Grouped bar chart from two pandas data frames

I have two data frames containing different values but the same structure:
df1 =
0 1 2 3 4
D 0.003073 0.014888 0.155815 0.826224 NaN
E 0.000568 0.000435 0.000967 0.002956 0.067249
df2 =
0 1 2 3 4
D 0.746689 0.185769 0.060107 0.007435 NaN
E 0.764552 0.000000 0.070288 0.101148 0.053499
I want to plot both data frames in a single grouped bar chart. In addition, each row (index) should be a subplot.
This can be easily achieved for one of them using pandas directly:
df1.T.plot(kind="bar", subplots=True, layout=(2,1), width=0.7, figsize=(10,10), sharey=True)
I tried to join them using
pd.concat([df1, df2], axis=1)
which results in a new dataframe:
0 1 2 3 4 0 1 2 3 4
D 0.003073 0.014888 0.155815 0.826224 NaN 0.746689 0.185769 0.060107 0.007435 NaN
E 0.000568 0.000435 0.000967 0.002956 0.067249 0.764552 0.000000 0.070288 0.101148 0.053499
However, plotting the data frame with the above method will not group the bars per column but rather treats them separately. Per subplot this results in a x-axis with duplicated ticks in order of the columns, e.g. 0,1,2,3,4,0,1,2,3,4.
Any ideas?
It is not exactly clear how the data is organized. Pandas and seaborn usually expect tidy datasets. Because you do transpose the data prior to plotting I assume you have two variable (A and B) and four observations (e.g. measurements)
df1 = pd.DataFrame.from_records(np.random.rand(2,4), index = ['A','B'])
df2 = pd.DataFrame.from_records(np.random.rand(2,4), index = ['A','B'])
df1.T
Maybe this is close to what you want:
df4 = pd.concat([df1.T, df2.T], axis=0, ignore_index=False)
df4['col'] = (len(df1.T)*(0,) + len(df2.T)*(1,))
df4.reset_index(inplace=True)
df4
using seaborns facet grid allows for convenient plotting:
sns.factorplot(x='index', y='A', hue='col', kind='bar', data=df4)

Categories