Python Pandas. Describe() by date - python

I would like to plot summary statistics over time for panel data. The X axis would be time and the Y axis would be the variable of interest with lines for Mean, min/max, P25, P50, P75 etc.
This would basically loop through and calc the stats for each date over all the individual observations and then plot them.
What I am trying to do is similar to below, but y axis would be dates instead of 1-10.
import numpy as np
import pandas as pd
# Create random data
rd = pd.DataFrame(np.random.randn(100, 10))
rd.describe().T.drop('count', axis=1).plot()
In my dataset, the time series of each individual is stacked on one another.
I tried running the following but I seem to get the descriptive stats of entire dataset and not broken down by date.
rd = rd.groupby('period').count().describe()
print (rd)
rd.show()

Using the dataframe below as the example:
df = pd.DataFrame({'Values':[10,20,30,20,40,60,40,80,120],'period': [1,2,3,1,2,3,1,2,3]})
df
Values period
0 10 1
1 20 2
2 30 3
3 20 1
4 40 2
5 60 3
6 40 1
7 80 2
8 120 3
Now,plotting the descriptive statistics by date using groupby:
df.groupby('period').describe()['Values'].drop('count', axis = 1).plot()

Related

Pandas: how to GROUPBY by number of not NaNs for each row?

I have a result from check-all-that-apply questions:
A | B | C ...
1 | NaN | 1
NaN | 1 | NaN
Where NaN means the responder did not select that option, and 1 if they selected it.
I want to group by the number of not NaNs in each row. Specifically, this is the kind of output visualization I am trying to do:
I tried using count():
df.count(axis=1).reset_index()
And I get the number of selected boxes per user, but I don't know what's next.
If the dataframe is like this, I included 1 more row so that we get values of 4+ :
import pandas as pd
import numpy as np
from matplotlib.ticker import FuncFormatter
df = pd.DataFrame({'A':[1,np.nan,1,np.nan],
'B':[np.nan,1,np.nan,np.nan],
'C':[np.nan,1,1,np.nan],
'D':[np.nan,np.nan,1,np.nan]})
df.isna().sum(axis=1) would give you the count for number of NAs per row. But you want to be these values, you can use pd.cut :
labels = pd.cut(df.isna().sum(axis=1),[-np.inf,1,3,+np.inf],labels=['0-1','2-3','4+'])
labels
0 2-3
1 2-3
2 0-1
3 4+
And just plot this:
ax = (labels.value_counts(sort=False) / len(labels)).plot.bar()
ax.yaxis.set_major_formatter(FuncFormatter(lambda y, _: '{:.0%}'.format(y)))

How to feed random numbers as indices to pandas data frame?

I'm trying to get a random sample from two pandas frames. If rows (random) 2,5,8 are selected in frame A, then the same 2,5,8 rows must be selected from frame B. I did it by first generating a random sample and now want to use this sample as indices for rows for frame. How can I do it? The code should look like
idx = list(random.sample(range(X_train.shape[0]),5000))
lgstc_reg[i].fit(X_train[idx,:], y_train[idx,:])
However, running the code gives an error.
Use iloc:
indexes = [2,5,8] # in your case this is the randomly generated list
A.iloc[indexes]
B.iloc[indexes]
An alternative consistent sampling methodology would be to set a random seed, and then sample:
random_seed = 42
A.sample(3, random_state=random_seed)
B.sample(3, random_state=random_seed)
The sampled DataFrames will have the same index.
Hope this helps!
>>> df1
value ID
0 3 2
1 4 2
2 7 8
3 8 8
4 11 8
>>> df2
value distance
0 3 0
1 4 0
2 7 1
3 8 0
4 11 0
I have two data frames. I want to select randoms of df1 along with corresponding rows of df2.
First I create a sample_index which a list of random rows of df using Pandas inbuilt function sample. Now use this index to location these rows in df1 and df2 with the help of another inbuilt funciton loc.
>>> selection_index = df1.sample(2).index
>>> selection_index
Int64Index([3, 1], dtype='int64')
>>> df1.loc[selection_index]
value ID
3 8 8
1 4 2
>>> df2.loc[selection_index]
value distance
3 8 0
1 4 0
>>>
In your case, this would become somewhat like
idx = X_train.sample(5000).index
lgstc_reg[i].fit(X_train.loc[idx], y_train.loc[idx])

Pandas DataFrame - How to make a stacked area graph stack (matplotlib)

I am trying to convert data in a pandas DataFrame in to a stacked area graph but can not seem to get it to stack.
The data is in the format
index | datetime (yyyy/mm/dd) | name | weight_change
With 6 different people each measured daily.
I want the stacked graph to show the weight_change (y) over the datetime (x) but with weight_change for each of the 6 people stacked on top of each other
The closest I have been able to get to it is with:
df = df.groupby['datetime', 'name'], as_index=False).agg({'weight_change': 'sum'})
agg = df.groupby('datetime').sum()
agg.plot.area()
This produces the area graph for the aggregate of the weight_change values (sum of each persons weight_change for each day) but I can't figure out how to split this up for each person like the different values here:
I have tried various things with no luck. Any ideas?
A simplified version of your data:
df = pd.DataFrame(dict(days=range(4)*2,
change=np.random.rand(8)*2.,
name=['John',]*4 + ['Jane',]*4))
df:
change days name
0 0.238336 0 John
1 0.293901 1 John
2 0.818119 2 John
3 1.567114 3 John
4 1.295725 0 Jane
5 0.592008 1 Jane
6 0.674388 2 Jane
7 1.763043 3 Jane
Now we can simply use pyplot's stackplot:
import matplotlib.pyplot as plt
days = df.days[df.name == 'John']
plt.stackplot(days, df.change[df.name == 'John'],
df.change[df.name == 'Jane'])
This produces the following plot:

Multiple lines on line plot/time series with matplotlib

How do I plot multiple traces represented by a categorical variable on matplotlib or plot.ly on Python? I am trying to replicate the geom_line(aes(x=Date,y=Value,color=Group) function from R.
Is there a way to achieve this on Python without the need to have the groups in separate columns? Do I have to restructure the data inevitably?
Let's say I have the following data:
Date Group Value
1/01/2015 A 50
2/01/2015 A 60
1/01/2015 B 100
2/01/2015 B 120
1/01/2015 C 40
2/01/2015 C 55
1/01/2015 D 36
2/01/2015 D 20
I would like date on the x axis, value on the y axis, and the group categories represented by different coloured lines/traces.
Thanks.
Assuming your data is in a pandas dataframe df, it would be hard to plot it without the groups being in separate columns, but that is actually a step very easily done in one line,
df.pivot(index="Date", columns="Group", values="Value").plot()
Complete example:
u = u"""Date Group Value
1/01/2015 A 50
2/01/2015 A 60
1/01/2015 B 100
2/01/2015 B 120
1/01/2015 C 40
2/01/2015 C 55
1/01/2015 D 36
2/01/2015 D 20"""
import io
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO(u), delim_whitespace=True)
df["Date"] = pd.to_datetime(df["Date"])
df.pivot(index="Date", columns="Group", values="Value").plot()
plt.show()

Why can't I set the y-axis range on a plot produced from a Pandas Series?

I'm trying to create a bar graph where the y-axis ranges from 0% - 100% using matplotlib and pandas. The range I get is only 0% - 50%. Now, since all of my bars top out at ~10%, this isn't disastrous. It's just frustrating and may interfere with comparisons to other plots with the complete range.
The code I'm using is (roughly) as follows:
from matplotlib import pyplot as plt
import pandas as pd
labels = list(cm.index) #Where cm is a DataFrame
for curr in sorted(labels):
xa = cm[curr] # Pulls 1 column out of DataFrame to be plotted
xplt = xa.plot(kind='bar', rot = 0, ylim = (0,1))
xplt.set_yticklabels(['{:3.0f}%'.format(x*10) for x in range(11)])
plt.show()
Is there anything obviously wrong or missing?
A sample of a plot I get is this:
Oddly, when I omit the set_yticklabels statement, I get this:
I now realize that the first graph is not just oddly scaled, but is also giving incorrect results. The values shown in the 2nd graph are the correct ones. I guess the error is in the set_yticklabels statement, but I have no idea what it could be.
Looks like the keyword ylim works fine for pandas.DataFrame.plot.bar():
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(10, 2)), columns=['low', 'high'])
df.high = df.high * 10
low high
0 3 10
1 2 0
2 7 20
3 3 90
4 7 60
5 0 40
6 1 0
7 3 70
8 1 80
9 6 90
for col in df:
df[col].plot.bar(ylim=(0, 100))
gives:

Categories