Plot a CDF from a frequency table in Python - python

I have some frequency data:
Rank Count
A 34
B 1
C 1
D 2
E 1
F 4
G 112
H 1
...
in a dictionary:
d = {"A":34,"B":1,"C":1,"D":2,"E":1,"F":4,"G":112,"H":1,.......}
The letters represent a rank from highest to lowest (A to Z), and the number of time I observed the rank in the dataset.
How can I plot the cumulative distribution function given that I already have the frequencies of my observations in the dictionary? I want to be able to see the general ranking of the observations. For example: 50% of my observations have a rank lower than E.
I have been searching for info about this but I always find ways to plot the CDF from the raw observations but not from the counts.
Thanks in advance.

Maybe you want to plot a bar plot with the rank on the x axis and the cdf on the y axis?
u = u"""Rank Count
A 34
B 1
C 1
D 2
E 1
F 4
G 112
H 1"""
import io
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO(u), delim_whitespace=True)
df["Cum"] = df.Count.cumsum()/df.Count.sum()
df.plot.bar(x="Rank", y="Cum")
plt.show()

Related

Python Pandas. Describe() by date

I would like to plot summary statistics over time for panel data. The X axis would be time and the Y axis would be the variable of interest with lines for Mean, min/max, P25, P50, P75 etc.
This would basically loop through and calc the stats for each date over all the individual observations and then plot them.
What I am trying to do is similar to below, but y axis would be dates instead of 1-10.
import numpy as np
import pandas as pd
# Create random data
rd = pd.DataFrame(np.random.randn(100, 10))
rd.describe().T.drop('count', axis=1).plot()
In my dataset, the time series of each individual is stacked on one another.
I tried running the following but I seem to get the descriptive stats of entire dataset and not broken down by date.
rd = rd.groupby('period').count().describe()
print (rd)
rd.show()
Using the dataframe below as the example:
df = pd.DataFrame({'Values':[10,20,30,20,40,60,40,80,120],'period': [1,2,3,1,2,3,1,2,3]})
df
Values period
0 10 1
1 20 2
2 30 3
3 20 1
4 40 2
5 60 3
6 40 1
7 80 2
8 120 3
Now,plotting the descriptive statistics by date using groupby:
df.groupby('period').describe()['Values'].drop('count', axis = 1).plot()

How to determine which row in dataframe has most even and highest distribution

I would like to sort a pandas dataframe by the rows which have the most even distribution but also high values. For example:
Row Attribute1 Attribute2 Attribute3
a 1 1 108
b 10 2 145
c 50 60 55
d 100 90 120
e 20 25 23
f 1000 30 0
Rows d and c should rank the highest, ideally d followed by c.
I considered using standard deviation to identify the most even distribution and then mean to get the highest average values but I'm unsure as to how I can combine these together.
As the perception of "even distribution" you mention seems to be quite subjective, here is an instuction to implement the coefficient of variation mentionned by #ALollz.
df.std(axis=1) / df.mean(axis=1)
Row 0
a 1.6848130582715446
b 1.535375387727906
c 0.09090909090909091
d 0.14782502241793033
e 0.11102697698927574
f 1.6569547684031352
This metrics is the percentage of the mean represented by the standard deviation. If you have a row mean of 10 and a standard deviation of 1, the ratio will be 10% or 0.1
In this example, the row that could be considered most 'evenly distributed' is the row c: its mean is 55 and its standard deviation is 5. Therefore the ratio is about 9%.
This way, you can have a decent overview of the homogeneity of the distribution.
If you want the ranking, you can apply .sort_values:
(df.std(axis=1) / df.mean(axis=1)).sort_values()
Row 0
c 0.09090909090909091
e 0.11102697698927574
d 0.14782502241793033
b 1.535375387727906
f 1.6569547684031352
a 1.6848130582715446
My last words would be to not be fooled by our brain's perception: it can be easily tricked by statistics.
Now if you want to improve results of higher values, you can divide this coefficient by the mean: the higher the mean, the lower the coefficient.
(df.std(axis=1) / df.mean(axis=1)**2).sort_values()
Row 0
d 0.0014305647330767452
c 0.001652892561983471
f 0.004826081849717869
e 0.004898248984820989
b 0.029338383204991835
a 0.045949447043769395
And now we obtain the desired ranking : d first, then c, f, e, b and a

Pandas: Calculating a Z-score to avoid "look ahead" bias

I have time series data in dataframe named "df", and, my code for calculating the z-score is given below:
mean = df.mean()
standard_dev = df.std()
z_score = (df - mean) / standard_dev
I would like to calculate the z-score for each observation using the respective observation and data that was known at the point of recording the observation. i.e. I do not want to use a standard deviation and mean that incorporates data that occurs after a specific point in time. I just want to use data from time t, t-1, t-2....
How do I do this?
Use .expanding() - col being the column you want to compute your statistics for (drop [col] in case, if you wish to compute it for the whole dataframe):
You might need to sort values by time column first - denoted as time_col (in case if it's not sorted already):
df=df.sort_values("time_col", axis=0)
Then:
df[col].sub(df[col].expanding().mean()).div(df[col].expanding().std())
Ref:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.expanding.html
For the sample data:
import pandas as pd
df=pd.DataFrame({"a": list("xyzpqrstuv"), "b": [6,5,7,1,-9,0,3,5,2,8]})
df["c"]=df["b"].sub(df["b"].expanding().mean()).div(df["b"].expanding().std())
Outputs:
a b c
0 x 6 NaN
1 y 5 -0.707107
2 z 7 1.000000
3 p 1 -1.425880
4 q -9 -1.677484
5 r 0 -0.281450
6 s 3 0.210502
7 t 5 0.534207
8 u 2 -0.046142
9 v 8 1.062430
You could assign two new columns, containing the mean and std of previous items. I here assume, that your time series data is in the column 'time_series_data':
len_ = len(df)
df['mean_past'] = [np.mean(df['time_series_data'][0:lv+1]) for lv in range(len_)]
df['std_past'] = [np.std(df['time_series_data'][0:lv+1]) for lv in range(len_)]
df['z_score'] = (df['time_series_data'] - df['mean_past']) / df['std_past']
Edit: if you want to z-score all columns, you could define a function, that computes the z-score and apply it on all columns of your dataframe:
def z_score_column(column):
len_ = len(column)
mean = [np.mean(column[0:lv+1]) for lv in range(0,len_)]
std = [np.std(column[0:lv+1]) for lv in range(0,len_)]
return [(c-m)/s for c,m,s in zip(column, mean, std)]
df = pd.DataFrame(np.random.rand(10,5))
df.apply(z_score_column)

Stacked Histograms of Grouped Data In pandas

Say that I have a dataframe (df)with lots of values, including two columns, X and Y. I want to create a stacked histogram where each bin is a categorical value in X (say A and B), and inside each bin are stacks by values in Y (say a,b,c,...).
I can run df.groupby(["X","Y"]).size() to get output like below, but how can I make the stacked histogram from this?
A a 14
b 41
c 4
d 2
e 2
f 15
g 1
h 3
B a 18
b 37
c 1
d 3
e 1
f 17
g 2
So, I think I figured this out. First one needs to stack the data using; .unstack(level=-1)
This will turn it into an n by m array-like structure where n is number of X entries and m is the number of Y entries. From this form you can follow the outline given here:
http://pandas.pydata.org/pandas-docs/stable/visualization.html
So in total the command will be:
df.groupby(["X","Y"]).size().unstack(level=-1).plot(kind='bar',stacked=True)
Kinda unwieldy looking though!

apply pandas qcut function to subgroups

Let us assume we created a dataframe df using the code below. I have created a bin frequency count based on the 'value' column in df. Now how do I get the frequency count of these label=1 samples frequency count based on previous created bin? Obviously, I should not use qcut for those label = 1 samples to get the count, since the bin positions are not same as before.
import numpy as np
import pandas as pd
mu, sigma = 0, 0.1
theta = 0.3
s = np.random.normal(mu, sigma, 100)
group = np.random.binomial(1, theta, 100)
df = pd.DataFrame(np.vstack([s,group]).transpose())
df.columns = ['value','label']
factor = pd.qcut(df['value'], 5)
factor_bin_count = pd.value_counts(factor)
Update: I took the solution from jeff
df.groupby(['label',factor]).value.count()
If I understand your question. You want to take a grouping factor (e.g. you created using qcut to bin the continuous values), and another grouper (e.g. 'label'), then perform an operation. count in this case.
In [36]: df.groupby(['label',factor]).value.count()
Out[36]:
label value
0 [-0.248, -0.0864] 14
(-0.0864, -0.0227] 15
(-0.0227, 0.0208] 15
(0.0208, 0.0718] 17
(0.0718, 0.24] 13
1 [-0.248, -0.0864] 6
(-0.0864, -0.0227] 5
(-0.0227, 0.0208] 5
(0.0208, 0.0718] 3
(0.0718, 0.24] 7
Name: value, dtype: int64

Categories