Plot several densities on one plot - python

I have a data frame with a MultiIndex (expenditure, groupid):
coef stderr N
expenditure groupid
TOTEXPCQ 176 3745.124 858.1998 81
358 -1926.703 1036.636 75
109 239.3678 639.373 280
769 6406.512 1823.979 96
775 2364.655 1392.187 220
I can get the density using df['coef'].plot(kind='density'). I would like to group these densities by the outer level of the MultiIndex (expenditure), and draw the different densities for different levels of expenditure into the same plot.
How would I achieve this? Bonus: label the different expenditure graphs with the 'expenditure' value
Answer
My initial approach was to merge the different kdes by generating one ax object and passing that along, but the accepted answer inspired me to rather generate one df with the group identifiers as columns:
n = 25
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : np.random.randn(n)})
df2 = df[['expenditure', 'coef']].pivot_table(index=df.index, columns='expenditure', values='coef')
df2.plot(kind='kde')

Wow, that ended up being much harder than I expected. Seemed easy in concept, but (yet again) concept and practice really differed.
Set up some toy data:
n = 25
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : randn(n)})
Then group by expenditure, iterate through each expenditure, pivot the data, and plot the kde:
gExp = df.groupby('expenditure')
for exp in gExp:
print exp[0]
gGroupid = exp[1].groupby('groupid')
g = exp[1][['groupid','coef']].reset_index(drop=True)
gpt = g.pivot_table(index = g.index, columns='groupid', values='coef')
gpt.plot(kind='kde').set_title(exp[0])
show()
Results in:
It took some trial and error to figure out the data had to be pivoted before plotting.

Related

Cross tab with mean of one column

Using Pandas and Matplotlib, how can I make a bar plot with using cross tab of two columns, one column will just be the mean? Here is an example of my data set:
score lunch setting
70 N Sub
69 N Sub
62 Y Urb
78 N R
60 Y R
58 Y Urb
80 N Sub
75 N Urb
70 N R
70 N Urb
69 N Sub
70 N Urb
What I would like to do is get
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("my file path")
pd.crosstab(df["score"], df["lunch"]).plot(kind="bar", figsize=(8,2))
plt.show()
#pd is pandas and df is my data frame
with the "score" column being the mean of all scores rather than the individual scores.
After running plt.show() this is the plot that I get:
What I would like is to have two bars, attached, with the Y as the mean score of lunch with 'N' and mean score of lunch with 'Y' values.
I have tried
df_grouped = df.groupby(["lunch"])["score"].mean()
df_grouped.plot(kind="bar", figsize=(7,2)
This seems to look alright except I would like to be able to get the legend and have the two bars be side by side. Here is what it looks like by grouping first:
I would like to know if I can do this by using crosstab first without having to group? I need to keep the legend and also have the two bars side by side.
My thought would be something that looks like this:
pd.crosstab(df["score"].mean(), df["lunch"]).plot(kind="bar",figsize=(6,3))
Getting the mean of each lunch using crosstab.
Try with to_frame
df.groupby('lunch')['score'].mean().to_frame().T.plot.bar()

Is there a way to cut only the first gap from histogram and take all the remain values in Python?

I have a data frame with fields: 'unique years', 'counts'. I plotted this data frame and i am getting the following histogram: histogram - example. I need to define a start year variable but if i have empty gaps at the starting point of histogram i need to skip them and shift the starting year. I was wondering if there is a pythonic way to do this. In the histogram - example plot, i have a not empty bin at the starting point but then i have a big gap with empty bins. So i need to find the point with a continuous not empty bins and define this point as a starting year (for the above sample i need the starting year as 1935). The n numpy.ndarray is giving me information about empty or not bins but i need a efficient way to resolve this. Thank you :)
Sample of my data frame:
import pandas as pd
data = {'unique_years': [1907, 1935, 1938, 1939, 1940],
'counts' : [11, 14, 438, 85, 8]}
df = pd.DataFrame(data, columns = ['unique_years', 'counts'])
code for the histogram plot
(n, bins, patches) = plt.hist(df.unique_years, bins=25, label='hst')
plt.show()
The issue with your question is that 'continuous' is not really well defined here. Do you mean that every year should have a non-empty count (that is fairly easy to do as you can filter your data for that prior to building your histogram), or should every consecutive bucket be non empty? If the latter, this means that you must:
Build your histogram
Filter your data on the resulting bins
Either use the filtered histogram or re-bin the remaining data, with bins sizes not guaranteed to stay the same (so it is possible that you have the same issue with the new bins!)
As it is difficult to know exactly what is relevant in your exact case, I think the best answer would be to give you a set of tools that you can use as you see fit for the exact problem that you are encountering:
I want to filter my data starting from a certain date
filtered = df.unique_years[df.unique_years > 1930]
I want to find the second non-empty bin
(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
From there you can:
rebin your filtered data:
(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
# Re-binning on the filtered data
plt.hist(df.unique_years[df.unique_years >= n[second_nonempty]], bins=25)
Plot your histogram directly on the filtered bins:
(n, x) = np.histogram(df.unique_years, bins=25)
second_nonempty = np.where(n > 0)[0][1]
# Forcing the bins to take the provided values
plt.hist(df.unique_years, bins=x[second_nonempty:])
Now the 'second_nonempty' above can of course be replaced by any estimator of where you want to start, e.g.:
# Last empty bin + 1
all_bins_full_after = np.where(n == 0)[0][-1] + 1
Or anything else really
This should work to eliminate all the bins that are not consecutive. I am working mainly on the df. You can use this to plot your histogram
df = pd.DataFrame(data, columns = ['unique_years', 'counts'])
yd = df.unique_years.diff().eq(1)
df[yd|yd.shift(-1)]
this is the result you would get:

customizing the legend in a plot derived from a pandas dataframe

I'm working on a python implementation of an agent-based model using the 'mesa' framework (available in Github). In the model, each "agent" on a grid plays a Prisoner's Dilemma game against its neighbors. Each agent has a strategy that determines its move vs. other moves. Strategies with higher payoffs replace strategies with lower payoffs. In addition, strategies evolve through mutations, so new and longer strategies emerge as the model runs. The app produces a pandas dataframe that gets updated after each step. For example, after 106 steps, the df might look like this:
step strategy count score
0 0 CC 34 2.08
1 0 DD 1143 2.18
2 0 CD 1261 2.24
3 0 DC 62 2.07
4 1 CC 6 1.88
.. ... ... ... ...
485 106 DDCC 56 0.99
486 106 DD 765 1.00
487 106 DC 1665 1.31
488 106 DCDC 23 1.60
489 106 DDDD 47 0.98
Pandas/matplotlib creates a pretty good plot of this data, calling this simple plot function:
def plot_counts(df):
df1 = df.set_index('step')
df1.groupby('strategy')['count'].plot()
plt.ylabel('count')
plt.xlabel('step')
plt.title('Count of all strategies by step')
plt.legend(loc='best')
plt.show()
I get this plot:
Not bad, but here's what I can't figure out. The automatic legend quickly gets way too long and the low-frequency strategies are of little interest, so I want the legend to (1) include only the top 4 strategies listed in the above legend and (2) list those strategies in the order they appear in the last step of the model, based on their counts. Looking at the strategies in step 106 in the df, for example, I want the legend to show the top 4 strategies in order DC,DD,DDCC, and DDDD, but not include DCDC (or any other lower-count strategies that might be active).
I have searched through tons of pandas and matplotlib plotting examples but haven't been able to find a solution to this specific problem. It's clear that these plots are extremely customizable, so I suspect there is a way to do this. Any help would be greatly appreciated.
This post is somewhat similar to what you have asked, I guess you should check the answer on this page: Show only certain items in legend Python Matplotlib. Hope this helps!
Here is an approach. I don't have the complete dataframe, so the test is only with the ones displayed in the question.
The pandas part of the question can be solved by assigning the last step to a variable, then querying for the strategies of that step and then getting the highest counts.
To find the handles, we ask matplotlib for all the handles and labels it generated. Then we search each of the strategies in the list of labels, taking its index to get the corresponding handle.
Please note that 'count' is an annoying name for a column. It also is the name of a pandas function, which prevents its use in the dot notation.
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame(columns=['step', 'strategy', 'count', 'score'],
data=[[0, 'CC', 34, 2.08],
[0, 'DD', 1143, 2.18],
[0, 'CD', 1261, 2.24],
[0, 'DC', 62, 2.07],
[1, 'CC', 6, 1.88],
[106, 'DDCC', 56, 0.99],
[106, 'DD', 765, 1.00],
[106, 'DC', 1665, 1.31],
[106, 'DCDC', 23, 1.60],
[106, 'DDDD', 47, 0.98]])
last_step = df.step.max()
strategies_last_step = df.strategy[df['count'][df.step == last_step].nlargest(4).index]
df1 = df.set_index('step')
df1.groupby('strategy')['count'].plot()
plt.ylabel('count')
plt.xlabel('step')
plt.title('Count of all strategies by step')
handles, labels = plt.gca().get_legend_handles_labels()
selected_handles = [handles[labels.index(strategy)] for strategy in strategies_last_step]
legend = plt.legend(handles=selected_handles, loc='best')
plt.show()
Thank you, JohanC, you really helped me see what was going on under the hood with this problem. (Also, good point about count as a col name. I changed it to ncount.)
I found your statement:
strategies_last_step = df.strategy[df['count'][df.step == last_step].nlargest(4).index]
wasn't working for me (nlargest got confused about dtypes) so I formulated a slightly different approach. I got a list of correctly ordered strategy names this way:
def plot_counts(df):
# to customize plot legend, first get the last step in the df
last_step = df.step.max()
# next, make new df_last_step, reverse sorted by 'count' & limited to 4 items
df_last_step = df[df['step'] == last_step].sort_values(by='ncount', ascending=False)[0:4]
# put selected and reordered strategies in a list
top_strategies = list(df_last_step.strategy)
Then, after indexing and grouping my original df and adding my other plot parameters ...
dfi = df.set_index('step')
dfi.groupby('strategy')['ncount'].plot()
plt.ylabel('ncount')
plt.xlabel('step')
plt.title('Count of all strategies by step')
I was able to pick out the right handles from the default handles list and reorder them this way:
handles, labels = plt.gca().get_legend_handles_labels()
# get handles for top_strategies, in order, and replace default handles
selected_handles = []
for i in range(len(top_strategies)):
# get the index of the labels object that matches this strategy
ix = labels.index(top_strategies[i])
# get matching handle w the same index, append it to a new handles list in right order
selected_handles.append(handles[ix])
Then plot with the new selected_handles:
plt.legend(handles=selected_handles, loc='best')
plt.show()
Result is exactly as intended. Here is a plot after 300+ steps. Legend is in the right order and limited to top 4 strategies:

Compact way of visualizing heat maps of correlated data

I am trying to visualize the correlation of the Result column with every other column.
A_B A_C B_C Result
0 0.318182 0.925311 0.860465 91
1 -0.384030 0.991803 0.996344 12
2 -0.818182 0.411765 0.920000 53
3 0.444444 0.978261 0.944444 64
A_B = (A-B)/(A+B) correspondingly all other values too.
which works for smaller no. of columns but if I increase the no. of columns then no. of rows in heatmap keeps on stacking up.Is there any compact way to represent it.
Following code will reproduce the output-
import pandas as pd
import seaborn as sns
data = {'A':[232,243,12,546,67,12,78,11,245],
'B':[120,546,120,210,56,120,56,89,12],
'C':[9,1,5,6,7,43,7,12,64],
'Result':[91,12,53,64,71,436,74,123,641],
}
df = pd.DataFrame(data,columns=['A','B','C','Result'])
#Responsible for (A-B)/(A+B) ,(A-C)/(A+C) and similarly
colnames = df.columns.tolist()[:-1]
for i,c in enumerate(colnames):
if i!=len(colnames):
for k in range(i+1,len(colnames)):
df[c+'_'+colnames[k]]=(df[c]-df[colnames[k]])/(df[c]+df[colnames[k]])
newdf = df[['A_B','A_C','B_C','Result']].copy()
#Plotting A_B,A_C,B_C by ignoring the output of result of itself
plot = pd.DataFrame(newdf.corr().iloc[:-1,-1])
sns.heatmap(plot,annot=True)
A technique which I heard but unable to find any source ,is representing each correlation factor in the mini-recangles like
So according to it, considering the given map as a matrix of 3*3 and (0,0) starting from left-bottom, A_B will be represented in (1,1)
A_C in (2,1),B_C in (2,2).
But ,I am not getting it how to do it ?
You can plot the correlation of each column against the Result column and other columns as well. Below is one way to do so. Providing the x- and y-ticklabels guides you better for comparing the correlations. You can also annotate the correlation values to be displayed on the heat map.
cor = newdf.corr()
sns.heatmap(cor, xticklabels=cor.columns.values,
yticklabels=cor.columns.values, annot=True)

Calculating angle between two points in time-series

I have a time-series data and i am trying to calculate angle (degree) between two points. Here is what i did so far but it doesn't seem to give the correct solution:
bars = 2
df = pd.read_csv("EURUSD.csv")
df = df.reset_index()
df['A'] = np.rad2deg(np.arctan2(df['Low']-df['Low'].shift(pts), df['index']-df['index'].shift(pts)))
df.dropna(inplace=True)
However, sometimes this gives me weird outputs like:
2693 3.141258
2702 -3.141383
2708 -3.141451
2719 -3.141033
2724 -3.140893
2734 3.141550
I have also tried the following code:
df['A'] = ((df['Low']-df['Low'].shift(pts))/(df['index']-df['index'].shift(pts)))
2693 -0.000334
2702 0.000210
2708 0.000142
2719 0.000560
2724 0.000700
2734 -0.000043
what am i doing wrong here?
EDIT:
Here is the screenshot i'm trying to do. I'm simply trying to find that -48 degree in Python. I am not trying to get these points automatically. I have spotted them manually and just need to do calculation.
I guess that your question is how do I calculated the angle between two lines? Where those lines are each of them defined by a single point and a common origin. Then you want to perform this operation for a series of x1, x2 points recorded over time.
Here you can find the arithmetics and here an example.
To get your line angle between the two points, you'll need the following:
price difference (looks like 1.29250 - 1.29650 = -0.004)
number of bar between the two points (That appears to be 10 bars)
Price to Bar ratio (you'll have to look at the settings for that particular graph)
price_diff = -0.004
bars = 10
price_to_bar = unknown
X = bars * price_to_bar
Final output:
import numpy as np
round(np.angle(complex(x, price_diff), deg=True), 0)

Categories