Grouping values in a clustered pie chart - python

I'm working with a dataset about when certain houses were constructed and my data stretches from the year 1873-2018(143 slices). I'm trying to visualise this data in the form of a piechart but because of the large number of indivdual slices the entire pie chart appears clustered and messy.
What I'm trying to implement to get aroud this is by grouping the values in 15-year time periods and displaying the periods on the pie chart instead. I seen a similiar post on StackOverflow where the suggested solution was using a dictionary and defining a threshold to group the values but implementing a version of that on my own piechart didn't work and I was wondering how I could tackle this problem
CODE
testing = df1.groupby("Year Built").size()
testing.plot.pie(autopct="%.2f",figsize=(10,10))
plt.ylabel(None)
plt.show()
Dataframe(testing)
Current Piechart

For the future, always provide a reproducible example of the data you are working on (maybe use df.head().to_dict()). One solution to your problem could be achieved by using pd.resample.
# Data Used
df = pd.DataFrame( {'year':np.arange(1890, 2018), 'built':np.random.randint(1,150, size=(2018-1890))} )
>>> df.head()
year built
0 1890 34
1 1891 70
2 1892 92
3 1893 135
4 1894 16
# First, convert your 'year' values into DateTime values and set it as the index
df['year'] = pd.to_datetime(df['year'], format=('%Y'))
df_to_plot = df.set_index('year', drop=True).resample('15Y').sum()
>>> df_to_plot
built
year
1890-12-31 34
1905-12-31 983
1920-12-31 875
1935-12-31 1336
1950-12-31 1221
1965-12-31 1135
1980-12-31 1207
1995-12-31 1168
2010-12-31 1189
2025-12-31 757
Also you could use pd.cut()
df['group'] = pd.cut(df['year'], 15, precision=0)
df.groupby('group')[['year']].sum().plot(kind='pie', subplots=True, figsize=(10,10), legend=False)

Related

Python change the starting values on the plot

I have data set which looks like this:
Hour_day Profits
7 645
3 354
5 346
11 153
23 478
7 464
12 356
0 346
I crated a line plot to visualize the hour on the x-axis and the profit values on y-axis. My code worked good with me but the problem is that on the x-axis it started at 0. but I want to start from 5 pm for example.
hours = df.Hour_day.value_counts().keys()
hours = hours.sort_values()
# Get plot information from actual data
y_values = list()
for hr in hours:
temp = df[df.Hour_day == hr]
y_values.append(temp.Profits.mean())
# Plot comparison
plt.plot(hours, y_values, color='y')
From what I know you have two options:
Create a sub DF that excludes the rows that have an Hour_day value under 5 and proceed with the rest of your code as normal:
df_new = df.where(df['Hour_day'] >= 5)
or, you might be able to set the x_ticks:
default_x_ticks = range(5:23)
plt.plot(hours, y_values, color='y')
plt.xticks(default_x_ticks, hours)
plt.show()
I haven't tested the x_ticks code so you might have to play around with it just a touch, but there are lots of easy to find resources on x_ticks.

Python data visualization: too small value to be visible - how to solve?

Here is dataset, i have:
Source
All Leads
Not Junks
Warms
Hots
Deals
Weighted Sum
web
281316
269490
10252
2508
1602
4376.5
telesales
30458
29732
431
138
85
316.2
networking
4249
4195
763
547
476
539.1
promos
1356
1308
30
1
0
10.8
I visualized it:
df.plot.bar()
And got this output:
Some columns got too small values, so that they are not visible, how can tackle this problem?
Setting bigger figure size isn't useful, it makes chart bigger, but columns ratio is still the same, so nothing changes
Any ideas how to make it look more sophisticated? Or maybe i should try different type of chart? Thank you!
Could try df.plot.bar(logy=true), but it's going to make useful interpretation of it messy. A Sankey diagram would probably be a better fit for showing how the data breaks down in each category.
Seaborn comes out a little nicer, but takes some transformation to produce the same type of output:
import seaborn as sns
df2 = df.melt('Source').rename(columns={'variable': 'Category', 'value': 'Values'})
sns.barplot(x='Source', y='Values', data=df2, hue='Category')
plt.show()
Output:
Or with log=True

Annotate scatter plot with multiindex

I have constructed a scatter plot using data from a DataFrame with a multiindex. The indexes are country and year
fig,ax=plt.subplots(1,1)
rel_pib=welfare["rel_pib_pc"].loc[:,1960:2010].groupby("country").mean()
rel_lambda=welfare["Lambda"].loc[:,1960:2010].groupby("country").mean()
ax.scatter(rel_pib,rel_lambda)
ax.set_ylim(0,2)
ax.set_ylabel('Bienestar(Lambda)')
ax.set_xlabel('PIBPc')
ax.plot([0,1],'red', linewidth=1)
I would like to annotate each point with the country name (and if possible, the Lambda value). I have the following code
for i, txt in enumerate(welfare.index):
plt.annotate(txt, (welfare["rel_pib_pc"].loc[:,1960:2010].groupby("country").mean()[i], welfare["Lambda"].loc[:,1960:2010].groupby("country").mean()[i]))
I am not sure how to indicate that i want the country names since all the lambda and pib_pc values for a given country are given as a single value, since I´m using the .mean() function.
I have tried using .xs() but all the combinations I tried won´t work.
I used the following test data:
rel_pib_pc Lambda
country year
Country1 2007 260 1.12
2008 265 1.13
2009 268 1.10
Country2 2007 230 1.05
2008 235 1.07
2009 236 1.04
Country3 2007 200 1.02
2008 203 1.07
2009 208 1.05
Then, to generate a scatter plot, I used the following code:
fig, ax = plt.subplots(1, 1)
ax.scatter(rel_pib,rel_lambda)
ax.set_ylabel('Bienestar(Lambda)')
ax.set_xlabel('PIBPc')
ax.set_xlim(190,280)
annot_dy = 0.005
for i, txt in enumerate(rel_lambda.index):
ax.annotate(txt, (rel_pib.loc[txt], rel_lambda.loc[txt] + annot_dy), ha='center')
plt.show()
and got the following result:
The trick to correctly generate annotations is:
Enumerate the index of one of already generated Series objects,
so that txt contains the country name.
Take values from already generated Series objects (don't compute
these values again).
Locate both coordinates by the current index value.
To put these annotations just above respective points, use:
ha (horizontal alignment) as 'center',
shift y coordinate a little up (if needed, experiment with othere
values of annot_dy.
I added also ax.set_xlim(190,280) in order to keep annotations within the
picture rectangle. Maybe you will not need it.

customizing the legend in a plot derived from a pandas dataframe

I'm working on a python implementation of an agent-based model using the 'mesa' framework (available in Github). In the model, each "agent" on a grid plays a Prisoner's Dilemma game against its neighbors. Each agent has a strategy that determines its move vs. other moves. Strategies with higher payoffs replace strategies with lower payoffs. In addition, strategies evolve through mutations, so new and longer strategies emerge as the model runs. The app produces a pandas dataframe that gets updated after each step. For example, after 106 steps, the df might look like this:
step strategy count score
0 0 CC 34 2.08
1 0 DD 1143 2.18
2 0 CD 1261 2.24
3 0 DC 62 2.07
4 1 CC 6 1.88
.. ... ... ... ...
485 106 DDCC 56 0.99
486 106 DD 765 1.00
487 106 DC 1665 1.31
488 106 DCDC 23 1.60
489 106 DDDD 47 0.98
Pandas/matplotlib creates a pretty good plot of this data, calling this simple plot function:
def plot_counts(df):
df1 = df.set_index('step')
df1.groupby('strategy')['count'].plot()
plt.ylabel('count')
plt.xlabel('step')
plt.title('Count of all strategies by step')
plt.legend(loc='best')
plt.show()
I get this plot:
Not bad, but here's what I can't figure out. The automatic legend quickly gets way too long and the low-frequency strategies are of little interest, so I want the legend to (1) include only the top 4 strategies listed in the above legend and (2) list those strategies in the order they appear in the last step of the model, based on their counts. Looking at the strategies in step 106 in the df, for example, I want the legend to show the top 4 strategies in order DC,DD,DDCC, and DDDD, but not include DCDC (or any other lower-count strategies that might be active).
I have searched through tons of pandas and matplotlib plotting examples but haven't been able to find a solution to this specific problem. It's clear that these plots are extremely customizable, so I suspect there is a way to do this. Any help would be greatly appreciated.
This post is somewhat similar to what you have asked, I guess you should check the answer on this page: Show only certain items in legend Python Matplotlib. Hope this helps!
Here is an approach. I don't have the complete dataframe, so the test is only with the ones displayed in the question.
The pandas part of the question can be solved by assigning the last step to a variable, then querying for the strategies of that step and then getting the highest counts.
To find the handles, we ask matplotlib for all the handles and labels it generated. Then we search each of the strategies in the list of labels, taking its index to get the corresponding handle.
Please note that 'count' is an annoying name for a column. It also is the name of a pandas function, which prevents its use in the dot notation.
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame(columns=['step', 'strategy', 'count', 'score'],
data=[[0, 'CC', 34, 2.08],
[0, 'DD', 1143, 2.18],
[0, 'CD', 1261, 2.24],
[0, 'DC', 62, 2.07],
[1, 'CC', 6, 1.88],
[106, 'DDCC', 56, 0.99],
[106, 'DD', 765, 1.00],
[106, 'DC', 1665, 1.31],
[106, 'DCDC', 23, 1.60],
[106, 'DDDD', 47, 0.98]])
last_step = df.step.max()
strategies_last_step = df.strategy[df['count'][df.step == last_step].nlargest(4).index]
df1 = df.set_index('step')
df1.groupby('strategy')['count'].plot()
plt.ylabel('count')
plt.xlabel('step')
plt.title('Count of all strategies by step')
handles, labels = plt.gca().get_legend_handles_labels()
selected_handles = [handles[labels.index(strategy)] for strategy in strategies_last_step]
legend = plt.legend(handles=selected_handles, loc='best')
plt.show()
Thank you, JohanC, you really helped me see what was going on under the hood with this problem. (Also, good point about count as a col name. I changed it to ncount.)
I found your statement:
strategies_last_step = df.strategy[df['count'][df.step == last_step].nlargest(4).index]
wasn't working for me (nlargest got confused about dtypes) so I formulated a slightly different approach. I got a list of correctly ordered strategy names this way:
def plot_counts(df):
# to customize plot legend, first get the last step in the df
last_step = df.step.max()
# next, make new df_last_step, reverse sorted by 'count' & limited to 4 items
df_last_step = df[df['step'] == last_step].sort_values(by='ncount', ascending=False)[0:4]
# put selected and reordered strategies in a list
top_strategies = list(df_last_step.strategy)
Then, after indexing and grouping my original df and adding my other plot parameters ...
dfi = df.set_index('step')
dfi.groupby('strategy')['ncount'].plot()
plt.ylabel('ncount')
plt.xlabel('step')
plt.title('Count of all strategies by step')
I was able to pick out the right handles from the default handles list and reorder them this way:
handles, labels = plt.gca().get_legend_handles_labels()
# get handles for top_strategies, in order, and replace default handles
selected_handles = []
for i in range(len(top_strategies)):
# get the index of the labels object that matches this strategy
ix = labels.index(top_strategies[i])
# get matching handle w the same index, append it to a new handles list in right order
selected_handles.append(handles[ix])
Then plot with the new selected_handles:
plt.legend(handles=selected_handles, loc='best')
plt.show()
Result is exactly as intended. Here is a plot after 300+ steps. Legend is in the right order and limited to top 4 strategies:

Plots do not appear when calling seaborn's pairplot on a pandas Dataframe

I have a Dataframe that looks like so
Price Mileage Age
4250 71000 8
6500 43100 6
26950 10000 3
1295 78000 17
5999 61600 8
This is assigned to dataset. I simply call sns.pairplot(dataset) and I'm left with just a single graph - the distribution of prices across my dataset. I expected a 3x3 grid of plots.
When I import a pre-configured dataset from seaborn I get the expected multiplot pair plot.
I'm new to seaborn so apologies if this is a silly question, but what am I doing wrong? It seems like a simple task.
From your comment, it seems like you're trying to plot on non-numeric columns. Try coercing them first:
dataset = dataset.apply(lambda x: pd.to_numeric(x, errors='coerce'))
sns.pairplot(dataset)
The errors='coerce' argument will replace non-coercible values (the reason your columns are objects in the first place) to NaN.

Categories