Python data visualization: too small value to be visible - how to solve? - python

Here is dataset, i have:
Source
All Leads
Not Junks
Warms
Hots
Deals
Weighted Sum
web
281316
269490
10252
2508
1602
4376.5
telesales
30458
29732
431
138
85
316.2
networking
4249
4195
763
547
476
539.1
promos
1356
1308
30
1
0
10.8
I visualized it:
df.plot.bar()
And got this output:
Some columns got too small values, so that they are not visible, how can tackle this problem?
Setting bigger figure size isn't useful, it makes chart bigger, but columns ratio is still the same, so nothing changes
Any ideas how to make it look more sophisticated? Or maybe i should try different type of chart? Thank you!

Could try df.plot.bar(logy=true), but it's going to make useful interpretation of it messy. A Sankey diagram would probably be a better fit for showing how the data breaks down in each category.

Seaborn comes out a little nicer, but takes some transformation to produce the same type of output:
import seaborn as sns
df2 = df.melt('Source').rename(columns={'variable': 'Category', 'value': 'Values'})
sns.barplot(x='Source', y='Values', data=df2, hue='Category')
plt.show()
Output:
Or with log=True

Related

Show how when values rise in one column, so does the values in another one

I'm working with a covid dataset for some python exercises I am working through to try learn. I've got it by doing the normal:
import pandas as pd
import numpy as np
df = pd.read_csv("C:/Users/Desktop/Python Short Course/diagnosis.csv")
In this dataset there are 2 columns called BodyTemp and SpO2, what I am looking to try do is show how the results of the columns are similar. So like when the values rise in the BodyTemp column, so does the values in the SpO2 column, that sort of idea. I had thought of maybe doing a bar chart like:
plt.xlabel("BodyTemp") , plt.ylabel("SpO2")
plt.bar(x = df["BodyTemp"], height = df["SpO2"])
plt.show()
but all the bars are very close together and it just doesn't look great, so what would be a better way to do this? Or would there be a better approach to show the visualisation of the distribution of values?
Edit: to show screenshot of graph
Edit to show data:
BodyTemp
SpO2
37.6
85
38.9
93
38.5
92
37
80
I've added a table showing the first few, there are a whole lot more though but it gives an idea of the data
you need to change the scale of y-axis. try this.
plt.ylim((df['SpO2'].min()-.5, df['SpO2'].max()+.5))
If this didn't work, it's probably because there are very small values in the column SpO2. These gaps between the bars may be small values that are distorting the data. Try to remove them from the dataframe.

Plotting data from dataframe column using matplotlib- specific start index and number of datapoints

https://archive.ics.uci.edu/ml/machine-learning-databases/00374/energydata_complete.csv
I have this data set that shows energy data logged every 10 minutes for 4.5 months in Chievres, Belgium.
I am only interested in displaying the ‘date’, ‘Appliances’, ‘lights’, and ‘T_out’ in a dataframe. The relevant code is below.
df=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00374/energydata_complete.csv')
df=df.iloc[:,[0,1,2,21]]
df.head(5)
(I'd show the df but I'm new to SO and don't know how to include output in a question, sorry :) )
I'd like to create a plot using matplotlib that shows only the lights data for 4 days to see if there is a correlation between daytime and nighttime energy usage. I want to start with ‘2016-01-12 06:00:00’ am to have an accurate representation of a day.
I know that the data for one day is equal to 144 data points since each point is recorded every 10 minutes, so for four days it is 576 data points.
fig = plt.figure()
fig.plot(df['lights'])
This is literally the only code I have so far and I know it isn't even remotely correct lol.
How can I graph the relevant data from the 'lights' column in the dataframe and limit the plot to 576 data points?
Update: The following code works although if I am being perfectly honest I am not sure why or how
fig,axes=plt.subplots(1,1)
lights= df[df['date']=='2016-01-12 06:00:00'].index[0]
axes.plot(df.iloc[lights:lights+144*4,0], df.iloc[lights:lights+144*4,2], color='g', alpha=0.5)
xticks=np.arange(0,144*4,36)
I don't really understand the parameters of the df.iloc[] function

customizing the legend in a plot derived from a pandas dataframe

I'm working on a python implementation of an agent-based model using the 'mesa' framework (available in Github). In the model, each "agent" on a grid plays a Prisoner's Dilemma game against its neighbors. Each agent has a strategy that determines its move vs. other moves. Strategies with higher payoffs replace strategies with lower payoffs. In addition, strategies evolve through mutations, so new and longer strategies emerge as the model runs. The app produces a pandas dataframe that gets updated after each step. For example, after 106 steps, the df might look like this:
step strategy count score
0 0 CC 34 2.08
1 0 DD 1143 2.18
2 0 CD 1261 2.24
3 0 DC 62 2.07
4 1 CC 6 1.88
.. ... ... ... ...
485 106 DDCC 56 0.99
486 106 DD 765 1.00
487 106 DC 1665 1.31
488 106 DCDC 23 1.60
489 106 DDDD 47 0.98
Pandas/matplotlib creates a pretty good plot of this data, calling this simple plot function:
def plot_counts(df):
df1 = df.set_index('step')
df1.groupby('strategy')['count'].plot()
plt.ylabel('count')
plt.xlabel('step')
plt.title('Count of all strategies by step')
plt.legend(loc='best')
plt.show()
I get this plot:
Not bad, but here's what I can't figure out. The automatic legend quickly gets way too long and the low-frequency strategies are of little interest, so I want the legend to (1) include only the top 4 strategies listed in the above legend and (2) list those strategies in the order they appear in the last step of the model, based on their counts. Looking at the strategies in step 106 in the df, for example, I want the legend to show the top 4 strategies in order DC,DD,DDCC, and DDDD, but not include DCDC (or any other lower-count strategies that might be active).
I have searched through tons of pandas and matplotlib plotting examples but haven't been able to find a solution to this specific problem. It's clear that these plots are extremely customizable, so I suspect there is a way to do this. Any help would be greatly appreciated.
This post is somewhat similar to what you have asked, I guess you should check the answer on this page: Show only certain items in legend Python Matplotlib. Hope this helps!
Here is an approach. I don't have the complete dataframe, so the test is only with the ones displayed in the question.
The pandas part of the question can be solved by assigning the last step to a variable, then querying for the strategies of that step and then getting the highest counts.
To find the handles, we ask matplotlib for all the handles and labels it generated. Then we search each of the strategies in the list of labels, taking its index to get the corresponding handle.
Please note that 'count' is an annoying name for a column. It also is the name of a pandas function, which prevents its use in the dot notation.
import pandas as pd
from matplotlib import pyplot as plt
df = pd.DataFrame(columns=['step', 'strategy', 'count', 'score'],
data=[[0, 'CC', 34, 2.08],
[0, 'DD', 1143, 2.18],
[0, 'CD', 1261, 2.24],
[0, 'DC', 62, 2.07],
[1, 'CC', 6, 1.88],
[106, 'DDCC', 56, 0.99],
[106, 'DD', 765, 1.00],
[106, 'DC', 1665, 1.31],
[106, 'DCDC', 23, 1.60],
[106, 'DDDD', 47, 0.98]])
last_step = df.step.max()
strategies_last_step = df.strategy[df['count'][df.step == last_step].nlargest(4).index]
df1 = df.set_index('step')
df1.groupby('strategy')['count'].plot()
plt.ylabel('count')
plt.xlabel('step')
plt.title('Count of all strategies by step')
handles, labels = plt.gca().get_legend_handles_labels()
selected_handles = [handles[labels.index(strategy)] for strategy in strategies_last_step]
legend = plt.legend(handles=selected_handles, loc='best')
plt.show()
Thank you, JohanC, you really helped me see what was going on under the hood with this problem. (Also, good point about count as a col name. I changed it to ncount.)
I found your statement:
strategies_last_step = df.strategy[df['count'][df.step == last_step].nlargest(4).index]
wasn't working for me (nlargest got confused about dtypes) so I formulated a slightly different approach. I got a list of correctly ordered strategy names this way:
def plot_counts(df):
# to customize plot legend, first get the last step in the df
last_step = df.step.max()
# next, make new df_last_step, reverse sorted by 'count' & limited to 4 items
df_last_step = df[df['step'] == last_step].sort_values(by='ncount', ascending=False)[0:4]
# put selected and reordered strategies in a list
top_strategies = list(df_last_step.strategy)
Then, after indexing and grouping my original df and adding my other plot parameters ...
dfi = df.set_index('step')
dfi.groupby('strategy')['ncount'].plot()
plt.ylabel('ncount')
plt.xlabel('step')
plt.title('Count of all strategies by step')
I was able to pick out the right handles from the default handles list and reorder them this way:
handles, labels = plt.gca().get_legend_handles_labels()
# get handles for top_strategies, in order, and replace default handles
selected_handles = []
for i in range(len(top_strategies)):
# get the index of the labels object that matches this strategy
ix = labels.index(top_strategies[i])
# get matching handle w the same index, append it to a new handles list in right order
selected_handles.append(handles[ix])
Then plot with the new selected_handles:
plt.legend(handles=selected_handles, loc='best')
plt.show()
Result is exactly as intended. Here is a plot after 300+ steps. Legend is in the right order and limited to top 4 strategies:

Plotting different lines for different states on the same chart

I am trying to create a distribution for the number of ___ across a few states.
I want to get all of the states on the same graph, represented by different lines.
Here is an example what my data looks like: you have the state ('which I want to filter lines by), the number of reviews (x axis), and the frequency of restaurants that have that many reviews (y axis)
State | num_of_reviews | Count_id
alaska 1 400
alaska 2 388
alaska 3 344
...
Wyoming 57 13
Whenever I try doing a simple line plot in seaborn or matplotlib, it just returns a messy graph.
Does anyone know a string of code where I easily can filter df['State']?
Assuming that you have 50+ states, I wouldn't plot the distribution for each on the same plot as it would get really messy and hard to read. Instead, I would suggest to use a FacetGrid (read more about it here).
Something like this should do.
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.FacetGrid(df, col="State", col_wrap=5, height=1.5)
g = g.map(plt.hist, "num_of_reviews")
You can find other possible solutions and ideas on how to visualize your data here.
If none of these work for you then it might be helpful if you explain a bit better your problem and provide a desired output and a minimal, complete, and verifiable example.

Folium TopoJSON heatmap does not populate as expected

I am trying to display a folium choropleth heatmap using a custom topoJSON file and a dataframe. The map generates with a uniformly shaded choropleth instead of the expected heatmap.
Heres a snippet of code I am using (excludes basic imports, creation of dataframe):
cols = ['dma', 'values']
center_us_long_lat = [39.50, -98.35]
topo_path = r'../../data/designated_marketing_areas_us_topo.json'
us_map = folium.Map(location=center_us_long_lat,attr='dma_code',
tiles='Mapbox Bright', zoom_start=4, min_zoom=4)
us_map.choropleth(geo_path=topo_path, topojson='objects.nielsen_dma',
data=df, columns=cols,
fill_opacity=0.7,
key_on="feature.properties.dma",
line_color='white', fill_color='YlOrRd',
highlight=True
)
The output looks like this:
I've tried adjusting the key_on argument to feature.dma but this results in the same output.
As a reference here's a sample of the df data:
In[1]:
df.head():
Out[1]:
dma values
1 501 16.749
2 740 8.858
3 807 15.790
4 511 15.315
5 798 8.425
The topojson can be found here
What am I doing wrong? Thanks!
I know this might sound silly, but I always find these issues come down to mismatched data types or leading/lagging spaces. Good luck!

Categories