Python change the starting values on the plot - python

I have data set which looks like this:
Hour_day Profits
7 645
3 354
5 346
11 153
23 478
7 464
12 356
0 346
I crated a line plot to visualize the hour on the x-axis and the profit values on y-axis. My code worked good with me but the problem is that on the x-axis it started at 0. but I want to start from 5 pm for example.
hours = df.Hour_day.value_counts().keys()
hours = hours.sort_values()
# Get plot information from actual data
y_values = list()
for hr in hours:
temp = df[df.Hour_day == hr]
y_values.append(temp.Profits.mean())
# Plot comparison
plt.plot(hours, y_values, color='y')

From what I know you have two options:
Create a sub DF that excludes the rows that have an Hour_day value under 5 and proceed with the rest of your code as normal:
df_new = df.where(df['Hour_day'] >= 5)
or, you might be able to set the x_ticks:
default_x_ticks = range(5:23)
plt.plot(hours, y_values, color='y')
plt.xticks(default_x_ticks, hours)
plt.show()
I haven't tested the x_ticks code so you might have to play around with it just a touch, but there are lots of easy to find resources on x_ticks.

Related

Grouping values in a clustered pie chart

I'm working with a dataset about when certain houses were constructed and my data stretches from the year 1873-2018(143 slices). I'm trying to visualise this data in the form of a piechart but because of the large number of indivdual slices the entire pie chart appears clustered and messy.
What I'm trying to implement to get aroud this is by grouping the values in 15-year time periods and displaying the periods on the pie chart instead. I seen a similiar post on StackOverflow where the suggested solution was using a dictionary and defining a threshold to group the values but implementing a version of that on my own piechart didn't work and I was wondering how I could tackle this problem
CODE
testing = df1.groupby("Year Built").size()
testing.plot.pie(autopct="%.2f",figsize=(10,10))
plt.ylabel(None)
plt.show()
Dataframe(testing)
Current Piechart
For the future, always provide a reproducible example of the data you are working on (maybe use df.head().to_dict()). One solution to your problem could be achieved by using pd.resample.
# Data Used
df = pd.DataFrame( {'year':np.arange(1890, 2018), 'built':np.random.randint(1,150, size=(2018-1890))} )
>>> df.head()
year built
0 1890 34
1 1891 70
2 1892 92
3 1893 135
4 1894 16
# First, convert your 'year' values into DateTime values and set it as the index
df['year'] = pd.to_datetime(df['year'], format=('%Y'))
df_to_plot = df.set_index('year', drop=True).resample('15Y').sum()
>>> df_to_plot
built
year
1890-12-31 34
1905-12-31 983
1920-12-31 875
1935-12-31 1336
1950-12-31 1221
1965-12-31 1135
1980-12-31 1207
1995-12-31 1168
2010-12-31 1189
2025-12-31 757
Also you could use pd.cut()
df['group'] = pd.cut(df['year'], 15, precision=0)
df.groupby('group')[['year']].sum().plot(kind='pie', subplots=True, figsize=(10,10), legend=False)

How do I get a Bokeh ColorBar to show the min and max value?

I am making a bubble chart using Bokeh and want the ColorBar to show the min and max value. Given data that looks like this
In[23]: group_counts.head()
cyl yr counts size
0 3 72 1 0.854701
1 3 73 1 0.854701
2 3 77 1 0.854701
3 3 80 1 0.854701
4 4 70 7 5.982906
I am generating a plot using
x_col = 'cyl'
y_col = 'yr'
color_transformer = transform.linear_cmap('counts', Inferno256,
group_counts.counts.min(),
group_counts.counts.max())
color_bar = ColorBar(color_mapper=color_transformer['transform'],
location=(0, 0))
source = ColumnDataSource(data=group_counts)
p = plotting.figure(x_range=np.sort(group_counts[x_col].unique()),
y_range=np.sort(group_counts[y_col].unique()),
plot_width=400, plot_height=300,
x_axis_label=x_col, y_axis_label=y_col)
p.add_layout(color_bar, 'right')
p.scatter(x=x_col, y=y_col, size='size', color=color_transformer,
source=source)
plotting.show(p)
Notice that the min and max values on the colorbar are not labelled. How do I force the colorbar to label these values?
You can do this using the FixedTicker class, located under bokeh.models. It is meant to be used to,
Generate ticks at fixed, explicitly supplied locations.
To provide the min and max data values, specify the desired tick values then pass that object to ColorBar using the ticker keyword argument.
mn = group_counts.counts.min()
mx = group_counts.counts.max()
n_ticks = 5 # how many ticks do you want?
ticks = np.linspace(mn, mx, n_ticks).round(1) # round to desired precision
color_ticks = FixedTicker(ticks=ticks)
color_bar = ColorBar(color_mapper=color_transformer['transform'],
location=(0, 0),
ticker=color_ticks, # <<< pass ticker object
)
If you want something a bit more exotic, there are 14 different tickers currently described in the bokeh.models.tickers documentation (do a word search for Ticker(** to quickly jump between the different options).

python: Adjusting the values in the x axis of a plot

IM trying to create plots in python.the first 10 rows of the dataset named Psmc_dolphin looks like the below. the original file has 57 rows and 5 columns.
0 1 2 3 4
0 0.000000e+00 11.915525 299.807861 0.000621 0.000040
1 4.801704e+03 11.915525 326.288712 0.000675 0.000311
2 1.003041e+04 11.915525 355.090418 0.000735 0.000497
3 1.572443e+04 11.915525 386.413025 0.000800 0.000548
4 2.192481e+04 0.583837 8508.130872 0.017613 0.012147
5 2.867635e+04 0.583837 9092.811889 0.018823 0.014021
6 3.602925e+04 0.466402 12111.617016 0.025073 0.019815
7 4.403533e+04 0.466402 12826.458632 0.026553 0.021989
8 5.275397e+04 0.662226 9587.887034 0.019848 0.017158
9 6.224833e+04 0.662226 10201.024439 0.021118 0.018877
10 7.258698e+04 0.991930 7262.773560 0.015035 0.013876
im trying to plot the column0 in x axis and column1 in y axis i get a plot with xaxis values 1000000,2000000,3000000,400000 etc. andthe codes i used are attached below.
i need to adjust the values in x axis so that the x axis should have values such as 1e+06,2e+06,3e+06 ... etc instead of 1000000,2000000,3000000,400000 etc .
# load the dataset
Psmc_dolphin = pd.read_csv('Beluga_mapped_to_dolphin.0.txt', sep="\t",header=None)
plt.plot(Psmc_dolphin[0],Psmc_dolphin[1],color='green')
Any help or suggstion will be appreciated
Scaling the values might help you. Convert 1000000 to 1,2000000 to 2 and so on . Divide the values by 1000000. Or use some different scale like logarithmic scale. I am no expert just a newbie but i think this might help

Normalizing huge numeric data to create a valuable line plot

I have the following dataframe:
Year Month Value
2005 9 1127.080000
2016 3 9399.000000
5 3325.000000
6 120.000000
7 40.450000
9 3903.470000
10 2718.670000
12 12108501.620000
2017 1 981879341.949982
2 500474730.739911
3 347482199.470025
4 1381423726.830030
5 726155254.759981
6 750914893.859959
7 299991712.719955
8 133495941.729959
9 27040614303.435833
10 26072052.099796
11 956680303.349909
12 755353561.609832
2018 1 1201358930.319930
2 727311331.659607
3 183254376.299662
4 9096130.550197
5 972474788.569924
6 779912460.479959
7 1062566320.859962
8 293262028544467.687500
9 234792487863.501495
As you can see, i have some huge values grouped by month and year. My problem is that i want to create a line plot, but when i do it, it doesn't make any sense to me:
df.plot(kind = 'line', figsize = (20,10))
The visual representation of the data doesn't make much sense taking into account that the values fluctuate over the months and years, but a flat line is shown for the most of the period and big peak at the end.
I guess the problem may be in the y axis scale that is not correctly fitting the data. I have tried to apply a log transformation to the y axis, but this don't add any changes, i have also tried to normalize the data between 0 and 1 just for test, but the plot still the same. Any ideas about how to get a more accurate representation of my data over the time period? And also, how can I display the name of the month and year in the x axis?
EDIT:
This is how i applied the log transform:
df.plot(kind = 'line', figsize = (20,10), logy = True)
and this is the result:
for me this plot still not really readable, taking into account that the plotted values represent income over the time, applying a logarithmic transformation to money values doesn't make much sense to me anyway.
Here is how i normalized the data:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
df_scaled.set_index(df.index, inplace = True)
And then i plotted it:
df_scaled.plot(kind = 'line', figsize = (20, 10), logy = True)
As you can see, noting seems to change with this, i'm a bit lost about how to correctly visualize these data over the given time periods.
The problem is that one value is much much bigger than the others, causing that spike. Instead, use a semi-log plot
df.plot(y='Value', logy=True)
outputs
To make it use the date as the x-axis do
df['Day'] = 1 # we need a day
df['Date'] = pd.to_datetime(df[['Year', 'Month', 'Day']])
df.plot(x='Date', y='Value', logy=True)
which outputs

Python plot line linking values seems circle back from last point to the first one

I met a python plotting problem. Looks like the line I plotted circles back from the last value point to the first one, so the line is a closed one, without two opening ends. Difficult to describe, I uploaded a image here:
here is related code:
plt.plot(newx, newy0Normalized, color='red', linewidth=2, marker='1');
plt.plot(newx, newy1Normalized, color='green', linewidth=2, marker='2');
where newx is integers in the range of 50-200 and newy0Normalized is corresponding percentages
How newx and newy0Normalized are generated is a bit long. I print the data here so you know what the data (and structure) looks like:
for i in range(len(newx)):
print "%d\t%.2f" % (newx[i], newy0Normalized[i])
100 7.69
101 14.81
102 9.09
103 8.33
more data here
135 40.00
136 60.00
137 50.00
139 0.00
66 100.00
67 0.00
68 0.00
69 0.00
more data here
97 11.54
98 14.81
99 11.11
This is how matplotlib's line plotting works: it starts with the first data point in the list, then draws a line to the next data point in the list, and so on, until it gets to the last point in the list. It's not a closed loop, though; note the break in the middle of the graph between x = 99 and x = 100. In your case, your data jumps from x = 139 to x = 66 in the middle of the list, so matplotlib will accordingly draw a line from the point at x = 139 to the point at x = 66.
If you don't want this to happen, just sort the data points by their x coordinate before plotting them. Or you can plot them as points without a connecting line, by using the ',' or '.' format specifier. (On scientific grounds, in most cases I would suggest the latter, but which one is correct depends on the interpretation of your data of course.)

Categories