Normalizing huge numeric data to create a valuable line plot - python

I have the following dataframe:
Year Month Value
2005 9 1127.080000
2016 3 9399.000000
5 3325.000000
6 120.000000
7 40.450000
9 3903.470000
10 2718.670000
12 12108501.620000
2017 1 981879341.949982
2 500474730.739911
3 347482199.470025
4 1381423726.830030
5 726155254.759981
6 750914893.859959
7 299991712.719955
8 133495941.729959
9 27040614303.435833
10 26072052.099796
11 956680303.349909
12 755353561.609832
2018 1 1201358930.319930
2 727311331.659607
3 183254376.299662
4 9096130.550197
5 972474788.569924
6 779912460.479959
7 1062566320.859962
8 293262028544467.687500
9 234792487863.501495
As you can see, i have some huge values grouped by month and year. My problem is that i want to create a line plot, but when i do it, it doesn't make any sense to me:
df.plot(kind = 'line', figsize = (20,10))
The visual representation of the data doesn't make much sense taking into account that the values fluctuate over the months and years, but a flat line is shown for the most of the period and big peak at the end.
I guess the problem may be in the y axis scale that is not correctly fitting the data. I have tried to apply a log transformation to the y axis, but this don't add any changes, i have also tried to normalize the data between 0 and 1 just for test, but the plot still the same. Any ideas about how to get a more accurate representation of my data over the time period? And also, how can I display the name of the month and year in the x axis?
EDIT:
This is how i applied the log transform:
df.plot(kind = 'line', figsize = (20,10), logy = True)
and this is the result:
for me this plot still not really readable, taking into account that the plotted values represent income over the time, applying a logarithmic transformation to money values doesn't make much sense to me anyway.
Here is how i normalized the data:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
df_scaled.set_index(df.index, inplace = True)
And then i plotted it:
df_scaled.plot(kind = 'line', figsize = (20, 10), logy = True)
As you can see, noting seems to change with this, i'm a bit lost about how to correctly visualize these data over the given time periods.

The problem is that one value is much much bigger than the others, causing that spike. Instead, use a semi-log plot
df.plot(y='Value', logy=True)
outputs
To make it use the date as the x-axis do
df['Day'] = 1 # we need a day
df['Date'] = pd.to_datetime(df[['Year', 'Month', 'Day']])
df.plot(x='Date', y='Value', logy=True)
which outputs

Related

Python change the starting values on the plot

I have data set which looks like this:
Hour_day Profits
7 645
3 354
5 346
11 153
23 478
7 464
12 356
0 346
I crated a line plot to visualize the hour on the x-axis and the profit values on y-axis. My code worked good with me but the problem is that on the x-axis it started at 0. but I want to start from 5 pm for example.
hours = df.Hour_day.value_counts().keys()
hours = hours.sort_values()
# Get plot information from actual data
y_values = list()
for hr in hours:
temp = df[df.Hour_day == hr]
y_values.append(temp.Profits.mean())
# Plot comparison
plt.plot(hours, y_values, color='y')
From what I know you have two options:
Create a sub DF that excludes the rows that have an Hour_day value under 5 and proceed with the rest of your code as normal:
df_new = df.where(df['Hour_day'] >= 5)
or, you might be able to set the x_ticks:
default_x_ticks = range(5:23)
plt.plot(hours, y_values, color='y')
plt.xticks(default_x_ticks, hours)
plt.show()
I haven't tested the x_ticks code so you might have to play around with it just a touch, but there are lots of easy to find resources on x_ticks.

python: Adjusting the values in the x axis of a plot

IM trying to create plots in python.the first 10 rows of the dataset named Psmc_dolphin looks like the below. the original file has 57 rows and 5 columns.
0 1 2 3 4
0 0.000000e+00 11.915525 299.807861 0.000621 0.000040
1 4.801704e+03 11.915525 326.288712 0.000675 0.000311
2 1.003041e+04 11.915525 355.090418 0.000735 0.000497
3 1.572443e+04 11.915525 386.413025 0.000800 0.000548
4 2.192481e+04 0.583837 8508.130872 0.017613 0.012147
5 2.867635e+04 0.583837 9092.811889 0.018823 0.014021
6 3.602925e+04 0.466402 12111.617016 0.025073 0.019815
7 4.403533e+04 0.466402 12826.458632 0.026553 0.021989
8 5.275397e+04 0.662226 9587.887034 0.019848 0.017158
9 6.224833e+04 0.662226 10201.024439 0.021118 0.018877
10 7.258698e+04 0.991930 7262.773560 0.015035 0.013876
im trying to plot the column0 in x axis and column1 in y axis i get a plot with xaxis values 1000000,2000000,3000000,400000 etc. andthe codes i used are attached below.
i need to adjust the values in x axis so that the x axis should have values such as 1e+06,2e+06,3e+06 ... etc instead of 1000000,2000000,3000000,400000 etc .
# load the dataset
Psmc_dolphin = pd.read_csv('Beluga_mapped_to_dolphin.0.txt', sep="\t",header=None)
plt.plot(Psmc_dolphin[0],Psmc_dolphin[1],color='green')
Any help or suggstion will be appreciated
Scaling the values might help you. Convert 1000000 to 1,2000000 to 2 and so on . Divide the values by 1000000. Or use some different scale like logarithmic scale. I am no expert just a newbie but i think this might help

How to fit a pandas timeseries to a 24h graph?

I have a pandas timeseries of multiple months and want to count occurences of a feature for different times of day.
I.e. I want to create a graph (using seaborn or matplotlib) with the time of day on the x axis (0 to 24 hours) and the relative number of occurences of a column on the y axis (like this).
I can't figure out how to format the timeseries correctly to make this work.
Edit:
This is a sample of the data I'm dealing with. "Open Data Channel Type" can assume five kinds (Online, Phone, Mobile, Unknown, Other). My goal is to plot every kind into one graph, displaying which kind occurs at which time of day.
You need to prepare the plot data first:
hour = df['Created Date'].dt.hour.rename('Hour')
df_plot = df.groupby(hour).apply(lambda x: x['Open Data Channel Type'].value_counts() / x.shape[0]) \
.rename_axis(index=['Hour', 'Channel Type']) \
.to_frame('Frequency') \
.reset_index()
A sample of df_plot:
Hour Channel Type Frequency
0 0 OTHER 0.223744
1 0 PHONE 0.210046
2 0 MOBILE 0.205479
3 0 UNKNOWN 0.198630
4 0 ONLINE 0.162100
5 1 UNKNOWN 0.206311
6 1 OTHER 0.203883
7 1 PHONE 0.201456
8 1 MOBILE 0.196602
9 1 ONLINE 0.191748
Then you can make the plot (here using Seaborn):
ax = sns.lineplot(data=df_plot, x='Hour', y='Frequency', hue='Channel Type')
ax.figure.set_size_inches(10, 4)
ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))
Result (using random data):

Pandas/pyplot coloring scatter plot points by condition

Good time of day everyone! I'm working on a simple script for quality analysis that compares original and duplicate samples and plots those on a scatter plot.
So far I've been able to create plots that I need:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
'''read file'''
duplicates_file = 'C:/Users/cherp2/Desktop/duplicates.csv'
duplicates = pd.read_csv(
duplicates_file, usecols=['SAMPLE_NUMBER','Duplicate Sample Type'
,'FE', 'P','SIO2','AL2O3'
,'Orig. Sample Type', 'FE.1', 'P.1'
,'SIO2.1','AL2O3.1'])
'''calculate standard deviations for grades'''
grades = ['FE','P','SIO2','AL2O3']
for grade in grades:
grade_std = duplicates[grade].std()
'''create scatter plots for all grades'''
ax = duplicates.plot.scatter(f'{grade}', f'{grade}.1')
ax.set_xlabel('Original sample')
ax.set_ylabel('Duplicate sample')
but now I want to color plot points by a condition: if a grade difference between the original and duplicate sample is less than one standard deviation point should be in green, if it's between 2 and 3 stdev it should be orange and red if more than that.
I've been trying to find solutions online but so far nothing has worked. I have a feeling that I'd need to use some lambda function here, but I'm not sure about the syntax.
You can pass a color argument to the plotting call (via c=) and using pandas.cut to generate the necessary color code for different categories based on std.
In [227]: df
Out[227]:
a b
0 0.991415 -0.627043
1 1.365594 -0.036651
2 -0.376318 -0.536504
3 1.041561 -2.180642
4 1.017692 -0.308826
5 -0.626566 1.613980
6 -1.302070 1.258944
7 -0.453499 0.411277
8 -0.927880 0.439102
9 -0.282031 1.249862
10 0.504829 0.536641
11 -1.528550 1.420456
12 0.774111 -1.086350
13 -1.662715 0.732753
14 -1.038514 -1.987912
15 -0.432515 3.104590
16 1.682876 0.663448
17 0.287642 -1.038507
18 -0.307923 -2.340498
19 -1.024045 -1.948608
In [228]: change = df.a - df.b
In [229]: df.plot(kind='scatter', x='a', y='b',
c=pd.cut(((change - change.mean()) / (change.std())).abs(), [0, 1, 2, 3], labels=['r', 'g', 'b']))

Empty Plot when dealing with a huge number of rows

I attempted to use the code below to plot a graph to show the Speed per hour by days.
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
import glob, os
taxi_df = pd.read_csv('ChicagoTaxi.csv')
taxi_df['trip_start_timestamp'] = pd.to_datetime(taxi_df['trip_start_timestamp'], format = '%Y-%m-%d %H:%M:%S', errors = 'raise')
taxi_df['trip_end_timestamp'] = pd.to_datetime(taxi_df['trip_end_timestamp'], format = '%Y-%m-%d %H:%M:%S', errors = 'raise')
#For filtering away any zero values when trip_Seconds or trip_miles = 0
filterZero = taxi_df[(taxi_df.trip_seconds != 0) & (taxi_df.trip_miles != 0)]
filterZero['trip_seconds'] = filterZero['trip_seconds']/60
filterZero['trip_seconds'] = filterZero['trip_seconds'].apply(lambda x: round(x,0))
filterZero['speed'] = filterZero['trip_miles']/filterZero['trip_seconds']
filterZero['speed'] *= 60
filterZero = filterZero.reset_index(drop=True)
filterZero.groupby(filterZero['trip_start_timestamp'].dt.strftime('%w'))['speed'].mean().plot()
plt.xlabel('Day')
plt.ylabel('Speed(Miles per Minutes)')
plt.title('Mean Miles per Hour By Days')
plt.show() #Not working
Example rows
0 2016-01-13 06:15:00 8.000000
1 2016-01-22 09:30:00 10.500000
Small Dataset : [1250219 rows x 2 columns]
Big Dataset: [15172212 rows x 2 columns]
For a smaller dataset the code works perfectly and the plot is shown. However when I attempted to use a dataset with 15 million rows the plot shown was empty as the values were "inf" despite running mean(). Am i doing something wrong here?
0 inf
1 inf
...
5 inf
6 inf
The speed is "Miles Per Hour" by day! I was trying out all time format so there is a mismatch in the picture sorry.
Image of failed Plotting(Larger Dataset):
Image of successful Plotting(Smaller Dataset):
I can't really be sure because you do not provide a real example of your dataset, but I'm pretty sure your problem comes from the column trip_seconds.
See these two lines:
filterZero['trip_seconds'] = filterZero['trip_seconds']/60
filterZero['trip_seconds'] = filterZero['trip_seconds'].apply(lambda x: round(x,0))
If some of your values in the column trip_seconds are ≤ 30, then this line will round them to 0.0.
filterZero['speed'] = filterZero['trip_miles']/filterZero['trip_seconds']
Therefore this line will be filled with some inf values (as anything / 0.0 = inf). Taking the mean() of an array with inf values will return inf regardless.
Two things to consider:
if your values in the column trip_seconds are actually in seconds, then after dividing your values by 60, they will be in minutes, which will make your speed in miles per minutes, not per hour.
You should try without rounding the times

Categories