Missing Data and Graphing with Pandas and Matplotlib

Missing Data and Graphing with Pandas and Matplotlib - python

I want my matplotlib plot to display my df's DateTimeIndex as consecutive count data (in seconds) on the x-axis and my df's Load data on the y axis. Then I want to overlap it with a scipy.signal find_peaks result (which has an x-axis of consecutive seconds). My data is not consecutive (real world data), though it does have a frequency of seconds.
Code
import pandas as pd
import matplotlib.pyplot as plt
from scipy import signal
import numpy as np
# Create Sample Dataset
df = pd.DataFrame([['2020-07-25 09:26:28',2],['2020-07-25 09:26:29',10],['2020-07-25 09:26:32',203],['2020-07-25 09:26:33',30]],
columns = ['Time','Load'])
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index("Time")
print(df)
# Try to solve the problem
rng = pd.date_range(df.index[0], df.index[-1], freq='s')
print(rng)
peaks, _ = signal.find_peaks(df["Load"])
plt.plot(rng, df["Load"])
plt.plot(peaks, df["Load"][peaks], "x")
plt.plot(np.zeros_like(df["Load"]), "--", color="gray")
plt.show()
This code does not work because rng has a length of 6, while the df has a length of 4. I think I might be going about this the wrong way entirely. Thoughts?

You are really close - I think you can get what you want by reindexing your df with your range. For instance:
df = df.reindex(rng).fillna(0)
peaks, _ = signal.find_peaks(df["Load"])
...
Does that do what you expect?

Related

Python: How to construct a joyplot with values taken from a column in pandas dataframe as y axis

I have a dataframe df in which the column extracted_day consists of dates ranging between 2022-05-08 to 2022-05-12. I have another column named gas_price, which consists of the price of the gas. I want to construct a joyplot such that for each date, it shows the gas_price in the y axis and has minutes_elapsed_from_start_of_day in the x axis. We may also use ridgeplot or any other plot if this doesn't work.
This is the code that I have written, but it doesn't serve my purpose.
from joypy import joyplot
import matplotlib.pyplot as plt
df['extracted_day'] = df['extracted_day'].astype(str)
joyplot(df, by = 'extracted_day', column = 'minutes_elapsed_from_start_of_day',figsize=(14,10))
plt.xlabel("Number of minutes elapsed throughout the day")
plt.show()

Create dataframe with mock data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from joypy import joyplot
np.random.seed(111)
df = pd.DataFrame({
'minutes_elapsed_from_start_of_day': np.tile(np.arange(1440), 5),
'extracted_day': np.repeat(['2022-05-08', '2022-05-09', '2022-05-10','2022-05-11', '2022-05-12'], 1440),
'gas_price': abs(np.cumsum(np.random.randn(1440*5)))})
Then create the joyplot. It is important that you set kind='values', since you do not want joyplot to show KDEs (kernel density estimates, joyplot's default) but the raw gas_price values:
joyplot(df, by='extracted_day',
column='gas_price',
kind='values',
x_range=np.arange(1440),
figsize=(7,5))
The resulting joyplot looks like this (the fake gas prices are represented by the y-values of the lines):

Pandas dataframe with time index and freq with multiplier

I have a DataFrame with a time series as index.
import pandas as pd
from numpy.random import rand
df = pd.DataFrame(rand(100000), index=None, columns=['a'])
df['time'] = pd.date_range('2020-01-01 12:30:15',
periods=len(df['a']), freq='ms')
df.set_index('time', inplace=True)
df.plot()
When I put a multiplier on the frequency, it becomes extremely slow to plot df2, even if it has less elements than df. EDIT 1: It actually crashed my Python kernel, and my laptop almost ran out of RAM
df2 = pd.DataFrame(rand(50000), index=None, columns=['a'])
df2['time'] = pd.date_range('2020-01-01 12:30:15',
periods=len(df2['a']), freq='2.5ms')
df2.set_index('time', inplace=True)
df2.plot()
I was wondering if this behavior is normal
Thanks
EDIT 2: Versions of packages
Python 3.9.3-1
Pandas 1.2.3-1
Numpy 1.20.1-1
Matplotlib 3.4.1-2
EDIT 3: It works when plotting directly with matplotlib
import matplotlib.pyplot as plt
fig, ax1 = plt.subplots(figsize=(12,5))
ax1.plot(df2);

python plotting multiple bars

I've been trying to do it for several hours and I have a mistake every time. I want to create 3 bar plots in one graph. The y-axis is to be between 0 and 1000.
The end result should be this
Thats my code:
import matplotlib.pyplot as plt
import numpy as np
import csv
df = pd.read_csv('razemKM.csv')
dfn = pd.read_csv('razemNPM.csv')
print(df)
y=[0,1000]
a=(df["srednia"]-df["odchStand"])
a1=df["srednia"]
a2=(df["srednia"]+df["odchStand"])
plt.bar(y,a,width=0.1,color='r')
plt.bar(y,a1,width=0.1,color='g')
plt.bar(y,a2,width=0.1,color='y')
plt.show()

You can use pandas plot function:
df['Sum'] = df["srednia"]+df["odchStand"]
df['Dif'] = df["srednia"]-df["odchStand"]
df.plot.bar(y=['Diff','srednia', 'Sum'],width=0.1)
plt.show()

Splitting large data set and plotting the average in matplotlib

I have a large data set with over 10,000 rows with values between 0 and 400,000,000. I would like to plot those values vs. the mean of another column in matplotlib where the x axis increments by 50,000,000 but I am unsure how to do so. I can plot it using pandas but would really like to do it using matplotlib but unsure how. This is what I have in pandas:
mean_values = df.groupby(pd.cut(df['budget_adj'],np.arange(0,4000000000,50000000)))['vote_average'].mean()
mean_values.plot(kind='line',figsize=(12,5))

I think I figured out what your problem is
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
# Create some data
df = pd.DataFrame({'budget_adj': np.random.uniform(0, 4000000000, 10000),
'vote_average': np.random.uniform(0, 100000, 10000)})
# Calculate the mean values
mean_values = df.groupby(pd.cut(df['budget_adj'],np.arange(0,4000000000,50000000)))['vote_average'].mean()
And this is what I suspect you do
# This wont work since mean_values.index is an interval
plt.plot(mean_values.index, mean_values)
This wont work since you index is a categorical interval. In order for plot to work your x-values have to be numbers. We can convert our intervals in many ways
# You can pick the left endpoint...
x_values = [i.left for i in mean_values.index]
# the right endpoint...
x_values = [i.right for i in mean_values.index]
# or the center value.
x_values = [i.mid for i in mean_values.index]
# And NOW you will get no error
plt.plot(x_values, mean_values)

Time-series boxplot in pandas

How can I create a boxplot for a pandas time-series where I have a box for each day?
Sample dataset of hourly data where one box should consist of 24 values:
import pandas as pd
n = 480
ts = pd.Series(randn(n),
index=pd.date_range(start="2014-02-01",
periods=n,
freq="H"))
ts.plot()
I am aware that I could make an extra column for the day, but I would like to have proper x-axis labeling and x-limit functionality (like in ts.plot()), so being able to work with the datetime index would be great.
There is a similar question for R/ggplot2 here, if it helps to clarify what I want.

If its an option for you, i would recommend using Seaborn, which is a wrapper for Matplotlib. You could do it yourself by looping over the groups from your timeseries, but that's much more work.
import pandas as pd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
n = 480
ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
fig, ax = plt.subplots(figsize=(12,5))
seaborn.boxplot(ts.index.dayofyear, ts, ax=ax)
Which gives:
Note that i'm passing the day of year as the grouper to seaborn, if your data spans multiple years this wouldn't work. You could then consider something like:
ts.index.to_series().apply(lambda x: x.strftime('%Y%m%d'))
Edit, for 3-hourly you could use this as a grouper, but it only works if there are no minutes or lower defined. :
[(dt - datetime.timedelta(hours=int(dt.hour % 3))).strftime('%Y%m%d%H') for dt in ts.index]

(Not enough rep to comment on accepted solution, so adding an answer instead.)
The accepted code has two small errors: (1) need to add numpy import and (2) nned to swap the x and y parameters in the boxplot statement. The following produces the plot shown.
import numpy as np
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
n = 480
ts = pd.Series(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
fig, ax = plt.subplots(figsize=(12,5))
seaborn.boxplot(ts.index.dayofyear, ts, ax=ax)

I have a solution that may be helpful-- It only uses native pandas and allows for hierarchical date-time grouping (i.e spanning years). The key is that if you pass a function to groupby(), it will be called on each element of the dataframe's index. If your index is a DatetimeIndex (or similar), you can access all of the dt's convenience functions for resampling!
Try this:
n = 480
ts = pd.DataFrame(np.random.randn(n), index=pd.date_range(start="2014-02-01", periods=n, freq="H"))
ts.groupby(lambda x: x.strftime("%Y-%m-%d")).boxplot(subplots=False, figsize=(12,9), rot=90)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Missing Data and Graphing with Pandas and Matplotlib - python

You are really close - I think you can get what you want by reindexing your df with your range. For instance: df = df.reindex(rng).fillna(0) peaks, _ = signal.find_peaks(df["Load"]) ... Does that do what you expect?

Related

Python: How to construct a joyplot with values taken from a column in pandas dataframe as y axis

Pandas dataframe with time index and freq with multiplier

python plotting multiple bars

Splitting large data set and plotting the average in matplotlib

Time-series boxplot in pandas

Categories

Resources