Random data appearing in bar plot at regular intervals - python

I have a dataset containing information related to COVID-19 data with columns = ['total_cases', 'new_cases', 'date']. The data increases monotonically with atleast no sudden spikes in new_cases in January month. The dataset can be found here: https://fnvuusdqoptinxntjrmodi.coursera-apps.org/edit/CovidIndiaData.csv with lots of columns out of which I use only ['total_cases', 'new_cases', 'date'].
First 10 days data is 0 for 'new_cases' as shown in this image:
I use this code to plot bar plot for 'date' vs 'new_cases':
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.dates import DateFormatter
df = pd.read_csv("CovidIndiaData.csv", parse_dates=['date'], index_col=['date'])
df = df[['new_cases', 'total_cases']]
df.fillna(0)
fig = plt.figure()
ax = plt.gca()
ax.bar(df.index.values,
df['new_cases'],
color='purple')
ax.set(xlabel="Date",
ylabel="New Cases",
title="New Cases per day",
xlim=["2020-01-01", "2020-07-18"])
date_form = DateFormatter("%m-%d")
ax.xaxis.set_major_formatter(date_form)
ax.xaxis.set_major_locator(mdates.WeekdayLocator(interval=1))
plt.setp(ax.get_xticklabels(), rotation=45)
plt.show()
The final plot looks like this:
The plot shows some spikes at 7th January ('01-07' on plot) where clearly in dataset the new_cases are 0. This is continued approximately after every one month interval.
Where does this data come from? How can I plot a correct graph for this data?

Thanks to Davis Herring for pointing out my mistake.
In case anyone faces similar issue, the solution is to specify date format when your date isn't in standardized format.
What I did is:
mydateparser = lambda x: pd.datetime.strptime(x, "%d-%m-%Y")
df = pd.read_csv("CovidIndiaData.csv", parse_dates=['date'], date_parser=mydateparser, index_col=['date'])

Related

Changing date format and x-axis tick labels

I have read in a monthly temperature anomalies csv file using Pandas read.csv() function. Years are from 1881 to 2022. I excluded the last 3 months of 202 to avoid -999 values). Date format is yyyy-mm-dd. How can I just plot the year and only one value instead of 12 on the x-axis (i.e., I don't need 12 1851s, 1852s, etc.)?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from matplotlib.dates import YearLocator, MonthLocator, DateFormatter
import matplotlib.dates as mdates
ds = pd.read_csv('path_to_file.csv', header='infer', engine='python', skipfooter=3)
dates = ds['Date']
tAnoms = ds[' Berkeley Earth 2m Air Temperature (degree C) 0N-90N;0E-360E']
fig = plt.figure(figsize=(10,10))
ax = plt.subplot(111)
ax.plot(dates,tAnoms)
ax.plot(dates,tAnoms.rolling(60, center=True).mean())
ax.xaxis.set_major_locator(mdates.YearLocator(month=1) # EDIT
years_fmt = mdates.DateFormatter('%Y') # EDIT 2
ax.xaxis.set_major_formatter(years_fmt) # EDIT 2
plt.show()
EDIT: adding the following gives me the 2nd plot
EDIT 2: Gives me yearly values, but only from 1970-1975. 3rd plot
You could:
Create a new column year from your Date column.
Compute the average temperature for each year (using mean or median): df.groupby(['year']).mean()
So, I found a good, but maybe not perfect solution. First thing I needed to do was use parse_dates & infer_datetime_format when reading in the csv file. Then, convert dates to pydatetime(). mdates.AutoDateLocator() was what I needed along with set_major_formatter. Not sure how I could manually change the interval, however (e.g., change to every 10 years or 25 years instead of using the default. This does work well enough though.
ds = pd.read_csv('path_to_file.csv', parse_dates=['Date'], infer_datetime_format=True,
header='infer', engine='python', skipfooter=3)
dates = ds['Date'].dt.to_pydatetime() # Convert to pydatetime()
tAnoms = ds[' Berkeley Earth 2m Air Temperature (degree C) 0N-90N;0E-360E']
fig = plt.figure(figsize=(10,10))
ax = plt.subplot(111)
# Produce plot
ax.plot(dates,tAnoms.rolling(60, center=True).mean())
# Use AutoDateLocator() from matplotlib.dates (mdates)
# Set date format to years
ax.xaxis.set_major_locator(mdates.AutoDateLocator())
years_fmt = mdates.DateFormatter('%Y')
ax.xaxis.set_major_formatter(years_fmt)
plt.show()

Month name offset in x axis with Matplotlib

I am plotting some time series from .nc files using pandas, xarray and matplotlib. I have two datasets:
Sea Surface Temerature from 1982 to 2019, from which I plot the monthly mean for my area and represent the monthly temperature variation for those 37 years.
Sea Sea Surface Temerature from 2020 to 2021, where I plot the monthly variation for each of the years.
Two plot this, I use te following code (PLEASE NOTE THAT DUE TO MEMORY ALLOCATION ISSUES I HAD WHILE LOOPING THROUGH THE VARIABLES I WROTE A VERY BASIC CODE WITH NO LOOPS, SORRY FOR THAT!)
import xarray as xr
import matplotlib.pyplot as plt
from matplotlib import dates as md
import pandas as pd
import numpy as np
import netCDF4
import seaborn as sns
import marineHeatWaves as mhw
import datetime
sns.set()
ds_original = xr.open_dataset('sst_med_f81_to21_L4.nc')
ds_original_last = xr.open_dataset('sst_med_f20_to21_L4.nc')
extract_date = datetime.datetime.today()
date = extract_date.strftime("%Y-%m-%d")
ds1 = ds_original.sel(time=slice('1982-01-01','2019-12-31'))
ds2 = ds_original_last.sel(time=slice('2020-01-01','2020-12-31'))
ds3 = ds_original_last.sel(time=slice('2021-01-01', date))
# Convert to Pandas Dataframe
df1 = ds1.to_dataframe().reset_index().set_index('time')
df2 = ds2.to_dataframe().reset_index().set_index('time')
df3 = ds3.to_dataframe().reset_index().set_index('time')
#Converting to Celsius
def kelvin_to_celsius(temp_k):
"""
Receives temperature in K and returns
temperature in Cº
"""
temp_c = temp_k - 273.15
return temp_c
df1['analysed_sst_C'] = kelvin_to_celsius(df1['analysed_sst'])
df2['analysed_sst_C'] = kelvin_to_celsius(df2['analysed_sst'])
df3['analysed_sst_C'] = kelvin_to_celsius(df3['analysed_sst'])
#Indexing by month and yearday
df1['month'] = df1.index.month
df1['yearday'] = df1.index.dayofyear
df2['month'] = df2.index.month
df2['yearday'] = df2.index.dayofyear
df3['month'] = df3.index.month
df3['yearday'] = df3.index.dayofyear
# Calculating the average
daily_sst_82_19 = df1.analysed_sst_C.groupby(df1.yearday).agg(np.mean)
daily_sst_2020 = df2.analysed_sst_C.groupby(df2.yearday).agg(np.mean)
daily_sst_2021 = df3.analysed_sst_C.groupby(df3.yearday).agg(np.mean)
# Quick Plot
sns.set_theme(style="whitegrid")
fig, ax=plt.subplots(1, 1, figsize=(15, 7))
ax.xaxis.set_major_locator(md.MonthLocator())
ax.xaxis.set_major_formatter(md.DateFormatter('%b'))
ax.margins(x=0)
plt.plot(daily_sst_82_19, label='1982-2019')
plt.plot(daily_sst_2020,label='2020')
plt.plot(daily_sst_2021,label='2021', c = 'black')
plt.legend(loc = 'upper left')
I obtain the following plot:
I want my plot to start with Jan and end with Dec, but I cannot figure out where is the problem. I have tried to set x axis limit between to specific dates, but this creates a conflict as one of the time series is for 37 years and the other two are for 1 year only.
Any help would be very appreciated!!
UPDATE
I figured out how to move the months, specifying the follwing:
ax.xaxis.set_major_locator(MonthLocator(bymonthday=2))
So I obtained this:
But I still ned to delete that last Jan, and I cannot figure out how to do it.
Okay so I figure out how to solve the issue.
Fine tunning plot parameters, I switched the DateFormatter to %D, to see the year as well. For my surprise, the year was set to 1970 and I have no idea why, because my oldest dataset starts in 1981. So once I discovered this, I set up the xlims to the ones you can read below and it worked pretty well:
#Add to plot settings:
ax.set_xlim(np.datetime64('1970-01-01'), np.datetime64('1970-12-31'))
ax.xaxis.set_major_locator(MonthLocator(bymonthday=1))
ax.xaxis.set_major_formatter(md.DateFormatter('%b'))
Result:

Plotly Express Chart Gaps Even with Index

I am having trouble eliminating datetime gaps within a dataset that i'm trying to create a very simple line chart in plotly express and I have straight lines on the graph connecting datapoints over a gap in the data (weekends).
Dataframe simply has an index of datetime (to the hour) called sale_date, and cols called NAME, COST with approximately 30 days worth of data.
df['sale_date'] = pd.to_datetime(df['sale_date'])
df = df.set_index('sale_date')
px.line(df, x=df.index, y='COST', color='NAME')
I've seen a few posts regarding this issue and one recommended setting datetime as the index, but it still yields the gap lines.
The data in the example may not be the same as yours, but the point is that you can change the x-axis data to string data instead of date/time data, or change the x-axis type to category, and add a scale and tick text.
import pandas as pd
import plotly.express as px
import numpy as np
np.random.seed(2021)
date_rng = pd.date_range('2021-08-01','2021-08-31', freq='B')
name = ['apple']
df = pd.DataFrame({'sale_date':pd.to_datetime(date_rng),
'COST':np.random.randint(100,3000,(len(date_rng),)),
'NAME':np.random.choice(name,size=len(date_rng))})
df = df.set_index('sale_date')
fig= px.line(df, x=[d.strftime('%m/%d') for d in df.index], y='COST', color='NAME')
fig.show()
xaxis update
fig= px.line(df, x=df.index, y='COST', color='NAME')
fig.update_xaxes(type='category',
tickvals=np.arange(0,len(df)),
ticktext=[d.strftime('%m/%d') for d in df.index])

How to plot a very large data set (date,time (x axis) vs values recorded(y-axis) every 15 mins daily for a a year) in Matplotlib

I am supposed to prepare an x vs y graph in python. My data set consists of Date - Time and Temperature which is recorded in an interval of 15 mins year long. Let say I have data of one month and I tried to plot it in Matplotlib. I am getting a graph which is not that clear because the x-axis (data-time) is filled throughout the axis and I am not getting a clear picture whereas Excel gives a good plot comparing to matplotlib.
The code I use to open 30 individual daily csv data recorded files and concatenating it to form one data frame is as follows
import pandas as pd
from openpyxl import load_workbook
import tkinter as tk
import datetime
from datetime import datetime
from datetime import time
from tkinter import filedialog
import matplotlib.pyplot as plt
root = tk.Tk()
root.withdraw()
root.call('wm', 'attributes', '.', '-topmost', True)
files = filedialog.askopenfilename(multiple=True)
%gui tk
var = root.tk.splitlist(files)
filePaths = []
for f in var:
df = pd.read_csv(f,skiprows=8, index_col=None, header=0, parse_dates=True, squeeze=True, encoding='ISO-8859–1', names=['Date', 'Time', 'Temperature', 'Humidty']) #,
filePaths.append(df)
df = pd.concat(filePaths, axis=0, join='outer', ignore_index=False, sort=True, verify_integrity=False, levels=None)
df["Time period"] = df["Date"] + df["Time"]
plt.figure()
plt.subplots(figsize=(25,20))
plt.plot('Time period', 'Temperature', data=df, linewidth=2, color='g')
plt.title('Temperature distribution Graph')
plt.xlabel('Time')
plt.grid(True)
Example of data
The output graph looks like this:
As you can see in the output graph is flourished with the data points on the x axis and it is not in a readable form. Also, matplotlib give multiple graphs if I load and concatenate .csv files for a group of days.
The same data set plotted in Excel/Libre gives a smooth graph with oderly arranged dates on the x axis and the line graph is also perfect.
I want to rewrite my code to plot a graph similar to one plotted in Excel/Libre. Please help
Try this approach:
Use date locators to format the x-axis with the date range you require.
Date locators can be used to define intervals in seconds, minutes, ...:
SecondLocator: Locate seconds
MinuteLocator: Locate minutes
HourLocator: Locate hours
DayLocator: Locate specified days of the month
MonthLocator: Locate months
YearLocator: Locate years
In the example, I use the MinuteLocator, with 15 minutes interval.
Import matplotlib.dates to work dates in plots:
import matplotlib.dates as mdates
import pandas as pd
import matplotlib.pyplot as plt
Get your data
# Sample data
# Data
df = pd.DataFrame({
'Date': ['07/14/2020', '07/14/2020', '07/14/2020', '07/14/2020'],
'Time': ['12:15:00 AM', '12:30:00 AM', '12:45:00 AM', '01:00:00 AM'],
'Temperature': [22.5, 22.5, 22.5, 23.0]
})
Convert Time period from String to Date object:
# Convert data to Date and Time
df["Time period"] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
Define min and max interval:
min = min(df['Time period'])
max = max(df['Time period'])
Create your plot:
# Plot
# Create figure and plot space
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot()
Set time interval using locators:
# Set Time Interval
ax.xaxis.set_major_locator(mdates.MinuteLocator(interval=15))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d %H:%M'))
Set your plot options and plot:
# Set labels
ax.set(xlabel="Time",
ylabel="Temperature",
title="Temperature distribution Graph", xlim=[min , max])
# Plot chart
ax.plot('Time period', 'Temperature', data=df, linewidth=2, color='g')
ax.grid(True)
fig.autofmt_xdate()
plt.show()

How do I control the number of x-axis ticks?

I have pulled in a dataset that I want to use, with columns named Date and Adjusted. Adjusted is just the adjusted percentage growth on the base month.
The code I currently have is:
x = data['Date']
y = data['Adjusted']
fig = plt.figure(dpi=128, figsize=(7,3))
plt.plot(x,y)
plt.title("FTSE 100 Growth", fontsize=25)
plt.xlabel("Date", fontsize=14)
plt.ylabel("Adjusted %", fontsize=14)
plt.show()
However, when I run it I get essentially a solid black line across the bottom where all of the dates are covering each other up. It is trying to show every single date, when obviously I only want to show major ones. That dates are in the format Apr-19, and the data runs from Oct-03 to May-20.
How do I limit the number of date ticks and labels to one per year, or any amount I choose? If you do have a solution, if you could respond with the edits made to the code itself that would be great. I've tried other solutions I've found on here but I haven't been able to get it to work.
dates module of matplotlib will do the job. You can control the interval by modifying the MonthLocator (It's currently set to 6 months). Here's how:
import pandas as pd
from datetime import date, datetime, timedelta
import matplotlib.pyplot as plt
import matplotlib.dates as md
import numpy as np
import matplotlib.ticker as ticker
x = data['Date']
y = data['Adjusted']
#converts differently formatted date to a datetime object
def convert_date(df):
return datetime.strptime(df['Date'], '%b-%y')
data['Formatted_Date'] = data.apply(convert_date, axis=1)
# plot
fig, ax = plt.subplots(1, 1)
ax.plot(data['Formatted_Date'], y,'ok')
## Set time format and the interval of ticks (every 6 months)
xformatter = md.DateFormatter('%Y-%m') # format as year, month
xlocator = md.MonthLocator(interval = 6)
## Set xtick labels to appear every 6 months
ax.xaxis.set_major_locator(xlocator)
## Format xtick labels as YYYY:mm
plt.gcf().axes[0].xaxis.set_major_formatter(xformatter)
plt.title("FTSE 100 Growth", fontsize=25)
plt.xlabel("Date", fontsize=14)
plt.ylabel("Adjusted %", fontsize=14)
plt.show()
Example output:

Categories