Thank you in advance for the assistance!
I am trying to create a heat map from time-series data and the data begins mid year, which is causing the top of my heat map to be shifted to the left and not match up with the rest of the plot (Shown Below). How would I go about shifting the just the top line over so that the visualization of the data syncs up with the rest of the plot?
(Code Provided Below)
import pandas as pd
import matplotlib.pyplot as plt
# links to datadata
url1 = 'https://raw.githubusercontent.com/the-datadudes/deepSoilTemperature/master/minotDailyAirTemp.csv'
# load the data into a DataFrame, not a Series
# parse the dates, and set them as the index
df1 = pd.read_csv(url1, parse_dates=['Date'], index_col=['Date'])
# groupby year and aggregate Temp into a list
dfg1 = df1.groupby(df1.index.year).agg({'Temp': list})
# create a wide format dataframe with all the temp data expanded
df1_wide = pd.DataFrame(dfg1.Temp.tolist(), index=dfg1.index)
# ploting the data
fig, (ax1) = plt.subplots(ncols=1, figsize=(20, 5))
ax1.matshow(df1_wide, interpolation=None, aspect='auto');
Now, what its the problem, the dates on the dataset, if you see the Dataset this start on
`1990-4-24,15.533`
To solve this is neccesary to add the data between 1990/01/01 -/04/23 and delete the 29Feb.
rng = pd.date_range(start='1990-01-01', end='1990-04-23', freq='D')
df = pd.DataFrame(index= rng)
df.index = pd.to_datetime(df.index)
df['Temp'] = np.NaN
frames = [df, df1]
result = pd.concat(frames)
result = result[~((result.index.month == 2) & (result.index.day == 29))]
With this data
dfg1 = result.groupby(result.index.year).agg({'Temp': list})
df1_wide = pd.DataFrame(dfg1['Temp'].tolist(), index=dfg1.index)
# ploting the data
fig, (ax1) = plt.subplots(ncols=1, figsize=(20, 5))
ax1.matshow(df1_wide, interpolation=None, aspect='auto');
The problem with the unfilled portions are a consequence of the NaN values on your dataset, in this case you take the option, replace the NaN values with the column-mean or replace by the row-mean.
Another ways are available to replace the NaN values
df1_wide = df1_wide.apply(lambda x: x.fillna(x.mean()),axis=0)
Related
I'm playing around with kaggle dataframe to practice using matplotlib.
I was creating bar graph one by one, but it keeps adding up.
When I called plt.show() there were like 10 windows of figure suddenly shows up.
Is it possible to combine 4 of those figures into 1 window?
These part are in the same segments "Time Analysis" So I want to combine these 4 figures in 1 window.
import matplotlib.pyplot as plt
import seaborn as sns
dataset = ('accidents_data.csv')
df = pd.read_csv(dataset)
"""Time Analysis :
Analyze the time that accidents happen for various patterns and trends"""
df.Start_Time = pd.to_datetime(df.Start_Time) #convert the start time column to date time format
df['Hour_of_Accident'] = df.Start_Time.dt.hour #extract the hour from the time data
hour_accident = df['Hour_of_Accident'].value_counts()
hour_accident_df = hour_accident.to_frame() #convert the series data to dataframe in order to sort the index columns
hour_accident_df.index.names = ['Hours'] #naming the index column
hour_accident_df.sort_index(ascending=True, inplace=True)
print(hour_accident_df)
# Plotting the hour of accidents data in a bargraph
hour_accident_df.plot(kind='bar',figsize=(8,4),color='blue',title='Hour of Accident')
#plt.show() #Show the bar graph
"""Analyzing the accident frequency per day of the week"""
df['Day_of_the_week'] = df.Start_Time.dt.day_of_week
day_of_accident = df['Day_of_the_week'].value_counts()
day_of_accident_df = day_of_accident.to_frame() #convert the series data to dataframe so that we can sort the index columns
day_of_accident_df.index.names = ['Day'] # Renaming the index column
day_of_accident_df.sort_index(ascending=True, inplace=True)
print(day_of_accident_df)
f, ax = plt.subplots(figsize = (8, 5))
x = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Sartuday', 'Sunday']
l = day_of_accident_df.index.values
y = day_of_accident_df.Day_of_the_week
plt.bar(l, y, color='green')
plt.title('Day of the week vs total number of accidents')
plt.ylabel("No. of accidents recorded")
ax.set_xticks(l)
ax.set_xticklabels(x)
#plt.show()
"""Analysis for the months"""
df['Month'] = df.Start_Time.dt.month
accident_month = df['Month'].value_counts()
accident_month_df = accident_month.to_frame() #convert the series data to dataframe so that we can sort the index columns
accident_month_df.index.names = ['Month'] # Renaming the index column
accident_month_df.sort_index(ascending=True, inplace=True)
print(accident_month_df)
#Plotting the Bar Graph
accident_month_df.plot(kind='bar',figsize=(8,5),color='purple',title='Month of Accident')
"""Yearly Analysis"""
df['Year_of_accident'] = df.Start_Time.dt.year
#Check the yearly trend
yearly_count = df['Year_of_accident'].value_counts()
yearly_count_df = pd.DataFrame({'Year':yearly_count.index, 'Accidents':yearly_count.values})
yearly_count_df.sort_values(by='Year', ascending=True, inplace=True)
print(yearly_count_df)
#Creating line plot
yearly_count_df.plot.line(x='Year',color='red',title='Yearly Accident Trend ')
plt.show()
This question already has answers here:
How to plot different groups of data from a dataframe into a single figure
(5 answers)
Closed 5 months ago.
I am trying to plot multiple different lines on the same graph using pandas and matplotlib.
I have a series of 100 synthetic temperature histories that I want to plot, all in grey, against the original real temperature history I used to generate the data.
How can I plot all of these series on the same graph? I know how to do it in MATLAB but am using Python for this project, and pandas has been the easiest way I have found to read in every single column from the output file without having to specify each column individually. The number of columns of data will change from 100 up to 1000, so I need a generic solution. My code plots each of the data series individually fine, but I just need to work out how to add them both to the same figure.
Here is the code so far:
# dirPath is the path to my working directory
outputFile = "output.csv"
original_data = "temperature_data.csv"
# Read in the synthetic temperatures from the output file, time is the index in the first column
data = pd.read_csv(outputFile,header=None, skiprows=1, index_col=0)
# Read in the original temperature data, time is the index in the first column
orig_data = pd.read_csv(dirPath+original_data,header=None, skiprows=1, index_col=0)
# Convert data to float format
data = data.astype(float)
orig_data = orig_data.astype(float)
# Plot all columns of synthetic data in grey
data = data.plot.line(title="ARMA Synthetic Temperature Histories",
xlabel="Time (yrs)",
ylabel=("Synthetic avergage hourly temperature (C)"),
color="#929591",
legend=None)
# Plot one column of original data in black
orig_data = orig_data.plot.line(color="k",legend="Original temperature data")
# Create and save figure
fig = data.get_figure()
fig = orig_data.get_figure()
fig.savefig("temp_arma.png")
This is some example data for the output data:
And this is the original data:
Plotting each individually gives these graphs - I just want them overlaid!
Your data.plot.line returns an AxesSubplot instance, you can catch it and feed it to your second command:
# plot 1
ax = data.plot.line(…)
# plot 2
data.plot.line(…, ax=ax)
Try to run this code:
# convert data to float format
data = data.astype(float)
orig_data = orig_data.astype(float)
# Plot all columns of synthetic data in grey
ax = data.plot.line(title="ARMA Synthetic Temperature Histories",
xlabel="Time (yrs)",
ylabel=("Synthetic avergage hourly temperature (C)"),
color="#929591",
legend=None)
# Plot one column of original data in black
orig_data.plot.line(color="k",legend="Original temperature data", ax=ax)
# Create and save figure
ax.figure.savefig("temp_arma.png")
You should directy use matplotlib functions. It offers more control and is easy to use as well.
Part 1 - Reading files (borrowing your code)
# Read in the synthetic temperatures from the output file, time is the index in the first column
data = pd.read_csv(outputFile,header=None, skiprows=1, index_col=0)
# Read in the original temperature data, time is the index in the first column
orig_data = pd.read_csv(dirPath+original_data,header=None, skiprows=1, index_col=0)
# Convert data to float format
data = data.astype(float)
orig_data = orig_data.astype(float)
Part 2 - Plotting
fig = plt.figure(figsize=(10,8))
ax = plt.gca()
# Plotting all histories:
# 1st column contains time hence excluding
for col in data.columns[1:]:
ax.plot(data["Time"], data[col], color='grey')
# Orig
ax.plot(orig_data["Time"], orig_data["Temperature"], color='k')
# axis labels
ax.set_xlabel("Time (yrs)")
ax.set_ylabel("avergage hourly temperature (C)")
fig.savefig("temp_arma.png")
Try the following:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
data.plot.line(ax=ax,
title="ARMA Synthetic Temperature Histories",
xlabel="Time (yrs)",
ylabel="Synthetic avergage hourly temperature (C)",
color="#929591",
legend=False)
orig_data.rename(columns={orig_data.columns[0]: "Original temperature data"},
inplace=True)
orig_data.plot.line(ax=ax, color="k")
It's pretty much your original code with the following slight modifications:
Getting the ax object
fig, ax = plt.subplots()
and using it for the plotting
data.plot.line(ax=ax, ...
...
orig_data.plot.line(ax=ax, ...)
Result for some randomly generated sample data:
import random # For sample data only
# Sample data
data = pd.DataFrame({
f'col_{i}': [random.random() for _ in range(25)]
for i in range(1, 50)
})
orig_data = pd.DataFrame({
'col_0': [random.random() for _ in range(25)]
})
Thank you in advance! (Image provided below)
I am trying to have the Y-Axis of my heatmap reflect the year associated with the data it is pulling. What is happening is that the Y-Axis is merely counting the number of years (0, 1, 2, ....30) when it should be appearing as 1990, 1995, 2000, etc.
How do I update my code (provided below) so that the Y-Axis shows the actual year instead of the year count?
# links to Minot data if you want to pull from the web
##url2 = 'https://raw.githubusercontent.com/the-
datadudes/deepSoilTemperature/master/allStationsDailyAirTemp1.csv'
raw_data = pd.read_csv('https://raw.githubusercontent.com/the-
datadudes/deepSoilTemperature/master/allStationsDailyAirTemp1.csv', index_col=1, parse_dates=True)
df_all_stations = raw_data.copy()
selected_station = 'Minot'
# load the data into a DataFrame, not a Series
# parse the dates, and set them as the index
df1 = df_all_stations[df_all_stations['Station'] == selected_station]
# groupby year and aggregate Temp into a list
dfg1 = df1.groupby(df1.index.year).agg({'Temp': list})
# create a wide format dataframe with all the temp data expanded
df1_wide = pd.DataFrame(dfg1.Temp.tolist(), index=dfg1.index)
# adding the data between 1990/01/01 -/04/23 and delete the 29th of Feb
rng = pd.date_range(start='1990-01-01', end='1990-04-23', freq='D')
df = pd.DataFrame(index= rng)
df.index = pd.to_datetime(df.index)
df['Temp'] = np.NaN
frames = [df, df1]
result = pd.concat(frames)
result = result[~((result.index.month == 2) & (result.index.day == 29))]
dfg1 = result.groupby(result.index.year).agg({'Temp': list})
df1_wide = pd.DataFrame(dfg1['Temp'].tolist(), index=dfg1.index)
# Setting all leftover empty fields to the average of that time in order to fill in the gaps
df1_wide = df1_wide.apply(lambda x: x.fillna(x.mean()),axis=0)
# ploting the data
fig, (ax1) = plt.subplots(ncols=1, figsize=(20, 5))
##ax1.set_title('Average Daily Air Temperature - Minot Station')
ax1.set_xlabel('Day of the year')
ax1.set_ylabel('Years since start of data collection')
# Setting the title so that it changes based off of the selected station
ax1.set_title('Average Air Temp for ' + str(selected_station))
# Creating Colorbar
cbm = ax1.matshow(df1_wide, interpolation=None, aspect='auto');
# Plotting the colorbar
cb = plt.colorbar(cbm, ax=ax1)
cb.set_label('Temp in Celsius')
Add this line at the end of your code:
ax1.set_yticklabels(['']+df1_wide.index.tolist()[::5])
I have created a code, which shows a heatmap of the data in the CSV file.
The code is as follows:
import pandas as pd
import matplotlib.pyplot as plt
data= pd.read_csv("data.csv" , sep=';', header=0,
index_col='Date')
fig=plt.imshow(data, cmap='YlOrBr', interpolation='nearest')
plt.colorbar()
plt.xlabel("Time (UTC)")
plt.ylabel("Date")
plt.show()
The dataset is as follows:
The time range varies from 00:00 till 23:50 with steps of 10 minutes.
I want the x axis to show the time from 00:00 till 23:50 in steps per hour.
The index is set as date. The date range is from 29-Oct-2017 till 24-Mar-2018.
I want the Y axis to show the date range in steps of months.
You can stack columns, then groupby month and hour and then unstack it back (I'm taking mean values here when aggregating, but you can change to sum or whatever aggregation should be done there):
df = pd.DataFrame(np.nan,
columns=pd.date_range('00:00', '23:50', freq='10min'),
index=pd.date_range('2017-10-29', '2018-03-24'))
df[df.columns] = np.random.randint(0, 100, df.shape)
fig, ax = plt.subplots(2, figsize=(10,6))
ax[0].imshow(df, cmap='YlOrBr')
ix = df.stack().index
l1 = ix.get_level_values(0).month
l2 = ix.get_level_values(1).hour
df2 = df.stack().groupby([l1,l2], sort=False).mean().unstack(1)
ax[1].imshow(df2, cmap='YlOrBr')
Output (original DataFrame above, processed below):
Update:
If the goal is just to put monthly and hourly labels on the same plot, please see below:
df = pd.DataFrame(np.nan,
columns=pd.date_range('00:00', '23:50', freq='10min').astype(str),
index=pd.date_range('2017-10-29', '2018-03-24').astype(str))
df[df.columns] = np.random.randn(*(df.shape))
fig, ax = plt.subplots(1, figsize=(10,6))
l1 = pd.to_datetime(df.index).month
l2 = pd.to_datetime(df.columns).hour
x = pd.Series(l2).drop_duplicates()
y = pd.Series(l1).drop_duplicates()
ax.imshow(df, cmap='YlOrBr')
ax.set_xticks(x.index)
ax.set_xticklabels(x)
ax.set_yticks(y.index)
ax.set_yticklabels(y)
Output:
The original dataset contain 4 data named df1,df2,df3,df4(all in pandas dataframe format)
df1 = pd.read_csv("./df1.csv")
df2 = pd.read_csv("./df2.csv")
df3 = pd.read_csv("./df3.csv")
df4 = pd.read_csv("./df4.csv")
# Concat these data
dataset = [df1,df2, df3,df4]
# Plottting
fig = plt.figure()
bpl = plt.boxplot(dataset, positions=np.array(xrange(len(dataset)))*2.0-0.4, \
sym='+', widths=0.5, patch_artist=True)
plt.show()
But the first data df1 was missing. I check df1, find nothing abnormal.
I upload these 4 data here in .csv format.
Any advice would be appreciate!
Update
I could make the plot without any problem.