Missing data in Boxplot using matplotlib - python

The original dataset contain 4 data named df1,df2,df3,df4(all in pandas dataframe format)
df1 = pd.read_csv("./df1.csv")
df2 = pd.read_csv("./df2.csv")
df3 = pd.read_csv("./df3.csv")
df4 = pd.read_csv("./df4.csv")
# Concat these data
dataset = [df1,df2, df3,df4]
# Plottting
fig = plt.figure()
bpl = plt.boxplot(dataset, positions=np.array(xrange(len(dataset)))*2.0-0.4, \
sym='+', widths=0.5, patch_artist=True)
plt.show()
But the first data df1 was missing. I check df1, find nothing abnormal.
I upload these 4 data here in .csv format.
Any advice would be appreciate!
Update

I could make the plot without any problem.

Related

Make a Plot Bar from Pandas multilevel dataframe

I'm having troubles trying to make a bar plot from a pandas dataframe, which should be easy but I can't make it work.
I have a dataframe that looks like that:
Data A
Data B
Data C
timestamp
06:54:00
0.1
0.2
0.3
But instead of 3 columns with Data, I have 99.
The point is that I am trying to do a bar plot representing in the x axis the different Data and in the y axis the values.
I tried with:
p = data.hvplot.bar(x = 'Data', y = 'Units', rot = 90)
And
p = data.plot(kind='bar', title="Data", figsize=(15, 10), legend=True, fontsize=12)
But none of them are working, and I think that the problem comes from the format of my dataframe, because of the column 'timestamp'.
However, I haven't manage to delete it, I tried:
data = data.droplevel('timestamp')
And:
data = data.drop(['timestamp'], axis=1)
But none of them are working. Could someone please give me a hand with that?
I finally managed to solve it.
What I did was:
new_df = data.melt(var_name="Data")
To get a new dataframe without the timestamp.
Then:
titles = new_df['Data'].to_list()
values = new_df['value'].to_list()
To get two lists, one with the titles and another one with the values.
And then I plotted the chart with the following code:
p = figure(x_range=titles, height=500, width=1500, title="Unit",
toolbar_location=None, tools="")
p.vbar(x=titles, top=values, width=0.6)
p.xgrid.grid_line_color = None
p.xaxis.major_label_orientation = "vertical"
p.y_range.start = 0
Thank you all,
You can try this:
new_df = df.melt(var_name="Data")
new_df.plot(kind='bar', x='Data', y='value')

Matplotlib Time-Series Heatmap Visualization Row Modification

Thank you in advance for the assistance!
I am trying to create a heat map from time-series data and the data begins mid year, which is causing the top of my heat map to be shifted to the left and not match up with the rest of the plot (Shown Below). How would I go about shifting the just the top line over so that the visualization of the data syncs up with the rest of the plot?
(Code Provided Below)
import pandas as pd
import matplotlib.pyplot as plt
# links to datadata
url1 = 'https://raw.githubusercontent.com/the-datadudes/deepSoilTemperature/master/minotDailyAirTemp.csv'
# load the data into a DataFrame, not a Series
# parse the dates, and set them as the index
df1 = pd.read_csv(url1, parse_dates=['Date'], index_col=['Date'])
# groupby year and aggregate Temp into a list
dfg1 = df1.groupby(df1.index.year).agg({'Temp': list})
# create a wide format dataframe with all the temp data expanded
df1_wide = pd.DataFrame(dfg1.Temp.tolist(), index=dfg1.index)
# ploting the data
fig, (ax1) = plt.subplots(ncols=1, figsize=(20, 5))
ax1.matshow(df1_wide, interpolation=None, aspect='auto');
Now, what its the problem, the dates on the dataset, if you see the Dataset this start on
`1990-4-24,15.533`
To solve this is neccesary to add the data between 1990/01/01 -/04/23 and delete the 29Feb.
rng = pd.date_range(start='1990-01-01', end='1990-04-23', freq='D')
df = pd.DataFrame(index= rng)
df.index = pd.to_datetime(df.index)
df['Temp'] = np.NaN
frames = [df, df1]
result = pd.concat(frames)
result = result[~((result.index.month == 2) & (result.index.day == 29))]
With this data
dfg1 = result.groupby(result.index.year).agg({'Temp': list})
df1_wide = pd.DataFrame(dfg1['Temp'].tolist(), index=dfg1.index)
# ploting the data
fig, (ax1) = plt.subplots(ncols=1, figsize=(20, 5))
ax1.matshow(df1_wide, interpolation=None, aspect='auto');
The problem with the unfilled portions are a consequence of the NaN values on your dataset, in this case you take the option, replace the NaN values with the column-mean or replace by the row-mean.
Another ways are available to replace the NaN values
df1_wide = df1_wide.apply(lambda x: x.fillna(x.mean()),axis=0)

How to MatPlotLib Plot two DataFrames?

I have two DataFrame north and south. Each has same rows and columns. I would like to plot the speed columns of both DataFrames in one figure as bar chart. I am trying this:
ax = south['speed'].plot(kind='bar', color='gray')
north['speed'].plot(kind = 'bar', color='red', ax=ax)
plt.show()
But it plots only the last dataframe , i.e. only the north DataFrame. Can you help me?
1) If you would like to plot just 'speed' column, you have to concatenate dataframes like:
df = pd.concat([north, south])
or
df = north.append(south)
2) If you would like to compare 'speed' column of both dataframes, you have to join dataframes along axis=1 like:
df = pd.concat([north, south], axis=1, ignore_index=True)
and the call plot method of df.
For more info: https://pandas.pydata.org/pandas-docs/stable/merging.html

Adding Legends in Pandas Plot

I am plotting Density Graphs using Pandas Plot. But I am not able to add appropriate legends for each of the graphs. My code and result is as as below:-
for i in tickers:
df = pd.DataFrame(dic_2[i])
mean=np.average(dic_2[i])
std=np.std(dic_2[i])
maximum=np.max(dic_2[i])
minimum=np.min(dic_2[i])
df1=pd.DataFrame(np.random.normal(loc=mean,scale=std,size=len(dic_2[i])))
ax=df.plot(kind='density', title='Returns Density Plot for '+ str(i),colormap='Reds_r')
df1.plot(ax=ax,kind='density',colormap='Blues_r')
You can see in the pic, top right side box, the legends are coming as 0. How do I add something meaningful over there?
print(df.head())
0
0 -0.019043
1 -0.0212065
2 0.0060413
3 0.0229895
4 -0.0189266
I think you may want to restructure the way you've created the graph. An easy way to do this is to create the ax before plotting:
# sample data
df = pd.DataFrame()
df['returns_a'] = [x for x in np.random.randn(100)]
df['returns_b'] = [x for x in np.random.randn(100)]
print(df.head())
returns_a returns_b
0 1.110042 -0.111122
1 -0.045298 -0.140299
2 -0.394844 1.011648
3 0.296254 -0.027588
4 0.603935 1.382290
fig, ax = plt.subplots()
I then created the dataframe using the parameters specified in your variables:
mean=np.average(df.returns_a)
std=np.std(df.returns_a)
maximum=np.max(df.returns_a)
minimum=np.min(df.returns_a)
pd.DataFrame(np.random.normal(loc=mean,scale=std,size=len(df.returns_a))).rename(columns={0: 'std_normal'}).plot(kind='density',colormap='Blues_r', ax=ax)
df.plot('returns_a', kind='density', ax=ax)
This second dataframe you're working with is created by default with column 0. You'll need to rename this.
I figured out a simpler way to do this. Just add column names to the dataframes.
for i in tickers:
df = pd.DataFrame(dic_2[i],columns=['Empirical PDF'])
print(df.head())
mean=np.average(dic_2[i])
std=np.std(dic_2[i])
maximum=np.max(dic_2[i])
minimum=np.min(dic_2[i])
df1=pd.DataFrame(np.random.normal(loc=mean,scale=std,size=len(dic_2[i])),columns=['Normal PDF'])
ax=df.plot(kind='density', title='Returns Density Plot for '+ str(i),colormap='Reds_r')
df1.plot(ax=ax,kind='density',colormap='Blues_r')

Put 2 separate .csv files on one graph using Python

I have created a multi-line graph, of water temperatures throughout the year, with python using pandas:
import pandas as pd
filepath = "C:\\Users\\technician\\Desktop\\LowerBD2014.csv"
data = pd.read_csv(filepath, header = 0, index_col = 0)
data.plot(kind = 'line', use_index = True, title="timeseries", figsize=(20,10))
Now, I would like to add another line for Air Temperature. Unfortunately, the dates and times, when data was collected, don't match. I was thinking that I could work around this my importing 2 separate .csv files into the same graph, but I am unsure how to do that.
Any suggestions would be great. I can also add all of the data to one file, I just worry that the Air Temperature will not plot correctly without a secondary horizontal axis (I don't know how to do this either).
Here is the graph created using ax=ax for one for the data set plots:
http://imgur.com/zrht85K
Once your two csv's are imported as two dataframes, just plot the first assigned to a named matplotlib axes object (ax in the block below) then pass that axes to the second plot call.
import pandas as pd
import numpy as np
# two made-up timeseries with different periods, for demonstration plot below
#air_temp = pd.DataFrame(np.random.randn(12),
# index=pd.date_range('1/1/2016', freq='M', periods=12),
# columns=['air_temp'])
#water_temp = pd.DataFrame(np.random.randn(365),
# index=pd.date_range('1/1/2016', freq='D', periods=365),
# columns=['water_temp'])
# the real data import would look something like this:
water_temp_filepath = "C:\\Users\\technician\\Desktop\\water_temp.csv"
air_temp_filepath = "C:\\Users\\technician\\Desktop\\airtemp.csv"
water_temp = pd.read_csv(water_temp_filepath, header = 0, index_col = 0,
parse_dates=True, infer_datetime_format=True)
air_temp = pd.read_csv(air_temp_filepath, header = 0, index_col = 0,
parse_dates=True, infer_datetime_format=True)
# plot both overlayed
ax = air_temp.plot(figsize=(20,10))
water_temp.plot(ax=ax)
As someone here said, if your columns are the same for both csv files, you can follow their code.
or
you can try combining the two CSV files in one, then using that.
file_a = open('first.csv','r')
file_a_data = file_a.read()
file_a.close()
file_b = open('second.csv','r')
file_b_data = file_b.read()
file_b.close()
combined_data = file_a_data + file_b_data
csv = open('test.csv','w')
csv.write(combined_date)
csv.close()
data = pd.read_csv(file_path_to_final_csv, ...,...)

Categories