I'm currently needing some help here since I’m kinda novice. So I was able to import and plot my time series data via Pandas and Matplotlib, respectively. The thing is, the plot is too cramped up (due to the amount of data lol).
Using the same data set, is it possible to ‘divide’ the whole plot into 3 separate subplots?
Here's a sample to what I mean:
What I'm trying to do here is to distribute my plot into 3 subplots (it seems it doesn't have ncol=x).
Initially, my code runs like this;
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format
data = pd.read_csv ('all_visuallc.csv')
df = pd.DataFrame(data, columns= ['JD', 'Magnitude'])
print(df) #displays ~37000ish data x 2 columns
colors = ('#696969') #a very nice dim grey heh
area = np.pi*1.7
ax = df.plot.scatter(x="JD", y="Magnitude", s=area, c=colors, alpha=0.2)
ax.set(title='HD 39801', xlabel='Julian Date', ylabel='Visual Magnitude')
ax.invert_yaxis()
ax.xaxis.set_minor_locator(ticker.AutoMinorLocator())
ax.yaxis.set_minor_locator(ticker.AutoMinorLocator())
plt.rcParams['figure.figsize'] = [20, 4]
plt.rcParams['figure.dpi'] = 250
plt.savefig('test_p.jpg')
plt.show()
which shows a very tight plot:
Thanks everyone and I do hope for your help and responses.
P.S. I think iloc[value:value] to slice from a df may work?
First of all, you have to create multiple plots for every part of your data.
For example, if we want to split data into 3 parts, we will create 3 subplots. And then, as you correctly wrote, we can apply iloc (or another type of indexing) to the data.
Here is a toy example, but I hope you are be able to apply your decorations to it.
y = np.arange(0,20,1)
x = np.arange(20,40,1)
sample = pd.DataFrame(x,y).reset_index().rename(columns={'index':'y',
0:'x'})
n_plots = 3
figs, axs = plt.subplots(n_plots, figsize=[30,10])
# Suppose we want to split data into equal parts
start_ind = 0
for i in range(n_plots):
end_ind = start_ind + round(len(sample)/n_plots) #(*)
part_of_frame = sample.iloc[start_ind:end_ind]
axs[i].scatter(part_of_frame['x'], part_of_frame['y'])
start_ind = end_ind
It's also possible to split data into unequal parts by changing the logic in the string (*)
Related
I'm trying to do some data analysis with python and pandas on a power consumption dataset.
However when I plot the data I get that stright line from 5-1-2007 to 13-1-2007 but I have no missing values in my dataset which is a weird behavior as I made sure that my dataset in clean.
Anyone had similar issue? or can explain this behavior?
Thank you.
EDIT: Here is what the data looks like in that range
EDIT 2 : Here is the link to the original dataset (before cleaning) if that might help: https://archive.ics.uci.edu/ml/machine-learning-databases/00235/
How does the data between 2007-01-01 and 2007-01-15 look like? (use df[(df['Date_Time'] >= '2007-01-01 ') & (df['Date_Time'] <= '2007-01-15')]).
If no data is missing it could be that the dataset has been manipulated and the missing datapoints were interpolated (see Interpolation)
Fact is that when there is data on the x (Datetime)axis, then if there is no data on the y axis,
then the rendering continues anyway. Is especially noticeable on financial data on weekends and holidays or when there are gaps.
Here this problem is described enter link description here
Although you say that the data is present, but still try this code, maybe it's a matter of omissions.
In order not to draw when there is no data for the y axis, is used 'ticker.FuncFormatter(format_data)'.
Below I attach the code where I specifically made data gaps in the data file and a picture of how it turned out:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
df = pd.read_csv('custom.csv',
index_col='DATE',
parse_dates=True,
infer_datetime_format=True)
z = df.iloc[:, 3].values
date = df.iloc[:, 0].index.date
fig, axes = plt.subplots(ncols=2)
ax = axes[0]
ax.plot(date, z)
ax.set_title("Default")
fig.autofmt_xdate()
N = len(z)
ind = np.arange(N)
def format_date(x, pos=None):
thisind = np.clip(int(x + 0.5), 0, N - 1)
return date[thisind].strftime('%Y-%m-%d')
ax = axes[1]
ax.plot(ind, z)
ax.xaxis.set_major_formatter(ticker.FuncFormatter(format_date))
ax.set_title("Without empty values")
fig.autofmt_xdate()
plt.show()
I'm a beginner in Python.
In my internship project I am trying to plot bloxplots from data contained in a csv
I need to plot bloxplots for each of the 4 (four) variables showed above (AAG, DENS, SRG e RCG). Since each variable presents values in the range from [001] to [100], there will be 100 boxplots for each variable, which need to be plotted in a single graph as shown in the image.
This is the graph I need to plot, but for each variable there will be 100 bloxplots as each one has 100 columns of values:
The x-axis is the "Year", which ranges from 2025 to 2030, so I need a graph like the one shown in figure 2 for each year and the y-axis is the sets of values for each variable.
Using Pandas-melt function and seaborn library I was able to plot only the boxplots of a column. But that's not what I need:
import pandas as pd
import seaborn as sns
df = pd.read_csv("2DBM_50x50_Central_Aug21_Sim.cliped.csv")
mdf= df.melt(id_vars=['Year'], value_vars='AAG[001]')
print(mdf)
ax=sns.boxplot(x='Year', y='value',width = 0.2, data=mdf)
Result of the code above:
What can I try to resolve this?
The following code gives you five subplots, where each subplot only contains the data of one variable. Then a boxplot is generated for each year. To change the range of columns used for each variable, change the upper limit in var_range = range(1, 101), and to see the outliers change showfliers to True.
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
df = pd.read_csv("2DBM_50x50_Central_Aug21_Sim.cliped.csv")
variables = ["AAG", "DENS", "SRG", "RCG", "Thick"]
period = range(2025, 2031)
var_range = range(1, 101)
fig, axes = plt.subplots(2, 3)
flattened_axes = fig.axes
flattened_axes[-1].set_visible(False)
for i, var in enumerate(variables):
var_columns = [f"TB_acc_{var}[{j:05}]" for j in var_range]
data = df.melt(id_vars=["Period"], value_vars=var_columns, value_name=var)
ax = flattened_axes[i]
sns.boxplot(x="Period", y=var, width=0.2, data=data, ax=ax, showfliers=False)
plt.tight_layout()
plt.show()
output:
I'm working with a dataset that has grades and states and need to create line graphs by state showing what percent of each state's students fall into which bins.
My methodology (so far) is as follows:
First I import the dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
records = [{'Name':'A', 'Grade':'.15','State':'NJ'},{'Name':'B', 'Grade':'.15','State':'NJ'},{'Name':'C', 'Grade':'.43','State':'CA'},{'Name':'D', 'Grade':'.75','State':'CA'},{'Name':'E', 'Grade':'.17','State':'NJ'},{'Name':'F', 'Grade':'.85','State':'HI'},{'Name':'G', 'Grade':'.89','State':'HI'},{'Name':'H', 'Grade':'.38','State':'CA'},{'Name':'I', 'Grade':'.98','State':'NJ'},{'Name':'J', 'Grade':'.49','State':'NJ'},{'Name':'K', 'Grade':'.17','State':'CA'},{'Name':'K', 'Grade':'.94','State':'HI'},{'Name':'M', 'Grade':'.33','State':'HI'},{'Name':'N', 'Grade':'.22','State':'NJ'},{'Name':'O', 'Grade':'.7','State':'NJ'}]
df = pd.DataFrame(records)
df.Grade = df.Grade.astype(float)
Next I cut each grade into a bin
df['bin'] = pd.cut(df['Grade'],[-np.inf,.05,.1,.15,.2,.25,.3,.35,.4,.45,.5,.55,.6,.65,.7,.75,.8,.85,.9,.95,1],labels=False)/10
Then I create a pivot table giving me the count of people by bin in each state
df2 = pd.pivot_table(df,index=['bin'],columns='State',values=['Name'],aggfunc=pd.Series.nunique,margins=True)
df2 = df2.fillna(0)
Then I convert those n-counts into percentages and remove the margin rows
df3 = df2.div(df2.iloc[-1])
df3 = df3.iloc[:-1,:-1]
Now I want to create a line graph with multiple lines (one for each state) with the bin on the x axis and the percentage on the Y axis. df3.plot() will give me the chart I want but I would like to accomplish the same using matplotlib, because it offers me greater customization of the graph. However, running
plt.plot(df3)
gives me the lines I need but I can't get the legend the work properly. Any thoughts on how to accomplish this?
It may not be the best way, but I use the pandas plot function to draw df3, then get the legend and get the new label names. Please note that the processing of the legend string is limited to this data.
line = df3.plot(kind='line')
handles, labels = line.get_legend_handles_labels()
label = []
for l in labels:
label.append(l[7:-1])
plt.legend(handles, label, loc='best')
You can do this:
plt.plot(df3,label="label")
plt.legend()
plt.show()
For more information visit here
And if it helps you to solve your issues then don't forget to mark this as accepted answer.
I am trying to integrate 2 curves as they change through time using pandas. I am loading data from a CSV file like such:
Where the Dates are the X-axis and both the Oil & Water points are the Y-axis. I have learned to use the cross-section option to isolate the "NAME" values, but am having trouble finding a good way to integrate with dates as the X-axis. I eventually would like to be able to take the integrals of both curves and stack them against each other. I am also having trouble with the plot defaulting the x-ticks to arbitrary values, instead of the dates.
I can change the labels/ticks manually, but have a large CSV to process and would like to automate the process. Any help would be greatly appreciated.
NAME,DATE,O,W
A,1/20/2000,12,50
B,1/20/2000,25,28
C,1/20/2000,14,15
A,1/21/2000,34,50
B,1/21/2000,8,3
C,1/21/2000,10,19
A,1/22/2000,47,35
B,1/22/2000,4,27
C,1/22/2000,46,1
A,1/23/2000,19,31
B,1/23/2000,18,10
C,1/23/2000,19,41
Contents of CSV in text form above.
Further to my comment above, here is some sample code (using logic from the example mentioned) to label your xaxis with formatted dates. Hope this helps.
Data Collection / Imports:
Just re-creating your dataset for the example.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
header = ['NAME', 'DATE', 'O', 'W']
data = [['A','1/20/2000',12,50],
['B','1/20/2000',25,28],
['C','1/20/2000',14,15],
['A','1/21/2000',34,50],
['B','1/21/2000',8,3],
['C','1/21/2000',10,19],
['A','1/22/2000',47,35],
['B','1/22/2000',4,27],
['C','1/22/2000',46,1],
['A','1/23/2000',19,31],
['B','1/23/2000',18,10],
['C','1/23/2000',19,41]]
df = pd.DataFrame(data, columns=header)
df['DATE'] = pd.to_datetime(df['DATE'], format='%m/%d/%Y')
# Subset to just the 'A' labels.
df_a = df[df['NAME'] == 'A']
Plotting:
# Define the number of ticks you need.
nticks = 4
# Define the date format.
mask = '%m-%d-%Y'
# Create the set of custom date labels.
step = int(df_a.shape[0] / nticks)
xdata = np.arange(df_a.shape[0])
xlabels = df_a['DATE'].dt.strftime(mask).tolist()[::step]
# Create the plot.
fig, ax = plt.subplots(1, 1)
ax.plot(xdata, df_a['O'], label='Oil')
ax.plot(xdata, df_a['W'], label='Water')
ax.set_xticks(np.arange(df_a.shape[0], step=step))
ax.set_xticklabels(xlabels, rotation=45, horizontalalignment='right')
ax.set_title('Test in Naming Labels for the X-Axis')
ax.legend()
Output:
I'd recommend modifying the X-axis into some form of integers or floats (Seconds, minutes, hours days since a certain time, based on the precision that you need). You can then use usual methods to integrate and the x-axes would no longer default to some other values.
See How to convert datetime to integer in python
I'm trying to understand why this fails, even though the documentation says:
dropna : boolean, optional
Drop missing values from the data before plotting.
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.__version__
# '0.7.dev'
# generate an example DataFrame
a = pd.DataFrame(data={
'a': np.random.normal(size=(100,)),
'b': np.random.lognormal(size=(100,)),
'c': np.random.exponential(size=(100,))})
sns.pairplot(a) # this works as expected
# snip
b = a.copy()
b.iloc[5,2] = np.nan # replace one value in col 'c' by a NaN
sns.pairplot(b) # this fails with error
# "AttributeError: max must be larger than min in range parameter."
# in histogram(a, bins, range, normed, weights, density)"
> sns.pairplot(b, dropna=True) # same error as above
when you are using the data directly, ie
sns.pairplot(b) #Same as sns.pairplot(b, x_vars=['a','b','c'] , y_vars=['a','b','c'],dropna=True)
your are plotting against all the columns in the DataFrame,Then make sure no:of rows are same in all columns.
sns.pairplot(b, x_vars=['a','c'] , y_vars=['a','b','c'],dropna=True)
In this case it works fine, but there will be a minute difference in the graph for removing the 'NaN value'.
So, If you want to plot with the whole Data then :-
either the null values must be replaced using "fillna()",
or the whole row containing 'nan values' must be dropped
b = b.drop(b.index[5])
sns.pairplot(b)
I'm going to post an answer to my own question, even though it doesn't exactly solve the problem in general, but at least it solves my problem.
The problem arises when trying to draw histograms. However, it looks like the kdes are much more robust to missing data. Therefore, this works, despite the NaN in the middle of the dataframe:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.__version__
# '0.7.dev'
# generate an example DataFrame
a = pd.DataFrame(data={
'a': np.random.normal(size=(100,)),
'b': np.random.lognormal(size=(100,)),
'c': np.random.exponential(size=(100,))})
a.iloc[5,2] = np.nan # replace one value in col 'c' by a NaN
sns.pairplot(a, diag_kind='kde')
Something of a necro- but as I cracked the answer to this today I thought it might be worth sharing. I could not find this solution elsewhere on the web... If the Seaborn ignoreNa keyword has not worked for your data and you don't want to drop all rows that have any NaN. This should work for you.
All of this is in Seaborn 0.9 with pandas 0.23.4, assuming a data frame (df) with j rows (samples) that have n columns (attributes).
The solution to the issue of Seaborn being unable to cope with NaN arrays being passed to it; particularly when you want to make sure you retain a row due to it having other data within it that is useful, is based on using a function to intercept the pair-wise columns before they are passed to the PairGrid for plotting.
Functions can be passed to the grid sectors to carry out an operation per subplot. A simple example of this would be to calculate and annotate RMSE for a column pair (subplot) onto each plot:
def rmse(x,y, **kwargs):
rmse = math.sqrt(skm.mean_squared_error(x, y))
label = 'RMSE = ' + str(round(rmse, 2))
ax = plt.gca()
ax.annotate(label, xy = (0.1, 0.95), size = 20, xycoords = ax.transAxes)
grid = grid.map_upper(rmse)
Therefore by writing a function that Seaborn can take as a data plotting argument, which drops NaNs on a column pair basis as the grid.map_ iterates over the main data frame, we can minimize data loss per sample (row). This is because one NaN in a row will not cause the entire row to be lost for all sub-plots. But rather just the sub-plot for that specific column pair will exclude the given row.
The following function carries out the pairwise NaN drop, returns the two series that seaborn then plots on the axes with matplotlibs scatter plot:
df = [YOUR DF HERE]
def col_nan_scatter(x,y, **kwargs):
df = pd.DataFrame({'x':x[:],'y':y[:]})
df = df.dropna()
x = df['x']
y = df['y']
plt.gca()
plt.scatter(x,y)
cols = df.columns
grid = sns.PairGrid(data= df, vars = cols, height = 4)
grid = grid.map_upper(col_nan_scatter)
The same can be done with seaborn plotting (with for example, just the x value):
def col_nan_kde_histo(x, **kwargs):
df = pd.DataFrame({'x':x[:]})
df = df.dropna()
x = df['x']
plt.gca()
sns.kdeplot(x)
cols = df.columns
grid = sns.PairGrid(data= df, vars = cols, height = 4)
grid = grid.map_upper(col_nan_scatter)
grid = grid.map_upper(col_nan_kde_histo)