I have a DataFrame with a time series as index.
import pandas as pd
from numpy.random import rand
df = pd.DataFrame(rand(100000), index=None, columns=['a'])
df['time'] = pd.date_range('2020-01-01 12:30:15',
periods=len(df['a']), freq='ms')
df.set_index('time', inplace=True)
df.plot()
When I put a multiplier on the frequency, it becomes extremely slow to plot df2, even if it has less elements than df. EDIT 1: It actually crashed my Python kernel, and my laptop almost ran out of RAM
df2 = pd.DataFrame(rand(50000), index=None, columns=['a'])
df2['time'] = pd.date_range('2020-01-01 12:30:15',
periods=len(df2['a']), freq='2.5ms')
df2.set_index('time', inplace=True)
df2.plot()
I was wondering if this behavior is normal
Thanks
EDIT 2: Versions of packages
Python 3.9.3-1
Pandas 1.2.3-1
Numpy 1.20.1-1
Matplotlib 3.4.1-2
EDIT 3: It works when plotting directly with matplotlib
import matplotlib.pyplot as plt
fig, ax1 = plt.subplots(figsize=(12,5))
ax1.plot(df2);
Related
I am using a dataset that can be found on Kaggle website (https://www.kaggle.com/claytonmiller/lbnl-automated-fault-detection-for-buildings-data).
I am trying to write a code that can specify based on Timestamp to look for those specific rows and apply a condition (In the context of this dataset the time between 10:01 PM to 6:59 AM) and fill all the columns corresponding to those specific rows with zero.
I have tried the following code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
%matplotlib inline
df = pd.read_csv('RTU.csv')
def fill_na(row):
if dt.time(22, 1) <= pd.to_datetime(row['Timestamp']).time() <= dt.time(6, 59):
row.fillna(0)
### df = df.apply(fill_na, axis=1) ###
df= df.apply(lambda row : fill_na(row), axis=1)
#### df.fillna(0, inplace=True) ###
df.head(2000)
However after changing the axis of the dataset it seems it can no longer work as intended.
I don't think you need a function to do that. Just filter the rows using a condition and then fillna.
import datetime as dt
import pandas as pd
df = pd.read_csv('RTU.csv',parse_dates=['Timestamp'])
df.head()
cond = (df.Timestamp.dt.time > dt.time(22,0)) | ((df.Timestamp.dt.time < dt.time(7,0)))
df[cond] = df[cond].fillna(0,axis=1)
Shows that the na before 7am fill with 0
I want my matplotlib plot to display my df's DateTimeIndex as consecutive count data (in seconds) on the x-axis and my df's Load data on the y axis. Then I want to overlap it with a scipy.signal find_peaks result (which has an x-axis of consecutive seconds). My data is not consecutive (real world data), though it does have a frequency of seconds.
Code
import pandas as pd
import matplotlib.pyplot as plt
from scipy import signal
import numpy as np
# Create Sample Dataset
df = pd.DataFrame([['2020-07-25 09:26:28',2],['2020-07-25 09:26:29',10],['2020-07-25 09:26:32',203],['2020-07-25 09:26:33',30]],
columns = ['Time','Load'])
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index("Time")
print(df)
# Try to solve the problem
rng = pd.date_range(df.index[0], df.index[-1], freq='s')
print(rng)
peaks, _ = signal.find_peaks(df["Load"])
plt.plot(rng, df["Load"])
plt.plot(peaks, df["Load"][peaks], "x")
plt.plot(np.zeros_like(df["Load"]), "--", color="gray")
plt.show()
This code does not work because rng has a length of 6, while the df has a length of 4. I think I might be going about this the wrong way entirely. Thoughts?
You are really close - I think you can get what you want by reindexing your df with your range. For instance:
df = df.reindex(rng).fillna(0)
peaks, _ = signal.find_peaks(df["Load"])
...
Does that do what you expect?
Given the following data from a CSV file, I want to plot a regression plot using Matlab for the mean of the 2-bedroom price.
I have managed to use subgroup to get the mean. However, after reading the solution from Stackoverflow and trying it, I mostly end up with other never-ending data-related problems. In general most of the errors are either to convert it to string or it is not index etc.
Bedrooms Price Date
0 2.0 NaN 3/9/2016
1 1480000.0 3/12/2016
2 2.0 1035000.0 4/2/2016
3 3.0 NaN 4/2/2016
4 3.0 1465000.0 4/2/2016
#Assume you have the following dataframe df that describes flights
%matplotlib inline
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('justtesting.csv', nrows=50, usecols=['Price','Date','Bedrooms'])
df = df.dropna(0)e
df['Date'] = pd.to_datetime(df.Date)
df.sort_values("Date", axis = 0, ascending = True, inplace = True)
df2 = df[df['Bedrooms'] == 2].groupby(["Date"]).agg(['sum'])
df2.head()
df2.info()
sns.set()
g=sns.lmplot(x="Date", y="Price", data=df2, lowess=True)
#Assume you have the following dataframe df that describes flights
%matplotlib inline
import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = x.copy()
df = df.dropna(0)
df.sort_values("Date", axis = 0, ascending = True, inplace = True)
df2 = df[df['Bedrooms'] == 2].groupby(["Date", 'Bedrooms'], as_index=False).sum()
df2.head()
df2.info()
sns.set()
g=sns.lmplot(x='Date', y="Price", data=df2, lowess=True)
Groupby makes the grouped by columns as index by default. Giving as_index=False will fix that. However, seasborn lmplot is required to have a float value. More info can be found on this question
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as dts
def use_matplot():
ax = df.plot(x='year', kind="area" )
years = dts.YearLocator(20)
ax.xaxis.set_major_locator(years)
fig = ax.get_figure()
fig.savefig('output.pdf')
dates = np.arange(1990,2061, 1)
dates = dates.astype('str').astype('datetime64')
df = pd.DataFrame(np.random.randint(0, dates.size, size=(dates.size,3)), columns=list('ABC'))
df['year'] = dates
cols = df.columns.tolist()
cols = [cols[-1]] + cols[:-1]
df = df[cols]
use_matplot()
In the above code, I get an error, "ValueError: year 0 is out of range" when trying to set the YearLocator so as to ensure the X-Axis has year labels for every 20th year. By default the plot has the years show up every 10 years. What am I doing wrong? Desired outcome is simply a plot with 1990, 2010, 2030, 2050 on the bottom. (Instead of default 1990, 2000, 2010, etc.)
Since the years are simple numbers, you may opt for not using them as dates at all and keeping them as numbers.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dates = np.arange(1990,2061, 1)
df = pd.DataFrame(np.random.randint(0,dates.size,size=(dates.size,3)),columns=list('ABC'))
df['year'] = dates
cols = df.columns.tolist()
cols = [cols[-1]] + cols[:-1]
df = df[cols]
ax = df.plot(x='year', kind="area" )
ax.set_xticks(range(2000,2061,20))
plt.show()
Apart from that, using Matplotlib locators and formatters on date axes created via pandas will most often fail. This is due to pandas using a completely different datetime convention. In order to have more freedom for setting custom tickers for datetime axes, you may use matplotlib. A stackplot can be plotted with plt.stackplot. On such a matplotlib plot, the use of the usual matplotlib tickers is unproblematic.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as dts
dates = np.arange(1990,2061, 1)
df = pd.DataFrame(np.random.randint(0,dates.size,size=(dates.size,3)),columns=list('ABC'))
df['year'] = pd.to_datetime(dates.astype(str))
cols = df.columns.tolist()
cols = [cols[-1]] + cols[:-1]
df = df[cols]
plt.stackplot(df["year"].values, df[list('ABC')].values.T)
years = dts.YearLocator(20)
plt.gca().xaxis.set_major_locator(years)
plt.margins(x=0)
plt.show()
Consider using set_xticklabels to specify values of x axis tick marks:
ax.set_xticklabels(sum([[i,''] for i in range(1990, 2060, 20)], []))
# [1990, '', 2010, '', 2030, '', 2050, '']
I'm trying to use seaborn to make a simple tsplot, but for reasons that aren't clear to me nothing shows up when I run the code. Here's a minimal example:
import numpy as np
import seaborn as sns
import pandas as pd
df = pd.DataFrame({'value': np.random.rand(31), 'time': range(31)})
ax = sns.tsplot(data=df, value='value', time='time')
sns.plt.show()
Usually tsplot you supply multiple data points for each time point, but does it just not work if you only supply one?
I know matplotlib can be used to do this pretty easily, but I wanted to use seaborn for some of its other functionality.
You are missing individual units. When using a data frame the idea is that multiple timeseries for the same unit have been recorded, which can be individually identifier in the data frame. The error is then calculated based on the different units.
So for one series only, you can get it working again like this:
df = pd.DataFrame({'value': np.random.rand(31), 'time': range(31)})
df['subject'] = 0
sns.tsplot(data=df, value='value', time='time', unit='subject')
Just to see how the error is computed, look at this example:
dfs = []
for i in range(10):
df = pd.DataFrame({'value': np.random.rand(31), 'time': range(31)})
df['subject'] = i
dfs.append(df)
all_dfs = pd.concat(dfs)
sns.tsplot(data=all_dfs, value='value', time='time', unit='subject')
You can use set_index for index from column time and then plot Series:
df = pd.DataFrame({'value': np.random.rand(31), 'time': range(31)})
df = df.set_index('time')['value']
ax = sns.tsplot(data=df)
sns.plt.show()