My data in csv files is 15 minutes average and I want to hourly Average. When I used below code, it is showing error. 'how' unrecognised argument.
import pandas as pd
df = pd.read_csv("sirifort_with_data.csv",parse_dates=['Time_Stamp'])
data.resample('H', how='mean')
Indeed, pandas.resample does not have a keyword argument named how. You only use that function to group your time-series data. After applying the function, you could apply another to apply operations on each sample/group. Since you want to calculate the average of each group, you can use .mean():
data.resample('H').mean()
import pandas as pd
df = pd.read_csv("sirifort_with_data.csv")
df['Time_Stamp'] = pd.to_datetime(df['Time_Stamp'])
df = df.set_index('Time_Stamp').resample('H').mean()
After converting the Time_Stamp to pd.to_datetime, it worked fine.
Thanks for the help.
Related
I am trying to resample a time series to get annual maximum values for different time steps(eg., 3h, 6h, etc. The original series is at an hourly resolution. I first converted the date format to pandas date format, used that column as an index, and resampled it. The final output should be the years and the corresponding maximum values at the desired timestep. However, i am getting a list of NaN. I am not sure, how can I incorporate a range in my code. Here is my code so far for a 3H timestep
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df = pd.DataFrame(df[['yyyymmddhh', 'rainfall']])
datin["yyyymmddhh"] = pd.to_datetime(datin["yyyymmddhh"], format="%Y%M%d%H")
datin.set_index("yyyymmddhh").resample("3H").sum().resample("Y").max()
stn_n;yyyymmddhh;rainfall
xyz;1980123123;-
xyz;1981010100;0.0
xyz;1981010101;0.0
xyz;1981010102;0.0
xyz;1981010103;0.0
xyz;1981010104;0.0
xyz;1981010105;0.0
xyz;1981010106;0.0
xyz;1981010107;0.0
xyz;1981010108;0.0
xyz;1981010109;0.4
xyz;1981010110;0.6
xyz;1981010111;0.1
xyz;1981010112;0.1
xyz;1981010113;0.0
xyz;1981010114;0.1
xyz;1981010115;0.6
I have a CSV of data I've loaded into a dataframe that I'm trying to massage: I want to create a new column that contains the difference from one record to another, grouped by another field.
Here's my code:
import pandas as pd
import matplotlib.pyplot as plt
rl = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
all_counties = pd.read_csv(url, dtype={"fips": str})
all_counties.date = pd.to_datetime(all_counties.date)
oregon = all_counties.loc[all_counties['state'] == 'Oregon']
oregon.set_index('date', inplace=True)
oregon.sort_values('county', inplace=True)
# This is not working; I was hoping to find the differences from one day to another on a per-county basis
oregon['delta'] = oregon.groupby(['state','county'])['cases'].shift(1, fill_value=0)
oregon.tail()
Unfortunately, I'm getting results where the delta is always the same as the cases.
I'm new at Pandas and relatively inexperienced with Python, so bonus points if you can point me towards how to best read the documentation.
Lets Try
oregon['delta']=oregon.groupby(['state','county'])['cases'].diff().fillna(0)
Im trying to create an empty DataFrame for which I will then constantly be appending rows to using the time stamp when the data arrives as index.
This is to code I have so far:
import pandas as pd
import datetime
df = pd.DataFrame(columns=['a','b'],index=pd.DatetimeIndex(freq='s'))
df.loc[event.get_datetime()] = event.get_data()
The problem Im having is with freq in the DateTimeIndex, the data is not arriving at any predefined intervalls, it is ju when some event tiggers. And also in the code above I need to specify a start and enddate for the index I dont want that I just want to be able to append rows whenever they arrive.
Set up empty with pd.to_datetime
df = pd.DataFrame(columns=['a','b'], index=pd.to_datetime([]))
Then do this
df.loc[pd.Timestamp('now')] = pd.Series([1, 2], ['a', 'b'])
df
a b
2018-06-10 20:52:52.025426 1 2
The first argument of DateTimeIndex is data. Try setting data to an empty list. If you want to define the start time, end time, or frequency, take a look at the other arguments for DateTimeIndex.
df = pd.DataFrame(columns=['a','b'], index=pd.DatetimeIndex([], name='startime'))
If you're trying to index on time delta values, also consider
df = pd.DataFrame(columns=['a','b'], index=pd.TimedeltaIndex([]))
I am trying to plot a pandas DataFrame with TimeStamp indizes that has a time gap in its indizes. Using pandas.plot() results in linear interpolation between the last TimeStamp of the former segment and the first TimeStamp of the next. I do not want linear interpolation, nor do I want empty space between the two date segments. Is there a way to do that?
Suppose we have a DataFrame with TimeStamp indizes:
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> df = pd.DataFrame(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
>>> df = df.cumsum()
Now lets take two time chunks of it and plot it:
>>> df = pd.concat([df['Jan 2000':'Aug 2000'], df['Jan 2001':'Aug 2001']])
>>> df.plot()
>>> plt.show()
The resulting plot has an interpolation line connecting the TimeStamps enclosing the gap. I cannot figure out how to upload pictures on this machine, but these pictures from Google Groups show my problem (interpolated.jpg, no-interpolation.jpg and no gaps.jpg). I can recreate the first as shown above. The second is achievable by replacing all gap values with NaN (see also this question). How can I achieve the third version, where the time gap is omitted?
Try:
df.plot(x=df.index.astype(str))
You may want to customize ticks and tick labels.
EDIT
That works for me using pandas 0.17.1 and numpy 1.10.4.
All you really need is a way to convert the DatetimeIndex to another type which is not datetime-like. In order to get meaningful labels I chose str. If x=df.index.astype(str) does not work with your combination of pandas/numpy/whatever you can try other options:
df.index.to_series().dt.strftime('%Y-%m-%d')
df.index.to_series().apply(lambda x: x.strftime('%Y-%m-%d'))
...
I realized that resetting the index is not necessary so I removed that part.
In my case I had DateTimeIndex objects instead of TimeStamp, but the following works for me in pandas 0.24.2 to eliminate the time series gaps after converting the DatetimeIndex objects to string.
df = pd.read_sql_query(sql, sql_engine)
df.set_index('date'), inplace=True)
df.index = df.index.map(str)
I need to filter out data with specific hours. The DataFrame function between_time seems to be the proper way to do that, however, it only works on the index column of the dataframe; but I need to have the data in the original format (e.g. pivot tables will expect the datetime column to be with the proper name, not as the index).
This means that each filter looks something like this:
df.set_index(keys='my_datetime_field').between_time('8:00','21:00').reset_index()
Which implies that there are two reindexing operations every time such a filter is run.
Is this a good practice or is there a more appropriate way to do the same thing?
Create a DatetimeIndex, but store it in a variable, not the DataFrame.
Then call it's indexer_between_time method. This returns an integer array which can then be used to select rows from df using iloc:
import pandas as pd
import numpy as np
N = 100
df = pd.DataFrame(
{'date': pd.date_range('2000-1-1', periods=N, freq='H'),
'value': np.random.random(N)})
index = pd.DatetimeIndex(df['date'])
df.iloc[index.indexer_between_time('8:00','21:00')]