Pandas: Using iterrows() and pd.Series to Append Values to Series - python

My input data looks like this:
cat start target
0 1 2016-09-01 00:00:00 4.370279
1 1 2016-09-01 00:00:00 1.367778
2 1 2016-09-01 00:00:00 0.385834
I want to build out a series using "start" for the Start Date and "target" for the series values. The iterrows() is pulling the correct values for "imp", but when appending to the time_series, only the first value is carried through to all series points. What's the reason for "data=imp" pulling the 0th row every time?
t0 = model_input_test['start'][0] # t0 = 2016-09-01 00:00:00
num_ts = len(model_input_test.index) # num_ts = 1348
time_series = []
for i, row in model_input_test.iterrows():
imp = row.loc['target']
print(imp)
index = pd.DatetimeIndex(start=t0, freq='H', periods=num_ts)
time_series.append(pd.Series(data=imp, index=index))
A screenshot can be seen here.
Series "time_series" should look like this:
2016-09-01 00:00:00 4.370279
2016-09-01 01:00:00 1.367778
2016-09-01 02:00:00 0.385834
But ends up looking like this:
2016-09-01 00:00:00 4.370279
2016-09-01 01:00:00 4.370279
2016-09-01 02:00:00 4.370279
I'm using Jupyter conda_python3 on Sagemaker.

When using dataframes, there are usually better ways to go about tasks then iterating through the dataframe. For example, in your case, you can create your series like this:
time_series = (df.set_index(pd.date_range(pd.to_datetime(df.start).iloc[0],
periods = len(df), freq='H')))['target']
>>> time_series
2016-09-01 00:00:00 4.370279
2016-09-01 01:00:00 1.367778
2016-09-01 02:00:00 0.385834
Freq: H, Name: target, dtype: float64
>>> type(time_series)
<class 'pandas.core.series.Series'>
Essentially, this says: "set the index to be a date range incremented hourly from your first date, then take the target column"

Given a dataframe df and series start and target, you can simply use set_index:
time_series = df.set_index('start')['target']

Related

Time Series Data Reformat

I am working on some code that will rearrange a time series. Currently I have a standard time series. I have a three columns with with the header being [Date, Time, Value]. I want to reformat the dataframe to index with the date and use a header with the time (i.e. 0:00, 1:00, ... , 23:00). The dataframe will be filled in with the value.
Here is the Dataframe currently have
essentially I'd like to mve the index toa single day and show the hours through the columns.
Thanks,
Use pivot:
df = df.pivot(index='Date', columns='Time', values='Total')
Output (first 10 columns and with random values for Total):
>>> df.pivot(index='Date', columns='Time', values='Total').iloc[0:10]
time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00
date
2019-01-01 0.732494 0.087657 0.930405 0.958965 0.531928 0.891228 0.664634 0.432684 0.009653 0.604878
2019-01-02 0.471386 0.575126 0.509707 0.715290 0.337983 0.618632 0.413530 0.849033 0.725556 0.186876
You could try this.
Split the time part to get only the hour. Add hr to it.
df = pd.DataFrame([['2019-01-01', '00:00:00',-127.57],['2019-01-01', '01:00:00',-137.57],['2019-01-02', '00:00:00',-147.57],], columns=['Date', 'Time', 'Totals'])
df['hours'] = df['Time'].apply(lambda x: 'hr'+ str(int(x.split(':')[0])))
print(pd.pivot_table(df, values ='Totals', index=['Date'], columns = 'hours'))
Output
hours hr0 hr1
Date
2019-01-01 -127.57 -137.57
2019-01-02 -147.57 NaN

Converting columns with hours to datetime type pandas

I try to convert my column with "time" in the form "hr hr: min min :sec sec" in my pandas frame from object to date time 64 as I want to filter for hours.
I tried new['Time'] = pd.to_datetime(new['Time'], format='%H:%M:%S').dt.time which has no effect at all (it is still an object).
I also tried new['Time'] = pd.to_datetime(new['Time'],infer_datetime_format=True)
which gets the error message: TypeError: <class 'datetime.time'> is not convertible to datetime
I want to be able to sort my data frame for hours.
How do i convert the object to the hour?
can I then filter by hour (for example everything after 8am) or do I have to enter the exact value with minutes and seconds to filter for it?
Thank you
If you want your df['Time'] to be of type datetime64 just use
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M:%S')
print(df['Time'])
This will result in the following column
0 1900-01-01 00:00:00
1 1900-01-01 00:01:00
2 1900-01-01 00:02:00
3 1900-01-01 00:03:00
4 1900-01-01 00:04:00
...
1435 1900-01-01 23:55:00
1436 1900-01-01 23:56:00
1437 1900-01-01 23:57:00
1438 1900-01-01 23:58:00
1439 1900-01-01 23:59:00
Name: Time, Length: 1440, dtype: datetime64[ns]
If you just want to extract the hour from the timestamp extent pd.to_datetime(...) by .dt.hour
If you want to group your values on an hourly basis you can also use (after converting the df['Time'] to datetime):
new_df = df.groupby(pd.Grouper(key='Time', freq='H'))['Value'].agg({pd.Series.to_list})
This will return all values grouped by hour.
IIUC, you already have a time structure from datetime module:
Suppose this dataframe:
from datetime import time
df = pd.DataFrame({'Time': [time(10, 39, 23), time(8, 47, 59), time(9, 21, 12)]})
print(df)
# Output:
Time
0 10:39:23
1 08:47:59
2 09:21:12
Few operations:
# Check if you have really `time` instance
>>> df['Time'].iloc[0]
datetime.time(10, 39, 23)
# Sort values by time
>>> df.sort_values('Time')
Time
1 08:47:59
2 09:21:12
0 10:39:23
# Extract rows from 08:00 and 09:00
>>> df[df['Time'].between(time(8), time(9))]
Time
1 08:47:59

Finding the Timedelta through pandas dataframe, I keep return NaT

So I am reading in a csv file of a 30 minute timeseries going from "2015-01-01 00:00" upto and including "2020-12-31 23:30". There are five sets of these timeseries, each being at a certain location, and there are 105215 rows going down for each 30 minutes. My job is to go through and find the timedelta between each row, for each column. It should be 30 minutes for each one, except sometimes it isn't, and I have to find that.
So far I'm reading in the data fine via
ca_time = np.array(ca.iloc[0:, 1], dtype= "datetime64")
ny_time = np.array(ny.iloc[0:, 1], dtype = "datetime64")
tx_time = np.array(tx.iloc[0:, 1], dtype = "datetime64")
#I'm then passing these to a pandas dataframe for more convenient manipulation
frame_ca = pd.DataFrame(data = ca_time, dtype = "datetime64[s]")
frame_ny = pd.DataFrame(data = ny_time, dtype = "datetime64[s]")
frame_tx = pd.DataFrame(data = tx_time, dtype = "datetime64[s]")
#Then concatenating them into an array with 100k+ rows, and the five columns represent each location
full_array = pd.concat([frame_ca, frame_ny, frame_tx], axis = 1)
I now want to find the timedelta between each cell for each respective location.
Currently I'm trying this as a simply test
first_row = full_array2.loc[1:1, :1]
second_row = full_array2.loc[2:2, :1]
delta = first_row - second_row
I'm getting back
0 0 0
1 NaT NaT NaT
2 NaT NaT NaT
These seems simple enough but don't know how I'm getting Not a Time here.
For reference, below are both those rows I'm trying to subtract
ca ny tx fl az
1 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00 2015-01-01 01:00:00, 0 0 0 0 0
2 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00 2015-01-01 01:30:00
Any help appreciated!

Copy row to another dataframe

I have 2 dataframes with index type: Datatimeindex and I would like to copy one row to another. The dataframes are:
variable: df
DateTime
2013-01-01 01:00:00 0.0
2013-01-01 02:00:00 0.0
2013-01-01 03:00:00 0.0
....
Freq: H, Length: 8759, dtype: float64
variable: consumption_year
Potência Ativa ... Costs
Datetime ...
2019-01-01 00:00:00 11.500000 ... 1.08874
2019-01-01 01:00:00 6.500000 ... 0.52016
2019-01-01 02:00:00 5.250000 ... 0.38183
2019-01-01 03:00:00 5.250000 ... 0.38183
[8760 rows x 5 columns]
here is my code:
mc.run_model(tmy_data)
df=round(mc.ac.fillna(0)/1000,3)
consumption_year['PVProduction'] = df.iloc[:,[1]] #1
consumption_year['PVProduction'] = df[:,1] #2
I am trying to copy the second column of df, to a new column in consumption_year dataframe but none of those previous experiences worked. Looking to the index, I see 3 major differences:
year (2013 and 2019)
starting hour: 01:00 and 00:00
length: 8760 and 8759
Do I need to solve those 3 differences first (making an datetime from df equal to consumption_year), before I can copy one row to another? If so, could you provide me a solution to fix those differences.
Those are the errors:
1: consumption_year['PVProduction'] = df.iloc[:,[1]]
raise IndexingError("Too many indexers")
pandas.core.indexing.IndexingError: Too many indexers
2: consumption_year['PVProduction'] = df[:,1]
raise ValueError("Can only tuple-index with a MultiIndex")
ValueError: Can only tuple-index with a MultiIndex
You can merge two data frames together.
pd.merge(df, consumption_year, left_index=True, right_index=True, how='outer')

pandas datetime columns problem and i don't know what i am missing

I am a Korean student
Please understand that English is awkward
i want to make columns datetime > year,mounth .... ,second
train = pd.read_csv('input/Train.csv')
DateTime looks like this
(this is head(20) and I remove other columns easy to see)
datetime
0 2011-01-01 00:00:00
1 2011-01-01 01:00:00
2 2011-01-01 02:00:00
3 2011-01-01 03:00:00
4 2011-01-01 04:00:00
5 2011-01-01 05:00:00
6 2011-01-01 06:00:00
7 2011-01-01 07:00:00
8 2011-01-01 08:00:00
9 2011-01-01 09:00:00
10 2011-01-01 10:00:00
11 2011-01-01 11:00:00
12 2011-01-01 12:00:00
13 2011-01-01 13:00:00
14 2011-01-01 14:00:00
15 2011-01-01 15:00:00
16 2011-01-01 16:00:00
17 2011-01-01 17:00:00
18 2011-01-01 18:00:00
19 2011-01-01 19:00:00
then I write this code to see each columns (year,month,day,hour,minute,second)
train['year'] = train['datetime'].dt.year
train['month'] = train['datetime'].dt.month
train['day'] = train['datetime'].dt.day
train['hour'] = train['datetime'].dt.hour
train['minute'] = train['datetime'].dt.minute
train['second'] = train['datetime'].dt.seond
and error like this
AttributeError: Can only use .dt accessor with datetimelike values
please help me ㅠㅅㅠ
Note that by default read_csv is able to deduce column type only
for numeric and boolean columns.
Unless explicitely specified (e.g. passing converters or dtype
parameters), all other cases of input are left as strings
and the pandasonic type of such columns is object.
And just this occurred in your case.
So, as this column is of object type, you can not invoke dt accessor
on it, as it works only on columns of datetime type.
Actually, in this case, you can take the following approach:
do not specify any conversion of this column (it will be parsed
just as object),
after that split datetime column into "parts", using str.split
(all 6 columns with a single instruction),
set proper column names in the resulting DataFrame,
join it to the original DataFrame (then drop),
as late as now change the type of the original column.
To do it, you can run:
wrk = df['datetime'].str.split(r'[- :]', expand=True).astype(int)
wrk.columns = ['year', 'month', 'day', 'hour', 'minute', 'second']
df = df.join(wrk)
del wrk
df['datetime'] = pd.to_datetime(df['datetime'])
Note that I added astype(int). Otherwise these columns would be left as
object (actually string) type.
Or maybe this original column is not needed any more (as you have extracted
all date / time components)? In such case drop this column instead of
converting it.
And the last hint: datetime is used rather as a type name (with various
endings).
So it is better when you used some other name here, at least differing
in char case, e.g. DateTime.

Categories