I have 2 dataframes with index type: Datatimeindex and I would like to copy one row to another. The dataframes are:
variable: df
DateTime
2013-01-01 01:00:00 0.0
2013-01-01 02:00:00 0.0
2013-01-01 03:00:00 0.0
....
Freq: H, Length: 8759, dtype: float64
variable: consumption_year
Potência Ativa ... Costs
Datetime ...
2019-01-01 00:00:00 11.500000 ... 1.08874
2019-01-01 01:00:00 6.500000 ... 0.52016
2019-01-01 02:00:00 5.250000 ... 0.38183
2019-01-01 03:00:00 5.250000 ... 0.38183
[8760 rows x 5 columns]
here is my code:
mc.run_model(tmy_data)
df=round(mc.ac.fillna(0)/1000,3)
consumption_year['PVProduction'] = df.iloc[:,[1]] #1
consumption_year['PVProduction'] = df[:,1] #2
I am trying to copy the second column of df, to a new column in consumption_year dataframe but none of those previous experiences worked. Looking to the index, I see 3 major differences:
year (2013 and 2019)
starting hour: 01:00 and 00:00
length: 8760 and 8759
Do I need to solve those 3 differences first (making an datetime from df equal to consumption_year), before I can copy one row to another? If so, could you provide me a solution to fix those differences.
Those are the errors:
1: consumption_year['PVProduction'] = df.iloc[:,[1]]
raise IndexingError("Too many indexers")
pandas.core.indexing.IndexingError: Too many indexers
2: consumption_year['PVProduction'] = df[:,1]
raise ValueError("Can only tuple-index with a MultiIndex")
ValueError: Can only tuple-index with a MultiIndex
You can merge two data frames together.
pd.merge(df, consumption_year, left_index=True, right_index=True, how='outer')
Related
I am working on some code that will rearrange a time series. Currently I have a standard time series. I have a three columns with with the header being [Date, Time, Value]. I want to reformat the dataframe to index with the date and use a header with the time (i.e. 0:00, 1:00, ... , 23:00). The dataframe will be filled in with the value.
Here is the Dataframe currently have
essentially I'd like to mve the index toa single day and show the hours through the columns.
Thanks,
Use pivot:
df = df.pivot(index='Date', columns='Time', values='Total')
Output (first 10 columns and with random values for Total):
>>> df.pivot(index='Date', columns='Time', values='Total').iloc[0:10]
time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00
date
2019-01-01 0.732494 0.087657 0.930405 0.958965 0.531928 0.891228 0.664634 0.432684 0.009653 0.604878
2019-01-02 0.471386 0.575126 0.509707 0.715290 0.337983 0.618632 0.413530 0.849033 0.725556 0.186876
You could try this.
Split the time part to get only the hour. Add hr to it.
df = pd.DataFrame([['2019-01-01', '00:00:00',-127.57],['2019-01-01', '01:00:00',-137.57],['2019-01-02', '00:00:00',-147.57],], columns=['Date', 'Time', 'Totals'])
df['hours'] = df['Time'].apply(lambda x: 'hr'+ str(int(x.split(':')[0])))
print(pd.pivot_table(df, values ='Totals', index=['Date'], columns = 'hours'))
Output
hours hr0 hr1
Date
2019-01-01 -127.57 -137.57
2019-01-02 -147.57 NaN
I am a Korean student
Please understand that English is awkward
i want to make columns datetime > year,mounth .... ,second
train = pd.read_csv('input/Train.csv')
DateTime looks like this
(this is head(20) and I remove other columns easy to see)
datetime
0 2011-01-01 00:00:00
1 2011-01-01 01:00:00
2 2011-01-01 02:00:00
3 2011-01-01 03:00:00
4 2011-01-01 04:00:00
5 2011-01-01 05:00:00
6 2011-01-01 06:00:00
7 2011-01-01 07:00:00
8 2011-01-01 08:00:00
9 2011-01-01 09:00:00
10 2011-01-01 10:00:00
11 2011-01-01 11:00:00
12 2011-01-01 12:00:00
13 2011-01-01 13:00:00
14 2011-01-01 14:00:00
15 2011-01-01 15:00:00
16 2011-01-01 16:00:00
17 2011-01-01 17:00:00
18 2011-01-01 18:00:00
19 2011-01-01 19:00:00
then I write this code to see each columns (year,month,day,hour,minute,second)
train['year'] = train['datetime'].dt.year
train['month'] = train['datetime'].dt.month
train['day'] = train['datetime'].dt.day
train['hour'] = train['datetime'].dt.hour
train['minute'] = train['datetime'].dt.minute
train['second'] = train['datetime'].dt.seond
and error like this
AttributeError: Can only use .dt accessor with datetimelike values
please help me ㅠㅅㅠ
Note that by default read_csv is able to deduce column type only
for numeric and boolean columns.
Unless explicitely specified (e.g. passing converters or dtype
parameters), all other cases of input are left as strings
and the pandasonic type of such columns is object.
And just this occurred in your case.
So, as this column is of object type, you can not invoke dt accessor
on it, as it works only on columns of datetime type.
Actually, in this case, you can take the following approach:
do not specify any conversion of this column (it will be parsed
just as object),
after that split datetime column into "parts", using str.split
(all 6 columns with a single instruction),
set proper column names in the resulting DataFrame,
join it to the original DataFrame (then drop),
as late as now change the type of the original column.
To do it, you can run:
wrk = df['datetime'].str.split(r'[- :]', expand=True).astype(int)
wrk.columns = ['year', 'month', 'day', 'hour', 'minute', 'second']
df = df.join(wrk)
del wrk
df['datetime'] = pd.to_datetime(df['datetime'])
Note that I added astype(int). Otherwise these columns would be left as
object (actually string) type.
Or maybe this original column is not needed any more (as you have extracted
all date / time components)? In such case drop this column instead of
converting it.
And the last hint: datetime is used rather as a type name (with various
endings).
So it is better when you used some other name here, at least differing
in char case, e.g. DateTime.
My input data looks like this:
cat start target
0 1 2016-09-01 00:00:00 4.370279
1 1 2016-09-01 00:00:00 1.367778
2 1 2016-09-01 00:00:00 0.385834
I want to build out a series using "start" for the Start Date and "target" for the series values. The iterrows() is pulling the correct values for "imp", but when appending to the time_series, only the first value is carried through to all series points. What's the reason for "data=imp" pulling the 0th row every time?
t0 = model_input_test['start'][0] # t0 = 2016-09-01 00:00:00
num_ts = len(model_input_test.index) # num_ts = 1348
time_series = []
for i, row in model_input_test.iterrows():
imp = row.loc['target']
print(imp)
index = pd.DatetimeIndex(start=t0, freq='H', periods=num_ts)
time_series.append(pd.Series(data=imp, index=index))
A screenshot can be seen here.
Series "time_series" should look like this:
2016-09-01 00:00:00 4.370279
2016-09-01 01:00:00 1.367778
2016-09-01 02:00:00 0.385834
But ends up looking like this:
2016-09-01 00:00:00 4.370279
2016-09-01 01:00:00 4.370279
2016-09-01 02:00:00 4.370279
I'm using Jupyter conda_python3 on Sagemaker.
When using dataframes, there are usually better ways to go about tasks then iterating through the dataframe. For example, in your case, you can create your series like this:
time_series = (df.set_index(pd.date_range(pd.to_datetime(df.start).iloc[0],
periods = len(df), freq='H')))['target']
>>> time_series
2016-09-01 00:00:00 4.370279
2016-09-01 01:00:00 1.367778
2016-09-01 02:00:00 0.385834
Freq: H, Name: target, dtype: float64
>>> type(time_series)
<class 'pandas.core.series.Series'>
Essentially, this says: "set the index to be a date range incremented hourly from your first date, then take the target column"
Given a dataframe df and series start and target, you can simply use set_index:
time_series = df.set_index('start')['target']
I have a dataframe as follows
df = pd.DataFrame({ 'X' : np.random.randn(50000)}, index=pd.date_range('1/1/2000', periods=50000, freq='T'))
df.head(10)
Out[37]:
X
2000-01-01 00:00:00 -0.699565
2000-01-01 00:01:00 -0.646129
2000-01-01 00:02:00 1.339314
2000-01-01 00:03:00 0.559563
2000-01-01 00:04:00 1.529063
2000-01-01 00:05:00 0.131740
2000-01-01 00:06:00 1.282263
2000-01-01 00:07:00 -1.003991
2000-01-01 00:08:00 -1.594918
2000-01-01 00:09:00 -0.775230
I would like to create a variable that contains the sum of X
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At index 2000-01-01 00:00:00, df['rolling_sum_same_hour'] contains the sum the values of X observed at 00:00:00 during the last 5 days in the data (not including 2000-01-01 of course).
At index 2000-01-01 00:01:00, df['rolling_sum_same_hour'] contains the sum of of X observed at 00:00:01 during the last 5 days and so on.
The intuitive idea is that intraday prices have intraday seasonality, and I want to get rid of it that way.
I tried to use df['rolling_sum_same_hour']=df.at_time(df.index.minute).rolling(window=5).sum()
with no success.
Any ideas?
Many thanks!
Behold the power of groupby!
df = # as you defined above
df['rolling_sum_by_time'] = df.groupby(df.index.time)['X'].apply(lambda x: x.shift(1).rolling(10).sum())
It's a big pill to swallow there, but we are grouping by time (as in python datetime.time), then getting the column we care about (else apply will work on columns - it now works on the time-groups), and then applying the function you want!
IIUC, what you want is to perform a rolling sum, but only on the observations grouped by the exact same time of day. This can be done by
df.X.groupby([df.index.hour, df.index.minute]).apply(lambda g: g.rolling(window=5).sum())
(Note that your question alternates between 5 and 10 periods.) For example:
In [43]: df.X.groupby([df.index.hour, df.index.minute]).apply(lambda g: g.rolling(window=5).sum()).tail()
Out[43]:
2000-02-04 17:15:00 -2.135887
2000-02-04 17:16:00 -3.056707
2000-02-04 17:17:00 0.813798
2000-02-04 17:18:00 -1.092548
2000-02-04 17:19:00 -0.997104
Freq: T, Name: X, dtype: float64
(newbie to python and pandas)
I have a data set of 15 to 20 million rows, each row is a time-indexed observation of a time a 'user' was seen, and I need to analyze the visit-per-day patterns of each user, normalized to their first visit. So, I'm hoping to plot with an X axis of "days after first visit" and a Y axis of "visits by this user on this day", i.e., I need to get a series indexed by a timedelta and with values of visits in the period ending with that delta [0:1, 3:5, 4:2, 6:8,] But I'm stuck very early ...
I start with something like this:
rng = pd.to_datetime(['2000-01-01 08:00', '2000-01-02 08:00',
'2000-01-01 08:15', '2000-01-02 18:00',
'2000-01-02 17:00', '2000-03-01 08:00',
'2000-03-01 08:20','2000-01-02 18:00'])
uid=Series(['u1','u2','u1','u2','u1','u2','u2','u3'])
misc=Series(['','x1','A123','1.23','','','','u3'])
df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
df=df.set_index(df.ts)
grouped = df.groupby('uid')
firstseen = grouped.first()
The ts values are unique to each uid, but can be duplicated (two uid can be seen at the same time, but any one uid is seen only once at any one timestamp)
The first step is (I think) to add a new column to the DataFrame, showing for each observation what the timedelta is back to the first observation for that user. But, I'm stuck getting that column in the DataFrame. The simplest thing I tried gives me an obscure-to-newbie error message:
df['sinceseen'] = df.ts - firstseen.ts[df.uid]
...
ValueError: cannot reindex from a duplicate axis
So I tried a brute-force method:
def f(row):
return row.ts - firstseen.ts[row.uid]
df['sinceseen'] = Series([{idx:f(row)} for idx, row in df.iterrows()], dtype=timedelta)
In this attempt, df gets a sinceseen but it's all NaN and shows a type of float for type(df.sinceseen[0]) - though, if I just print the Series (in iPython) it generates a nice list of timedeltas.
I'm working back and forth through "Python for Data Analysis" and it seems like apply() should work, but
def fg(ugroup):
ugroup['sinceseen'] = ugroup.index - ugroup.index.min()
return ugroup
df = df.groupby('uid').apply(fg)
gives me a TypeError on the "ugroup.index - ugroup.index.min(" even though each of the two operands is a Timestamp.
So, I'm flailing - can someone point me at the "pandas" way to get to the data structure Ineed?
Does this help you get started?
>>> df = DataFrame({'uid':uid,'misc':misc,'ts':rng})
>>> df = df.sort(["uid", "ts"])
>>> df["since_seen"] = df.groupby("uid")["ts"].apply(lambda x: x - x.iloc[0])
>>> df
misc ts uid since_seen
0 2000-01-01 08:00:00 u1 0 days, 00:00:00
2 A123 2000-01-01 08:15:00 u1 0 days, 00:15:00
4 2000-01-02 17:00:00 u1 1 days, 09:00:00
1 x1 2000-01-02 08:00:00 u2 0 days, 00:00:00
3 1.23 2000-01-02 18:00:00 u2 0 days, 10:00:00
5 2000-03-01 08:00:00 u2 59 days, 00:00:00
6 2000-03-01 08:20:00 u2 59 days, 00:20:00
7 u3 2000-01-02 18:00:00 u3 0 days, 00:00:00
[8 rows x 4 columns]