I need to reshape a dataframe that looks like df1 and turn it into df2. There are 2 considerations for this procedure:
I need to be able to set the number of rows to be sliced as a parameter (length).
I need to split date and time from the index, and use date in the reshape as the column names and keep time as the index.
Current df1
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10
Desired Output df2 - With the parameter 'length=5'
2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
What have I done:
My approach was to create a multi-index (Date - Time) and then do a pivot table or some sort of reshape to achieve the desired df output.
import pandas as pd
'''
First separate time and date
'''
df['TimeStamp'] = df.index
df['date'] = df.index.date
df['time'] = df.index.time
'''
Then create a way to separate the slices and make those specific dates available for then create
a multi-index.
'''
for index, row in df.iterrows():
df['Num'] = np.arange(len(df))
for index, row in df.iterrows():
if row['Num'] % 5 == 0:
df.loc[index, 'EventDate'] = df.loc[index, 'Date']
df.set_index(['EventDate', 'Hour'], inplace=True)
del df['Date']
del df['Num']
del df['TimeStamp']
Problem: There's a NaN appears next to each date of the first level of the multi-index. And even if that worked well, I can't find how to do what I need with a multiindex df.
I'm stuck. I appreciate any input.
import numpy as np
import pandas as pd
import io
data = '''\
val
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10'''
df = pd.read_table(io.BytesIO(data), sep='\s{2,}', parse_dates=True)
chunksize = 5
chunks = len(df)//chunksize
df['Date'] = np.repeat(df.index.date[::chunksize], chunksize)[:len(df)]
index = df.index.time[:chunksize]
df['Time'] = np.tile(np.arange(chunksize), chunks)
df = df.set_index(['Date', 'Time'], append=False)
df = df['val'].unstack('Date')
df.index = index
print(df)
yields
Date 2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
Note that the final DataFrame has an index with non-unique entries. (The
18:00:00 is repeated.) Some DataFrame operations are problematic when the
index has repeated entries, so in general it is better to avoid this if
possible.
First of all I'm assuming your datetime column is actually a datetime type if not use df['t'] = pd.to_datetime(df['t']) to convert.
Then set your index using a multindex and unstack...
df.index = pd.MultiIndex.from_tuples(df['t'].apply(lambda x: [x.time(),x.date()]))
df['v'].unstack()
This would be a canonical approach for pandas:
First, setup with imports and data:
import pandas as pd
import StringIO
txt = '''2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10'''
Now read in the DataFrame, and pivot on the correct columns:
df1 = pd.read_csv(StringIO.StringIO(txt), sep=' ',
names=['d', 't', 'n'], )
print(df1.pivot(index='t', columns='d', values='n'))
prints a pivoted df:
d 2007-08-07 2007-08-08 2007-11-02 2007-11-03
t
00:00:00 NaN 2 NaN 7
06:00:00 NaN 3 NaN 8
12:00:00 NaN 4 NaN 9
18:00:00 1 5 6 10
You won't get a length of 5, though. The following,
2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
is incorrect, as you have 18:00:00 twice for the same date, and in your initial data, they apply to different dates.
Related
Concatenating this dask DataFrame to this pandas DataFrame and using set_index to sort index does not result in a sorted index. Is this normal?
from dask import dataframe as dd
import pandas as pd
a=list('aabbccddeeffgghhi')
df = pd.DataFrame(dict(a=a),
index = pd.date_range(start='2010/01/01', end='2010/02/01', periods=len(a))).reset_index()
ddf = dd.from_pandas(df, npartitions=5)
a2=list('aabbccddeef')
df2 = pd.DataFrame(dict(a=a2),
index = pd.date_range(start='2020/01/01',end='2020/01/06', periods=len(a2))).reset_index()
ddf2 = dd.concat([ddf, df2]).set_index('index')
ddf2.compute()
a
index
2010-01-01 00:00:00 a
2010-01-02 22:30:00 a
2010-01-04 21:00:00 b
2010-01-06 19:30:00 b
2010-01-08 18:00:00 c
2010-01-10 16:30:00 c
2010-01-12 15:00:00 d
2010-01-14 13:30:00 d
2010-01-16 12:00:00 e
2010-01-18 10:30:00 e
2010-01-20 09:00:00 f
2010-01-22 07:30:00 f
2010-01-24 06:00:00 g
2010-01-26 04:30:00 g
2010-01-28 03:00:00 h
2010-01-30 01:30:00 h
2010-02-01 00:00:00 i
2020-01-01 00:00:00 a
2020-01-01 12:00:00 a
2020-01-02 00:00:00 b
2020-01-02 12:00:00 b
2020-01-03 00:00:00 c
2020-01-03 12:00:00 c
2020-01-04 00:00:00 d
2020-01-04 12:00:00 d
2020-01-05 00:00:00 e
2020-01-05 12:00:00 e
2020-01-06 00:00:00 f
Please, do I do something the wrong way?
Yes, it is completly normal because most pandas operations don not assume a sorted index, some do though.
In dask dataframes you must apply
ddf2 = dd.concat([ddf, df2]).set_index('index', sorted = True).
By the way, your data is already properly sorted by index. Regard the years (2010, 2020)
Try sort_index to sort. Set index is just to set the index, doesn't force a resorting.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html
I have a DataFrame and want to multiply all values in a column a for a certain day with the value of a at 6h00m00 of that day. If there is no 6h00m00 entry, that day should stay unchanged.
The code below unfortunately gives an error.
How do I have to correct this code / replace it with any working solution?
import pandas as pd
import numpy as np
start = pd.Timestamp('2000-01-01')
end = pd.Timestamp('2000-01-03')
t = np.linspace(start.value, end.value, 9)
datetime1 = pd.to_datetime(t)
df = pd.DataFrame( {'a':[1,3,4,5,6,7,8,9,14]})
df['date']= datetime1
print(df)
def myF(x):
y = x.set_index('date').between_time('05:59', '06:01').a
return y
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
.
a date
0 1 2000-01-01 00:00:00
1 3 2000-01-01 06:00:00
2 4 2000-01-01 12:00:00
3 5 2000-01-01 18:00:00
4 6 2000-01-02 00:00:00
5 7 2000-01-02 06:00:00
6 8 2000-01-02 12:00:00
7 9 2000-01-02 18:00:00
8 14 2000-01-03 00:00:00
....
AttributeError: ("'Series' object has no attribute 'set_index'", 'occurred at index a')
you should change this line:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
to this:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).apply(myF)
using .apply instead of .transform will give you the desired result.
apply is the right choice here since it implicitly passes all the columns for each group as a DataFrame to the custom function.
to read more about the difference between the two methods, consider this answer
If you stick to use between_times(...) function, that would be the way to do it:
df = df.set_index('date')
mask = df.between_time('05:59', '06:01').index
df.loc[mask, 'a'] = df.loc[mask, 'a'] ** 2 # the operation you want to perform
df.reset_index(inplace=True)
Outputs:
date a
0 2000-01-01 00:00:00 1
1 2000-01-01 06:00:00 9
2 2000-01-01 12:00:00 4
3 2000-01-01 18:00:00 5
4 2000-01-02 00:00:00 6
5 2000-01-02 06:00:00 49
6 2000-01-02 12:00:00 8
7 2000-01-02 18:00:00 9
8 2000-01-03 00:00:00 14
If I got your goal right, you can use apply to return a dataframe with the same amount of rows as the original dataframe (simulating a transform):
def myF(grp):
time = grp.date.dt.strftime('%T')
target_idx = time == '06:00:00'
if target_idx.any():
grp.loc[~target_idx, 'a_sum'] = grp.loc[~target_idx, 'a'].values * grp.loc[target_idx, 'a'].values
else:
grp.loc[~target_idx, 'a_sum'] = np.nan
return grp
df.groupby(df.date.dt.floor('D')).apply(myF)
Output:
a date a_sum
0 1 2000-01-01 00:00:00 3.0
1 3 2000-01-01 06:00:00 NaN
2 4 2000-01-01 12:00:00 12.0
3 5 2000-01-01 18:00:00 15.0
4 6 2000-01-02 00:00:00 42.0
5 7 2000-01-02 06:00:00 NaN
6 8 2000-01-02 12:00:00 56.0
7 9 2000-01-02 18:00:00 63.0
8 14 2000-01-03 00:00:00 NaN
See that, for each day, each value with time other than 06:00:00 is multiplied by the value with time equals 06:00:00. It retuns NaN for the 06:00:00-values themselves, as well as for the groups without this time.
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
After importing data from a HDF5 file the index for my stock data has disappeared.
One of the columns in my dataframe "Date" is a Datetime64. How do I convert this date column to a datetimeindex column but without the time parts at the end.
So that slicing the dataframe like this data.ix["2016-01-01":"2016-02-06"] works.
IIUC, starting from a sample dataframe as:
Date x
0 2016-01-01 20:01 1
1 2016-01-02 20:02 2
you can do:
df = df.set_index(pd.DatetimeIndex(df['Date']).date)
which returns your DatetimeIndex only with the date part:
Date x
2016-01-01 2016-01-01 20:01 1
2016-01-02 2016-01-02 20:02 2
Use set_index to convert the column to an index. You do not need to trim the time part for the it to work.
df = pd.DataFrame({
'ts': pd.date_range('2015-12-20', periods=10, freq='12h'),
'stuff': np.random.randn(10)
})
print(df)
stuff ts
0 0.942231 2015-12-20 00:00:00
1 1.229604 2015-12-20 12:00:00
2 -0.162319 2015-12-21 00:00:00
3 -0.142590 2015-12-21 12:00:00
4 1.057184 2015-12-22 00:00:00
5 -0.370927 2015-12-22 12:00:00
6 -0.358605 2015-12-23 00:00:00
7 -0.561857 2015-12-23 12:00:00
8 -0.020714 2015-12-24 00:00:00
9 0.552764 2015-12-24 12:00:00
print(df.set_index('ts').ix['2015-12-21':'2015-12-23'])
stuff
ts
2015-12-21 00:00:00 -0.162319
2015-12-21 12:00:00 -0.142590
2015-12-22 00:00:00 1.057184
2015-12-22 12:00:00 -0.370927
2015-12-23 00:00:00 -0.358605
2015-12-23 12:00:00 -0.561857
Given a df of this kind, where we have DateTime Index:
DateTime A
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10
I would like to subset observations using the attributes of the index, like:
First business day of the month
Last business day of the month
First Friday of the month 'WOM-1FRI'
Third Friday of the month 'WOM-3FRI'
I'm specifically interested to know if this can be done using something like:
df.loc[(df['A'] < 5) & (df.index == 'WOM-3FRI'), 'Signal'] = 1
Thanks
You could try...
# FIRST DAY OF MONTH
df.loc[df[1:][df.index.month[:-1]!=df.index.month[1:]].index]
# LAST DAY OF MONTH
df.loc[df[:-1][df.index.month[:-1]!=df.index.month[1:]].index]
# 1st Friday
fr1 = df.groupby(df.index.year*100+df.index.month).apply(lambda x: x[(x.index.week==1)*(x.index.weekday==4)])
# 3rd Friday
fr3 = df.groupby(df.index.year*100+df.index.month).apply(lambda x: x[(x.index.week==3)*(x.index.weekday==4)])
If you want to remove extra-levels in the index of fr1 and fr3:
fr1.index=fr1.index.droplevel(0)
fr3.index=fr3.index.droplevel(0)