in Rodeo I created a dataframe with column 'Date' as a range of Date by inputing:
temp_df = pd.DataFrame({
'Date': pd.date_range('2017-01-01',periods=100,freq='D'),
'Value': np.random.normal(10,5,size=100).tolist()
})
However, when I click on the dataframe, it shows
Date
1483228800000
1483315200000
1483401600000
1483488000000
...
in the 'Date' column. Yet, the datetime format works properly when I try:
>>> temp_df.Date.head()
0 2017-01-01
1 2017-01-02
2 2017-01-03
3 2017-01-04
4 2017-01-05
May I know what am I missing in my code? Thanks a lot.
Related
I have the following date column that I would like to transform to a pandas datetime object. Is it possible to do this with weekly data? For example, 1-2018 stands for week 1 in 2018 and so on. I tried the following conversion but I get an error message: Cannot use '%W' or '%U' without day and year
import pandas as pd
df1 = pd.DataFrame(columns=["date"])
df1['date'] = ["1-2018", "1-2018", "2-2018", "2-2018", "3-2018", "4-2018", "4-2018", "4-2018"]
df1["date"] = pd.to_datetime(df1["date"], format = "%W-%Y")
You need to add a day to the datetime format
df1["date"] = pd.to_datetime('0' + df1["date"], format='%w%W-%Y')
print(df1)
Output
date
0 2018-01-07
1 2018-01-07
2 2018-01-14
3 2018-01-14
4 2018-01-21
5 2018-01-28
6 2018-01-28
7 2018-01-28
As the error message says, you need to specify the day of the week by adding %w :
df1["date"] = pd.to_datetime( '0'+df1.date, format='%w%W-%Y')
I have a dataframe with dates and tick-data like below
Date Bid
0 20160601 00:00:00.020 160.225
1 20160601 00:00:00.136 160.226
2 20160601 00:00:00.192 160.225
3 20160601 00:00:00.327 160.230
4 20160601 00:00:01.606 160.231
5 20160601 00:00:01.613 160.230
I want to filter out unique values in the 'Bid' column at set intervals
E.g: 2016-06-01 00:00:00 - 00:15:00, 2016-06-01 00:15:00 - 00:30:00...
The result will be a new dataframe (keeping the filtered values with its datetime).
Here's the code I have so far:
#Convert Date column to index with seconds as base
df['Date'] = pd.DatetimeIndex(df['Date'])
df['Date'] = df['Date'].astype('datetime64[s]')
df.set_index('Date', inplace=True)
#Create new DataFrame with filtered values
ts = pd.DataFrame(df.loc['2016-06-01'].between_time('00:00', '00:30')['Bid'].unique())
With the method above I loose the [Dates] (datetime) of the filtered values in the process of creating a new DataFrame plus I have to manually input each date and time interval which is unrealistic.
Output:
0
0 160.225
1 160.226
2 160.230
3 160.231
4 160.232
5 160.228
6 160.227
Ideally I'm looking for an operation where I can set the time interval as a timedelta and have an operation done on the whole file (about 8Gb) at once, creating a new DataFrame with Date and Bid columns of the unique values within the set interval. Like this
Date Bid
0 20160601 00:00:00.020 160.225
1 20160601 00:00:00.136 160.226
2 20160601 00:00:00.327 160.230
3 20160601 00:00:01.606 160.231
...
805 20160601 00:15:00.606 159.127
PS. I also tried using pd.rolling() & pd.resample() methods with apply(lambda x: function (eg. pd['Bid'].unique()) but it never was able to cut it, maybe someone better at it could attempt.
Just to clarify: This is not a rolling calculation. You mentioned attempting to solve this using rolling, but from your clarification it seems you want to split the time series into discrete, non-overlapping 15 minutes sequences.
Setup
df = pd.DataFrame({
'Date': [
'2016-06-01 00:00:00.020', '2016-06-01 00:00:00.136',
'2016-06-01 00:15:00.636', '2016-06-01 00:15:02.836',
],
'Bid': [150, 150, 200, 200]
})
print(df)
Date Bid
0 2016-06-01 00:00:00.020 150
1 2016-06-01 00:00:00.136 150 # Should be dropped
2 2016-06-01 00:15:00.636 200
3 2016-06-01 00:15:02.836 200 # Should be dropped
First, verify that your Date column is datetime:
df.Date = pd.to_datetime(df.Date)
Now use dt.floor to round each value down to the nearest 15 minutes, and use this new column to drop_duplicates per 15 minute window, but still keep the precision of your dates.
df.assign(flag=df.Date.dt.floor('15T')).drop_duplicates(['flag', 'Bid']).drop('flag', 1)
Date Bid
0 2016-06-01 00:00:00.020 150
2 2016-06-01 00:15:00.636 200
From my original answer, but I still believe it holds value. If you'd like to access the unique values per group, you can make use of pd.Grouper and unique, and I believe learning to leverage pd.Grouper is a powerful tool to have with pandas:
df.groupby(pd.Grouper(key='Date', freq='15T')).Bid.unique()
Date
2016-06-01 00:00:00 [150]
2016-06-01 00:15:00 [200]
Freq: 15T, Name: Bid, dtype: object
I have a large pandas dataframe (40 million rows) with the following format :
ID DATETIME TIMESTAMP
81215545953683710540 2017-01-01 17:39:57 1483243205
74994612102903447699 2017-01-01 19:14:12 1483243261
48126186377367976994 2017-01-01 17:19:29 1483243263
23522333658893375671 2017-01-01 12:50:46 1483243266
16194691060240380504 2017-01-01 15:59:23 1483243353
I am trying to assign a value to each row depending on the timestamp so that i have group of rows with the same value if they are in the same time interval.
Let's say I have t0 = 1483243205 and I want a differently value when TIMESTAMP = t0+10 . So here my time interval would be of 10.
I would like something like that :
ID DATETIME TIMESTAMP VALUE
81215545953683710540 2017-01-01 17:39:57 1483243205 0
74994612102903447699 2017-01-01 19:14:12 1483243261 5
48126186377367976994 2017-01-01 17:19:29 1483243263 5
23522333658893375671 2017-01-01 12:50:46 1483243266 6
16194691060240380504 2017-01-01 15:59:23 1483243288 8
Here is my code :
df['VALUE']=''
t=1483243205
j=0
for i in range(0,len(df['TIMESTAMP'])):
while(df.iloc[i][2])<(t+10):
df['VALUE'][i]=j
i+=1
t+=10
j+=1
I have a warning when executing my code (SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame) and I have the following result :
ID DATETIME TIMESTAMP VALUE
81215545953683710540 2017-01-01 17:39:57 1483243205 0
74994612102903447699 2017-01-01 19:14:12 1483243261
48126186377367976994 2017-01-01 17:19:29 1483243263
23522333658893375671 2017-01-01 12:50:46 1483243266
16194691060240380504 2017-01-01 15:59:23 1483243288
It is not the first time I encounter the warning and I always overcame it, but I am confused with the fact I only got a value for the first row.
Does anyone know what I am missing ?
Thanks
I would suggest using pandas' cut method to achieve this, preventing the need to explicitly loop through your DataFrame.
tmin, tmax = df['TIMESTAMP'].min(), df['TIMESTAMP'].max()
bins = [i for i in range(tmin, tmax+10, 10)]
labels = [i for i in range(len(bins)-1)]
df['VALUE'] = pd.cut(df['TIMESTAMP'], bins=bins, labels=labels, include_lowest=True)
ID DATETIME TIMESTAMP VALUE
0 81215545953683710540 2017-01-01 17:39:57 1483243205 0
1 74994612102903447699 2017-01-01 19:14:12 1483243261 5
2 48126186377367976994 2017-01-01 17:19:29 1483243263 5
3 23522333658893375671 2017-01-01 12:50:46 1483243266 6
4 16194691060240380504 2017-01-01 15:59:23 1483243288 8
I am new to Python and working my way through my crawling project. I have two questions regarding few pandas module.
Below is my data table "js"
apple banana
period
2017-01-01 100.00000 22.80130
2017-02-01 94.13681 16.28664
2017-03-01 85.34201 13.68078
2017-04-01 65.79804 9.77198
2017-05-01 43.32247 13.35504
2017-06-01 72.63843 9.44625
2017-07-01 78.82736 9.77198
2017-08-01 84.03908 10.09771
2017-09-01 90.55374 13.35504
2017-10-01 86.64495 9.12052
Below is my code to apply apple and banana values to new DataFrame.
import pandas as pd
from datetime import datetime, timedelta
dd = pd.date_range('2017-01-01',datetime.now().date() - timedelta(1))
df = pd.DataFrame.set_index(dd) #this part has error
first step is to set my df index as data_range ('2017-01-01' to yesterday (daily)). And the error message is saying I am missing 1 required positional argument: 'keys'. Is it possible to set index as daily dates from '2017-01-01' to yesterday?
After that is solved, I am trying to put my "js" Data such as 'apple' and 'banana' as column, and put each value respect to df index dates. This example only shows 'apple' and 'banana' columns, but in my real data set, I have thousands more...
Please let me know the efficient way to solve my problem. Thanks in advance!
------------------EDIT------------------------
The date indexing works perfect with #COLDSPEED answer.
dd = pd.date_range('2017-01-01',datetime.now().date() - timedelta(1))
df.index = pd.to_datetime(df.index) # ignore if not needed
df = df.reindex(dd, fill_value=0.0)
One problem is that if I have another dataframe "js2"(below) and combine these data in a single df (above) I believe it will not work. Any sugguestions?
kiwi mango
period
2017-01-01 9.03614 100.00000
2017-02-01 5.42168 35.54216
2017-03-01 7.83132 50.00000
2017-04-01 10.24096 55.42168
2017-05-01 10.84337 60.84337
2017-06-01 12.04819 65.66265
2017-07-01 17.46987 34.93975
2017-08-01 9.03614 30.72289
2017-09-01 9.63855 56.02409
2017-10-01 12.65060 45.18072
You can use pd.to_datetime and pd.Timedelta -
idx = pd.date_range('2017-01-01', pd.to_datetime('today') - pd.Timedelta(days=1))
idx
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06', '2017-01-07', '2017-01-08',
'2017-01-09', '2017-01-10',
...
'2017-11-18', '2017-11-19', '2017-11-20', '2017-11-21',
'2017-11-22', '2017-11-23', '2017-11-24', '2017-11-25',
'2017-11-26', '2017-11-27'],
dtype='datetime64[ns]', length=331, freq='D')
This, you can then use to reindex your dataframe -
df.index = pd.to_datetime(df.index) # ignore if not needed
df = df.reindex(idx, fill_value=0.0)
If your date are day-first (day first, followed by month), make sure you specify that when converting your index -
df.index = pd.to_datetime(df.index, dayfirst=True)
I am trying to group a Pandas Dataframe into buckets of 2 days. For example, if I do the below:
df = pd.DataFrame()
df['action_date'] = ['2017-01-01', '2017-01-01', '2017-01-03', '2017-01-04', '2017-01-04', '2017-01-05', '2017-01-06']
df['action_date'] = pd.to_datetime(df['action_date'], format="%Y-%m-%d")
df['user_name'] = ['abc', 'wdt', 'sdf', 'dfe', 'dsd', 'erw', 'fds']
df['number_of_apples'] = [1,2,3,4,5,6,2]
df = df.groupby(['action_date', 'number_of_apples']).sum()
I get a dataframe grouped by action_date with number_of_apples per day.
However, if I wanted to look at the dataframe in chunks of 2 days, how could I do so? I would then like to analyze the number_of_apples per date_chunk, either by making new dataframes for the dates 2017-01-01 & 2017-01-03, another for 2017-01-04 & 2017-01-05, and then one last one for 2017-01-06, OR just by regrouping and working within.
EDIT: I ultimately would like to make lists of users based on the the number of apples they have for each day chunk, so do not want to get the sum nor mean of each day chunk's apples. Sorry for the confusion!
Thank you in advance!
You can use resample:
print (df.resample('2D', on='action_date')['number_of_apples'].sum().reset_index())
action_date number_of_apples
0 2017-01-01 3
1 2017-01-03 12
2 2017-01-05 8
EDIT:
print (df.resample('2D', on='action_date')['user_name'].apply(list).reset_index())
action_date user_name
0 2017-01-01 [abc, wdt]
1 2017-01-03 [sdf, dfe, dsd]
2 2017-01-05 [erw, fds]
Try using a TimeGrouper to group by two days.
>>df.index=df.action_date
>>dg = df.groupby(pd.TimeGrouper(freq='2D'))['user_name'].apply(list) # 2 day frequency
>>dg.head()
action_date
2017-01-01 [abc, wdt]
2017-01-03 [sdf, dfe, dsd]
2017-01-05 [erw, fds]