I have a dataframe table that has columns containing datetime information.
As you can see from the table below, the 2019-xx field is between the years 2018 and 2016 so I need to arrange it properly.
I tried to use .sort_index(axis=1, inplace=True) but in vain
(I don't know why it has no effect at all).
Dataframe:
2017-12-31 2018-12-31 2019-12-31 2016-12-31 2020-06-30
Unnamed: 0
WaterFlow -26084000.0 -257404000.0 -84066000.0 135075000.0 NaN
trailing1HourWaterFlow NaN NaN -84066000.0 NaN 6823000.0
The problem is that:
I don't know how to arrange columns orders when it's represented as
datetime info.
The table above seems strange since that "Unnamed: 0" row is empty
and there's a space between the columns and rows unlike other
ordinary dataframes.
I think you need convert the columns to datetimes, then do the sorting. If Unnamed: 0 is the index name you can remove it by using DataFrame.rename_axis:
df.columns = pd.to_datetime(df.columns)
df = df.sort_index(axis=1).rename_axis(None)
Related
I'm trying to update a df1 with values from df2. However, a column with date doesn't seem to be updating correctly.
For instance, I have df 1 and df2:
df1:
SN Date_screened DOB
7983 2017-11-30 00:00:00 2011-02-05 00:00:00
df2:
SN Date_screened DOB
7983 2011-02-05 00:00:00
When I try to update df2 with df1 this is what I get:
df2.update(df1)
df2:
SN Date_screened DOB
7983 2017-11-30 00:00:00 1296864000000000000
I'm really not sure why the Date_screened could be updated correctly but for the DOB it returns a long string of numbers after the update rather than returning 2011-02-05 00:00:00
For more context, the DOB values in df1 consists entirely of dates dtype is ('<M8[ns]'). However, the DOB values in df2 consists a mix of dates, blanks, strings etc as the column has not been cleaned yet; it's dtype is ('o').
Any ideas as to why this is happening (I suspect it might be because of df2's DOB column datatype but I'm not too sure as df2's Date_screened column is also read as ('o') ) and how I might be able to rectify this would be greatly appreciated. Thanks so much.
In my import file one of the column has a date, if I view the same column in the dataframe, its converted into integer. How do I convert back to the date format.
In the data file, the column looks like 'Oct-17' but when I view in the dataframe it looks like '43009'. How do I change in Python from integer to Date so my data looks like 'Oct-17'
Appreciate for your help
Use xlrd, once you read in pandas:
df = pd.DataFrame({'Date_String':[43009,43000,42345,43134,43917]})
import xlrd
df['Date'] = df['Date_String'].apply(lambda x: xlrd.xldate.xldate_as_datetime(x, 0))
df['MMMYY'] =df['Date'].apply(lambda x: x.strftime('%b-%y'))
print(df)
Date_String Date MMMYY
0 43009 2017-10-01 Oct-17
1 43000 2017-09-22 Sep-17
2 42345 2015-12-07 Dec-15
3 43134 2018-02-03 Feb-18
4 43917 2020-03-27 Mar-20
I'm new to the language and have managed to create a dataframe below. it is MultiIndex and is a (a,b) size.
The Date is on the rows, and I'm not fully sure how it is all defined.
I want to add a column that is the day of the week (1,2,3,4,5,6,7) for the days, based on the date stamps on the left/index.
Can someone show me how to do it please, I'm just confused on how to pull the index/date column to do calcs on.
Thanks
print(df_3.iloc[:,0])
Date
2019-06-01 8573.84
2019-06-02 8565.47
2019-06-03 8741.75
2019-06-04 8210.99
2019-06-05 7704.34
2019-09-09 10443.23
2019-09-10 10336.41
2019-09-11 10123.03
2019-09-12 10176.82
2019-09-13 10415.36
Name: (bitcoin, Open), Length: 105, dtype: float64
I've just used two of yours first columns and 3 of your records to get a possible solution. It's pretty much of what Celius did, but with a column conversion to to_datetime.
data = [['2019-06-01', 8573.84], ['2019-06-02', 8565.47], ['2019-06-03', 8741.75]]
df = pd.DataFrame(data,columns = ['Date', 'Bitcoin'])
df['Date']= pd.to_datetime(df['Date']).dt.dayofweek
The output result prints 5 for day 2019-06-01 which is a Saturday, 6 for the 2019-06-02 (Sunday) and 0 for 2019-06-03 (Monday).
I hope it helps you.
If you are using pandas and your Index is interpreted as a Datetime object, I would try the following (I assume Date is your index given the dataframe you provided as example):
df = df.reset_index(drop=False) #Drop the index so you can get a new column named `Date`.
df['day_of_week'] = df['Date'].dt.dayofweek #Create new column using pandas `dt.dayofweek`
Edit: Also possible duplicate of Create a day of week column in a pandas dataframe
I am new to Pandas. I have the following data (stock prices)
id,date,time,price
0,2015-01-01,9:00,21.72
1,2015-01-01,9:00,17.65
2,2015-01-01,9:00,54.24
0,2015-01-01,11:00,21.82
1,2015-01-01,11:00,18.65
2,2015-01-01,11:00,52.24
0,2015-01-02,9:00,21.02
1,2015-01-02,9:00,19.01
2,2015-01-02,9:00,50.21
0,2015-01-02,11:00,20.61
1,2015-01-02,11:00,18.70
2,2015-01-02,11:00,51.21
...
...
I want to sort by date and calculate returns for each id and across dates and times within a date. I tried this
import pandas as pd
import numpy as np
df = pd.read_csv("/path/to/csv", index_col=[0,2,1])
df['returns'] = df['price'].pct_change()
However, the returns are calculated across the ids in the order they appear. Any idea how to do this correctly? I would also like to access the data as
price_0 = df['id'==0]['date'=='2014-01-01'][time=='9:00']['price']
Assuming that those are the columns in your dataframe (and none are the index), then you want to group by date, time, and id on price. You then unstack the id, which effectively creates a pivot table with dates and times as the rows and ids as the columns. You then need to use pct_change to achieve your objective.
returns = df.groupby(['date', 'time', 'id']).price.first().unstack().pct_change()
>>> returns
id 0 1 2
date time
1/1/15 11:00 NaN NaN NaN
9:00 -0.004583 -0.053619 0.038285
1/2/15 11:00 -0.051105 0.059490 -0.055863
9:00 0.019893 0.016578 -0.019527
It will probably be better, however, to combine the dates and times into timestamps. Assuming your dates and times are text representations, the following should work:
df['timestamp'] = df.apply(lambda row: pd.Timestamp(row.date + ' ' + row.time), axis=1)
Then, just group on the the timestamp and id, and unstack the id.
returns = df.groupby(['timestamp, 'id']).price.first().unstack('id').pct_change()
>>> returns
id 0 1 2
timestamp
2015-01-01 09:00:00 NaN NaN NaN
2015-01-01 11:00:00 0.004604 0.056657 -0.036873
2015-01-02 09:00:00 -0.036664 0.019303 -0.038859
You would index the returns for a given security as follows:
>>> returns.ix['2015-01-02 9:00'].loc[1]
0.0193029490616623
My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.
idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()
In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013
However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:
fig, ax = plt.subplots()
ax.bar(idx.to_pydatetime(), s, color='green')
What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?
Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.
09-02-2013 2
09-03-2013 10
09-06-2013 5
09-07-2013 1
You could use Series.reindex:
import pandas as pd
idx = pd.date_range('09-01-2013', '09-30-2013')
s = pd.Series({'09-02-2013': 2,
'09-03-2013': 10,
'09-06-2013': 5,
'09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value=0)
print(s)
yields
2013-09-01 0
2013-09-02 2
2013-09-03 10
2013-09-04 0
2013-09-05 0
2013-09-06 5
2013-09-07 1
2013-09-08 0
...
A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().
# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'),
pd.Timestamp('2012-05-04'),
pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)
print(s.asfreq('D'))
2012-05-01 1.0
2012-05-02 NaN
2012-05-03 NaN
2012-05-04 2.0
2012-05-05 NaN
2012-05-06 3.0
Freq: D, dtype: float64
One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:
df = pd.DataFrame({
'timestamps': pd.to_datetime(
['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-18 "2016-11-18 04:00:00" d
Due to the duplicate 2016-11-16 date, an attempt to reindex:
all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)
fails with:
...
ValueError: cannot reindex from a duplicate axis
(by this it means the index has duplicates, not that it is itself a dup)
Instead, we can use .loc to look up entries for all dates in range:
df.loc[all_days]
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-17 NaN NaN
2016-11-18 "2016-11-18 04:00:00" d
fillna can be used on the column series to fill blanks if needed.
An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:
df.resample('D').mean()
resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.
Here is the original data, but with an extra entry for '2013-09-03':
val
date
2013-09-02 2
2013-09-03 10
2013-09-03 20 <- duplicate date added to OP's data
2013-09-06 5
2013-09-07 1
And here are the results:
val
date
2013-09-02 2.0
2013-09-03 15.0 <- mean of original values for 2013-09-03
2013-09-04 NaN <- NaN b/c date not present in orig
2013-09-05 NaN <- NaN b/c date not present in orig
2013-09-06 5.0
2013-09-07 1.0
I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.
Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:
def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):
df.set_index(date_col_name,drop=True,inplace=True)
df.index = pd.DatetimeIndex(df.index)
d = datetime.now().date()
d2 = d - timedelta(days = days_back)
idx = pd.date_range(d2, d, freq = "D")
df = df.reindex(idx,fill_value=fill_value)
df[date_col_name] = pd.DatetimeIndex(df.index)
return df
You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.
# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
'date':pd.to_datetime([
'2022-02-10'
,'2022-02-11'
,'2022-02-14'
,'2022-02-14'
,'2022-02-24'
,'2022-02-16'
])
,'value':[10,20,5,10,15,30]
})
# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])
# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')
s.asfreq('D').interpolate().asfreq('Q')