New column from slices of a column in pandas - python

I am working with time series data with pandas and my data frame looks a little bit like this
Date Layer
0 2000-01-01 0.408640
1 2000-01-02 0.842065
2 2000-01-03 1.271810
3 2000-01-04 1.699399
4 2000-01-05 2.128098
... ...
7300 2019-12-27 149.323520
7301 2019-12-28 149.744012
7302 2019-12-29 150.155702
7303 2019-12-30 150.562771
7304 2019-12-31 151.003031
I need to make a column for each year, like this:
2000 2001 2002
0 0.408640 0.415863 0.425689
1 0.852653 0.826542 0.863524
... ... ...
364 156.235978 158.564578 152.135689
365 156.685421 158.924556 152.528978
Is there a way I can manage to do that? The resulting data can be in a new data frame

The approach for this will be to create separate year and day of year columns, and then create a pivot table:
#Convert Date column to pandas datetime if you haven't already:
df['Date'] = pd.to_datetime(df['Date'])
#Create year column
df['Year'] = df['Date'].dt.year
#Create day of year column
df['DayOfYear'] = df['Date'].dt.dayofyear
#create pivot table in new dataframe
df2 = pd.pivot_table(df, index = 'DayOfYear', columns = 'Year', values = 'Layer')
This won't look exactly like your desired output because the index will be numbered 1-365 (and have a name) rather than 0-364. If you want it to match exactly, you can add:
df2 = df2.reset_index()

Related

Pandas groupby keep rows according to ranking

I have this dataframe:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
3 2020-03-01 0.603989 S2B-SR
4 2020-03-11 0.717264 S2B-SR
5 2020-04-02 0.737118 L8-SR
I would like to groupby by the date columns where I keep the rows according to a ranking/importance of my chooseing from the source columns. For example, my ranking is L8-SR>S2B-SR>GP6_r, meaning that for all rows with the same date, keep the row where source==L8-SR, if none contain L8-SR, then keep the row where source==S2B-SR etc. How can I accomplish that in pandas groupby
Output should look like this:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
3 2020-03-11 0.717264 S2B-SR
4 2020-04-02 0.737118 L8-SR
Let's try category dtype and drop_duplicates:
orders = ['L8-SR','S2B-SR','GP6_r']
df.source = df.source.astype('category')
df.source.cat.set_categories(orders, ordered=True)
df.sort_values(['date','source']).drop_duplicates(['date'])
Output:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
4 2020-03-11 0.717264 S2B-SR
5 2020-04-02 0.737118 L8-SR
TRY below code for the group by operation. For ordering after this operation you can perform sortby:
# Import pandas library
import pandas as pd
# Declare a data dictionary contains the data mention in table
pandasdata_dict = {'date':['2020-02-14', '2020-02-15', '2020-03-01', '2020-03-01', '2020-03-11', '2020-04-02'],
'value':[0.438767, 0.422867, 0.657453, 0.603989, 0.717264, 0.737118],
'source':['L8-SR', 'S2A-SR', 'L8-SR', 'S2B-SR', 'S2B-SR', 'L8-SR']}
# Convert above dictionary data to the data frame
df = pd.DataFrame(pandasdata_dict)
# display data frame
df
# Convert date field to datetime
df["date"] = pd.to_datetime(df["date"])
# Once conversion done then do the group by operation on the data frame with date field
df.groupby([df['date'].dt.date])

Create a new column in a dataframe that shows Day of the Week from an already existing dd/mm/yy column? Python

I have a dataframe that contains a column with dates e.g. 24/07/15 etc
Is there a way to create a new column into the dataframe that displays all the days of the week corresponding to the already existing 'Date' column?
I want the output to appear as:
[Date][DayOfTheWeek]
This might work:
If you want day name:
In [1405]: df
Out[1405]:
dates
0 24/07/15
1 25/07/15
2 26/07/15
In [1406]: df['dates'] = pd.to_datetime(df['dates']) # You don't need to specify the format also.
In [1408]: df['dow'] = df['dates'].dt.day_name()
In [1409]: df
Out[1409]:
dates dow
0 2015-07-24 Friday
1 2015-07-25 Saturday
2 2015-07-26 Sunday
If you want day number:
In [1410]: df['dow'] = df['dates'].dt.day
In [1411]: df
Out[1411]:
dates dow
0 2015-07-24 24
1 2015-07-25 25
2 2015-07-26 26
I would try the apply function, so something like this:
def extractDayOfWeek(dateString):
...
df['DayOfWeek'] = df.apply(lambda x: extractDayOfWeek(x['Date'], axis=1)
The idea is that, you map over every row, extract the 'date' column, and then apply your own function to create a new row entry named 'Day'
Depending of the type of you column Date.
df['Date']=pd.to_datetime(df['Date'], format="d/%m/%y")
df['weekday'] = df['Date'].dt.dayofweek

Converting Integer column to Date Column

In my import file one of the column has a date, if I view the same column in the dataframe, its converted into integer. How do I convert back to the date format.
In the data file, the column looks like 'Oct-17' but when I view in the dataframe it looks like '43009'. How do I change in Python from integer to Date so my data looks like 'Oct-17'
Appreciate for your help
Use xlrd, once you read in pandas:
df = pd.DataFrame({'Date_String':[43009,43000,42345,43134,43917]})
import xlrd
df['Date'] = df['Date_String'].apply(lambda x: xlrd.xldate.xldate_as_datetime(x, 0))
df['MMMYY'] =df['Date'].apply(lambda x: x.strftime('%b-%y'))
print(df)
Date_String Date MMMYY
0 43009 2017-10-01 Oct-17
1 43000 2017-09-22 Sep-17
2 42345 2015-12-07 Dec-15
3 43134 2018-02-03 Feb-18
4 43917 2020-03-27 Mar-20

Pandas: How to group the non-continuous date column?

I have a column in a dataframe which contains non-continuous dates. I need to group these date by a frequency of 2 days. Data Sample(after normalization):
2015-04-18 00:00:00
2015-04-20 00:00:00
2015-04-20 00:00:00
2015-04-21 00:00:00
2015-04-27 00:00:00
2015-04-30 00:00:00
2015-05-07 00:00:00
2015-05-08 00:00:00
I tried following but as the dates are not continuous I am not getting the desired result.
df.groupby(pd.Grouper(key = 'l_date', freq='2D'))
Is these a way to achieve the desired grouping using pandas or should I write a separate logic?
Once you have a l_date sorted dataframe. you can create a continuous dummy date (dum_date) column and groupby 2D frequency on it.
df = df.sort_values(by='l_date')
df['dum_date'] = pd.date_range(pd.datetime.today(), periods=df.shape[0]).tolist()
df.groupby(pd.Grouper(key = 'dum_date', freq='2D'))
OR
If you are fine with groupings other than dates. then a generalized way to group n consecutive rows could be:
n = 2 # n = 2 for your use case
df = df.sort_values(by='l_date')
df['grouping'] = [(i//n + 1) for i in range(df.shape[0])]
df.groupby(pd.Grouper(key = 'grouping'))

Indexing by multiple fields with pandas in python

I am new to Pandas. I have the following data (stock prices)
id,date,time,price
0,2015-01-01,9:00,21.72
1,2015-01-01,9:00,17.65
2,2015-01-01,9:00,54.24
0,2015-01-01,11:00,21.82
1,2015-01-01,11:00,18.65
2,2015-01-01,11:00,52.24
0,2015-01-02,9:00,21.02
1,2015-01-02,9:00,19.01
2,2015-01-02,9:00,50.21
0,2015-01-02,11:00,20.61
1,2015-01-02,11:00,18.70
2,2015-01-02,11:00,51.21
...
...
I want to sort by date and calculate returns for each id and across dates and times within a date. I tried this
import pandas as pd
import numpy as np
df = pd.read_csv("/path/to/csv", index_col=[0,2,1])
df['returns'] = df['price'].pct_change()
However, the returns are calculated across the ids in the order they appear. Any idea how to do this correctly? I would also like to access the data as
price_0 = df['id'==0]['date'=='2014-01-01'][time=='9:00']['price']
Assuming that those are the columns in your dataframe (and none are the index), then you want to group by date, time, and id on price. You then unstack the id, which effectively creates a pivot table with dates and times as the rows and ids as the columns. You then need to use pct_change to achieve your objective.
returns = df.groupby(['date', 'time', 'id']).price.first().unstack().pct_change()
>>> returns
id 0 1 2
date time
1/1/15 11:00 NaN NaN NaN
9:00 -0.004583 -0.053619 0.038285
1/2/15 11:00 -0.051105 0.059490 -0.055863
9:00 0.019893 0.016578 -0.019527
It will probably be better, however, to combine the dates and times into timestamps. Assuming your dates and times are text representations, the following should work:
df['timestamp'] = df.apply(lambda row: pd.Timestamp(row.date + ' ' + row.time), axis=1)
Then, just group on the the timestamp and id, and unstack the id.
returns = df.groupby(['timestamp, 'id']).price.first().unstack('id').pct_change()
>>> returns
id 0 1 2
timestamp
2015-01-01 09:00:00 NaN NaN NaN
2015-01-01 11:00:00 0.004604 0.056657 -0.036873
2015-01-02 09:00:00 -0.036664 0.019303 -0.038859
You would index the returns for a given security as follows:
>>> returns.ix['2015-01-02 9:00'].loc[1]
0.0193029490616623

Categories