Pandas - Select month and year - python

Trying to subset a dataframe, ultimately want to export a certain month and year (Say November 2020) to a CSV. But I'm stuck at the selection part, the date column is in DD/MM/YYYY format. My attempt -
csv = r"C:\Documents\Transactions.csv"
current_month = 11
current_year = 2020
data =pd.read_csv(csv, sep=',', index_col = None)
df = data[pd.to_datetime(data['Date'],dayfirst=True).dt.month == current_month &(pd.to_datetime(data['Date']).dt.year==current_year)]
print(df)
Result is the rows with the correct year, but includes all months whereas I want it restricted the current_month variable. Any help appreciated.

Given that you have a Date column, I would suggest to first convert the column as you do it twice. You cannot apply .dt.month to the Series (whole column).
Then just apply it to the Series.
import datetime as dt
data['Date']= pd.to_datetime(data['Date'], dayfirst=True)
df = data[(data['Date'].apply(lambda x: x.month) == current_month) &
(data['Date'].apply(lambda y: y.year) == current_year)]

Convert column Date to date format first, then do the selection part as usual.
import pandas as pd
df = pd.read_csv('data-date.txt')
current_month = 11
current_year = 2020
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df[(df['Date'].dt.month == current_month) & (df['Date'].dt.year == current_year)]

Related

Python - Remove lines prior to current month and year

I have a dataframe that contain arrival dates for vessels and I'd want to make python recognize the current year and month that we are at the moment and remove all entries that are prior to the current month and year.
I have a column with the date itself in the format '%d/%b/%Y' and columns for month and year separatly if needed.
For instance, if today is 01/01/2022. I'd like to remove everything that is from dec/2021 and prior.
Using pandas periods and boolean indexing:
# set up example
df = pd.DataFrame({'date': ['01/01/2022', '08/02/2022', '09/03/2022'], 'other_col': list('ABC')})
# find dates equal or greater to this month
keep = (pd.to_datetime(df['date'], dayfirst=False)
.dt.to_period('M')
.ge(pd.Timestamp('today').to_period('M'))
)
# filter
out = df[keep]
Output:
date other_col
1 08/02/2022 B
2 09/03/2022 C
from datetime import datetime
import pandas as pd
df = ...
# assuming your date column is named 'date'
t = datetime.utcnow()
df = df[pd.to_datetime(df.date) >= datetime(t.year, t.month, t.day)]
Let us consider this example dataframe:
import pandas as pd
import datetime
df = pd.DataFrame()
data = [['nao victoria', '21/Feb/2012'], ['argo', '6/Jun/2022'], ['kon tiki', '23/Aug/2022']]
df = pd.DataFrame(data, columns=['Vessel', 'Date'])
You can convert your dates to datetimes, by using pandas' to_datetime method; for instance, you may save the output into a new Series (column):
df['Datetime']=pd.to_datetime(df['Date'], format='%d/%b/%Y')
You end up with the following dataframe:
Vessel Date Datetime
0 nao victoria 21/Feb/2012 2012-02-21
1 argo 6/Jun/2022 2022-06-06
2 kon tiki 23/Aug/2022 2022-08-23
You can then reject rows containing datetime values that are smaller than today's date, defined using datetime's now method:
df = df[df.Datetime > datetime.datetime.now()]
This returns:
Vessel Date Datetime
2 kon tiki 23/Aug/2022 2022-08-23

Get the first and the last day of a month from the df

This is how my dataframe looks like:
datetime open high low close
2006-01-02 4566.95 4601.35 4542.00 4556.25
2006-01-03 4531.45 4605.45 4531.45 4600.25
2006-01-04 4619.55 4707.60 4616.05 4694.14
.
.
.
Need to calculate the Monthly Returns in %
Formula: (Month Closing Price - Month Open Price) / Month Open Price
I can't seem to get the open price and closing price of a month, because in my df most months dont have a log for the 1st of the month. So having trouble calculating it.
Any help would be very much appreciated!
You need to use groupby and agg function in order to get the first and last value of each column in each month:
import pandas as pd
df = pd.read_csv("dt.txt")
df["datetime"] = pd.to_datetime(df["datetime"])
df.set_index("datetime", inplace=True)
resultDf = df.groupby([df.index.year, df.index.month]).agg(["first", "last"])
resultDf["new_column"] = (resultDf[("close", "last")] - resultDf[("open", "first")])/resultDf[("open", "first")]
resultDf.index.rename(["year", "month"], inplace=True)
resultDf.reset_index(inplace=True)
resultDf
The code above will result in a dataframe that has multiindex column. So, if you want to get, for example, rows with year of 2010, you can do something like:
resultDf[resultDf["year"] == 2010]
You can create a custom grouper such as follow :
import pandas as pd
import numpy as np
from io import StringIO
csvfile = StringIO(
"""datetime\topen\thigh\tlow\tclose
2006-01-02\t4566.95\t4601.35\t4542.00\t4556.25
2006-01-03\t4531.45\t4605.45\t4531.45\t4600.25
2006-01-04\t4619.55\t4707.60\t4616.05\t4694.14""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')
df.datetime = pd.to_datetime(df.datetime, format = "%Y-%m-%d")
dg = df.groupby(pd.Grouper(key='datetime', axis=0, freq='M'))
Then each group of dg is separate by month, and since we convert datetime as pandas.datetime we can use classic arithmetic on it :
def monthly_return(datetime, close_value, open_value):
index_start = np.argmin(datetime)
index_end = np.argmax(datetime)
return (close_value[index_end] - open_value[index_start]) / open_value[index_start]
dg.apply(lambda x : monthly_return(x.datetime, x.close, x.open))
Out[97]:
datetime
2006-01-31 0.02785
Freq: M, dtype: float64
Of course a pure functional approach is possible instead of using monthly_return function

Not all dates are captured when filtering by dates. Python Pandas

I am filtering a dataframe by dates to produce two seperate versions:
Data from only today's date
Data from the last two years
However, when I try to filter on the date, it seems to miss dates that are within the last two years.
date_format = '%m-%d-%Y' # desired date format
today = dt.now().strftime(date_format) # today's date. Will always result in today's date
today = dt.strptime(today, date_format).date() # converting 'today' into a datetime object
today = today.strftime(date_format)
two_years = today - relativedelta(years=2) # date is today's date minus two years.
two_years = two_years.strftime(date_format)
# normalizing the format of the date column to the desired format
df_data['date'] = pd.to_datetime(df_data['date'], errors='coerce').dt.strftime(date_format)
df_today = df_data[df_data['date'] == today]
df_two_year = df_data[df_data['date'] >= two_years]
Which results in:
all dates ['07-17-2020' '07-15-2020' '08-01-2019' '03-25-2015']
today df ['07-17-2020']
two year df ['07-17-2020' '08-01-2019']
The 07-15-2020 date is missing from the two year, even though 08-01-2019 is captured.
you don't need to convert anything to string, simply work with datetime dtype. Ex:
import pandas as pd
df = pd.DataFrame({'date': pd.to_datetime(['07-17-2020','07-15-2020','08-01-2019','03-25-2015'])})
today = pd.Timestamp('now')
print(df[df['date'].dt.date == today.date()])
# date
# 0 2020-07-17
print(df[(df['date'].dt.year >= today.year-1) & (df['date'].dt.date != today.date())])
# date
# 1 2020-07-15
# 2 2019-08-01
What you get from the comparison operations (adjust them as needed...) are boolean masks - you can use them nicely to filter the df.
Your datatype conversions are the problem here. You could do this:
today = dt.now() # today's date. Will always result in today's date
two_years = today - relativedelta(years=2) # date is today's date minus two years.
This prints '2018-07-17 18:40:42.704395'. You can then convert it to the date only format.
two_years = two_years.strftime(date_format)
two_years = dt.strptime(two_years, date_format).date()

Pandas equivalent to sql for month date time

I have a pandas dataframe that I need to filter just like a sql query for a specific month. Everytime I run the code I want it to grab data from the previous month, no matter what the specific day is of the current month.
My SQL code is here but I need pandas equivalent.
WHERE DATEPART(m, logged) = DATEPART(m, DATEADD(m, -1, getdate()))
df = pd.DataFrame({'month': ['1-05-01 00:00:00','1-06-01 00:00:00','1-06-01 00:00:00','1-05-01 00:00:00']})
df['month'] = pd.to_datetime(df['month'])```
In this example, I only want the data from June.
Would definitely appreciate the help! Thanks.
Modifying based on the question edit:
df = pd.DataFrame({'month': ['1-05-01 00:00:00','1-06-01 00:00:00','1-06-01 00:00:00','1-05-01 00:00:00']})
df['month'] = pd.to_datetime(df['month'])
## To get it to the right format
import datetime as dt
df['month'] = df['month'].apply(lambda x: dt.datetime.strftime(x, '%Y-%d-%m'))
df['month'] = pd.to_datetime(df['month'])
## Extract the month from this date
df['month_ex'] = df.month.dt.month
## Get current month to get the latest month from the dataframe, which is the previous month of the current month
from datetime import datetime
currentMonth = datetime.now().month
newDf = df[df.month_ex == currentMonth - 1]
Output:
month month_ex
1 2001-06-01 6
2 2001-06-01 6

Get the monthly observation data from daily dataframe in pandas

I want to get a monthly observation data from the daily data in pandas. That means, I want to get the data at every 5th day of the month (2011-01-05; 2011-02-05; 2011-03-05...2011-12-05) or the closest trading day to that date (e.g if 03-05 is not existed, it will search 2011-03-06). How can i do that?
The dataframe looks something like:
Date Close
2011-01-01 100.99
2011-01-02 100.65
......
2011-12-31 76.08
Below answer will solve your problem but there is a caveat that there should be atleast a single day data for each month!
df['Date'] = pd.to_datetime(df['Date'])
df['day'] = df.Date.dt.day
df['month'] = df.Date.dt.month
df['year'] = df.Date.dt.year
def get_nearest_time_data(df, day):
newdf = pd.DataFrame()
for month in range(1,13):
daydf = df[(df.day==day) & (df.month==month)]
while (daydf.shape[0]==0):
day+=1
daydf = df[(df.day==day) & (df.month==month)]
newdf = pd.concat([newdf,daydf], ignore_index=True)
return newdf
get_nearest_time_data(df, 5)

Categories