Grouping by month-year after using datetime.strptime

Grouping by month-year after using datetime.strptime - python

I'm fairly new at this. I have a csv that has a string date/time column as shown below. I am trying to average flow values based on month-year.
CSV:
X Flow
6/9/16/ 14:00 15000
Code:
import pandas as pd
from datetime import datetime
#import csv
df = pd.read_csv('monthlyaverage.csv', header=True)
date_object = datetime.strptime('6/9/16 14:00', '%m/%d/%y %H:%M')
df.set_index(pd.DatetimeIndex(df))
df1 = df.groupby(pd.TimeGrouper(freq='%m/%y')).mean()

I think you can read_csv and set column X to index by parameter 'index_col'. Then first convert index to DatetimeIndex and then to_period. Last groupby by index (level=0) and aggregate mean:
import pandas as pd
import io
temp=u"""X,Flow
6/9/16/ 14:00,15000
6/9/16/ 14:00,55000
6/9/16/ 14:00,35000"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), index_col='X')
df.index = pd.DatetimeIndex(df.index, format='%m/%d/%y %H:%M').to_period('M')
print (df)
Flow
2016-06 15000
2016-06 55000
2016-06 35000
print (df.groupby(level=0).mean())
Flow
2016-06 35000

Related

Python - Remove lines prior to current month and year

I have a dataframe that contain arrival dates for vessels and I'd want to make python recognize the current year and month that we are at the moment and remove all entries that are prior to the current month and year.
I have a column with the date itself in the format '%d/%b/%Y' and columns for month and year separatly if needed.
For instance, if today is 01/01/2022. I'd like to remove everything that is from dec/2021 and prior.

Using pandas periods and boolean indexing:
# set up example
df = pd.DataFrame({'date': ['01/01/2022', '08/02/2022', '09/03/2022'], 'other_col': list('ABC')})
# find dates equal or greater to this month
keep = (pd.to_datetime(df['date'], dayfirst=False)
.dt.to_period('M')
.ge(pd.Timestamp('today').to_period('M'))
)
# filter
out = df[keep]
Output:
date other_col
1 08/02/2022 B
2 09/03/2022 C

from datetime import datetime
import pandas as pd
df = ...
# assuming your date column is named 'date'
t = datetime.utcnow()
df = df[pd.to_datetime(df.date) >= datetime(t.year, t.month, t.day)]

Let us consider this example dataframe:
import pandas as pd
import datetime
df = pd.DataFrame()
data = [['nao victoria', '21/Feb/2012'], ['argo', '6/Jun/2022'], ['kon tiki', '23/Aug/2022']]
df = pd.DataFrame(data, columns=['Vessel', 'Date'])
You can convert your dates to datetimes, by using pandas' to_datetime method; for instance, you may save the output into a new Series (column):
df['Datetime']=pd.to_datetime(df['Date'], format='%d/%b/%Y')
You end up with the following dataframe:
Vessel Date Datetime
0 nao victoria 21/Feb/2012 2012-02-21
1 argo 6/Jun/2022 2022-06-06
2 kon tiki 23/Aug/2022 2022-08-23
You can then reject rows containing datetime values that are smaller than today's date, defined using datetime's now method:
df = df[df.Datetime > datetime.datetime.now()]
This returns:
Vessel Date Datetime
2 kon tiki 23/Aug/2022 2022-08-23

join two dataframes after a specific column but showing no index

I am converting my date into matlab datenum and saving values in text file. then I am reading that csv and trying to add those values in existing dataframe after id column. I can direct add it at the end or as a 1st column but I can't add it after a specific column. number of rows are eual in both dataframes. it throws following error
KeyError: 'datenum'
Here is what i am doing
import pandas as pd
import datetime
import numpy as np
from datetime import datetime as dt
from datetime import timedelta
import os
df=pd.read_csv(r'C:\example.csv')
def datenum(d):
return 366 + d.toordinal() + (d - dt.fromordinal(d.toordinal())).total_seconds()/(24*60*60)
d = dt.strptime('2021-01-01 00:15:00','%Y-%m-%d %H:%M:%S')
column = df['date']
new_column = [dt.strptime(i,'%Y-%m-%d %H:%M:%S') for i in column]
end_column = [datenum(i) for i in new_column]
for i in end_column:
print(i)
df['Datetime'] = pd.to_datetime(df['date'])
df[ df['Datetime'].diff() > pd.Timedelta('15min') ]
np.savetxt('timePminus.txt', end_column,fmt='% 1.5f')
df['Datetime'] = pd.to_datetime(df['date'])
df[ df['Datetime'].diff() > pd.Timedelta('15min') ]
after that reading this csv
import pandas as pd
import datetime
import numpy as np
from datetime import datetime as dt
from datetime import timedelta
import os
df=pd.read_csv(r'example.csv')
df1=pd.read_csv(r'time.csv')
df2= pd.concat([df, df1], axis=1, join='outer')
print(df2)
df2= pd.concat([df, df1], axis=1, join='outer')
print(df2.get('datenum'))
df3=pd.DataFrame().assign(B=df2['B'],A=df2['A'],datenum=df2['datenum'],D=df2['D'])
print(df3)
it throws error keyerror:datenum
here is the dataframe or csv
A,B,C
2021-01-02 00:15:00,"43289,95698800",236985
2021-01-01 00:30:00,"425962,555555",236985
2021-01-01 00:45:00,"2368,56980000",236985
2021-01-01 01:00:00,"2368,56980000",236985
2021-01-15 01:15:00,"2368,56980000",236985
2021-05-01 01:30:00,"2368,56980000",236985
if I do
print(df2.get('datenum'))
output is none.
2nd dataframe
datenum
738157.01042
738157.02083
738157.03125
738157.04167
738157.05208
738157.06250
can some one please guide me what is wrong. I am trying it for hours.
Thanks in advance.

If you are looking to just re-arrange your dataframe columns after concat you can do,
column_order = ['id', 'datenum', 'A', 'B', 'C']
df = df[column_order]

Get the first and the last day of a month from the df

This is how my dataframe looks like:
datetime open high low close
2006-01-02 4566.95 4601.35 4542.00 4556.25
2006-01-03 4531.45 4605.45 4531.45 4600.25
2006-01-04 4619.55 4707.60 4616.05 4694.14
.
.
.
Need to calculate the Monthly Returns in %
Formula: (Month Closing Price - Month Open Price) / Month Open Price
I can't seem to get the open price and closing price of a month, because in my df most months dont have a log for the 1st of the month. So having trouble calculating it.
Any help would be very much appreciated!

You need to use groupby and agg function in order to get the first and last value of each column in each month:
import pandas as pd
df = pd.read_csv("dt.txt")
df["datetime"] = pd.to_datetime(df["datetime"])
df.set_index("datetime", inplace=True)
resultDf = df.groupby([df.index.year, df.index.month]).agg(["first", "last"])
resultDf["new_column"] = (resultDf[("close", "last")] - resultDf[("open", "first")])/resultDf[("open", "first")]
resultDf.index.rename(["year", "month"], inplace=True)
resultDf.reset_index(inplace=True)
resultDf
The code above will result in a dataframe that has multiindex column. So, if you want to get, for example, rows with year of 2010, you can do something like:
resultDf[resultDf["year"] == 2010]

You can create a custom grouper such as follow :
import pandas as pd
import numpy as np
from io import StringIO
csvfile = StringIO(
"""datetime\topen\thigh\tlow\tclose
2006-01-02\t4566.95\t4601.35\t4542.00\t4556.25
2006-01-03\t4531.45\t4605.45\t4531.45\t4600.25
2006-01-04\t4619.55\t4707.60\t4616.05\t4694.14""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')
df.datetime = pd.to_datetime(df.datetime, format = "%Y-%m-%d")
dg = df.groupby(pd.Grouper(key='datetime', axis=0, freq='M'))
Then each group of dg is separate by month, and since we convert datetime as pandas.datetime we can use classic arithmetic on it :
def monthly_return(datetime, close_value, open_value):
index_start = np.argmin(datetime)
index_end = np.argmax(datetime)
return (close_value[index_end] - open_value[index_start]) / open_value[index_start]
dg.apply(lambda x : monthly_return(x.datetime, x.close, x.open))
Out[97]:
datetime
2006-01-31 0.02785
Freq: M, dtype: float64
Of course a pure functional approach is possible instead of using monthly_return function

It is possible to create a new data frame on Pandas from a time series, with the daily diference?

Is there a way to create a new data frame from a time series with the daily diffence?
This means, suppose that on October 5 I had 5321 counts and on October 6 5331 counts. This represents the difference of 10; what I want is, for example, that my DataFrame shows 10 on October 6.
Here's my code of the raw dataframe:
import pandas as pd
from datetime import datetime, timedelta
url = 'https://raw.githubusercontent.com/mariorz/covid19-mx-time-series/master/data/covid19_confirmed_mx.csv'
df = pd.read_csv(url, index_col=0)
df = df.loc['Colima','18-03-2020':'06-10-2020']
df = pd.DataFrame(df)
df.index = pd.to_datetime(df.index, format='%d-%m-%Y')
df
This is the raw outcome:
Thank you guys!

There's an inbuilt diff function just for these kind of operations:
df['Diff'] = df.Colima.diff()

Yes, you can use the shift method to access the preceding row's value to calculate the difference.
df['difference'] = df.Colima - df.Colima.shift(1)

Convert '9999-12-31 00:00:00' to 'dd/mm/yyyy' in Pandas

I have a dataframe containing the column 'Date' with value as '9999-12-31 00:00:00'. I need to convert it to 'dd/mm/yyyy'.
import pandas as pd
data = (['9999-12-31 00:00:00'])
df = pd.DataFrame(data, columns=['Date'])

Use daily periods by custom function with remove times by split and change format by strftime:
df['Date'] = (df['Date'].str.split()
.str[0]
.apply(lambda x: pd.Period(x, freq='D'))
.dt.strftime('%d/%m/%Y'))
print (df)
Date
0 31/12/9999

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping by month-year after using datetime.strptime - python

Related

Python - Remove lines prior to current month and year

join two dataframes after a specific column but showing no index

Get the first and the last day of a month from the df

It is possible to create a new data frame on Pandas from a time series, with the daily diference?

Convert '9999-12-31 00:00:00' to 'dd/mm/yyyy' in Pandas

Categories

Resources