I am new to scripting need some help in writing the code in correct way. I have a csv file in which we have date based on the date I need to create a new column name period which will be combination of year and month.
If the date range is between 1 to 25, month will be the current month from the date
If the date range is greater then 25, month will be next month.
Sample file:
Date
10/21/2021
10/26/2021
01/26/2021
Expected results:
Date
Period (year+month)
10/21/2021
202110
10/26/2021
202111
01/26/2021
202102
Two ways I can think of.
Convert the incoming string into a date object and get the values you need from there. See Converting string into datetime
Use split("/") to split the date string into a list of three values and use those to do your calculations.
Good question.
I've included the code that I wrote to do this, below. The process we will follow is:
Load the data from a csv
Define a function that will calculate the period for each date
Apply the function to our data and store the result as a new column
import pandas as pd
# Step 1
# read in the data from a csv, parsing dates and store the data in a DataFrame
data = pd.read_csv("filepath.csv", parse_dates=["Date"])
# Create day, month and year columns in our DataFrame
data['day'] = data['Date'].dt.day
data['month'] = data['Date'].dt.month
data['year'] = data['Date'].dt.year
# Step 2
# Define a function that will get our periods from a given date
def get_period(date):
day = date.day
month = date.month
year = date.year
if day > 25:
if month == 12: # if december, increment year and change month to jan.
year += 1
month = 1
else:
month += 1
# convert our year and month into strings that we can concatenate easily
year_string = str(year).zfill(4) #
month_string = str(month).zfill(2)
period = str(year_string) + str(month_string) # concat the strings together
return period
# Step 3
# Apply our custom function (get_period) to the DataFrame
data['period'] = data.apply(get_period, axis = 1)
Related
I have a data frame with a column Campaign which consists of the campaign name (start date - end date) format. I need to create 3 new columns by extracting the start and end dates.
start_date, end_date, days_between_start_and_end_date.
The issue is Campaign column value is not in a fixed format, for the below values my code block works well.
1. Season1 hero (18.02. -24.03.2021)
What I am doing in my code snippet is extracting the start date & end date from the campaign column and as you see, start date doesn't have a year. I am adding the year by checking the month value.
import pandas as pd
import re
import datetime
# read csv file
df = pd.read_csv("report.csv")
# extract start and end dates from the 'Campaign' column
dates = df['Campaign'].str.extract(r'(\d+\.\d+)\.\s*-\s*(\d+\.\d+\.\d+)')
df['start_date'] = dates[0]
df['end_date'] = dates[1]
# convert start and end dates to datetime format
df['start_date'] = pd.to_datetime(df['start_date'], format='%d.%m')
df['end_date'] = pd.to_datetime(df['end_date'], format='%d.%m.%Y')
# Add year to start date
for index, row in df.iterrows():
if pd.isna(row["start_date"]) or pd.isna(row["end_date"]):
continue
start_month = row["start_date"].month
end_month = row["end_date"].month
year = row["end_date"].year
if start_month > end_month:
year = year - 1
dates_str = str(row["start_date"].strftime("%d.%m")) + "." + str(year)
df.at[index, "start_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")
dates_str = str(row["end_date"].strftime("%d.%m")) + "." + str(row["end_date"].year)
df.at[index, "end_date"] = pd.to_datetime(dates_str, format="%d.%m.%Y")
but, I have multiple different column values where my regex fail and I receive nan values, for example
1. Sales is on (30.12.21-12.01.2022)
2. Sn 2 Fol CAMPAIGN A (24.03-30.03.2023)
3. M SALE (19.04 - 04.05.2022)
4. NEW SALE (29.12.2022-11.01.2023)
5. Year End (18.12. - 12.01.2023)
6. XMAS 1 - THE TRIBE CELEBRATES XMAS (18.11.-08.12.2021) (gifting communities)
Year End (18.12. - 12.01.2023)
in all the above 4 example, my date format is completely different.
expected output
start date end date
2021-12-30 2022-01-22
2023-03-24 2023-03-30
2022-04-19 2022-05-04
2022-12-29 2023-01-11
2022-18-12 2023-01-12
2021-11-18 2021-12-08
Can someone please help me here?
Since the datetimes in the data don't have a fixed format (some are dd.mm.yy, some are dd.mm.YYYY), it might be better if we apply a custom parser function that uses try-except. We can certainly do two conversions using pd.to_datetime and choose values using np.where etc. but it might not save any time given we need to do a lot of string manipulations beforehand.
To append the missing years for some rows, since pandas string methods are not optimized and as we'll need a few of them, (str.count(), str.cat() etc.) it's probably better to use Python string methods in a loop implementation instead.
Also, iterrows() is incredibly slow, so it's much faster if you use a python loop instead.
pd.to_datetime converts each element into datetime.datetime objects anyways, so we can use datetime.strptime from the built-in module to perform the conversions.
from datetime import datetime
def datetime_parser(date, end_date=None):
# remove space around dates
date = date.strip()
# if the start date doesn't have year, append it from the end date
dmy = date.split('.')
if end_date and len(dmy) == 2:
date = f"{date}.{end_date.rsplit('.', 1)[1]}"
elif end_date and not dmy[-1]:
edmy = end_date.split('.')
if int(dmy[1]) > int(edmy[1]):
date = f"{date}{int(edmy[-1])-1}"
else:
date = f"{date}{edmy[-1]}"
try:
# try 'dd.mm.YYYY' format (e.g. 29.12.2022) first
return datetime.strptime(date, '%d.%m.%Y')
except ValueError:
# try 'dd.mm.yy' format (e.g. 30.12.21) if the above doesn't work out
return datetime.strptime(date, '%d.%m.%y')
# extract dates into 2 columns (tentatively start and end dates)
splits = df['Campaign'].str.extract(r"\((.*?)-(.*?)\)").values.tolist()
# parse the dates
df[['start_date', 'end_date']] = [[datetime_parser(start, end), datetime_parser(end)] for start, end in splits]
# find difference
df['days_between_start_and_end_date'] = df['end_date'] - df['start_date']
I would do a basic regex with extract and then perform slicing :
ser = df["Campaign"].str.extract(r"\((.*)\)", expand=False)
start_date = ser.str.strip().str[-10:]
#or ser.str.strip().str.rsplit("-").str[-1]
end_date = ser.str.strip().str.split("\s*-\s*").str[0]
NB : You can assign the Series start_date and end_date to create your two new column.
Output :
start_date, end_date
(1.0 12.01.2022 # <- start_date
2.0 30.03.2023
3.0 04.05.2022
4.0 11.01.2023
Name: Campaign, dtype: object,
1.0 30.12.21 # <- end_date
2.0 24.03
3.0 19.04
4.0 29.12.2022
Name: Campaign, dtype: object)
I have a 40 year time series in the format stn;yyyymmddhh;rainfall , where yyyy= year, mm = month, dd= day,hh= hour. The series is at an hourly resolution. I extracted the maximum values for each year by the following groupby method:
import pandas as pd
df = pd.read_csv('data.txt', delimiter = ";")
df['yyyy'] = df['yyyymmhhdd'].astype(str).str[:4]
df.groupby(['yyyy'])['rainfall'].max().reset_index()
Now, i am trying to extract the maximum values for 3 hour duration each year. I tried this sliding maxima approach but it is not working. k is the duration I am interested in. In simple words,i need maximum precipitation sum for multiple durations in every year (eg 3h, 6h, etc)
class AMS:
def sliding_max(self, k, data):
tp = data.values
period = 24*365
agg_values = []
start_j = 1
end_j = k*int(np.floor(period/k))
for j in range(start_j, end_j + 1):
start_i = j - 1
end_i = j + k + 1
agg_values.append(np.nansum(tp[start_i:end_i]))
self.sliding_max = max(agg_values)
return self.sliding_max
Any suggestions or improvements in my code or is there a way i can implement it with groupby. I am a bit new to python environment, so please excuse if the question isn't put properly.
Stn;yyyymmddhh;rainfall
xyz;1981010100;0.0
xyz;1981010101;0.0
xyz;1981010102;0.0
xyz;1981010103;0.0
xyz;1981010104;0.0
xyz;1981010105;0.0
xyz;1981010106;0.0
xyz;1981010107;0.0
xyz;1981010108;0.0
xyz;1981010109;0.4
xyz;1981010110;0.6
xyz;1981010111;0.1
xyz;1981010112;0.1
xyz;1981010113;0.0
xyz;1981010114;0.1
xyz;1981010115;0.6
xyz;1981010116;0.0
xyz;1981010117;0.0
xyz;1981010118;0.2
xyz;1981010119;0.0
xyz;1981010120;0.0
xyz;1981010121;0.0
xyz;1981010122;0.0
xyz;1981010123;0.0
xyz;1981010200;0.0
You first have to convert your column containing the datetimes to a Series of type datetime. You can do that parsing by providing the format of your datetimes.
df["yyyymmddhh"] = pd.to_datetime(df["yyyymmddhh"], format="%Y%M%d%H")
After having the correct data type you have to set that column as your index and can now use pandas functionality for time series data (resampling in your case).
First you resample the data to 3 hour windows and sum the values. From that you resample to yearly data and take the maximum value of all the 3 hour windows for each year.
df.set_index("yyyymmddhh").resample("3H").sum().resample("Y").max()
# Output
yyyymmddhh rainfall
1981-12-31 1.1
I've written this function to get the last Thursday of the month
def last_thurs_date(date):
month=date.dt.month
year=date.dt.year
cal = calendar.monthcalendar(year, month)
last_thurs_date = cal[4][4]
if month < 10:
thurday_date = str(year)+'-0'+ str(month)+'-' + str(last_thurs_date)
else:
thurday_date = str(year) + '-' + str(month) + '-' + str(last_thurs_date)
return thurday_date
But its not working with the lambda function.
datelist['Date'].map(lambda x: last_thurs_date(x))
Where datelist is
datelist = pd.DataFrame(pd.date_range(start = pd.to_datetime('01-01-2014',format='%d-%m-%Y')
, end = pd.to_datetime('06-03-2019',format='%d-%m-%Y'),freq='D').tolist()).rename(columns={0:'Date'})
datelist['Date']=pd.to_datetime(datelist['Date'])
Jpp already added the solution, but just to add a slightly more readable formatted string - see this awesome website.
import calendar
def last_thurs_date(date):
year, month = date.year, date.month
cal = calendar.monthcalendar(year, month)
# the last (4th week -> row) thursday (4th day -> column) of the calendar
# except when 0, then take the 3rd week (February exception)
last_thurs_date = cal[4][4] if cal[4][4] > 0 else cal[3][4]
return f'{year}-{month:02d}-{last_thurs_date}'
Also added a bit of logic - e.g. you got 2019-02-0 as February doesn't have 4 full weeks.
Scalar datetime objects don't have a dt accessor, series do: see pd.Series.dt. If you remove this, your function works fine. The key is understanding that pd.Series.apply passes scalars to your custom function via a loop, not an entire series.
def last_thurs_date(date):
month = date.month
year = date.year
cal = calendar.monthcalendar(year, month)
last_thurs_date = cal[4][4]
if month < 10:
thurday_date = str(year)+'-0'+ str(month)+'-' + str(last_thurs_date)
else:
thurday_date = str(year) + '-' + str(month) + '-' + str(last_thurs_date)
return thurday_date
You can rewrite your logic more succinctly via f-strings (Python 3.6+) and a ternary statement:
def last_thurs_date(date):
month = date.month
year = date.year
last_thurs_date = calendar.monthcalendar(year, month)[4][4]
return f'{year}{"-0" if month < 10 else "-"}{month}-{last_thurs_date}'
I know that a lot of time has passed since the date of this post, but I think it would be worth adding another option if someone came across this thread
Even though I use pandas every day at work, in that case my suggestion would be to just use the datetutil library. The solution is a simple one-liner, without unnecessary combinations.
from dateutil.rrule import rrule, MONTHLY, FR, SA
from datetime import datetime as dt
import pandas as pd
# monthly options expiration dates calculated for 2022
monthly_options = list(rrule(MONTHLY, count=12, byweekday=FR, bysetpos=3, dtstart=dt(2022,1,1)))
# last satruday of the month
last_saturday = list(rrule(MONTHLY, count=12, byweekday=SA, bysetpos=-1, dtstart=dt(2022,1,1)))
and then of course:
pd.DataFrame({'LAST_ST':last_saturdays}) #or whatever you need
This question answer Calculate Last Friday of Month in Pandas
This can be modified by selecting the appropriate day of the week, here freq='W-FRI'
I think the easiest way is to create a pandas.DataFrame using pandas.date_range and specifying freq='W-FRI.
W-FRI is Weekly Fridays
pd.date_range(df.Date.min(), df.Date.max(), freq='W-FRI')
Creates all the Fridays in the date range between the min and max of the dates in df
Use a .groupby on year and month, and select .last(), to get the last Friday of every month for every year in the date range.
Because this method finds all the Fridays for every month in the range and then chooses .last() for each month, there's not an issue with trying to figure out which week of the month has the last Friday.
With this, use pandas: Boolean Indexing to find values in the Date column of the dataframe that are in last_fridays_in_daterange.
Use the .isin method to determine containment.
pandas: DateOffset objects
import pandas as pd
# test data: given a dataframe with a datetime column
df = pd.DataFrame({'Date': pd.date_range(start=pd.to_datetime('2014-01-01'), end=pd.to_datetime('2020-08-31'), freq='D')})
# create a dateframe with all Fridays in the daterange for min and max of df.Date
fridays = pd.DataFrame({'datetime': pd.date_range(df.Date.min(), df.Date.max(), freq='W-FRI')})
# use groubpy and last, to get the last Friday of each month into a list
last_fridays_in_daterange = fridays.groupby([fridays.datetime.dt.year, fridays.datetime.dt.month]).last()['datetime'].tolist()
# find the data for the last Friday of the month
df[df.Date.isin(last_fridays_in_daterange)]
I am filtering a dataframe by dates to produce two seperate versions:
Data from only today's date
Data from the last two years
However, when I try to filter on the date, it seems to miss dates that are within the last two years.
date_format = '%m-%d-%Y' # desired date format
today = dt.now().strftime(date_format) # today's date. Will always result in today's date
today = dt.strptime(today, date_format).date() # converting 'today' into a datetime object
today = today.strftime(date_format)
two_years = today - relativedelta(years=2) # date is today's date minus two years.
two_years = two_years.strftime(date_format)
# normalizing the format of the date column to the desired format
df_data['date'] = pd.to_datetime(df_data['date'], errors='coerce').dt.strftime(date_format)
df_today = df_data[df_data['date'] == today]
df_two_year = df_data[df_data['date'] >= two_years]
Which results in:
all dates ['07-17-2020' '07-15-2020' '08-01-2019' '03-25-2015']
today df ['07-17-2020']
two year df ['07-17-2020' '08-01-2019']
The 07-15-2020 date is missing from the two year, even though 08-01-2019 is captured.
you don't need to convert anything to string, simply work with datetime dtype. Ex:
import pandas as pd
df = pd.DataFrame({'date': pd.to_datetime(['07-17-2020','07-15-2020','08-01-2019','03-25-2015'])})
today = pd.Timestamp('now')
print(df[df['date'].dt.date == today.date()])
# date
# 0 2020-07-17
print(df[(df['date'].dt.year >= today.year-1) & (df['date'].dt.date != today.date())])
# date
# 1 2020-07-15
# 2 2019-08-01
What you get from the comparison operations (adjust them as needed...) are boolean masks - you can use them nicely to filter the df.
Your datatype conversions are the problem here. You could do this:
today = dt.now() # today's date. Will always result in today's date
two_years = today - relativedelta(years=2) # date is today's date minus two years.
This prints '2018-07-17 18:40:42.704395'. You can then convert it to the date only format.
two_years = two_years.strftime(date_format)
two_years = dt.strptime(two_years, date_format).date()
I am trying to filter out some data and seem to be running into some errors.
Below this statement is a replica of the following code I have:
url = "http://elections.huffingtonpost.com/pollster/2012-general-election-romney-vs-obama.csv"
source = requests.get(url).text
s = StringIO(source)
election_data = pd.DataFrame.from_csv(s, index_col=None).convert_objects(
convert_dates="coerce", convert_numeric=True)
election_data.head(n=3)
last_day = max(election_data["Start Date"])
filtered = election_data[((last_day-election_data['Start Date']).days <= 5)]
As you can see last_day is the max within the column election_data
I would like to filter out the data in which the difference between
the max and x is less than or equal to 5 days
I have tried using for - loops, and various combinations of list comprehension.
filtered = election_data[map(lambda x: (last_day - x).days <= 5, election_data["Start Date"]) ]
This line would normally work however, python3 gives me the following error:
<map object at 0x10798a2b0>
Your first attempt has it almost right. The issue is
(last_day - election_date['Start Date']).days
which should instead be
(last_day - election_date['Start Date']).dt.days
Series objects do not have a days attribute, only TimedeltaIndex objects do. A fully working example is below.
data = pd.read_csv(url, parse_dates=['Start Date', 'End Date', 'Entry Date/Time (ET)'])
data.loc[(data['Start Date'].max() - data['Start Date']).dt.days <= 5]
Note that I've used Series.max which is more performant than the built-in max. Also, data.loc[mask] is slightly faster than data[mask] since it is less-overloaded (has a more specialized use case).
If I understand your question correctly, you just want to filter your data where any Start Date value that is <=5 days away from the last day. This sounds like something pandas indexing could easily handle, using .loc.
If you want an entirely new DataFrame object with the filtered data:
election_data # your frame
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
new_df = election_data.loc[(last_day-election_data["Start Date"]<=date)]
Or if you just want the Start Date column post-filtering:
last_day = max(election_data["Start Date"])
date = # Your date within 5 days of the last day
filtered_dates = election_data.loc[(last_day-election_data["Start Date"]<=date), "Start Date"]
Note that your date variable needs to be your date in the format required by Start Date (possibly YYYYmmdd format?). If you don't know what this variable should be, then just print(last_day) then count 5 days back.