Fix date strings with days and months interchanged in certain rows - python

I m trying to upload some data from a csv file and find the values for date and month get interchanged.
Given below is how the data looks:
id,date
1001,09/10/2018
1002,20/09/2018
1003,09/05/2018
All of the dates are from September but as seen they are interchanged in different format. I am using the below to convert to datetime
df['date'] = pd.to_datetime(df['date']).dt.strftime('%d/%m/%Y')

I've figured out a neat little trick using str.extract and pd.to_datetime to do this quickly and efficiently:
m = df.date.str.extract(r'(?:(09)/(\d+))')[1].astype(int) > 31
df['date'] = [
pd.to_datetime(d, dayfirst=m) for d, m in zip(df.date, m)]
id date
0 1001 2018-09-10
1 1002 2018-09-20
2 1003 2018-09-05

Pandas has no issues dealing with your sample data because it clearly comes in the US notation apart from the case of '20/09/2018' where 20 cannot possibly be a month which pandas has no problem dealing with either.
However, if the input contains e.g. '10/09/2018' (as was mentioned in the comments) where it's impossible to tell day and month apart unless either the US notation is assumed or it is known beforehand that absolutely all dates are in September.
Since the latter seems to be the case, you can do
df['date'].map(lambda x: pd.datetime(x.year, x.day, x.month)
if (x.month != 9) & (x.day == 9)
else x)
0 2018-09-10
1 2018-09-20
2 2018-09-05

Related

A simple way of selecting the previous row in a column and performing an operation?

I'm trying to create a forecast which takes the previous day's 'Forecast' total and adds it to the current day's 'Appt'. Something which is straightforward in Excel but I'm struggling in pandas. At the moment all I can get in pandas using .loc is this:
pd.DataFrame({'Date': ['2022-12-01', '2022-12-02','2022-12-03','2022-12-04','2022-12-05'],
'Appt': [12,10,5,4,13],
'Forecast': [37,0,0,0,0]
})
What I'm looking for it to do is this:
pd.DataFrame({'Date': ['2022-12-01', '2022-12-02','2022-12-03','2022-12-04','2022-12-05'],
'Appt': [12,10,5,4,13],
'Forecast': [37,47,52,56,69]
})
E.g. 'Forecast' total on the 1st December is 37. On the 2nd December the value in the 'Appt' column in 10. I want it to select 37 and + 10, then put this in the 'Forecast' column for the 2nd December. Then iterate over the rest of the column.
I've tied using .loc() with the index, and experimented with .shift() but neither seem to work for what I'd like. Also looked into .rolling() but I think that's not appropriate either.
I'm sure there must be a simple way to do this?
Apologies, the original df has 'Date' as a datetime column.
You can use mask and cumsum:
df['Forecast'] = df['Forecast'].mask(df['Forecast'].eq(0), df['Appt']).cumsum()
# or
df['Forecast'] = np.where(df['Forecast'].eq(0), df['Appt'], df['Forecast']).cumsum()
Output:
Date Appt Forecast
0 2022-12-01 12 37
1 2022-12-01 10 47
2 2022-12-01 5 52
3 2022-12-01 4 56
4 2022-12-01 13 69
You have to make sure that your column has datetime/date type, then you may filter df like this:
# previous code&imports
yesterday = datetime.now().date() - timedelta(days=1)
df[df["date"] == yesterday]["your_column"].sum()

how to create month and year columns using regex and pandas

Hello Stack overflow Community
I've got the Data Frame here
code sum of August
AA 1000
BB 4000
CC 72262
So there are two columns ['code','sum of August']
I've to convert this dataFrame into ['month', 'year', 'code', 'sum of August'] columns
month year code sum of August
8 2020 AA 1000
8 2020 BB 4000
8 2020 CC 72262
So the ['sum of August'] column sometimes named as just ['August'] or ['august']. Also sometimes, it can be ['sum of November'] or ['November'] or ['november'].
I thought of using regex to extract the month name and covert to month number.
Can anyone please help me with this?
Thanks in advance!
You can do the following:
month = {1:'janauary',
2:'february',
3:'march',
4:'april',
5:'may',
6:'june',
7:'july',
8:'august',
9:'september',
10:'october',
11:'november',
12:'december'}
Let's say your data frame is called df. Then you can create the column month automatically using the following:
df['month']=[i for i,j in month.items() if j in str.lower(" ".join(df.columns))][0]
code sum of August month
0 AA 1000 8
1 BB 4000 8
2 CC 72262 8
That means that if a month's name exists in the column names in any way, return the number of this month.
It looks like you're trying to convert month names to their numbers, and the columns can be uppercse or lowercase.
This might work:
months = ['january','febuary','march','april','may','june','july','august','september','october','november','december']
monthNum = []#If you're using a list, just to make this run
sumOfMonths = ['sum of august','sum of NovemBer']#Just to show functionality
for sumOfMonth in sumOfMonths:
for idx, month in enumerate(months):
if month in sumOfMonth.lower():#If the column month name has any of the month keywords
monthNum.append(str(idx + 1)) #i'm just assuming that it's a list, just add the index + 1 to your variable.
I hope this helps! Of course, this wouldn't be exactly what you do, you fill in the variables and change append() if you're not using it.

Python Pandas: Change value associated with each first day entry in every month

I'd like to change the value associated with the first day in every month for a pandas.Series I have. For example, given something like this:
Date
1984-01-03 0.992701
1984-01-04 1.003614
1984-01-17 0.994647
1984-01-18 1.007440
1984-01-27 1.006097
1984-01-30 0.991546
1984-01-31 1.002928
1984-02-01 1.009894
1984-02-02 0.996608
1984-02-03 0.996595
...
I'd like to change the values associated with 1984-01-03, 1984-02-01 and so on. I've racked my brain for hours on this one and have looked around Stack Overflow a fair bit. Some solutions have come close. For example, using:
[In]: series.groupby((m_ret.index.year, m_ret.index.month)).first()
[Out]:
Date Date
1984 1 0.992701
2 1.009894
3 1.005963
4 0.997899
5 1.000342
6 0.995429
7 0.994620
8 1.019377
9 0.993209
10 1.000992
11 1.009786
12 0.999069
1985 1 0.981220
2 1.011928
3 0.993042
4 1.015153
...
Is almost there, but I'm sturggling to proceed further.
What I'd ike to do is set the values associated with the first day present in each month for every year to 1.
series[m_ret.index.is_month_start] = 1 comes close, but the problem here is that is_month_start only selects rows where the day value is 1. However in my series, this isn't always the case as you can see. For example, the date of the first day in January is 1984-01-03.
series.groupby(pd.TimeGrouper('BM')).nth(0) doesn't appear to return the first day either, instead I get the last day:
Date
1984-01-31 0.992701
1984-02-29 1.009894
1984-03-30 1.005963
1984-04-30 0.997899
1984-05-31 1.000342
1984-06-29 0.995429
1984-07-31 0.994620
1984-08-31 1.019377
...
I'm completely stumped. Your help is as always, greatly appreciated! Thank you.
One way would to be to use your .groupby((m_ret.index.year, m_ret.index.month)) idea, but use idxmin instead on the index itself converted into a Series:
In [74]: s.index.to_series().groupby([s.index.year, s.index.month]).idxmin()
Out[74]:
Date Date
1984 1 1984-01-03
2 1984-02-01
Name: Date, dtype: datetime64[ns]
In [75]: start = s.index.to_series().groupby([s.index.year, s.index.month]).idxmin()
In [76]: s.loc[start] = 999
In [77]: s
Out[77]:
Date
1984-01-03 999.000000
1984-01-04 1.003614
1984-01-17 0.994647
1984-01-18 1.007440
1984-01-27 1.006097
1984-01-30 0.991546
1984-01-31 1.002928
1984-02-01 999.000000
1984-02-02 0.996608
1984-02-03 0.996595
dtype: float64

Pandas monthly rolling operation

I ended up figuring it out while writing out this question so I'll just post anyway and answer my own question in case someone else needs a little help.
Problem
Suppose we have a DataFrame, df, containing this data.
import pandas as pd
from io import StringIO
data = StringIO(
"""\
date spendings category
2014-03-25 10 A
2014-04-05 20 A
2014-04-15 10 A
2014-04-25 10 B
2014-05-05 10 B
2014-05-15 10 A
2014-05-25 10 A
"""
)
df = pd.read_csv(data,sep="\s+",parse_dates=True,index_col="date")
Goal
For each row, sum the spendings over every row that is within one month of it, ideally using DataFrame.rolling as it's a very clean syntax.
What I have tried
df = df.rolling("M").sum()
But this throws an exception
ValueError: <MonthEnd> is a non-fixed frequency
version: pandas==0.19.2
Use the "D" offset rather than "M" and specifically use "30D" for 30 days or approximately one month.
df = df.rolling("30D").sum()
Initially, I intuitively jumped to using "M" as I figured it stands for one month, but now it's clear why that doesn't work.
To address why you cannot use things like "AS" or "Y", in this case, "Y" offset is not "a year", it is actually referencing YearEnd (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases), and therefore the rolling function does not get a fixed window (e.g. you get a 365 day window if your index falls on Jan 1, and 1 day if Dec 31).
The proposed solution (offset by 30D) works if you do not need strict calendar months. Alternatively, you would iterate over your date index, and slice with an offset to get more precise control over your sum.
If you have to do it in one line (separated for readability):
df['Sum'] = [
df.loc[
edt - pd.tseries.offsets.DateOffset(months=1):edt, 'spendings'
].sum() for edt in df.index
]
spendings category Sum
date
2014-03-25 10 A 10
2014-04-05 20 A 30
2014-04-15 10 A 40
2014-04-25 10 B 50
2014-05-05 10 B 50
2014-05-15 10 A 40
2014-05-25 10 A 40

how to code to count number of repeatation of dates by using pandas.series.map?

This code can count the frequency of dates in this way : Monday, Tuesday,Wednesday and Thursday together and Saturday,Sunday together.
How to change the arguments of map function to get repeatation of dates in two groups:
1. 9 am to 5pm weekdays
2. the rest of hours in the week (5pm to 9am weekdays and weekends).
d = ['10/3/2013 18:36', '10/3/2013 23:40', '10/3/2013 20:56', '10/4/2013 9:35', '11/7/2013 10:02', '11/11/2013 14:45', '12/1/2013 12:04']
df = pd.DataFrame(pd.to_datetime(d), columns=["DATE"])
df["DATE"].dt.weekday.map({0:0,1:0,2:0,3:0,4:0,5:1,6:1}).value_counts()
Pandas has some built-in methods for identifying business days. There might actually be some built-in functions for business hours but I'm not sure as I don't deal with datetimes that often:
df['in_business_hours'] = (
df['DATE'].map(pd.datetools.isBusinessDay) &
((9 <= df.DATE.dt.hour) & (df.DATE.dt.hour <= 16))
)
df['in_business_hours'].value_counts()
Out[14]:
False 4
True 3
dtype: int64
Since .map() can be applied directly on the Series and also takes an arbitrary function, you can use:
df['DATE'].map(lambda dt:
'Office' if dt.weekday() in {0,1,2,3,4} and 9 <= dt.hour < 17
else 'Out of office'
).value_counts()
The result is:
Out of office 4
Office 3
dtype: int64

Categories