How add dig in a string by index without deleting anything - python

How adding a dig / in a string that looks like this:
012019
so it will look like this:
01/2019
Also add maybe day like
01/01/2019
The data:
import pandas as pd
df= pd.DataFrame({ "month": ["012019","152019","222019","142019","302019","012020"]})
My code:
df.month = df.month.apply(lambda x: '{:0>2}'.format(x.split('/')[0]))
But it does not work.

If I understood correctly, you just want to add a slash between 2nd and 3rd characters. then it's easy:
df['new'] = df.month.str.slice(0, 2) + '/' + df.month.str.slice(2)

IIUC just convert to datetime and use dt.strftime
df['month'] = pd.to_datetime(df['month'],format='%d%Y').dt.strftime('%d/%Y')
output:
print(df)
month
0 01/2019
1 15/2019
2 22/2019
3 14/2019
4 30/2019
5 01/2020
if you want to add a month as well just add it to your string
month = '01'
df['month'] = pd.to_datetime(df['month'].astype(str) +
month,format='%d%Y%m').dt.strftime('%m/%d/%Y')
print(df)
month
0 01/01/2019
1 01/15/2019
2 01/22/2019
3 01/14/2019
4 01/30/2019
5 01/01/2020

Related

How to convert month number to datetime in pandas

I have followed the instructions from this thread, but have run into issues.
Converting month number to datetime in pandas
I think it may have to do with having an additional variable in my dataframe but I am not sure. Here is my dataframe:
0 Month Temp
1 0 2
2 1 4
3 2 3
What I want is:
0 Month Temp
1 1990-01 2
2 1990-02 4
3 1990-03 3
Here is what I have tried:
df= pd.to_datetime('1990-' + df.Month.astype(int).astype(str) + '-1', format = '%Y-%m')
And I get this error:
ValueError: time data 1990-0-1 doesn't match format specified
IIUC, we can manually create your datetime object then format it as your expected output:
m = np.where(df['Month'].eq(0),
df['Month'].add(1), df['Month']
).astype(int).astype(str)
df['date'] = pd.to_datetime(
"1900" + "-" + pd.Series(m), format="%Y-%m"
).dt.strftime("%Y-%m")
print(df)
Month Temp date
0 0 2 1900-01
1 1 4 1900-02
2 2 3 1900-03
Try .dt.strftime() to show how to display the date, because datetime values are by default stored in %Y-%m-%d 00:00:00 format.
import pandas as pd
df= pd.DataFrame({'month':[1,2,3]})
df['date']=pd.to_datetime(df['month'], format="%m").dt.strftime('%Y-%m')
print(df)
You have to explicitly tell pandas to add 1 to the months as they are from range 0-11 not 1-12 in your case.
df=pd.DataFrame({'month':[11,1,2,3,0]})
df['date']=pd.to_datetime(df['month']+1, format='%m').dt.strftime('1990-%m')
Here is my solution for you
import pandas as pd
Data = {
'Month' : [1,2,3],
'Temp' : [2,4,3]
}
data = pd.DataFrame(Data)
data['Month']= pd.to_datetime('1990-' + data.Month.astype(int).astype(str) + '-1', format = '%Y-%m').dt.to_period('M')
Month Temp
0 1990-01 2
1 1990-02 4
2 1990-03 3
If you want Month[0] means 1 then you can conditionally add this one

Change comma separated date format 20,190,927 in a dataframe?

I have a data frame with date columns as 20,190,927 which means: 2019/09/27.
I need to change the format to YYYY/MM/DD or something similar.
I thought of doing it manually like:
x = df_all['CREATION_DATE'].str[:2] + df_all['CREATION_DATE'].str[3:5] + "-" + \
df_all['CREATION_DATE'].str[5] + df_all['CREATION_DATE'].str[7] + "-" + df_all['CREATION_DATE'].str[8:]
print(x)
What's a more creative way of doing this? Could it be done with datetime module?
I believe this is what you want. First replace the , with nothing, so you get a yyyymmdd format, and then change it to datetime with pd.to_datetime by passing the correct format. One liner:
df['dates'] = pd.to_datetime(df['dates'].str.replace(',',''),format='%Y%m%d')
Full explanation:
import pandas as pd
a = {'dates':['20,190,927','20,191,114'],'values':[1,2]}
df = pd.DataFrame(a)
print(df)
Output, here's how the original dataframe looks like:
dates values
0 20,190,927 1
1 20,191,114 2
df['dates'] = df['dates'].str.replace(',','')
df['dates'] = pd.to_datetime(df['dates'],format='%Y%m%d')
print(df)
print(df.info())
Output of the newly formatted dataframe:
dates values
0 2019-09-27 1
1 2019-11-14 2
Printing .info() to ensure we have the correct format:
dates 2 non-null datetime64[ns]
values 2 non-null int64
Hope this helps,
date=['20,190,927','20,190,928','20,190,929']
df3=pd.DataFrame(date,columns=['Date'])
df3['Date']=df3['Date'].replace('\,','',regex=True)
df3['Date']=pd.to_datetime(df3['Date'])

Changing format of date in pandas dataframe

I have a pandas dataframe, in which a column is a string formatted as
yyyymmdd
which should be a date. Is there an easy way to convert it to a recognizable form of date?
And then what python libraries should I use to handle them?
Let's say, for example, that I would like to consider all the events (rows) whose date field is a working day (so mon-fri). What is the smoothest way to handle such a task?
Ok so you want to select Mon-Friday. Do that by converting your column to datetime and check if the dt.dayofweek is lower than 6 (Mon-Friday --> 0-4)
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
Full example:
import pandas as pd
df = pd.DataFrame({
'date': [
'20180101',
'20180102',
'20180103',
'20180104',
'20180105',
'20180106',
'20180107'
],
'value': range(7)
})
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
print(df2)
Returns:
date value
0 20180101 0
1 20180102 1
2 20180103 2
3 20180104 3
4 20180105 4

Function take values from a dataframe as parameter

I have a function which calculates the Holidays for a given year like this:
holidays = bf.Holidays(year)
the problem is, there is no way to edit the Holidays function so i need another solutions.
I have a datafame with some years, example:
year
0 2005
1 2011
2 2015
3 2017
right now if i do this:
yearX = year.get_value(0, 0)
and run
holidays = bf.Holidays(yearX)
it just calculates the holidays for the first year in the dataframe (2005)
How can i implement that the function should take every year and append it?
using a for loop?
Example how it works now:
year = df['YEAR']
yearX = year.get_value(0,0)
holidays = bf.Holidays(year)
holidays = holidays.get_holiday_list()
print(holidays)
output:
DATE
2005-01-01
2005-03-25
2005-03-27
2005-03-28
2005-05-01
but i want it to calculate for very dataframe row, not only the first one
Looks like you're looking for pandas.DataFrame.apply:
holidays = df.apply(bf.Holidays, axis=1)
It will apply function bf.Holidays to each row in your df DataFrame.
For the df from your question:
In [50]: df
Out[50]:
year
0 2010
1 2011
2 2015
3 2017
In [51]: def test(x):
...: return x % 13
...:
In [52]: df.apply(test, axis=1)
Out[52]:
year
0 8
1 9
2 0
3 2
I think you can follow this example and just write a little wrapper function to return the dates to their respective columns:
def holiday_mapper(row):
holidays = bf.Holidays(row['year'],'HH').get_holiday_list()
row['holiday1'], row['holiday2']...row['holidayN'] = holidays
return row
df = df.apply(holiday_mapper, axis=1)
Assuming your get_holiday_list() function actually returns a list, and that you want to store the holiday dates in columns for each holiday, rather than append a single column with all the dates.

Aggregating unbalanced panel to time series using pandas

I have an unbalanced panel that I'm trying to aggregate up to a regular, weekly time series. The panel looks as follows:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
To give a better sense of what I'm looking for, I'm including an intermediate step, which I'd love to skip if possible. Basically some data needs to be filled in so that it can be aggregated. As you can see, missing weeks in between observations are interpolated. All other values are set equal to zero.
Group Date value
A 1/1/2000 5
A 1/8/2000 5
A 1/15/2000 10
A 1/22/2000 0
B 1/1/2000 0
B 1/8/2000 3
B 1/15/2000 3
B 1/22/2000 7
C 1/1/2000 0
C 1/8/2000 0
C 1/15/2000 0
C 1/22/2000 20
The final result that I'm looking for is as follows:
Date value
1/1/2000 5 = 5 + 0 + 0
1/8/2000 8 = 5 + 3 + 0
1/15/2000 13 = 10 + 3 + 0
1/22/2000 27 = 0 + 7 + 20
I haven't gotten very far, managed to create a panel:
panel = df.set_index(['Group','week']).to_panel()
Unfortunately, if I try to resample, I get an error
panel.resample('W')
TypeError: Only valid with DatetimeIndex or PeriodIndex
Assume df is your second dataframe with weeks, you can try the following:
df.groupby('week').sum()['value']
The documentation of groupby() and its application is here. It's similar to group-by function in SQL.
To obtain the second dataframe from the first one, try the following:
Firstly, prepare a function to map the day to week
def d2w_map(day):
if day <=7:
return 1
elif day <= 14:
return 2
elif day <= 21:
return 3
else:
return 4
In the method above, days from 29 to 31 are considered in week 4. But you get the idea. You can modify it as needed.
Secondly, take the lists out from the first dataframe, and convert days to weeks
df['Week'] = df['Day'].apply(d2w_map)
del df['Day']
Thirdly, initialize your second dataframe with only columns of 'Group' and 'Week', leaving the 'value' out. Assume now your initialized new dataframe is result, you can now do a join
result = result.join(df, on=['Group', 'Week'])
Last, write a function to fill the Nan up in the 'value' column with the nearby element. The Nan is what you need to interpolate. Since I am not sure how you want the interpolation to work, I will leave it to you.
Here is how you can change d2w_map to convert string of date to integer of week
from datetime import datetime
def d2w_map(day_str):
return datetime.strptime(day_str, '%m/%d/%Y').weekday()
Returned value of 0 means Monday, 1 means Tuesday and so on.
If you have the package dateutil installed, the function can be more robust:
from dateutil.parser import parse
def d2w_map(day_str):
return parse(day_str).weekday()
Sometimes, things you want are already implemented by magic :)
Turns out the key is to resample a groupby object like so:
df_temp = (df.set_index('date')
.groupby('Group')
.resample('W', how='sum', fill_method='ffill'))
ts = (df_temp.reset_index()
.groupby('date')
.sum()[value])
Used this tab delimited test.txt:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
You can skip the intermediate datafile as follows. Don't have time now. Just play around with it to get it right.
import pandas as pd
import datetime
time_format = '%m/%d/%Y'
Y = pd.read_csv('test.txt', sep="\t")
dates = Y['Date']
dates_right_format = map(lambda s: datetime.datetime.strptime(s, time_format), dates)
values = Y['value']
X = pd.DataFrame(values)
X.index = dates_right_format
print X
X = X.sort()
print X
print X.resample('W', how=sum, closed='right', label='right')
Last print
value
2000-01-02 5
2000-01-09 3
2000-01-16 NaN
2000-01-23 37

Categories