I am new to Python/pandas.
I have a data frame that looks like this
x = pd.DataFrame([20210901,20210902, 20210903, 20210904])
[out]:
0
0 20210901
1 20210902
2 20210903
3 20210904
I want to separate each row as follows: For example
year = 2021
month = 9
day = 1
or I have a list for each row like this:
[2021,9,1]
You can use pd.to_datetime to convert the entire column to datetime type.
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({'Col': [20210917,20210918, 20210919, 20210920]})
>>>
>>> df.Col = pd.to_datetime(df.Col, format='%Y%m%d')
>>> df
Col
0 2021-09-17
1 2021-09-18
2 2021-09-19
3 2021-09-20
>>> df['Year'] = df.Col.dt.year
>>> df['Month'] = df.Col.dt.month
>>> df['Day'] = df.Col.dt.day
>>>
>>> df
Col Year Month Day
0 2021-09-17 2021 9 17
1 2021-09-18 2021 9 18
2 2021-09-19 2021 9 19
3 2021-09-20 2021 9 20
If you want the result as list, you can use list comprehension along with zip function.
>>> [(year, month, day) for year, month, day in zip(df.Year, df.Month, df.Day)]
[(2021, 9, 17), (2021, 9, 18), (2021, 9, 19), (2021, 9, 20)]
Separate each row as follows three new columns: year, month, day
import pandas as pd
df = pd.DataFrame({"date":[20210901,20210902, 20210903, 20210904]})
# split date field to three fields: year month day
df["year"] = df["date"].apply(lambda x: str(x)[:4])
df["month"] = df["date"].apply(lambda x: str(x)[4:6])
df["day"] = df["date"].apply(lambda x: str(x)[6:])
print(df)
result is as following
# result is:
date year month day
0 20210901 2021 09 01
1 20210902 2021 09 02
2 20210903 2021 09 03
3 20210904 2021 09 04
I created a simple function which will return a zipped list in the format you wanted
def convert_to_dates(df):
dates = []
for i,v in x.iterrows():
dates.append(v.values[0])
for i in range(len(dates)):
dates[i] = str(dates[i])
years = []
months = []
days = []
for i in range(len(dates)):
years.append(dates[i][0:4])
months.append(dates[i][4:6])
days.append(dates[i][6:8])
return list(zip(years, months, days))
Call it by using convert_to_dates(x)
Output:
In [4]: convert_to_dates(x)
Out[4]:
[('2021', '09', '01'),
('2021', '09', '02'),
('2021', '09', '03'),
('2021', '09', '04')]
Related
I'm trying to get the number of day between two days but per each month.
I found some answers but I can't figure out how to do it when the dates have two different years.
For example, I have this dataframe:
df = {'Id': ['1','2','3','4','5'],
'Item': ['A','B','C','D','E'],
'StartDate': ['2019-12-10', '2019-12-01', '2019-01-01', '2019-05-10', '2019-03-10'],
'EndDate': ['2020-01-30' ,'2020-02-02','2020-03-03','2020-03-03','2020-02-02']
}
df = pd.DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])
And I want to get this dataframe:
s = (df[["StartDate", "EndDate"]]
.apply(lambda row: pd.date_range(row.StartDate, row.EndDate), axis=1)
.explode())
new = (s.groupby([s.index, s.dt.year, s.dt.month])
.count()
.unstack(level=[1, 2], fill_value=0))
new.columns = new.columns.map(lambda c: f"{c[0]}-{str(c[1]).zfill(2)}")
new = new.sort_index(axis="columns")
get all the dates in between StartDate and EndDate per row, and explode that list of dates to their own rows
group by the row id, year and month & count records
unstack the year & month identifier to be on the columns side as a multiindex
join that year & month values with a hypen in between (also zerofill months, e.g., 03)
lastly sort the year-month pairs on columns
to get
>>> new
2019-11 2019-12 2020-01 2020-02 2020-03
0 0 22 30 0 0
1 0 31 31 2 0
2 0 31 31 29 3
3 21 31 31 29 3
4 9 31 31 2 0
I am thinking of trying if condition, but is there any library or method which I don't know about can solve this?
You can subtract actual month-year period with months from values with decimals with month(s) and assign back to DataFrames, for days convert values to timedeltas and subtract actual datetime by Series.rsub for subtract from right side:
print (df)
col
0 28 days ago
1 4 months ago
2 11 months ago
3 Oct, 2021
now = pd.Timestamp('now')
per = now.to_period('m')
date = now.floor('d')
s = df['col'].str.extract('(\d+)\s*month', expand=False).astype(float)
s1 = df['col'].str.extract('(\d+)\s*day', expand=False).astype(float)
mask, mask1 = s.notna(), s1.notna()
df.loc[mask, 'col'] = s[mask].astype(int).rsub(per).dt.strftime('%b, %Y')
df.loc[mask1, 'col'] = pd.to_timedelta(s1[mask1], unit='d').rsub(date).dt.strftime('%b, %Y')
print (df)
col
0 Sep, 2022
1 Jun, 2022
2 Nov, 2021
3 Oct, 2021
Assuming this input:
col
0 4 months ago
1 Oct, 2021
2 9 months ago
You can use:
# try to get a date:
s = pd.to_datetime(df['col'], errors='coerce')
# extract the month offset
offset = (df['col']
.str.extract(r'(\d+) months? ago', expand=False)
.fillna(0).astype(int)
)
# if the date it NaT, replace by today - n months
df['date'] = s.fillna(pd.Timestamp('today').normalize()
- offset*pd.DateOffset(months=1))
If you want a Mon, Year format:
df['date2'] = df['col'].where(offset.eq(0),
(pd.Timestamp('today').normalize()
-offset*pd.DateOffset(months=1)
).dt.strftime('%b, %Y')
)
output:
col date date2
0 4 months ago 2022-06-28 Jun, 2022
1 Oct, 2021 2021-10-01 Oct, 2021
2 9 months ago 2022-01-28 Jan, 2022
I have a meteorological data set with daily precipitation values for 120 years. I would like to prepare this in such a way that I have monthly average values for 4 climate periods at the end. Example: Average precipitation January, February, March, ... for period 1981 - 2010, average precipitation January, February, March, ... for period 2011 - 2040 and so on.
Data set looks like this (is available as csv file, read in as pandas dataframe):
year month day lon lat value
0 1981 1 1 0 0 0.522592
1 1981 1 2 0 0 2.692495
2 1981 1 3 0 0 0.556698
3 1981 1 4 0 0 0.000000
4 1981 1 5 0 0 0.000000
... ... ... ... ... ... ...
43824 2100 12 27 0 0 0.000000
43825 2100 12 28 0 0 0.185120
43826 2100 12 29 0 0 10.252080
43827 2100 12 30 0 0 13.389290
43828 2100 12 31 0 0 3.523566
Here my code until now:
csv_path = r'filepath.csv'
df = pd.read_csv(csv_path, delimiter = ';')
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
years = pd.date_range('1981-01-01', periods = 6, freq = '30YS').strftime('%Y')
labels = [f'{a}-{b}' for a, b in zip(years, years[1:])]
(df.assign(period = pd.cut(df['year'], bins = years.astype(int), labels = labels, right = False)).groupby(df[['year', 'month']].dt.to_period('M')).agg({'period': 'first', 'value': 'sum'}).groupby('period')['value'].mean())
The best way is probably to write a loop that iterates over all months and the 4 30-year periods, but unfortunately I can't get this to work. Does anyone have any tips?
Expected Output:
Month Average
0 January 20
1 Febuary 21
2 March 19
3 April 18
To get the total value per month and then the average per periods 30 years, you need to use a double groupby:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
years = pd.date_range('1981-01-01', periods=6, freq='30YS').strftime('%Y')
labels = [f'{a}-{b}' for a,b in zip(years, years[1:])]
(df
.assign(period=pd.cut(df['year'], bins=years.astype(int), labels=labels, right=False))
.groupby(df['date'].dt.to_period('M')).agg({'period':'first', 'value': 'sum'})
.groupby('period')['value'].mean()
)
output:
period
1981-2011 3.771785
2011-2041 NaN
2041-2071 NaN
2071-2101 27.350056
2101-2131 NaN
Name: value, dtype: float64
older answer
The expected output is not fully clear, but if you want average precipitation per quarter per year:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df['quarter'] = df['date'].dt.to_period('Q')
df.groupby('quarter')['value'].mean()
output:
quarter
1981Q1 0.754357
2100Q4 5.470011
Freq: Q-DEC, Name: value, dtype: float64
or per quarter globally:
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df['quarter'] = df['date'].dt.quarter
df.groupby('quarter')['value'].mean()
output:
quarter
1 0.754357
4 5.470011
Name: value, dtype: float64
NB. you can do the same for other periods. For months use to_period('M') / .dt.month
df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
df['period'] = df['date'].dt.to_period('M')
df.groupby('period')['value'].mean()
output:
period
1981-01 0.754357
2100-12 5.470011
Freq: M, Name: value, dtype: float64
I have a csv-file: https://data.rivm.nl/covid-19/COVID-19_aantallen_gemeente_per_dag.csv
I want to use it to provide insight into the corona deaths per week.
df = pd.read_csv("covid.csv", error_bad_lines=False, sep=";")
df = df.loc[df['Deceased'] > 0]
df["Date_of_publication"] = pd.to_datetime(df["Date_of_publication"])
df["Week"] = df["Date_of_publication"].dt.isocalendar().week
df["Year"] = df["Date_of_publication"].dt.year
df = df[["Week", "Year", "Municipality_name", "Deceased"]]
df = df.groupby(by=["Week", "Year", "Municipality_name"]).agg({"Deceased" : "sum"})
df = df.sort_values(by=["Year", "Week"])
print(df)
Everything seems to be working fine except for the first 3 days of 2021. The first 3 days of 2021 are part of the last week (53) of 2020: http://week-number.net/calendar-with-week-numbers-2021.html.
When I print the dataframe this is the result:
53 2021 Winterswijk 1
Woudenberg 1
Zaanstad 1
Zeist 2
Zutphen 1
So basically what I'm looking for is a way where this line returns the year of the week number and not the year of the date:
df["Year"] = df["Date_of_publication"].dt.year
You can use dt.isocalendar().year to setup df["Year"]:
df["Year"] = df["Date_of_publication"].dt.isocalendar().year
You will get year 2020 for date of 2021-01-01 but will get back to year 2021 for date of 2021-01-04 by this.
This is just similar to how you used dt.isocalendar().week for setting up df["Week"]. Since they are both basing on the same tuple (year, week, day) returned by dt.isocalendar(), they would always be in sync.
Demo
date_s = pd.Series(pd.date_range(start='2021-01-01', periods=5, freq='1D'))
date_s
0
0 2021-01-01
1 2021-01-02
2 2021-01-03
3 2021-01-04
4 2021-01-05
date_s.dt.isocalendar()
year week day
0 2020 53 5
1 2020 53 6
2 2020 53 7
3 2021 1 1
4 2021 1 2
You can simply subtract the two dates and then divide the days attribute of the timedelta object by 7.
For example, this is the current week we are on now.
time_delta = (dt.datetime.today() - dt.datetime(2021, 1, 1))
The output is a datetime timedelta object
datetime.timedelta(days=75, seconds=84904, microseconds=144959)
For your problem, you'd do something like this
time_delta = int((df["Date_of_publication"] - df["Year"].days / 7)
The output would be a number that is the current week since date_of_publication
Suppose I have a start and end dates like so:
start_d = datetime.date(2017, 7, 20)
end_d = datetime.date(2017, 9, 10)
I wish to obtain a Pandas DataFrame that looks like this:
Month NumDays
2017-07 12
2017-08 31
2017-09 10
It shows the number of days in each month that is contained in my range.
So far I can generate the monthly series with pd.date_range(start_d, end_d, freq='MS').
You can use date_range by default day frequency first, then create Series and resample with size. Last convert to month period by to_period:
import datetime as dt
start_d = dt.date(2017, 7, 20)
end_d = dt.date(2017, 9, 10)
s = pd.Series(index=pd.date_range(start_d, end_d), dtype='float64')
df = s.resample('MS').size().rename_axis('Month').reset_index(name='NumDays')
df['Month'] = df['Month'].dt.to_period('m')
print (df)
Month NumDays
0 2017-07 12
1 2017-08 31
2 2017-09 10
Thank you Zero for simplifying solution:
df = s.resample('MS').size().to_period('m').rename_axis('Month').reset_index(name='NumDays')