Python duplicates date values for week number conversion in leap year - python

I have a dataframe where the date column is in format format='%Y-W%W-%w'. I am converting from the 2018-W01 etc. to an actual date using pd.to_datetime(urldict[key]['date']+'-1', format='%Y-W%W-%w'), but the data appears to be shifted incorrectly for 2020/2021, I'm guessing because of the leap-year.
Subsequently, it creates two entries for 01-04-2021, with the first entry being what would be 2020-W53. The data going back is also misaligned.
I'm not sure how to fix this as I assumed that the datetime library would account for it.
Pre-conversion:
date region total
2020-W51 africa 1
2020-W52 africa 2
2020-W53 africa 3
2021-W01 africa 4
Post-conversion:
date region total
12/21/2020 africa 1
12/28/2020 africa 2
1/4/2021 africa 3
1/4/2021 africa 4

It seems you need ISO 8601 year/week/weekday, so the correct formatting directive would be '%G-W%V-%u' (see the docs, end of that section). For
df
date region total
0 2020-W51 africa 1
1 2020-W52 africa 2
2 2020-W53 africa 3
3 2021-W01 africa 4
that would look like
pd.to_datetime(df['date']+'-1', format='%G-W%V-%w')
0 2020-12-14
1 2020-12-21
2 2020-12-28
3 2021-01-04
Name: date, dtype: datetime64[ns]
Related: Python - Get date from day of week, year, and week number

Related

Randomly sample from panel data by 3months periods

I have pandas data frame that is panel data i.e. data of multiple customers over a timeframe. I want to sample (for bootstraping) a continuous three months period (i always wanna get a full month) of a random customer 90 times.
I have googled a bit and found several sampling techniques but none that would include sampling based on three continuous months.
I was considering just making a list of all the month names and sampling three consecutive ones (although not sure how to do consecutive). But how would i then be able to e.g. pick Nov21-Dec21-Jan22 ?
Would appreciate the help a lot!
import pandas as pd
date_range = pd.date_range("2020-01-01", "2022-01-01")
df = pd.DataFrame({"value":3}, index=date_range)
df.groupby(df.index.quarter).sample(5)
This would output:
Out[12]:
value
2021-01-14 3
2021-02-27 3
2020-01-20 3
2021-02-03 3
2021-02-19 3
2021-04-27 3
2021-06-29 3
2021-04-12 3
2020-06-24 3
2020-06-05 3
2021-07-30 3
2020-08-29 3
2021-07-03 3
2020-07-17 3
2020-09-12 3
2020-12-22 3
2021-12-13 3
2021-11-29 3
2021-12-19 3
2020-10-18 3
It selected 5 sample values form each quarter group.
From now own you can format date column (index) for it to write month in text.

Add new column based on two conditions

I have the following table in python:
Country
Year
Date
Spain
2020
2020-08-10
Germany
2020
2020-08-10
Italy
2019
2020-08-11
Spain
2019
2020-08-20
Spain
2020
2020-06-10
I would like to add a new column that gives 1 if it's the first date of the year in a country and 0 if it's not the first date.
I've tried to write a function but I'm conscious that it doesn't really make sense `
def first_date(x, country, year):
if df["date"] == df[(df["country"] == country) & (df["year"] == year)]["date"].min():
x==1
else:
x==0
`
There are many ways to achieve this. Let's create a groupby object to get the min index of each country so we can do some assignment using .loc
As an aside, using if with pandas is usually an anti pattern - there are native functions in pandas that help you achieve the same thing whilst taking advantage of the vectorised code base under the hood.
Recommend reading: https://pandas.pydata.org/docs/user_guide/10min.html
df.loc[df.groupby(['Country'])['Date'].idxmin(), 'x'] = 1
df['x'] = df['x'].fillna(0)
Country Year Date x
0 Spain 2020 2020-08-10 0.0
1 Germany 2020 2020-08-10 1.0
2 Italy 2019 2020-08-11 1.0
3 Spain 2019 2020-08-20 0.0
4 Spain 2020 2020-06-10 1.0
or using np.where with df.index.isin
import numpy as np
df['x'] = np.where(
df.index.isin(df.groupby(['Country'])['Date'].transform('idxmin')),1,0)

Pandas select rows by multiple conditions on columns

I would like to reduce my code. So instead of 2 lines I would like to select rows by 3 conditions on 2 columns.
My DataFrame contains Country's population between 2000 and 2018 by granularity (Total, Female, Male, Urban, Rural)
Zone Granularity Year Value
0 Afghanistan Total 2000 20779.953
1 Afghanistan Male 2000 10689.508
2 Afghanistan Female 2000 10090.449
3 Afghanistan Rural 2000 15657.474
4 Afghanistan Urban 2000 4436.282
20909 Zimbabwe Total 2018 14438.802
20910 Zimbabwe Male 2018 6879.119
20911 Zimbabwe Female 2018 7559.693
20912 Zimbabwe Rural 2018 11465.748
20913 Zimbabwe Urban 2018 5447.513
I would like all rows of the Year 2017 with granularity Total AND Urban.
I tried something like this below but not working but each condition working well in separate code.
df.loc[(df['Granularity'].isin(['Total', 'Urban'])) & (df['Year'] == '2017')]
Thanks for tips to help
Very likely, you're using the wrong type for the year. I imagine these are integers.
You should try:
df.loc[(df['Granularity'].isin(['Total', 'Urban'])) & df['Year'].eq(2017)]
output (for the Year 2018 as 2017 is missing from the provided data):
Zone Granularity Year Value
20909 Zimbabwe Total 2018 14438.802
20913 Zimbabwe Urban 2018 5447.513

use only year for datetime index - pandas dataframe

I've properly converted my year column into a datetime index, however the month and date are inaccurate and unneeded seeing my dataset only includes year. I've used the format parameter to set year only, however it's still showing as "%Y-%M-%D" format.
Original data:
index song year artist genre
0 0 ego-remix 2009 beyonce knowles Pop
1 1 shes-tell-me 2009 save Rock
2 2 hello 2009 yta Pop
3 3 the rock 2009 term R&B
4 4 black-culture 2009 hughey Country
conducted a few more scrubbing techniques with the above code.
Then here are example rows from my dataframe code:
clean_df.index = pd.to_datetime(clean_df['year'], format='%Y')
clean_df = clean_df.drop(['index', 'year'], 1)
clean_df.sort_index(inplace=True)
clean_df.head()
year song artist genre
1970-01-01 hey now caravan Rock
1970-01-01 show me abc Rock
1970-01-01 hey now xyz Pop
1970-01-01 tell me foxy R&B
1970-01-01 move up curtis R&B
Is there any other method to be used to set index as annual only?
You were close
clean_df.index = pd.to_datetime(clean_df['year'], format='%Y-%m-%d').year
It's hard to provide the actual correct format needed because I don't have your original data, but you just need to transform to date object and then call the year parameter
I had a similar issue. Solved it this way:
df['Year'] = df.Year.astype(np.datetime64)
df['Year'] = df.Year.dt.year
df.set_index('Year')
Output should only show the year with 4 digits.

AttributeError: 'TimedeltaProperties' object has no attribute 'years' in Pandas

In Pandas, why does a TimedeltaProperties object have no attribute 'years'?
After all, the datetime object has this property.
It seems like a very natural thing for an object that is concerned with time to have. Especially if it already has an hours, seconds, etc attribute.
Is there a workaround so that my column, which is full of values like
10060 days,
can be converted to years? Or better yet, just converted to an integer representation for years?
TimedeltaProperties does not have year or month attributes because according to TimedeltaProperties source code . It is -
Accessor object for datetimelike properties of the Series values.
But , months or years have no constant definition.
1 month can take on different different number of days, based on the month itself, like January -> 31 days , April -> 30 days , etc.
1 month can take on different values based on the year as well (in case of February month) , if the year is 2004 , February has 29 days , if the year is 2003 February has 28 days, etc.
Same is the case with years , it can take on different values based on which exact year it is, for example - if the year is 2003 , it has 365 days, if the year is 2004 it has 366 days.
Hence, an requirement like - Convert 10060 days to years is not accurate, which years?
Like previously stated, the accurate amount of years those no. of days correspond to depend on the actual years those days represent.
This workaround gets you closer.
round((df["Accident Date"] - df["Iw Date Of Birth"]).dt.days / 365, 1)
I used .astype('timedelta64[Y]') and then .astype('int') to get integer years:
df['age'] = (pd.Timestamp('now') - df.birthDate).astype('timedelta64[Y]').astype('int')
Output:
nflId height weight birthDate collegeName position displayName age
0 2539334 72 190 1990-09-10 Washington CB Desmond Trufant 30
1 2539653 70 186 1988-11-01 Southeastern Louisiana CB Robert Alford 31
2 2543850 69 186 1991-12-18 Purdue SS Ricardo Allen 28
3 2555162 73 227 1994-11-04 Louisiana State MLB Deion Jones 25
4 2555255 75 232 1993-07-01 Minnesota OLB DeVondre Campbell 27

Categories