a['Year'] = a['Date'].dt.year creates a additional .0 [duplicate]

a['Year'] = a['Date'].dt.year creates a additional .0 [duplicate] - python

This question already has answers here:
Convert Pandas column containing NaNs to dtype `int`
(27 answers)
Closed 1 year ago.
I extracted the year from the Date and added it as a new column to the Dataframe.
I need it to be like 2001 but it is 2001.0
Where is the .0 coming from?
This is the Output:
Datum LebensverbrauchMIN ... Lastfaktor Jahr
0 2001-01-01 00:00:00 0.001986 ... 0.249508 2001.0
1 2001-01-01 00:01:00 0.000839 ... 0.249847 2001.0
2 2001-01-01 00:02:00 0.000387 ... 0.250186 2001.0
# Read in Data
InnenTemp = ["LebensverbrauchMIN","HPT", "Innentemperatur", "Verlustleistung", "SolarEintrag", "Lastfaktor"]
Klima1min = pd.read_csv("Klima_keinPV11.csv", names=InnenTemp,
skiprows=0)
Datum = pd.read_csv("Klima_Lufttemp_GLobalstrahlung_Interpoliert_1min.csv", usecols=["Datum"],
skiprows=0)
Luft = pd.read_csv("Klima_Lufttemp_GLobalstrahlung_Interpoliert_1min.csv", usecols=["Lufttemperatur"],
skiprows=0)
frames = [Datum, Klima1min]
a = pd.concat(frames, axis=1)
a['Datum'] = pd.to_datetime(a['Datum'], format="%Y-%m-%dT%H:%M:%S")
a.set_index('Datum')
# Extract Year from Date(tried both lines)
a['Jahr'] = pd.DatetimeIndex(a['Datum']).year
#a['Jahr'] = a['Datum'].dt.year
print(a)

If there is a missing value in dataframe column, it considers it as a float datatype. This happens only for int, for string it remains the same.

Related

replacing/re-assign pandas value with new value

I wanted to re-assign/replace my new value, from my current
20000123
19850123
19880112
19951201
19850123
20190821
20000512
19850111
19670133
19850123
As you can see there is data with 19670133 (YYYYMMDD), which means that date is not exist since there is no month with 33 days in it.So I wanted to re assign it to the end of the month. I tried to make it to the end of the month, and it works.
But when i try to replace the old value with the new ones, it became a problem.
What I've tried to do is this :
for x in df_tmp_customer['date']:
try:
df_tmp_customer['date'] = df_tmp_customer.apply(pd.to_datetime(x), axis=1)
except Exception:
df_tmp_customer['date'] = df_tmp_customer.apply(pd.to_datetime(x[0:6]+"01")+ pd.offsets.MonthEnd(n=0), axis=1)
This part is the one that makes it end of the month :
pd.to_datetime(x[0:6]+"01")+ pd.offsets.MonthEnd(n=0)

Probably not efficient on a large dataset but can be done using pendulum.parse()
import pendulum
def parse_dates(x: str) -> pendulum:
i = 0
while ValueError:
try:
return pendulum.parse(str(int(x) - i)).date()
except ValueError:
i += 1
df["date"] = df["date"].apply(lambda x: parse_dates(x))
print(df)
date
0 2000-01-23
1 1985-01-23
2 1988-01-12
3 1995-12-01
4 1985-01-23
5 2019-08-21
6 2000-05-12
7 1985-01-11
8 1967-01-31
9 1985-01-23

For a vectorial solution, you can use:
# try to convert to YYYYMMDD
date1 = pd.to_datetime(df['date'], format='%Y%m%d', errors='coerce')
# get rows for which conversion failed
m = date1.isna()
# try to get end of month
date2 = pd.to_datetime(df.loc[m, 'date'].str[:6], format='%Y%m', errors='coerce').add(pd.offsets.MonthEnd())
# Combine both
df['date2'] = date1.fillna(date2)
NB. Assuming df['date'] is of string dtype. If rather of integer dtype, use df.loc[m, 'date'].floordiv(100) in place of df.loc[m, 'date'].str[:6].
Output:
date date2
0 20000123 2000-01-23
1 19850123 1985-01-23
2 19880112 1988-01-12
3 19951201 1995-12-01
4 19850123 1985-01-23
5 20190821 2019-08-21
6 20000512 2000-05-12
7 19850111 1985-01-11
8 19670133 1967-01-31 # invalid replaced by end of month
9 19850123 1985-01-23

How to create a dataframe with pandas.date_range for previous years?

I want to create a dataframe with date from previous years. For example something like this -
df = pd.DataFrame({'Years': pd.date_range('2021-09-21', periods=-5, freq='Y')})
but negative period is not supported. How to achieve that?

Use end parameter in date_range aand then add DateOffset:
d = pd.to_datetime('2021-09-21')
df = pd.DataFrame({'Years': pd.date_range(end=d, periods=5, freq='Y') +
pd.DateOffset(day=d.day, month=d.month)})
print (df)
Years
0 2016-09-21
1 2017-09-21
2 2018-09-21
3 2019-09-21
4 2020-09-21
Or if need also actual year to last value of column use YS for start of year:
d = pd.to_datetime('2021-09-21')
df = pd.DataFrame({'Years': pd.date_range(end=d, periods=5, freq='YS') +
pd.DateOffset(day=d.day, month=d.month)})
print (df)
Years
0 2017-09-21
1 2018-09-21
2 2019-09-21
3 2020-09-21
4 2021-09-21

Python Timedelta[M] adds incomplete days

I have a table that has a column Months_since_Start_fin_year and a Date column. I need to add the number of months in the first column to the date in the second column.
DateTable['Date']=DateTable['First_month']+DateTable['Months_since_Start_fin_year'].astype("timedelta64[M]")
This works OK for month 0, but month 1 already has a different time and for month 2 onwards has the wrong date.
Image of output table where early months have the correct date but month 2 where I would expect June 1st actually shows May 31st
It must be adding incomplete months, but I'm not sure how to fix it?
I have also tried
DateTable['Date']=DateTable['First_month']+relativedelta(months=DateTable['Months_since_Start_fin_year'])
but I get a type error that says
TypeError: cannot convert the series to <class 'int'>
My Months_since_Start_fin_year is type int32 and my First_month variable is datetime64[ns]

The problem with adding months as an offset to a date is that not all months are equally long (28-31 days). So you need pd.DateOffset which handles that ambiguity for you. .astype("timedelta64[M]") on the other hand only gives you the average days per month within a year (30 days 10:29:06).
Ex:
import pandas as pd
# a synthetic example since you didn't provide a mre
df = pd.DataFrame({'start_date': 7*['2017-04-01'],
'month_offset': range(7)})
# make sure we have datetime dtype
df['start_date'] = pd.to_datetime(df['start_date'])
# add month offset
df['new_date'] = df.apply(lambda row: row['start_date'] +
pd.DateOffset(months=row['month_offset']),
axis=1)
which would give you e.g.
df
start_date month_offset new_date
0 2017-04-01 0 2017-04-01
1 2017-04-01 1 2017-05-01
2 2017-04-01 2 2017-06-01
3 2017-04-01 3 2017-07-01
4 2017-04-01 4 2017-08-01
5 2017-04-01 5 2017-09-01
6 2017-04-01 6 2017-10-01
You can find similar examples here on SO, e.g. Add months to a date in Pandas. I only modified the answer there by using an apply to be able to take the months offset from one of the DataFrame's columns.

New column from slices of a column in pandas

I am working with time series data with pandas and my data frame looks a little bit like this
Date Layer
0 2000-01-01 0.408640
1 2000-01-02 0.842065
2 2000-01-03 1.271810
3 2000-01-04 1.699399
4 2000-01-05 2.128098
... ...
7300 2019-12-27 149.323520
7301 2019-12-28 149.744012
7302 2019-12-29 150.155702
7303 2019-12-30 150.562771
7304 2019-12-31 151.003031
I need to make a column for each year, like this:
2000 2001 2002
0 0.408640 0.415863 0.425689
1 0.852653 0.826542 0.863524
... ... ...
364 156.235978 158.564578 152.135689
365 156.685421 158.924556 152.528978
Is there a way I can manage to do that? The resulting data can be in a new data frame

The approach for this will be to create separate year and day of year columns, and then create a pivot table:
#Convert Date column to pandas datetime if you haven't already:
df['Date'] = pd.to_datetime(df['Date'])
#Create year column
df['Year'] = df['Date'].dt.year
#Create day of year column
df['DayOfYear'] = df['Date'].dt.dayofyear
#create pivot table in new dataframe
df2 = pd.pivot_table(df, index = 'DayOfYear', columns = 'Year', values = 'Layer')
This won't look exactly like your desired output because the index will be numbered 1-365 (and have a name) rather than 0-364. If you want it to match exactly, you can add:
df2 = df2.reset_index()

Pandas KeyError when using .loc() [duplicate]

This question already has answers here:
How are iloc and loc different?
(6 answers)
Closed 2 years ago.
I have a pandas DataFrame portfolio whose keys are dates. I'm trying to access multiple rows through
print(portfolio.loc[['2007-02-26','2008-02-06'],:]),
but am getting an error
KeyError: "None of [Index(['2007-02-26', '2008-02-06'], dtype='object', name='Date')] are in the [index]"
However, print(portfolio.loc['2007-02-26',:]) successfully returns
holdings 1094.6124
pos_diff 100.0000
cash 98905.3876
total 100000.0000
returns 0.0000
Name: 2007-02-26 00:00:00, dtype: float64
Isn't this a valid format--> df.loc[['key1', 'key2', 'key3'], 'Column1]?

It seems that the issue is with type conversion from strings to timestamps. The solution is, therefore, to explicitly convert the set of labels to DateTime before passing them to loc:
df = pd.DataFrame({"a" : range(5)}, index = pd.date_range("2020-01-01", freq="1D", periods=5))
print(df)
==>
a
2020-01-01 0
2020-01-02 1
2020-01-03 2
2020-01-04 3
2020-01-05 4
try:
df.loc[["2020-01-01", "2020-01-02"], :]
except Exception as e:
print (e)
==>
"None of [Index(['2020-01-01', '2020-01-02'], dtype='object')] are in the [index]"
# But - if you convert the labels to datetime before calling loc,
# it works fine.
df.loc[pd.to_datetime(["2020-01-01", "2020-01-02"]), :]
===>
a
2020-01-01 0
2020-01-02 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

a['Year'] = a['Date'].dt.year creates a additional .0 [duplicate] - python

If there is a missing value in dataframe column, it considers it as a float datatype. This happens only for int, for string it remains the same.

Related

replacing/re-assign pandas value with new value

How to create a dataframe with pandas.date_range for previous years?

Python Timedelta[M] adds incomplete days

New column from slices of a column in pandas

Pandas KeyError when using .loc() [duplicate]

Categories

Resources