Pandas: Extracting values from a DatetimeIndex - python

I have a Pandas DataFrame whose rows and columns are a DatetimeIndex.
import pandas as pd
data = pd.DataFrame(
{
"PERIOD_END_DATE": pd.date_range(start="2018-01", end="2018-04", freq="M"),
"first": list("abc"),
"second": list("efg")
}
).set_index("PERIOD_END_DATE")
data.columns = pd.date_range(start="2018-01", end="2018-03", freq="M")
data
Unfortunately, I am getting a variety of errors when I try to pull out a value:
data['2018-01', '2018-02'] # InvalidIndexError: ('2018-01', '2018-02')
data['2018-01', ['2018-02']] # InvalidIndexError: ('2018-01', ['2018-02'])
data.loc['2018-01', '2018-02'] # TypeError: only integer scalar arrays can be converted to a scalar index
data.loc['2018-01', ['2018-02']] # KeyError: "None of [Index(['2018-02'], dtype='object')] are in the [columns]"
How do I extract a value from a DataFrame that uses a DatetimeIndex?

There are 2 issues:
Since, you are using a DateTimeIndex dataframe, the correct notation to traverse between rows and columns are:
a) data.loc[rows_index_name, [column__index_name]]
or
b) data.loc[rows_index_name, column__index_name]
depending on the type of output you desire.
Notation A will return a series value, while notation (b) returns a string value.
The index names can not be amputated- you must specify the whole string.
As such, your issue will be resolved with:
data.loc['2018-01-31',['2018-01-31']] or data.loc['2018-01-31','2018-01-31']

As long as you already set the date as index, you will not be able to slice or extract any data of it. You can extract the month and date of it as it is a regular column not when it is an index. I had this before and that was the solution.
I kept it as a regular column, extracted the Month, Day and Year as a seperate column for each of them, then I assigned the date column as the index column.

you are accessing as a period (YYYY-MM) on a date columns.
This would help in this case
data.columns = pd.period_range(start="2018-01", end="2018-02", freq='M')
data[['2018-01']]
2018-01
PERIOD_END_DATE
2018-01-31 a
2018-02-28 b
2018-03-31 c

Timestamp indexes are finicky. Pandas accepts each of the following expressions, but they return different types.
data.loc['2018-01',['2018-01-31']]
data.loc['2018-01-31',['2018-01-31']]
data.loc['2018-01','2018-01-31']
data.loc['2018-01-31','2018-01']
data.loc['2018-01-31','2018-01-31']

Related

Remove a dtype data from pandas dataframe column

I have a dataframe where it was added date and datetime information to a column where it was expected a string. What would be the best way to filter all dates and date values from a pandas dataframe column and replace those values to blank?
Thank you!
In general, if you provided a minimum working example of your problem, one could help more specifically, but assuming you have the following column:
df = pd.DataFrame(np.zeros(shape=(10,1)), columns = ["Mixed"])
df["Mixed"] = "foobar"
df.loc[2,"Mixed"] = pd.to_datetime("2022-08-22")
df.loc[7,"Mixed"] = pd.to_datetime("2022-08-21")
#print("Before Fix", df)
You can use apply(type) on the column to obtain the data-types of each cell and then use list comprehension [x!=str for x in types] to check for each cells datatype if its a string or not. After that, just replace those values that are not the desired datatype with a value of your choosing.
types = df["Mixed"].apply(type).values
mask = [x!=str for x in types]
df.loc[mask,"Mixed"] = "" #Or None, or whatever you want to overwrite it with
#print("After Fix", df)

Dataframe sum(axis=1) is returning Nan Values

I'm trying to make a sum of the second column ('ALL_PPA'), grouping by Numéro_département
Here's my code :
df.fillna(0,inplace=True)
df = df.loc[:, ('Numéro_département','ALL_PPA')]
df = df.groupby('Numéro_département').sum(axis=1)
print(df)
My DF is full of numbers, I don't have any NaN values, but when I apply the function df.sum(axis=1),some rows appear to have a NaN Value
Here's how my tab looks like before sum():
Here's after sum()
My question is : How am I supposed to do this? I've try to use numpy library but, it doesn't work as I want it to work
Drop the first row of that dataframe, as it just as the column names in it, and convert it to an int. Right now, it is an object because of the mixed data types:
df2 = df.iloc[1:].astype(int).copy()
Then, apply groupby.sum() and specify the column as well:
df3 = df2.groupby('Numero_department')['ALL_PPA'].sum()
I think using .dropna() before summing the DF will help remove any rows or columns (depending on the axis= you choose) with nan values. According to the screenshot provided, please drop the first line of the DF as it is a string.

pyspark cast all columns of a certain data type to another

I have a data frame with a certain number of date columns. I want to cast them all to timestamp, without having to worry about the exact names of the columns. So what I want is something in line of: "Cast all date columns to timestamp and keep the same column names"
I know that for one column it would be:
df = df.withColumn('DATUM', df['DATUM'].cast('timestamp'))
You can use a loop and detect when the type is date and perform the cast only for those cases.
for col in df.dtypes:
if(col[1] == 'date'):
df = df.withColumn(col[0],df[col[0]].cast('timestamp'))
You can use for loop, and cast to timestamp
df.select(
*[df[col_name].cast('timestamp') for col_name in df.columns]
)

using 'to_datetime' on str object - in order to change dataframe column names

This is a follow up question to the one asked here.
I have about 200 column names in a dataframe which need to be converted to datetime format.
My intital thought was to create a list of column names, and iterate thru the list, converting them as I go along, and then renaming the columns of the dataframe, using this list of converted names. But from the previous question, I am not sure if I can apply to_datetime to a regular string element. So this method won't work.
Is there anyway to easily convert all columns, or at least, selected columns, with to_datetime?
I do not see an axis to choose in the documentation:
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, box=True, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix')[source]¶
Function to_datetime working only with Series (column of DataFrame), so possible solution are:
df = df.apply(pd.to_datetime)
#alternative
#df = df.apply(lambda x: pd.to_datetime(x))
Or:
for c in df.columns:
df[c] = pd.to_datetime(df[c])
For convert column names:
df.columns = pd.to_datetime(df.columns)

Pandas filtering - between_time on a non-index column

I need to filter out data with specific hours. The DataFrame function between_time seems to be the proper way to do that, however, it only works on the index column of the dataframe; but I need to have the data in the original format (e.g. pivot tables will expect the datetime column to be with the proper name, not as the index).
This means that each filter looks something like this:
df.set_index(keys='my_datetime_field').between_time('8:00','21:00').reset_index()
Which implies that there are two reindexing operations every time such a filter is run.
Is this a good practice or is there a more appropriate way to do the same thing?
Create a DatetimeIndex, but store it in a variable, not the DataFrame.
Then call it's indexer_between_time method. This returns an integer array which can then be used to select rows from df using iloc:
import pandas as pd
import numpy as np
N = 100
df = pd.DataFrame(
{'date': pd.date_range('2000-1-1', periods=N, freq='H'),
'value': np.random.random(N)})
index = pd.DatetimeIndex(df['date'])
df.iloc[index.indexer_between_time('8:00','21:00')]

Categories