Getting rid of a hierarchical index in Pandas

Getting rid of a hierarchical index in Pandas - python

I have just pivoted a dataframe to create the dataframe below:
date 2012-10-31 2012-11-30
term
red -4.043862 -0.709225
blue -18.046630 -8.137812
green -8.339924 -6.358016
The columns are supposed to be dates, the left most column in supposed to have strings in it.
I want to be able to run through the rows (using the .apply()) and compare the values under each date column. The problem I am having is that I think the df has a hierarchical index.
Is there a way to give the whole df a new index (e.g. 1, 2, 3 etc.) and then have a flat index (but not get rid of the terms in the first column)?
EDIT: When I try to use .reset_index() I get the error ending with 'AttributeError: 'str' object has no attribute 'view''.
EDIT 2: this is what the df looks like:
EDIT 3: here is the description of the df:
<class 'pandas.core.frame.DataFrame'>
Index: 14597 entries, 101016j to zymogens
Data columns (total 6 columns):
2012-10-31 00:00:00 14597 non-null values
2012-11-30 00:00:00 14597 non-null values
2012-12-31 00:00:00 14597 non-null values
2013-01-31 00:00:00 14597 non-null values
2013-02-28 00:00:00 14597 non-null values
2013-03-31 00:00:00 14597 non-null values
dtypes: float64(6)
Thanks in advance.

df= df.reset_index()
this will take the current index and make it a column then give you a fresh index from 0
Adding example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'2012-10-31': [-4, -18, -18], '2012-11-30': [-0.7, -8, -6]}, index = ['red', 'blue','green'])
df
2012-10-31 2012-11-30
red -4 -0.7
blue -18 -8.0
green -18 -6.0
df.reset_index()
term 2012-10-31 2012-11-30
0 red -4 -0.7
1 blue -18 -8.0
2 green -18 -6.0

EDIT: When I try to use .reset_index() I get the error ending with 'AttributeError: 'str' object has no attribute 'view''.
Try to convert your date columns to string type columns first.
I think pandas doesn't like to reset_index() here because you try to reset your string index into a columns which only consist of dates. If you only have dates as columns, pandas will handle those columns internally as a DateTimeIndex. When calling reset_index(), pandas tries to set up your string index as a further column to your date columns and fails somehow. Looks like a bug for me, but not sure.
Example:
t = pandas.DataFrame({pandas.to_datetime('2011') : [1,2], pandas.to_datetime('2012') : [3,4]}, index=['A', 'B'])
t
2011-01-01 00:00:00 2012-01-01 00:00:00
A 1 3
B 2 4
t.columns
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-01 00:00:00, 2012-01-01 00:00:00]
Length: 2, Freq: None, Timezone: None
t.reset_index()
...
AttributeError: 'str' object has no attribute 'view'
If you try with a string columns it will work.

Related

How to remove rows in pandas of type datetime64[ns] by date?

I'm pretty newbie, started to use python for my project.
I have dataset, first column has datetime64[ns] type
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5889 entries, 0 to 5888
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 5889 non-null datetime64[ns]
1 title 5889 non-null object
2 stock 5889 non-null object
dtypes: datetime64[ns](1), object(2)
memory usage: 138.1+ KB
and
type(BA['date'])
gives
pandas.core.series.Series
date has format 2020-06-10
I need to delete all instances before specific date, for example 2015-09-09
What I tried:
convert to string. Failed
Create conditions using:
.df.year <= %y & .df.month <= %m
<= ('%y-%m-%d')
create data with datetime() method
create variable with datetime64 format
just copy with .loc() and .copy()
All of this failed, I had all kinds of error, like it's not int, its not datetime, datetime mutable, not this, not that, not a holy cow
How can this pandas format can be more counterintuitive, I can't believe, for first time I feel like write a parser CSV in C++ seems easier than use prepared library in python
Thank you for understanding

Toy Example
df = pd.DataFrame({'date':['2021-1-1', '2020-12-6', '2019-02-01', '2020-02-01']})
df.date = pd.to_datetime(df.date)
df
Input df
date
0 2021-01-01
1 2020-12-06
2 2019-02-01
3 2020-02-01
Delete rows before 2020.01.01.
We are selecting the rows which have dates after 2020.01.01 and ignoring old dates.
df.loc[df.date>'2020.01.01']
Output
date
0 2021-01-01
1 2020-12-06
3 2020-02-01
If we want the index to be reset
df = df.loc[df.date>'2020.01.01']
df
Output
date
0 2021-01-01
1 2020-12-06
2 2020-02-01

Selecting single row as dataframe with DatetimeIndex

I have a time series in a dataframe with DatetimeIndex like that:
import pandas as pd
dates= ["2015-10-01 00:00:00",
"2015-10-01 01:00:00",
"2015-10-01 02:00:00",
"2015-10-01 03:00:00",
"2015-10-01 04:00:00"]
df = pd.DataFrame(index=pd.DatetimeIndex(dates))
df["values"] = range(0,5)
Out[]:
values
2015-10-01 00:00:00 0
2015-10-01 01:00:00 1
2015-10-01 02:00:00 2
2015-10-01 03:00:00 3
2015-10-01 04:00:00 4
I would like to as simple clean as possible select a row looking like that, based on the date being the key, e.g. "2015-10-01 02:00:00":
Out[]:
values
2015-10-01 02:00:00 2
Simply using indexing results in a key error:
df["2015-10-01 02:00:00"]
Out[]:
KeyError: '2015-10-01 02:00:00'
Similarly this:
df.loc[["2015-10-01 02:00:00"]]
Out[]:
KeyError: "None of [['2015-10-01 02:00:00']] are in the [index]"
These surprisingly (?) result in the same series as follows:
df.loc["2015-10-01 02:00:00"]
Out[]:
values 2
Name: 2015-10-01 02:00:00, dtype: int32
df.loc["2015-10-01 02:00:00",:]
Out[]:
values 2
Name: 2015-10-01 02:00:00, dtype: int32
print(type(df.loc["2015-10-01 02:00:00"]))
print(type(df.loc["2015-10-01 02:00:00",:]))
print(df.loc["2015-10-01 02:00:00"].shape)
print(df.loc["2015-10-01 02:00:00",:].shape)
Out[]:
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
(1,)
(1,)
I could wrap any of those in DataFrame like that:
slize = pd.DataFrame(df.loc["2015-10-01 02:00:00",:])
Out[]:
2015-10-01 02:00:00
values 2
Of course I could do this to reach my result:
slize.T
Out[]:
values
2015-10-01 02:00:00 2
But as at this point, I could also expect a column as a series it is kinda hard to test if it is a row or columns series to add the T automatically.
Did I miss a way of selecting what I want?

I recommend to generate your index using pd.date_range for convenience, and then to use .loc with a Timestamp or datetime object.
from datetime import datetime
import pandas as pd
start = datetime(2015, 10, 1, 0, 0, 0)
end = datetime(2015, 10, 1, 4, 0, 0)
dates = pd.date_range(start, end, freq='H')
df = pd.DataFrame(index=pd.DatetimeIndex(dates))
df["values"] = range(0,5)
Then you can use .loc with a Timestamp or datetime object.
In [2]: df.loc[[start]]
Out[2]:
values
2015-10-01 0
Further details
Simply using indexing results in a key error:
df["2015-10-01 02:00:00"]
Out[]:
KeyError: '2015-10-01 02:00:00'
KeyError occurs because you try to return a view of the DataFrame by looking for a column named "2015-10-01 02:00:00"
Similarly this:
df.loc[["2015-10-01 02:00:00"]]
Out[]:
KeyError: "None of [['2015-10-01 02:00:00']] are in the [index]"
Your second option cannot work with str indexing, you should use exact indexing as mentioned instead.
These surprisingly (?) result in the same series as follows:
df.loc["2015-10-01 02:00:00"]
Out[]:
values 2
Name: 2015-10-01 02:00:00, dtype: int32
If you use .loc on a single row you will have a coercion to Series type as you noticed. Hence you shall cast to DataFrame and then transpose the result.

You can convert string to datetime - using exact indexing:
print (df.loc[[pd.to_datetime("2015-10-01 02:00:00")]])
values
2015-10-01 02:00:00 2
Or convert Series to DataFrame and transpose:
print (df.loc["2015-10-01 02:00:00"].to_frame().T)
values
2015-10-01 02:00:00 2

df[df[time_series_row] == “data_to_match”]
Sorry for the formatting. On my phone, will update when I’m back at a computer.
Edit:
I would generally write it like this:
bitmask = df[time_seried_row] == "data_to_match"
row = df[bitmask]

Retrieve date from column index position in pandas and paste in PyQt

I want to retrieve the date of one index position of a Pandas data frame and paste it into the LineEdit of a PyQt Application.
What I have so far is:
purchase = sales [['Total','Date']]
pandas_value = purchase.iloc[-1:]['Date'] # last position of the "Date" column
pyqt_value = str(pandas_value)
# This returns :
67 2016-10-20
Name: Data, dtype: datetime64[ns]
The entire output appears in the LineEdit as : 67 2016-10-20 Name: Data, dtype: datetime64[ns]
I have also tried converting the date, to no avail:
pandas_value.strftime('%Y-%m-%d')
'Series' object has no attribute 'strftime'
Is there a way to retrieve and paste just the date like : 2016-10-20 ?
Or better : Is there a way to retrieve any value as a string from any index position in pandas?
Thanks in advance for any help.

You can do it this way:
In [37]: df
Out[37]:
Date a
0 2016-01-01 0.228208
1 2016-01-02 0.695593
2 2016-01-03 0.493608
3 2016-01-04 0.728678
4 2016-01-05 0.369823
5 2016-01-06 0.336615
6 2016-01-07 0.012200
7 2016-01-08 0.481646
8 2016-01-09 0.773467
9 2016-01-10 0.550114
In [38]: df.iloc[-1, df.columns.get_loc('Date')].strftime('%Y-%m-%d')
Out[38]: '2016-01-10'

pandas returns it as Series which is like a list (normally it keeps one row or one column of data) so you have to use index to get value. You Series has only one value so you can use index [0] (or maybe [67] because your text shows value 67 as index)
pyqt_values = str(panda_values[0])

get next value in list Pandas

I have a list of unique dates in chronological order.
I have a dataframe with dates in it. I want to use the list of dates in the dataframe to get the NEXT date in the list (find the date in dataframe in the list, return the date to the right of it ( next chronological date).
Any ideas?

It appears that printing the list wouldn't work, and you haven't provided us with any code to work with, or an example print of what your date time looks like. My best suggestion is to use the sort function.
dataframe.sort()
If I wanted a specific date to print, I would have to say to print it by index number once you have it sorted. Without knowing what your computers ability is to handle print statements of this size, I suggest copying this sorted file to a out txt file to ensure that you are getting the proper response.

so for every item in the dataframe there is an exact match for its date in the list of unique dates and you want to move it to the next date
you should use a dictionary for this really
next_date_dictionary = dict(zip(sequential_list_of_dates,sequential_list_of_dates[1:]))
then you simply look up the next date in the dictionary
next_date = next_date_dictionary.get(row.date)
alternatively if you want to replace the date column you can use replace
data_frame.replace({"date":next_date_dictionary})

OK here is one way of doing this:
In [210]:
# generate some data
df = pd.DataFrame({'dates':pd.date_range(start=dt.datetime(2014,3,2), end=dt.datetime(2014,4,23))})
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 53 entries, 0 to 52
Data columns (total 1 columns):
dates 53 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 848.0 bytes
Now I'd create a df from your date list:
In [219]:
base = dt.datetime(2014,5,3)
date_list = [base - dt.timedelta(days=x) for x in range(0, 70)]
date_df = pd.DataFrame({'dates':date_list})
date_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 70 entries, 0 to 69
Data columns (total 1 columns):
dates 70 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 1.1 KB
Then add a new column to this date_df that shifts the dates column by 1 and then set the index to be the dates:
In [220]:
date_df['date_lookup'] = date_df['dates'].shift(1)
date_df = date_df.set_index('dates')
date_df.head()
Out[220]:
date_lookup
dates
2014-05-03 NaT
2014-05-02 2014-05-03
2014-05-01 2014-05-02
2014-04-30 2014-05-01
2014-04-29 2014-04-30
Then call map on the orig df and pass the date_df and access the date_lookup column, map will use the index to perform a lookup which will return the corresponding next value:
In [221]:
df['date_next'] = df['dates'].map(date_df['date_lookup'])
df.head()
Out[221]:
dates date_next
0 2014-03-02 2014-03-03
1 2014-03-03 2014-03-04
2 2014-03-04 2014-03-05
3 2014-03-05 2014-03-06
4 2014-03-06 2014-03-07

Efficiently handling missing dates when aggregating Pandas Dataframe

Follow up from Summing across rows of Pandas Dataframe and Pandas Dataframe object types fillna exception over different datatypes
One of the columns that I am aggregating using
df.groupby(['stock', 'same1', 'same2'], as_index=False)['positions'].sum()
this method is not very forgiving if there are missing data. If there are any missing data in same1, same2, etc it pads totally unrelated values. Workaround is to do a fillna loop over the columns to replace missing strings with '' and missing numbers with zero solves the problem.
I do however have one column with missing dates as well. column type is 'object' with nan of type float and in the missing cells and datetime objects in the existing data fields. important that I know that the data is missing, i.e. the missing indicator must survive the groupby transformation.
Dataset outlining the problem:
csv file that I use as input is:
Date,Stock,Position,Expiry,same
2012/12/01,A,100,2013/06/01,AA
2012/12/01,A,200,2013/06/01,AA
2012/12/01,B,300,,BB
2012/6/01,C,400,2013/06/01,CC
2012/6/01,C,500,2013/06/01,CC
I then read in file:
df = pd.read_csv('example', parse_dates=[0])
def convert_date(d):
'''Converts YYYY/mm/dd to datetime object'''
if type(d) != str or len(d) != 10: return np.nan
dd = d[8:]
mm = d[5:7]
YYYY = d[:4]
return datetime.datetime(int(YYYY), int(mm), int(dd))
df['Expiry'] = df.Expiry.map(convert_date)
df
df looks like:
Date Stock Position Expiry same
0 2012-12-01 00:00:00 A 100 2013-06-01 00:00:00 AA
1 2012-12-01 00:00:00 A 200 2013-06-01 00:00:00 AA
2 2012-12-01 00:00:00 B 300 NaN BB
3 2012-06-01 00:00:00 C 400 2013-06-01 00:00:00 CC
4 2012-06-01 00:00:00 C 500 2013-06-01 00:00:00 CC
can quite easily change the convert_date function to pop anything else for missing data in Expiry column.
Then using:
df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum()
to aggregate the Position column. Get a TypeError: can't compare datetime.datetime to str with any non date that I plug into missing date data. Important for later functionality to know if Expiry is missing.

You need to convert your dates to the datetime64[ns] dtype (which manages how datetimes work). An object column is not efficient nor does it deal well with datelikes. datetime64[ns] allow missing values usingNaT (not-a-time), see here: http://pandas.pydata.org/pandas-docs/dev/missing_data.html#datetimes
In [6]: df['Expiry'] = pd.to_datetime(df['Expiry'])
# alternative way of reading in the data (in 0.11.1, as ``NaT`` will be set
# for missing values in a datelike column)
In [4]: df = pd.read_csv('example',parse_dates=['Date','Expiry'])
In [9]: df.dtypes
Out[9]:
Date datetime64[ns]
Stock object
Position int64
Expiry datetime64[ns]
same object
dtype: object
In [7]: df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum()
Out[7]:
Stock Expiry same Position
0 A 2013-06-01 00:00:00 AA 300
1 B NaT BB 300
2 C 2013-06-01 00:00:00 CC 900
In [8]: df.groupby(['Stock', 'Expiry', 'same'] ,as_index=False)['Position'].sum().dtypes
Out[8]:
Stock object
Expiry datetime64[ns]
same object
Position int64
dtype: object

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting rid of a hierarchical index in Pandas - python

Related

How to remove rows in pandas of type datetime64[ns] by date?

Selecting single row as dataframe with DatetimeIndex

Retrieve date from column index position in pandas and paste in PyQt

get next value in list Pandas

Efficiently handling missing dates when aggregating Pandas Dataframe

Categories

Resources