Say I have a pandas.DataFrame with a MultiIndex and I know it has two levels and year is in the first one, and I want to keep particular years, I can do
df = df.loc[yearStart:, :]
If I know the index has only two levels, but not in which year is, I can hack some dirty
if df.index.names[0] == 'year':
df = df.loc[yearStart:, :]
else
df = df.loc[:, yearStart:]
What if I know it is in the index, but not which level, nor how many levels the index has? If year is not in the index, but a regular column, I can do
df = df.loc[df.year >= yearStart]
Is there something similar generic for the index?
You can use get_level_values to get a column-like view of an index level.
df = pd.DataFrame({'a': range(100)}, index=pd.MultiIndex.from_product([range(10), range(2010,2020)], names=['idx1', 'year']))
df.head()
Out[41]:
a
idx1 year
0 2010 0
2011 1
2012 2
2013 3
2014 4
df[df.index.get_level_values('year') >= 2015].head()
Out[42]:
a
idx1 year
0 2015 5
2016 6
2017 7
2018 8
2019 9
Related
I have a resulting table
Year mycat
2019 A 2
B 1
2020 A 0
B 1
In the 3rd row (2020, A) you see zero. I want to get rid of lines like this.
Year mycat
2019 A 2
B 1
2020 B 1
How can I do this? Is there a way to let pandas handle that without "hacking" the resulting table after I've done .groupby().size()?
Here is the full code:
>>> import pandas as pd
>>> df = pd.DataFrame({'Year': [2019, 2019, 2019, 2020], 'mycat': list('AABB')})
>>> df.mycat = df.mycat.astype('category')
>>> df
Year mycat
0 2019 A
1 2019 A
2 2019 B
3 2020 B
>>> df.groupby(['Year', 'mycat']).size()
Year mycat
2019 A 2
B 1
2020 A 0
B 1
dtype: int64
Yes, there is a way to eliminate zero-instance groupby results even for Categoricals such as in your specified input dataframe:
df.groupby(['Year', 'mycat'], observed=True).size()
In the docs for groupby(), the observed argument is explained as follows:
observed : bool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
For a DataFrame with two MultiIndex levels age and yearref, the goal is to add a new MultiIndex level yearconstr calculated as yearconstr = yearref - age.
import pandas as pd
df = pd.DataFrame({"value": [1, 2, 3]},
index=pd.MultiIndex.from_tuples([(10, 2015), (3, 2015), (2, 2016)],
names=["age", "yearref"]))
print(df)
# input df:
value
age yearref
10 2015 1
3 2015 2
2 2016 3
We could reset the index, calculate a new column and then put the original index back in place plus the newly defined column, but surely there must be a better way.
df = (df.reset_index()
.assign(yearconstr=lambda df: df.yearref - df.age)
.set_index(list(df.index.names) + ["yearconstr"]))
print(df)
# expected result:
value
age yearref yearconstr
10 2015 2005 1
3 2015 2012 2
2 2016 2014 3
For a concise and straight-forward approach, we can make use of
eval to generate a new Series calculated from the existing MultiIndex. This is easy since it treats index levels just like columns: df.eval("yearref - age")
rename the new Series
set_index to append the Series to df using the append=True argument.
Putting everything together:
df.set_index(df.eval("yearref - age").rename("yearconstr"), append=True)
# result:
value
age yearref yearconstr
10 2015 2005 1
3 2015 2012 2
2 2016 2014 3
Lets say I have a pandas dataframe and the -1 indexing does not work post resetting its index. How do I make it work as it was earlier and why it doesn't work in this case?:
Try this:
list(data.reset_index()['Date'])[-1]
I dont think reset_index() as anything to do with it. Choosing a specific column from a dataframe returns a series. We may need to cast it to a list to access via negative index position.
This is a small example I tried on a sample dummy df:
'''
year key val
2019 a 3
2019 a 4
2019 b 3
2019 c 5
2020 d 6
2020 e 1
2020 f 2
'''
import pandas as pd
df = pd.read_clipboard()
print(df)
Source df:
year key val
0 2019 a 3
1 2019 a 4
2 2019 b 3
3 2019 c 5
4 2020 d 6
5 2020 e 1
6 2020 f 2
Both of these throw key error:
mask = df['year'][-1]
print(mask)
or
mask = df.reset_index()['year'][-1]
print(mask)
Output:
KeyError: -1
Both of these work:
mask = list(df.reset_index()['year'])[-1]
or
mask = list(df['year'])[-1]
Output:
2020
You might consider using the df.loc() or df.iloc() property.
In your case that might be
(see the documentation).
data.rerset_index()['Date'].iloc(-1)
Instead of resetting the index, you also could simply write data.index[-1].
Check column ['esn'] from df1. When any different found between two rows, produce another dataframe, df2. df2 only contains the before change and after change information
>>> df1 = pd.DataFrame([[2014,1],[2015,1],[2016,1],[2017,2],[2018,2]],columns=['year','esn'])
>>> df1
year esn
0 2014 1
1 2015 1
2 2016 1
3 2017 2
4 2018 2
>>> df2 # new dataframe intended to create
year esn
0 2016 1
1 2017 2
can't produce the above result in df2. Thanks for your help in advance.
Create boolena mask by compare shifted values by ne for not equal and replace first missing value by backfill, similar compare shifted with -1 with forward filling missing values - chain by | for bitwise OR and filter by boolean indexing:
mask = df1['esn'].ne(df1['esn'].shift().bfill()) | df1['esn'].ne(df1['esn'].shift(-1).ffill())
df2 = df1[mask]
print (df2)
year esn
2 2016 1
3 2017 2
I have two dfs:
df_1 = pd.DataFrame([[5,6]], columns=['Jan15','Feb15'])
Jan15 Feb15
0 5 6
df_2 = pd.DataFrame([8,3]], columns=['Jan16','Feb16'])
Jan16 Feb16
0 8 3
Is there a way to sum both frames in order to come out with:
sum = Index, Jan, Feb
0 13 9
You'll need concat and then a groupby on the column headers.
pd.concat([df_1, df_2], axis=1).groupby(by=lambda x: x[:3], axis=1).sum()
Feb Jan
0 9 13
This works on the assumption that your column names all have the format MTHxx.
Use add():
df_1.add(df_2)
Index Jan15 Feb15
0 0 13 9
With different column names:
pd.DataFrame(df_1.values + df_2.values,
columns = df_1.columns.str.replace("\d", "")).reset_index()
This is a dummy way of doing this, #coldspeed's answer and #andrew_reece's are the best here:
new1 = df_1[:]
new1.columns = [i[:-2] for i in df_1.columns]
new2 = df_2[:]
new2.columns = [i[:-2] for i in df_2.columns]
final_df = new1+new2
indexes = list(new1.index)+list(new1.index)
final_df['Index'] = list(set(indexes))
print(final_df)
Output:
Jan Feb Index
0 13 9 0