What is the difference between swaplevel() and reorder_levels()? - python

While working with hierarchical index levels in pandas, what is the difference between swaplevel() and reorder_levels()?

When there only two levels swaplevel and reorder_levels almost same , but when your df have more than 3 levels , personally think reorder_levels is more elegant way
For example :
idx = pd.MultiIndex.from_arrays([[1, 1, 2], [1, 2, 2], [3, 3, 3],[1,1,1]])
df = pd.DataFrame(columns=idx, index=[1, 2, 3, 4])
IF we want to change the order level=[0,1,2,3] to [3,2,1,0]
With swaplevel : need multiple calls
df.swaplevel(0,3,axis=1).swaplevel(1,2,axis=1)
1
3
1 2
1 1 2
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
With reorder_levels : Only one call
df.reorder_levels([3,2,1,0],axis=1)
1
3
1 2
1 1 2
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

Related

How to ignore NaN values for a rolling mean calculation in pandas DataFrame?

I try to create a DataFrame containing a rolling mean based on a window with length 5. But my data contains one NaN value and therefore I only get NaN values for column 3 with a NaN values. How is it possible to ignore NaN values when using .rolling(5).mean()?
I have this sample data df1:
Column1 Column2 Column3 Column4
0 1 5 -9.0 13
1 1 6 -10.0 15
2 3 7 -5.0 11
3 4 8 NaN 9
4 6 5 -2.0 8
5 2 8 0.0 10
6 3 8 -3.0 12
For convenience:
#create DataFrame with NaN
df1 = pd.DataFrame({
'Column1':[1, 1, 3, 4, 6, 2, 3],
'Column2':[5, 6, 7, 8, 5, 8, 8],
'Column3':[-9, -10, -5, 'NaN', -2, 0, -3],
'Column4':[13, 15, 11, 9, 8, 10, 12]
})
df1 = df1.replace('NaN',np.nan)
df1
When I use to create a rolling mean based on a window of 5, I get for column 3 only NaN values.
df2 = df1.rolling(5).mean()
Column1 Column2 Column3 Column4
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 3.0 6.2 NaN 11.2
5 3.2 6.8 NaN 10.6
6 3.6 7.2 NaN 10.0
Pandas mean has a skipna flag to be told to ignore the NaNs see
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html
Try
df2 = df1.rolling(5).mean(skipna=True)
or
df2 = df1.rolling(5).apply(pd.np.nanmean)
You should interpolate the NaN with either 0 or mean.
Below works.
df1 = df1.fillna(df1.mean())
df2 = df1.rolling(5).mean()
You can use:
df2 = df1[df1['Column3'].notna()].rolling(5).mean()
here you simply form new df without rows with NaN
If you don't want to lose the data in good columns
df2 = df1.drop("Column3", axis=1).rolling(5).mean()
df2["Colunm3"] = df1['Column3'].notna().rolling(5).mean()
you calculate for all good columns, then for one with NaN

find first non NaN value in shift pandas

I have a following issue. I would like to compute lag of a column in my df. However, I have a condition that the lagged value cannot my nan.
See example bellow:
import numpy as np
d = {'col1': [1, 2, 10, 5, 3, 2], 'col2': [3, 4, np.nan, np.nan, 23, 42]}
df = pd.DataFrame(data=d)
when I try this:
df["col2_lag"] = df["col2"].shift(1)
I got this result:
col1 col2 col2_lag
0 1 3.0 NaN
1 2 4.0 3.0
2 10 NaN 4.0
3 5 NaN NaN
4 3 23.0 NaN
5 2 42.0 23.0
However, desired output is this:
col1 col2 col2_lag
0 1 3.0 NaN
1 2 4.0 3.0
2 10 NaN 4.0
3 5 NaN 4.0 #because we skip NaN and find first non NaN
4 3 23.0 4.0 #because we skip NaN and find first non NaN
5 2 42.0 23.0
Is there and elegant way, how to do this? Ideally without writting my own function. Thanks
Use ffill:
df["col2_lag"] = df["col2"].shift(1).ffill()

append specific amount of empty rows to pandas dataframe

I want to append a specific amount of empty rows to that df
df = pd.DataFrame({'cow': [2, 4, 8],
'shark': [2, 0, 0],
'pudle': [10, 2, 1]})
with df = df.append(pd.Series(), ignore_index = True) I append one empty row, how can I append x amount of rows ?
You can use df.reindex to achieve this goal.
df.reindex(list(range(0, 10))).reset_index(drop=True)
cow shark pudle
0 2.0 2.0 10.0
1 4.0 0.0 2.0
2 8.0 0.0 1.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
The arguments you provide to df.reindex is going to be the total number of rows the new DataFrame has. So if your DataFrame has 3 objects, providing a list that caps out at 10 will add 7 new rows.
I'm not too pandas savvy, but if you can already add one empty row, why not just try writing a for loop and appending x times?
for i in range(x):
df = df.append(pd.Series(), ignore_index = True)
You could do:
import pandas as pd
df = pd.DataFrame({'cow': [2, 4, 8],
'shark': [2, 0, 0],
'pudle': [10, 2, 1]})
n = 10
df = df.append([[] for _ in range(n)], ignore_index=True)
print(df)
Output
cow shark pudle
0 2.0 2.0 10.0
1 4.0 0.0 2.0
2 8.0 0.0 1.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
Try with reindex
out = df.reindex(df.index.tolist()+[df.index.max()+1]*5)#reset_index(drop=True)
Out[93]:
cow shark pudle
0 2.0 2.0 10.0
1 4.0 0.0 2.0
2 8.0 0.0 1.0
3 NaN NaN NaN
3 NaN NaN NaN
3 NaN NaN NaN
3 NaN NaN NaN
3 NaN NaN NaN
Create an empty dataframe of the appropriate size and append it:
import numpy as np
df = df.append(pd.DataFrame([[np.nan] * df.shape[1]] * n,columns=df.columns),
ignore_index = True)

Columns appending is troublesome with Pandas

Here is what I have tried and what error I received:
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1,2,3,4,5],"B":[5,4,3,2,1],"C":[0,0,0,0,0],"D":[1,1,1,1,1]})
>>> df
A B C D
0 1 5 0 1
1 2 4 0 1
2 3 3 0 1
3 4 2 0 1
4 5 1 0 1
>>> import pandas as pd
>>> df = pd.DataFrame({"A":[1,2,3,4,5],"B":[5,4,3,2,1],"C":[0,0,0,0,0],"D":[1,1,1,1,1]})
>>> first = [2,2,2,2,2,2,2,2,2,2,2,2]
>>> first = pd.DataFrame(first).T
>>> first.index = [2]
>>> df = df.join(first)
>>> df
A B C D 0 1 2 3 4 5 6 7 8 9 10 11
0 1 5 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 3 3 0 1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 4 2 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
>>> second = [3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3]
>>> second = pd.DataFrame(second).T
>>> second.index = [1]
>>> df = df.join(second)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python35\lib\site-packages\pandas\core\frame.py", line 6815, in join
rsuffix=rsuffix, sort=sort)
File "C:\Python35\lib\site-packages\pandas\core\frame.py", line 6830, in _join_compat
suffixes=(lsuffix, rsuffix), sort=sort)
File "C:\Python35\lib\site-packages\pandas\core\reshape\merge.py", line 48, in merge
return op.get_result()
File "C:\Python35\lib\site-packages\pandas\core\reshape\merge.py", line 552, in get_result
rdata.items, rsuf)
File "C:\Python35\lib\site-packages\pandas\core\internals\managers.py", line 1972, in items_overlap_with_suffix
'{rename}'.format(rename=to_rename))
ValueError: columns overlap but no suffix specified: Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], dtype='object')
I am trying to create new list with the extra columns which I have to add at specific indexes of the main dataframe df.
When i tried the first it worked and you can see the output. But when I tried the same way with second I received the above mentioned error.
Kindly, let me know what I can do in this situation and achieve the goal I am expecting.
Use DataFrame.combine_first instead join if need assign to same columns created before, last DataFrame.reindex by list of columns for expected ordering:
df = pd.DataFrame({"A":[1,2,3,4,5],"B":[5,4,3,2,1],"C":[0,0,0,0,0],"D":[1,1,1,1,1]})
orig = df.columns.tolist()
first = [2,2,2,2,2,2,2,2,2,2,2,2]
first = pd.DataFrame(first).T
first.index = [2]
df = df.combine_first(first)
second = [3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3]
second = pd.DataFrame(second).T
second.index = [1]
df = df.combine_first(second)
df = df.reindex(orig + first.columns.tolist(), axis=1)
print (df)
A B C D 0 1 2 3 4 5 6 7 8 9 10 11
0 1 5 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 0 1 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
2 3 3 0 1 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0
3 4 2 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 0 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Yes this is expected behaviour because join works much like an SQL join, meaning that it will join on the provided index and concatenate all the columns together. The problem arises from the fact that pandas does not accept two columns to have the same name. Hence, if you have 2 columns in each dataframe with the same name, it will first look for a suffix to add to those columns to avoid name clashes. This is controlled with the lsuffix and rsuffix arguments in the join method.
Conclusion: 2 ways to solve this:
Either provide a suffix so that pandas is able to resolve the name clashes; or
Make sure that you don't have overlapping columns
You have to specify the suffixes since the column names are the same. Assuming you are trying to add the second values as new columns horizontally:
df = df.join(second, lsuffix='first', rsuffix='second')
A B C D 0first 1first 2first 3first 4first 5first ... 10second 11second 12 13 14 15 16 17 18 19
0 1 5 0 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2 4 0 1 NaN NaN NaN NaN NaN NaN ... 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
2 3 3 0 1 2.0 2.0 2.0 2.0 2.0 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 4 2 0 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 5 1 0 1 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Insert value into column which is named in known column pandas

I'm preparing data for machine learning where data is in pandas DataFrame which looks like this:
Column v1 v2
first 1 2
second 3 4
third 5 6
now i want to transform it into:
Column v1 v2 first-v1 first-v2 second-v1 econd-v2 third-v1 third-v2
first 1 2 1 2 Nan Nan Nan Nan
second 3 4 Nan Nan 3 4 Nan Nan
third 5 6 Nan Nan Nan Nan 5 6
what i've tried is to do something like this:
# we know how many values there are but
# length can be changed into length of [1, 2, 3, ...] values
values = ['v1', 'v2']
# data with description from above is saved in data
for value in values:
data[ str(data['Column'] + '-' + value)] = data[ value]
Results are a columns with name:
['first-v1' 'second-v1'..], ['first-v2' 'second-v2'..]
where there are correct values. What i'm doing wrong? Is there a more optimal way to do this because my data is big?
Thank you for your time!
You can use unstack with swaping and sorting MultiIndex in columns:
df = data.set_index('Column', append=True)[values].unstack()
.swaplevel(0,1, axis=1).sort_index(1)
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Or stack + unstack:
df = data.set_index('Column', append=True).stack().unstack([1,2])
df.columns = df.columns.map('-'.join)
print (df)
first-v1 first-v2 second-v1 second-v2 third-v1 third-v2
0 1.0 2.0 NaN NaN NaN NaN
1 NaN NaN 3.0 4.0 NaN NaN
2 NaN NaN NaN NaN 5.0 6.0
Last join to original:
df = data.join(df)
print (df)
Column v1 v2 first-v1 first-v2 second-v1 second-v2 third-v1 \
0 first 1 2 1.0 2.0 NaN NaN NaN
1 second 3 4 NaN NaN 3.0 4.0 NaN
2 third 5 6 NaN NaN NaN NaN 5.0
third-v2
0 NaN
1 NaN
2 6.0

Categories