How to add a nested column to a 3D Pandas DataFrame? - python

New to Pandas, not very sure how the 3D DataFrame works. My dataframe, called 'new' looks like this:
unique cat numerical
a b c d e f
0 0 1 2 3 4 5
1 0 1 2 3 4 5
I want to insert column 'z' so that it ends up like this:
unique cat numerical
a b z c d e f
0 0 1 9 2 3 4 5
1 0 1 9 2 3 4 5
I successfully made a new column after slicing out 'unique' from my dataframe:
Doing this:
new_column = new.loc[:,'unique'].assign(z=pd.Series([9,9]).values)
Gets me this:
a b z
0 0 1 9
1 0 1 9
However I have no idea how to put it back into the dataframe. I tried:
new['unique'] = new_column
But I've since found out that it just tries to replace all the values in all the rows and columns found under 'unique', like this:
new['unique'] = 'a'
Gets:
unique cat numerical
a b c d e f
0 a a 2 3 4 5
1 a a 2 3 4 5
And using .loc gets this instead:
unique cat numerical
a b c d e f
0 NaN NaN 2 3 4 5
1 NaN NaN 2 3 4 5
Here's my full code:
import pandas as pd
import numpy as np
data=[[0,1,2,3,4,5],[0,1,2,3,4,5]]
datatypes=np.array(['unique','unique','cat','cat','numerical','numerical'])
columnnames=np.array(['a','b','c','d','e','f'])
new = pd.DataFrame(data=data, columns=pd.MultiIndex.from_tuples(zip(datatypes,columnnames)))
print('new: ')
print(new)
new_column = new.loc[:,'unique'].assign(z=pd.Series([9,9]).values)
print('\nnew column:')
print(new_column)
new.loc[:,'unique'] = new_column
print('\nattempt 1:')
print(new)
new['unique'] = new_column
print('\nattempt 2:')
print(new)

One way to do this:
# Create your new multiindexed column:
new['unique','z'] = 9
# Re-order your columns in your desired order:
new = new[['unique', 'cat', 'numerical']]
>>> new
unique cat numerical
a b z c d e f
0 0 1 9 2 3 4 5
1 0 1 9 2 3 4 5

Related

Pandas - combine columns and put one after another?

I have the following dataframe:
a1,a2,b1,b2
1,2,3,4
2,3,4,5
3,4,5,6
The desirable output is:
a,b
1,3
2,4
3,5
2,4
3,5
4,6
There is a lot of "a" and "b" named headers in the dataframe, the maximum is a50 and b50. So I am looking for the way to combine them all into just "a" and "b".
I think it's possible to do with concat, but I have no idea how to combine it all, putting all the values under each other. I'll be grateful for any ideas.
You can use pd.wide_to_long:
pd.wide_to_long(df.reset_index(), ['a','b'], 'index', 'No').reset_index()[['a','b']]
Output:
a b
0 1 3
1 2 4
2 3 5
3 2 4
4 3 5
5 4 6
First we read the dataframe:
import pandas as pd
from io import StringIO
s = """a1,a2,b1,b2
1,2,3,4
2,3,4,5
3,4,5,6"""
df = pd.read_csv(StringIO(s), sep=',')
Then we stack the columns, and separate the number of the columns from the letter 'a' or 'b':
stacked = df.stack().rename("val").reset_index(1).reset_index()
cols_numbers = pd.DataFrame(stacked
.level_1
.str.split('(\d)')
.apply(lambda l: l[:2])
.tolist(),
columns=["col", "num"])
x = cols_numbers.join(stacked[['val', 'index']])
print(x)
col num val index
0 a 1 1 0
1 a 2 2 0
2 b 1 3 0
3 b 2 4 0
4 a 1 2 1
5 a 2 3 1
6 b 1 4 1
7 b 2 5 1
8 a 1 3 2
9 a 2 4 2
10 b 1 5 2
11 b 2 6 2
Finally, we group by index and num to get two columns a and b, and we fill the first row of the b column with the second value, to get what was expected:
result = (x
.set_index("col", append=True)
.groupby(["index", "num"])
.val
.apply(lambda g:
g
.unstack()
.fillna(method="bfill")
.head(1))
.reset_index(-1, drop=True))
print(result)
col a b
index num
0 1 1.0 3.0
2 2.0 4.0
1 1 2.0 4.0
2 3.0 5.0
2 1 3.0 5.0
2 4.0 6.0
To get rid of the multiindex at the end: result.reset_index(drop=True)

Remove outliers from aggregated Dataframe (Python)

My origin dataframe looks like this, only the first rows...:
categories id products
0 A 1 a
1 B 1 a
2 C 1 a
3 A 1 b
4 B 1 b
5 A 2 c
6 B 2 c
I aggregated it with the following code:
df2 = df.groupby('id').products.nunique().reset_index().merge(
pd.crosstab(df.id, df.categories).reset_index()
The dataframe is the following then, I added n outlier from my DF too:
id products A B C
0 1 2 2 2 1
1 2 1 1 1 0
2 3 50 1 1 30
Now I am trying to remove the outliers in my new DF:
#remove outliners
del df2['id']
df2 = df2.loc[df2['products']<=20,[str(i) for i in df2.columns]]
What I then get is:
products A B C
0 2 NaN NaN NaN
1 1 NaN NaN NaN
It removes the outliers but why do I get only NaNs now in the categorie column?
df2 = df2.loc[df2['products'] <= 20]

Pandas merge on aggregated columns

Let's say I create a DataFrame:
import pandas as pd
df = pd.DataFrame({"a": [1,2,3,13,15], "b": [4,5,6,6,6], "c": ["wish", "you","were", "here", "here"]})
Like so:
a b c
0 1 4 wish
1 2 5 you
2 3 6 were
3 13 6 here
4 15 6 here
... and then group and aggregate by a couple columns ...
gb = df.groupby(['b','c']).agg({"a": lambda x: x.nunique()})
Yielding the following result:
a
b c
4 wish 1
5 you 1
6 here 2
were 1
Is it possible to merge df with the newly aggregated table gb such that I create a new column in df, containing the corresponding values from gb? Like this:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
I tried doing the simplest thing:
df.merge(gb, on=['b','c'])
But this gives the error:
KeyError: 'b'
Which makes sense because the grouped table has a Multi-index and b is not a column. So my question is two-fold:
Can I transform the multi-index of the gb DataFrame back into columns (so that it has the b and c column)?
Can I merge df with gb on the column names?
Whenever you want to add some aggregated column from groupby operation back to the df you should be using transform, this produces a Series with its index aligned with your orig df:
In [4]:
df['nc'] = df.groupby(['b','c'])['a'].transform(pd.Series.nunique)
df
Out[4]:
a b c nc
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2
There is no need to reset the index or perform an additional merge.
There's a simple way of doing this using reset_index().
df.merge(gb.reset_index(), on=['b','c'])
gives you
a_x b c a_y
0 1 4 wish 1
1 2 5 you 1
2 3 6 were 1
3 13 6 here 2
4 15 6 here 2

Adding a column to pandas data frame fills it with NA

I have this pandas dataframe:
SourceDomain 1 2 3
0 www.theguardian.com profile.theguardian.com 1 Directed
1 www.theguardian.com membership.theguardian.com 2 Directed
2 www.theguardian.com subscribe.theguardian.com 3 Directed
3 www.theguardian.com www.google.co.uk 4 Directed
4 www.theguardian.com jobs.theguardian.com 5 Directed
I would like to add a new column which is a pandas series created like this:
Weights = Weights.value_counts()
However, when I try to add the new column using edgesFile[4] = Weights it fills it with NA instead of the values:
SourceDomain 1 2 3 4
0 www.theguardian.com profile.theguardian.com 1 Directed NaN
1 www.theguardian.com membership.theguardian.com 2 Directed NaN
2 www.theguardian.com subscribe.theguardian.com 3 Directed NaN
3 www.theguardian.com www.google.co.uk 4 Directed NaN
4 www.theguardian.com jobs.theguardian.com 5 Directed NaN
How can I add the new column keeping the values?
Thanks?
Dani
You are getting NaNs because the index of Weights does not match up with the index of edgesFile. If you want Pandas to ignore Weights.index and just paste the values in order then pass the underlying NumPy array instead:
edgesFile[4] = Weights.values
Here is an example which demonstrates the difference:
In [14]: df = pd.DataFrame(np.arange(4)*10, index=list('ABCD'))
In [15]: df
Out[15]:
0
A 0
B 10
C 20
D 30
In [16]: s = pd.Series(np.arange(4), index=list('CDEF'))
In [17]: s
Out[17]:
C 0
D 1
E 2
F 3
dtype: int64
Here we see Pandas aligning the index:
In [18]: df[4] = s
In [19]: df
Out[19]:
0 4
A 0 NaN
B 10 NaN
C 20 0
D 30 1
Here, Pandas simply pastes the values in s into the column:
In [20]: df[4] = s.values
In [21]: df
Out[21]:
0 4
A 0 0
B 10 1
C 20 2
D 30 3
This is small example of your question:
You can add new column with a column name in existing DataFrame
>>> df = DataFrame([[1,2,3],[4,5,6]], columns = ['A', 'B', 'C'])
>>> df
A B C
0 1 2 3
1 4 5 6
>>> s = Series([7,8])
>>> s
0 7
1 8
2 9
>>> df['D']=s
>>> df
A B C D
0 1 2 3 7
1 4 5 6 8
Or, You can make DataFrame from Series and concat then
>>> df = DataFrame([[1,2,3],[4,5,6]])
>>> df
0 1 2
0 1 2 3
1 4 5 6
>>> s = DataFrame(Series([7,8]), columns=['4']) # if you don't provide column name, default name will be 0
>>> s
0
0 7
1 8
>>> df = pd.concat([df,s], axis=1)
>>> df
0 1 2 0
0 1 2 3 7
1 4 5 6 8
Hope this will help

How to extract values of one dataframe with values of other dataframe in pandas?

Suppose that you create the next python pandas data frames:
In[1]: print df1.to_string()
ID value
0 1 a
1 2 b
2 3 c
3 4 d
In[2]: print df2.to_string()
Id_a Id_b
0 1 2
1 4 2
2 2 1
3 3 3
4 4 4
5 2 2
How can I create a frame df_ids_to_values with the next values:
In[2]: print df_ids_to_values.to_string()
value_a value_b
0 a b
1 d b
2 b a
3 c c
4 d d
5 b b
In other words, I would like to replace the id's of df2 with the corresponding values in df1. I have tried doing this by performing a for loop but it is very slow and I am hopping that there is a function in pandas that allow me to do this operation very efficiently.
Thanks for your help...
Start by setting an index on df1
df1 = df1.set_index('ID')
then join the two columns
df = df2.join(df1, on='Id_a')
df = df.rename(columns = {'value' : 'value_a'})
df = df.join(df1, on='Id_b')
df = df.rename(columns = {'value' : 'value_b'})
result:
> df
Id_a Id_b value_a value_b
0 1 2 a b
1 4 2 d b
2 2 1 b a
3 3 3 c c
4 4 4 d d
5 2 2 b b
[6 rows x 4 columns]
(and you get to your expected output with df[['value_a','value_b']])

Categories