Add values from columns into a new column using pandas - python

I have a dataframe:
id category value
1 1 abc
2 2 abc
3 1 abc
4 4 abc
5 4 abc
6 3 abc
Category 1 = best, 2 = good, 3 = bad, 4 =ugly
I want to create a new column such that, for category 1 the value in the column should be cat_1, for category 2, the value should be cat2.
in new_col2 for category 1 should be cat_best, for category 2, the value should be cat_good.
df['new_col'] = ''
my final df
id category value new_col new_col2
1 1 abc cat_1 cat_best
2 2 abc cat_2 cat_good
3 1 abc cat_1 cat_best
4 4 abc cat_4 cat_ugly
5 4 abc cat_4 cat_ugly
6 3 abc cat_3 cat_bad
I can iterate it in for loop:
for index,row in df.iterrows():
df.loc[df.id == row.id,'new_col'] = 'cat_'+str(row['category'])
Is there a better way of doing it (least time consuming)

I think you need join string with column converted to string and map with join for second column:
d = {1:'best', 2: 'good', 3 : 'bad', 4 :'ugly'}
df['new_col'] = 'cat_'+ df['category'].astype(str)
df['new_col2'] = 'cat_'+ df['category'].map(d)
Or:
df = df.assign(new_col= 'cat_'+ df['category'].astype(str),
new_col2='cat_'+ df['category'].map(d))
print (df)
id category value new_col new_col2
0 1 1 abc cat_1 cat_best
1 2 2 abc cat_2 cat_good
2 3 1 abc cat_1 cat_best
3 4 4 abc cat_4 cat_ugly
4 5 4 abc cat_4 cat_ugly
5 6 3 abc cat_3 cat_bad

You can do it by using apply also:
df['new_col']=df['category'].apply(lambda x: "cat_"+str(x))

Related

Pandas replace columns by merging another dataframe

I have a dataframe df1 looks like this:
id A B
0 1 10 5
1 1 11 6
2 2 10 7
3 2 11 8
And another dataframe df2:
id A
0 1 3
1 2 4
Now I want to replace A column in df1 with the value of A in df2 based on id, so the result should look like this:
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8
There's a way that I can drop column A in df1 first and merge df2 to df1 on id like df1 = df1.drop(['A'], axis=1).merge(df2, how='left', on='id'), but if there're like 10 columns in df2, it will be pretty hard. Is there a more elegant way to do so?
here is one way to do it, by making use of pd.update. However, it requires to set the index on the id, so it can match the two df
df.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
df.update(df2)
df['A'] = df['A'].astype(int) # value by default was of type float
df.reset_index()
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8
Merge just the id column from df to df2, and then combine_first it to the original DataFrame:
df = df[['id']].merge(df2).combine_first(df)
print(df)
Output:
A B id
0 3 5 1
1 3 6 1
2 4 7 2
3 4 8 2

How to append a specific string according to each value in a string pandas dataframe column?

Let's take these sample dataframes :
df = pd.DataFrame({'Id':['1','2','3','4','5'], 'Value':[9,8,7,6,5]})
Id Value
0 1 9
1 2 8
2 3 7
3 4 6
4 5 5
df_name = pd.DataFrame({'Id':['1','2','4'], 'Name':['Andrew','Jason','John']})
Id Name
0 1 Andrew
1 2 Jason
2 4 John
I would like to add in the Id column of df the Name of the person (obtainable in df_name) if it exists, in brackets. I know how to do this with a for loop over the Id column of df but it is inefficient with large dataframes. Do you know please a better way do to this ?
Expected output :
Id Value
0 1 (Andrew) 9
1 2 (Jason) 8
2 3 7
3 4 (John) 6
4 5 5
Use Series.map for match values, add () and replace non matche values by original column in Series.fillna:
df['Id'] = ((df['Id'] + ' (' + df['Id'].map(df_name.set_index('Id')['Name']) + ')')
.fillna(df['Id']))
print (df)
Id Value
0 1 (Andrew) 9
1 2 (Jason) 8
2 3 7
3 4 (John) 6
4 5 5

Convert a column with values in a list into separated rows grouped by specific columns

I'm trying to convert a column with values in a list into separated rows grouped by specifics columns.
That's the dataframe I have:
id rooms bathrooms facilities
111 1 2 [2, 3, 4]
222 2 3 [4, 5, 6]
333 2 1 [2, 3, 4]
That's the dataframe I need:
id rooms bathrooms facility
111 1 2 2
111 1 2 3
111 1 2 4
222 2 3 4
222 2 3 5
222 2 3 6
333 2 1 2
333 2 1 3
333 2 1 4
I was trying converting to list the column facilities first:
facilities = pd.DataFrame(df.facilities.tolist())
And later join by columns and following the same method with another suggested solution:
df[['id', 'rooms', 'bathrooms']].join(facilities).melt(id_vars=['id', 'rooms', 'bathrooms']).drop('variable', 1)
Unfortunately, it didn't work for me.
Another solution?
Thanks in advance!
You need explode:
df.explode('facilities')
# id rooms bathrooms facilities
#0 111 1 2 2
#0 111 1 2 3
#0 111 1 2 4
#1 222 2 3 4
#1 222 2 3 5
#1 222 2 3 6
#2 333 2 1 2
#2 333 2 1 3
#2 333 2 1 4
It is a bit awkward to have list as values in a dataframe so one way I can think of to work around this is to unpack the lists and store each in its own column, then use the melt function.
# recreate your data
d = {"id":[111, 222, 333],
"rooms": [1,2,2],
"bathrooms": [2,3,1],
"facilities": [[2, 3, 4],[4, 5, 6],[2, 3, 4]]}
df = pd.DataFrame(d)
# unpack the lists
f0, f1, f2 = [],[],[]
for row in df.itertuples():
f0.append(row.facilities[0])
f1.append(row.facilities[1])
f2.append(row.facilities[2])
df["f0"] = f0
df["f1"] = f1
df["f2"] = f2
# melt the dataframe
df = pd.melt(df, id_vars=['id', 'rooms', 'bathrooms'], value_vars=["f0", "f1", "f2"], value_name="facilities")
# optionally sort the values and remove the "variable" column
df.sort_values(by=['id'], inplace=True)
df = df[['id', 'rooms', 'bathrooms', 'facilities']]
I think that should get you the dataframe you need.
id rooms bathrooms facilities
0 111 1 2 2
3 111 1 2 3
6 111 1 2 4
1 222 2 3 4
4 222 2 3 5
7 222 2 3 6
2 333 2 1 2
5 333 2 1 3
8 333 2 1 4
The following will give the desired output
def changeDf(x):
df_m = pd.DataFrame(columns=['id','rooms','bathrooms','facilities'])
for index, fc in enumerate(x['facilities']):
df_m.loc[index] = [x['id'], x['rooms'], x['bathrooms'], fc]
return df_m
df_modified = df.apply(changeDf, axis=1)
df_final = pd.concat([i for i in df_modified])
print(df_final)
"df" is input dataframe and "df_final" is desired dataframe
Try this
reps = [len(x) for x in df.facilities]
facilities = pd.Series(np.array(df.facilities.tolist()).ravel())
df = df.loc[df.index.repeat(reps)].reset_index(drop=True)
df.facilities = facilities
df
id rooms bathrooms facilities
0 111 1 2 2
1 111 1 2 3
2 111 1 2 4
3 222 2 3 4
4 222 2 3 5
5 222 2 3 6
6 333 2 1 2
7 333 2 1 3
8 333 2 1 4

How do I name the column and index in a Pandas dataframe? [duplicate]

How do I get the index column name in python pandas? Here's an example dataframe:
Column 1
Index Title
Apples 1
Oranges 2
Puppies 3
Ducks 4
What I'm trying to do is get/set the dataframe index title. Here is what i tried:
import pandas as pd
data = {'Column 1' : [1., 2., 3., 4.],
'Index Title' : ["Apples", "Oranges", "Puppies", "Ducks"]}
df = pd.DataFrame(data)
df.index = df["Index Title"]
del df["Index Title"]
print df
Anyone know how to do this?
You can just get/set the index via its name property
In [7]: df.index.name
Out[7]: 'Index Title'
In [8]: df.index.name = 'foo'
In [9]: df.index.name
Out[9]: 'foo'
In [10]: df
Out[10]:
Column 1
foo
Apples 1
Oranges 2
Puppies 3
Ducks 4
You can use rename_axis, for removing set to None:
d = {'Index Title': ['Apples', 'Oranges', 'Puppies', 'Ducks'],'Column 1': [1.0, 2.0, 3.0, 4.0]}
df = pd.DataFrame(d).set_index('Index Title')
print (df)
Column 1
Index Title
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
print (df.index.name)
Index Title
print (df.columns.name)
None
The new functionality works well in method chains.
df = df.rename_axis('foo')
print (df)
Column 1
foo
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
You can also rename column names with parameter axis:
d = {'Index Title': ['Apples', 'Oranges', 'Puppies', 'Ducks'],'Column 1': [1.0, 2.0, 3.0, 4.0]}
df = pd.DataFrame(d).set_index('Index Title').rename_axis('Col Name', axis=1)
print (df)
Col Name Column 1
Index Title
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
print (df.index.name)
Index Title
print (df.columns.name)
Col Name
print df.rename_axis('foo').rename_axis("bar", axis="columns")
bar Column 1
foo
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
print df.rename_axis('foo').rename_axis("bar", axis=1)
bar Column 1
foo
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
From version pandas 0.24.0+ is possible use parameter index and columns:
df = df.rename_axis(index='foo', columns="bar")
print (df)
bar Column 1
foo
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
Removing index and columns names means set it to None:
df = df.rename_axis(index=None, columns=None)
print (df)
Column 1
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
If MultiIndex in index only:
mux = pd.MultiIndex.from_arrays([['Apples', 'Oranges', 'Puppies', 'Ducks'],
list('abcd')],
names=['index name 1','index name 1'])
df = pd.DataFrame(np.random.randint(10, size=(4,6)),
index=mux,
columns=list('ABCDEF')).rename_axis('col name', axis=1)
print (df)
col name A B C D E F
index name 1 index name 1
Apples a 5 4 0 5 2 2
Oranges b 5 8 2 5 9 9
Puppies c 7 6 0 7 8 3
Ducks d 6 5 0 1 6 0
print (df.index.name)
None
print (df.columns.name)
col name
print (df.index.names)
['index name 1', 'index name 1']
print (df.columns.names)
['col name']
df1 = df.rename_axis(('foo','bar'))
print (df1)
col name A B C D E F
foo bar
Apples a 5 4 0 5 2 2
Oranges b 5 8 2 5 9 9
Puppies c 7 6 0 7 8 3
Ducks d 6 5 0 1 6 0
df2 = df.rename_axis('baz', axis=1)
print (df2)
baz A B C D E F
index name 1 index name 1
Apples a 5 4 0 5 2 2
Oranges b 5 8 2 5 9 9
Puppies c 7 6 0 7 8 3
Ducks d 6 5 0 1 6 0
df2 = df.rename_axis(index=('foo','bar'), columns='baz')
print (df2)
baz A B C D E F
foo bar
Apples a 5 4 0 5 2 2
Oranges b 5 8 2 5 9 9
Puppies c 7 6 0 7 8 3
Ducks d 6 5 0 1 6 0
Removing index and columns names means set it to None:
df2 = df.rename_axis(index=(None,None), columns=None)
print (df2)
A B C D E F
Apples a 6 9 9 5 4 6
Oranges b 2 6 7 4 3 5
Puppies c 6 3 6 3 5 1
Ducks d 4 9 1 3 0 5
For MultiIndex in index and columns is necessary working with .names instead .name and set by list or tuples:
mux1 = pd.MultiIndex.from_arrays([['Apples', 'Oranges', 'Puppies', 'Ducks'],
list('abcd')],
names=['index name 1','index name 1'])
mux2 = pd.MultiIndex.from_product([list('ABC'),
list('XY')],
names=['col name 1','col name 2'])
df = pd.DataFrame(np.random.randint(10, size=(4,6)), index=mux1, columns=mux2)
print (df)
col name 1 A B C
col name 2 X Y X Y X Y
index name 1 index name 1
Apples a 2 9 4 7 0 3
Oranges b 9 0 6 0 9 4
Puppies c 2 4 6 1 4 4
Ducks d 6 6 7 1 2 8
Plural is necessary for check/set values:
print (df.index.name)
None
print (df.columns.name)
None
print (df.index.names)
['index name 1', 'index name 1']
print (df.columns.names)
['col name 1', 'col name 2']
df1 = df.rename_axis(('foo','bar'))
print (df1)
col name 1 A B C
col name 2 X Y X Y X Y
foo bar
Apples a 2 9 4 7 0 3
Oranges b 9 0 6 0 9 4
Puppies c 2 4 6 1 4 4
Ducks d 6 6 7 1 2 8
df2 = df.rename_axis(('baz','bak'), axis=1)
print (df2)
baz A B C
bak X Y X Y X Y
index name 1 index name 1
Apples a 2 9 4 7 0 3
Oranges b 9 0 6 0 9 4
Puppies c 2 4 6 1 4 4
Ducks d 6 6 7 1 2 8
df2 = df.rename_axis(index=('foo','bar'), columns=('baz','bak'))
print (df2)
baz A B C
bak X Y X Y X Y
foo bar
Apples a 2 9 4 7 0 3
Oranges b 9 0 6 0 9 4
Puppies c 2 4 6 1 4 4
Ducks d 6 6 7 1 2 8
Removing index and columns names means set it to None:
df2 = df.rename_axis(index=(None,None), columns=(None,None))
print (df2)
A B C
X Y X Y X Y
Apples a 2 0 2 5 2 0
Oranges b 1 7 5 5 4 8
Puppies c 2 4 6 3 6 5
Ducks d 9 6 3 9 7 0
And #Jeff solution:
df.index.names = ['foo','bar']
df.columns.names = ['baz','bak']
print (df)
baz A B C
bak X Y X Y X Y
foo bar
Apples a 3 4 7 3 3 3
Oranges b 1 2 5 8 1 0
Puppies c 9 6 3 9 6 3
Ducks d 3 2 1 0 1 0
df.index.name should do the trick.
Python has a dir function that let's you query object attributes. dir(df.index) was helpful here.
Use df.index.rename('foo', inplace=True) to set the index name.
Seems this api is available since pandas 0.13.
If you do not want to create a new row but simply put it in the empty cell then use:
df.columns.name = 'foo'
Otherwise use:
df.index.name = 'foo'
Setting the index name can also be accomplished at creation:
pd.DataFrame(data={'age': [10,20,30], 'height': [100, 170, 175]}, index=pd.Series(['a', 'b', 'c'], name='Tag'))
df.columns.values also give us the column names
The solution for multi-indexes is inside jezrael's cyclopedic answer, but it took me a while to find it so I am posting a new answer:
df.index.names gives the names of a multi-index (as a Frozenlist).
To just get the index column names df.index.names will work for both a single Index or MultiIndex as of the most recent version of pandas.
As someone who found this while trying to find the best way to get a list of index names + column names, I would have found this answer useful:
names = list(filter(None, df.index.names + df.columns.values.tolist()))
This works for no index, single column Index, or MultiIndex. It avoids calling reset_index() which has an unnecessary performance hit for such a simple operation. I'm surprised there isn't a built in method for this (that I've come across). I guess I run into needing this more often because I'm shuttling data from databases where the dataframe index maps to a primary/unique key, but is really just another column to me.

Pandas index column title or name

How do I get the index column name in python pandas? Here's an example dataframe:
Column 1
Index Title
Apples 1
Oranges 2
Puppies 3
Ducks 4
What I'm trying to do is get/set the dataframe index title. Here is what i tried:
import pandas as pd
data = {'Column 1' : [1., 2., 3., 4.],
'Index Title' : ["Apples", "Oranges", "Puppies", "Ducks"]}
df = pd.DataFrame(data)
df.index = df["Index Title"]
del df["Index Title"]
print df
Anyone know how to do this?
You can just get/set the index via its name property
In [7]: df.index.name
Out[7]: 'Index Title'
In [8]: df.index.name = 'foo'
In [9]: df.index.name
Out[9]: 'foo'
In [10]: df
Out[10]:
Column 1
foo
Apples 1
Oranges 2
Puppies 3
Ducks 4
You can use rename_axis, for removing set to None:
d = {'Index Title': ['Apples', 'Oranges', 'Puppies', 'Ducks'],'Column 1': [1.0, 2.0, 3.0, 4.0]}
df = pd.DataFrame(d).set_index('Index Title')
print (df)
Column 1
Index Title
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
print (df.index.name)
Index Title
print (df.columns.name)
None
The new functionality works well in method chains.
df = df.rename_axis('foo')
print (df)
Column 1
foo
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
You can also rename column names with parameter axis:
d = {'Index Title': ['Apples', 'Oranges', 'Puppies', 'Ducks'],'Column 1': [1.0, 2.0, 3.0, 4.0]}
df = pd.DataFrame(d).set_index('Index Title').rename_axis('Col Name', axis=1)
print (df)
Col Name Column 1
Index Title
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
print (df.index.name)
Index Title
print (df.columns.name)
Col Name
print df.rename_axis('foo').rename_axis("bar", axis="columns")
bar Column 1
foo
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
print df.rename_axis('foo').rename_axis("bar", axis=1)
bar Column 1
foo
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
From version pandas 0.24.0+ is possible use parameter index and columns:
df = df.rename_axis(index='foo', columns="bar")
print (df)
bar Column 1
foo
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
Removing index and columns names means set it to None:
df = df.rename_axis(index=None, columns=None)
print (df)
Column 1
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
If MultiIndex in index only:
mux = pd.MultiIndex.from_arrays([['Apples', 'Oranges', 'Puppies', 'Ducks'],
list('abcd')],
names=['index name 1','index name 1'])
df = pd.DataFrame(np.random.randint(10, size=(4,6)),
index=mux,
columns=list('ABCDEF')).rename_axis('col name', axis=1)
print (df)
col name A B C D E F
index name 1 index name 1
Apples a 5 4 0 5 2 2
Oranges b 5 8 2 5 9 9
Puppies c 7 6 0 7 8 3
Ducks d 6 5 0 1 6 0
print (df.index.name)
None
print (df.columns.name)
col name
print (df.index.names)
['index name 1', 'index name 1']
print (df.columns.names)
['col name']
df1 = df.rename_axis(('foo','bar'))
print (df1)
col name A B C D E F
foo bar
Apples a 5 4 0 5 2 2
Oranges b 5 8 2 5 9 9
Puppies c 7 6 0 7 8 3
Ducks d 6 5 0 1 6 0
df2 = df.rename_axis('baz', axis=1)
print (df2)
baz A B C D E F
index name 1 index name 1
Apples a 5 4 0 5 2 2
Oranges b 5 8 2 5 9 9
Puppies c 7 6 0 7 8 3
Ducks d 6 5 0 1 6 0
df2 = df.rename_axis(index=('foo','bar'), columns='baz')
print (df2)
baz A B C D E F
foo bar
Apples a 5 4 0 5 2 2
Oranges b 5 8 2 5 9 9
Puppies c 7 6 0 7 8 3
Ducks d 6 5 0 1 6 0
Removing index and columns names means set it to None:
df2 = df.rename_axis(index=(None,None), columns=None)
print (df2)
A B C D E F
Apples a 6 9 9 5 4 6
Oranges b 2 6 7 4 3 5
Puppies c 6 3 6 3 5 1
Ducks d 4 9 1 3 0 5
For MultiIndex in index and columns is necessary working with .names instead .name and set by list or tuples:
mux1 = pd.MultiIndex.from_arrays([['Apples', 'Oranges', 'Puppies', 'Ducks'],
list('abcd')],
names=['index name 1','index name 1'])
mux2 = pd.MultiIndex.from_product([list('ABC'),
list('XY')],
names=['col name 1','col name 2'])
df = pd.DataFrame(np.random.randint(10, size=(4,6)), index=mux1, columns=mux2)
print (df)
col name 1 A B C
col name 2 X Y X Y X Y
index name 1 index name 1
Apples a 2 9 4 7 0 3
Oranges b 9 0 6 0 9 4
Puppies c 2 4 6 1 4 4
Ducks d 6 6 7 1 2 8
Plural is necessary for check/set values:
print (df.index.name)
None
print (df.columns.name)
None
print (df.index.names)
['index name 1', 'index name 1']
print (df.columns.names)
['col name 1', 'col name 2']
df1 = df.rename_axis(('foo','bar'))
print (df1)
col name 1 A B C
col name 2 X Y X Y X Y
foo bar
Apples a 2 9 4 7 0 3
Oranges b 9 0 6 0 9 4
Puppies c 2 4 6 1 4 4
Ducks d 6 6 7 1 2 8
df2 = df.rename_axis(('baz','bak'), axis=1)
print (df2)
baz A B C
bak X Y X Y X Y
index name 1 index name 1
Apples a 2 9 4 7 0 3
Oranges b 9 0 6 0 9 4
Puppies c 2 4 6 1 4 4
Ducks d 6 6 7 1 2 8
df2 = df.rename_axis(index=('foo','bar'), columns=('baz','bak'))
print (df2)
baz A B C
bak X Y X Y X Y
foo bar
Apples a 2 9 4 7 0 3
Oranges b 9 0 6 0 9 4
Puppies c 2 4 6 1 4 4
Ducks d 6 6 7 1 2 8
Removing index and columns names means set it to None:
df2 = df.rename_axis(index=(None,None), columns=(None,None))
print (df2)
A B C
X Y X Y X Y
Apples a 2 0 2 5 2 0
Oranges b 1 7 5 5 4 8
Puppies c 2 4 6 3 6 5
Ducks d 9 6 3 9 7 0
And #Jeff solution:
df.index.names = ['foo','bar']
df.columns.names = ['baz','bak']
print (df)
baz A B C
bak X Y X Y X Y
foo bar
Apples a 3 4 7 3 3 3
Oranges b 1 2 5 8 1 0
Puppies c 9 6 3 9 6 3
Ducks d 3 2 1 0 1 0
df.index.name should do the trick.
Python has a dir function that let's you query object attributes. dir(df.index) was helpful here.
Use df.index.rename('foo', inplace=True) to set the index name.
Seems this api is available since pandas 0.13.
If you do not want to create a new row but simply put it in the empty cell then use:
df.columns.name = 'foo'
Otherwise use:
df.index.name = 'foo'
Setting the index name can also be accomplished at creation:
pd.DataFrame(data={'age': [10,20,30], 'height': [100, 170, 175]}, index=pd.Series(['a', 'b', 'c'], name='Tag'))
df.columns.values also give us the column names
The solution for multi-indexes is inside jezrael's cyclopedic answer, but it took me a while to find it so I am posting a new answer:
df.index.names gives the names of a multi-index (as a Frozenlist).
To just get the index column names df.index.names will work for both a single Index or MultiIndex as of the most recent version of pandas.
As someone who found this while trying to find the best way to get a list of index names + column names, I would have found this answer useful:
names = list(filter(None, df.index.names + df.columns.values.tolist()))
This works for no index, single column Index, or MultiIndex. It avoids calling reset_index() which has an unnecessary performance hit for such a simple operation. I'm surprised there isn't a built in method for this (that I've come across). I guess I run into needing this more often because I'm shuttling data from databases where the dataframe index maps to a primary/unique key, but is really just another column to me.

Categories