Multi-column replacement in Pandas based on row selection - python

There are a plethora of questions on SO about how to select rows in a DataFrame and replace values in a column in those rows, but one use case is missing. To use the example DataFrame from this question,
In [1]: df
Out[1]:
apple banana cherry
0 0 3 good
1 1 4 bad
2 2 5 good
And this works if one wants to change a single column based on another:
df.loc[df.cherry == 'bad', 'apple'] = df.banana * 2
Or this sets the values in two columns:
df.loc[df.cherry == 'bad', ['apple', 'banana'] = np.nan
But this doesn't work:
df.loc[df.cherry == 'bad', ['apple', 'banana'] = [df.banana, df.apple]
, because apparently the right side is 3x2, while the left side is 1x2, hence the error message
ValueError: Must have equal len keys and value when setting with an ndarray
So I understand what the problem is, but what is the solution?

IIUC you can try:
df['a'] = df.apple * 3
df['b'] = df.banana * 2
print df
apple banana cherry a b
0 0 3 good 0 6
1 1 4 bad 3 8
2 2 5 good 6 10
df[['a', 'b']] = df.loc[df.cherry == 'bad', ['apple', 'banana']]
print df
apple banana cherry a b
0 0 3 good NaN NaN
1 1 4 bad 1.0 4.0
2 2 5 good NaN NaN
Or use conditions with values:
df['a'] = df.apple * 3
df['b'] = df.banana * 2
df.loc[df.cherry == 'bad', ['apple', 'banana']] =
df.loc[df.cherry == 'bad', ['a', 'b']].values
print df
apple banana cherry a b
0 0 3 good 0 6
1 3 8 bad 3 8
2 2 5 good 6 10
Another options with original columns:
print df[['apple','banana']].shift() * 2
apple banana
0 NaN NaN
1 12.0 6.0
2 2.0 8.0
df.loc[df.cherry == 'bad', ['apple', 'banana']] = df[['apple','banana']].shift() * 2
print df
apple banana cherry
0 6.0 3.0 good
1 12.0 6.0 bad
2 2.0 5.0 good

Related

Search N consecutive rows with same value in one dataframe

I need to create a python code to search "N" as variable, consecutive rows in a column dataframe with the same value and different that NaN like this.
I can't figure out how to do it with a for loop because I don't know which row I'm looking at in each case. Any idea that how can do it?
Fruit
2 matches
5 matches
Apple
No
No
NaN
No
No
Pear
No
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
Yes
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
Banana
No
No
Banana
Yes
No
Update: testing solutions by #Corralien
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['Matches'] = df.where(counts >= N, other='No')
VSCode return me the 'Frame skipped from debugging during step-in.' message when execute the last line and generate an exception in the previous for loop.
Compute consecutive values and set NaN to 0. Once you have calculated the cumulative counter, you just have to check if the counter is greater than or equal to N:
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['2 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
N = 5
df['5 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
Output:
>>> df
Fruit 2 matches 5 matches
0 Apple No No
1 NaN No No
2 Pear No No
3 Pear Yes No
4 Pear Yes No
5 Pear Yes No
6 Pear Yes Yes
7 NaN No No
8 NaN No No
9 NaN No No
10 NaN No No
11 NaN No No
12 Banana No No
13 Banana Yes No
>>> counts
0 1
1 0
2 1
3 2
4 3
5 4
6 5
7 0
8 0
9 0
10 0
11 0
12 1
13 2
dtype: int64
Update
if I need to change "Yes" for the fruit name for example
N = 2
df['2 matches'] = df.where(counts >= N, other='No')
print(df)
# Output
Fruit 2 matches
0 Apple No
1 NaN No
2 Pear No
3 Pear Pear
4 Pear Pear
5 Pear Pear
6 Pear Pear
7 NaN No
8 NaN No
9 NaN No
10 NaN No
11 NaN No
12 Banana No
13 Banana Banana

Pandas merging/joining tables with multiple key columns and duplicating rows where necessary

I have several tables that contain lab results, with a 'master' table of sample data with things like a description. The results tables are also broken down by specimen (sub-samples). They contain multiple results columns - I'm just showing one here. I want to combine all the results tables into one dataframe, like this:
Table 1:
Location Sample Description
1 A Yellow
1 B Red
2 A Blue
2 B Violet
Table 2
Location Sample Specimen Result1
1 A X 5
1 A Y 6
1 B X 10
2 A X 1
Table 3
Location Sample Specimen Result2
1 A X "Heavy"
1 A Q "Soft"
1 B K "Grey"
2 B Z "Bananas"
Desired Output:
Location Sample Description Specimen Result1 Result2
1 A Yellow X 5 "Heavy"
1 A Yellow Y 6 nan
1 A Yellow Q nan "Soft"
1 B Red X 10 nan
1 B Red K nan "Grey"
2 A Blue X 1 nan
2 B Violet Z nan "Bananas"
I currently have a solution for this using iterrows() and df.append(), but these are both slow operations and when there are thousands of results it takes too long. Is there better way? I have tried using join() and merge() but I can't seem to get the result I want.
Quick code to reproduce my dataframes:
dict1 = {'Location': [1,1,2,2], 'Sample': ['A','B','A','B'], 'Description': ['Yellow','Red','Blue','Violet']}
dict2 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','A'], 'Specimen': ['x', 'y','x', 'x'], 'Result1': [5,6,10,1]}
dict3 = {'Location': [1,1,1,2], 'Sample': ['A','A','B','B'], 'Specimen': ['x', 'q','k', 'z'], 'Result2': ["Heavy","Soft","Grey","Bananas"]}
df1 = pd.DataFrame.from_dict(dict1)
df2 = pd.DataFrame.from_dict(dict2)
df3 = pd.DataFrame.from_dict(dict3)
First idea is join df2, df3 together by concat and for unique 'Location','Sample','Specimen' rows are rows aggregated by sum, last merge to df1:
df23 = (pd.concat([df2, df3])
.groupby(['Location','Sample','Specimen'], as_index=False, sort=False)
.sum(min_count=1))
df = df1.merge(df23, on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow x 5.0 4.0
1 1 A Yellow y 6.0 NaN
2 1 A Yellow q NaN 6.0
3 1 B Red x 10.0 NaN
4 1 B Red k NaN 8.0
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN 5.0
Or if all rows in df2,df3 per columns ['Location','Sample','Specimen'] are unique, solution is simplier:
df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']),
df3.set_index(['Location','Sample','Specimen'])], axis=1)
df = df1.merge(df23.reset_index(), on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow q NaN 6.0
1 1 A Yellow x 5.0 4.0
2 1 A Yellow y 6.0 NaN
3 1 B Red k NaN 8.0
4 1 B Red x 10.0 NaN
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN 5.0
EDIT: With new data second solution working well:
df23 = pd.concat([df2.set_index(['Location','Sample','Specimen']),
df3.set_index(['Location','Sample','Specimen'])], axis=1)
df = df1.merge(df23.reset_index(), on=['Location','Sample'])
print (df)
Location Sample Description Specimen Result1 Result2
0 1 A Yellow q NaN Soft
1 1 A Yellow x 5.0 Heavy
2 1 A Yellow y 6.0 NaN
3 1 B Red k NaN Grey
4 1 B Red x 10.0 NaN
5 2 A Blue x 1.0 NaN
6 2 B Violet z NaN Bananas

How do I name the column and index in a Pandas dataframe? [duplicate]

How do I get the index column name in python pandas? Here's an example dataframe:
Column 1
Index Title
Apples 1
Oranges 2
Puppies 3
Ducks 4
What I'm trying to do is get/set the dataframe index title. Here is what i tried:
import pandas as pd
data = {'Column 1' : [1., 2., 3., 4.],
'Index Title' : ["Apples", "Oranges", "Puppies", "Ducks"]}
df = pd.DataFrame(data)
df.index = df["Index Title"]
del df["Index Title"]
print df
Anyone know how to do this?
You can just get/set the index via its name property
In [7]: df.index.name
Out[7]: 'Index Title'
In [8]: df.index.name = 'foo'
In [9]: df.index.name
Out[9]: 'foo'
In [10]: df
Out[10]:
Column 1
foo
Apples 1
Oranges 2
Puppies 3
Ducks 4
You can use rename_axis, for removing set to None:
d = {'Index Title': ['Apples', 'Oranges', 'Puppies', 'Ducks'],'Column 1': [1.0, 2.0, 3.0, 4.0]}
df = pd.DataFrame(d).set_index('Index Title')
print (df)
Column 1
Index Title
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
print (df.index.name)
Index Title
print (df.columns.name)
None
The new functionality works well in method chains.
df = df.rename_axis('foo')
print (df)
Column 1
foo
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
You can also rename column names with parameter axis:
d = {'Index Title': ['Apples', 'Oranges', 'Puppies', 'Ducks'],'Column 1': [1.0, 2.0, 3.0, 4.0]}
df = pd.DataFrame(d).set_index('Index Title').rename_axis('Col Name', axis=1)
print (df)
Col Name Column 1
Index Title
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
print (df.index.name)
Index Title
print (df.columns.name)
Col Name
print df.rename_axis('foo').rename_axis("bar", axis="columns")
bar Column 1
foo
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
print df.rename_axis('foo').rename_axis("bar", axis=1)
bar Column 1
foo
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
From version pandas 0.24.0+ is possible use parameter index and columns:
df = df.rename_axis(index='foo', columns="bar")
print (df)
bar Column 1
foo
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
Removing index and columns names means set it to None:
df = df.rename_axis(index=None, columns=None)
print (df)
Column 1
Apples 1.0
Oranges 2.0
Puppies 3.0
Ducks 4.0
If MultiIndex in index only:
mux = pd.MultiIndex.from_arrays([['Apples', 'Oranges', 'Puppies', 'Ducks'],
list('abcd')],
names=['index name 1','index name 1'])
df = pd.DataFrame(np.random.randint(10, size=(4,6)),
index=mux,
columns=list('ABCDEF')).rename_axis('col name', axis=1)
print (df)
col name A B C D E F
index name 1 index name 1
Apples a 5 4 0 5 2 2
Oranges b 5 8 2 5 9 9
Puppies c 7 6 0 7 8 3
Ducks d 6 5 0 1 6 0
print (df.index.name)
None
print (df.columns.name)
col name
print (df.index.names)
['index name 1', 'index name 1']
print (df.columns.names)
['col name']
df1 = df.rename_axis(('foo','bar'))
print (df1)
col name A B C D E F
foo bar
Apples a 5 4 0 5 2 2
Oranges b 5 8 2 5 9 9
Puppies c 7 6 0 7 8 3
Ducks d 6 5 0 1 6 0
df2 = df.rename_axis('baz', axis=1)
print (df2)
baz A B C D E F
index name 1 index name 1
Apples a 5 4 0 5 2 2
Oranges b 5 8 2 5 9 9
Puppies c 7 6 0 7 8 3
Ducks d 6 5 0 1 6 0
df2 = df.rename_axis(index=('foo','bar'), columns='baz')
print (df2)
baz A B C D E F
foo bar
Apples a 5 4 0 5 2 2
Oranges b 5 8 2 5 9 9
Puppies c 7 6 0 7 8 3
Ducks d 6 5 0 1 6 0
Removing index and columns names means set it to None:
df2 = df.rename_axis(index=(None,None), columns=None)
print (df2)
A B C D E F
Apples a 6 9 9 5 4 6
Oranges b 2 6 7 4 3 5
Puppies c 6 3 6 3 5 1
Ducks d 4 9 1 3 0 5
For MultiIndex in index and columns is necessary working with .names instead .name and set by list or tuples:
mux1 = pd.MultiIndex.from_arrays([['Apples', 'Oranges', 'Puppies', 'Ducks'],
list('abcd')],
names=['index name 1','index name 1'])
mux2 = pd.MultiIndex.from_product([list('ABC'),
list('XY')],
names=['col name 1','col name 2'])
df = pd.DataFrame(np.random.randint(10, size=(4,6)), index=mux1, columns=mux2)
print (df)
col name 1 A B C
col name 2 X Y X Y X Y
index name 1 index name 1
Apples a 2 9 4 7 0 3
Oranges b 9 0 6 0 9 4
Puppies c 2 4 6 1 4 4
Ducks d 6 6 7 1 2 8
Plural is necessary for check/set values:
print (df.index.name)
None
print (df.columns.name)
None
print (df.index.names)
['index name 1', 'index name 1']
print (df.columns.names)
['col name 1', 'col name 2']
df1 = df.rename_axis(('foo','bar'))
print (df1)
col name 1 A B C
col name 2 X Y X Y X Y
foo bar
Apples a 2 9 4 7 0 3
Oranges b 9 0 6 0 9 4
Puppies c 2 4 6 1 4 4
Ducks d 6 6 7 1 2 8
df2 = df.rename_axis(('baz','bak'), axis=1)
print (df2)
baz A B C
bak X Y X Y X Y
index name 1 index name 1
Apples a 2 9 4 7 0 3
Oranges b 9 0 6 0 9 4
Puppies c 2 4 6 1 4 4
Ducks d 6 6 7 1 2 8
df2 = df.rename_axis(index=('foo','bar'), columns=('baz','bak'))
print (df2)
baz A B C
bak X Y X Y X Y
foo bar
Apples a 2 9 4 7 0 3
Oranges b 9 0 6 0 9 4
Puppies c 2 4 6 1 4 4
Ducks d 6 6 7 1 2 8
Removing index and columns names means set it to None:
df2 = df.rename_axis(index=(None,None), columns=(None,None))
print (df2)
A B C
X Y X Y X Y
Apples a 2 0 2 5 2 0
Oranges b 1 7 5 5 4 8
Puppies c 2 4 6 3 6 5
Ducks d 9 6 3 9 7 0
And #Jeff solution:
df.index.names = ['foo','bar']
df.columns.names = ['baz','bak']
print (df)
baz A B C
bak X Y X Y X Y
foo bar
Apples a 3 4 7 3 3 3
Oranges b 1 2 5 8 1 0
Puppies c 9 6 3 9 6 3
Ducks d 3 2 1 0 1 0
df.index.name should do the trick.
Python has a dir function that let's you query object attributes. dir(df.index) was helpful here.
Use df.index.rename('foo', inplace=True) to set the index name.
Seems this api is available since pandas 0.13.
If you do not want to create a new row but simply put it in the empty cell then use:
df.columns.name = 'foo'
Otherwise use:
df.index.name = 'foo'
Setting the index name can also be accomplished at creation:
pd.DataFrame(data={'age': [10,20,30], 'height': [100, 170, 175]}, index=pd.Series(['a', 'b', 'c'], name='Tag'))
df.columns.values also give us the column names
The solution for multi-indexes is inside jezrael's cyclopedic answer, but it took me a while to find it so I am posting a new answer:
df.index.names gives the names of a multi-index (as a Frozenlist).
To just get the index column names df.index.names will work for both a single Index or MultiIndex as of the most recent version of pandas.
As someone who found this while trying to find the best way to get a list of index names + column names, I would have found this answer useful:
names = list(filter(None, df.index.names + df.columns.values.tolist()))
This works for no index, single column Index, or MultiIndex. It avoids calling reset_index() which has an unnecessary performance hit for such a simple operation. I'm surprised there isn't a built in method for this (that I've come across). I guess I run into needing this more often because I'm shuttling data from databases where the dataframe index maps to a primary/unique key, but is really just another column to me.

How to expand a df by different dict as columns?

I have a df with different dicts as entries in a column, in my case column "information". I would like to expand the df by all possible dict.keys(), something like that:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5]),
'name': pd.Series(['banana',
'apple',
'orange',
'strawberry' ,
'toast']),
'information': pd.Series([{'shape':'curve','color':'yellow'},
{'color':'red'},
{'shape':'round'},
{'amount':500},
np.nan]),
'cost': pd.Series([1,2,2,10,4])})
id name information cost
0 1 banana {'shape': 'curve', 'color': 'yellow'} 1
1 2 apple {'color': 'red'} 2
2 3 orange {'shape': 'round'} 2
3 4 strawberry {'amount': 500} 10
4 5 toast NaN 4
Should look like this:
id name shape color amount cost
0 1 banana curve yellow NaN 1
1 2 apple NaN red NaN 2
2 3 orange round NaN NaN 2
3 4 strawberry NaN NaN 500.0 10
4 5 toast NaN NaN NaN 4
Another approach would be using pandas.DataFrame.from_records:
import pandas as pd
new = pd.DataFrame.from_records(df.pop('information').apply(lambda x: {} if pd.isna(x) else x))
new = pd.concat([df, new], 1)
print(new)
Output:
cost id name amount color shape
0 1 1 banana NaN yellow curve
1 2 2 apple NaN red NaN
2 2 3 orange NaN NaN round
3 10 4 strawberry 500.0 NaN NaN
4 4 5 toast NaN NaN NaN
You can use:
d = {k: {} if v != v else v for k, v in df.pop('information').items()}
df1 = pd.DataFrame.from_dict(d, orient='index')
df = pd.concat([df, df1], axis=1)
print(df)
id name cost shape color amount
0 1 banana 1 curve yellow NaN
1 2 apple 2 NaN red NaN
2 3 orange 2 round NaN NaN
3 4 strawberry 10 NaN NaN 500.0
4 5 toast 4 NaN NaN NaN

update pandas groupby group with column value

I have a test df like this:
df = pd.DataFrame({'A': ['Apple','Apple', 'Apple','Orange','Orange','Orange','Pears','Pears'],
'B': [1,2,9,6,4,3,2,1]
})
A B
0 Apple 1
1 Apple 2
2 Apple 9
3 Orange 6
4 Orange 4
5 Orange 3
6 Pears 2
7 Pears 1
Now I need to add a new column with the respective %differences in col 'B'. How is this possible. I cannot get this to work.
I have looked at
update column value of pandas groupby().last()
Not sure that it is pertinent to my problem.
And this which looks promising
Pandas Groupby and Sum Only One Column
I need to find and insert into the col maxpercchng (all rows in group) the maximum change in col (B) per group of col 'A'.
So I have come up with this code:
grouppercchng = ((df.groupby['A'].max() - df.groupby['A'].min())/df.groupby['A'].iloc[0])*100
and try to add it to the group col 'maxpercchng' like so
group['maxpercchng'] = grouppercchng
Or like so
df_kpi_hot.groupby(['A'], as_index=False)['maxpercchng'] = grouppercchng
Does anyone know how to add to all rows in group the maxpercchng col?
I believe you need transform for Series with same size like original DataFrame filled by aggregated values:
g = df.groupby('A')['B']
df['maxpercchng'] = (g.transform('max') - g.transform('min')) / g.transform('first') * 100
print (df)
A B maxpercchng
0 Apple 1 800.0
1 Apple 2 800.0
2 Apple 9 800.0
3 Orange 6 50.0
4 Orange 4 50.0
5 Orange 3 50.0
6 Pears 2 50.0
7 Pears 1 50.0
Or:
g = df.groupby('A')['B']
df1 = ((g.max() - g.min()) / g.first() * 100).reset_index()
print (df1)
A B
0 Apple 800.0
1 Orange 50.0
2 Pears 50.0

Categories