Python pandas: swap column values of a DataFrame slice - python

I have a DataFrame like:
df = pd.DataFrame({"type":['pet', 'toy', 'toy', 'car'], 'symbol': ['A', 'B', 'C', 'D'], 'desc': [1, 2, 3, 4]})
df
Out[22]:
type symbol desc
0 pet A 1
1 toy B 2
2 toy C 3
3 car D 4
My goal is to swap the values of symbol and desc for the rows whose type is toy:
type symbol desc
0 pet A 1
1 toy 2 B # <-- B and 2 are swapped
2 toy 3 C # <-- C and 3 are swapped
3 car D 4
So I am going to take a slice first, then do the swap on the slice, but failed. My scripts, warnings and results are:
df[df['type']=='toy'][['symbol', 'desc']] = df[df['type']=='toy'][['desc', 'symbol']]
/anaconda3/lib/python3.8/site-packages/pandas/core/frame.py:3191: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[k1] = value[k2]
df
Out[31]:
type symbol desc
0 pet A 1
1 toy B 2 # <-- didn't work :(
2 toy C 3
3 car D 4
Is there any advice?

Let us do
m = df.type=='toy'
l = ['symbol','desc']
df.loc[m,l] = df.loc[m,l[::-1]].values
df
Out[89]:
type symbol desc
0 pet A 1
1 toy 2 B
2 toy 3 C
3 car D 4
Or try with rename
m = df.type=='toy'
l = ['symbol','desc']
out = pd.concat([df[~m],df[m].rename(columns=dict(zip(l,l[::-1])))]).sort_index()

Let's try something like:
import pandas as pd
df = pd.DataFrame(
{"type": ['pet', 'toy', 'toy', 'car'], 'symbol': ['A', 'B', 'C', 'D'],
'desc': [1, 2, 3, 4]})
m = df['type'] == 'toy'
df.loc[m, ['symbol', 'desc']] = df.loc[m, ['desc', 'symbol']].to_numpy()
print(df)
Output:
type symbol desc
0 pet A 1
1 toy 2 B
2 toy 3 C
3 car D 4
Use to_numpy() / values to prevent columns from matching to their old column names.

You can also use pandas where:
df[['symbol', 'desc']] = df[['desc', 'symbol']].where(df['type'] == 'toy',
df[['symbol', 'desc']].values)
Output:
type symbol desc
0 pet A 1
1 toy 2 B
2 toy 3 C
3 car D 4

You can also use list(zip(...)):
m = df['type']=='toy'
df.loc[m, ['symbol', 'desc']] = list(zip(df.loc[m, 'desc'], df.loc[m, 'symbol']))
Or simply use .values:
m = df['type']=='toy'
df.loc[m, ['symbol', 'desc']] = df.loc[m, ['desc', 'symbol']].values
Result:
print(df)
type symbol desc
0 pet A 1
1 toy 2 B
2 toy 3 C
3 car D 4

Related

Remove all the rows having same column values of another column which is duplicated

1.Input: we have a dataframe
ID name
1 a
1 b
2 a
2 c
3 d
2.Now I took the first duplicate 'name' (here it is 'a' with ID as '2') value and remove the rest, output:
ID name
1 a
1 b
2 c
3 d
Code I used:
df.loc[~df.duplicated(keep='first', subset=['name'])]
3.Now I want to remove all the rows sharing the same 'ID' ( here the 'a' removed was having '2' as ID, so we remove all rows with '2' as ID), Final Expected output : so we remove [2 c]
ID name
1 a
1 b
3 d
Code I tried: But it is not working
dt = df.name.duplicated(keep='first')
df.loc[~df.groupby(['ID','dt']).size().reset_index().drop(columns={0})]
You can use some kind of blacklist for the ID's:
Sample data:
import pandas as pd
d = {'ID':[1, 1, 2, 2, 3], 'name':['a', 'b', 'a', 'c', 'd']}
df = pd.DataFrame(d)
Code:
df[~df['ID'].isin(df[df['name'].duplicated()]['ID'])]
Output:
ID name
0 1 a
1 1 b
4 3 d
Code simplified:
blacklist = df[df['name'].duplicated()]['ID']
mask = ~df['ID'].isin(blacklist)
df[mask]
If the Dataframe is ordered by ID those two approaches should work:
df = pd.DataFrame(data={'ID': [1, 1, 1, 2, 3], 'name': ['a', 'b', 'a', 'c', 'd']})
df1 = df.loc[~df.duplicated(keep='first', subset=['ID'])]
df2 = df1.loc[~df1.duplicated(keep='first', subset=['name'])]
print(df2)
print(df.drop_duplicates(keep='first', subset=['ID']).drop_duplicates(keep='first', subset=['name']))
ID name
0 1 a
3 2 c
4 3 d
If it's order by name you should do subset=['name'] and then subset=['ID'].

How to fastly select dataframe according to multi-columns in pandas

I want to want to filter rows by multi-column values.
For example, given the following dataframes,
import pandas as pd
df = pd.DataFrame({"name":["Amy", "Amy", "Amy", "Bob", "Bob",],
"group":[1, 1, 1, 1, 2],
"place":['a', 'a', "a", 'b', 'b'],
"y":[1, 2, 3, 1, 2]
})
print(df)
Original dataframe:
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
3 Bob 1 b 1
4 Bob 2 b 2
I want to select the samples that satisfy the columns combination [name, group, place] in selectRow.
selectRow = [["Amy", 1, "a"], ["Amy", 2, "b"]]
Then the expected dataframe is :
name group place y
0 Amy 1 a 1
1 Amy 1 a 2
2 Amy 1 a 3
I have tried it and my method is not efficient and runs for a long time, especially when there are many samples in original dataframe.
My Simple Method:
newdf = pd.DataFrame({})
for item in (selectRow):
print(item)
tmp = df.loc[(df['name'] == item[0]) & (df['group'] == item[1]) & (df['place'] == item[2])]
newdf = newdf.append(tmp)
newdf = newdf.reset_index( drop = True)
newdf.tail()
print(newdf)
Hope for an efficient method to achieve it.
Try using isin:
print(df[df['name'].isin(list(zip(*selectRow))[0]) & df['group'].isin(list(zip(*selectRow))[1]) & df['place'].isin(list(zip(*selectRow))[2])])

add column with string length of another column and cumsum?

Given the following dataframe:
df = pd.DataFrame({'col1': ["kuku", "pu", "d", "fgf"]})
I want to calculate the length of each string and add a cumsum column.
I am trying to do this with df.str.len("col1") but it throws an error.
Use str.len()
Ex:
import pandas as pd
df = pd.DataFrame({"col1": ["kuku", "pu", "d", "fgf"]})
df["New"] = df["col1"].str.len()
print(df)
print(df["New"].cumsum()) #cumulative sum
Output:
col1 New
0 kuku 4
1 pu 2
2 d 1
3 fgf 3
0 4
1 6
2 7
3 10
Name: New, dtype: int64
The dataframe initialization code is wrong. Try this.
>>> df = pd.DataFrame({'col1': ["kuku", "pu", "d", "fgf"]})
>>> df
col1
0 kuku
1 pu
2 d
3 fgf
Alternatively, you can use map as well.
>>> df.col1.map(lambda x: len(x))
0 4
1 2
2 1
3 3
To calculate length.
>>> df['len'] = df.col1.str.len()
>>> df
col1 len
0 kuku 4
1 pu 2
2 d 1
3 fgf 3
Or
import pandas as pd
df = pd.DataFrame({ "col1" : ["kuku", "pu", "d", "fgf"]})
df['new'] = df.col1.apply(lambda x: len(x))
Your col1 argument is an unknown argument to pd.DataFrame()...
Use data as the argument name instead... Then add your new column with the length
data = {'col1': ["kuku", "pu", "d", "fgf"]}
df = pd.DataFrame(data=data)
df["col1 lenghts"] = df["col1"].str.len()
print(df)
Here is another alternative I think solved my issue:
df = pd.DataFrame({"col1": ['dilly macaroni recipe salad', 'gazpacho', 'bake crunchy onion potato', 'cool creamy easy pie watermelon', 'beef easy skillet tropical', 'chicken grilled tea thigh', 'cake dump rhubarb strawberry', 'parfaits yogurt', 'bread nut zucchini', 'la salad salmon']})
df["title_len"] = df[1].str.len()
df["cum_len"] = df["title_len"].cumsum()

How can I mark a row when it meets a condition?

If I have a dataframe,
df = pd.DataFrame({
'name' : ['A', 'B', 'C'],
'john_01' : [1, 2, 3],
'mary_02' : [4,5,6],
})
I'd like to attach a mark '#' with the name if column['name'] equal to list containing 'A' and 'B'. Then I can see something like below in the result, does anyone know how to do it using pandas in elegant way?
name_list = ['A','B','D'] # But we only have A and B in df.
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
If name_list is the same length as the length of the Series name, then you could try this:
df1['name_list'] = ['A','B','D']
df1.ix[df1.name == df1.name_list, 'name'] = '#'+df1.name
This would only prepend a '#' when the value of name and name_list are the same for the current index.
In [81]: df1
Out[81]:
john_01 mary_02 name name_list
0 1 4 #A A
1 2 5 #B B
2 3 6 C D
In [82]: df1.drop('name_list', axis=1, inplace=True) # Drop assist column
If the two are not the same length - and therefore you don't care about index - then you could try this:
In [84]: name_list = ['A','B','D']
In [87]: df1.ix[df1.name.isin(name_list), 'name'] = '#'+df1.name
In [88]: df1
Out[88]:
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
I hope this helps.
Use df.loc[row_indexer,column_indexer] operator with isin method of a Series object:
df.loc[df.name.isin(name_list), 'name'] = '#'+df.name
print(df)
The output:
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
http://pandas.pydata.org/pandas-docs/stable/indexing.html
You can use isin to check whether the name is in the list, and use numpy.where to prepend #:
df['name'] = np.where(df['name'].isin(name_list), '#', '') + df['name']
df
Out:
john_01 mary_02 name
0 1 4 #A
1 2 5 #B
2 3 6 C
import pandas as pd
def exclude_list (x):
list_exclude = ['A','B']
if x in list_exclude:
x = '#' + x
return x
df = pd.DataFrame({
'name' : ['A', 'B', 'C'],
'john_01' : [1, 2, 3],
'mary_02' : [4,5,6],
})
df['name'] = df['name'].apply(lambda row: exclude_list(row))
print(df)

How can I check the ID of a pandas data frame in another data frame in Python?

Hello I have the following Data Frame:
df =
ID Value
a 45
b 3
c 10
And another dataframe with the numeric ID of each value
df1 =
ID ID_n
a 3
b 35
c 0
d 7
e 1
I would like to have a new column in df with the numeric ID, so:
df =
ID Value ID_n
a 45 3
b 3 35
c 10 0
Thanks
Use pandas merge:
import pandas as pd
df1 = pd.DataFrame({
'ID': ['a', 'b', 'c'],
'Value': [45, 3, 10]
})
df2 = pd.DataFrame({
'ID': ['a', 'b', 'c', 'd', 'e'],
'ID_n': [3, 35, 0, 7, 1],
})
df1.set_index(['ID'], drop=False, inplace=True)
df2.set_index(['ID'], drop=False, inplace=True)
print pd.merge(df1, df2, on="ID", how='left')
output:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You could use join(),
In [14]: df1.join(df2)
Out[14]:
Value ID_n
ID
a 45 3
b 3 35
c 10 0
If you want index to be numeric you could reset_index(),
In [17]: df1.join(df2).reset_index()
Out[17]:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You can do this in a single operation. join works on the index, which you don't appear to have set. Just set the index to ID, join df after also setting its index to ID, and then reset your index to return your original dataframe with the new column added.
>>> df.set_index('ID').join(df1.set_index('ID')).reset_index()
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
Also, because you don't do an inplace set_index on df1, its structure remains the same (i.e. you don't change its indexing).

Categories