How can I swap specific column values in Dataframe? - python

I have a large dataframe this is the sample part of the Dataframe.
Want to swap the Muscat and Shanghai values.
df =
City Score
Istanbul 6.0749
2.23607 Muscat
Prague 4.38576
1.85958 Shanghai
Istanbul 6.0749
Singapore 5.17054
Output:
df =
City Score
Istanbul 6.0749
Muscat 2.23607
Prague 4.38576
Shanghai 1.85958
Istanbul 6.0749
Singapore 5.17054
I am confused that how can I apply the condition after iterating through the dataframe, also is there any other alternative?

Use to_numeric with notna for boolean mask and then swap by loc:
m = pd.to_numeric(df['City'], errors='coerce').notna()
#oldier versions of pandas
#m = pd.to_numeric(df['City'], errors='coerce').notnull()
df.loc[m,['City','Score']] = df.loc[m,['Score','City']].values
print (df)
City Score
0 Istanbul 6.0749
1 Muscat 2.23607
2 Prague 4.38576
3 Shanghai 1.85958
4 Istanbul 6.0749
5 Singapore 5.17054

You can use:
In [39]: mask = pd.to_numeric(df.Score, errors='coerce').isna()
In [40]: s = df.Score.copy()
In [41]: df.Score[mask] = df.City
In [42]: df.City[mask] = s
In [43]: df
Out[43]:
City Score
0 Istanbul 6.0749
1 Muscat 2.23607
2 Prague 4.38576
3 Shanghai 1.85958
4 Istanbul 6.0749
5 Singapore 5.17054

Related

Conditionals with NaN in python [duplicate]

I have a simple DataFrame like the following:
I want to select all values from the 'First Season' column and replace those that are over 1990 by 1. In this example, only Baltimore Ravens would have the 1996 replaced by 1 (keeping the rest of the data intact).
I have used the following:
df.loc[(df['First Season'] > 1990)] = 1
But, it replaces all the values in that row by 1, and not just the values in the 'First Season' column.
How can I replace just the values from that column?
You need to select that column:
In [41]:
df.loc[df['First Season'] > 1990, 'First Season'] = 1
df
Out[41]:
Team First Season Total Games
0 Dallas Cowboys 1960 894
1 Chicago Bears 1920 1357
2 Green Bay Packers 1921 1339
3 Miami Dolphins 1966 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 1950 1003
So the syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
You can check the docs and also the 10 minutes to pandas which shows the semantics
EDIT
If you want to generate a boolean indicator then you can just use the boolean condition to generate a boolean Series and cast the dtype to int this will convert True and False to 1 and 0 respectively:
In [43]:
df['First Season'] = (df['First Season'] > 1990).astype(int)
df
Out[43]:
Team First Season Total Games
0 Dallas Cowboys 0 894
1 Chicago Bears 0 1357
2 Green Bay Packers 0 1339
3 Miami Dolphins 0 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 0 1003
A bit late to the party but still - I prefer using numpy where:
import numpy as np
df['First Season'] = np.where(df['First Season'] > 1990, 1, df['First Season'])
df.loc[df['First season'] > 1990, 'First Season'] = 1
Explanation:
df.loc takes two arguments, 'row index' and 'column index'. We are checking if the value is greater than 1990 of each row value, under "First season" column and then we replacing it with 1.
df['First Season'].loc[(df['First Season'] > 1990)] = 1
strange that nobody has this answer, the only missing part of your code is the ['First Season'] right after df and just remove your curly brackets inside.
for single condition, ie. ( 'employrate'] > 70 )
country employrate alcconsumption
0 Afghanistan 55.7000007629394 .03
1 Albania 51.4000015258789 7.29
2 Algeria 50.5 .69
3 Andorra 10.17
4 Angola 75.6999969482422 5.57
use this:
df.loc[df['employrate'] > 70, 'employrate'] = 7
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 51.400002 7.29
2 Algeria 50.500000 .69
3 Andorra nan 10.17
4 Angola 7.000000 5.57
therefore syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
For multiple conditions ie. (df['employrate'] <=55) & (df['employrate'] > 50)
use this:
df['employrate'] = np.where(
(df['employrate'] <=55) & (df['employrate'] > 50) , 11, df['employrate']
)
out[108]:
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 11.000000 7.29
2 Algeria 11.000000 .69
3 Andorra nan 10.17
4 Angola 75.699997 5.57
therefore syntax here is:
df['<column_name>'] = np.where((<filter 1> ) & (<filter 2>) , <new value>, df['column_name'])
Another option is to use a list comprehension:
df['First Season'] = [1 if year > 1990 else year for year in df['First Season']]
You can also use mask which replaces the values where the condition is met:
df['First Season'].mask(lambda col: col > 1990, 1)
We can update the First Season column in df with the following syntax:
df['First Season'] = expression_for_new_values
To map the values in First Season we can use pandas‘ .map() method with the below syntax:
data_frame(['column']).map({'initial_value_1':'updated_value_1','initial_value_2':'updated_value_2'})

Dataframe - filter the values of a particular column with isin()

I have a pandas dataframe in which I have the column "Bio Location", I would like to filter it so that I only have the locations of my list in which there are names of cities. I have made the following code which works except that I have a problem.
For example, if the location is "Paris France" and I have Paris in my list then it will return the result. However, if I had "France Paris", it would not return "Paris". Do you have a solution? Maybe use regex? Thank u a lot!!!
df = pd.read_csv(path_to_file, encoding='utf-8', sep=',')
cities = [Paris, Bruxelles, Madrid]
values = df[df['Bio Location'].isin(citiesfr)]
values.to_csv(r'results.csv', index = False)
What you want here is .str.contains():
1. The DF I used to test:
df = {
'col1':['Paris France','France Paris Test','France Paris','Madrid Spain','Spain Madrid Test','Spain Madrid'] #so tested with 1x at start, 1x in the middle and 1x at the end of a str
}
df = pd.DataFrame(df)
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid
2. Then applying the code below:
Updated following comment
#so tested with 1x at start, 1x in the middle and 1x at the end of a str
reg = ('Paris|Madrid')
df = df[df.col1.str.contains(reg)]
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid

How to drop rows from a pandas dataframe based on a pre-made list

I have a big dataset. It's about news reading. I'm trying to clean it. I created a checklist of cities that I want to keep (the set has all the cities). How can I drop the rows based on that checklist? For example, I have a checklist (as a list) that contains all the french cities. How can I drop other cities?
To picture the data frame (I have 1.5m rows btw):
City Age
0 Paris 25-34
1 Lyon 45-54
2 Kiev 35-44
3 Berlin 25-34
4 New York 25-34
5 Paris 65+
6 Toulouse 35-44
7 Nice 55-64
8 Hannover 45-54
9 Lille 35-44
10 Edinburgh 65+
11 Moscow 25-34
You can do this using pandas.Dataframe.isin. This will return boolean values checking whether each element is inside the list x. You can then use the boolean values and take out the subset of the df with rows that return True by doing df[df['City'].isin(x)]. Following is my solution:
import pandas as pd
x = ['Paris' , 'Marseille']
df = pd.DataFrame(data={'City':['Paris', 'London', 'New York', 'Marseille'],
'Age':[1, 2, 3, 4]})
print(df)
df = df[df['City'].isin(x)]
print(df)
Output:
>>> City Age
0 Paris 1
1 London 2
2 New York 3
3 Marseille 4
City Age
0 Paris 1
3 Marseille 4

Renaming column in pandas with a multi level index

I would like to rename 'multi level columns' of a pandas dataframe to 'single level columns'. My code so far does not give any errors but does not rename either. Any suggestions for code improvements?
import pandas as pd
url = 'https://en.wikipedia.org/wiki/Gross_national_income'
df = pd.read_html(url)[3][[('Country', 'Country'), ('GDP[10]', 'GDP[10]')]]\
.rename(columns={('Country', 'Country'):'Country', ('GDP[10]', 'GDP[10]'): 'GDP'})
df
I prefer to use the rename method. df.columns = ['Country', 'GDP'] works but is not what I am looking for.
For rename solution create dictionary by flatten values of MultiIndex with join with new columns names in zip:
url = 'https://en.wikipedia.org/wiki/Gross_national_income'
df = pd.read_html(url)[3]
df.columns = df.columns.map('_'.join)
old = ['No._No.', 'Country_Country', 'GNI (Atlas method)[8]_value (a)',
'GNI (Atlas method)[8]_a - GDP', 'GNI[9]_value (b)', 'GNI[9]_b - GDP',
'GDP[10]_GDP[10]']
new = ['No.','Country','GNI a','GDP a','GNI b', 'GNI b', 'GDP']
df = df.rename(columns=dict(zip(old, new)))
If want create dictionary for rename:
d = {'No._No.': 'No.', 'Country_Country': 'Country', 'GNI (Atlas method)[8]_value (a)': 'GNI a', 'GNI (Atlas method)[8]_a - GDP': 'GDP a', 'GNI[9]_value (b)': 'GNI b', 'GNI[9]_b - GDP': 'GNI b', 'GDP[10]_GDP[10]': 'GDP'}
df = df.rename(columns=d)
print (df)
No. Country GNI a GDP a GNI b GNI b GDP
0 1 United States 20636317 91974 20837347 293004 20544343
1 2 China 13181372 -426779 13556853 -51298 13608151
2 3 Japan 5226599 255276 5155423 184100 4971323
3 4 Germany 3905321 -42299 4058030 110410 3947620
4 5 United Kingdom 2777405 -77891 2816805 -38491 2855296
5 6 France 2752034 -25501 2840071 62536 2777535
6 7 India 2727893 9161 2691040 -27692 2718732
7 8 Italy 2038376 -45488 2106525 22661 2083864
8 9 Brazil 1902286 16804 1832170 -53312 1885482
9 10 Canada 1665565 -47776 1694054 -19287 1713341
For alternatives of "rename", you can use get_level_values(). See below:
df.columns = df.columns.get_level_values(0)
>>> print(df)
Country GDP[10]
0 United States 20544343
1 China 13608151
2 Japan 4971323
3 Germany 3947620
4 United Kingdom 2855296
5 France 2777535
6 India 2718732
7 Italy 2083864
8 Brazil 1885482
9 Canada 1713341

Python split one column into multiple columns and reattach the split columns into original dataframe

I want to split one column from my dataframe into multiple columns, then attach those columns back to my original dataframe and divide my original dataframe based on whether the split columns include a specific string.
I have a dataframe that has a column with values separated by semicolons like below.
import pandas as pd
data = {'ID':['1','2','3','4','5','6','7'],
'Residence':['USA;CA;Los Angeles;Los Angeles', 'USA;MA;Suffolk;Boston', 'Canada;ON','USA;FL;Charlotte', 'NA', 'Canada;QC', 'USA;AZ'],
'Name':['Ann','Betty','Carl','David','Emily','Frank', 'George'],
'Gender':['F','F','M','M','F','M','M']}
df = pd.DataFrame(data)
Then I split the column as below, and separated the split column into two based on whether it contains the string USA or not.
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
Now if you run USA and nonUSA, you'll note that there are extra columns in nonUSA, and also a row with no country information. So I got rid of those NA values.
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA.columns = ['Country', 'State']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
Now I want to attach USA and nonUSA to my original dataframe, so that I will get two dataframes that look like below:
USAdata = pd.DataFrame({'ID':['1','2','4','7'],
'Name':['Ann','Betty','David','George'],
'Gender':['F','F','M','M'],
'Country':['USA','USA','USA','USA'],
'State':['CA','MA','FL','AZ'],
'County':['Los Angeles','Suffolk','Charlotte','None'],
'City':['Los Angeles','Boston','None','None']})
nonUSAdata = pd.DataFrame({'ID':['3','6'],
'Name':['David','Frank'],
'Gender':['M','M'],
'Country':['Canada', 'Canada'],
'State':['ON','QC']})
I'm stuck here though. How can I split my original dataframe into people whose Residence include USA or not, and attach the split columns from Residence ( USA and nonUSA ) back to my original dataframe?
(Also, I just uploaded everything I had so far, but I'm curious if there's a cleaner/smarter way to do this.)
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat for join together and then add to original by DataFrame.join or concat with axis=1:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None

Categories