Selecting rows by last 3 characters in a column with strings - python

I have this dataframe
name year ...
0 Carlos - xyz 2019
1 Marcos - yws 2031
3 Fran - xxz 2431
4 Matt - yre 1985
...
I want to create a new column, called type.
If the name of the person ends with "xyz" or "xxz" I want type to be "big"
So, it should look like this:
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
3 Fran - xxz 2431 big
4 Matt - yre 1985
...
Any suggestions?

Option 1
Use str.contains to generate a mask:
m = df.name.str.contains(r'x[yx]z$')
Or,
sub_str = ['xyz', 'xxz']
m = df.name.str.contains(r'{}$'.format('|'.join(sub_str)))
Now, you may either create your column with np.where,
df['type'] = np.where(m, 'big', '')
Or, loc in place of np.where;
df['type'] = ''
df.loc[m, 'type'] = 'big'
df
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
3 Fran - xxz 2431 big
4 Matt - yre 1985
Option 2
As an alternative, consider str.endswith + np.logical_or.reduce
sub_str = ['xyz', 'xxz']
m = np.logical_or.reduce([df.name.str.endswith(s) for s in sub_str])
df['type'] = ''
df.loc[m, 'type'] = 'big'
df
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
3 Fran - xxz 2431 big
4 Matt - yre 1985

Here is one way using pandas.Series.str.
df = pd.DataFrame([['Carlos - xyz', 2019], ['Marcos - yws', 2031],
['Fran - xxz', 2431], ['Matt - yre', 1985]],
columns=['name', 'year'])
df['type'] = np.where(df['name'].str[-3:].isin({'xyz', 'xxz'}), 'big', '')
Alternatively, you can use .loc accessor instead of numpy.where:
df['type'] = ''
df.loc[df['name'].str[-3:].isin({'xyz', 'xxz'}), 'type'] = 'big'
Result
name year type
0 Carlos - xyz 2019 big
1 Marcos - yws 2031
2 Fran - xxz 2431 big
3 Matt - yre 1985
Explanation
Extract last 3 letters using pd.Series.str.
Compare to a specified set of values for O(1) complexity lookup.
Use numpy.where to perform conditional assignment for new series.

Related

Conditionals with NaN in python [duplicate]

I have a simple DataFrame like the following:
I want to select all values from the 'First Season' column and replace those that are over 1990 by 1. In this example, only Baltimore Ravens would have the 1996 replaced by 1 (keeping the rest of the data intact).
I have used the following:
df.loc[(df['First Season'] > 1990)] = 1
But, it replaces all the values in that row by 1, and not just the values in the 'First Season' column.
How can I replace just the values from that column?
You need to select that column:
In [41]:
df.loc[df['First Season'] > 1990, 'First Season'] = 1
df
Out[41]:
Team First Season Total Games
0 Dallas Cowboys 1960 894
1 Chicago Bears 1920 1357
2 Green Bay Packers 1921 1339
3 Miami Dolphins 1966 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 1950 1003
So the syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
You can check the docs and also the 10 minutes to pandas which shows the semantics
EDIT
If you want to generate a boolean indicator then you can just use the boolean condition to generate a boolean Series and cast the dtype to int this will convert True and False to 1 and 0 respectively:
In [43]:
df['First Season'] = (df['First Season'] > 1990).astype(int)
df
Out[43]:
Team First Season Total Games
0 Dallas Cowboys 0 894
1 Chicago Bears 0 1357
2 Green Bay Packers 0 1339
3 Miami Dolphins 0 792
4 Baltimore Ravens 1 326
5 San Franciso 49ers 0 1003
A bit late to the party but still - I prefer using numpy where:
import numpy as np
df['First Season'] = np.where(df['First Season'] > 1990, 1, df['First Season'])
df.loc[df['First season'] > 1990, 'First Season'] = 1
Explanation:
df.loc takes two arguments, 'row index' and 'column index'. We are checking if the value is greater than 1990 of each row value, under "First season" column and then we replacing it with 1.
df['First Season'].loc[(df['First Season'] > 1990)] = 1
strange that nobody has this answer, the only missing part of your code is the ['First Season'] right after df and just remove your curly brackets inside.
for single condition, ie. ( 'employrate'] > 70 )
country employrate alcconsumption
0 Afghanistan 55.7000007629394 .03
1 Albania 51.4000015258789 7.29
2 Algeria 50.5 .69
3 Andorra 10.17
4 Angola 75.6999969482422 5.57
use this:
df.loc[df['employrate'] > 70, 'employrate'] = 7
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 51.400002 7.29
2 Algeria 50.500000 .69
3 Andorra nan 10.17
4 Angola 7.000000 5.57
therefore syntax here is:
df.loc[<mask>(here mask is generating the labels to index) , <optional column(s)> ]
For multiple conditions ie. (df['employrate'] <=55) & (df['employrate'] > 50)
use this:
df['employrate'] = np.where(
(df['employrate'] <=55) & (df['employrate'] > 50) , 11, df['employrate']
)
out[108]:
country employrate alcconsumption
0 Afghanistan 55.700001 .03
1 Albania 11.000000 7.29
2 Algeria 11.000000 .69
3 Andorra nan 10.17
4 Angola 75.699997 5.57
therefore syntax here is:
df['<column_name>'] = np.where((<filter 1> ) & (<filter 2>) , <new value>, df['column_name'])
Another option is to use a list comprehension:
df['First Season'] = [1 if year > 1990 else year for year in df['First Season']]
You can also use mask which replaces the values where the condition is met:
df['First Season'].mask(lambda col: col > 1990, 1)
We can update the First Season column in df with the following syntax:
df['First Season'] = expression_for_new_values
To map the values in First Season we can use pandas‘ .map() method with the below syntax:
data_frame(['column']).map({'initial_value_1':'updated_value_1','initial_value_2':'updated_value_2'})

Dataframe - filter the values of a particular column with isin()

I have a pandas dataframe in which I have the column "Bio Location", I would like to filter it so that I only have the locations of my list in which there are names of cities. I have made the following code which works except that I have a problem.
For example, if the location is "Paris France" and I have Paris in my list then it will return the result. However, if I had "France Paris", it would not return "Paris". Do you have a solution? Maybe use regex? Thank u a lot!!!
df = pd.read_csv(path_to_file, encoding='utf-8', sep=',')
cities = [Paris, Bruxelles, Madrid]
values = df[df['Bio Location'].isin(citiesfr)]
values.to_csv(r'results.csv', index = False)
What you want here is .str.contains():
1. The DF I used to test:
df = {
'col1':['Paris France','France Paris Test','France Paris','Madrid Spain','Spain Madrid Test','Spain Madrid'] #so tested with 1x at start, 1x in the middle and 1x at the end of a str
}
df = pd.DataFrame(df)
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid
2. Then applying the code below:
Updated following comment
#so tested with 1x at start, 1x in the middle and 1x at the end of a str
reg = ('Paris|Madrid')
df = df[df.col1.str.contains(reg)]
df
Result:
index
col1
0
Paris France
1
France Paris Test
2
France Paris
3
Madrid Spain
4
Spain Madrid Test
5
Spain Madrid

How to do cumulative division in groupby level [0,1] in pandas dataframe based on conditions?

I have a dataframe where I want append row add some groupby + additional conditions. Looking for for loop or other solution whatever can work.
or if its easier...
first melt df and then add new ratio % col then unmelt.
As calculations are customise, I think for loop can find the solution with or without groupby.
---Line 6,7,8 are my requirement.---
0-14 = child and unemployed
14-50 = young and working
50+ = old and unemployed
# ref line 6,7,8 = showing which rows to (+) and (/)
Currently I want to put 3 conditions in output line 6,7,8:
d = { 'year': [2019,2019,2019,2020,2020,2020],
'age group': ['(0-14)','(14-50)','(50+)','(0-14)','(14-50)','(50+)'],
'con': ['UK','UK','UK','US','US','US'],
'population': [10,20,300,400,1000,2000]}
df = pd.DataFrame(data=d)
df2 = df.copy()
df
year age group con population
0 2019 (0-14) UK 10
1 2019 (14-50) UK 20
2 2019 (50+) UK 300
3 2020 (0-14) US 400
4 2020 (14-50) US 1000
5 2020 (50+) US 2000
output required:
year age group con population
0 2019 (0-14) UK 10.0
1 2019 (14-50) UK 20.0
2 2019 (50+) UK 300.0
3 2020 (0-14) US 400.0
4 2020 (14-50) US 1000.0
5 2020 (50+) US 2000.0
6 2019 young vs child UK-young vs child 2.0 # 20/10
7 2019 old vs young UK-old vs young 15.0 #300/20
8 2019 unemployed vs working UK-unemployed vs working. 15.5 #300+10 20
Trials now:
df2 = df.copy()
criteria = [df2['con'].str.contains('0-14'),
df2['con'].str.contains('14-50'),
df2['con'].str.contains('50+')]
#conditions should be according to requirements
values = ['young vs child','old vs young', 'unemployed vs working']
df2['con'] = df2['con']+'_'+np.select(criteria, values, 0)
df2['age group'] = df2['age group']+'_'+np.select(criteria, values, 0)
df.groupby(['year','age group','con']).sum().groupby(level=[0,1]).cumdiv()
pd.concat([df,df2])
#----errors. cumdiv() not found and missing conditions criteria-------
also tried:
df['population'].div(df.groupby('con')['population'].shift(1))
#but looking for customisations into this
#so it can first sum rows and then divide
#according to unemployed condition-- row 8 reference.
CLOSEST TRAIL
n_df_2 = df.copy()
con_list = [x for x in df.con]
year_list = [x for x in df.year]
age_list = [x for x in df['age group']]
new_list = ['young vs child','old vs young', 'unemployed vs working']
for country in con_list:
bev_child = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[0]))]
bev_work = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[1]))]
bev_old = n_df_2[(n_df_2['con'].str.contains(country)) & (n_df_2['age group'].str.contains(age_list[2]))]
bev_child.loc[:,'population'] = bev_work.loc[:,'population'].max() / bev_child.loc[:,'population'].max()
bev_child.loc[:,'con'] = country +'-'+new_list[0]
bev_child.loc[:,'age group'] = new_list[0]
s = n_df_2.append(bev_child, ignore_index=True)
bev_child.loc[:,'population'] = bev_child.loc[:,'population'].max() + bev_old.loc[:,'population'].max()/ bev_work.loc[:,'population'].max()
bev_child.loc[:,'con'] = country +'-'+ new_list[2]
bev_child.loc[:,'age group'] = new_list[2]
s = s.append(bev_child, ignore_index=True)
bev_child.loc[:,'population'] = bev_old.loc[:,'population'].max() / bev_work.loc[:,'population'].max()
bev_child.loc[:,'con'] = country +'-'+ new_list[1]
bev_child.loc[:,'age group'] = new_list[1]
s = s.append(bev_child, ignore_index=True)
s
year age group con population
0 2019 (0-14) UK 10.0
1 2019 (14-50) UK 20.0
2 2019 (50+) UK 300.0
3 2020 (0-14) US 400.0
4 2020 (14-50) US 1000.0
5 2020 (50+) US 2000.0
6 2020 young vs child US-young vs child 2.5
7 2020 unemployed vs working US-unemployed vs working 4.5
8 2020 old vs young US-old vs young 2.0
also
PLEASE find the easiest way to solve it... Please...

Python split one column into multiple columns and reattach the split columns into original dataframe

I want to split one column from my dataframe into multiple columns, then attach those columns back to my original dataframe and divide my original dataframe based on whether the split columns include a specific string.
I have a dataframe that has a column with values separated by semicolons like below.
import pandas as pd
data = {'ID':['1','2','3','4','5','6','7'],
'Residence':['USA;CA;Los Angeles;Los Angeles', 'USA;MA;Suffolk;Boston', 'Canada;ON','USA;FL;Charlotte', 'NA', 'Canada;QC', 'USA;AZ'],
'Name':['Ann','Betty','Carl','David','Emily','Frank', 'George'],
'Gender':['F','F','M','M','F','M','M']}
df = pd.DataFrame(data)
Then I split the column as below, and separated the split column into two based on whether it contains the string USA or not.
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
Now if you run USA and nonUSA, you'll note that there are extra columns in nonUSA, and also a row with no country information. So I got rid of those NA values.
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA.columns = ['Country', 'State']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
Now I want to attach USA and nonUSA to my original dataframe, so that I will get two dataframes that look like below:
USAdata = pd.DataFrame({'ID':['1','2','4','7'],
'Name':['Ann','Betty','David','George'],
'Gender':['F','F','M','M'],
'Country':['USA','USA','USA','USA'],
'State':['CA','MA','FL','AZ'],
'County':['Los Angeles','Suffolk','Charlotte','None'],
'City':['Los Angeles','Boston','None','None']})
nonUSAdata = pd.DataFrame({'ID':['3','6'],
'Name':['David','Frank'],
'Gender':['M','M'],
'Country':['Canada', 'Canada'],
'State':['ON','QC']})
I'm stuck here though. How can I split my original dataframe into people whose Residence include USA or not, and attach the split columns from Residence ( USA and nonUSA ) back to my original dataframe?
(Also, I just uploaded everything I had so far, but I'm curious if there's a cleaner/smarter way to do this.)
There is unique index in original data and is not changed in next code for both DataFrames, so you can use concat for join together and then add to original by DataFrame.join or concat with axis=1:
address = df['Residence'].str.split(';',expand=True)
country = address[0] != 'USA'
USA, nonUSA = address[~country], address[country]
USA.columns = ['Country', 'State', 'County', 'City']
nonUSA = nonUSA.dropna(axis=0, subset=[1])
nonUSA = nonUSA[nonUSA.columns[0:2]]
#changed order for avoid error
nonUSA.columns = ['Country', 'State']
df = pd.concat([df, pd.concat([USA, nonUSA])], axis=1)
Or:
df = df.join(pd.concat([USA, nonUSA]))
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NaN NaN
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 NaN NaN
3 Charlotte None
4 NaN NaN
5 NaN NaN
6 None None
But it seems it is possible simplify:
c = ['Country', 'State', 'County', 'City']
df[c] = df['Residence'].str.split(';',expand=True)
print (df)
ID Residence Name Gender Country State \
0 1 USA;CA;Los Angeles;Los Angeles Ann F USA CA
1 2 USA;MA;Suffolk;Boston Betty F USA MA
2 3 Canada;ON Carl M Canada ON
3 4 USA;FL;Charlotte David M USA FL
4 5 NA Emily F NA None
5 6 Canada;QC Frank M Canada QC
6 7 USA;AZ George M USA AZ
County City
0 Los Angeles Los Angeles
1 Suffolk Boston
2 None None
3 Charlotte None
4 None None
5 None None
6 None None

How can I swap specific column values in Dataframe?

I have a large dataframe this is the sample part of the Dataframe.
Want to swap the Muscat and Shanghai values.
df =
City Score
Istanbul 6.0749
2.23607 Muscat
Prague 4.38576
1.85958 Shanghai
Istanbul 6.0749
Singapore 5.17054
Output:
df =
City Score
Istanbul 6.0749
Muscat 2.23607
Prague 4.38576
Shanghai 1.85958
Istanbul 6.0749
Singapore 5.17054
I am confused that how can I apply the condition after iterating through the dataframe, also is there any other alternative?
Use to_numeric with notna for boolean mask and then swap by loc:
m = pd.to_numeric(df['City'], errors='coerce').notna()
#oldier versions of pandas
#m = pd.to_numeric(df['City'], errors='coerce').notnull()
df.loc[m,['City','Score']] = df.loc[m,['Score','City']].values
print (df)
City Score
0 Istanbul 6.0749
1 Muscat 2.23607
2 Prague 4.38576
3 Shanghai 1.85958
4 Istanbul 6.0749
5 Singapore 5.17054
You can use:
In [39]: mask = pd.to_numeric(df.Score, errors='coerce').isna()
In [40]: s = df.Score.copy()
In [41]: df.Score[mask] = df.City
In [42]: df.City[mask] = s
In [43]: df
Out[43]:
City Score
0 Istanbul 6.0749
1 Muscat 2.23607
2 Prague 4.38576
3 Shanghai 1.85958
4 Istanbul 6.0749
5 Singapore 5.17054

Categories