How to extract certain string from a text? - python

I have a certain feature "Location" from which I want to extract country.
The feature looks like:
data['Location'].head()
0 stockton, california, usa
1 edmonton, alberta, canada
2 timmins, ontario, canada
3 ottawa, ontario, canada
4 n/a, n/a, n/a
Name: Location, dtype: object
I want:
data['Country'].head(3)
0 usa
1 canada
2 canada
I've tried:
data['Country'] = data.Location.str.extract('(+[a-zA-Z])', expand=False)
data[['Location', 'Country']].sample(10)
which returns:
error: nothing to repeat at position 1
When I try to put the '[a-zA-Z]+' it gives me city.
Help would be appreciated. Thanks.

You can also use regex patterns:
df['Country'] = df['Location'].str.split('(,\s)(\w+)$', n = 1, expand = True)[2]
Output:
df['Country'].head(3)
Out[111]:
0 usa
1 canada
2 canada
Name: country, dtype: object

data['Country'] = data['Location'].apply(lambda row: str(row).split(',')[-1])
You may do this, df.apply applies a function across all rows, our lambda function extracts the country, and apply is only called on one column and saved into another

Related

ValueError: Series.replace cannot use dict-value and non-None to_replace when creating a conditional column

given this dataframe named df:
Number City Country
one Milan Italy
two Paris France
three London UK
four Berlin Germany
five Milan Italy
six Oxford UK
I would like to create a new column called 'Classification' based on this condition:
if df['Country'] = "Italy" and df['City'] = "Milan", result = "zero" else result = df['Number']
The result I want to achieve is this:
Number City Country Classification
one Milan Italy zero
two Paris France two
three London UK three
four Berlin Germany four
five Milan Italy zero
six Oxford UK six
I tried to use this code:
condition = [(df['Country'] == "Italy") & (df['City'] == 'Milan'),]
values = ['zero']
df['Classification'] = np.select(condition, values)
the result of which is this dataframe:
Number City Country Classification
one Milan Italy zero
two Paris France 0
three London UK 0
four Berlin Germany 0
five Milan Italy zero
six Oxford UK 0
now I try to replace the '0' in the 'Classification' column with the values of the column 'Number'
df['Classification'].replace(0, df['Number'])
but the result I get is an error:
ValueError: Series.replace cannot use dict-value and non-None to_replace
I would be very grateful for any suggestion on how to fix this
What you want is np.where
df['Classification'] = np.where((df['Country'] == "Italy") & (df['City'] == 'Milan'), 'zero', df['Number'])
print(df)
Number City Country Classification
0 one Milan Italy zero
1 two Paris France two
2 three London UK three
3 four Berlin Germany four
4 five Milan Italy zero
5 six Oxford UK six
If you want to use np.select, you need to specify default argument
condition = [(df['Country'] == "Italy") & (df['City'] == 'Milan'),]
values = ['zero']
df['Classification'] = np.select(condition, values, default=df['Number'])

How can I fill some data of the cell of the new column that is in accord with a substring of the original data using pandas?

There are 2 dataframes, and they have simillar data.
A dataframe
Index Business Address
1 Oils Moskva, Russia
2 Foods Tokyo, Japan
3 IT California, USA
... etc.
B dataframe
Index Country Country Calling Codes
1 USA +1
2 Egypt +20
3 Russia +7
4 Korea +82
5 Japan +81
... etc.
I will add a column named 'Country Calling Codes' to A dataframe, too.
After this, 'Country' column in B dataframe will be compared with the data of 'Address' column. If the string of 'A.Address' includes string of 'B.Country', 'B.Country Calling Codes' will be inserted to 'A.Country Calling Codes' of compared row.
Result is:
Index Business Address Country Calling Codes
1 Oils Moskva, Russia +7
2 Foods Tokyo, Japan +81
3 IT California, USA +1
I don't know how to deal with the issue because I don't have much experience using pandas. I should be very grateful to you if you might help me.
Use Series.str.extract for get possible strings by Country column and then Series.map by Series:
d = B.drop_duplicates('Country').set_index('Country')['Country Calling Codes']
s = A['Address'].str.extract(f'({"|".join(d.keys())})', expand=False)
A['Country Calling Codes'] = s.map(d)
print (A)
Index Business Address Country Calling Codes
0 1 Oils Moskva, Russia +7
1 2 Foods Tokyo, Japan +81
2 3 IT California, USA +1
Detail:
print (A['Address'].str.extract(f'({"|".join(d.keys())})', expand=False))
0 Russia
1 Japan
2 USA
Name: Address, dtype: object

How to use .values_counts() for list items in a dataframe

I have a dataframe df which looks like this:
data = [['Alex','Japan'],['Joe','Japan, India']]
df = pd.DataFrame(data,columns=['Name','Countries'])
Name Countries
Alex Japan
Joe Japan, India
So I want to modify df in such a way that when I implememt df['Countries'].value_coun
ts(), I get
Japan 2
India 1
So I thought that I should convert those strings in df['Countries'] into a list using this:
df['Countries']= df['Countries'].str[0:].str.split(',').tolist()
Name Countries
0 Alex [Japan]
1 Bob [Japan, India]
But now when I run df['Countries'].value_counts(), I get the following error:
TypeError: unhashable type: 'list'
All I wish is that when I run a .values_counts() I get 2 for Japan and 1 for India. Please see if you can help me with this. Thank you!
Use Series.str.split with reshape by DataFrame.stack for Series, so possible use value_counts:
s = df['Countries'].str.split(', ', expand=True).stack().value_counts()
print (s)
Japan 2
India 1
dtype: int64
Another way using series.str.get_dummies():
df.Countries.str.get_dummies(',').sum()
India 1
Japan 2

Pandas - 'cut' everything after a certain character in a string column and paste it in the beginning of the column

In a pandas dataframe string column, I want to grab everything after a certain character and place it in the beginning of the column while stripping the character. What is the most efficient way to do this / clean way to do achieve this?
Input Dataframe:
>>> df = pd.DataFrame({'city':['Bristol, City of', 'Newcastle, City of', 'London']})
>>> df
city
0 Bristol, City of
1 Newcastle, City of
2 London
>>>
My desired dataframe output:
city
0 City of Bristol
1 City of Newcastle
2 London
Assuming there are only two pieces to each string at most, you can split, reverse, and join:
df.city.str.split(', ').str[::-1].str.join(' ')
0 City of Bristol
1 City of Newcastle
2 London
Name: city, dtype: object
If there are more than two commas, split on the first one only:
df.city.str.split(', ', 1).str[::-1].str.join(' ')
0 City of Bristol
1 City of Newcastle
2 London
Name: city, dtype: object
Another option is str.partition:
u = df.city.str.partition(', ')
u.iloc[:,-1] + ' ' + u.iloc[:,0]
0 City of Bristol
1 City of Newcastle
2 London
dtype: object
This always splits on the first comma only.
You can also use a list comprehension, if you need performance:
df.assign(city=[' '.join(s.split(', ', 1)[::-1]) for s in df['city']])
city
0 City of Bristol
1 City of Newcastle
2 London
Why should you care about loopy solutions? For loops are fast when working with string/regex functions (faster than pandas, at least). You can read more at For loops with pandas - When should I care?.

pandas fill missing country values based on city if it exists

I'm trying to fill country names in my dataframe if it is null based on city and country names, which exists. For eg see the dataframe below, here i want to replace NaN for City Bangalore with Country India if such City exists in the dataframe
df1=
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 Abu Dhabi UAE
6 Bangalore NaN
I am new to this so any help would be appreciated :).
You can create a series mapping after dropping nulls and duplicates.
Then use fillna with pd.Series.map:
g = df.dropna(subset=['Country']).drop_duplicates('City').set_index('City')['Country']
df['Country'] = df['Country'].fillna(df['City'].map(g))
print(df)
City Country
0 Bangalore India
1 Delhi India
2 London UK
3 California USA
4 Dubai UAE
5 AbuDhabi UAE
6 Bangalore India
This solution will also work if NaN occurs first within a group.
I believe
df1.groupby('City')['Country'].fillna(method='ffill')
should resolve your issue by forward filling missing values within the group by.
One of the ways could be -
non_null_cities = df1.dropna().drop_duplicates(['City']).rename(columns={'Country':'C'})
df1 = df1.merge(non_null_cities, on='City', how='left')
df1.loc[df1['Country'].isnull(), 'Country'] = df1['C']
del df1['C']
Hope this will be helpful!
Here is one nasty way to do it.
first use forward fill and then use backwardfill ( for the possible NaN occurs first)
df = df.groupby('City')[['City','Country']].fillna(method = 'ffill').groupby('City')[['City','Country']].fillna(method = 'bfill')

Categories