Python - str.match for each string in a dataframe - python

I'm trying to use str.match to match a phrase exactly, but for each word in each row's string. I want to return the row's index number for the correct row, which is why I'm using str.match instead of regex.
I want to return the index for the row that contains exactly 'FL', not 'FLORIDA'. The problem with using str.contains though, is that it returns to me the index of the row with 'FLORIDA'.
import pandas as pd
data = [['Alex in FL','ten'],['Bob in FLORIDA','five'],['Will in GA','three']]
df = pd.DataFrame(data,columns=['Name','Age'])
df.index[df['Name'].str.contains('FL')]
df.index[df['Name'].str.match('FL')]
Here's what the dataframe looks like:
Name Age
0 Alex in FL ten
1 Bob in FLORIDA five
2 Will in GA three
The output should be returning the index of row 0:
Int64Index([0], dtype='int64')

Use contains with word boundaries:
import pandas as pd
data = [['Alex in FL','ten'],['Bob in FLORIDA','five'],['Will in GA','three']]
df = pd.DataFrame(data,columns=['Name','Age'])
print(df.index[df['Name'].str.contains(r'\bFL\b')])
Output
Int64Index([0], dtype='int64')

Try:
df[df.Name.str.contains(r'\bFL\b', regex=True)]
OR
df[['FL' in i for i in df.Name.str.split('\s')]]
Output:
Name Age
0 Alex in FL ten

The docs say that it's matching Regex with the expression ("FL" in your case). Since "FLORIDA" does contain that substring, it does match.
One way you could do this would be to match instead for " FL " (padded with space) but you would also need to pad each of the values with spaces as well (for when "FL" is the end of the string).

Related

Copying substring that starts with "C" and ends with all the values after "D" in one dataframe and putting the substring into a new dataframe

The above photo is data stored in df_input.
I would like extract the "C#D#" part from the 'Visit' column and place it into the column of a new dataframe I created (df_output['VISIT']).
Additionally, there could be up to two numeric values that follow after the "D".
I'm not sure if I am supposed to use '.str.extract' and how I would capture all the numeric values that follow right after the "D"
The output I would like to get is:
C1D1
C1D1
" "
C1D1
Please note df_input[Visit] does not only have "C1D1". It has variations of the C#D# structure so it could be "C1D12" or "C2D9".
You can use a simple regex to recognize your pattern and then you can apply a function to dataframe to apply the recognizer to the whole column:
import pandas as pd
import re
def extract(year):
matches = re.findall('C\dD\d{1,2}', year)
if matches:
return matches[0] # Assuming you only want to retrieve the first occurrence
df_input = pd.DataFrame(data=['C1D1-Pre', 'C1D12-2hr Post', 'test'], columns=['VISIT'])
df_output = pd.DataFrame()
df_output['VISIT'] = df_input['VISIT'].apply(lambda x: extract(x))
print(df_output)
The output will be:
VISIT
0 C1D1
1 C1D12
2 None
If you want empty string instead of None, you have to edit the extract function:
def extract(year):
matches = re.findall('C\dD\d{1,2}', year)
if matches:
return matches[0]
return ""

Extract the first number from a string number range

I have a dataset with price column as type of string, and some of the values in the form of range (15000-20000).
I want to extract the first number and convert the entire column to integers.
I tried this :
df['ptice'].apply(lambda x:x.split('-')[0])
The code just return the original column.
Try one of the following options:
Data
import pandas as pd
data = {'price': ['0','100-200','200-300']}
df = pd.DataFrame(data)
print(df)
price
0 0 # adding a str without `-`, to show that this one will be included too
1 100-200
2 200-300
Option 1
Use Series.str.split with expand=True and select the first column from the result.
Next, chain Series.astype, and assign the result to df['price'] to overwrite the original values.
df['price'] = df.price.str.split('-', expand=True)[0].astype(int)
print(df)
price
0 0
1 100
2 200
Option 2
Use Series.str.extract with a regex pattern, r'(\d+)-?':
\d matches a digit.
+ matches the digit 1 or more times.
match stops when we hit - (? specifies "if present at all").
data = {'price': ['0','100-200','200-300']}
df = pd.DataFrame(data)
df['price'] = df.price.str.extract(r'(\d+)-?').astype(int)
# same result
Here is one way to do this:
df['price'] = df['price'].str.split('-', expand=True)[0].astype('int')
This will only store first number from the range. Example: From 15000-20000 only 15000 will be stored in the price column.

Match the column name based on the string in python?

I am new to python, I have an issue with matching the names of the column of Dataframe in python. So, I have a string s = "8907*890a" where a is the column name of a data frame. Now I want to match that with the column names of df which is there or not. I have tried it but the string is being taken as the whole. How to get only the 'a' from the whole string?
My code:
s = "8907*890a"
df=
a b c
0 rr 12 4
1 rt 45 9
2 ht 78 0
for col in df.columns:
for i in s.split():
print(i)
Which gives:
"8907*890a"
Expected out:
a
The split function accepts a delimiter as a parameter. By default the delimiter is a space. So when you try s.split() the interpreter is looking for a space in the string which it doesn't find in this case. So it returns the whole string as the output. If you try s.split('*') you will get
8907
890a
as output. In your case it appears that splitting the string is not the best option to extract the column name. I would go with extracting the last character instead. This can be done using s[-1:]

How to remove numbers and parenthesis at the end of column values like in 'abc23', 'abc(xyz)' in Pandas Dataframe?

I have a pandas dataframe with a column 'Country' that has values like these: 'Switzerland17', 'Bolivia (Plurinational State of)'. I want to convert them to just 'Switzerland', 'Bolivia'. How can I do that?
PS: I am able to solve the question using for loops but that's taking a long time as we have a dataframe here. Is there any pandas dataframe function we can use to solve this question?
If numbers and parenthesis are the only ones that signify the start of what you want to discard, you can split the string based on '(' and just keep the first part and again split the string based on the numbers and keep the first part and discard the rest.
a = 'Bolivia (Plurinational State of)'
a.split("(")[0]
will give you Bolivia.
b = 'Switzerland17'
re.compile('[0-9]').split(b)[0]
will give you Switzerland and discard anything after the appearance of any number.
def mysplit(a):
b = a.split("(")[0]
return re.compile('[0-9]').split(b)[0].rstrip()
df['Country'].apply(mysplit)
This will work.
So you have data like:
string = 'Switzerland17'
We can replace the numeric ending using the re module sub function.
import re
no_digits = re.sub(r'\d+$', '', string)
We get:
>>> no_digits
'Switzerland'
Let's say we have an example dataframe df as
Country
0 Switzerland24
1 USA53
2 Norway3
You can use filter() function for your purpose,
df['Country'] = df['Country'].apply(lambda s : ''.join(filter(lambda x: x.isalpha(), s)))
print(df)
Country
0 Switzerland
1 USA
2 Norway
or,
def remove_digits(s):
for x in range(10):
s = s.replace(str(x), '')
return s
df['Country'] = df['Country'].apply(remove_digits)
print(df)
Country
0 Switzerland
1 USA
2 Norway

Creating a year column in Pandas

I'm trying to create a year column with the year taken from the title column in my dataframe. This code works, but the column dtype is object. For example, in row 1 the year displays as [2013].
How can i do this, but change the column dtype to a float?
year_list = []
for i in range(title_length):
year = re.findall('\d{4}', wine['title'][i])
year_list.append(year)
wine['year'] = year_list
Here is the head of my dataframe:
country designation points province title year
Italy Vulkà Bianco 87 Sicily Nicosia 2013 Vulkà Bianco [2013]
re.findall returns a list of results. Use re.search
wine['year'] = [re.search('\d{4}', title)[0] for title in wine['title']]
better yet use pandas extract method.
wine['year'] = wine['title'].str.extract(r'\d{4}')
Definition
Series.str.extract(pat, flags=0, expand=True)
For each subject string in the Series, extract groups from the first match of regular expression pat.
Instead of re.findall that returns a list of strings, you may use str.extract():
wine['year'] = wine['title'].str.extract(r'\b(\d{4})\b')
Or, in case you want to only match 1900-2000s years:
wine['year'] = wine['title'].str.extract(r'\b((?:19|20)\d{2})\b')
Note that the pattern in str.extract must contain at least 1 capturing group, its value will be used to populate the new column. The first match will only be considered, so you might have to precise the context later if need be.
I suggest using word boundaries \b around the \d{4} pattern to match 4-digit chunks as whole words and avoid partial matches in strings like 1234567890.

Categories