Suppose I have a string like str = "The Invisible Man (2020)". In Python I want to split it into a list with String + Number (year number always at the end of the string) of Year like below:
['The Invisible Man', '2020']
How can I achieve this goal using a regular expression in Python?
Here's one way using re.split, which works for this specific string structure:
import re
s = "The Invisible Man (2020)"
re.split(r'\s+\((\d+)\)', s)[:2]
# ['The Invisible Man', '2020']
Here is one way using regexp and named groups. You take longest string followed by space and opening parenthesis and name it name. Then you take 4 digit long number inside parenthesis and name it year.
Finally make a list as requested in question.
import re
r = re.compile(r'(?P<name>([a-zA-Z ]*)) \((?P<year>\d\d\d\d)\)')
m = r.match("The Invisible Man (2020)")
l = [m.group('name'), m.group('year')]
You can write a regex for the whole string, and use re.search and re.search.groups to get the title and year out of the string:
import re
s = "The Invisible Man (2020)"
regex = r"(.+) \((\d+)\)"
title, year = re.search(regex, s).groups()
print('title = "{}", year = "{}"'.format(title, year))
Output:
title = "The Invisible Man", year = "2020"
Related
so I'm using beautifulsoup to crawl a table in a Wikipedia page in which I extract data in a file.
the problem is that I want to remove some of the substrings in the list generated for the columns in the table
here is my code:
soup= bs(result.text,'html.parser')
country_names= soup.find('table', class_= 'wikitable sortable').tbody
rows= country_names.find_all('tr')
columns=[v.text.replace('[a][b][13]\n', '') for v in rows[0].find_all('th')]
print(columns)
all I was able to do is to remove only one substring from the strings in the list using a replace function.
the output before replace() function:
['Flag\n', 'Map\n', 'English short nameandformal name[6][7][8]\n', 'Local short name(s)andformal name(s)[6][7]\n', 'Capital[8][9][10]\n', 'Population (2018)[11][12]\n', 'Area[a][b][13]\n']
the output after replace() function:
['Flag\n', 'Map\n', 'English short nameandformal name[6][7][8]\n', 'Local short name(s)andformal name(s)[6][7]\n', 'Capital[8][9][10]\n', 'Population (2018)[11][12]\n', 'Area']
so I want to remove all substrings such as '[8][9][10]\n', ' [6][7]\n ', '[6][7][8]\n' and '(2018)[11][12]\n' and so on but I couldn't reach a solution because I'm still new to python and beautifulsoup.
I would suggest you to dive deeper into Regular Expressions:
Use e.g. \[\d+\] as expression for any number of digits inside brackets.
import re
org_string = 'Capital[8][9][10]\n'
pattern = r'\[\d+\]'
mod_string = re.sub(pattern, '', org_string )
# Capital
I think this is the solution you are looking for:
import re
colums = [re.sub('(\[[0-9]+])', '', i).replace('\n', '') for i in rows]
You can use the python re regular expression library for this. The re library has a function re.sub(pattern,replace_string,input_string) that will replace any substring that matches the pattern regular expression.
Something like this:
# make sure to import the re module
import re
columns = [re.sub('(\[[a-zA-Z\d]*\])+\n','',v.text) for v in rows[0].find_all('th')]
Edit: Changed the regular expression pattern
Your desired output requires the use of regular expressions inside a list comprehension:
import re
list_before = ['Flag\n', 'Map\n', 'English short nameandformal name[6][7][8]\n', 'Local short name(s)andformal name(s)[6][7]\n', 'Capital[8][9][10]\n', 'Population (2018)[11][12]\n', 'Area[a][b][13]\n']
pattern = r'(\(\d+\))*(\[\w+\])*\n?'
list_after = [re.sub(pattern, "", elem).strip() for elem in list_before]
the pattern defines the regular expression pattern that you want to substitute in each string of the list_before. You may want to dig deeper into regular expressions to fully understand it, but in plain English this pattern matches:
0 or more occurrences of "(", followed by 1 or more digits (which are indicated by the special sequence \d), followed by ")"
0 or more occurrences of "[", followed by 1 or more alphanumeric characters (which are indicated by the special sequence \w), followed by "]"
0 or one occurrences of the new line \n
finally, the method re.sub() inside the list comprehension replaces any match with "".
the output is:
['Flag', 'Map', 'English short nameandformal name', 'Local short name(s)andformal name(s)', 'Capital', 'Population', 'Area']
i am trying to find specific part of the string using regex or something like that.
for example:
string = "hi i am *hadi* and i have &18& year old"
name = regex.find("query")
age = regex.find("query")
print(name,age)
result:
hadi 18
i need the 'hadi' and '18'
Attention: The string is different each time. I need the sentence or
words betwee ** and &&
Try:
import re
string = "hi i am *hadi* and i have &18& year old"
pattern = r'(?:\*|&)(\w+)(?:\*|&)'
print(re.findall(pattern, string))
Outputs:
['hadi', '18']
You could assign re.findall(pattern, string) to a variable and have a Python list and access the values etc.
Regex demo:
https://regex101.com/r/vIg7lU/1
The \w+ in the regex can be changed to .*? if there is more than numbers and letters. Example: (?:\*|&)(.*?)(?:\*|&) and demo: https://regex101.com/r/RIqLuI/1
this is how i solved my question:
import re
string = "hello. my name is *hadi* and i am ^18^ years old."
name = re.findall(r"\*(.+)\*", string)
age = re.findall(r"\^(.+)\^", string)
print(name[0], age[0])
I am trying to remove a string from a column using regular expressions and replace.
Name
"George # ACkDk02gfe" sold
I want to remove " # ACkDk02gfe"
I have tried several different variations of the code below, but I cant seem to remove string I want.
df['Name'] = df['Name'].str.replace('(\#\D+\"$)','')
The output should be
George sold
This portion of the string "ACkDk02gfe is entirely random.
Let's try this using regex with | ("OR") and regex group:
df['Name'].str.replace('"|(\s#\s\w+)','', regex=True)
Output:
0 George sold
Name: Name, dtype: object
Updated
df['Name'].str.replace('"|(\s#\s\w*[-]?\w+)','')
Where df,
Name
0 "George # ACkDk02gfe" sold
1 "Mike # AisBcIy-rW" sold
Output:
0 George sold
1 Mike sold
Name: Name, dtype: object
Your pattern and syntax is wrong.
import pandas as pd
# set up the df
df = pd.DataFrame.from_dict(({'Name': '"George # ACkDk02gfe" sold'},))
# use a raw string for the pattern
df['Name'] = df['Name'].str.replace(r'^"(\w+)\s#.*?"', '\\1')
I'll let someone else post a regex answer, but this could also be done with split. I don't know how consistent the data you are looking at is, but this would work for the provided string:
df['Name'] = df['Name'].str.split(' ').str[0].str[1:] + ' ' + df['Name'].str.split(' ').str[-1]
output:
George sold
This should do for you
Split the string by a chain of whitespace,#,text immediately after #and whitespace after the text. This results in a list. remove the list corner brackets while separating elements by space using .str.join(' ')
df.Name=df.Name.str.split('\s\#\s\w+\s').str.join(' ')
0 George sold
To use a regex for replacement, you need to import re and use re.sub() instead of .replace().
import re
Name
"George # ACkDk02gfe" sold
df['Name'] = re.sub(r"#.*$", "", df['Name'])
should work.
import re
ss = '"George # ACkDk02gfe" sold'
ss = re.sub('"', "", ss)
ss = re.sub("\#\s*\w+", "", ss)
ss = re.sub("\s*", " ", ss)
George sold
Given that this is the general format of your code, here's what may help you understand the process I made. (1) substitute literal " (2) substitute given regex \#\s*\w+ (means with literal # that may be followed by whitespace/s then an alphanumeric word with multiple characters) will be replaced (3) substitute multiple whitespaces with a single whitespace.
You can wrap around a function to this process which you can simply call to a column. Hope it helps!
I have been scraping data from a site.
I have this list scraped
[' ', '*One child under 12 years old stays free using existing bedding.', '24 hour front desk', 'Bar / Lounge', 'Business centre', 'Concierge', 'Dry cleaning / laundry service', ...
This is scraped so far and more (about 20) would be scraped too.
I want to create a column in my Table for every entry in List by getting its first 20 characters.
Here is how I filter these entries to make a valid MySQL column name.
column_name = column_to_create[:20].replace(" ","_").replace("/","_").replace("*","_").replace("-","_").replace("$","_").replace("&","_").replace(".","_")
I know it does not include many invalid character.
How can I filter to get a valid column name? Any less-code solution or any Reg-Ex ???
Use this Regex:
column_name = re.sub(r'[-/*$&.\s]+','_',column_to_create[:20])
Demo:
>>> import re
>>> st = "replace/ these**characters---all$$of&them....with_"
>>> re.sub(r'[-/*$&.\s]+','_',st)
'replace_these_characters_all_of_them_with_'
Also if there is any other character you want to replace with _, just add that character to square braces in the Regex. Say e.g., you need to replace # also. Then regex would become re.sub(r'[-/*$&.\s#]+','_',column_to_create[:20]).
Python has a translate capability you can use to easily change one character into another, or delete characters. I use it something like this (first 3 lines are setup, 4th line is actually using it.)
norm = string.maketrans(' _,','---') # space underscore comma to dash
keep = "-#'$%{}[]~#().&^+=/\/:"
toss = string.translate(norm,norm,string.letters+string.digits+keep)
toName = toName.translate(norm,toss)
I'm trying to search a string for numbers, and when finding them, wrap some chars around them, e.g.
a = "hello, i am 8 years old and have 12 toys"
a = method(a)
print a
"hello, i am \ref{8} years old and have \ref{12} toys"
I've looked at the re (regular expression) library, but cannot seem to find anything helpful... any cool ideas?
This is pretty basic usage of the .sub method:
numbers = re.compile(r'(\d+)')
a = numbers.sub(r'\ref{\1}', a)
The parethesis around the \d+ number pattern create a group, and the \1 reference is replaced with the contents of the group.
>>> import re
>>> a = "hello, i am 8 years old and have 12 toys"
>>> numbers = re.compile(r'(\d+)')
>>> a = numbers.sub(r'\\ref{\1}', a)
>>> print a
hello, i am \ref{8} years old and have \ref{12} toys
you need to use re.sub function along these lines :
re.sub("(\d+)",my_sub_func,text) # catch the numbers here (altho this only catches non real numbers)
where my_sub_func is defined like this :
def my_sub_func(match_obj):
text = match_obj.group(0) # get the digit text here
new_text = "\\ref{"+text+"}" # change the pattern here
return new_text`