I am learning how to webscrape with python using a Wikepedia article. I managed to get the data I needed, the tables, by using the .get_text() method on the table rows ().
I am cleaning up the data in Pandas and one of the routines involves getting the date a book or movie was published. Since there are many ways in which this can occur such as:
(1986)
(1986-1989)
(1986-present)
Currently, I am using the code below which works on a test sentence:
# get the first columns of row 19 from the table and get its text
test = data_collector[19].find_all('td')[0]
text = test.get_text()
#create and test the pattern
pattern = re.compile('\(\d\d\d\d\)|\(\d\d\d\d-\d\d\d\d\)|\(\d\d\d\d-[ Ppresent]*\)')
re.findall(pattern, 'This is Agent (1857), the years were (1987-1868), which lasted from (1678- Present)')
I get the expected output on the test sentence.
['(1857)', '(1987-1868)', '(1678- Present)']
However, when I test it on a particular piece of text from the wiki article 'The Adventures of Sherlock Holmes (1891–1892) (series), (1892) (novel), Arthur Conan Doyle\n', I am able to extract (1892), but NOT (1891-1892).
text = test.get_text()
re.findall(pattern, text)
o/p: ['(1892)']
Even as I type this, I can see that the hyphen that I am using and the one on the text are different. I am sure that this is the issue and was hoping if someone could tell me what this particular symbol is called and how I can actually "type" it using my keyboard.
Thank you!
I suggest enhancing the pattern to search for the most common hyphens, -, – and —, and fix the present pattern from a character class to a char sequence (so as not to match sent with [ Ppresent]*):
re.compile(r'\(\d{4}(?:[\s–—-]+(?:\d{4}|present))?\)', re.I)
See the regex demo. Note that re.I flag will make the regex match in a case insensitive way.
Details
\( - a (
\d{4} - four digits ({4} is a limiting quantifier that repeats the pattern it modifies four times)
(?:[\s–—-]+(?:\d{4}|present))? - an optional (as there is a ? at the end) non-capturing (due to ?:) group matching 1 or 0 occurrences of
[\s–—-]+ - 1 or more whitespaces, -, — or –
(?:\d{4}|present) - either 4 digits or present
\) - a ) char.
If you plan to match any hyphens use [\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\s]+ instead of [\s–—-]+.
Or, to match any 1+ non-word chars at that location, probably, other than ( and ), use [^\w()]+ instead: re.compile(r'\(\d{4}(?:[^\w()]+(?:\d{4}|present))?\)', re.I).
Related
I'm processing some raw texts that may contains several links within it. The example links that may appear in the text look like this:
\\1.123.42.5\foo\bar\file_name
\\remote_node\shared_path\520.38-nonprod\utli\file_name
Links like above can appear any where in the text, at beginning, in the middle or in the end. Notice that the dots can appear in any chunk of the link but not the last char. For example:
"Here is an example text and we want to use regex to extract the links. The links we are interested in are \\1.123.42.5\foo\bar\file_name and \\remote_node\shared_path\520.38-nonprod\util\file_name. This text can be continued with even more sentences ..."
By using re.findall(), I'm hoping to get a list, ["\\1.123.42.5\foo\bar\file_name", "\\remote_node\shared_path\520.38-nonprod\util\file_name"]
Notice that the dot following the link is not included in the second link as it's the period of the sentence.
We don't know how may chunks/directories consist of a link (>=2 for sure). We only know that the first chunk allows alphanumerical, dots and underscore. The rest of chunks allow alphanumerical, dots, underscore and hyphens, and the link cannot be ended with a dot.
The regex I have currently is:
first_chunk= r"\w\."
rest_chunk= r"\w\.\-\\"
pattern = re.compile(r"\\\\[%s]+\\[%s]+" % (first_chunk, rest_chunk))
However, this pattern also add ended dots (if any) to the links. After seeing end-of-line, I also tried
first_chunk= r"\w\."
rest_chunk= r"\w\.\-\\"
pattern = re.compile(r"\\\\[%s]+\\[%s]+[^\.]$" % (first_chunk, rest_chunk))
or
pattern = re.compile(r"\\\\[%s]+\\[%s]+[^\.]$" % (first_chunk, rest_chunk), flags=re.MULTILINE )
Neither of the regex can preciously extract the correct links from the text.
I'm wondering how to modify the regex to achieve my goal. Any comments would be extremely appreciated. Thanks!
You could write the final pattern matching as least 2 times / and in the final part of the pattern omit matching the dot.
\\\\[\w.]+(?:\\[\w.-]+)+\\[\w-]+(?:\.[\w+-]+)*
Explanation
\\\\ Match \\
[\w.]+ Match 1+ times either \w or -
(?:\\[\w.-]+)+ Repeat 1+ times starting with \ and the same character class as before
\\[\w-]+ Match \ again (to have at least 2 occurrences of \) and match 1+ times either \w or -
(?:\.[\w+-]+)* Optionally repeat . and 1+ word chars or -
See a regex demo.
Example
import re
s = r"""Here is an example text and we want to use regex to extract the links.
The links we are interested in are \\1.123.42.5\foo\bar\file_name and \\remote_node\shared_path\520.38-nonprod\util\file_name.
This text can be continued with even more sentences ...
\\1.123.42.5\foo\bar\file_name.png"""
pattern = r"\\\\[\w.]+(?:\\[\w.-]+)+\\[\w-]+(?:\.[\w+-]+)*"
print(re.findall(pattern, s))
Output
[
'\\\\1.123.42.5\\foo\\bar\\file_name',
'\\\\remote_node\\shared_path\\520.38-nonprod\\util\\file_name',
'\\\\1.123.42.5\\foo\\bar\\file_name.png'
]
As #Stuart said in the comments, maybe re module is unnecessary here:
s = r"""Here is an example text and we want to use regex to extract the links.
The links we are interested in are \\1.123.42.5\foo\bar\file_name and \\remote_node\shared_path\520.38-nonprod\util\file_name.
This text can be continued with even more sentences ..."""
for word in s.split():
if word.startswith("\\"):
print(word.strip("."))
Prints:
\\1.123.42.5\foo\bar\file_name
\\remote_node\shared_path\520.38-nonprod\util\file_name
I am trying to search for all occurrences of "Tom" which are not followed by "Thumb".
I have tried to look for
Tom ^((?!Thumb).)*$
but I still get the lines that match to Tom Thumb.
You don't say what flavor of regex you're using, but this should work in general:
Tom(?!\s+Thumb)
In case you are not looking for whole words, you can use the following regex:
Tom(?!.*Thumb)
If there are more words to check after a wanted match, you may use
Tom(?!.*(?:Thumb|Finger|more words here))
Tom(?!.*Thumb)(?!.*Finger)(?!.*more words here)
To make . match line breaks please refer to How do I match any character across multiple lines in a regular expression?
See this regex demo
If you are looking for whole words (i.e. a whole word Tom should only be matched if there is no whole word Thumb further to the right of it), use
\bTom\b(?!.*\bThumb\b)
See another regex demo
Note that:
\b - matches a leading/trailing word boundary
(?!.*Thumb) - is a negative lookahead that fails the match if there are any 0+ characters (depending on the engine including/excluding linebreak symbols) followed with Thumb.
Tom(?!\s+Thumb) is what you search for.
I am trying to clean up text for use in a machine learning application. Basically these are specification documents that are "semi-structured" and I am trying to remove the section number that is messing with NLTK sent_tokenize() function.
Here is a sample of the text I am working with:
and a Contract for the work and/or material is entered into with some other person for a
greater amount, the undersigned hereby agrees to forfeit all right and title to the
aforementioned deposit, and the same is forfeited to the Crown.
2.3.3
...
(b)
until thirty-five days after the time fixed for receiving this tender,
whichever first occurs.
2.4
AGREEMENT
Should this tender be accepted, the undersigned agrees to enter into written agreement with
the Minister of Transportation of the Province of Alberta for the faithful performance of the
works covered by this tender, in accordance with the said plans and specifications and
complete the said work on or before October 15, 2019.
I am trying to remove all the section breaks (ex. 2.3.3, 2.4, (b)), but not the date numbers.
Here is the regex I have so far: [0-9]*\.[0-9]|[0-9]\.
Unfortunately it matches part of the date in the last paragraph (2019. turns into 201) and I really dont know how to fix this being a non-expert at regex.
Thanks for any help!
You may try replacing the following pattern with empty string
((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))
output = re.sub(r'((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))', '', input)
print(output)
This pattern works by matching a section number as \d+(?:\.\d+)*, but only if it appears as the start of a line. It also matches letter section headers as \([a-z]+\).
To your specific case, I think \n[\d+\.]+|\n\(\w\) should works. The \n helps to diferentiate the section.
The pattern you tried [0-9]*\.[0-9]|[0-9]\. is not anchored and will match 0+ digits, a dot and single digit or | a single digit and a dot
It does not take the match between parenthesis into account.
Assuming that the section breaks are at the start of the string and perhaps might be preceded with spaces or tabs, you could update your pattern with the alternation to:
^[\t ]*(?:\d+(?:\.\d+)+|\([a-z]+\))
^ Start of string
[\t ]* Match 0+ times a space or tab
(?: Non capturing group
\d+(?:\.\d+)+ Match 1+ digits and repeat 1+ times a dot and 1+ digits to match at least a single dot to match 2.3.3 or 2.4
|
\([a-z]+\) Match 1+ times a-z between parenthesis
) Close non capturing group
Regex demo | Python demo
For example using re.MULTILINE whers s is your string:
pattern = r"^(?:\d+(?:\.\d+)+|\([a-z]+\))"
result = re.sub(pattern, "", s, 0, re.MULTILINE)
I have the following code:
import pandas as pd
s = pd.Series(['toy story (1995)', 'the pirates (2014)'])
print(s.str.extract('.*\((.*)\).*',expand = True))
with output:
0
0 1995
1 2014
I understand that the extract function is pulling the values between the parentheses for both series objects. However I do not understand how. What exactly does '.*\((.*)\).*' mean? I think that the asterisks represent wild card characters but beyond that I am quite confused as to what is actually going on with this expression.
.*\( matches everything up until the first (
\).* matches everything from ) until the end
(.*) returns everything in between the first two matches
.* Match any number of characters
\( Match one opening parenthesis
(.*) Match any number of characters into the first capturing group
\) Match a closing parenthesis
.* Match any number of characters
This notation is called a regular expression, and I guess Pandas uses regexes in the extract function so you can get more precise data. Things inside capturing groups would be returned.
You can learn more about regexes at the Wikipedia page.
Here's a test example using your regex.
I’m stumped on a problem. I have a large data frame where two of the columns are like this:
pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'], ['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
What I’m trying to do is leave only the URL including the word “twitter” left in each cell and remove the rest. The pattern is that the URLs I want always include the word “twitter” and ends with “/” + a one-digit number. In the cases where there are two identical URLs in the same cell then only one should remain. Like this:
Test2 = pd.DataFrame([['a', 'https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
Test2
I’m new to Python and after a lot of googling I’ve started to understand that something called regex is the answer but that is as far as I come. One of the postings here at Stackoverflow led me to regex101.com and after playing around this is as far as I’ve come and it doesn't work:
r’^[https]+(:)(//)(.*?)(/)(\d)’
Can anyone tell me how to solve this problem?
Thanks in advance.
Regular expressions are certainly handy for such tasks. Refer to this question and online tools such as regex101 to learn more.
Your current pattern is incorrect because:
^ Matches the following pattern at the start of string.
[https]+ This is a character set, meaning it will match h, s, ps, therefore any combination of one or more letters present in the [] brackets, and not just the strings http and https which is what you are after.
(:) You don't need to put this : in a capturing group here.
(//) / Needs to be escaped in regex, \/. No need for capturing group here either.
(.*?) The .*? combo is often misused when a negated character set [^] could be used instead.
(/) As discussed above.
(\d) Matches and captures a digit. The capturing group here is also redundant for your task.
You may use the following expression:
https?:\/\/twitter\.com[^,]+(?<=\/\d$)
https? Matches literal substrings http or https.
:\/\/twitter\.com Matches literal substring ://twitter.com.
[^,]+ Anything that is not a comma, one or more.
(?<=\/\d$) Positive lookbehind. Assert that a / followed by a digit \d is present at the end of the string $.
Regex demo here.
Python demo:
import pandas as pd
df = pd.DataFrame([['a', 'https://gofundme.com/ydvmve-surgery-for-jax,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['b','https://twitter.com/dog_rates/status/890971913173991426/photo/1,https://twitter.com/dog_rates/status/890971913173991426/photo/1'],
['c','https://twitter.com/dog_rates/status/890971913173991430/video/1'] ],columns=['ID','URLs'])
df['URLs'] = df['URLs'].str.findall(r"https?:\/\/twitter\.com[^,]+(?<=\/\d$)").str[0]
print(df)
Prints:
ID URLs
0 a https://twitter.com/dog_rates/status/890971913173991426/photo/1
1 b https://twitter.com/dog_rates/status/890971913173991426/photo/1
2 c https://twitter.com/dog_rates/status/890971913173991430/video/1