get text in div after specific character [xpath] - python

I am using XPath to scrape a website, I've been able to access more of the information I need except for the date. The date is text in a div, it is formatted as such below.
October 13, 2018 / 1:31 AM / Updated 5 hours ago
I just want to get the date, not the time or anything else. However, with my current code, I am getting the entire text in the div. My code is below.
item['datePublished'] = response.xpath("//div[contains(#class, 'ArticleHeader_date') and substring-before(., '/')]/text()").extract()

As hinted, there are ways to do this in XPath 2.0+. However, this should be done in the host language.
One way is to extract the date using a regex after the value has been retrieved, e.g. Regex Demo
\w+\ \d\d?,\ \d{4}
Code Sample:
import re
regex = r"\w+\ \d\d?,\ \d{4}"
test_str = "October 13, 2018 / 1:31 AM / Updated 5 hours ago"
matches = re.search(regex, test_str)
if matches:
print (matches.group())

Related

How to find a date at a specific word using regex

Sentence : "I went to hospital and admitted. Date of admission: 12/08/2019 and surgery of Date of surgery: 15/09/2015. Date of admission: 12/05/2018 is admitted Raju"
keyword: "Date of admission:"
Required solution: 12/08/2019,12/05/2018
Is there any solution to get the dates near "Date of admission:" only. Is there any solution
I was unable to reproduce the result in the answer by #Ders. Plus I think .findall() is more appropriate here anyway, so:
import re
pattern = re.compile(r"Date of admission: (\d{2}/\d{2}/\d{4})")
print(pattern.findall(s))
# ['12/08/2019', '12/05/2018']
Use a capturing group. If the re matches, then you can get the contents of the group.
import re
p = re.compile("Date of admission: (\d{2}/\d{2}/\d{4})")
m = p.match(s)
date = m.group(1)
# 12/08/2019

Skipping XML elements using Regular Expressions in Python 3

I have an XML document where I wish to extract certain text contained in specific tags such as-
<title>Four-minute warning</title>
<categories>
<category>Nuclear warfare</category>
<category>Cold War</category>
<category>Cold War military history of the United Kingdom</category>
<category>disaster preparedness in the United Kingdom</category>
<category>History of the United Kingdom</category>
</categories>
<bdy>
some text
</bdy>
In this toy example, if I want to extract all the text contained in tags by using the following Regular Expression code in Python 3-
# Python 3 code using RE-
file = open("some_xml_file.xml", "r")
xml_doc = file.read()
file.close()
title_text = re.findall(r'<title>.+</title>', xml_doc)
if title_text:
print("\nMatches found!\n")
for title in title_text:
print(title)
else:
print("\nNo matches found!\n\n")
It gives me the text within the XML tags ALONG with the tags. An example of a single output would be-
<title>Four-minute warning</title>
My question is, how should I frame the pattern within the re.findall() or re.search() methods so that and tags are skipped and all I get is the text between them.
Thanks for your help!
Just use a capture group in your regex (re.findall() takes care of the rest in this case). For example:
import re
s = '<title>Four-minute warning</title>'
title_text = re.findall(r'<title>(.+)</title>', s)
print(title_text[0])
# OUTPUT
# Four-minute warning

Extract Numbers from Text File excluding Dates

I have simple code which extracts numbers from a text file. It looks like this:
import re
codata = []
with open(r"filename.txt") as file:
for line in file:
for i in re.findall(r'\d+(?:\.\d+)?', line):
codata.append(i)
The text contains a lot financial data and also a lot of dates which I don't want. Is there an easy way to modify the code to exclude dates? The dates generally follow these formats (I'm using a specific date as an example for the format but it can be any date):
August 31, 2018
8/31/2018
8/31/18
August 2018
FY2018
CY2018
fiscal year 2018
calendar year 2018
Here is an example. I have a text file with the following text:
"For purposes of the financial analyses described in this section, the term “implied merger consideration” refers to the implied value of the per share consideration provided for in the transaction of $80.38 consisting of the cash portion of the consideration of $20.25 and the implied value of the stock portion of the consideration of 0.275 shares of XXX common stock based on XXX’s closing stock price of $218.67 per share on July 14, 2018."
When I run my code I posted above, I get this output from print(codata):
['80.38', '20.25', '0.275', '218.67', '14', '2018']
I would like to get this output instead:
['80.38', '20.25', '0.275', '218.67']
So I don't want to pick up the numbers 14 and 2018 associated with the date "July 14, 2018". If I know that any numbers related to dates within the text would have the formats that I outlined above, how should I modify my code to get the desired output?
Hard to understand exactly what you want. But if you are just looking for numbers you can do this (and if it has a decimal, use float instead).
import re
codata = []
with open(r"filename.txt") as file:
for line in file:
for i in re.findall(r'\d+(?:\.\d+)?', line):
try:
codata.append(int(i))
except:
continue
Here's a regex that will match and remove your current set of dates:
import re
p = r"(((january|february|march|april|may|june|july|august|september|october|november|december) +[\d, ]+)|" + \
r"((\d?\d\/){2}(\d\d){1,2})|" + r"((fiscal year|fy|calendar year|cy) *(\d\d){1,2}))"
codata = []
with open(r"filename.txt") as file:
for line in file:
codata.append(re.sub(p, "", line, flags=re.IGNORECASE))
print(codata)
Output (assuming input file is the same as your provided date list):
['\n', '\n', '\n', '\n', '\n', '\n', '\n', '\n']
Considering the text sample I assume that every price starts with $ sign, in that case you are probably looking for the following regex:
r"(?<=\$)\d+\.?\d*(?= )"
the result would be:
['80.38', '20.25', '218.67']
Or in case you want the $ sign in your list the regex would be:
r"\$\d+\.?\d*(?= )"
and the result in that case:
['$80.38', '$20.25', '$218.67']
To clarify, the (?<=\$) means that our match needs to be proceeded by the $ sign, but the $ sign is not added to the output. (?= ) means that the price should be followed by space.

Replace word between two substrings (keeping other words)

I'm trying to replace a word (e.g. on) if it falls between two substrings (e.g. <temp> & </temp>) however other words are present which need to be kept.
string = "<temp>The sale happened on February 22nd</temp>"
The desired string after the replace would be:
Result = <temp>The sale happened {replace} February 22nd</temp>
I've tried using regex, I've only been able to figure out how to replace everything lying between the two <temp> tags. (Because of the .*?)
result = re.sub('<temp>.*?</temp>', '{replace}', string, flags=re.DOTALL)
However on may appear later in the string not between <temp></temp> and I wouldn't want to replace this.
re.sub('(<temp>.*?) on (.*?</temp>)', lambda x: x.group(1)+" <replace> "+x.group(2), string, flags=re.DOTALL)
Output:
<temp>The sale happened <replace> February 22nd</temp>
Edit:
Changed the regex based on suggestions by Wiktor and HolyDanna.
P.S: Wiktor's comment on the question provides a better solution.
Try lxml:
from lxml import etree
root = etree.fromstring("<temp>The sale happened on February 22nd</temp>")
root.text = root.text.replace(" on ", " {replace} ")
print(etree.tostring(root, pretty_print=True))
Output:
<temp>The sale happened {replace} February 22nd</temp>

Output surrounded by [' and ]' - How to stop?

I am pulling information down from an rss feed. Due to further analysis,, I don't particularly want to use the likes of beautiful soup or feedparser. The explanation is kind of out of scope for this question.
The output is generating the text covered in [' and ']. For example
Title:
['The Morning Download: Apple Stumbles but Mobile Soars']
Published:
['Tue, 28 Jan 2014 13:09:04 GMT']
Why is this output like this? How do I stop this?
try:
#This is the RSS Feed that is being scraped
page = 'http://finance.yahoo.com/rss/headline?s=aapl'
yahooFeed = opener.open(page).read()
try:
items = re.findall(r'<item>(.*?)</item>', yahooFeed)
for item in items:
# Prints the title
title = re.findall(r'<title>(.*?)</title>', item)
print "Title:"
print title
# Prints the Date / Time Published
print "Published:"
datetime = re.findall(r'<pubDate>(.*?)</pubDate>', item)
print datetime
print "\n"
except Exception, e:
print str(e)
I am grateful of any criticism, advise and best practice information.
I'm a Java / Perl programmer so still getting used to Python, so any great resources you know of, are greatly appreciated.
Use re.search instead of re.findall, re.findall always returns a list of all matches.
datetime = re.search(r'<pubDate>(.*?)</pubDate>', item).group(1)
Note that the difference between re.findall and re.search is that the former returns a list(Python's array data-structure) of all matches, while re.search will only return the first match found.
In case of a no match re.search returns None, so to handle that as well:
match = re.search(r'<pubDate>(.*?)</pubDate>', item)
if match is not None:
datetime = match.group(1)

Categories