How to eliminate certain elements when scraping? - python

SO I am not sure how to proceed here. I have an example of the page that I am trying to scrape:
http://www.yonhapnews.co.kr/sports/2015/06/05/1001000000AKR20150605128600007.HTML?template=7722
Now I have the xpath selecting the 'article' div class and then subsequent <p>'s I can then always eliminate the first one because it is the same stock news text (city, yonhapnews, reporter, etc) I am evaluating word densities so this could be a problem for me :(
The issue comes in towards the end of the article. If you look towards the end there is a reporter email address and a date and time of publishing...
The problem is that on different pages of this site, there are different numbers of <p> tags towards the end so I cannot just eliminate the last two because it still messes with my results sometimes.
How would you go about eliminating those certain <p> elements towards the end? do I just have to try and scrub my data afterwards?
Here is the code snippet that selects the path and eliminates the first <p> and the last two. How should I change it?
# gets all the text from the listed div and then applies the regex to find all word objects in hanul range
hangul_syllables = response.xpath('//*[#class="article"]/p//text()').re(ur'[\uac00-\ud7af]+')
# For yonhapnews the first and the last two <p>'s are useless, everything else should be good
hangul_syllables = hangul_syllables[1:-2]

You can tweak your XPath expression not to include the p tag having class="adrs" (the date of publishing):
//*[#class="article"]/p[not(contains(#class, "adrs"))]//text()

Adding to alecxe's answer, you could exclude the p containing the email address using something that checks for an email address (possibly surrounded by whitespace). How to do that depends on whether you have XPath 2.0 or just 1.0. In 2.0 you could do something like:
//*[#class="article"]/p[not(contains(#class, "adrs")
or text()[matches(normalize-space(.),
"^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}$", "i")])]//text()
adapting the regex for email addresses from http://www.regular-expressions.info/email.html. You could change the \.[A-Z]{2,4} to \.kr if you like.

Related

Find_all function returning multiple strings that are separated but not delimited by a character

Background: I am pretty new to python and decided to practice by making a webscraper for https://www.marketwatch.com/tools/markets/stocks/a-z which would allow me to pull the company name, ticker, origin, and sector. I could then use this in another scraper to combine it with more complex information. The page is separated by 2 indexing methods- one to select the first letter of the company name (at the top of the page and another for the number of pages within that letter index (at the bottom of the page). These two tags have the class = "pagination" identifier but when I scrape based on that criteria, I get two separate strings but they are not a delimited list separated by a comma.
Does anyone know how to get the strings as a list? or individually? I really only care about the second.
from bs4 import BeautifulSoup
import requests
# open the source code of the website as text
source = 'https://www.marketwatch.com/tools/markets/stocks/a-z/x'
page = requests.get(source).text
soup = BeautifulSoup(page, 'lxml')
for tags in soup.find_all('ul', class_='pagination'):
tags_text = tags.text
print(tags_text)
Which returns:
0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther
«123»
When I try to split on /n:
tags_text = tags.text.split('/n')
print(tags_text)
The return is:
['\n0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther']
['«123»']
Neither seem to form a list. I have found many ways to get the first string but I really only need the second.
Also, please note that I am using the X index as my current tab. If you build it from scratch, you might have more numbers in the second string and the word (current) might be in a different place in the first list.
THANK YOU!!!!
Edit:
Cleaned old, commented out code from the source and realized I did not show the results of trying to call the second element despite the lack of a comma in the split example:
tags_text = tags.text.split('/n')[1]
print(tags_text)
Returns:
File, "C:\.....", line 22, in <module>
tags_text = tags.text.split('/n')[1]
IndexError: list index out of range
Never mind, I was using print() when I should have been using return, .apphend(), or another term to actually do something with the value...

Scraping text from specific HTML elements, from URL list, combined with Regex, and then concatenating them in a .csv

I need to scrape all li elements containing the word GAP, regardless of case and signs - I need to catch 'Gap' 'gap-' '-gap' 'GAP', etc. I need the whole text inside of those li elements and they have some extra tags inside, which proved to be problematic.
Here is an instance of one of the elements I need to scrape, inside one of the pages:
<li>
The GAP Group,
<span class="MathTeX">$GAP$</span>
<script type="math/tex">GAP</script>
" groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org. "
</li>
Here is how the text looks on the page - and this is the full text I need to scrape, but only parts of it come out because of the extra tags inside the li element:
"The GAP Group, $GAP$ groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org."
Here is what I tried:
url_lst = ['some URLs']
for page3 in url_lst:
page3 = requests.get(page3)
tree3 = html.fromstring(page3.content)
targets3 = tree3.xpath('//li[contains(text(), "gap")]')
for target in targets3:
print(target.text)
Here is the result displayed:
The GAP Group, GAP – groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018.
The GAP Group, (2008). (http://www.gap-system.org).
The GAP Group, 2019. GAP – Groups, Algorithms, and Programming, Version 4.10.1; https://www.gap-system.org.
It catches only some of them, only those containing exactly the same word I put 'gap' and it partially catches those which have extra tags inside the li element - it only catches the part until it reaches the first extra tag inside - as I show in the top.
I know I need to add Regex in order to ignore case but I don't know how - this is the first problem. And the bigger problem is to catch the whole text in the li element including the text in the extra tags. And finally I need to concatenate all results in a .csv file which I can then load in pandas and continue analysing. I do not mind if the solution uses lxml, xpath or selenium + bs4 as long as it works fine. Thank you for the help, much appreciated.
The text() filter wil only filter on the first text()-node. That is your example everything before <span class="MathTeX">$GAP$</span>.
To filter on all content use the .
And since XPath 1.0 does not have any regexp functions, you could use this:
targets3 = tree3.xpath('//li[contains(., "gap") or contains(.,"GAP")]')

Matching the last two html 'class' attributes encounter before a variable in Python Regex

Regex is definitely giving me a headache. Every time I am moving one step ahead, I have a feeling that I stepping twice back!
I am trying to extract the class attribute of the last tag before the one containing any first name.
I randomly found that website which I thought would be a good example to practice. I am trying to write a general rule! Nothing specifically applied to that website.
The only assumption is that I know what the first name is and that it is contained in a tag (div, span, h1, ...) with a certain class.
Here is my regex trials:
re.findall(r'(?:class="(.+)".+){2}.*' + val, source) #'source' is the source code of the page
re.findall(r'(?:class="(.+)".*class=)+' + val, source) #'val' a name that I know is in the page
Any explanations on what is wrong or on what to do to succeed in my task would be highly appreciated !
Thanks a lot and stay safe.
Here is the solution that I found. Assuming any keyword, you want to retrieve the text of the preceding element.
First find the class of the tag containing your keyword:
elt = driver.find_element_by_xpath("//*[contains(text(),'{}')]".format(keyword))
keyword_class = elt.get_attribute('class')
Next you can find the parent or precedingsiblings using xpath.
# Find the class of firstnames preceding siblings and access their text
xpath = "//*[#class='{}']//preceding-sibling::*".format(class_name)
pre_siblings = driver.find_elements_by_xpath(xpath)
for sibling in pre_siblings:
print(sibling.text)

How to get the job description using scrapy?

I'm new to scrapy and XPath but programming in Python for sometime. I would like to get the email, name of the person making the offer and phone number from the page https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ using scrapy. As you see, the email and phone is provided as text inside the <p> tag and that makes it hard to extract.
My idea is to first get text inside the Job Overview or at least all the text talking about this respective job and use ReGex to get the email, phone number and if possible the name of the person.
So, I fired up the scrapy shell using the command: scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ and get the response from there.
Now, I try to get all the text from the div job_description where I actually get nothing. I used
full_des = response.xpath('//div[#class="job_description"]/text()').extract()
It returns [u'\t\t\t\n\t\t ']
How do I get all the text from the page mentioned ? Obviously, the task will come afterwards to get the attributes mentioned before, but, first things first.
Update: This selection only returns [] response.xpath('//div[#class="job_description"]/div[#class="container"]/div[#class="row"]/text()').extract()
You were close with
full_des = response.xpath('//div[#class="job_description"]/text()').extract()
The div-tag actually does not have any text besides what you get.
<div class="job_description" (...)>
"This is the text you are getting"
<p>"This is the text you want"</p>
</div>
As you see, the text you are getting with response.xpath('//div[#class="job_description"]/text()').extract() is the text that is in between the div-tag, not in between the tags inside the div-tag. For this you would need:
response.xpath('//div[#class="job_description"]//*/text()').extract()
What this does is it selects all the child-nodes from div[#class="job_description] and returns the text (see here for what the different xpaths do).
You will see that this returns much useless text as well, as you are still getting all the \n and such. For this I suggest that you narrow your xpath down to the element that you want, instead of doing a broad approach.
For example the entire job description would be in
response.xpath('//div[#class="col-sm-5 justify-text"]//*/text()').extract()

Python Using wildcard inside of strings

I am trying to scrap data from boxofficemoviemojo.com and I have everything setup correctly. However I am receiving a logical error that I cannot figure out. Essentially I want to take the top 100 movies and write the data to a csv file.
I am currently using html from this site for testing (Other years are the same): http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm
There's a lot of code however this is the main part that I am struggling with. The code block looks like this:
def grab_yearly_data(self,page,year):
# page is the url that was downloaded, year in this case is 2014.
rank_pattern=r'<td align="center"><font size="2">([0-9,]*?)</font>'
mov_title_pattern=r'(.htm">[A-Z])*?</a></font></b></td>'
#mov_title_pattern=r'.htm">*?</a></font></b></td>' # Testing
self.rank= [g for g in re.findall(rank_pattern,page)]
self.mov_title=[g for g in re.findall(mov_title_pattern,page)]
self.rank works perfectly. However self.mov_title does not store the data correctly. I am suppose to receive a list with 102 elements with the movie titles. However I receive 102 empty strings: ''. The rest of the program will be pretty simple once I figure out what I am doing wrong, I just can't find the answer to my question online. I've tried to change the mov_title_pattern plenty of times and I either receive nothing or 102 empty strings. Please help I really want to move forward with my project.
Just don't attempt to parse HTML with regex - it would save you time, most importantly - hair, and would make your life easier.
Here is a solution using BeautifulSoup HTML parser:
from bs4 import BeautifulSoup
import requests
url = 'http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)
for row in soup.select('div#body td[colspan="3"] > table[border="0"] tr')[1:-3]:
cells = row.find_all('td')
if len(cells) < 2:
continue
rank = cells[0].text
title = cells[1].text
print rank, title
Prints:
1 Guardians of the Galaxy
2 The Hunger Games: Mockingjay - Part 1
3 Captain America: The Winter Soldier
4 The LEGO Movie
...
98 Transcendence
99 The Theory of Everything
100 As Above/So Below
The expression inside the select() call is a CSS Selector - a convenient and powerful way of locating elements. But, since the elements on this particular page are not conveniently mapped with ids or marked with classes, we have to rely on attributes like colspan or border. [1:-3] slice is here to eliminate the header and total rows.
For this page, to get to the table you can rely on the chart element and get it's next table sibling:
for row in soup.find('div', id='chart_container').find_next_sibling('table').find_all('tr')[1:-3]:
...
mov_title_pattern=r'.htm">([A-Za-z0-9 ]*)</a></font></b></td>'
Try this.This should work for your case.See demo.
https://www.regex101.com/r/fG5pZ8/6
Your regex does not make much sense. It matches .htm">[A-Z] as few times as possible, which is usually zero, yielding an empty string.
Moreover, with a very general regular expression like that, there is no guarantee that it only matches on the result rows. The generated page contains a lot of other places where you could expect to find .htm"> followed by something.
More generally, I would advocate an approach where you craft a regular expression which precisely identifies each generated result row, and extracts from that all the values you want. In other words, try something like
re.findall('stuff (rank) stuff (title) stuff stuff stuff')
(where I have left it as an exercise to devise a precise regular expression with proper HTML fragments where I have the stuff placeholders)
and extract both the "rank" group and the "title" group out of each matched row.
Granted, scraping is always brittle business. If you make your regex really tight, chances are it will stop working if the site changes some details in its layout. If you make it too relaxed, it will sometimes return the wrong things.

Categories