Python Using wildcard inside of strings - python

I am trying to scrap data from boxofficemoviemojo.com and I have everything setup correctly. However I am receiving a logical error that I cannot figure out. Essentially I want to take the top 100 movies and write the data to a csv file.
I am currently using html from this site for testing (Other years are the same): http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm
There's a lot of code however this is the main part that I am struggling with. The code block looks like this:
def grab_yearly_data(self,page,year):
# page is the url that was downloaded, year in this case is 2014.
rank_pattern=r'<td align="center"><font size="2">([0-9,]*?)</font>'
mov_title_pattern=r'(.htm">[A-Z])*?</a></font></b></td>'
#mov_title_pattern=r'.htm">*?</a></font></b></td>' # Testing
self.rank= [g for g in re.findall(rank_pattern,page)]
self.mov_title=[g for g in re.findall(mov_title_pattern,page)]
self.rank works perfectly. However self.mov_title does not store the data correctly. I am suppose to receive a list with 102 elements with the movie titles. However I receive 102 empty strings: ''. The rest of the program will be pretty simple once I figure out what I am doing wrong, I just can't find the answer to my question online. I've tried to change the mov_title_pattern plenty of times and I either receive nothing or 102 empty strings. Please help I really want to move forward with my project.

Just don't attempt to parse HTML with regex - it would save you time, most importantly - hair, and would make your life easier.
Here is a solution using BeautifulSoup HTML parser:
from bs4 import BeautifulSoup
import requests
url = 'http://boxofficemojo.com/yearly/chart/?yr=2014&p=.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content)
for row in soup.select('div#body td[colspan="3"] > table[border="0"] tr')[1:-3]:
cells = row.find_all('td')
if len(cells) < 2:
continue
rank = cells[0].text
title = cells[1].text
print rank, title
Prints:
1 Guardians of the Galaxy
2 The Hunger Games: Mockingjay - Part 1
3 Captain America: The Winter Soldier
4 The LEGO Movie
...
98 Transcendence
99 The Theory of Everything
100 As Above/So Below
The expression inside the select() call is a CSS Selector - a convenient and powerful way of locating elements. But, since the elements on this particular page are not conveniently mapped with ids or marked with classes, we have to rely on attributes like colspan or border. [1:-3] slice is here to eliminate the header and total rows.
For this page, to get to the table you can rely on the chart element and get it's next table sibling:
for row in soup.find('div', id='chart_container').find_next_sibling('table').find_all('tr')[1:-3]:
...

mov_title_pattern=r'.htm">([A-Za-z0-9 ]*)</a></font></b></td>'
Try this.This should work for your case.See demo.
https://www.regex101.com/r/fG5pZ8/6

Your regex does not make much sense. It matches .htm">[A-Z] as few times as possible, which is usually zero, yielding an empty string.
Moreover, with a very general regular expression like that, there is no guarantee that it only matches on the result rows. The generated page contains a lot of other places where you could expect to find .htm"> followed by something.
More generally, I would advocate an approach where you craft a regular expression which precisely identifies each generated result row, and extracts from that all the values you want. In other words, try something like
re.findall('stuff (rank) stuff (title) stuff stuff stuff')
(where I have left it as an exercise to devise a precise regular expression with proper HTML fragments where I have the stuff placeholders)
and extract both the "rank" group and the "title" group out of each matched row.
Granted, scraping is always brittle business. If you make your regex really tight, chances are it will stop working if the site changes some details in its layout. If you make it too relaxed, it will sometimes return the wrong things.

Related

Find_all function returning multiple strings that are separated but not delimited by a character

Background: I am pretty new to python and decided to practice by making a webscraper for https://www.marketwatch.com/tools/markets/stocks/a-z which would allow me to pull the company name, ticker, origin, and sector. I could then use this in another scraper to combine it with more complex information. The page is separated by 2 indexing methods- one to select the first letter of the company name (at the top of the page and another for the number of pages within that letter index (at the bottom of the page). These two tags have the class = "pagination" identifier but when I scrape based on that criteria, I get two separate strings but they are not a delimited list separated by a comma.
Does anyone know how to get the strings as a list? or individually? I really only care about the second.
from bs4 import BeautifulSoup
import requests
# open the source code of the website as text
source = 'https://www.marketwatch.com/tools/markets/stocks/a-z/x'
page = requests.get(source).text
soup = BeautifulSoup(page, 'lxml')
for tags in soup.find_all('ul', class_='pagination'):
tags_text = tags.text
print(tags_text)
Which returns:
0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther
«123»
When I try to split on /n:
tags_text = tags.text.split('/n')
print(tags_text)
The return is:
['\n0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther']
['«123»']
Neither seem to form a list. I have found many ways to get the first string but I really only need the second.
Also, please note that I am using the X index as my current tab. If you build it from scratch, you might have more numbers in the second string and the word (current) might be in a different place in the first list.
THANK YOU!!!!
Edit:
Cleaned old, commented out code from the source and realized I did not show the results of trying to call the second element despite the lack of a comma in the split example:
tags_text = tags.text.split('/n')[1]
print(tags_text)
Returns:
File, "C:\.....", line 22, in <module>
tags_text = tags.text.split('/n')[1]
IndexError: list index out of range
Never mind, I was using print() when I should have been using return, .apphend(), or another term to actually do something with the value...

Scraping text from specific HTML elements, from URL list, combined with Regex, and then concatenating them in a .csv

I need to scrape all li elements containing the word GAP, regardless of case and signs - I need to catch 'Gap' 'gap-' '-gap' 'GAP', etc. I need the whole text inside of those li elements and they have some extra tags inside, which proved to be problematic.
Here is an instance of one of the elements I need to scrape, inside one of the pages:
<li>
The GAP Group,
<span class="MathTeX">$GAP$</span>
<script type="math/tex">GAP</script>
" groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org. "
</li>
Here is how the text looks on the page - and this is the full text I need to scrape, but only parts of it come out because of the extra tags inside the li element:
"The GAP Group, $GAP$ groups, algorithms, and programming, version 4.4.12 (2008), http://www.gap-system.org."
Here is what I tried:
url_lst = ['some URLs']
for page3 in url_lst:
page3 = requests.get(page3)
tree3 = html.fromstring(page3.content)
targets3 = tree3.xpath('//li[contains(text(), "gap")]')
for target in targets3:
print(target.text)
Here is the result displayed:
The GAP Group, GAP – groups, algorithms and programming, version 4.10, Available from http://www.gap-system.org, 2018.
The GAP Group, (2008). (http://www.gap-system.org).
The GAP Group, 2019. GAP – Groups, Algorithms, and Programming, Version 4.10.1; https://www.gap-system.org.
It catches only some of them, only those containing exactly the same word I put 'gap' and it partially catches those which have extra tags inside the li element - it only catches the part until it reaches the first extra tag inside - as I show in the top.
I know I need to add Regex in order to ignore case but I don't know how - this is the first problem. And the bigger problem is to catch the whole text in the li element including the text in the extra tags. And finally I need to concatenate all results in a .csv file which I can then load in pandas and continue analysing. I do not mind if the solution uses lxml, xpath or selenium + bs4 as long as it works fine. Thank you for the help, much appreciated.
The text() filter wil only filter on the first text()-node. That is your example everything before <span class="MathTeX">$GAP$</span>.
To filter on all content use the .
And since XPath 1.0 does not have any regexp functions, you could use this:
targets3 = tree3.xpath('//li[contains(., "gap") or contains(.,"GAP")]')

Unable to parse some information from a webpage using search keywords

I've created a script to parse some information related to some songs from a website. When I try with this link or this one, I get my scrpt working flawlessly. What I could understand is that when I append my search keyword after this portion https://www.billboard.com/music/, I get the desired page having information.
However, things go wrong when I try with these keywords 1 Of The Girls or Al B. Sure! or Ashford & Simpson and so on.
I can't figure out how to append the above keywords after the base link https://www.billboard.com/music/ to locate the pages with information.
Script I've tried with:
import requests
from bs4 import BeautifulSoup
LINK = "https://www.billboard.com/music/Adele"
res = requests.get(LINK)
soup = BeautifulSoup(res.text,"lxml")
scores = [item.text for item in soup.select("[class$='-history__stats'] > p > span")]
print(scores)
Result I'm getting (as expected):
['4 No. 1 Hits', '6 Top 10 Hits', '13 Songs']
Result located in that page is just after the chart history like the following:
How can I fetch some information from a webpage using critical search keywords?
I don't know all use cases but the obvious pattern I have seen for cases mentioned is that special characters are stripped (without leaving whitespace in their place) out, words are lower-case and then spaces replaced with "-". The tricky bit may be the definition and handling of special characters.
e.g.
https://www.billboard.com/music/ashford-simpson
https://www.billboard.com/music/al-b-sure
https://www.billboard.com/music/1-of-the-girls
You could start with writing something to perform those string manipulations and then test the response code. Perhaps see if there is any form of validation in js files.
EDIT:
Multiple blanks between words becomes single blank before being replaced with "-" ?
Answer developed with #Mithu for preparing search terms:
import re
keywords = ["Y?N-Vee","Ashford & Simpson","Al B. Sure!","1 Of The Girls"]
spec_char = ["!","#","$","%","&","'","(",")","*","+",",",".","/",":",";","<","=",">","?","#","[","]","^","_","`","{","|","}","~",'"',"\\"]
for elem in keywords:
refined_keywords = re.sub('-+','-' , ''.join(i.replace(" ","-") for i in elem.lower() if i not in spec_char))
print(refined_keywords)

How to eliminate certain elements when scraping?

SO I am not sure how to proceed here. I have an example of the page that I am trying to scrape:
http://www.yonhapnews.co.kr/sports/2015/06/05/1001000000AKR20150605128600007.HTML?template=7722
Now I have the xpath selecting the 'article' div class and then subsequent <p>'s I can then always eliminate the first one because it is the same stock news text (city, yonhapnews, reporter, etc) I am evaluating word densities so this could be a problem for me :(
The issue comes in towards the end of the article. If you look towards the end there is a reporter email address and a date and time of publishing...
The problem is that on different pages of this site, there are different numbers of <p> tags towards the end so I cannot just eliminate the last two because it still messes with my results sometimes.
How would you go about eliminating those certain <p> elements towards the end? do I just have to try and scrub my data afterwards?
Here is the code snippet that selects the path and eliminates the first <p> and the last two. How should I change it?
# gets all the text from the listed div and then applies the regex to find all word objects in hanul range
hangul_syllables = response.xpath('//*[#class="article"]/p//text()').re(ur'[\uac00-\ud7af]+')
# For yonhapnews the first and the last two <p>'s are useless, everything else should be good
hangul_syllables = hangul_syllables[1:-2]
You can tweak your XPath expression not to include the p tag having class="adrs" (the date of publishing):
//*[#class="article"]/p[not(contains(#class, "adrs"))]//text()
Adding to alecxe's answer, you could exclude the p containing the email address using something that checks for an email address (possibly surrounded by whitespace). How to do that depends on whether you have XPath 2.0 or just 1.0. In 2.0 you could do something like:
//*[#class="article"]/p[not(contains(#class, "adrs")
or text()[matches(normalize-space(.),
"^[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}$", "i")])]//text()
adapting the regex for email addresses from http://www.regular-expressions.info/email.html. You could change the \.[A-Z]{2,4} to \.kr if you like.

XPATH for Scrapy

So i am using SCRAPY to scrape off the books of a website.
I have the crawler working and it crawls fine, but when it comes to cleaning the HTML using the select in XPATH it is kinda not working out right. Now since it is a book website, i have almost 131 books on each page and their XPATH comes to be likes this
For example getting the title of the books -
1st Book --- > /html/body/div/div[3]/div/div/div[2]/div/ul/li/a/span
2nd Book ---> /html/body/div/div[3]/div/div/div[2]/div/ul/li[2]/a/span
3rd book ---> /html/body/div/div[3]/div/div/div[2]/div/ul/li[3]/a/span
The DIV[] number increases with the book. I am not sure how to get this into a loop, so that it catches all the titles. I have to do this for Images and Author names too, but i think it will be similar. Just need to get this initial one done.
Thanks for your help in advance.
There are different ways to get this
Best to select multiple nodes is, selecting on the basis of ids or class.
e.g:
sel.xpath("//div[#id='id']")
You can select like this
for i in range(0, upto_num_of_divs):
list = sel.xpath("//div[%s]" %i)
You can select like this
for i in range(0, upto_num_of_divs):
list = sel.xpath("//div[position > =1 and position() < upto_num_of_divs])
Here is an example how you can parse your example html:
lis = hxs.select('//div/div[3]/div/div/div[2]/div/ul/li')
for li in lis:
book_el = li.select('a/span/text()')
Often enough you can do something like //div[#class="final-price"]//span to get the list of all the spans in one xpath. The exact expression depends on your html, this is just to give you an idea.
Otherwise the code above should do the trick.

Categories