Python - Beautiful Soup to grab emails from website

Python - Beautiful Soup to grab emails from website - python

I've been trying to figure out a simple way to run through a set of URLs that lead to pages that all have the same layout. We figured out that one issue is that in the original list the URLs are http but then they redirect to https. I am not sure if that then causes a problem in trying to pull the information from the page. I can see the structure of the page when I use Inspector in Chrome, but when I try to set up the code to grab relevant links I come up empty (literally). The most general code I have been using is:
soup = BeautifulSoup(urllib2.urlopen('https://ngcproject.org/program/algirls').read())
links = SoupStrainer('a')
print links
which yields:
a|{}
Given that I'm new to this I've been trying to work with anything that I think might work. I also tried:
mail = soup.find(attrs={'class':'tc-connect-details_send-email'}).a['href']
and
spans = soup.find_all('span', {'class' : 'tc-connect-details_send-email'})
lines = [span.get_text() for span in spans]
print lines
but these don't yield anything either.
I am assuming that it's an issue with my code and not one that the data are hidden from being scraped. Ideally I want to have the data passed to a CSV file for each URL I scrape but right now I need to be able to confirm that the code is actually grabbing the right information. Any suggestions welcome!

If you press CTRL+U on Google Chrome or Right click > view source.
You'll see that the page is rendered by using javascript or other.
urllib is not going to be able to display/download what you're looking for.
You'll have to use automated browser (Selenium - most popular) and you can use it with Google Chrome / Firefox or a headless browser (PhantomJS).
You can then get the information from Selenium and store it then manipulate it in anyway you see fit.

Related

Web scraping from dynamic websites in Python and Selenium

I asked a question yesterday and got an answer from #QHarr that dynamic websites like Workday (take https://wd1.myworkdaysite.com/recruiting/upenn/careers-at-penn for example) generate job posts' links by making extra XHR requests. So, if I want to extract specific job post links, the normal webpage scraping using HTML parse or CSS selector by keywords is not feasible while the links cannot be extracted from the HTML source code generated by the Selenium driver. (Based on WeiZhang2017's GitHub post: https://gist.github.com/Weizhang2017/0029b2ff59e943ca9f024c117fbdf88a)
In my case, websites like Workday using Ajax to load data while needed, I used Selenium to simulate page scroll down and get more data as needed. However, as for getting the JSON response using Selenium, I searched a lot but couldn't find an answer that fits my need.
My thought to extract specific job posts' links was by 3 steps in general:
Use Selenium to load and scroll down the website
Use a similar method like request .get().json() in Selenium to get the scrolled down website's JSON response data
Search through the JSON response data with my specific keywords to get the specific posts' links.
However, here comes my questions.
Step1: I did this by a loop to scroll down pages I want. No problem.
scroll = 3
while scroll:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(3)
scroll = scroll -1
Step2: I don't know what kind of method can work after searching a lot and couldn't find an easy-to-understand answer. (I am new to Python and Selenium, limited understanding of dynamic websites scraping )
Step3: I think I could handle the search and get what I want (specific job posts' links) once got the JSON data (assumed it named log) as shown on the Chrome Inspect-Network-Preview.
list = ['https://wd1.myworkdaysite.com' + x['title']['commonlink'] for x in log['body']['children'][0]['children'][0]['listItems'] if x['instance'][0]['text']==mySpecificWords]
Appreciate any thoughts on the step2 solutions.

Python BeatifulSoup scrape a dynamic container

I am trying to scrape some flashcards from this website, but I am having a few problems. Below a snippet of my code:
# point to the right link and chapter
url_main = r'https://learninglink.oup.com/access/content/neuroscience-sixth-edition-student-resources/neuroscience-6e-chapter-1-flashcards?previousFilter=tag_chapter-'
chapter = '01'
url_main = url_main + chapter
# get source
html = requests.get(url_main).text
bs = BeautifulSoup(html, features="html.parser")
If I inspect the page on Chrome, I can see that the information I am looking for is in class="box1text". So I do:
# get class
text = bs.find(class_ = "box1text" )
However, when I print this 'text' variable I get:
<span aria-live="assertive" class="box1text"></span>
And no mention of the text I am looking for. What am I doing wrong?
Also, I would like to know how to interact with this container and its buttons, but I don't even know where to start from. My ideal output would be a dictionary containing all the keywords and the associated answers (so, front and back of the card for each flashcard), but to do so I need to be able to interact with this container. Any suggestion on how to do this?
Thanks in advance!

I know this doesn't answer your question, but there is actually a better way to go about this.
If you go into devTools sidebar of your browser, and examine the network log,
you would see that there is a Http request being sent to get all the flashcard info:
As you can see, all you need to do is mimic this http request by copying the request header and sending it.
Since I don't use python, i would just use cUrl on the windows command prompt.
You could do this too right now by right clicking on that exact request when the browser is open and clicking on 'copy as Curl(cmd)' , and pasting whatever you get into the command prompt, and you should be getting the required text that you can easily read
Edit:
Also, the site in your post doesn't require any additional paramters to be sent in the request, so you should be able to get away with just:
curl "https://learninglink.oup.com/protected/files/content/flashcardCsv/1512079199667-Neuroscience6e-ch01_flashcards.csv"
you can copy and paste this in cmd to verify for yourself

Python - Scrapy ecommerce website

I'm trying to scrape the price of this product
http://www.asos.com/au/fila/fila-vintage-plus-ringer-t-shirt-with-small-logo-in-green/prd/9065343?clr=green&SearchQuery=&cid=7616&gridcolumn=2&gridrow=1&gridsize=4&pge=1&pgesize=72&totalstyles=4699
With the following code but it returns an empty array
response.xpath('//*[#id="product-price"]/div/span[2]/text()').extract()
Any help is appreciated, Thanks.

Because the site is dynamic(this is what I got when I use view(response) command in scrapy shell:
As you can see, the price info doesn't come out.
Solutions:
1. splash.
2. selenium+phantomJS
It might help also by checking this answer:Empty List From Scrapy When Using Xpath to Extract Values

The price is later added by the browser which renders the page using javascript code found in the html. If you disable javascript in your browser, you would notice that the page would look a bit different. Also, take a look at the page source, usually that's unaltered, to see that the tag you're looking for doesn't exist (yet).
Scrapy doesn't execute any javascript code. It receives the plain html and that's what you have to work with.
If you want to extract data from pages which look the same as in the browser, I recommend using an headless browser like Splash (if you're already using scrapy): https://github.com/scrapinghub/splash
You can programaticaly tell it to download your page, render it and select the data points you're interested in.
The other way is to check for the request made to the Asos API which asks for the product data. In your case, for this product:
http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=9065343&currency=AUD&keyStoreDataversion=0ggz8b-4.1&store=AU
I got this url by taking a look at all the XMLHttpRequest (XHR) requests sent in the Network tab found in Developers Tools (on Google Chrome).

You can try to find JSON inside HTML (using regular expression) and parse it:
json_string = response.xpath('//script[contains(., "function (view) {")]/text()').re_first( r'view\(\'([^\']+)' )
data = json.loads(json_string)
price = data["price"]["current"]

Web scraping for divs inserted by scripts

Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'

Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source

check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network

The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.

Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()

getting specific images from page

I am pretty new with BeautifulSoup. I am trying to print image links from http://www.bing.com/images?q=owl:
redditFile = urllib2.urlopen("http://www.bing.com/images?q=owl")
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
productDivs = soup.findAll('div', attrs={'class' : 'dg_u'})
for div in productDivs:
print div.find('a')['t1'] #works fine
print div.find('img')['src'] #This getting issue KeyError: 'src'
But this gives only title, not the image source
Is there anything wrong?
Edit:
I have edited my source, still could not get image url.

Bing is using some techniques to block automated scrapers. I tried to print
div.find('img')
and found that they are sending source in attribute names src2, so following should work -
div.find('img')['src2']
This is working for me. Hope it helps.

If you open up browser develop tools, you'll see that there is an additional async XHR request issued to the http://www.bing.com/images/async endpoint which contains the image search results.
Which leads to the 3 main options you have:
simulate that XHR request in your code. You might want to use something more suitable for humans than urllib2; see requests module. This would be so called "low-level" approach, going down to the bare metal and web-site specific implementation which would make this option non-reliable, difficult, "heavy", error-prompt and fragile
automate a real browser using selenium - stay on the high-level. In other words, you don't care how the results are retrieved, what requests are made, what javascript needs to be executed. You just wait for search results to appear and extract them.
use Bing Search API (this should probably be option #1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.