so I'm trying to parse public facebook pages using BeautifulSoup. I've managed to successfully scrape LinkedIn, but I've spent hours trying to get it to work on facebook with no luck. The code I'm trying to use looks like this:
for urls in my_urls:
try:
page = urllib2.urlopen(urls)
soup = BeautifulSoup(page)
info = soup.find_all("div", class_="fsl fwb fcb")
info2 = info.findall('a')
The part that's frustrating me is that I can get the title element out, and I can even get pretty far down the document, but I can't get to the part where I need to get.
This line successfuly grabs the pageTitle:
info = soup.find_all("title", attrs={"id": "pageTitle"})
This line can get pretty far down the list of elements, but can't go any farther.
info = soup.find_all(id="pagelet_timeline_main_column")
Here's a sample page that I'm trying to parse, I want current city from it:
https://www.facebook.com/100004210542493
and heres a quick screenshot of what the part I want looks like:
http://prntscr.com/1t8xx6
I feel like I'm really close, but I just can't figure it out. Thanks in advance for any help!
EDIT 2: I should also mention that I can successfully print the whole soup and visually find the part I need, but for whatever reason the parsing just won't work the way it should.
Try looking at content returned by using curl or wget. What you are seeing in the browser is what has been rendered after javascripts has been executed.
wget https://www.facebook.com/100004210542493
You might want to use memchanize or selenium, since you want to simulate a client browser (instead of handling raw content).
Another issue related to it might be Beautiful Soup cannot find a CSS class if the object has other classes, too
Related
I was trying to extract the reviews from booking.com
URL = "https://www.booking.com/hotel/ph/oyo-518-mytown-amsterdam-manila.en-gb.html?label=gen173nr-1DCAEoggI46AdIM1gEaLQBiAEBmAEJuAEHyAEN2AED6AEBiAIBqAIDuALwtZuKBsACAdICJDZiZTlhNTRlLTc5MzktNGIwNC05NmZjLWI4Nzg3NGZhNWQ1MtgCBOACAQ;all_sr_blocks=601616502_275784337_2_0_0;checkin=2021-10-01;checkout=2021-10-02;dest_id=-2437894;dest_type=city;dist=0;group_adults=2;group_children=0;hapos=1;highlighted_blocks=601616502_275784337_2_0_0;hpos=1;no_rooms=1;room1=A%2CA;sb_price_type=total;sr_order=popularity;sr_pri_blocks=601616502_275784337_2_0_0__139646;srepoch=1632033539;srpvid=af482ec1c5c20263;type=total;ucfs=1&#tab-reviews"
and I'm using an XPATH which I copied verbatim from my browser. These codes give me nothing.
tree = lxml.html.fromstring(URL.content)
tree.xpath('//*[#id="review_list_page_container"]/ul/li[1]/div/div[2]/div[2]/div[2]/div/div[1]/p/span[3]')
output: []
Any good heart who can assist please. Thank you.
This website is loaded dynamically. Therefore you cannot extract data from this website with lxml.
You will need something like Selenium to extract data from such websites since it would be able to run the javascript that loads the actual content you are trying to scrape. Here is a tutorial to get you started.
I've been trying to scrape text from a website for the past hour and have made no progress, simply because I have very little knowledge on how to actually use BSoup.
def select_ticker():
url = "https://www.barchart.com/stocks/performance/gap/gap-up?screener=nasdaq"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html)
find = soup.findAll('td, {"data-ng-if:"row.blankRow"}')
print(find)
I'm going to this website and trying to get the first symbol from the table. Right now that symbol is BFBG
I know this should be extremely easy for someone who actually knows what they're doing with BSoup but I don't understand searching for things and this website doesn't make it easy to search either.
I appreciate your time and thanks for the help!
Actually, you cannot scrap the first symbol from the html get request. You need to fetch the json.
import urllib3
import json
http = urllib3.PoolManager()
r = http.request('GET', 'https://core-api.barchart.com/v1/quotes/get?lists=stocks.gaps.up.nasdaq&orderDir=desc&fields=symbol,symbolName,lastPrice,priceChange,gapUp,highPrice,lowPrice,volume,tradeTime,symbolCode,symbolType,hasOptions&orderBy=gapUp&meta=field.shortName,field.type,field.description&hasOptions=true&page=1&limit=100&raw=1')
print(json.loads(r.data)['data'][0]['symbol'])
And there you got the first symbol.
With the Json you can also find every information you probably want to scrap.
Here is how you can usually find those Jsons :
Going into the console, network tab, xhr tab and reload the page. If there are a lot of ressources fetched, you can also filter by the name of the domain ! :)
However, this syntax is wrong:
soup.findAll('td, {"data-ng-if:"row.blankRow"}')
you need to give a dictionnary to the find_all method according to BS4 doc
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all
soup.find_all('td', {'data-ng-if':'row.blankRow'})
Hope this helps
I have to admit that I don't know much html. I am trying to extract all the comments from an article in the online news using python. I tried using python BeautifulSoup, but it seems comments are not in the html source-code, but present in the inspect-element. For instance you can check here. http://www.dailymail.co.uk/sciencetech/article-5100519/Elon-Musk-says-Tesla-Roadster-special-option.html#comments
My code is here and I am struck.
import urllib.request as urllib2
from bs4 import BeautifulSoup
url = "http://www.dailymail.co.uk/sciencetech/article-5100519/Elon-Musk-says-Tesla-Roadster-special-option.html#comments"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, "html.parser")
I want to do this
name_box = soup.find('p', attrs={'class': 'comment-body comment-text'})
but this info is not there in the source-code.
Any suggestion, how to move forward?
I have not attempted things like this, but my guess is if you want to get it directly from "page source" you'll need something like selenium to actually navigate the page since the page is dynamic.
Alternatively if you're only interested in comments you may use the dailymail.co.uk's api to acquire comments.
Note the items in the querystring "max=1000" "&order" etc. You may also need to use the variable "offset" along side max to find all the comments if the API has a limit on the maximum "max" value.
I do not know where the API is defined, you can view it by view the network requests that your browser makes while you search the webpage.
You can get comment data from http://www.dailymail.co.uk/reader-comments/p/asset/readcomments/5100519?max=1000&order=desc&rcCache=shout for that page in JSON format. It appears that every article has something like "5101863" in its url, you can use swap those numbers for each new story that you want comments about.
Thank you FredMan. I did not know about this API. It seems we need to give only the article id and we can the comments from the article. This was the solution I was looking for.
I've been trying to figure out a simple way to run through a set of URLs that lead to pages that all have the same layout. We figured out that one issue is that in the original list the URLs are http but then they redirect to https. I am not sure if that then causes a problem in trying to pull the information from the page. I can see the structure of the page when I use Inspector in Chrome, but when I try to set up the code to grab relevant links I come up empty (literally). The most general code I have been using is:
soup = BeautifulSoup(urllib2.urlopen('https://ngcproject.org/program/algirls').read())
links = SoupStrainer('a')
print links
which yields:
a|{}
Given that I'm new to this I've been trying to work with anything that I think might work. I also tried:
mail = soup.find(attrs={'class':'tc-connect-details_send-email'}).a['href']
and
spans = soup.find_all('span', {'class' : 'tc-connect-details_send-email'})
lines = [span.get_text() for span in spans]
print lines
but these don't yield anything either.
I am assuming that it's an issue with my code and not one that the data are hidden from being scraped. Ideally I want to have the data passed to a CSV file for each URL I scrape but right now I need to be able to confirm that the code is actually grabbing the right information. Any suggestions welcome!
If you press CTRL+U on Google Chrome or Right click > view source.
You'll see that the page is rendered by using javascript or other.
urllib is not going to be able to display/download what you're looking for.
You'll have to use automated browser (Selenium - most popular) and you can use it with Google Chrome / Firefox or a headless browser (PhantomJS).
You can then get the information from Selenium and store it then manipulate it in anyway you see fit.
Sorry if this is a silly question.
I am trying to use Beautifulsoup and urllib2 in python to look at a url and extract all divs with a particular class. However, the result is always empty even though I can see the divs when I "inspect element" in chrome's developer tools.
I looked at the page source and those divs were not there which means they were inserted by a script. So my question is how can i look for those divs (using their class name) using Beautifulsoup? I want to eventually read and follow hrefs under those divs.
Thanks.
[Edit]
I am currently looking at the H&M website: http://www.hm.com/sg/products/ladies and I am interested to get all the divs with class 'product-list-item'
Try using selenium to run the javascript
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("http://www.python.org")
html = driver.page_source
check this link enter link description here
you can get all info by change the url, this link can be found in chrome dev tools > Network
The reason why you got nothing from that specific url is simply because, the info you need is not there.
So first let me explain a little bit about how that page is loaded in a browser: when you request for that page(http://www.hm.com/sg/products/ladies), the literal content will be returned in the very first phase(which is what you got from your urllib2 request), then the browser starts to read/parse the content, basically it tells the browser where to find all information it needs to render the whole page(e.g. CSS to control layout, additional javascript/urls/pages to populate certain area etc.), and the browser does all that behind the scene. When you "inspect element" in chrome, the page is already fully loaded, and those info you want is not in original url, so you need to find out which url is used to populate those area and go after that specific url instead.
So now we need to find out what happens behind the scene, and a tool is needed to capture all traffic when that page loads(I would recommend fiddler).
As you can see, lots of things happen when you open that page in a browser!(and that's only part of the whole page-loading process) So by educated guess, those info you need should be in one of those three "api.hm.com" requests, and the best part is they are alread JSON formatted, which means you might not even bother with BeautifulSoup, the built-in json module could do the job!
OK, now what? Use urllib2 to simulate those requests and get what you want.
P.S. requests is a great tool for this kind of job, you can get it here.
Try This one :
from bs4 import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://www.hm.com/sg/products/ladies")
soup = BeautifulSoup(page.read(),'lxml')
scrapdiv = open('scrapdiv.txt','w')
product_lists = soup.findAll("div",{"class":"o-product-list"})
print product_lists
for product_list in product_lists:
print product_list
scrapdiv.write(str(product_list))
scrapdiv.write("\n\n")
scrapdiv.close()