Web scraping using Python and Beautiful Soup for /post-sitemap.xml/ - python

I am trying to scrape a page website/post-sitemap.xml which contains all url's posted for a wordpress website. In the first step, I need to make a list of all the url's present in post-sitemap. When I use requests.get and I check the output, it opens all of the internal urls as well, which is weird. My intention is to make a list of all url's first and then using a loop, I will scrape individual url's in the next function. Below is the code I have done so far. I would need all url's as a list as my final output if python gurus can help.
I have tried using requests.get and openurl but nothing seems to open only the base url for /post-sitemap.xml
import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re
class wordpress_ext_url_cleanup(object):
def __init__(self,wp_url):
self.wp_url_raw = wp_url
self.wp_url = wp_url + '/post-sitemap.xml/'
def identify_ext_url(self):
html = requests.get(self.wp_url)
print(self.wp_url)
print(html.text)
soup = BeautifulSoup(html.text,'lxml')
#print(soup.get_text())
raw_data = soup.find_all('tr')
print (raw_data)
#for link in raw_data:
#print(link.get("href"))
def main():
print ("Inside Main Function");
url="http://punefirst dot com" #(knowingly removed the . so it doesnt look spammy)
first_call = wordpress_ext_url_cleanup(url)
first_call.identify_ext_url()
if __name__ == '__main__':
main()
I would need all 548 url's present in the post sitemap as a list which I will use it for the next function for further scraping.

The document that is returned from the server is XML and transformed with XSLT to HTML form (more info here). To parse all links from this XML, you can use this script:
import requests
from bs4 import BeautifulSoup
url = 'http://punefirst.com/post-sitemap.xml/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for loc in soup.select('url > loc'):
print(loc.text)
Prints:
http://punefirst.com
http://punefirst.com/hospitals/pcmc-hospitals/aditya-birla-memorial-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/saijyoti-hospital-and-icu-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/niramaya-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/chetna-hospital-chinchwad-pune
http://punefirst.com/hospitals/hadapsar-hospitals/pbmas-h-v-desai-eye-hospital
http://punefirst.com/hospitals/punecentral-hospitals/shree-sai-prasad-hospital
http://punefirst.com/hospitals/punecentral-hospitals/sadhu-vaswani-missions-medical-complex
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/shivneri-hospital
http://punefirst.com/hospitals/punecentral-hospitals/kelkar-nursing-home
http://punefirst.com/hospitals/pcmc-hospitals/shrinam-hospital
http://punefirst.com/hospitals/pcmc-hospitals/dhanwantari-hospital-nigdi
http://punefirst.com/hospitals/punecentral-hospitals/dr-tarabai-limaye-hospital
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/satyanand-hospital-kondhwa-pune
...and so on.

Related

Why wont BeautifulSoup extract all the HTML from a public twitter page?

I am trying to write some code to extract tweets from a public twitter page (Nike store) using the Python BS4 module. When I print the page HTML into the console, only some of the HTML is printed - when I try to search (ctrl +F) the specific class values for a tag from the console output and it returns with zero results. Why is this happening?
Here a code snippet:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
import re
if __name__ == '__main__':
# Read webpage into page_html' and close connection to webpage'
first_page = 'https://twitter.com/nikestore'
url_client = urlopen(first_page)
page_html = url_client.read()
url_client.close()
print(page_html)
I came across the accepted answer in the following link. Answer also suggests using selenium to circumvent the problem.
Problem while scraping twitter using beautiful soup

Loading more links in a page after sending json requests in Python

I am parsing this URL to get links from one of the boxes with infinite scroll. Here is mo code for sending the requests for the website to get next 10 links:
import requests
from bs4 import BeautifulSoup
import urllib2
import urllib
import extraction
import json
from json2html import *
baseUrl = 'http://www.marketwatch.com/news/headline/getheadlines'
parameters2 = {
'ticker':'XOM',
'countryCode':'US',
'docType':'2007',
'sequence':'6e09aca3-7207-446e-bb8a-db1a4ea6545c',
'messageNumber':'1830',
'count':'10',
'channelName':'',
'topic':' ',
'_':'1479539628362'}
html2 = requests.get(baseUrl, params = parameters2)
html3 = json.loads(html2.text) # array of size 10
In the corresponding HTML , there is an element like:
<li class="loading">Loading more headlines...</li>
that tells there are more items to be loaded by scrolling dowwn , but I don't know how to use json file to write a loop to gets more links.
My first try was to use Beautiful Soup and to write the following code to get links and ids :
url = 'http://www.marketwatch.com/investing/stock/xom'
r = urllib.urlopen(url).read()
soup = BeautifulSoup(r, 'lxml')
pressReleaseBox = soup.find('div', attrs={'id':'prheadlines'})
and then check if there is more link to scrape, get the next json file:
loadingMore = pressReleaseBox.find('li',attrs={'class':'loading'})
while loadingMore != None:
# get the links from json file and load more links
I don't know hot to implement the comment part. do you have any idea about it?
I am not obliged to use BeautifulSoup, and any other working library will be fine.
Here is how you can load more json file:
get last json file, extract value of key UniqueId in last item.
if the value is something looks like e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2:8499
extract e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2 as sequence
extract 8499 as messageNumber
let docId be empty
if the value is something looks like 1222712881
let sequence be empty
let messageNumber be empty
extract 1222712881 as docId
put parameters sequence, messageNumber, docId into your parameters2.
use requests.get(baseUrl, params = parameters2) to get your next json file.

python beautiful soup import urls

I am trying to import a list of urls and grab pn2 and main1. I can run it without importing the file so I know it works but I just have no idea what to do with the import. Here is what I have tried most recent and below it is a small portion of the urls. Thanks in advance.
import urllib
import urllib.request
import csv
from bs4 import BeautifulSoup
csvfile = open("ecco1.csv")
csvfilelist = csvfile.read()
theurl="csvfilelist"
soup = BeautifulSoup(theurl,"html.parser")
for row in csvfilelist:
for pn in soup.findAll('td',{"class":"productText"}):
pn2.append(pn.text)
for main in soup.find_all('div',{"class":"breadcrumb"}):
main1 = main.text
print (main1)
print ('\n'.join(pn2))
Urls:
http://www.eccolink.com/products/productresults.aspx?catId=2458
http://www.eccolink.com/products/productresults.aspx?catId=2464
http://www.eccolink.com/products/productresults.aspx?catId=2435
http://www.eccolink.com/products/productresults.aspx?catId=2446
http://www.eccolink.com/products/productresults.aspx?catId=2463
From what I see, you are opening a CSV file and using BeautifulSoup to parse it.
That should not be the way.
BeautifulSoup parses html files, not CSV.
Looking at your code, it seems correct if you were passing in html code to Bs4.
from bs4 import BeautifulSoup
import requests
links = []
file = open('links.txt')
html = requests.get('http://www.example.com')
soup = BeautifulSoup(html, 'html.parser')
for x in soup.find_all('a',"class":"abc"):
links.append(x)
file.write(x)
file.close()
Above is a very basic implementation of how I could get a target element in the html code and write it to a file/ or append it to a list. Use Requests rather than urllib. It is a better library and more modern.
If you want to input your data as CSV, my best option is to use csv reader as import.
Hope that helps.

Accessing a website in python

I am trying to get all the urls on a website using python. At the moment I am just copying the websites html into the python program and then using code to extract all the urls. Is there a way I could do this straight from the web without having to copy the entire html?
In Python 2, you can use urllib2.urlopen:
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
In Python 3, you can use urllib.request.urlopen:
import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
html = response.read()
If you have to perform more complicated tasks like authentication or passing parameters I suggest to have a look at the requests library.
The most straightforward would probably be urllib.urlopen if you're using python2, or urllib.request.urlopen if you're using python3 (you have to do import urllib or import urllib.request first of course). That way you get an file like object from which you can read (ie f.read()) the html document.
Example for python 2:
import urllib
f = urlopen("http://stackoverflow.com")
http_document = f.read()
f.close()
The good news is that you seem to have done the hard part which is analyzing the html document for links.
You might want to use the bs4(BeautifulSoup) library.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
You can download bs4 with the followig command at the cmd line. pip install BeautifulSoup4
import urllib2
import urlparse
from bs4 import BeautifulSoup
url = "http://www.google.com"
response = urllib2.urlopen(url)
content = response.read()
soup = BeautifulSoup(content, "html.parser")
for link in soup.find_all('a', href=True):
print urlparse.urljoin(url, link['href'])
You can simply use the combination of requests and BeautifulSoup.
First make an HTTP request using requests to get the HTML content. You will get it as a Python string, which you can manipulate as you like.
Take the HTML content string and supply it into the BeautifulSoup, which has done all the job to extract the DOM, and get all URLs, i.e. <a> elements.
Here is an example of how to fetch all links from StackOverflow:
import requests
from bs4 import BeautifulSoup, SoupStrainer
response = requests.get('http://stackoverflow.com')
html_str = response.text
bs = BeautifulSoup(html_str, parseOnlyThese=SoupStrainer('a'))
for a_element in bs:
if a_element.has_attr('href'):
print(a_element['href'])
Sample output:
/questions/tagged/facebook-javascript-sdk
/questions/31743507/facebook-app-request-dialog-keep-loading-on-mobile-after-fb-login-called
/users/3545752/user3545752
/questions/31743506/get-nuspec-file-for-existing-nuget-package
/questions/tagged/nuget
...

Scraping Product Names using BeautifulSoup

I'm using BeautifulSoup (BS4) to build a scraper tool that will allow me to pull the product name from any TopShop.com product page, which sits between 'h1' tags. Can't figure out why the code I've written isn't working!
from urllib2 import urlopen
from bs4 import BeautifulSoup
import re
TopShop_URL = raw_input("Enter a TopShop Product URL")
ProductPage = urlopen(TopShop_URL).read()
soup = BeautifulSoup(ProductPage)
ProductNames = soup.find_all('h1')
print ProductNames
I get this working using requests (http://docs.python-requests.org/en/latest/)
from bs4 import BeautifulSoup
import requests
content = requests.get("TOPShop_URL").content
soup = BeautifulSoup(content)
product_names = soup.findAll("h1")
print product_names
Your code is correct, but the problem is that the div which includes the product name is dynamically generated via JavaScript.
In order to be able to successfully parse this element you should mind using Selenium or a similar tool, that will allow you to parse the webpage after all the dom has been fully loaded.

Categories