Extracting table from html using python - python

I am trying to extract the table "Pharmacology-and-Biochemistry"from the url https://pubchem.ncbi.nlm.nih.gov/compound/23677941#section=Pharmacology-and-Biochemistry i have written this code
from lxml import etree
import urllib.request as ur
url = "https://pubchem.ncbi.nlm.nih.gov/compound /23677941#section=Chemical-and-Physical-Properties"
web = ur.urlopen(url)
s = web.read()
html = etree.HTML(s)
print (html)
nodes = html.xpath('//li[#id="Pharmacology-and-Biochemistry"/descendant::*]')
print (tr_nodes)
but the script is not getting the node specified in xpath and output is empty list
[]
I tried several other xpaths but nothing worked!
please help me !!

I think the problem is that in this url doesn't exists the table that you are searching.
Try to run this:
from urllib import urlopen
text = urlopen('https://pubchem.ncbi.nlm.nih.gov/compound/23677941#section=Pharmacology-and-Biochemistry').read()
print 'Pharmacology-and-Biochemistry' in text
The result is:
False

Related

Unable to open LOCAL HTML page for scrapping using BS$ Python

I have written following code to open a local HTML file saved on my Desktop:
However while running this code I get following error:
I have no prior experience of handling this in Python or BS4. I tried various solutions online but couldn't solve it.
Code:
import csv
from email import header
from fileinput import filename
from tokenize import Name
import requests
from bs4 import BeautifulSoup
url = "C:\ Users\ ASUS\ Desktop\ payment.html"
page=open(url)
# r=requests.get(url)
# htmlContent = r.content
soup = BeautifulSoup(page.read())
head_tag = soup.head
for child in head_tag.descendants:
print(child)
Need help!
Thank you in advance.
It's unicode error prefix the path with r (to produce a raw string):
url = r"C:\ Users\ ASUS\ Desktop\ payment.html"

Removing Empty Lines of a list in python

My goal is to get a simple text output like:
https://widget.reviews.io/rating-snippet/dist.js
But I keep getting output like this:
https://widget.reviews.io/rating-snippet/dist.js
All these empty lines are the problem
--> Before there where [] but I removed them with ''.join
Now I only have these empty lines.
Here is my code:
import requests
import re
from bs4 import BeautifulSoup
html = requests.get("https://www.nutrimuscle.com")
soup = BeautifulSoup(html.text, "html.parser")
# Find all script tags
for n in soup.find_all('script'):
# Check if the src attribute exists, and if it does grab the source URL
if 'src' in n.attrs:
javascript = n['src']
# Otherwise assume that the javascript is contained within the tags
else:
javascript = ''
kameleoonRegex = re.compile(r'[\w].*rating-snippet/dist.js')
#Everything I tried :D
kameleeonScript = kameleoonRegex.findall(javascript)
text = ''.join(kameleeonScript)
print(text)
It's probably not that hard but I've been on this for hours
if kameleeonScript: print(kameleeonScript[0])
did the job :)

HTML Scraping the website with duplicated div class name

I currently working on the HTML scraping the baka-update.
However, the name of Div Class is duplicated.
As my goal is as csv or json, I would like to use information in [sCat] as column name and [sContent] as to be get stored.....
Is their are way to scrape with this kinds of website?
Thanks,
Sample
https://www.mangaupdates.com/series.html?id=75363
Image 1
Image 2
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]/text()')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]/text()')
print('sCat: ', sCat)
print('sContent: ', sContent)
I tried but nothing I could find of
#Jasper Nichol M Fabella
I tried to edit your code and got the following output. Maybe it will Help.
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
# print(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]')
print('sCat: ', len(sCat))
print('sContent: ', len(sContent))
json_dict={}
for i in range(0,len(sCat)):
# print(''.join(i.itertext()))
sCat_text=(''.join(sCat[i].itertext()))
sContent_text=(''.join(sContent[i].itertext()))
json_dict[sCat_text]=sContent_text
print(json_dict)
I got the following output
Hope it Helps
you can use xpath expressions and create an absolute path on what you want to scrape
Here is an example with requests and lxml library:
from lxml import html
import requests
r = requests.get('https://www.mangaupdates.com/series.html?id=75363')
tree = html.fromstring(r.content)
sCat = [i.text_content().strip() for i in tree.xpath('//div[#class="sCat"]')]
sContent = [i.text_content().strip() for i in tree.xpath('//div[#class="sContent"]')]
What are you using to scrape?
If you are using BeautifulSoup? Then you can search for all content on the page with FindAll method with a class identifier and iterate thru that. You can the special "_class" deginator
Something like
import bs4
soup = bs4.BeautifulSoup(html.source)
soup.find_all('div', class_='sCat')
# do rest of your logic work here
Edit: I was typing on my mobile on cached page before you made the edits. So didnt see the changes. Though i see you are using raw lxml library to parse. Yes that's faster but I am not to familiar, as Ive only used raw lxml library for one project but I think you can chain two search methods to distill to something equivalent.

Web scraping using Python and Beautiful Soup for /post-sitemap.xml/

I am trying to scrape a page website/post-sitemap.xml which contains all url's posted for a wordpress website. In the first step, I need to make a list of all the url's present in post-sitemap. When I use requests.get and I check the output, it opens all of the internal urls as well, which is weird. My intention is to make a list of all url's first and then using a loop, I will scrape individual url's in the next function. Below is the code I have done so far. I would need all url's as a list as my final output if python gurus can help.
I have tried using requests.get and openurl but nothing seems to open only the base url for /post-sitemap.xml
import pandas as pd
import numpy as np
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re
class wordpress_ext_url_cleanup(object):
def __init__(self,wp_url):
self.wp_url_raw = wp_url
self.wp_url = wp_url + '/post-sitemap.xml/'
def identify_ext_url(self):
html = requests.get(self.wp_url)
print(self.wp_url)
print(html.text)
soup = BeautifulSoup(html.text,'lxml')
#print(soup.get_text())
raw_data = soup.find_all('tr')
print (raw_data)
#for link in raw_data:
#print(link.get("href"))
def main():
print ("Inside Main Function");
url="http://punefirst dot com" #(knowingly removed the . so it doesnt look spammy)
first_call = wordpress_ext_url_cleanup(url)
first_call.identify_ext_url()
if __name__ == '__main__':
main()
I would need all 548 url's present in the post sitemap as a list which I will use it for the next function for further scraping.
The document that is returned from the server is XML and transformed with XSLT to HTML form (more info here). To parse all links from this XML, you can use this script:
import requests
from bs4 import BeautifulSoup
url = 'http://punefirst.com/post-sitemap.xml/'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for loc in soup.select('url > loc'):
print(loc.text)
Prints:
http://punefirst.com
http://punefirst.com/hospitals/pcmc-hospitals/aditya-birla-memorial-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/saijyoti-hospital-and-icu-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/niramaya-hospital-chinchwad-pune
http://punefirst.com/hospitals/pcmc-hospitals/chetna-hospital-chinchwad-pune
http://punefirst.com/hospitals/hadapsar-hospitals/pbmas-h-v-desai-eye-hospital
http://punefirst.com/hospitals/punecentral-hospitals/shree-sai-prasad-hospital
http://punefirst.com/hospitals/punecentral-hospitals/sadhu-vaswani-missions-medical-complex
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/shivneri-hospital
http://punefirst.com/hospitals/punecentral-hospitals/kelkar-nursing-home
http://punefirst.com/hospitals/pcmc-hospitals/shrinam-hospital
http://punefirst.com/hospitals/pcmc-hospitals/dhanwantari-hospital-nigdi
http://punefirst.com/hospitals/punecentral-hospitals/dr-tarabai-limaye-hospital
http://punefirst.com/hospitals/katraj-kondhwa-hospitals/satyanand-hospital-kondhwa-pune
...and so on.

Python, Limiting search at a specific hyperlink on webpage

I am finding a way to download .pdf file through hyperlinks on a webpage.
Learned from How can i grab pdf links from website with Python script, the way is:
import lxml.html, urllib2, urlparse
base_url = 'http://www.renderx.com/demos/examples.html'
res = urllib2.urlopen(base_url)
tree = lxml.html.fromstring(res.read())
ns = {'re': 'http://exslt.org/regular-expressions'}
for node in tree.xpath('//a[re:test(#href, "\.pdf$", "i")]', namespaces=ns):
print urlparse.urljoin(base_url, node.attrib['href'])
The question is, how can I only find the .pdf under a specific hyperlink, instead of listing all the .pdf(s) on the webpage?
A way is, I can limit the print when it contains certain words like:
If ‘CA-Personal.pdf’ in node:
But what if the .pdf file name is changing? Or I just want to limit the searching on the webpage, at the hyperlink of “Applications”? thanks.
well, not the best way but no harm to do:
from bs4 import BeautifulSoup
import urllib2
domain = 'http://www.renderx.com'
url = 'http://www.renderx.com/demos/examples.html'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
app = soup.find_all('a', text = "Applications")
for aa in app:
print domain + aa['href']

Categories