Accessing web table using Python - NIST website - python

I am trying to access a table from the NIST website here:
http://physics.nist.gov/PhysRefData/XrayMassCoef/tab3.html
Assume that I click the element zinc. I would like to retrieve the information for Energy, u/p and u[en]/p into 3 columns of a table using python 2.7.
I am beginning to learn BeautifulSoup and Mechanize. However, I am finding it hard to identify a clear pattern in the HTML code relating to the table on this site.
What I am looking for is some way to something like this:
import mechanize
from bs4 import BeautifulSoup
page=mech.open("http://physics.nist.gov/PhysRefData/XrayMassCoef/tab3.html")
html = page.read()
soup = BeautifulSoup(html)
My thought was to try:
table = soup.find("table",...)
The ... above would be some identifier. I can't find a clear identifier on the NIST website above.
How would I be able to import this table using python 2.7?
EDIT: Is it possible to put these 3 columns in a table?

If I understood you well,
Try this:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z30.html")
soup = BeautifulSoup(respond.text)
l = soup.find('table').find('table').find_all('tr')
for i in range(3 , len(l)):
print l[i].get_text()
Edit:
Other way (Getting ASCII column) and put rows to the list l:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z30.html")
soup = BeautifulSoup(respond.text)
l = soup.find('table').find('pre').get_text()[145:].split("\n")
print l

Related

Webscraping Python BS4 issue not returning data

I am new here and have had a read through much of the historic posts but cannot exactly find what I am looking for.
I am new to webscraping and have successfully scraped data from a handful of sites.
However I am having an issue with this code as I am trying to extract the titles of the products using beautiful soup but have an issue somewhere in the code as it is not returning the data? Any help would be appreciated:
from bs4 import BeautifulSoup
import requests
import pandas as pd
webpage = requests.get('https://groceries.asda.com/aisle/beer-wine-spirits/spirits/whisky/1215685911554-1215685911575-1215685911576')
sp = BeautifulSoup(webpage.content, 'html.parser')
title = sp.find_all('h3', class_='co-product__title')
print(title)
I assume my issue lies somewhere in the find_all function, however cannot quite work out how to resolve?
Regards
Milan
You could try to use this link, it seems to pull the information you desire:
from bs4 import BeautifulSoup
import requests
webpage = requests.get("https://groceries.asda.com/api/items/iconmetadata?request_origin=gi")
sp = BeautifulSoup(webpage.content, "html.parser")
print(sp)
Let me know if this helps.
Try this:
from bs4 import BeautifulSoup
import requests
import pandas as pd
webpage = requests.get('https://groceries.asda.com/aisle/beer-wine-spirits/spirits/whisky/1215685911554-1215685911575-1215685911576')
sp = BeautifulSoup(webpage.content, 'html.parser')
title = sp.find_all('h3', {'class':'co-product__title'})
print(title[0])
also i prefer
sp = BeautifulSoup(webpage.text, 'lxml')
Also note that this will return a list with all elements of that class. If you want just the first instance, use .find ie:
title = sp.find('h3', {'class':'co-product__title'})
Sorry to rain on this parade, but you wont be able to scrape this data with out a webdriver or You can call the api directly. You should research how to get post rendered js in python.

HTML Scraping the website with duplicated div class name

I currently working on the HTML scraping the baka-update.
However, the name of Div Class is duplicated.
As my goal is as csv or json, I would like to use information in [sCat] as column name and [sContent] as to be get stored.....
Is their are way to scrape with this kinds of website?
Thanks,
Sample
https://www.mangaupdates.com/series.html?id=75363
Image 1
Image 2
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]/text()')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]/text()')
print('sCat: ', sCat)
print('sContent: ', sContent)
I tried but nothing I could find of
#Jasper Nichol M Fabella
I tried to edit your code and got the following output. Maybe it will Help.
from lxml import html
import requests
page = requests.get('http://www.mangaupdates.com/series.html?id=153558?')
tree = html.fromstring(page.content)
# print(page.content)
#Get the name of the columns.... I hope
sCat = tree.xpath('//div[#class="sCat"]')
#Get the actual data
sContent = tree.xpath('//div[#class="sContent"]')
print('sCat: ', len(sCat))
print('sContent: ', len(sContent))
json_dict={}
for i in range(0,len(sCat)):
# print(''.join(i.itertext()))
sCat_text=(''.join(sCat[i].itertext()))
sContent_text=(''.join(sContent[i].itertext()))
json_dict[sCat_text]=sContent_text
print(json_dict)
I got the following output
Hope it Helps
you can use xpath expressions and create an absolute path on what you want to scrape
Here is an example with requests and lxml library:
from lxml import html
import requests
r = requests.get('https://www.mangaupdates.com/series.html?id=75363')
tree = html.fromstring(r.content)
sCat = [i.text_content().strip() for i in tree.xpath('//div[#class="sCat"]')]
sContent = [i.text_content().strip() for i in tree.xpath('//div[#class="sContent"]')]
What are you using to scrape?
If you are using BeautifulSoup? Then you can search for all content on the page with FindAll method with a class identifier and iterate thru that. You can the special "_class" deginator
Something like
import bs4
soup = bs4.BeautifulSoup(html.source)
soup.find_all('div', class_='sCat')
# do rest of your logic work here
Edit: I was typing on my mobile on cached page before you made the edits. So didnt see the changes. Though i see you are using raw lxml library to parse. Yes that's faster but I am not to familiar, as Ive only used raw lxml library for one project but I think you can chain two search methods to distill to something equivalent.

Compiling list of links off of pastebin Python

Currently trying to extract links of of pastebin using python. What I have so far:
from bs4 import BeautifulSoup
import re
import requests
from random import randint
import time
from lxml import etree
from time import sleep
import random
a = requests.get('http://pastebin.com/JGM3p9c9')
scrape = BeautifulSoup(a.text, 'lxml')
linkz = scrape.find_all("textarea", {"id":"paste_code"})
rawlinks = str(linkz)
partition1 = rawlinks.partition('\\r')[0]
links = partition1.partition('">')[-1]
I cant seem to get python to compile all of the http:// formatted links but only the first... Using regex's I found online didnt work
End goal I'm trying to get the links into a list, in which I can send requests to all of the links in the list I compile.
Firstly, you do not have to extract the complete tag and change it into str. Better way to achieve it is:
# `next` to extract content within tag v
# instead use `find` v v
>>> my_links = scrape.find("textarea", {"id":"paste_code"}).next
where my_links will hold the value:
u'http://www.walmart.com\r\nhttp://www.target.com\r\nhttp://www.lowes.com\r\nhttp://www.sears.com'
In order to convert this string to your desired list of links, you may split the string on \r\n as:
>>> my_links.split('\r\n')
[u'http://www.walmart.com', u'http://www.target.com', u'http://www.lowes.com', u'http://www.sears.com']
You need to navigate through a couple of layers of HTML but I had a look at the pastebin page and I think this code will find what you want (sorry for switching a couple of modules I just use these ones instead)
from bs4 import BeautifulSoup
import urllib.request
a = urllib.request.urlopen('http://pastebin.com/JGM3p9c9')
scrape = BeautifulSoup(a, 'html.parser')
x1 = scrape.find_all('div', id = 'selectable')
for x2 in x1:
x3 = x2.find_all('li')
for x4 in x3:
x5 = x4.find_all('div')
for x6 in x5:
print(x6.string)
Next time you need to scrape a specific thing I advice looking at the HML of the website by right-clicking and selecting 'Inspect Element'. also you can do:
print(scrape.prettify())
To get a better idea of how the HTML is nested.
Forget using BS to parse the HTML - in this specific case, you can get the content of the PasteBin directly, and turn this into a one liner.
import requests
links = [link.strip() for link in requests.get('http://pastebin.com/raw/JGM3p9c9').text.split('\n')]
You can also split on \r\n

urllib keeps freezing while trying to pull HTML data from a website - is my code correct?

I'm trying to build a simple Python script algorithm on Mac OS X that has four parts to it.
go to a defined website and grab all the HTML using urllib
parse the HTML data to find a table of numbers (using beautifulsoup)
with those numbers do a simple calculation
print out the results in a table in numerical order
I'm having trouble with step 1, i can grab the data with urllib using this code
import urllib.request
y=urllib.request.urlopen('my target website url')
x=y.read()
print(x)
But it keeps freezing once it has returned the HTML and the Python shell is non-responsive.
Since you mentioned requests, I think it's a great solution.
import requests
import BeautifulSoup
r = requests.get('http://example.com')
html = r.content
soup = BeautifulSoup(html)
table = soup.find("table", {"id": "targettable"})
As suggested by jonrsharpe, if you're concerned about the size of the response returned by that url, you can check the size first before printing or parsing.
With requests:
r = requests.get('http://example.com')
print r.headers['content-length']

Trying to scrape information from an iterated list of links using Beautiful Soup or ElementTree

I'm attempting to scrape an xml database list of links for these addresses. (The 2nd link is an example page that actually contains some addresses. Many of the links don't.)
I'm able to retrieve the list of initial links I'd like to crawl through, but I can't seem to go one step further and extract the final information I'm looking for (addresses).
I assume there's an error with my syntax, and I've tried scraping it using both beautiful soup and Python's included library, but it doesn't work.
BSoup:
from bs4 import BeautifulSoup
import requests
import re
resultsdict = {}
companyname = []
url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml'
html = requests.get(url1).text
bs = BeautifulSoup(html)
# find the links to companies
company_menu = bs.find_all('loc')
for company in company_menu:
data = bs.find("html",{"i"})
print data
Non 3rd Party:
import requests
import xml.etree.ElementTree as et
req = requests.get('http://www.agenzia-interinale.it/sitemap-5.xml')
root = et.fromstring(req.content)
for i in root:
print i[0].text
Any input is appreciated! Thanks.
Your syntax is ok. You need to simply follow those links in the first page, here's how it will look like for the Milano page:
from bs4 import BeautifulSoup
import requests
import re
resultsdict = {}
companyname = []
url1 = 'http://www.agenzia-interinale.it/sitemap-5.xml'
html = requests.get(url1).text
bs = BeautifulSoup(html)
company_menu = bs.find_all('loc')
for item in company_menu:
if 'milano' in item.text:
subpage = requests.get(item.text)
subsoup = BeautifulSoup(subpage.text)
adresses = subsoup.find_all(class_='riquadro_agenzia_off')
for adress in adresses:
companyname.append(adress.text)
print companyname
To get all addresses you can simply remove if 'milano' block in the code. I don't know if they are all formatted according to coherent rules, for milano addresses are under div with class="riquandro_agenzia_off", if other subpages are also formatted in this way then it should work. Anyway this should get you started. Hope it helps.

Categories