How to get the content from a certain <table> using python? - python

I have some <tr>s, like this:
<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>
I want to fetch the content without html tags, like:
yangfanhit
3155
Accepted
344K
219MS
C++
3940B
2012-10-02 16:42:45
Now I'm using the following code to deal with it:
response = urllib2.urlopen('http://poj.org/status', timeout=10)
html = response.read()
response.close()
pattern = re.compile(r'<tr align.*</tr>')
match = pattern.findall(html)
pat = re.compile(r'<td>.*?</td>')
p = re.compile(r'<[/]?.*?>')
for item in match:
for i in pat.findall(item):
print p.sub(r'', i)
print '================================================='
I'm new to regex and also new to python. So could you suggest some better methods to process it?

You could use BeautifulSoup to parse the html. To write the content of the table in csv format:
#!/usr/bin/env python
import csv
import sys
import urllib2
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
soup = BeautifulSoup(urllib2.urlopen('http://poj.org/status'))
writer = csv.writer(sys.stdout)
for tr in soup.find('table', 'a')('tr'):
writer.writerow([td.get_text() for td in tr('td')])
Output
Run ID,User,Problem,Result,Memory,Time,Language,Code Length,Submit Time
10876151,yangfanhit,3155,Accepted,344K,219MS,C++,3940B,2012-10-02 16:42:45
10876150,BandBandRock,2503,Accepted,16348K,2750MS,G++,840B,2012-10-02 16:42:25

Also take a look at PyQuery. Very easy to pickup if you're familiar with jQuery. Here's an example that returns table header and data as list of dictionaries.
import itertools
from pyquery import PyQuery as pq
# parse html
html = pq(url="http://poj.org/status")
# extract header values from table
header = [header.text for header in html(".a").find(".in").find("td")]
# extract data values from table rows in nested list
detail = [[td.text for td in tr] for tr in html(".a").children().not_(".in")]
# merge header and detail to create list of dictionaries
result = [dict(itertools.izip(header, values)) for values in detail]

You really don't need to work with regex directly to parse html, see answer here.
Or see Dive into Python Chapter 8 about HTML Processing.

Why you are doing those things when you already got HTML/ XML parsers which does the job easily for you
Use BeautifulSoup. Considering what you want as mentioned in the above question, it can be done in 2-3 lines of code.
Example:
>>> from bs4 import BeautifulSoup as bs
>>> html = """
<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>
"""
>>>soup = bs(html)
>>>soup.td
>>><td>10876151</td>

Related

Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API - parse data within XML parent/child tags

I'm new to working with XML and BeautifulSoup and I am trying to get a dataset of clinical trials using Clinicaltrials.gov's new API that converts a list of trials into an XML dataset. I tried using find_all() like I typically do with HTML, but I'm not having the same luck. I've tried a few other approaches, like converting to a string and splitting (very messy) but I don't want to clutter my code with failed attempts.
Bottom line: I want to extract all NCTIds (I know I can just convert the whole thing into a string and use regex, but I want to learn how to actually parse XML correctly) and official titles for each clinical trial listed in the XML file. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html
url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results
you can filter on attributes like following:
m1_nctid = soup.findAll("field", {"name" : "NCTId"})
m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
and then iterate each result to get text, for ex:
official_titles = [result.text for result in m1_officialtitle]
for more info, you can check the documentation here
You can search for the field tag in lowercase, and pass name as an attribute to attrs. This works with just BeautifulSoup there's no need to use etree:
import requests
from bs4 import BeautifulSoup
url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, "lxml")
m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})

how to use beautifulSoup to extract html5 elements such as <section>

I intend to extract the article text from an NYT article. However I don't know how to extract by html5 tags such as section name.
import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen('https://www.nytimes.com/2019/10/24/opinion/chuck-schumer-electric-car.html?action=click&module=Opinion&pgtype=Homepage')
soup = BeautifulSoup(html)
data = soup.findAll(text=True)
The main text is wrapped in a section named 'articleBody'. What kind of soup.find() syntax can I use to extract that?
The find method searches tags, it doesn't differentiate HTML5 from any other (X)HTML tag name
article = soup.find("section",{"name":"articleBody"})
You can scrape the pre-loaded data from script tag and parse with json library. The first code block brings back a little more content than you wanted.
You can further restrict by looking up ids of paragraphs within body, and use those to filter content, as shown in bottom block; You then get exactly the article content you describe.
import requests, re, json
r = requests.get('https://www.nytimes.com/2019/10/24/opinion/chuck-schumer-electric-car.html?action=click&module=Opinion&pgtype=Homepage')
p = re.compile(r'window\.__preloadedData = (.*})')
data = json.loads(p.findall(r.text)[0])
for k,v in data['initialState'].items():
if k.startswith('$Article') and 'formats' in v:
print(v['text#stripHtml'] if 'text#stripHtml' in v else v['text'])
You can explore the json here: https://jsoneditoronline.org/?id=f9ae1fb774af439d8e9b32247db9d853
The following shows how to use additional logic to limit to just output you want:
ids = []
for k,v in data['initialState'].items():
if k.startswith('$Article') and v['__typename'] == 'ParagraphBlock' and 'content' in v:
ids += [v['content'][0]['id']]
for k,v in data['initialState'].items():
if k in ids:
print(v['text'])

Accessing web table using Python - NIST website

I am trying to access a table from the NIST website here:
http://physics.nist.gov/PhysRefData/XrayMassCoef/tab3.html
Assume that I click the element zinc. I would like to retrieve the information for Energy, u/p and u[en]/p into 3 columns of a table using python 2.7.
I am beginning to learn BeautifulSoup and Mechanize. However, I am finding it hard to identify a clear pattern in the HTML code relating to the table on this site.
What I am looking for is some way to something like this:
import mechanize
from bs4 import BeautifulSoup
page=mech.open("http://physics.nist.gov/PhysRefData/XrayMassCoef/tab3.html")
html = page.read()
soup = BeautifulSoup(html)
My thought was to try:
table = soup.find("table",...)
The ... above would be some identifier. I can't find a clear identifier on the NIST website above.
How would I be able to import this table using python 2.7?
EDIT: Is it possible to put these 3 columns in a table?
If I understood you well,
Try this:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z30.html")
soup = BeautifulSoup(respond.text)
l = soup.find('table').find('table').find_all('tr')
for i in range(3 , len(l)):
print l[i].get_text()
Edit:
Other way (Getting ASCII column) and put rows to the list l:
from bs4 import BeautifulSoup
import requests
respond = requests.get("http://physics.nist.gov/PhysRefData/XrayMassCoef/ElemTab/z30.html")
soup = BeautifulSoup(respond.text)
l = soup.find('table').find('pre').get_text()[145:].split("\n")
print l

create list from parsed web page in python

I'm a little new to web parsing in python. I am using beautiful soup. I would like to create a list by parsing strings from a webpage. I've looked around and can't seem to find the right answer. Doe anyone know how to create a list of strings from a web page? Any help is appreciated.
My code is something like this:
from BeautifulSoup import BeautifulSoup
import urllib2
url="http://www.any_url.com"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
#The data I need is coming from HTML tag of td
page_find=soup.findAll('td')
for page_data in page_find:
print page_data.string
#I tried to create my list here
page_List = [page_data.string]
print page_List
Having difficulty understanding what you are trying to achieve... If you want all values of page_data.string in page_List, then your code should look like this:
page_List = []
for page_data in page_find:
page_List.append(page_data.string)
Or using a list comprehension:
page_List = [page_data.string for page_data in page_find]
The problem with your original code is that you create the list using the text from the last td element only (i.e. outside of the loop which processes each td element).
I'd recommend lxml over BeautifulSoup, when you start scraping alot of pages the speed advantage of lxml is hard to ignore.
import requests
import lxml.html
dom = lxml.html.fromstring(requests.get('http://www.any_url.com').content)
page_list = [x for x in dom.xpath('//td/text()')]
print page_list
Here it is modified to call the web page as a string
import requests
the_web_page_as_a_string = requests.get(some_path).content
from lxml import html
myTree = html.fromstring(the_web_page_as_a_string)
td_list = [ e for e in myTree.iter() if e.tag == 'td']
text_list = []
for td_e in td_list:
text = td_e.text_content()
text_list.append(text)

How to remove content in nested tags with BeautifulSoup?

How to remove content in nested tags with BeautifulSoup? These posts showed the reverse to retrieve the content in nested tags: How to get contents of nested tag using BeautifulSoup, and BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?
I have tried .text but it only removes the tags
>>> from bs4 import BeautifulSoup as bs
>>> html = "<foo>Something something <bar> blah blah</bar> something</foo>"
>>> bs(html).find_all('foo')[0]
<foo>Something something <bar> blah blah</bar> something else</foo>
>>> bs(html).find_all('foo')[0].text
u'Something something blah blah something else'
Desired output:
Something something something else
You can check for bs4.element.NavigableString on children:
from bs4 import BeautifulSoup as bs
import bs4
html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>"
def get_only_text(elem):
for item in elem.children:
if isinstance(item,bs4.element.NavigableString):
yield item
print ''.join(get_only_text(bs(html).find_all('foo')[0]))
Output;
Something something something else
Eg.
body = bs(html)
for tag in body.find_all('bar'):
tag.replace_with('')
Here is my simple method, soup.body.clear() or soup.tag.clear()
let's say you want to clear the content in <table></table> and add a new pandas dataframe; later you can use this clear method to easily update your tables in an html file for your webpage instead of flask/django:
import pandas as pd
import bs4
I want to convert a 1.2million row .csv into a DataFrame, then into a HTML table,
and then add it to my webpage's html syntax. Later I want to easily
update the data whenever the csv gets updated by simply switching a variable
bizcsv = read_csv("business.csv")
dframe = pd.DataFrame(bizcsv)
dfhtml = dframe.to_html #convert DataFrame to table, HTML format
dfhtml_update = dfhtml_html.strip('<table border="1" class="dataframe">, </table>')
"""use dfhtml_update later to update your table without the <table> tags,
the <table> is easy for BS to select & clear!"""
#A small function to unescape (< to <) the tags back into HTML format
def unescape(s):
s = s.replace("<", "<")
s = s.replace(">", ">")
# this has to be last:
s = s.replace("&", "&")
return s
with open("page.html") as page: #return to here when updating
txt = page.read()
soup = bs4.BeautifulSoup(txt, features="lxml")
soup.body.append(dfhtml) #adds table to <body>
with open("page.html", "w") as outf:
outf.write(unescape(str(soup))) #writes to page.html
"""lets say you want to make seamless table updates to your
webpage instead of using flask or django x_x; return to with open function"""
soup.table.clear() #clears everything in <table></table>
soup.table.append(dfhtml_update)
with open("page.html", "w") as outf:
outf.write(unescape(str(soup)))
I'm a newbie, but after tons of searching I just combined a bunch of fundamental teachings from the documentation...Kind of bloated, but so is working with literally billions of cells of data. This works for me
What you are trying to do is kill a tag (bar) along with the content (blah blah). here is the code for you along with explanation
from bs4 import BeautifulSoup as bs
html = "<foo>Something something <bar> blah blah</bar> something</foo>"
soup = bs(html) # this is the soup
#lets find all bars and remove it along with content. decompose does it.
for bar in soup.find_all('bar'):
bar.decompose()
print(soup) # returns " <html><body><foo>Something something something</foo></body></html>"
# now lets extract the text with .text
print(soup.text)

Categories