Parsing brackets in HTML with Python - python

I am trying to parse some information thats in a var meta window, and I am just a little confused how to grab just the value for the "id".
My code is below
url = input("\n\nEnter URL: ")
print(Fore.MAGENTA + "\nSetting link . . .")
def printID():
print("")
session = requests.session()
response = session.get(url)
soup = bs(response.text, 'html.parser')
form = soup.find('script', {'id' : 'ProductJson-product-template'})
scripts = soup.findAll('id')
#get the id
'''
for scripts in form:
data = soup.find_all()
print data
'''
print(form)
printID()
And the output of this prints
<script id="ProductJson-product-template" type="application/json">
{"id":463448473639,"title":"n/a","handle":"n/a","description":"n/a"}
</script>
Again, I just want to print just the value of the ID ("463448473639").

you can retrieve all the attributes using following sytax.
form.attrs
and if you looking some specific, it's dictionary.
form['id']
the full code is as below
from bs4 import BeautifulSoup
html_doc="""<script id="ProductJson-product-template" type="application/json">
{"id":463448473639,"title":"n/a","handle":"n/a","description":"n/a"}
</script>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print soup.find("script").attrs
print soup.find("script")['id']
However if you want to get value of ID from innerText {"id":463448473639,"title":"n/a","handle":"n/a","description":"n/a"}
the only way to do is, as below.
innerText = soup.find("script").getText()
print innerText
print ast.literal_eval(strip(innerText)).get("id")

It looks like you are going to want to do something like:
import json
id = json.loads(scripts[0].get_text())['id']
I haven't tested that but if you want to get what is in between the script tags I think that is they way you will do it. get_text doc

Related

Indeed scraper bs4, splitting parsed HTML code after grabbing it

import pandas as pd
from bs4 import BeautifulSoup
import requests
import os
url = 'https://fr.indeed.com/jobs?q=data%20anlayst&l=france'
#grabbing page content and parsing it into html
def data_grabber(url):
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html, 'html.parser')
job_soup = soup.find_all('div', {"class":"job_seen_beacon"})
return job_soup
def job_title(url):
titles = data_grabber(url)
for title in titles:
t = title.find_all('tbody')
return t
this is my source code, and im testing it out in jupyter notebook to make sure my functions work correctly but I've hit a small road block. My html soup from my first function works perfectly. It grabs all the info from indeed, especially the job_seen_beacon class.
Mr job_title function is wrong because it only outputs the first 'tbody' class it finds. refer to image here, I don't have enough points on stack
while for my data_grabber it returns every single job_seen_beacon. If you were able to scroll, you would easily see the multiple job_seen_beacon's.
I'm clearly missing something but I can't see it, any ideas?
What happens?
In moment you are return something from a function you leave it and that happens in first iteration.
Not sure where you will end up with your code, but you can do something like that:
def job_title(item):
title = item.select_one('h2')
return title.get_text('|',strip=True).split('|')[-1] if title else 'No Title'
Example
from bs4 import BeautifulSoup
import requests
url = 'https://fr.indeed.com/jobs?q=data%20anlayst&l=france'
#grabbing page content and parsing it into html
def data_grabber(url):
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html, 'html.parser')
job_soup = soup.find_all('div', {"class":"job_seen_beacon"})
return job_soup
def job_title(item):
title = item.select_one('h2')
return title.get_text('|',strip=True).split('|')[-1] if title else 'No Title'
def job_location(item):
location = item.select_one('div.companyLocation')
return location.get_text(strip=True) if location else 'No Location'
data = []
for item in data_grabber(url):
data.append({
'title':job_title(item),
'companyLocation':job_location(item)
})
data
Output
[{'title': 'Chef de Projet Big Data H/F', 'companyLocation': 'Lyon (69)'},{'title': 'Chef de Projet Big Data F/H', 'companyLocation': 'Lyon 9e (69)'}]

how to use python to parse a html that is in txt format?

I am trying to parse a txt, example as below link.
The txt, however, is in the form of html. I am trying to get "COMPANY CONFORMED NAME" which located at the top of the file, and my function should return "Monocle Acquisition Corp".
https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt
I have tried below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
However, "soup" does not contain "COMPANY CONFORMED NAME" at all.
Can someone point me in the right direction?
The data you are looking for is not in an HTML structure so Beautiful Soup is not the best tool. The correct and fast way of searching for this data is just using a simple Regular Expression like this:
import re
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
text_string = r.content.decode()
name_re = re.compile("COMPANY CONFORMED NAME:[\\t]*(.+)\n")
match = name_re.search(text_string).group(1)
print(match)
the part you look like is inside a huge tag <SEC-HEADER>
you can get the whole section by using soup.find('sec-header')
but you will need to parse the section manually, something like this works, but it's some dirty job :
(view it in replit : https://repl.it/#gui3/stackoverflow-parsing-html)
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
header = soup.find('sec-header').text
company_name = None
for line in header.split('\n'):
split = line.split(':')
if len(split) > 1 :
key = split[0]
value = split[1]
if key.strip() == 'COMPANY CONFORMED NAME':
company_name = value.strip()
break
print(company_name)
There may be some library able to parse this data better than this code

BeautifulSoup: Extracting string between tags does not seem to work

I am currently parsing this url. the Url will be the argument for the parse function.
def parse(sitemap):
req = urllib.request.urlopen(sitemap)
soup = BeautifulSoup(req, 'lxml')
soup.prettify()
inventory_url = []
inventory_url_set = set()
for item in soup.find_all('url'):
print(item.find('lastmod'))
# print(item.find('lastmod').text)
inventory_url_set.add(item.find('loc').text)
However, item.find('lastmod').text retuns an AttributeError whereas if I were to print the whole tag item.find('lastmod') it works fine.
I'd like to only obtain the text in between the 'lastmod' tag from within each 'item'.
Thanks
Not all of the url entries contain a lastmod, so you need to test for that. If you use a dictionary, you could store the lastmod as values and still benefit from having unique URLs as follows:
from bs4 import BeautifulSoup
import urllib.request
def parse(sitemap):
req = urllib.request.urlopen(sitemap)
soup = BeautifulSoup(req, 'lxml')
inventory_urls = {}
for url in soup.find_all('url'):
if url.lastmod:
lastmod = url.lastmod.text
else:
lastmod = None
inventory_urls[url.loc.text] = lastmod
for url, lastmod in inventory_urls.items():
print(lastmod, url)
parse("https://www.kith.com/sitemap_products_1.xml")
This would give you a list starting as follows:
2017-02-12T03:55:25Z https://kith.com/products/adidas-originals-stan-smith-wool-pk-grey-white
2017-03-13T18:55:24Z https://kith.com/products/norse-projects-niels-pocket-boucle-tee-black
2017-03-15T17:20:47Z https://kith.com/products/ronnie-fieg-x-fracap-rf120-rust
2017-03-17T01:30:25Z https://kith.com/products/new-balance-696-birch
2017-01-23T08:43:56Z https://kith.com/products/ronnie-fieg-x-diamond-supply-co-x-asics-gel-lyte-v-1
2017-03-17T00:41:03Z https://kith.com/products/off-white-diagonal-ferns-hoodie-black
2017-03-16T15:01:55Z https://kith.com/products/norse-projects-skagen-bubble-crewneck-charcoal
2017-02-21T15:57:56Z https://kith.com/products/vasque-eriksson-gtx-brown-black

python passing argument containing quote

I'm learning to scrape text from the web. Ive written the following function
from bs4 import BeautifulSoup
import requests
def get_url(source_url):
r = requests.get(source_url)
data = r.text
#extract HTML for parsing
soup = BeautifulSoup(data, 'html.parser')
#get H3 tags with class ...
h3list = soup.findAll("h3", { "class" : "entry-title td-module-title" })
#create data structure to store links in
ulist = []
#pull links from each article heading
for href in h3list:
ulist.append(href.a['href'])
return ulist
I am calling this from a separate file...
from print1 import get_url
ulist = get_url("http://www.startupsmart.com.au/")
print(ulist[3])
The problem is that the css selector I am using is quite unique to the site I am parsing. So the function is a bit 'brittle'. I want to pass the css selector as an argument to the function
If I add a parameter to the function definition
def get_url(source_url, css_tag):
and try to pass "h3", { "class" : "entry-title td-module-title" }
it spazzes out
TypeError: get_url() takes exactly 1 argument (2 given)
I tried escaping all the quotes but it still doesn't work.
I'd really appreciate some help. I can't find a previoud answer to this one.
Here's a version that works:
from bs4 import BeautifulSoup
import requests
def get_url(source_url, tag_name, attrs):
r = requests.get(source_url)
data = r.text
# extract HTML for parsing
soup = BeautifulSoup(data, 'html.parser')
# get H3 tags with class ...
h3list = soup.findAll(tag_name, attrs)
# create data structure to store links in
ulist = []
# pull links from each article heading
for href in h3list:
ulist.append(href.a['href'])
return ulist
ulist = get_url("http://www.startupsmart.com.au/", "h3", {"class": "entry-title td-module-title"})
print(ulist[3])

Can print but not return html table: "TypeError: ResultSet object is not an iterator"

Python newbie here. Python 2.7 with beautifulsoup 3.2.1.
I'm trying to scrape a table from a simple page. I can easily get it to print, but I can't get it to return to my view function.
The following works:
#app.route('/process')
def process():
queryURL = 'http://example.com'
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
print table
return 'All good'
I can also return html successfully. But when I try to return table instead of return 'All good' I get the following error:
TypeError: ResultSet object is not an iterator
I also tried:
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
out = []
for row in table.findAll('tr'):
colvals = [col.text for col in row.findAll('td')]
out.append('\t'.join(colvals))
return table
With no success. Any suggestions?
You're trying to return an object, you're not actually getting the text of the object so return table.text should be what you are looking for. Full modified code:
def process():
queryURL = 'http://example.com'
br.open(queryURL)
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find("table")
return table.text
EDIT:
Since I understand now that you want the HTML code that forms the site instead of the values, you can do something like this example I made:
import urllib
url = urllib.urlopen('http://www.xpn.org/events/concert-calendar')
htmldata = url.readlines()
url.close()
for tag in htmldata:
if '<th' in tag:
print tag
if '<tr' in tag:
print tag
if '<thead' in tag:
print tag
if '<tbody' in tag:
print tag
if '<td' in tag:
print tag
You can't do this with BeautifulSoup (at least not to my knowledge) is because BeautifulSoup is more for parsing or printing the HTML in a nice looking manner. You can just do what I did and have a for loop go through the HTML code and if a tag is in the line, then print it.
If you want to store the output in a list to use later, you would do something like:
htmlCodeList = []
for tag in htmldata:
if '<th' in tag:
htmlCodeList.append(tag)
if '<tr' in tag:
htmlCodeList.append(tag)
if '<thead' in tag:
htmlCodeList.append(tag)
if '<tbody' in tag:
htmlCodeList.append(tag)
if '<td' in tag:
htmlCodeList.append(tag)
This save the HTML line in a new element of the list. so <td> would be index 0 the next set of tags would be index 1, etc.
After #Heinst pointed out that I was trying to return an Object and not a string, I also found a more elegant solution to convert the BeautifulSoup Object into a string and return it:
return str(table)

Categories