Good morning dear SO community. I had a small problem lately when trying to parse HTML. I always use the bs4 module and this was always fine until now. I mostly needed hidden inputs when scraping and could easily find the value if I searched them by name. But now I found a page where the input has also an id, like this:
<input type="hidden" value="985207" name="order[ship_address_attributes]
[id]" id="order_ship_address_attributes_id">
i want to find the value, if the rest is known.
I tried it by just leaving the id part away and searching it with the name only, like I am used to, but this didn't go well and I didn't find the value.
my code:
soup=bs(r.text, 'lxml')
vle=soup.find('input',{'name':'ship_address_attributes'})['value']
I hope to find a way to get the value, in a similar way to how I tried. Is there a method to add just the id like the name? I would be very happy on any help. Thanks a lot and wish the whole community happy holidays.
why not select it by id ?
vle = soup.find('input',{'id':'order_ship_address_attributes_id'})['value']
if the name value has no space or newline select it with
vle = soup.find('input', {'name':'order[ship_address_attributes][id]'})['value']
And this will select input with type=hidden and has attributes name also id
hiddenInputs = soup.select('input[type=hidden]')
for input in hiddenInputs:
if input.get('name') and input.get('id'):
print(input['value'])
You can use regex along with BeautifulSoup to find the right tag.
For example:
import re
from bs4 import BeautifulSoup as bs
a = '''<input type="hidden" value="985207" name="order[ship_address_attributes]
[id]" id="order_ship_address_attributes_id">'''
# Or:
# soup = bs(a, 'lxml')
soup = bs(a, 'html.parser')
data = soup.find('input', {'name': re.compile(r'order\[\w+\]\s+\[\w+\]')})
print(data['value']) # 985207
Or if you want to find the tag with the exact regex match, you can do:
data = soup.find('input', {'name': re.compile(r'order\[ship_address_attributes\]\s+\[id\]')})
print(data['value']) # 985207
Related
I am trying to extract data from this website - https://www.airtasker.com/users/brad-n-11346775/.
So far, I have managed to extract everything except the license number. The problem I'm facing is bizarre as the license number is in the form of text. I was able to extract everything else like the Name, Address etc. For example, to extract the Name, I just did this:
name.append(pro.find('div', class_= 'name').text)
And it works just fine.
This is what I have tried to do, but I'm getting the output as None
license_number.append(pro.find('div', class_= 'sub-text'))
When I do :
license_number.append(pro.find('div', class_= 'sub-text').text)
It gives me the following error:
AttributeError: 'NoneType' object has no attribute 'text'
That means it does not recognise the license number as a text, even though it is a text.
Can someone please give me a workable solution and please tell me what am I doing wrong???
Regards,
The badge with the license number is added to the HTML dynamically from a Boostrap JSON that sits in one of the <script> tags.
You can find the tag with bs4 and scoop out the data with regex and parse it with json.
Here's how:
import ast
import json
import re
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.airtasker.com/users/brad-n-11346775/").text
scripts = BeautifulSoup(page, "lxml").find_all("script")[-4]
bootstrap_JSON = json.loads(
ast.literal_eval(re.search(r"parse\((.*)\)", scripts.string).group(1))
)
print(bootstrap_JSON["profile"]["badges"]["electrical_vic"]["reference_code"])
Output:
Licence No. 28661
my title may not be the most precise but I had some trouble coming up with a better one and considering it's work hours I'll go with this.
What I am trying to do is get the links from this specific page, then by using RE find specific links that are job ads with certain keywords in it.
Currently I find 2 ads but I haven't been able to get all the ads that match my keyword(in this case it's "säljare", Swedish for sales).
I would appreciate it anyone could look at my RE and say or hint towards fixing it. Thank you!:)
import urllib, urllib.request
import re
from bs4 import BeautifulSoup
url = "https://se.indeed.com/jobb?l=V%C3%A4stra+G%C3%B6talands+L%C3%A4n&start=10&pp=AAoAAAFd6hHqiAAAAAEX-kSOAQABQVlE682pK5mDD9vTZGjJhZBXQGaw6Nf2QaY"
reKey = re.compile('^<a.*?href=\"(.*?)\".*?>(.*säljare.*)</a>')
data = urllib.request.urlopen(url)
dataSoup = BeautifulSoup(data, 'html.parser')
for link in dataSoup.find_all('a'):
linkMatch = re.match(reKey, str(link))
if linkMatch:
print(linkMatch)
print(linkMatch.group(1), linkMatch.group(2))
If I understand your question correctly, you do not need a regex at all. Just check, if the title attribute containing the job title is present in the link and then check for a list of keyword (I added truckförare as a second keyword).
import urllib, urllib.request
import re
import ssl
from bs4 import BeautifulSoup
url = "https://se.indeed.com/jobb?l=V%C3%A4stra+G%C3%B6talands+L%C3%A4n&start=10&pp=AAoAAAFd6hHqiAAAAAEX-kSOAQABQVlE682pK5mDD9vTZGjJhZBXQGaw6Nf2QaY"
keywords = ['säljare', 'truckförare']
data = urllib.request.urlopen(url)
dataSoup = BeautifulSoup(data, 'html.parser')
for link in dataSoup.find_all('a'):
# if we do have a title attribute, check for all keywords
# if at least one of them is present,
# then print the title and the href attribute
if 'title' in link.attrs:
title = link.attrs['title'].lower()
for kw in keywords:
if kw in title:
print(title, link.attrs['href'])
While I personally like regexes (yes, I'm that kind of person ), most of the time you can get away with a little parsing in Python which IMHO makes the code more readable.
Instead of using re can you try in keyword.
for link in dataSoup.find_all('a'):
if keyword in link:
print link
A working solution:
<a[^>]+href=\"([^\"]+)\"[^>]+title=\"((?=[^\"]*säljare[^\"]*)[^\"]+)\"
<a // literal
[^>]+ // 1 or more not '>'
href=\"([^\"]+)\" // href literal then 1 or more not '"' grouped
[^>]+ // 1 or more not '>'
title=\" // literal
( // start of group
(?=[^\"]*säljare[^\"]*) // look ahead and match literal enclosed by 0 or more not '"'
[^\"]+ // 1 or more not '"'
)\" // end of group
Flags: global, case insensitive
Assumes: title after href
Demo
I am trying to grab all text between a tag that has a specific class name. I believe I am very close to getting it right, so I think all it'll take is a simple fix.
In the website these are the tags I'm trying to retrieve data from. I want 'SNP'.
<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>
From what I have currently:
from lxml import html
import requests
def main():
url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
page = html.fromstring(requests.get(url_link).text)
for span_tag in page.xpath("//span"):
class_name = span_tag.get("class")
if class_name is not None:
if "rtq_exch" == class_name:
print(url_link, span_tag.text)
if __name__ == "__main__":main()
I get this:
http://finance.yahoo.com/q?s=^GSPC&d=t None
To show that it works, when I change this line:
if "rtq_dash" == class_name:
I get this (please note the '-' which is the same content between the tags):
http://finance.yahoo.com/q?s=^GSPC&d=t -
What I think is happening is it sees the child tag and stops grabbing the data, but I'm not sure why.
I would be happy with receiving
<span class="rtq_dash">-</span>SNP
as a string for span_tag.text, as I can easily chop off what I don't want.
A higher description, I'm trying to get the stock symbol from the page.
Here is the documentation for requests, and here is the documentation for lxml (xpath).
I want to use xpath instead of BeautifulSoup for several reasons, so please don't suggest changing to use that library instead, not that it'd be any easier anyway.
There are some possible ways. You can find the outer span and return direct-child text node of it :
>>> url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
>>> page = html.fromstring(requests.get(url_link).text)
>>> for span_text in page.xpath("//span[#class='rtq_exch']/text()"):
... print(span_text)
...
SNP
or find the inner span and get the tail :
>>> for span_tag in page.xpath("//span[#class='rtq_dash']"):
... print(span_tag.tail)
...
SNP
Use BeautifulSoup:
import bs4
html = """<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>"""
soup = bs4.BeautifulSoup(html)
snp = list(soup.findAll("span", class_="rtq_exch")[0].strings)[1]
This is an easy one I am sure. I am parsing a website and I am trying to get the specific text in between tags. The text will either == [revoked, Active, Default] I am using Python. I have been able to print out all the inner text results, but I have not been able to find a good solution on the web for specific text. Here is my code
from BeautifulSoup import BeautifulSoup
import urllib2
import re
url = urllib2.urlopen("Some URL")
content = url.read()
soup = BeautifulSoup(content)
for tag in soup.findAll(re.compile("^a")):
print(tag.text)
I'm still not sure I understand what you are trying to do, but I'll try to help.
soup.find_all('a', text=['revoked', 'active', 'default'])
This will select only those <a …> tags that have one of given strings as their text.
I've used the snippet below in a similar occasion. See if this works with your goal:
table = soup.find(id="Table3")
for i in table.stripped_strings:
print(i)
Using beautfiulsoup to parse sourcecode for scraping:
tempSite = preSite+'/contact_us/'
print tempSite
theTempSite = urlopen(tempSite).read()
currentTempSite = BeautifulSoup(theTempSite)
lightwaveEmail = currentTempSite('input')[7]
#<input type="Hidden" name="bb_recipient" value="comm2342#gmail.com" />
How can I re.compile lightwaveEmail so that only comm2342#gmail.com is printed?
Kinda going about it the wrong way. The reason its the wrong way is that you're using numbered indexes to find the tag you want - BeautifulSoup will find tags for you based on their tag, or attributes which makes it a lot simpler.
You want something like
tempSite = preSite+'/contact_us/'
print tempSite
theTempSite = urlopen(tempSite).read()
soup = BeautifulSoup(theTempSite)
tag = soup.find("input", { "name" : "bb_recipient" })
print tag['value']
If the question is how to get the value attribute from the tag object, then you can use it as a dictionary:
lightwaveEmail['value']
You can find more information about this in the BeautifulSoup documentation.
If the question is how to find in the soup all input tags with such a value, then you can look for them as follows:
soup.findAll('input', value=re.compile(r'comm2342#gmail.com'))
You can find a similar example also in the BeautifulSoup documentation.