Python Regular Expression string exclusion - python

Using beautfiulsoup to parse sourcecode for scraping:
tempSite = preSite+'/contact_us/'
print tempSite
theTempSite = urlopen(tempSite).read()
currentTempSite = BeautifulSoup(theTempSite)
lightwaveEmail = currentTempSite('input')[7]
#<input type="Hidden" name="bb_recipient" value="comm2342#gmail.com" />
How can I re.compile lightwaveEmail so that only comm2342#gmail.com is printed?

Kinda going about it the wrong way. The reason its the wrong way is that you're using numbered indexes to find the tag you want - BeautifulSoup will find tags for you based on their tag, or attributes which makes it a lot simpler.
You want something like
tempSite = preSite+'/contact_us/'
print tempSite
theTempSite = urlopen(tempSite).read()
soup = BeautifulSoup(theTempSite)
tag = soup.find("input", { "name" : "bb_recipient" })
print tag['value']

If the question is how to get the value attribute from the tag object, then you can use it as a dictionary:
lightwaveEmail['value']
You can find more information about this in the BeautifulSoup documentation.
If the question is how to find in the soup all input tags with such a value, then you can look for them as follows:
soup.findAll('input', value=re.compile(r'comm2342#gmail.com'))
You can find a similar example also in the BeautifulSoup documentation.

Related

Get HTML class attribute by a the content with bs4 webscraping

So I'm currently trying to get a certain attribute just by the content of the HTML element.
I know how to get an attribute by another attribute in the same HTML section. But this time I need the attribute by the content of the section.
"https://www.skatedeluxe.ch/de/adidas-skateboarding-busenitz-vulc-ii-schuh-white-collegiate-navy-bluebird_p155979?cPath=216&value[55][]=744" this is the link I try to scrape.
So I'm trying to get the "data-id" just by the " US 12"
What I tried to do is getting it similar to how I'd get an attribute by an attribute.
This is my code:
def carting ():
a = session.get(producturl, headers=headers, proxies=proxy)
soup = BeautifulSoup(a.text, "html.parser")
product_id = soup.find("div", {"class" : "product-grid"})["data-product-id"]
option_id = soup.find("option", {"option" : " US 12"})["data-id"]
print(option_id)
carting()
This is what I get:
'NoneType' object is not subscriptable
I know that the code is wrong and doesn't work like I wrote it but I cannot figure how else I'm supposed to do it.
Would appreciate help and ofc if you need more information just ask.
Kind Regards
Try:
import requests
from bs4 import BeautifulSoup
url = "https://www.skatedeluxe.ch/de/adidas-skateboarding-busenitz-vulc-ii-schuh-white-collegiate-navy-bluebird_p155979?cPath=216&value[55][]=744"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
sizes = soup.select_one("#product-size-chooser")
print(sizes.select_one('option:-soup-contains("US 12")')["data-id"])
Print:
16
I suggest filtering the text using regex as you have whitespaces around it:
soup.find("option", text=re.compile("US 12"))["data-id"]
there are a lot of ways to achieve this:
1st:
you can extract all the options and only pick the one you want with a loop
# find all the option tags that have the attribute "data-id"
for option in soup.find_all("option", attrs={'data-id':True}):
if option.text.strip() == 'US 12':
print(option.text.strip(), '/', option['data-id'])
break
2nd:
you can use a regular expression (regex)
import re
# get the option that has "US 12" in the string
option = soup.find('option', string=re.compile('US 12'))
3rd:
using the CSS selectors
# get the option that has the attribute "data-id" and "US 12" in the string
option = soup.select_one('#product-size-chooser > option[data-id]:-soup-contains-own("US 12")')
I recommend you learn more about CSS selectors

How to find hidden input value with id and name - Python, bs4

Good morning dear SO community. I had a small problem lately when trying to parse HTML. I always use the bs4 module and this was always fine until now. I mostly needed hidden inputs when scraping and could easily find the value if I searched them by name. But now I found a page where the input has also an id, like this:
<input type="hidden" value="985207" name="order[ship_address_attributes]
[id]" id="order_ship_address_attributes_id">
i want to find the value, if the rest is known.
I tried it by just leaving the id part away and searching it with the name only, like I am used to, but this didn't go well and I didn't find the value.
my code:
soup=bs(r.text, 'lxml')
vle=soup.find('input',{'name':'ship_address_attributes'})['value']
I hope to find a way to get the value, in a similar way to how I tried. Is there a method to add just the id like the name? I would be very happy on any help. Thanks a lot and wish the whole community happy holidays.
why not select it by id ?
vle = soup.find('input',{'id':'order_ship_address_attributes_id'})['value']
if the name value has no space or newline select it with
vle = soup.find('input', {'name':'order[ship_address_attributes][id]'})['value']
And this will select input with type=hidden and has attributes name also id
hiddenInputs = soup.select('input[type=hidden]')
for input in hiddenInputs:
if input.get('name') and input.get('id'):
print(input['value'])
You can use regex along with BeautifulSoup to find the right tag.
For example:
import re
from bs4 import BeautifulSoup as bs
a = '''<input type="hidden" value="985207" name="order[ship_address_attributes]
[id]" id="order_ship_address_attributes_id">'''
# Or:
# soup = bs(a, 'lxml')
soup = bs(a, 'html.parser')
data = soup.find('input', {'name': re.compile(r'order\[\w+\]\s+\[\w+\]')})
print(data['value']) # 985207
Or if you want to find the tag with the exact regex match, you can do:
data = soup.find('input', {'name': re.compile(r'order\[ship_address_attributes\]\s+\[id\]')})
print(data['value']) # 985207

Beautiful Soup Nested Tag Search

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div>).
Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn't find any, although there are. Is there any simple method or another way to do it?
Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:
soup = bs4.BeautifulSoup(content, 'html.parser')
# This will get the div
div_container = soup.find('div', class_='some_class')
# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
# prints the p tag content
print(ptag.text)
Hope that helps
Try this one :
data = []
for nested_soup in soup.find_all('xyz'):
data = data + nested_soup.find_all('abc')
Maybe you can turn in into lambda and make it cool, but this works. Thanks.
UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs
we read that there is a method called get_text(), use it as:
from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))
INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:
from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)
count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
continue
temp = matcher.split(tag.text) # Split using tokens such as \s and \n
temp = filter(None, temp) # remove empty elements in the list
count +=len(temp)
print "number of words in the document %d" %count
fd.close()
Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason
You can find all <p> tags using regular expressions (re module).
Note that r.content is a string which contains the whole html of the site.
for eg:
r = requests.get(url,headers=headers)
p_tags = re.findall(r'<p>.*?</p>',r.content)
this should get you all the <p> tags irrespective of whether they are nested or not. And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content.
Alternatively if you just want just the text you can try this:
from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()
this will get you a more bare bones form of the html from the site, and now proceed with the parsing.

Python - Retrieve specific "object" from url with Beatifulsoup

Im trying to parse a specific "item" on a site, but I dont know if its a class, object, id or something else
my code:
soup = BeautifulSoup(urllib2.urlopen(myURL))
divdata = soup.find('div')
print(divdata)
And it returns:
<div data-store='{"Auth":{"cookie":null,"user":null,"timestamp":1485297666762},"Blocked":{},"Broadcast":
{"forceUpdate":false,"failed":[],"pending":[],"error":
{"isNotFound":false,"isServerError":false,"isUnavailable":false}},"BroadcastCache":{"broadcasts":{"ID1":{"broadcast":
{"data":{"class_name":"Broadcast","id":"ID1","state":"running,
....(more)....
So I want to retrieve the "running" or what ever is in "state"
I tried
statedata = soup.find('div', {"class":"state"})
But it returns nothing, what is the correct way to retrieve it?
import json
div_tag = soup.find('div', {'data-store':True})
data_string = div_tag['data-store'] # get data string
json.loads(data_string)['BroadcastCache']['broadcasts']['ID1']['broadcast']['data']['state'] # convert data string to python dict and get state
out:
'running'
The correct syntax is soup.find_all('div', class_='state').
Note the underscore after class_.
It's unlikely to work in your case without modification, since it looks like the actual class of the div is 'data-store', and the rest is just a string and not actually content of a tag. You could just use string.find('\"state\"') on that one.

Match HTML tags in two strings using regex in Python

I want to verify that the HTML tags present in a source string are also present in a target string.
For example:
>> source = '<em>Hello</em><label>What's your name</label>'
>> verify_target(’<em>Hi</em><label>My name is Jim</label>')
True
>> verify_target('<label>My name is Jim</label><em>Hi</em>')
True
>> verify_target('<em>Hi<label>My name is Jim</label></em>')
False
I would get rid of Regex and look at Beautiful Soup.
findAll(True) lists all the tags found in your source.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(source)
allTags = soup.findAll(True)
[tag.name for tag in allTags ]
[u'em', u'label']
then you just need to remove possible duplicates and confront your tags lists.
This snippet verifies that ALL of source's tags are present in target's tags.
from BeautifulSoup import BeautifulSoup
def get_tags_set(source):
soup = BeautifulSoup(source)
all_tags = soup.findAll(True)
return set([tag.name for tag in all_tags])
def verify(tags_source_orig, tags_source_to_verify):
return tags_source_orig == set.intersection(tags_source_orig, tags_source_to_verify)
source= '<label>What\'s your name</label><label>What\'s your name</label><em>Hello</em>'
source_to_verify= '<em>Hello</em><label>What\'s your name</label><label>What\'s your name</label>'
print verify(get_tags_set(source),get_tags_set(source_to_verify))
I don't think that regex is the right way here, basically because html is not always just a string, but it's a bit more complex, with nested tags.
I suggest you to use HTMLParser, create a class with parses the original source and builds a structure on it. Then verify that the same data structure is valid for the targets to be verified.

Categories