Im new in scraping,
And am doing some scraping project and I trying to get value from the Html Below:
<div class="buttons_zoom"><div class="full_prod"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></div></div>
i want to get this value :
379104
which located in onclick
im using BeautifulSoup
The code:
for i in page_content.find_all('div', attrs={'class':'prodPrice'}):
temp = i.parent.parent.contents[0]
temp return list of objects and temp= to the Html Above
can someone help to extract this id
thanks!!
Edit******
Wow guys thanks for amazing explanation!!!!! but i have 2 issues 1.retry mechanism that no working i set it to timeout=1 in order to make it fail but once its fail its return:
requests.exceptions.RetryError: HTTPSConnectionPool(host='www.XXXXX.il', port=443): Max retries exceeded with url: /default.asp?catid=%7B2234C62C-BD68-4641-ABF4-3C225D7E3D81%7D (Caused by ResponseError('too many redirects',))
can you please help me with retry mechanism code below : 2. perfromance issues witout the retry mechanism when im set timeout=6 scraping duration of 8000 items taking 15 minutes how i can improve this code performance ? Code below:
def get_items(self, dict):
itemdict = {}
for k, v in dict.items():
boolean = True
# here, we fetch the content from the url, using the requests library
while (boolean):
try:
a =requests.Session()
retries = Retry(total=3, backoff_factor=0.1, status_forcelist=[301,500, 502, 503, 504])
a.mount(('https://'), HTTPAdapter(max_retries=retries))
page_response = a.get('https://www.XXXXXXX.il' + v, timeout=1)
except requests.exceptions.Timeout:
print ("Timeout occurred")
logging.basicConfig(level=logging.DEBUG)
else:
boolean = False
# we use the html parser to parse the url content and store it in a variable.
page_content = BeautifulSoup(page_response.content, "html.parser")
for i in page_content.find_all('div', attrs={'class':'prodPrice'}):
parent = i.parent.parent.contents[0]
getparentfunc= parent.find("a", attrs={"href": "javascript:void(0)"})
itemid = re.search(".*'(\d+)'.*", getparentfunc.attrs['onclick']).groups()[0]
itemName = re.sub(r'\W+', ' ', i.parent.contents[0].text)
priceitem = re.sub(r'[\D.]+ ', ' ', i.text)
itemdict[itemid] = [itemName, priceitem]
from bs4 import BeautifulSoup as bs
import re
txt = """<div class="buttons_zoom"><div class="full_prod"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></div></div>"""
soup = bs(txt,'html.parser')
a = soup.find("a", attrs={"href":"javascript:void(0)"})
r = re.search(".*'(\d+)'.*", data).groups()[0]
print(r) # will print '379104'
Edit
Replaced ".*\}.*,.*'(\d+)'\).*" with ".*'(\d+)'.*". They produce the same result but the latter is much cleaner.
Explanation : Soup
find the (first) element w/ an a tag where the attribute "href" has "javascript:void(0)" as its value. More about beautiful soup keyword arguments here.
a = soup.find("a", attrs={"href":"javascript:void(0)"})
This is equivalent to
a = soup.find("a", href="javascript:void(0)")
In older versions of Beautiful Soup, which don’t have the class_ shortcut, you can use the attrs trick mentioned above. Create a dictionary whose value for “class” is the string (or regular expression, or whatever) you want to search for. -- see beautiful soup documentation about "attrs"
a points to an element of type <class 'bs4.element.Tag'>. We can access the tag attributes like we would do for a dictionary via the property a.attrs (more about that at beautiful soup attributes). That's what we do in the following statement.
a_tag_attributes = a.attrs # that's the dictionary of attributes in question...
The dictionary keys are named after the tags attributes. Here we have the following keys/attributes name : 'title', 'href' and 'onclick'.
We can check that out for ourselves by printing them.
print(a_tag_attributes.keys()) # equivalent to print(a.attrs.keys())
This will output
dict_keys(['title', 'href', 'onclick']) # those are the attributes names (the keys to our dictionary)
From here, we need to get the data we are interested in. The key to our data is "onclick" (it's named after the html attribute where the data we seek lays).
data = a_tag_attributes["onclick"] # equivalent to data = a.attrs["onclick"]
data now holds the following string.
"js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')"
Explanation : Regex
Now that we have isolated the piece that contains the data we want, we're going to extract just the portion we need.
We'll do so by using a regular expression (this site is an excellent resource if you want to know more about Regex, good stuff).
To use regular expression in Python we
must import the Regex module re. More about the "re" module here, good good stuff.
import re
Regex lets us search a string that matches a pattern.
Here the string is our data, and the pattern is ".*'(\d+)'.*" (which is also a string as you can tell by the use of the double quotes).
You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is ^.*\.txt$.
Best you read about regular expressions to further understand what it is about. Here's a quick start, good good good stuff.
Here we search for a string. We describe the string as having none or an infinite number of characters. Those characters are followed by some digits (at least one) and an enclosed in single quotes. Then we have some more characters.
The parenthesis is used to extract a group (that's called capturing in regex), we capture just the part that's a number.
By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a quantifier to the entire group or to restrict alternations to part of the regex.
Only parentheses can be used for grouping. Square brackets define a character class, and curly braces are used by a quantifier with specific limits. -- Use Parentheses for Grouping and Capturing
r = re.search(".*'(\d+)'.*", data)
Defining the symbols :
.* matches any character (except for line terminators), * means there can be none or infinite amount
' matches the character '
\d+ matches a least one digit (equal to [0-9]); that's the part we capture
(\d+) Capturing Group; this means capture the part of the string where a digit is repeated at least one
() are used for capturing, the part that match the pattern within the parentheses are saved.
The part captured (if any) can later be access with a call to r.groups() on the result of a re.search.
This returns a tuple containing what was captured or None(r refers to the results of the re.search function call).
In our case the first (and only) item of the tuple are the digits...
captured_group = r.groups()[0] # that's the tuple containing our data (we captured...)
We can now access our data which is at the first index of the tuple (we only captured one group)
print(captured_group[0]) # this will print out '379104'
Both solutions below assume regular/consistent structure to the onclick attribute
If there can only be one match then something like the following.
from bs4 import BeautifulSoup as bs
html ='''
<div class="buttons_zoom"><div class="full_prod"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></div></div>
'''
soup = bs(html, 'lxml')
element = soup.select_one('[onclick^="js:getProdID"]')
print(element['onclick'].split(',')[2].strip(')'))
If more than one match
from bs4 import BeautifulSoup as bs
html ='''
<div class="buttons_zoom"><div class="full_prod"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></div></div>
'''
soup = bs(html, 'lxml')
elements = soup.select('[onclick^="js:getProdID"]')
for element in elements:
print(element['onclick'].split(',')[2].strip(')'))
Related
I am trying to use BeautifulSoup to scrape a particular download URL from a web page, based on a partial text match. There are many links on the page, and it changes frequently. The html I'm scraping is full of sections that look something like this:
<section class="onecol habonecol">
<a href="https://longGibberishDownloadURL" title="Download">
<img src="\azure_storage_blob\includes\download_for_windows.png"/>
</a>
sentinel-3.2022335.1201.1507_1608C.ab.L3.FL3.v951T202211_1_3.CIcyano.LakeOkee.tif
</section>
The second to last line (sentinel-3.2022335...LakeOkee.tif) is the part I need to search using a partial string to pull out the correct download url. The code I have attempted so far looks something like this:
import requests, re
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}, string=re.compile(?))
I've been searching StackOverflow a long time now and while there are similar questions and answers, none of the proposed solutions have worked for me so far (re.compile, lambdas, etc.). I am able to pull up a section if I remove the string argument, but when I try to include a partial matching string I get None for my result. I'm unsure what to put for the string argument (? above) to find a match based on partial text, say if I wanted to find the filename that has "CIcyano" somewhere in it (see second to last line of html example at top).
I've tried multiple methods using re.compile and lambdas, but I don't quite understand how either of those functions really work. I was able to pull up other sections from the html using these solutions, but something about this filename string with all the periods seems to be preventing it from working. Or maybe it's the way it is positioned within the section? Perhaps I'm going about this the wrong way entirely.
Is this perhaps considered part of the section id, and so the string argument can't find it?? An example of a section on the page that I AM able to find has html like the one below, and I'm easily able to find it using the string argument and re.compile using "Name", "^N", etc.
<section class="onecol habonecol">
<h3>
Name
</h3>
</section>
Appreciate any advice on how to go about this! Once I get the correct section, I know how to pull out the URL via the a tag.
Here is the full html of the page I'm scraping, if that helps clarify the structure I'm working against.
I believe you are overthinking. Just remove the regular expression part, take the text and you will be fine.
import requests
from bs4 import BeautifulSoup
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
result = soup.find('section', attrs={'class':'onecol habonecol'}).text
print(result)
You can query inside every section for the string you want. Like so:
s.find('section', attrs={'class':'onecol habonecol'}).find(string=re.compile(r'.sentinel.*'))
Using this regular expression you will match any text that has sentinel in it, be careful that you will have to match some characters like spaces, that's why there is a . at beginning of the regex, you might want a more robust regex which you can test here:
https://regex101.com/
I ended up finding another method not using the string argument in find(), instead using something like the code below, which pulls the first instance of a section that contains a partial text match.
sections = soup.find_all('section', attrs={'class':'onecol habonecol'})
for s in sections:
text = s.text
if 'CIcyano' in text:
print(s)
break
links = s.find('a')
dwn_url = links.get('href')
This works for my purposes and fetches the first instance of the matching filename, and grabs the URL.
enter image description here
Hey, I just start to learn python, want to code a web scraping. The website I got, I only care about the name and price, but all the information was written in a one div which contains other three sub divs in there. The HTML looks like:
<div class="Product">
<div class="product-image-and-name-container"></div>
<div class="prices"></div>
<div class="buy-now-button"></div>
</div>
I tied to use this line to get all the information from "Product"
root_pattern = '<div class="Product">([\s\S]*?)</div>'
But only give the first div-"product-image-and-name-container" information, and then stop. Not get anything from the other divs.
Here is my all codes:
from urllib.request import Request, urlopen
import re
class Shopping_Spider():
url = 'http://www....com/Shop-Online/587'
root_pattern = '<div class="Product">([\s\S]*?)</div>'
name_pattern = '<div class="product-name">([\s\S]*?)</div>'
price_pattern = '<span class="Price">([\s\S]*?)</span>'
def __fetch_content(self):
# page = urllib.urlopen(Shopping_Spider.url)
r = Request(Shopping_Spider.url, headers={'User-Agent': 'Mozilla/5.0'})
html_s = urlopen(r).read()
html_s = str(html_s, encoding='utf-8')
return html_s
def __analysis(self, html_s):
root_html = re.findall(Shopping_Spider.root_pattern, html_s)
anchors = []
for html in root_html:
name = re.findall(Shopping_Spider.name_pattern, html)
price = re.findall(Shopping_Spider.price_pattern, html)
anchor = {'name': name, 'price': price}
anchors.append(anchor)
return anchors
def go(self):
html_s = self.__fetch_content()
self.__analysis(html_s)
shopping_spider = Shopping_Spider()
shopping_spider.go()
Thanks in advance, I think my regular express is wrong, but do not know how to rewrite, I know it may easier use BeautifulSoup to deal with it, but just want to know is that possible I just use the regular express to get what I want! Big Thanks.
You can extract the inner content of the outer divs with a regex like
root_pattern = r'(?:<div class="Product">)(.*)(?:</div>)'
Above you define three capturing groups but reject the ones starting specified with the ?: in the start
But you have to set the dotall flag to include the \n char in the dot (.), that is all chars, specification, so later in your code use
root_html = re.findall(Shopping_Spider.root_pattern, html_s, re.DOTALL)
Then you can adapt the remaining patterns with the same principles.
Edit (important):
Jimmy, unless an expert in regexes shows up with a solution, go with the BeautifulSoup.
Unless your target html pages were that simple (they never are) this will not work in practice (despite working with the sample).Alex comment goes right to the point. Also feel free to un-accept my answer and give him the credits because I tend to believe the better advice will always be to go with BS (despite you asking for a regex alternative). You may always upvote this if you think it was somehow useful anyway.
The problem is that div tags can be arbitrarly nested in a document. The regex I presented captures everything between the start product div until the last div in the document (which will not work in practice). This is because * is "greedy". You can avoid that with ? following the *, but that you'll not solve anything as then it will capture to the first closing div. Also I see no way of matching a closing div with its start, because closing divs are all equal, and because of both arbitrary nesting, or structure changes shuch as having more listed divs inside the product div.
Not without starting to write code to somehow parse the html, which is exactly what BS is for.
I am successfully able to get the url using my technique but point is that i need to change the url slightly like this: "http://www.example.com/static/p/no-name-0330-227404-1.jpg". Where as in img tag i get this link: "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
HTML CODE:
<div class="swiper-wrapper"><img data-error-placeholder="PlaceholderPDP.jpg" class="swiper-lazy swiper-lazy-loaded" src="http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"></div>
Python Code:
imagesList = []
imagesList.append([re.findall(re.compile(u'http.*?\.jpg'), etree.tostring(imagesList).decode("utf-8")) for imagesList in productTree.xpath('//*[#class="swiper-wrapper"]/img')])
print (imagesList)
output:
[['http://www.example.com/static/p/no-name-8143-225244-1-product.jpg']]
NOTE: I need to remove "-product" from url and I have no idea why this url is inside two square brackets.
If you are intending to remove just the product keyword then you can simply use the .replace() API. Otherwise you can construct regular expressions to manipulate the string. Below is an example code for the replace API.
myURL = "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
myURL = myURL.replace("-product", "") # gives u "http://www.example.com/static/p/no-name-0330-227404-1.jpg"
print(myURL)
Regular expression version: (Probably not a clean solution, as in it is difficult to understand). However it is better than the first approach because it dynamically discard the last set of -words (e.g. -product)
What I have done is capture 3 parts of the URL but omit the middle part because that is the -product bit, and combine part 1 and 3 together to form your URL.
import re
myURL = "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
myPattern = "(.*)(-.*)(\.jpg)$"
pattern = re.compile(myPattern)
match = re.search(pattern, myURL)
print (match.group(1) + match.group(3))
Same output as above:
http://www.example.com/static/p/no-name-0330-227404-1.jpg
If all the images have the word "product" could you just do a simple string replace and remove just that word? Whatever you are trying to do (including renaming files) I see that as the simplest solution.
I'm having a problem trying to figure out how to grab the specific tag I need.
<div class="meaning"><span class="hinshi">[名]</span><span class="hinshi">(スル)</span></div>, <div class="meaning"><b>1</b> 今まで経験してきた仕事・身分・地位・学業などの事柄。履歴。「―を偽る」</div>,
Right now I have it so it finds all the meaning classes, but I need to narrow it down even further from this to get what I want. Above is an example. I need to grab just the
"<div class="meaning"><b>".
and ignore all the "hinshi" classes.
edit: It seems to be showing the number, which I guess is what is, but I need the text next to it. Any ideas?
You can find a specific attribute by using keyword arguments to the find method. In your case, you'll want to match on the class_ keyword. See the documentation regarding the class_ keyword.
Assuming that you want to filter the elements that don't contain any children with the "hinshi" class, you could try something like this:
soup = BeautifulSoup(data)
potential_matches = soup.find_all(class_="meaning")
matches = []
for match in potential_matches:
bad_children = match.find_all(class_="hinshi")
if not bad_children:
matches.append(match)
return matches
If you'd like, you could make it a little shorter, for example:
matches = soup.find_all(class_="meaning")
return [x for x in matches if not x.find_all(class_="hinshi")]
Or, depending on your Python version, i.e. 2.x:
matches = soup.find_all(class_="meaning")
return filter(matches, lambda x: not x.find_all(class_="hinshi"))
EDIT: If you want to find the foreign characters next to the number in your example, you should first remove the b element, then use the get_text method. For example
# Assuming `element` is one of the matches from above
element.find('b').extract()
print(element.get_text())
You could try using the .select function, which takes a CSS selector:
soup.select('.meaning b')
Just you can do it like this,
for s in soup.findAll("div {class:meaning}"):
for b in s.findAll("b"):
# b.getText("<b>")
And in '#' line, you should accord the result to fix it.
i have need webpage-content. I need to get some data from it. It looks like:
< div class="deg">DATA< /div>
As i understand, i have to use regex, but i can't choose one.
I tried the code below but had no any results. Please, correct me:
regexHandler = re.compile('(<div class="deg">(?P<div class="deg">.*?)</div>)')
result = regexHandler.search( pageData )
I suggest using a good HTML parser (such as BeautifulSoup -- but for your purposes, i.e. with well-formed HTML as input, the ones that come with the Python standard library, such as HTMLParser, should also work well) rather than raw REs to parse HTML.
If you want to persist with the raw RE approach, the pattern:
r'<div class="deg">([^<]*)</div>'
looks like the simplest way to get the string 'DATA' out of the string '<div class="deg">DATA</div>' -- assuming that's what you're after. You may need to add one or more \s* in spots where you need to tolerate optional whitespace.
If you want the div tags included in the matched item:
regexpHandler = re.compile('(<div class="deg">.*?</div>)')
If you don't want the div tags included, only the DATA portion:
regexpHandler = re.compile('<div class="deg">(.*?)</div>')
Then to run the match and get the result:
result = regexHandler.search( pageData )
matchedText = result.groups()[0]
you can use simple string functions in Python, no need for regex
mystr = """< div class="deg">DATA< /div>"""
if "div" in mystr and "class" in mystr and "deg" in mystr:
s = mystr.split(">")
for n,item in enumerate(s):
if "deg" in item:
print s[n+1][:s[n+1].index("<")]
my approach, get something to split on. eg in the above, i split on ">". Then go through the splitted items, check for "deg", and get the item after it, since "deg" appears before the data you want to get. of course, this is not the only approach.
While it is ok to use rexex for quick and dirty html processing a much better and cleaner way is to use a html parser like lxml.html and to query the parsed tree with XPath or CSS Selectors.
html = """<html><body><div class="deg">DATA1</div><div class="deg">DATA2</div></body></html>"""
import lxml.html
page = lxml.html.fromstring(html)
#page = lxml.html.parse(url)
for element in page.findall('.//div[#class="deg"]'):
print element.text
#using css selectors
from lxml.cssselect import CSSSelector
sel = CSSSelector("div.deg")
for element in sel(page):
print element.text