Exclude certain keyword from URL - python

I am successfully able to get the url using my technique but point is that i need to change the url slightly like this: "http://www.example.com/static/p/no-name-0330-227404-1.jpg". Where as in img tag i get this link: "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
HTML CODE:
<div class="swiper-wrapper"><img data-error-placeholder="PlaceholderPDP.jpg" class="swiper-lazy swiper-lazy-loaded" src="http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"></div>
Python Code:
imagesList = []
imagesList.append([re.findall(re.compile(u'http.*?\.jpg'), etree.tostring(imagesList).decode("utf-8")) for imagesList in productTree.xpath('//*[#class="swiper-wrapper"]/img')])
print (imagesList)
output:
[['http://www.example.com/static/p/no-name-8143-225244-1-product.jpg']]
NOTE: I need to remove "-product" from url and I have no idea why this url is inside two square brackets.

If you are intending to remove just the product keyword then you can simply use the .replace() API. Otherwise you can construct regular expressions to manipulate the string. Below is an example code for the replace API.
myURL = "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
myURL = myURL.replace("-product", "") # gives u "http://www.example.com/static/p/no-name-0330-227404-1.jpg"
print(myURL)
Regular expression version: (Probably not a clean solution, as in it is difficult to understand). However it is better than the first approach because it dynamically discard the last set of -words (e.g. -product)
What I have done is capture 3 parts of the URL but omit the middle part because that is the -product bit, and combine part 1 and 3 together to form your URL.
import re
myURL = "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
myPattern = "(.*)(-.*)(\.jpg)$"
pattern = re.compile(myPattern)
match = re.search(pattern, myURL)
print (match.group(1) + match.group(3))
Same output as above:
http://www.example.com/static/p/no-name-0330-227404-1.jpg

If all the images have the word "product" could you just do a simple string replace and remove just that word? Whatever you are trying to do (including renaming files) I see that as the simplest solution.

Related

How to substring with specific start and end positions where a set of characters appear?

I am trying to clean the data I scraped from their links. I have over 100 links in a CSV I'm trying to clean.
This is what a link looks like in the CSV:
"https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
I've observed that scraping this for HTML data doesn't go well and I have to get the URL present inside this.
I want to get the substring which starts with &url= and ends at &ct as that's where the real URL resides.
I've read posts like this but couldn't find one for ending str too. I've tried an approach from this using the substring package but it doesn't work for more than one character.
How do I do this? Preferably without using third party packages?
I don't understand problem
If you have string then you can use string- functions like .find() and slice [start:end]
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
start = text.find('url=') + len('url=')
end = text.find('&ct=')
text[start:end]
But it may have url= and ct= in different order so better search first & after url=
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
start = text.find('url=') + len('url=')
end = text.find('&', start)
text[start:end]
EDIT:
There is also standard module urllib.parse to work with url - to split or join it.
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
import urllib.parse
url, query = urllib.parse.splitquery(text)
data = urllib.parse.parse_qs(query)
data['url'][0]
In data you have dictionary
{'cd': ['SldisGkopisopiasenjA6Y28Ug'],
'ct': ['ga'],
'rct': ['j'],
'sa': ['t'],
'url': ['https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428'],
'usg': ['AFQjaskdfYJkasKugowe896fsdgfsweF']}
EDIT:
Python shows warning that splitquery() is deprecated as of 3.8 and code should use urlparse()
text = "https://www.google.com/url?rct=j&sa=t&url=https://www.somenewswebsite.com/news/society/new-covid-variant-decline-across-the-nation/014465428&ct=ga&cd=SldisGkopisopiasenjA6Y28Ug&usg=AFQjaskdfYJkasKugowe896fsdgfsweF"
import urllib.parse
parts = urllib.parse.urlparse(text)
data = urllib.parse.parse_qs(parts.query)
data['url'][0]

How to do regular expression to extract string from HTML file

I still cannot figure out how to extract links like this:
http: example.com/AA-HDCM-300B.pdf
Since I want to extract the product part number "AA-HDCM-300B" which begins with "AA-".
Does anyone what the extraction code will be?
import re
url = 'dview.com/IDVIEW/Products/Cameras/Covert/assets/IV-PC229XP.pdf'
result = re.findall('(IV.*)\.', url)
Output:
IV-PC229XP

get Specify value from html with python beautifulsoup

Im new in scraping,
And am doing some scraping project and I trying to get value from the Html Below:
<div class="buttons_zoom"><div class="full_prod"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></div></div>
i want to get this value :
379104
which located in onclick
im using BeautifulSoup
The code:
for i in page_content.find_all('div', attrs={'class':'prodPrice'}):
temp = i.parent.parent.contents[0]
temp return list of objects and temp= to the Html Above
can someone help to extract this id
thanks!!
Edit******
Wow guys thanks for amazing explanation!!!!! but i have 2 issues 1.retry mechanism that no working i set it to timeout=1 in order to make it fail but once its fail its return:
requests.exceptions.RetryError: HTTPSConnectionPool(host='www.XXXXX.il', port=443): Max retries exceeded with url: /default.asp?catid=%7B2234C62C-BD68-4641-ABF4-3C225D7E3D81%7D (Caused by ResponseError('too many redirects',))
can you please help me with retry mechanism code below : 2. perfromance issues witout the retry mechanism when im set timeout=6 scraping duration of 8000 items taking 15 minutes how i can improve this code performance ? Code below:
def get_items(self, dict):
itemdict = {}
for k, v in dict.items():
boolean = True
# here, we fetch the content from the url, using the requests library
while (boolean):
try:
a =requests.Session()
retries = Retry(total=3, backoff_factor=0.1, status_forcelist=[301,500, 502, 503, 504])
a.mount(('https://'), HTTPAdapter(max_retries=retries))
page_response = a.get('https://www.XXXXXXX.il' + v, timeout=1)
except requests.exceptions.Timeout:
print ("Timeout occurred")
logging.basicConfig(level=logging.DEBUG)
else:
boolean = False
# we use the html parser to parse the url content and store it in a variable.
page_content = BeautifulSoup(page_response.content, "html.parser")
for i in page_content.find_all('div', attrs={'class':'prodPrice'}):
parent = i.parent.parent.contents[0]
getparentfunc= parent.find("a", attrs={"href": "javascript:void(0)"})
itemid = re.search(".*'(\d+)'.*", getparentfunc.attrs['onclick']).groups()[0]
itemName = re.sub(r'\W+', ' ', i.parent.contents[0].text)
priceitem = re.sub(r'[\D.]+ ', ' ', i.text)
itemdict[itemid] = [itemName, priceitem]
from bs4 import BeautifulSoup as bs
import re
txt = """<div class="buttons_zoom"><div class="full_prod"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></div></div>"""
soup = bs(txt,'html.parser')
a = soup.find("a", attrs={"href":"javascript:void(0)"})
r = re.search(".*'(\d+)'.*", data).groups()[0]
print(r) # will print '379104'
Edit
Replaced ".*\}.*,.*'(\d+)'\).*" with ".*'(\d+)'.*". They produce the same result but the latter is much cleaner.
Explanation : Soup
find the (first) element w/ an a tag where the attribute "href" has "javascript:void(0)" as its value. More about beautiful soup keyword arguments here.
a = soup.find("a", attrs={"href":"javascript:void(0)"})
This is equivalent to
a = soup.find("a", href="javascript:void(0)")
In older versions of Beautiful Soup, which don’t have the class_ shortcut, you can use the attrs trick mentioned above. Create a dictionary whose value for “class” is the string (or regular expression, or whatever) you want to search for. -- see beautiful soup documentation about "attrs"
a points to an element of type <class 'bs4.element.Tag'>. We can access the tag attributes like we would do for a dictionary via the property a.attrs (more about that at beautiful soup attributes). That's what we do in the following statement.
a_tag_attributes = a.attrs # that's the dictionary of attributes in question...
The dictionary keys are named after the tags attributes. Here we have the following keys/attributes name : 'title', 'href' and 'onclick'.
We can check that out for ourselves by printing them.
print(a_tag_attributes.keys()) # equivalent to print(a.attrs.keys())
This will output
dict_keys(['title', 'href', 'onclick']) # those are the attributes names (the keys to our dictionary)
From here, we need to get the data we are interested in. The key to our data is "onclick" (it's named after the html attribute where the data we seek lays).
data = a_tag_attributes["onclick"] # equivalent to data = a.attrs["onclick"]
data now holds the following string.
"js:getProdID('https://www.XXXXXXX.co.il','{31F93B1D-449F-4AD7-BFB0-97A0A8E068F6}','379104')"
Explanation : Regex
Now that we have isolated the piece that contains the data we want, we're going to extract just the portion we need.
We'll do so by using a regular expression (this site is an excellent resource if you want to know more about Regex, good stuff).
To use regular expression in Python we
must import the Regex module re. More about the "re" module here, good good stuff.
import re
Regex lets us search a string that matches a pattern.
Here the string is our data, and the pattern is ".*'(\d+)'.*" (which is also a string as you can tell by the use of the double quotes).
You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager. The regex equivalent is ^.*\.txt$.
Best you read about regular expressions to further understand what it is about. Here's a quick start, good good good stuff.
Here we search for a string. We describe the string as having none or an infinite number of characters. Those characters are followed by some digits (at least one) and an enclosed in single quotes. Then we have some more characters.
The parenthesis is used to extract a group (that's called capturing in regex), we capture just the part that's a number.
By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a quantifier to the entire group or to restrict alternations to part of the regex.
Only parentheses can be used for grouping. Square brackets define a character class, and curly braces are used by a quantifier with specific limits. -- Use Parentheses for Grouping and Capturing
r = re.search(".*'(\d+)'.*", data)
Defining the symbols :
.* matches any character (except for line terminators), * means there can be none or infinite amount
' matches the character '
\d+ matches a least one digit (equal to [0-9]); that's the part we capture
(\d+) Capturing Group; this means capture the part of the string where a digit is repeated at least one
() are used for capturing, the part that match the pattern within the parentheses are saved.
The part captured (if any) can later be access with a call to r.groups() on the result of a re.search.
This returns a tuple containing what was captured or None(r refers to the results of the re.search function call).
In our case the first (and only) item of the tuple are the digits...
captured_group = r.groups()[0] # that's the tuple containing our data (we captured...)
We can now access our data which is at the first index of the tuple (we only captured one group)
print(captured_group[0]) # this will print out '379104'
Both solutions below assume regular/consistent structure to the onclick attribute
If there can only be one match then something like the following.
from bs4 import BeautifulSoup as bs
html ='''
<div class="buttons_zoom"><div class="full_prod"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></div></div>
'''
soup = bs(html, 'lxml')
element = soup.select_one('[onclick^="js:getProdID"]')
print(element['onclick'].split(',')[2].strip(')'))
If more than one match
from bs4 import BeautifulSoup as bs
html ='''
<div class="buttons_zoom"><div class="full_prod"><img alt="פרטים נוספים" border="0" src="template/images/new_site/icon-view-prod-cartpage.png"/></div></div>
'''
soup = bs(html, 'lxml')
elements = soup.select('[onclick^="js:getProdID"]')
for element in elements:
print(element['onclick'].split(',')[2].strip(')'))

How to reliably extract URLs contained in URLs with Python?

Many search engines track clicked URLs by adding the result's URL to the query string which can take a format like: http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask
In the above example the result URL is part of the query string but in some cases it takes the form http://www.example.com/http://www.stackoverflow.com/questions/ask or URL encoding is used.
The approach I tried first is to split searchengineurl.split("http://"). Some obvious problems with this:
it would return all parts of the query string that follow the result URL and not just the result URL. This would be a problem with an URL like this: http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask&showauthor=False&display=None
it does not distinguish between any additional parts of the search engine tracking URL's query string and the result URL's query string. This would be a problem with an URL like this: http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask?showauthor=False&display=None
it fails if the "http://" is ommitted in the result URL
What is the most reliable, general and non-hacky way in Python to extract URLs contained in other URLs?
I would try using urlparse.urlparse it will probably get you most of the way there and a little extra work on your end will get what you want.
This works for me.
from urlparse import urlparse
from urllib import unquote
urls =["http://www.example.com/http://www.stackoverflow.com/questions/ask",
"http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask&showauthor=False&display=None",
"http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask?showauthor=False&display=None",
"http://www.example.com/result?track=http%3A//www.stackoverflow.com/questions/ask%3Fshowauthor%3DFalse%26display%3DNonee"]
def clean(url):
path = urlparse(url).path
index = path.find("http")
if not index == -1:
return path[index:]
else:
query = urlparse(url).query
index = query.index("http")
query = query[index:]
index_questionmark = query.find("?")
index_ampersand = query.find("&")
if index_questionmark == -1 or index_questionmark > index_ampersand:
return unquote(query[:index_ampersand])
else:
return unquote(query)
for url in urls:
print clean(url)
> http://www.stackoverflow.com/questions/ask
> http://www.stackoverflow.com/questions/ask
> http://www.stackoverflow.com/questions/ask?showauthor=False&display=None
> http://www.stackoverflow.com/questions/ask?showauthor=False&display=None
I don't know about Python specifically, but I would use a regular expression to get the parts (key=value) of the query string, with something like...
(?:\?|&)[^=]+=([^&]*)
That captures the "value" parts. I would then decode those and check them against another pattern (probably another regex) to see which one looks like a URL. I would just check the first part, then take the whole value. That way your pattern doesn't have to account for every possible type of URL (and presumably they didn't combine the URL with something else within a single value field). This should work with or without the protocol being specified (it's up to your pattern to determine what looks like a URL).
As for the second type of URL... I don't think there is a non-hacky way to parse that. You could URL-decode the entire URL, then look for the second instance of http:// (or https://, and/or any other protocols you might run across). You would have to decide whether any query strings are part of "your" URL or the tracker URL. You could also not decode the URL and attempt to match on the encoded values. Either way will be messy, and if they don't include the protocol it will be even worse! If you're working with a set of specific formats, you could work out good rules for them... but if you just have to handle whatever they happen to throw at you... I don't think there's a reliable way to handle the second type of embedding.

Extract information from a webpage in a particular format

I am trying to make a simple python script to extract certain links from a webpage. I am able to extract link successfully but now I want to extract some more information like bitrate,size,duration given on that webpage.
I am using the below xpath to extract the above mentioned info
>>> doc = lxml.html.parse('http://mp3skull.com/mp3/linkin_park_faint.html')
>>> info = doc.xpath(".//*[#id='song_html']/div[1]/text()")
>>> info[0:7]
['\n\t\t\t', '\n\t\t\t\t3.71 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t3.49 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t192 kbps', '2:41']
Now what I need is that for a particular link the info I require is generated in a form of tuple like (bitrate,size,duration).
The xpath I mentioned above generates the required info but it is ill-formatted that is it is not possible to achieve my required format with any logic at least I am not able to that.
So, is there any way to achieve the output in my format.?
I think BeautifulSoup will do the job, it parses even badly formatted HTML:
http://www.crummy.com/software/BeautifulSoup/
parsing is quite easy with BeautifulSoup - for example:
import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen('http://mp3skull.com/mp3/linkin_park_faint.html').read())
print soup.find_all('a')
and have quite good docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/
You can actually strip everything out with XPath:
translate(.//*[#id='song_html']/div[1]/text(), "\n\t,'", '')
So for your additional question, either:
info[0, len(info)]
for altogether, or:
info.rfind(" ")
Since the translate leaves a space character, but you could replace that with whatever you wanted.
Addl info found here
How are you with regular expressions and python's re module?
http://docs.python.org/library/re.html may be essential.
As far as getting the data out of the array, re.match(regex,info[n]) should suffice, as far as the triple tuple goes, the python tuple syntax takes care of it. Simply match from members of your info array with re.match.
import re
matching_re = '.*' # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc.
var truple = (incoming_value_1, incoming_value_2, incoming_value_2

Categories