I'm trying to create a list of URL's visit_urls to visit.
at first I specify manually the first url to visit self.br.get(url)
and checking the amount of pages that the page has for example it has 40 pages, I will know that it with "count" and I just want to switch the end of the url with &page=2 &page=3 up to 40 in a list.
Here is the loop part of my code. I only need a way to add all the pages into the visit_urls list
visit_urls=[]
self.br.get(url)
count = self.br.find_elements_by_xpath("//*[#class='count']").get_attribute("innerHTML"):
for (here up to count)
self.visit_urls.append(url + need to append this also to the end of the first url &page=2-count)
This code comes after a lot of research and I'm stuck so any help will be great!
Try something like this:
visit_urls=[]
self.br.get(url)
count = self.br.find_elements_by_xpath("//*[#class='count']").get_attribute("innerHTML")
for page_number in xrange(1, count+1):
url = '{url}&page={page_number}'.format(url=url, page_number=page_number)
visit_urls.append(url)
This will work assuming url never changes. That is, if the url variable always points to the same url, you will end up with urls like http://www.mysite.com&page=1&page=2&page=3
Make sure url is always defined appropriately.
I'm assuming everything works and the issue you're having is in generating an array of all URLs based on your findings in "count".
The easiest thing to do would be if you already know the URL, and it's in the proper format, such as:
url = 'http://www.thisisapage.com/somethinghere/dosomething.php?page=1'
If that's the case, do something to strip the 1, getting a 'baseurl' to act on (exactly how to do this depends on what urls, and how they are formed):
baseurl = 'http://www.thisisapage.com/somethinghere/dosomething.php?page='
After, just loop from n to count, appending the current iteration to the baseurl.
Often, it's a lot easier to use a regular expression to do this if you're ever going to have complex URLs, or dynamic URLs that may include security tokens and the such.
For that, you could use something like:
import re
m = re.match(r'^(.*)(page=\d+&?)(.*)$', url)
for i in range(2, count):
self.visit_urls.append(m.group(1) + 'page=%i' % i + m.group(3))
Of course, since you're using a URL, that could be so many things, it will fall on you to make sure that the regular expression catches everything that it needs to. Mine was very simple based on the information you provided.
A very basic webcrawler in Python:
import re, urllib
print "Enter the URL you wish to crawl.." #
print 'Usage - "http://example.com/"'
myurl = input("#> ")
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(myurl).read(), re.I):
for ee in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(i).read(), re.I):
print ee
Related
Background: I am pretty new to python and decided to practice by making a webscraper for https://www.marketwatch.com/tools/markets/stocks/a-z which would allow me to pull the company name, ticker, origin, and sector. I could then use this in another scraper to combine it with more complex information. The page is separated by 2 indexing methods- one to select the first letter of the company name (at the top of the page and another for the number of pages within that letter index (at the bottom of the page). These two tags have the class = "pagination" identifier but when I scrape based on that criteria, I get two separate strings but they are not a delimited list separated by a comma.
Does anyone know how to get the strings as a list? or individually? I really only care about the second.
from bs4 import BeautifulSoup
import requests
# open the source code of the website as text
source = 'https://www.marketwatch.com/tools/markets/stocks/a-z/x'
page = requests.get(source).text
soup = BeautifulSoup(page, 'lxml')
for tags in soup.find_all('ul', class_='pagination'):
tags_text = tags.text
print(tags_text)
Which returns:
0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther
«123»
When I try to split on /n:
tags_text = tags.text.split('/n')
print(tags_text)
The return is:
['\n0-9ABCDEFGHIJKLMNOPQRSTUVWX (current)YZOther']
['«123»']
Neither seem to form a list. I have found many ways to get the first string but I really only need the second.
Also, please note that I am using the X index as my current tab. If you build it from scratch, you might have more numbers in the second string and the word (current) might be in a different place in the first list.
THANK YOU!!!!
Edit:
Cleaned old, commented out code from the source and realized I did not show the results of trying to call the second element despite the lack of a comma in the split example:
tags_text = tags.text.split('/n')[1]
print(tags_text)
Returns:
File, "C:\.....", line 22, in <module>
tags_text = tags.text.split('/n')[1]
IndexError: list index out of range
Never mind, I was using print() when I should have been using return, .apphend(), or another term to actually do something with the value...
I'm using urllib to parse a url, but I was wanting it to take input from a text box so I could put in multiple url's whenever I needed instead of changing the code to parse just one url. I tried using tkinter but I couldn't figure out how to get urllib to grab the input from that.
You haven't provided much information on your use case but let's pretend you have multiple URLs already and that part is working.
def retrieve_input(list_of_urls):
for url in list_of_urls:
# do parsing as needed
Now if you wanted to have a way to get more than one URL and put them in a list, maybe you would do something like:
list_of_urls = []
while True:
url = input('What is your URL?')
if url != 'Stop':
list_of_urls.append(url)
else:
break
With that example you would probably want to control inputs more but just to give you an idea. If you are expecting to get help with the tkinter portion, you'll need to provide more information and examples of what you have tried, your expected input (and method), and expected output.
I am successfully able to get the url using my technique but point is that i need to change the url slightly like this: "http://www.example.com/static/p/no-name-0330-227404-1.jpg". Where as in img tag i get this link: "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
HTML CODE:
<div class="swiper-wrapper"><img data-error-placeholder="PlaceholderPDP.jpg" class="swiper-lazy swiper-lazy-loaded" src="http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"></div>
Python Code:
imagesList = []
imagesList.append([re.findall(re.compile(u'http.*?\.jpg'), etree.tostring(imagesList).decode("utf-8")) for imagesList in productTree.xpath('//*[#class="swiper-wrapper"]/img')])
print (imagesList)
output:
[['http://www.example.com/static/p/no-name-8143-225244-1-product.jpg']]
NOTE: I need to remove "-product" from url and I have no idea why this url is inside two square brackets.
If you are intending to remove just the product keyword then you can simply use the .replace() API. Otherwise you can construct regular expressions to manipulate the string. Below is an example code for the replace API.
myURL = "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
myURL = myURL.replace("-product", "") # gives u "http://www.example.com/static/p/no-name-0330-227404-1.jpg"
print(myURL)
Regular expression version: (Probably not a clean solution, as in it is difficult to understand). However it is better than the first approach because it dynamically discard the last set of -words (e.g. -product)
What I have done is capture 3 parts of the URL but omit the middle part because that is the -product bit, and combine part 1 and 3 together to form your URL.
import re
myURL = "http://www.example.com/static/p/no-name-0330-227404-1-product.jpg"
myPattern = "(.*)(-.*)(\.jpg)$"
pattern = re.compile(myPattern)
match = re.search(pattern, myURL)
print (match.group(1) + match.group(3))
Same output as above:
http://www.example.com/static/p/no-name-0330-227404-1.jpg
If all the images have the word "product" could you just do a simple string replace and remove just that word? Whatever you are trying to do (including renaming files) I see that as the simplest solution.
I have the following url 'http://www.alriyadh.com/file/278?&page=1'
I would like to write a regex to access urls from page=2 till page=12
For example, this url is needed 'http://www.alriyadh.com/file/278?&page=4', but not page = 14
I reckon what will work is a function that iterate the specified 10 pages to access all the urls within them. I have tried this regex but does not work
'.*?=[2-9]'
My aim is to get the content from those urls using newspaper package. I simply want this data for my research
Thanks in advance
does not require regex, a simple preset loop will do.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://www.alriyadh.com/file/278?&page='
for page in range(2,13):
html = requests.get(url+str(page)).text
soup = bs(html)
Here's a regex to access the proper range (i.e. 2-12):
([2-9]|1[012])
Judging by what you have now, I am unsure that your regex will work as you intend it to. Perhaps I am misinterpreting your regex altogether, but is the '?=' intended to be a lookahead?
Or are you actually searching for a '?' immediately followed by a '=' immediately followed by any number 2-9?
How familiar are you with regexs in general? This particular one seems dangerously vague to find a meaningful match.
Many search engines track clicked URLs by adding the result's URL to the query string which can take a format like: http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask
In the above example the result URL is part of the query string but in some cases it takes the form http://www.example.com/http://www.stackoverflow.com/questions/ask or URL encoding is used.
The approach I tried first is to split searchengineurl.split("http://"). Some obvious problems with this:
it would return all parts of the query string that follow the result URL and not just the result URL. This would be a problem with an URL like this: http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask&showauthor=False&display=None
it does not distinguish between any additional parts of the search engine tracking URL's query string and the result URL's query string. This would be a problem with an URL like this: http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask?showauthor=False&display=None
it fails if the "http://" is ommitted in the result URL
What is the most reliable, general and non-hacky way in Python to extract URLs contained in other URLs?
I would try using urlparse.urlparse it will probably get you most of the way there and a little extra work on your end will get what you want.
This works for me.
from urlparse import urlparse
from urllib import unquote
urls =["http://www.example.com/http://www.stackoverflow.com/questions/ask",
"http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask&showauthor=False&display=None",
"http://www.example.com/result?track=http://www.stackoverflow.com/questions/ask?showauthor=False&display=None",
"http://www.example.com/result?track=http%3A//www.stackoverflow.com/questions/ask%3Fshowauthor%3DFalse%26display%3DNonee"]
def clean(url):
path = urlparse(url).path
index = path.find("http")
if not index == -1:
return path[index:]
else:
query = urlparse(url).query
index = query.index("http")
query = query[index:]
index_questionmark = query.find("?")
index_ampersand = query.find("&")
if index_questionmark == -1 or index_questionmark > index_ampersand:
return unquote(query[:index_ampersand])
else:
return unquote(query)
for url in urls:
print clean(url)
> http://www.stackoverflow.com/questions/ask
> http://www.stackoverflow.com/questions/ask
> http://www.stackoverflow.com/questions/ask?showauthor=False&display=None
> http://www.stackoverflow.com/questions/ask?showauthor=False&display=None
I don't know about Python specifically, but I would use a regular expression to get the parts (key=value) of the query string, with something like...
(?:\?|&)[^=]+=([^&]*)
That captures the "value" parts. I would then decode those and check them against another pattern (probably another regex) to see which one looks like a URL. I would just check the first part, then take the whole value. That way your pattern doesn't have to account for every possible type of URL (and presumably they didn't combine the URL with something else within a single value field). This should work with or without the protocol being specified (it's up to your pattern to determine what looks like a URL).
As for the second type of URL... I don't think there is a non-hacky way to parse that. You could URL-decode the entire URL, then look for the second instance of http:// (or https://, and/or any other protocols you might run across). You would have to decide whether any query strings are part of "your" URL or the tracker URL. You could also not decode the URL and attempt to match on the encoded values. Either way will be messy, and if they don't include the protocol it will be even worse! If you're working with a set of specific formats, you could work out good rules for them... but if you just have to handle whatever they happen to throw at you... I don't think there's a reliable way to handle the second type of embedding.