I was wondering how can I search the data that is loaded from a page with a .read() and search for a variable within that search in a while statement. Here is what I wrote:
match2 = re.search(r"=profile\.php\?id=" + str(link) + ">(.+?)</a>, <a href=profile\.php\?id=(.+?)>", home)
The page basically just lists all the user profiles and I'm trying to make it read each one and view their profile; pretty simple except for the fact that I can't get the user ids set as a variable from link = match2.group(2) to work.
This isn't a direct answer, but I strongly suggest you look at Beautiful Soup. It's an HTML parser that lets you search for values in a much more structured manner. In this case, you could loop across all the items in the list of user profiles, and extract the information you want out of each one in turn.
Related
I would like to scrape multiple URLS but they are of different nature, such as different company websites with different html backend. Is there a way to do it without coming up with a customised code for each url?
Understand that I can put multiple URLS into a list and loop them
I fear not, but I am not an expert :-)
I could imagine that it depends on the complexity of the structures. If you want to find a the text "Test" on every website, I coul imagine that soup.body.findAll(text='Test') would return all occurences of "Test" on the website.
I assume you're aware of how to loop through a list here, so that you'd loop through the list of URLS and for each check whether the searched string occurs (maybe you are looking for sth else , i.e. an "apply" button or "login" ?
all the best,
I'm writing a python script to scrape an online shopping website
every item on this website has an item number and after inserting an item number into search box I'm redirected to item page
when I looked to the URL of this page there was no clue about the item number in this URL _ so I can replace it with any number after that so I can go directly to the item page without going first to website portal_
any clue how to construct this URL?
is it a general case or it depends on every website?
say my website is ebay so to reach this page searching for cisco 262 on ebay there are 2 ways:
open ebay and then inserting cisco 262 into the search box
use this URL cisco 262 search result on ebay
as we can see from URL we can replace "cisco++262" with what we want to search for so we can go directly to search result without going first to the main page of eBay and inserting what we want to search for into the search box
my question here is it's not always clear in the URL where you can put what you want to search for so you can go directly to its page
so is there any way to know how to construct URL if it's not clear.
Update
here is the base url of website I want to scrape
and here is the page url after inserting this value "CHG2020324" into its the search box
another url after inserting this "CHG2022230" into the search box
so as you can see there is no clue where to put item number so we can reconstruct url ...any help with url inspecting or constructing.
If I understand your question correctly, you want to go straight to the search result pages using a series of search strings. If so, then - at least in the case of ebay (if will likely be different for each site) - you can use f-strings together with a base url to achieve that:
base_url = 'https://www.ebay.com/sch/i.html?_from=R40&_nkw='
search_strings = ['ryzen', 'cisco 262'] #examples of a single word and phrase search strings
for srchstr in search_strings:
src = srchstr.replace(' ','+') #note the ebay represents a phrase as "word1+word2"; other sites will do it differently
print(f'{base_url}"{src}"')
Output:
https://www.ebay.com/sch/i.html?_from=R40&_nkw="ryzen"
https://www.ebay.com/sch/i.html?_from=R40&_nkw="cisco+262"
And these take to the search result pages for the respective search strings.
"minus" sign doesn't fit because the list consists of ~2000 entries.
I'm just beginner in python, so, please explain as to 5-year old, if possible
Thank you much in advance!
Presumably you are fetching the Google search results from a Python program. So you can exclude the web pages in your list in your Python program as you read the results, instead of trying to make Google do it for you. You can use a functional programming technique like calling filter for this.
Ideally you would do this by comparing the URLs of the links, but if you were willing to sacrifice accuracy you could do it by comparing the titles of the links instead, if you only had titles in your list and not URLs. But URLs are definitely better for this purpose.
So you could parse the Google search results using a library like Beautiful Soup, extract the URLs of the links, and filter out (using filter) the ones that were equal to any of the URLs on your list (you could define a function using def, for checking whether a given URL is on your list). You'll have to be careful though because sometimes Google search result links go via a Google website which redirects to the real URL, for ranking purposes.
I'm new to software development, and I'm not sure how to go about this. I want to visit every page of a website and grab a specific bit of data from each one. My problem is, I don't know how to iterate through all of the existing pages without knowing the individual urls ahead of time. For example, I want to visit every page whose url starts with
"http://stackoverflow.com/questions/"
Is there a way to compile a list and then iterate through that, or is it possible to do this without creating a giant list of urls?
Try Scrapy.
It handles all of the crawling for you and lets you focus on processing the data, not extracting it. Instead of copy-pasting the code already in the tutorial, I'll leave it to you to read it.
To grab a specific bit of data from a web site you could use some web scraping tool e.g., scrapy.
If required data is generated by javascript then you might need browser-like tool such as Selenium WebDriver and implement crawling of the links by hand.
For example, you can make a simple for loop, like this:
def webIterate():
base_link = "http://stackoverflow.com/questions/"
for i in xrange(24):
print "http://stackoverflow.com/questions/%d" % (i)
The output will be:
http://stackoverflow.com/questions/0
http://stackoverflow.com/questions/2
http://stackoverflow.com/questions/3
...
http://stackoverflow.com/questions/23
It's just an example. You can pass numbers of questions and make with them whatever you want
Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc.
I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of.
Try the Beautiful Soup library for Python. It has very simple methods to extract information from an html file.
Trying to generically extract data from webpages would require people to write their pages in a similar way... but there's an almost infinite number of ways to convey a page that looks identical let alone all the conbinations you can have to convey the same information.
Was there a particular type of information you were trying to extract or some other end goal?
You could try extracting any content in 'div' and 'p' markers and compare the relative sizes of all the information in the page. The problem then is that people probably group information into collections of 'div's and 'p's (or at least they do if they're writing well formed html!).
Maybe if you formed a tree of how the information is related (nodes would be the 'p' or 'div or whatever and each node would contain the associated text) you could do some sort of analysis to identify the smallest 'p' or 'div' that encompases what appears to be the majority of the information.. ?
[EDIT] Maybe if you can get it into the tree structure I suggested, you could then use a similar points system to spam assassin. Define some rules that attempt to classify the information. Some examples:
+1 points for every 100 words
+1 points for every child element that has > 100 words
-1 points if the section name contains the word 'nav'
-2 points if the section name contains the word 'advert'
If you have a lots of low scoring rules which add up when you find more relevent looking sections, I think that could evolve into a fairly powerful and robust technique.
[EDIT2] Looking at the readability, it seems to be doing pretty much exactly what I just suggested! Maybe it could be improved to try and understand tables better?
Have a look at templatemaker: http://www.holovaty.com/writing/templatemaker/
It's written by one of the founders of Django. Basically you feed it a few example html files and it will generate a "template" that you can then use to extract just the bits that are different (which is usually the meaningful content).
Here's an example from the google code page:
# Import the Template class.
>>> from templatemaker import Template
# Create a Template instance.
>>> t = Template()
# Learn a Sample String.
>>> t.learn('<b>this and that</b>')
# Output the template so far, using the "!" character to mark holes.
# We've only learned a single string, so the template has no holes.
>>> t.as_text('!')
'<b>this and that</b>'
# Learn another string. The True return value means the template gained
# at least one hole.
>>> t.learn('<b>alex and sue</b>')
True
# Sure enough, the template now has some holes.
>>> t.as_text('!')
'<b>! and !</b>'
You might use the boilerpipe Web application to fetch and extract content on the fly.
(This is not specific to Python, as you only need to issue a HTTP GET request to a page on Google AppEngine).
Cheers,
Christian
What is meaningful and what is not, it depends on the semantic of the page. If the semantics is crappy, your code won't "guess" what is meaningful. I use readability, which you linked in the comment, and I see that on many pages I try to read it does not provide any result, not talking about a decent one.
If someone puts the content in a table, you're doomed. Try readability on a phpbb forum you'll see what I mean.
If you want to do it, go with a regexp on <p></p>, or parse the DOM.
Goose is just the library for this task. To quote their README:
Goose will try to extract the following information:
Main text of an article
Main image of article
Any Youtube/Vimeo movies embedded in article
Meta Description
Meta tags