Is it possible to automatically scrape articles from websites - Python & Beautiful Soup - python

Trying to make a script to scrape one or two articles (article URLs only) from different websites, i was able to make a Python script that uses BeautifulSoup to get the website's HTML, find the website's Navbar menu via its Class name, and loop trough each website section, the problem is that each website has a different Class name or Xpath for the Navbar menu and its sections ..
Is there a way to make the script work for multiple websites with as little human intervention as possible ?
Any suggestions are more than welcome,
Thanks

Did it, i have only needed to use Python and Selenium, an Xpath for the Navbar Elements for each website and another Xpath for all types of articles on the different website pages, saved everything on a database and the rest is just customized for our specific needs, it wasn't that complicated in the end, thanks for the help <3

Related

Webscraper in python where I provide a webpage that has a list of links which the scraper then visits individually

I am a beginner in programming and I am trying to make a scraper. As of right now I'm using the requests library and BeautifulSoup. I provide the program a link and I am able to extract any information I want from that single web page. What I am trying to accomplish is as follows... I want to provide a web page to the program, the web page that I provide is a search result where there is a list of links that could be clicked. I want the program to be able to get the links of those search results, and then scrape some information from each of those specific pages from the main web page that I provide.
If anyone can give me some sort of guidance on how I could achieve this I would appreciate it greatly! Are there some other libraries I should be using? Is there some reading material you could refer me to, maybe a video?
You can put all the url links in a list then have your request-sending function loop through it. Use the requests or urllib package for this.
For the search logic, you would want to look for the <a> tag with href property.

Python webscraping how to get only the body html

Hey I am trying to implement a program that can get urls from the html of a website, but I only want the urls from the body. Basically, I want to avoid ads and menus on the website and only get links to the websites that are embedded in the actual article. Does anyone know of a good way of isolating the body html from the rest of the html without hardcoding how the body is designated for each website?
It is a simple process to scrape only specific parts of the html. For the most part you can choose elements from the page you want. Let's say you only want the <div id="example">example</div> you can specify your scraper to only pick up that div. Please check this example out.
https://realpython.com/beautiful-soup-web-scraper-python/

Is there any way to scrape a search box using BeautifulSoup/requests and then search and refresh?

I'm trying to make a program that can be able to make a search request on most websites such as YouTube, ESPN, My university course timetable etc...
I have looked online for various solutions but many of them point to simply adding your search query at the end of the url you are "getting", but that doesn't seem to work with all websites some of them don't update their URL's when you manually make a search, while many others might give each and every URL a unique 'id'. Would it be possible to scrape a search bar from any website and then specifying a search query and entering it? Is there a function for that?
You need to Use Selenium Instance to do that. You can not achieve it using BeautifulSoup or requests.
It's possible to use a text-based web browser and automate the search with a script. Then you can download the site you get from this browser and scrape it with BeautifulSoup or something else.

How to use Python to make a web crawler to full-text RSS

I want to write a web crawler in Python that prints only the contents of, for example, a news article.
I tried to do it using BeautifulSoup, to print the content inside a <div> with a specific id, but every website has a different id for the 'entry' div.
I found this website and tried to make a crawler for this website, but I have 2 problems:
I don't know if its a good idea to make a crawler for this website, because maybe it will not work one day
I tried to print only the text from the website (fivefilters.org) but it prints the HTML markup as well. Could someone please show me how to print just the text of the page?

Browse links recursively using selenium

I'd like to know if is it possible to browse all links in a site (including the parent links and sublinks) using python selenium (example: yahoo.com),
fetch all links in the homepage,
open each one of them
open all the links in the sublinks to three four levels.
I'm using selenium on python.
Thanks
Ala'a
You want "web-scraping" software like Scrapy and possibly Beautifulsoup4 - the first is used to build a program called a "spider" which "crawls" through web pages, extracting structured data from them, and following certain (or all) links in them. BS4 is also for extracting data from web pages, and combined with libraries like requests can be used to build your own spider, though at this point something like Scrapy is probably more relevant to what you need.
There are numerous tutorials and examples out there to help you - just start with the google search I linked above.
Sure it is possible, but you have to instruct selenium to enter these links one by one as you are working within one browser.
In case, the pages are not having the links rendered by JavaScript in the browser, it would be much more efficient to fetch these pages by direct http request and process it this way. In this case I would recommend using requests. However, even with requests it is up to your code to locate all urls in the page and follow up with fetching those pages.
There might be also other Python packages, which are specialized on this kind of task, but here I cannot serve with real experience.

Categories