How to search for links in a given page with Bash, or Python or any other popular scripts [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
Given a http/https page, I would like to search for some links on that page, anyone knows how to achieve this goal with Bash, Python or any other popular script languages?

Try this in python. It will print all tags with a link:
import requests
from bs4 import BeautifulSoup as soup
print(soup(requests.get('Your link').content).find_all('a', href=True'))

You should use Beautiful Soup. It's an html parser library in python. You'll look for <a> tags and grab the inner content.

Related

Trying to webscrape but my code always returns None or [] [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 days ago.
Improve this question
Im trying to scrape data from glassdoor but my beautifulsoup cant read the data at all
So far i've tried this
import requests
from bs4 import BeautifulSoup
html_text=requests.get('https://www.glassdoor.co.in/Job/data-analyst-jobs-SRCH_KO0,12.htm?fromAge=7').text
soup1=BeautifulSoup(html_text,'lxml')
soup2=soup1.prettify()
jobs=soup1.find_all('li',class_='react-job-listing css-7x0jr eigr9kq3')
print(jobs)
Ive seen solutions using selenium but is there any other way to get the actual data? ive tried this for the 'ul' class, the 'li' class and so on but nothing seems to work
There is no li tag with attribute react-job-listing css-7x0jr eigr9kq3 in the html of that url.
look html page for what you need to scrape.
for example you can try li with atrribute react-job-listing css-7x0jr eigr9kq3 which is present in html page.

HTML - How to scrape not visible elements using python? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I'm using beautiful soup to scrape a webpages.
I am trying to scrape data from this https://painel-covid19.saude.ma.gov.br/vacinas. But the problem is I am getting the tags in outputs empty. In the Inspect Element I can see the data, but in page source not. You can see the code is hidden in . How can I retrieve it using python? Someone can help me?
The issue isn't "not visible". The issue is that the data is being filled in by Javascript code. You won't see the data unless you are executing the Javascript on the page. You can do that with the selenium package, which runs a copy of Chrome to do the rendering.

Using Python for web scraping [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I need to use a specific website (Translates English to my language) in my python code , and also I don't wanna use googletrans in python , it's huge load of data , so I need to use python for doing it fast , Is there any references or any title than I can read about it ? or any doc in python ?
thanks
You might want to consider using selenium or BeautifulSoup for interacting with a website or web scraping, but if you simply want to open a website you could use the webbrowser module.
import webbrowser
Google = 'https://www.google.com/?safe=active&safe=active'
webbrowser.open(Google)
Here are some links to selenium and BeautifulSoup
https://pythonspot.com/selenium-webdriver/
https://realpython.com/beautiful-soup-web-scraper-python/
Hope this helps.

How to wait for the page to load before scraping it? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I want to extract the HTML from a webpage:
import urllib2
req = urllib2.Request('https://www.example.com')
response = urllib2.urlopen(req)
fullhtml = response.read()
I tried with "ulrllib2" but since the page is built dynamically, the HTML content is empty.
Is there a way to wait for the javascript to load?
Take a look at this http://phantomjs.org/ . Most websites are javascript based and php or python can not execute them. I think this library will be the best you can get.

Extract the main article text from a Wikipedia page using Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've been searching for hours on how to extract the main text of a Wikipedia article, without all the links and references. I've tried wikitools, mwlib, BeautifulSoup and more. But I haven't really managed to.
Is there any easy and fast way for me to take the clear text (the actual article), and put it in a Python variable?
SOLUTION: Omid Raha solved it :)
You can use this package, that is a python wrapper for Wikipedia API,
Here is a quick start.
First install it:
pip install wikipedia
Example:
import wikipedia
p = wikipedia.page("Python programming language")
print(p.url)
print(p.title)
content = p.content # Content of page.
Output:
http://en.wikipedia.org/wiki/Python_(programming_language)
Python (programming language)

Categories