This question already has answers here:
Parsing HTML using Python
(7 answers)
Closed 2 years ago.
I'm trying to do a little bit of HTML parsing in Python which I'm horrible at to be quite honest. I've been up googling ways to do this but can't get anything to work. Here is my situation. I have a web page that has a BUNCH of links to downloads. What I want to do is specify a search string, and if the string I am searching for is there, download the file. But it needs to get the entire file name. For example if I am searching for game-1 and the name of the actual game is game-1-something-else, I want it to download game-1-1something-else. I have already used the following code to obtain the source of the page:
import urllib2
file = urllib2.urlopen('http://www.example.com/my/example/dir')
dload = file.read()
This grabs the entire source code of the webpage which is just a directory by itself. For example, I have tons of tags. I have <a href tags, <td> tags, etc. I want to string the tags so all I have is a list of the files in the directory of the web page, then I want to use a regular expression or something simliar to search for what I am searching for, take the entire file name, and download it.
Once you have the HTML data, parse it and then you can make selections of nodes within the page:
import lxml.html
tree = lxml.html.fromstring(dload)
for node in tree.xpath('//a'):
print node['href']
Related
I am quite new to python and webscraping and I am trying to pull the following text ($1.74), and all the other relevant odds on the page from a website:
HMTL text that I am trying to pull
For similar situations previously I have been successful by using a for loop inside another for loop, but on those occasions I was searching by 'class'. I cannot search by class here as there are a lot of other 'td's that have the same class type, and not the odds that I want. Here I would like to (and I am not sure if it is possible) search via 'data-bettype'. The reason I am trying to search via that, and not 'data compid data-bettype', is that when I print out the full HTML in python, it looks like so:
HMTL printed to Python
The relevant part of my code here is:
soup_playup = BeautifulSoup(source_playup, 'lxml')
#print(soup_playup.prettify())
for odds_a in soup_playup.find_all('td',{'data-bettype','Awin'}):
for odds in odds_a.find_all('div'):
print(odds.text)
I am not receiving any errors when I run this code, but it seems as though it just will not find the text.
The correct format for looking up attributes is a dictionary of key-value pairs like so:
soup_playup.find_all('td',attrs={'data-bettype':'Awin'})
This question already has answers here:
Using python Requests with javascript pages
(6 answers)
Closed 3 years ago.
I am attempting to download many dot-bracket notations of RNA sequences from a url link with Python.
This is one of the links I am using: https://rnacentral.org/rna/URS00003F07BD/9606. To navigate to what I want, you have to click on the '2D structure' button, and only then does the thing I am looking for (right below the occurence of this tag)
<h4>Dot-bracket notation</h4>
appear in the Inspect Element tab.
When I use the get function from the requests package, the text and content fields do not contain that tag. Does anyone know how I can get the bracket notation item?
Here is my current code:
import requests
url = 'http://rnacentral.org/rna/URS00003F07BD/9606'
response = requests.get(url)
print(response.text)
Requests library does not render JS. You need to use a web browser-based solution like selenium. I have listed a pseudo-code below.
Use selenium to load the page.
then click the button 2D structure using selenium.
Wait for some time by adding a time.sleep().
And read the page source using selenium.
You should get what you want.
This question already has answers here:
Strip HTML from strings in Python
(28 answers)
Closed 8 years ago.
I'm using the jinja2 templating engine to create both HTML emails and their plaintext alternative that I then send out using Sendgrid. Unfortunately for my lazy self, this entails me writing and maintaining two separate templates with essentially the same content, the .html file and the .txt file. The .txt file is identical to the HTML file other than containing no HTML tags.
Is there any way to simply have the HTML template and then somehow dynamically generate the txt version, essentially just stripping the HTML tags? I know a regex could achieve this, but I also know that implementing a regex to deal with HTML tags is notoriously gotcha-ridden.
I used this trick to get text out of HTML even if HTML is broken:
text = get_some_html()
import StringIO, htmllib, formatter
io = StringIO.StringIO()
htmllib.HTMLParser(formatter.AbstractFormatter(formatter.DumbWriter(io))).feed("<pre>"+text+"</pre>")
text = io.getvalue()
If you are sure your HTML is well-formed, you don't need those <pre> tags.
I am making a download manager. And I want to make the download manager check the md5 hash of an url after downloading the file. The hash is found on the page. It needs to compute the md5 of the file ( this is done), search for a match on the html page and then compare the WHOLE contents of the html page for a match.
my question is how do i make python return the whole contents of the html and find a match for my "md5 string"?
Requests lib is what you want to use. Will save you lots of trouble
import urllib and use urllib.urlopen for getting the contents of an html. import re to search for the hash code using regex. You could also use find method on the string instead of regex.
If you encounter problems, then you can ask more specific questions. Your question is too general.
How would I look for all URLs on a web page and then save them to individual variables with urllib2 In Python?
Parse the html with an html parser and find all (e.g. using Beutiful Soup's findAll() method) <a> tags and check their href attributes.
If, however, you want to find all URLs in the page even if they aren't hyperlinks, then you can use a regular expression which could be anything from simple to ridiculously insane.
You don't do it with urllib2 alone. What are you looking for is parsing urls in a web page.
You get your first page using urllib2, read its contents and then pass it through parser like Beautifulsoup or as the other poster explained, you can regex to search the contents of the page too.
You could simply download the raw html with urllib2, then simply search through it. There might be easier ways but you could do this:
1:Download the source code.
2:Use strings library to split it into a list.
3:Search the first 7 characters of each section-->
4:If the first 7 characters are http://, write that to a variable.
Why do you need separate variables though? Wouldn't it be easier to save them all to a list, using list.append(URL_YOU_JUST_FOUND), every time you find another url?