My goal here is for this Python script to open a block number page on blockchain.info, and take the correct table. This table is then searched for a range of values and results printed.
This one works on https://blockchain.info/search?search=422407, finding the associated "0.02269362":
import numpy as np
from bs4 import BeautifulSoup
import urllib2
#create a list of numbers that will be used in search
list = np.arange(0.9749999312,0.9793780897,0.000001)
#open webpage, get table
web = urllib2.urlopen("https://blockchain.info/search?search=422407").read()
#whole page
soup = BeautifulSoup(web, "lxml")
table = soup.findAll("table", {'class':'table table-striped'}) #Correct table
#Go through all numbers created; check if found; print found
for i in list:
j = str(round((i * 0.023223),8))
for line in table: #if you see a line in the table
if line.get_text().find(j) > -1 : #and you find the specific string
print(line.prettify().encode("utf-8")) #print it
print j
I'm having difficulties doing this for others. This is supposed to go to block 422245 and find "0.02972821". It does not print off anything. Ideally it would be printing anything that matches [x.xxxx]yz and so on.
import numpy as np
from bs4 import BeautifulSoup
import urllib2
#create a list of numbers that will be used in search
list = np.arange(0.9749999312,0.9793780897,0.000001)
#open webpage, get table
web = urllib2.urlopen("https://blockchain.info/search?search=422245").read() #whole page
soup = BeautifulSoup(web, "lxml")
table = soup.findAll("table", {'class':'table table-striped'}) #Correct table
#Go through all numbers created; check if found; print found
for i in list:
j = str(round((i * 0.03044589),8))
for line in table: #if you see a line in the table
if line.get_text().find(j) > -1 : #and you find the specific string
print(line.prettify().encode("utf-8")) #print it
print j
When I tried testing the finding part of the script using the code below it also does not work. But if you go to https://blockchain.info/search?search=422245 and search on page "0.02972821" the value is there. I am confused at why this is not working.
import numpy as np
from bs4 import BeautifulSoup
import urllib2
#create a list of numbers that will be used in search
list = np.arange(0.9749999312,0.9793780897,0.000001)
#open webpage, get table
web = urllib2.urlopen("https://blockchain.info/search?search=422245").read() #whole page
soup = BeautifulSoup(web, "lxml")
table = soup.findAll("table", {'class':'table table-striped'}) #Correct table
#Go through all numbers created; check if found; print found
j = "0.02972821"
for line in table: #if you see a line in the table
if line.get_text().find(j) > -1 : #and you find the specific string
print(line.prettify().encode("utf-8")) #print it
print j
Related
I wanted to scrape something as my first program, just to learn the basics really but I'm having trouble showing more than one result.
The premise is going to a forum (http://blackhatworld.com), scrape all thread titles and compare with a string. If it contains the word "free" it will print, otherwise it won't.
Here's the current code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
n=0
for x in range(len(threadtitles)):
test = list(threadtitles)[n]
test2 = list(test)[0]
if test2.find('free') == -1:
n=n+1
else:
print(test2)
n=n+1
This is the result of running the program:
https://i.gyazo.com/6cf1e135b16b04f0807963ce21b2b9be.png
As you can see it's checking for the word "free" and it works but it only shows first result while there are several more in the page.
By default, strings comparison is case sensitive (FREE != free). To solve your problem, first you need to put test2 in lowercase:
test2 = list(test)[0].lower()
To solve your problem and simplify your code try this:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.blackhatworld.com/')
content = BeautifulSoup(page.content, 'html.parser')
threadtitles = content.find_all('a', class_='PreviewTooltip')
count = 0
for title in threadtitles:
if "free" in title.get_text().lower():
print(title.get_text())
else:
count += 1
print(count)
Bonus: Print value of href:
for title in threadtitles:
print(title["href"])
See also this.
#importing the libraries
import urllib.request as urllib2
from bs4 import BeautifulSoup
#getting the page url
quote_page="https://www.quora.com/What-is-the-best-advice-you-can-give-to-a-junior-programmer"
page=urllib2.urlopen(quote_page)
#parsing the html
soup = BeautifulSoup(page,"html.parser")
# Take out the <div> of name and get its value
name_box = soup.find("div", attrs={"class": "AnswerListDiv"})
#finding all the tags in the page
ans=name_box.find_all("div", attrs={"class": "u-serif-font-main--large"},recursive=True)
#separating the answers into lists
for i in range(0, len(ans), 100):
chunk = ans[i:i+100]
#extracting all the answers and putting into a list
finalans=[]
l=0
for i in chunk:
stri=chunk[l]
finalans.append(stri.text)
l+=1
continue
final_string = '\n'.join(finalans)
#final output
print(final_string)
I am not able to get more than 20 entries into this list. What is wrong with this code? (I am a beginner and I have used some references to write this program)
Edit: I have added the URL I want to scrape.
You try to break ans into smaller chunks, but notice that each iteration of this loop discards the previous content of chunks so you loose all but the last chunk of data.
#separating the answers into lists
for i in range(0, len(ans), 100):
chunk = ans[i:i+100] # overwrites previous chunk
This is why you only get 20 items in the list... its only the final chunk. Since you want final_string to hold all of the text nodes, there is no need to chunk and I just removed it.
Next, and this is just tightening up the code, you don't need to both iterate the values of the list and track an index just to get the same value you are indexing. Working on ans because we are no longer chunking,
finalans=[]
l=0
for i in ans:
stri=ans[l]
finalans.append(stri.text)
l+=1
continue
becomes
finalans=[]
for item in ans:
finalans.append(item.text)
or more susinctly
finalans = [item.text for item in ans]
So the program is
#importing the libraries
import urllib.request as urllib2
from bs4 import BeautifulSoup
#getting the page url
quote_page="https:abcdef.com"
page=urllib2.urlopen(quote_page)
#parsing the html
soup = BeautifulSoup(page,"html.parser")
# Take out the <div> of name and get its value
name_box = soup.find("div", attrs={"class": "AnswerListDiv"})
#finding all the tags in the page
ans=name_box.find_all("div", attrs={"class": "u-serif-font-main--large"},recursive=True)
#extracting all the answers and putting into a list
finalans = [item.text for item in ans]
final_string = '\n'.join(finalans)
#final output
print(final_string)
I am able to extract table values from this website with the following code.
from pyquery import PyQuery as pq
import requests
url = "https://finviz.com/screener.ashx"
content = requests.get(url).content
doc = pq(content)
Tickers = doc(".screener-link-primary").text()
print(Tickers)
But I am able to extract only the first 20 values. There is a 'next' button at the end of the page which has the link to the next set of values.
How can I extract this link automatically, fetch the new page and extract the new set of values and append to my existing list?
You can iterate through all pages like:
counter = 1
while True:
url = "https://finviz.com/screener.ashx?v=111&r=%d" % counter
content = requests.get(url).content
counter += 20
Note that for the first page r parameter (which I guess stands for starting entry index) will be 1 for the second - 21, for the third -41... So I used + 20 increment for counter
You should also add break for the moment when the last page reached. Usually one make a check whether new data to scrape available and if not - break
I want to extract some information off websites with URLs of the form:
http://www.pedigreequery.com/american+pharoah
where "american+pharoah" is the extension for one of many horse names.
I have a list of the horse names I'm searching for, I just need to figure out how to plug the names in after "http://www.pedigreequery.com/"
This is what I currently have:
import csv
allhorses = csv.reader(open('HORSES.csv') )
rows=list(allhorses)
import requests
from bs4 import BeautifulSoup
for i in rows: # Number of pages plus one
url = "http://www.pedigreequery.com/".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
letters = soup.find_all("a", class_="horseName")
print(letters)
When I print out the url it doesn't have the horse's name at the end, just the URL in quotes. the letters/print statement at the end are just to check if it's actually going to the website.
This is how I've seen it done for looping URLs that change by numbers at the end- I haven't found advice on URLs that change by characters.
Thanks!
You are missing the placeholder in your format so scange the format to:
url = "http://www.pedigreequery.com/{}".format(i)
^
#add placeholder
Also you are getting a list of lists at best from rows=list(allhorses) so you would be passing a list not a string/horsename, just open the file normally if you have a horse per line and iterate over the file object stripping the newline.
Presuming one horse name per line, the whole working code would be:
import requests
from bs4 import BeautifulSoup
with open("HORSES.csv") as f:
for horse in map(str.strip,f): # Number of pages plus one
url = "http://www.pedigreequery.com/{}".format(horse)
r = requests.get(url)
soup = BeautifulSoup(r.content)
letters = soup.find_all("a", class_="horseName")
print(letters)
If you have multiple horses per line you can use the csv lib but you will need an inner loop:
with open("HORSES.csv") as f:
for row in csv.reader(f):
# Number of pages plus one
for horse in row:
url = "http://www.pedigreequery.com/{}".format(horse)
r = requests.get(url)
soup = BeautifulSoup(r.content)
letters = soup.find_all("a", class_="horseName")
print(letters)
Lastly if you don't have the names store correctly you have a few options the simplest of which is to split and create the create the query manually.
url = "http://www.pedigreequery.com/{}".format("+".join(horse.split()))
I am a Python novice and have no experience with BeautifulSoup and urllib
I've tried to frankenstein my own code from other questions to no avail, so I will try to detail what I’m trying to achieve from the pseudocode and description below:
import urllib2
from bs4 import BeautifulSoup
for eachurl in "urllist.txt":
urllib read first (or 2nd or 3rd) url in list
find.all("<form")
if number of "<form" > 0:
result = True
if number of "<form" == 0:
result = False
write result to csv/excel/html
table col 1 = url in urllist
table col 2 = result
So basically, I have a txt file with a list of URLs in; I would like urllib to open each URL one by one and see whether or not the html contains a form tag. (Writing to a new file) the URL string in the left column and a y or n in the right, depending on whether finding all form tag returned a result greater than 0, and then of course stop once the URLs have been exhausted in the txt file.
Use requests instead of urllib2.
Try this:
import requests
from bs4 import BeautifulSoup
with open('data.txt', 'r') as data:
for line in data:
res = requests.get(line.strip()).content
soup = BeautifulSoup(res, 'html.parser')
with open('result.txt', 'a') as result_file:
if soup.find_all('form'):
result_file.write('{} y\n'.format(line.strip()))
else:
result_file.write('{} n\n'.format(line.strip()))
data.txt
http://stackoverflow.com/questions/34263219/urllib2-and-beautifulsoup-loop-through-urls-and-return-whether-html-contains
http://blank.org/
result.txt
http://stackoverflow.com/questions/34263219/urllib2-and-beautifulsoup-loop-through-urls-and-return-whether-html-contains y
http://blank.org/ n