I use BeautifulSoup parsing multiple url and want to convert to PDF. but I don't know how?
my code like this:
soup = BeautifulSoup(open('1.html'),"lxml")
for i in soup.find_all('a',class_='black'):
pdfkit.from_url(['http://mycode.com'+i['href']],'1.pdf')
If there are two URLs in html. I want to two PDF files, but, it's only one file "1.pdf". How to solve it? Thanks!
You always write to the same PDF file: 1.pdf. You need to tell pdfkit to write each URL to a different file. E.g.:
for file_id, i in enumerate(soup.find_all('a',class_='black')):
pdfkit.from_url(['http://mycode.com'+i['href']], str(file_id) + '.pdf')
Related
I'm very beginner to python but I know intermediate JavaScript. I have one project to complete this is like a scraper but I want to automate some work for me.
1 ) I have a Excel with more than 1000 data and it also has URLs. I want to code that python visit every URL from that Excel sheet and search first page for Some predefine search texts (List of Texts)
2 ) If my code find any of the Text from that web page then it should return true else false
I want any idea or logic to do this kind of process. Any help will make my head pain less 😅
it is very heavy work which is not very good idea to do in JavaScript that's why I want to do it in Python
An easy way to do this would be to get the requests module. Then learn how to use the csv module which can read spreadsheets such as excel spreadsheets. Then here is what you want to do
import csv
import requests
URLS = []
def GetUrlFromCSVFile():
global URLS
#Figure out how to get link from csv file then append them to the URLS list
for url in URLS:
r = requests.get(URL, headers=#You Should Probs get some headers)
if whatever_keyword_u_looking_for in r.text:
print("Found")
else:
print("Not here")
I suggest the following:
Read about the csv library - to read the content of an excel file.
Read about the requests library - to get the page's content from its URL.
Read about regular expressions in the re library.
I have a folder that contains thousands of raw html code. I would like to extract all the href from each page. What would be the fastest way to do that?
href="what_i_need_here"
import re
with open('file', 'r') as f:
print (re.findall(r"href=\"(.+?)\"\n", ''.join(f.readlines())))
This would be what I guess might work, but there's no way to tell since you didn't provide any information. The regex used is href="(.+?)"\n. I read the content using f.readlines(), then combined it into a line to search using ''.join. See if it works, or add examples of the text.
i have to basically make a program that take a user-input web address and parses html to find links . then stores all the links in another HTML file in a certain format. i only have access to builtin python modules (python 3) . im able to get the HTML code from the link using urllib.request and put that into a string. how would i actually go about extracting links from this string and putting them into a string array? also would it be possible to identify links (such as an image link / mp3 link) so i can put them into different arrays (then i could catagorize them when im creating the output file)
You can use the re module to parse the HTML text for links. Particularly the findall method can return every match.
As far as sorting by file type that depends on whether the url actually contains the extension (i.e. .mp3, .js, .jpeg, etc...)
You could do a simple for loop like such:
import re
html = getHTMLText()
mp3s = []
other = []
for match in re.findall('<reexpression>',html):
if match.endswith('.mp3'):
mp3s.append(match)
else:
other.append(match)
try to use HTML.Parser library or re library
they will help you to do that
and i think you can use regex to do it
r'http[s]?://[^\s<>"]+|www.[^\s<>"]+
Hello I am trying to make a python function to save a list of URLs in .txt file
Example: visit http://forum.domain.com/ and save all viewtopic.php?t= word URL in .txt file
http://forum.domain.com/viewtopic.php?t=1333
http://forum.domain.com/viewtopic.php?t=2333
I use this function but not save
I am very new in python can someone help me to create this
web_obj = opener.open('http://forum.domain.com/')
data = web_obj.read()
fl_url_list = open('urllist.txt', 'r')
url_arr = fl_url_list.readlines()
fl_url_list.close()
This is far from trivial and can have quite a few corner cases (I suppose the page you're referring to is a web page)
To give you a few pointers, you need to:
download the web page : you're already doing it (in data)
extract the URLs : this is hard, most probably, you'll want to usae an html parser, extract <a> tags, fetch the hrefattribute and put that into a list. then filter that list to have only the url formatted like you like (say with viewtopic). Let's say you got it into urlList
then open a file for Writing Text (thus wt, not r).
write the content f.write('\n'.join(urlList))
close the file
I advise to try to follow these steps and ask relevant questions when you're stuck on a particular issue.
I am trying to scrape rows off of over 1200 .htm files that are on my hard drive. On my computer they are here 'file:///home/phi/Data/NHL/pl07-08/PL020001.HTM'. These .htm files are sequential from *20001.htm until *21230.htm. My plan is to eventually toss my data in MySQL or SQLite via a spreadsheet app or just straight in if I can get a clean .csv file out of this process.
This is my first attempt at code (Python), scraping, and I just installed Ubuntu 9.04 on my crappy pentium IV. Needless to say I am newb and have some roadblocks.
How do I get mechanize to go through all the files in the directory in order. Can mechanize even do this? Can mechanize/Python/BeautifulSoup read a 'file:///' style url or is there another way to point it to /home/phi/Data/NHL/pl07-08/PL020001.HTM? Is it smart to do this in 100 or 250 file increments or just send all 1230?
I just need rows that start with this "<tr class="evenColor">" and end with this "</tr>". Ideally I only want the rows that contain "SHOT"|"MISS"|"GOAL" within them but I want the whole row (every column). Note that "GOAL" is in bold so do I have to specify this? There are 3 tables per htm file.
Also I would like the name of the parent file (pl020001.htm) to be included in the rows I scrape so I can id them in their own column in the final database. I don't even know where to begin for that. This is what I have so far:
#/usr/bin/python
from BeautifulSoup import BeautifulSoup
import re
from mechanize import Browser
mech = Browser()
url = "file:///home/phi/Data/NHL/pl07-08/PL020001.HTM"
##but how do I do multiple urls/files? PL02*.HTM?
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
##this confuses me and seems redundant
pl = open("input_file.html","r")
chances = open("chancesforsql.csv,"w")
table = soup.find("table", border=0)
for row in table.findAll 'tr class="evenColor"'
#should I do this instead of before?
outfile = open("shooting.csv", "w")
##how do I end it?
Should I be using IDLE or something like it? just Terminal in Ubuntu 9.04?
You won't need mechanize. Since I do not exactly know the HTML content, I'd try to see what matches, first. Like this:
import glob
from BeautifulSoup import BeautifulSoup
for filename in glob.glob('/home/phi/Data/*.htm'):
soup = BeautifulSoup(open(filename, "r").read()) # assuming some HTML
for a_tr in soup.findAll("tr", attrs={ "class" : "evenColor" }):
print a_tr
Then pick the stuff you want and write it to stdout with commas (and redirect it > to a file). Or write the csv via python.
MYYN's answer looks like a great start to me. One thing I'd point out that I've had luck with is:
import glob
for file_name in glob.glob('/home/phi/Data/*.htm'):
#read the file and then parse with BeautifulSoup
I've found both the os and glob imports to be really useful for running through files in a directory.
Also, once you're using a for loop in this way, you have the file_name which you can modify for use in the output file, so that the output filenames will match the input filenames.