Iterating through multiple URLs from .txt file with Python/BeautifulSoup

Iterating through multiple URLs from .txt file with Python/BeautifulSoup - python

I'm trying to create a script that takes a .txt file with multiple lines of YouTube usernames, appends it to the YouTube user homepage URL, and crawls through to get profile data.
The code below gives me the info I want for one user, but I have no idea where to start for importing and iterating through multiple URLs.
#!/usr/bin/env python
# -- coding: utf-8 --
from bs4 import BeautifulSoup
import re
import urllib2
# download the page
response = urllib2.urlopen("http://youtube.com/user/alxlvt")
html = response.read()
# create a beautiful soup object
soup = BeautifulSoup(html)
# find the profile info & display it
profileinfo = soup.findAll("div", { "class" : "user-profile-item" })
for info in profileinfo:
print info.get_text()
Does anyone have any recommendations?
Eg., if I had a .txt file that read:
username1
username2
username3
etc.
How could I go about iterating through those, appending them to http://youtube.com/user/%s, and creating a loop to pull all the info?

If you don't want to use an actual scraping module (like scrapy, mechanize, selenium, etc), you can just keep iterating on what you've written.
use the iteration on file objects to read line by line A few things, a neat fact about file objects, is that, if they are opened with 'rb', they actually call readline() as their iterator, so you can just do for line in file_obj to go line by line in a document.
concatenate urls I used + below, but you can also use the concatenate function.
make a list of urls - will let you stagger your requests, so you can do compassionate screen scraping.
# Goal: make a list of urls
url_list = []
# use a try-finally to make sure you close your file.
try:
f = open('pathtofile.txt','rb')
for line in f:
url_list.append('http://youtube.com/user/%s' % line)
# do something with url list (like call a scraper, or use urllib2
finally:
f.close()
EDIT: Andrew G's string format is clearer. :)

You'll need to open the file (preferably with the with open('/path/to/file', 'r') as f: syntax) and then do f.readline() in a loop. Assign the results of readline() to a string like "username" and then run your current code inside the loop, starting with response = urllib2.urlopen("http://youtube.com/user/%s" % username).

Related

Why won't webbrowser module open my html file in my browser

I am using the python webbrowser module to try and open a html file. I added a short thing to get code from a website to view, allowing me to store a web-page incase I ever need to view it without wifi, for instance a news article or something else.
The code itself is fairly short so far, so here it is:
import requests as req
from bs4 import BeautifulSoup as bs
import webbrowser
import re
webcheck = re.compile('^(https?:\/\/)?(www.)?([a-z0-9]+\.[a-z]+)([\/a-zA-Z0-9#\-_]+\/?)*$')
#Valid URL Check
while True:
url = input('URL (MUST HAVE HTTP://): ')
check = webcheck.search(url)
groups = list(check.groups())
if check != None:
for group in groups:
if group == 'https://':
groups.remove(group)
elif group.count('/') > 0:
groups.append(group.replace('/', '--'))
groups.remove(group)
filename = ''.join(groups) + '.html'
break
#Getting Website Data
reply = req.get(url)
soup = bs(reply.text, 'html.parser')
#Writing Website
with open(filename, 'w') as file:
file.write(reply.text)
#Open Website
webbrowser.open(filename)
webbrowser.open('https://www.youtube.com')
I added webbrowser.open('https://www.youtube.com') so that I knew the module was working, which it was, as it did open up youtube.
However, webbrowser.open(filename) doesn't do anything, yet it returns True if I define it as a variable and print it.
The html file itself has a period in the name, but I don't think that should matter as I have made a file without it as the name and it wont run.
Does webbrowser need special permissions to work?
I'm not sure what to do as I've removed characters from the filename and even showed that the module is working by opening youtube.
What can I do to fix this?

From the webbrowser documentation:
Note that on some platforms, trying to open a filename using this function, may work and start the operating system’s associated program. However, this is neither supported nor portable.
So it seems that webbrowser can't do what you want. Why did you expect that it would?

adding file:// + full path name does the trick for any wondering

For some reason .write() does not write string to .txt file anymore

So I started with python some days ago and now tried to make a function that gives me all subpages of websites. I know it may not be the most elegant function but I had been pretty proud to see it working. But for some reason unknown to me, my function does not work anymore. I could've sworn I haven't changed that function since it worked the last time. But after hours of attempts to debug I am slowly doubting myself. Can you maybe take a look why my function does not output to a .txt file anymore? I just get handed an empty text file. Though if I delete it atleast creates a new (empty) one.
I tried to move the save strings part out of the try block, which didn't. work. I also tried all_urls.flush() to maybe save everything. I restarted the PC in the hopes that something in the background accessed the file and made me unable to write on it. I also renamed the file it supposed to save as, so as to generate something truly fresh. Still the same problem. I also controlled that the link from the loop gets given as a string, so that shouldn't be a problem. I also tried:
print(link, file=all_urls, end='\n')
as a replacement to
all_urls.write(link)
all_urls.write('\n')
with no result.
My full function:
def get_subpages(url):
# gets all subpage links from a website that start with the given url
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
links = [url]
tested_links = []
to_test_links = links
# open a .txt file to save results into
all_urls = open('all_urls.txt', 'w')
problematic_pages = open('problematic_pages.txt', 'w')
while len(to_test_links)>0:
for link in to_test_links:
print('the link we are testing right now:', link)
# add the current link to the tested list
tested_links.append(link)
try:
print(type(link))
all_urls.write(link)
all_urls.write('\n')
# Save it to the -txt file and make an abstract
# get the link ready to be accessed
req = Request(link)
html_page = urlopen(req)
soup = BeautifulSoup(html_page, features="html.parser")
# empty previous temporary links
templinks = []
# move the links on the subpage link to templinks
for sublink in soup.findAll('a'):
templinks.append(sublink.get('href'))
# clean off accidental 'None' values
templinks = list(filter(lambda item: item is not None, templinks))
for templink in templinks:
# make sure we have still the correct website and don't accidentally crawl instagram etc.
# also avoid duplicates
if templink.find(url) == 0 and templink not in links:
links.append(templink)
#and lastly refresh the to_test_links list with the newly found links before going back into the loop
to_test_links = (list(set(links) ^ set(tested_links)))
except:
# Save it to the ERROR -txt file and make an abstract
problematic_pages.write(link)
problematic_pages.write('\n')
print('ERROR: All links on', link, 'not retrieved. If need be check for new subpages manually.')
all_urls.close()
problematic_pages.close()
return links

I can't reproduce this, but I've had inexplicable [to me at least] errors with file handling that were resolved when I wrote from inside a with.
[Just make sure to remove the lines involving allurl in your current code first just in case - or just try this with a different filename while checking if it works]
Since you're appending all the urls to tested_links anyway, you could just write it all at once after the while loop
with open('all_urls.txt', 'w') as f:
f.write('\n'.join('tested_links')+'\n')
or, if you have to write link by link, you can append by opening with mode='a':
# before the while, if you're not sure the file exists
# [and/or to clear previous data from file]
# with open('all_urls.txt', 'w') as f: f.write('')
# and inside the try block:
with open('all_urls.txt', 'a') as f:
f.write(f'{link}\n')

not a direct answer, but in my early days this happened with me. The requests module of Python sends request with headers indicating python and that could be quickly detected by websites and your IP can get blocked and you are getting unusual response that's why your working function is not working now.
Solution:
Use natural request headers see the code below
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'}
r = requests.get(URL, headers=headers)
Use a proxy in case you got blocked on your IP it is highly recommended

Here is your slitely changed script with marked (*****************) changes:
def get_subpages(url):
# gets all subpage links from a website that start with the given url
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
links = [url]
# ******************* added sublinks_list variable ******************
sublinks_list = []
tested_links = []
to_test_links = links
# open a .txt file to save results into
all_urls = open('all_urls.txt', 'w')
problematic_pages = open('problematic_pages.txt', 'w')
while len(to_test_links)>0:
for link in to_test_links:
print('the link we are testing right now:', link)
# add the current link to the tested list
tested_links.append(link)
try:
all_urls.write(link)
all_urls.write('\n')
# Save it to the -txt file and make an abstract
# get the link ready to be accessed
req = Request(link)
html_page = urlopen(req)
soup = BeautifulSoup(html_page, features="html.parser")
# empty previous temporary links
templinks = []
# move the links on the subpage link to templinks
sublinks = soup.findAll('a')
for sublink in sublinks:
#templinks.append(sublink.get('href')) ***************** changed the line with next row *****************
templinks.append(sublink['href'])
# clean off accidental 'None' values
templinks = list(filter(lambda item: item is not None, templinks))
for templink in templinks:
# make sure we have still the correct website and don't accidentally crawl instagram etc.
# also avoid duplicates
#if templink.find(url) == 0 and templink not in links: ******************* changed the line with next row *****************
if templink not in sublinks_list:
#links.append(templink) ******************* changed the line with next row *****************
sublinks_list.append(templink)
all_urls.write(templink + '\n') # ******************* added this line *****************
#and lastly refresh the to_test_links list with the newly found links before going back into the loop
to_test_links = (list(set(links) ^ set(tested_links)))
except:
# Save it to the ERROR -txt file and make an abstract
problematic_pages.write(link)
problematic_pages.write('\n')
print('ERROR: All links on', link, 'not retrieved. If need be check for new subpages manually.')
all_urls.close()
problematic_pages.close()
return links
lnks = get_subpages('https://www.jhanley.com/blog/pyscript-creating-installable-offline-applications/') # # ******************* url used for testing *****************
It works and there is over 180 links in the file. Please test it yourself. There are still some missfits and questionable sintax so you should test your code thoroughly again - but the part with writing links into a file works.
Regards...

How to increase the speed of my program in Python?

I am working on a web scraping project, and I have to get links from 19062 facilities. If I use a for loop, it will take almost 3 hours to complete. I tried making a generator but failed to make any logic, and I am not sure that it can be done using a generator. So, is there any Python expert who has an idea to get what I want faster? In my code, I execute it for just 20 ids. Thanks
import requests, json
from bs4 import BeautifulSoup as bs
url = 'https://hilfe.diakonie.de/hilfe-vor-ort/marker-json.php?ersteller=&kategorie=0&text=& n=55.0815&e=15.0418321&s=47.270127&w=5.8662579&zoom=20000'
res = requests.get(url).json()
url_1 = 'https://hilfe.diakonie.de/hilfe-vor-ort/info-window-html.php?id='
# extracting all the id= from .json res object
id = []
for item in res['items'][0]["elements"]:
id.append(item["id"])
# opening a .json file and making a dict for links
file = open('links.json', 'a')
links = {'links': []}
def link_parser(url, id):
resp = requests.get(url + id).content
soup = bs(resp, "html.parser")
link = soup.select_one('p > a').attrs['href']
links['links'].append(link)
# dumping the dict into links.json file
for item in id[:20]:
link_parser(url_1, item)
json.dump(links, file)
file.close()

In web scraping, speed is not a good idea! You will be hitting the server numerous times a second and will most likely get blocked if you use a For Loop. A generator will not make this quicker. Ideally, you want to hit the server once and process the data locally.
If it were me, I would want to use a framework like Scrapy that encourages good practice and various Spider classes to support standard techniques.

I am having a hard time creating a program that finds tor nodes

I am trying to create a web scraping program that goes to a specific website, collects the tor nodes and then compares it to a list that I have. If the IP addresses match then it's a tor node, if not it isn't a tor node then it's false.
I am having a hard time getting the "text" from the inspect element of the website ..[Inspect element of website][1]
[1]: https://i.stack.imgur.com/16zWw.png
Any help is appreciated, I'm stuck right now and don't know how to get the "text" from the first picture to show up on my program. Thanks in advance.
Here is the code to my program so far:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.dan.me.uk/tornodes')
soup = BeautifulSoup(page.content, 'html.parser')
search = soup.find(id='content')
#137.74.19.201 is practice tor node
items = search.find_all(class_='article box')

Why bother with BeautifulSoup ?! the guy states clearly that there are some markers in the page ... just take the whole pate as a string, split by those markers an go from there, for example:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.dan.me.uk/tornodes')
# assuming that page.content contains the source code of the page
if "<!--__BEGIN_TOR_NODE_LIST__-->" not in page.content:
print("list not ready")
else:
list_text = page.content.split("<!--__BEGIN_TOR_NODE_LIST__-->")[1] # take everything after this
list_text = list_text.split("<!--__END_TOR_NODE_LIST__-->")[1] # take everything before this
line_list = [line.strip() ]
for line in list_text.split("<br>"):
line_ip = line.strip().split("|")[0]
# how do what you want with it
if line_ip in my_known_ip_list:
print("This is good %s" % line_ip)

import urllib.request # the lib that handles the url stuff
target_url = 'https://www.dan.me.uk/torlist/'
my_ips = ['1.161.11.204', '1.161.11.205']
confirmed_ips = []
for line in urllib.request.urlopen(target_url):
if line in my_ips:
print(line.decode())
confirmed_ips.append(line)
print(confirmed_ips)
# ATTENTION:
# Umm... You can only fetch the data every 30 minutes - sorry. It's pointless any faster as I only update every 30 minutes anyway.
# If you keep trying to download this list too often, you may get blocked from accessing it completely.
# (this is due to some people trying to download this list every minute!)
Since there's this 30min limitation, otherwise you will receive ERROR 403, you can read the lines and save to a file, then compare your list with the downloaded list.

Open a specific page on a website

Using the webbrowser module, I want to open a specific page on last.fm.
It picks a line from a text file then prints it. I want it to add that line at the end of:
webbrowser.open('http://www.last.fm/music/')
So for example, the random.choice picks example artist. I want example artist to be added at the end of the url correctly.
Any help is appreciated.

Use the urlparse.urljoin function to build up the full destination URL:
import urlparse
import webbrowser
artist_name = 'virt'
url = urlparse.urljoin('http://www.last.fm/music/', artist_name)
# Will open http://www.last.fm/music/virt in your browser.
webbrowser.open(url)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterating through multiple URLs from .txt file with Python/BeautifulSoup - python

Related

Why won't webbrowser module open my html file in my browser

For some reason .write() does not write string to .txt file anymore

How to increase the speed of my program in Python?

I am having a hard time creating a program that finds tor nodes

Open a specific page on a website

Categories

Resources