Open links from txt file in python - python

I would like to ask for help with a rss program. What I'm doing is collecting sites which are containing relevant information for my project and than check if they have rss feeds.
The links are stored in a txt file(one link on each line).
So I have a txt file with full of base urls what are needed to be checked for rss.
I have found this code which would make my job much easier.
import requests
from bs4 import BeautifulSoup
def get_rss_feed(website_url):
if website_url is None:
print("URL should not be null")
else:
source_code = requests.get(website_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.find_all("link", {"type" : "application/rss+xml"}):
href = link.get('href')
print("RSS feed for " + website_url + "is -->" + str(href))
get_rss_feed("http://www.extremetech.com/")
But I would like to open my collected urls from the txt file, rather than typing each, one by one.
So I have tryed to extend the program with this:
from bs4 import BeautifulSoup, SoupStrainer
with open('test.txt','r') as f:
for link in BeautifulSoup(f.read(), parse_only=SoupStrainer('a')):
if link.has_attr('http'):
print(link['http'])
But this is returning with an error, saying that beautifoulsoup is not a http client.
I have also extended with this:
def open()
f = open("file.txt")
lines = f.readlines()
return lines
But this gave me a list separated with ","
I would be really thankfull if someone would be able to help me

Typically you'd do something like this:
with open('links.txt', 'r') as f:
for line in f:
get_rss_feed(line)
Also, it's a bad idea to define a function with the name open unless you intend to replace the builtin function open.

i guess you can make it by using urllib
import urllib
f = open('test.txt','r')
#considering each url in a new line...
while True:
URL = f.readline()
if not URL:
break
mycontent=urllib.urlopen(URL).read()

Related

Downloading links from a txt file

I am very new to Python. I want to do a simple exercise where I want to download a bunch of links from a txt file. The files are all annual reports in txt format too. I also want to preserve the name of each link as the file name with '/' replaced with '_'. I have tried the following so far. I do not know how to open a txt file with URLs in each line, which is why I am using a list of URLs. But I want to do it properly. I know that the following code is no way near what I want but I just wanted to give it a try. Can anyone please help with this. Thanks a million!
import requests
urllist = ["https://www.sec.gov/Archives/edgar/data/100240/0000950144-94-000787.txt",
"https://www.sec.gov/Archives/edgar/data/100240/0000950144-94-000787.txt",
]
for url in urllist:
r = requests.get(url)
with open('filename.txt', 'w') as file:
file.write(r.text)
You can try using:
import requests
urllist = ["https://www.sec.gov/Archives/edgar/data/100240/0000950144-94-000787.txt",
"https://www.sec.gov/Archives/edgar/data/100240/0000950144-94-000787.txt" ] # links are the same
for url in urllist:
r = requests.get(url)
if r.status_code == 200:
fn = url.replace("/", "_").replace(":", "_") # on windows, : is not allowed on filenames
with open(fn, 'w') as file:
file.write(r.text)
Output:
https___www.sec.gov_Archives_edgar_data_100240_0000950144-94-000787.txt
Only one file was generated because links are repeated
If your links are in a file lets say urls.txt where each link in a different line that you can use this:
import urllib.request
with open('urls.txt') as f:
for url in f:
url = url.replace('\n', '')
urllib.request.urlretrieve(url , url .replace('/', '_').replace(':', '_'))

scrapy reading urls from a txt file fail

This is how the txt file looks like, and I opened it from jupiter notebook. Notice that I changed the name of the links in the result for obvious reason.
input-----------------------------
with open('...\j.txt', 'r')as f:
data = f.readlines()
print(data[0])
print(type(data))
output---------------------------------
['https://www.example.com/191186976.html', 'https://www.example.com/191187171.html']
Now I wrote these in my scrapy script, it didn't go for the links when I ran it. Instead it shows: ERROR: Error while obtaining start requests.
class abc(scrapy.Spider):
name = "abc_article"
with open('j.txt' ,'r')as f4:
url_c = f4.readlines()
u = url_c[0]
start_urls = u
And if I wrote u = ['example.html', 'example.html'] starting_url = u then it works perfectly fine. I'm new to scrapy so I'd like to ask what is the problem here? Is it the reading method or something else I didn't notice. Thanks.
Something like this should get you going in the right direction.
import csv
from urllib.request import urlopen
#import urllib2
from bs4 import BeautifulSoup
contents = []
with open('C:\\your_path_here\\test.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
for url in contents: # Parse through each url in the list.
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "html.parser")
print(soup)

How do I download from a text file that has a list with links all in 1 run?

I've tried using wget in Python to download links from a txt file.
What should I use to help me do this?
I've using the wget Python module.
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'html.parser')
body = soup.body
s = "https://google.com/"
for url in soup.find_all('a'):
f = open("output.txt", "a")
print(str(s), file=f, end = '')
print(url.get('href'), file=f)
f.close()
So far I've only been able to create the text file then use wget.exe in the command prompt. I'd like to be able to do all this in 1 step.
Since you're already using the third party requests library, just use that:
from os.path import basename
with open('output.txt') as urls:
for url in urls:
response = requests.get(url)
filename = basename(url)
with open(filename, 'wb') as output:
output.write(repsonse.content)
This code makes many assumptions:
The end of the url must be a unique name as we use basename to create the name of the downloaded file. e.g. basename('https://i.imgur.com/7ljexwX.gifv') gives '7ljexwX.gifv'
The content is assumed to be binary not text and we open the output file as 'wb' meaning 'write binary'.
The response isn't checked to make sure there were no errors
If the content is large this will be loaded into memory and then written to the output file. This may not be very efficient. There are other questions on this site which address that.
I also haven't actually tried running this code.

Opening webpage and returning a dict of all the links and their text

I'm trying to open a webpage and return all the links as a dictionary that would look like this.
{"http://my.computer.com/some/file.html" : "link text"}
So the link would be after the href= and the text would be between the > and the </a>
I'm using https://www.yahoo.com/ as my test website
I keep getting a this error:
'href=' in line:
TypeError: a bytes-like object is required, not 'str'
Heres my code:
def urlDict(myUrl):
url = myUrl
page = urllib.request.urlopen(url)
pageText = page.readlines()
urlList = {}
for line in pageText:
if '<a href=' in line:
try:
url = line.split('<a href="')[-1].split('">')[0]
txt = line.split('<a href="')[-1].split('">')[-1].split('< /a>')[0]
urlList[url] = txt
except:
pass
return urlList
What am I doing wrong? I've looked around and people have mostly suggest this mysoup parser thing. I'd use it, but I don't think that would fly with my teacher.
The issue is that you're attempting to compare a byte string to a regular string. If you add print(line) as the first command in your for loops, you'll see that it will print a string of HTML but it will have a b' at the beginning, indicating it's not utf-8 encoding. This makes things difficult. The proper way to use urllib here is the following:
def url_dict(myUrl):
with urllib.request.urlopen(myUrl) as f:
s = f.read().decode('utf-8')
This will have the s variable hold the entire text of the page. You can then use a regular expression to parse out the links and the link target. Here is an example which will pull the link targets without the HTML.
import urllib.request
import re
def url_dict():
# url = myUrl
with urllib.request.urlopen('http://www.yahoo.com') as f:
s = f.read().decode('utf-8')
r = re.compile('(?<=href=").*?(?=")')
print(r.findall(s))
url_dict()
Using regex to get both the html and the link itself in a dictionary is outside the scope of where you are in your class, so I would absolutely not recommend submitting it for the assignment, although I would recommend learning it for later use.
You'll want to use BeautifulSoup as suggested, as it make this entire thing extremely easy. There is an example in the docs that you can cut and paste to extract the URLs.
For what it's worth, here is a BeautifulSoup and requests approach.
Feel free to replace requests with urllib, but BeautifulSoup doesn't really have a nice replacement.
import requests
from bs4 import BeautifulSoup
def get_links(url):
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
return { a_tag['href']: a_tag.text for a_tag in soup.find_all('a') }
for link, text in get_links('https://www.yahoo.com/').items():
print(text.strip(), link)

Scrape websites and export only the visible text to a text document Python 3 (Beautiful Soup)

Problem: I am trying to scrape multiple websites using beautifulsoup for only the visible text and then export all of the data to a single text file.
This file will be used as a corpus for finding collocations using NLTK. I'm working with something like this so far but any help would be much appreciated!
import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
with open('thisisanew.txt','w') as file:
for item in text:
print(file, item)
Unfortunately, there are two issues with this: when I try to export the file to a .txt file it is completely blank.
Any ideas?
print(file, item) should be print(item, file=file).
But don't name your files file as this shadows the file builtin, something like this is better:
with open('thisisanew.txt','w') as outfile:
for item in text:
print(item, file=outfile)
To solve the next problem, overwriting the data from the first URL, you can move the file writing code into your loop, and open the file once before entering the loop:
import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]
with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
for item in text:
print(item, file=outfile)
There is another problem: you are collecting the text only from the last url: reassigning the text variable over and over.
Define the text as an empty list before the loop and add new data to it inside:
text = []
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text += [''.join(s.findAll(text=True))for s in soup.findAll('p')]

Categories