Downloading links from a txt file - python

I am very new to Python. I want to do a simple exercise where I want to download a bunch of links from a txt file. The files are all annual reports in txt format too. I also want to preserve the name of each link as the file name with '/' replaced with '_'. I have tried the following so far. I do not know how to open a txt file with URLs in each line, which is why I am using a list of URLs. But I want to do it properly. I know that the following code is no way near what I want but I just wanted to give it a try. Can anyone please help with this. Thanks a million!
import requests
urllist = ["https://www.sec.gov/Archives/edgar/data/100240/0000950144-94-000787.txt",
"https://www.sec.gov/Archives/edgar/data/100240/0000950144-94-000787.txt",
]
for url in urllist:
r = requests.get(url)
with open('filename.txt', 'w') as file:
file.write(r.text)

You can try using:
import requests
urllist = ["https://www.sec.gov/Archives/edgar/data/100240/0000950144-94-000787.txt",
"https://www.sec.gov/Archives/edgar/data/100240/0000950144-94-000787.txt" ] # links are the same
for url in urllist:
r = requests.get(url)
if r.status_code == 200:
fn = url.replace("/", "_").replace(":", "_") # on windows, : is not allowed on filenames
with open(fn, 'w') as file:
file.write(r.text)
Output:
https___www.sec.gov_Archives_edgar_data_100240_0000950144-94-000787.txt
Only one file was generated because links are repeated

If your links are in a file lets say urls.txt where each link in a different line that you can use this:
import urllib.request
with open('urls.txt') as f:
for url in f:
url = url.replace('\n', '')
urllib.request.urlretrieve(url , url .replace('/', '_').replace(':', '_'))

Related

in python - how to save multi HTML Source code to one single text file

I have list for Links (stored in links.txt file )
This code can save result of one link
but I do not know how to make it download ALL the source codes of ALL links inside (links.txt) and SAVE THEM AS ONE SINGLE text file for next step of processing ...
import urllib.request
urllib.request.urlretrieve("https://www.ebay.com/sch/i.html?_from=R40&_nkw=abc&_sacat=0&_pgn=1", "result.txt")
Example links form links.txt
https://www.ebay.com/sch/i.html?_from=R40&_nkw=abc&_sacat=0&_pgn=1
https://www.ebay.com/sch/i.html?_from=R40&_nkw=abc&_sacat=0&_pgn=2
https://www.ebay.com/sch/i.html?_from=R40&_nkw=abc&_sacat=0&_pgn=3
https://www.ebay.com/sch/i.html?_from=R40&_nkw=abc&_sacat=0&_pgn=4
https://www.ebay.com/sch/i.html?_from=R40&_nkw=abc&_sacat=0&_pgn=5
https://www.ebay.com/sch/i.html?_from=R40&_nkw=abc&_sacat=0&_pgn=6
https://www.ebay.com/sch/i.html?_from=R40&_nkw=abc&_sacat=0&_pgn=7
....
urllib
import urllib.request
with open('links.txt', 'r') as f:
links = f.readlines()
for link in links:
with urllib.request.urlopen(link) as f:
# get html text
html = f.read().decode('utf-8')
# append html to file
with open('result.txt', 'w+') as f:
f.write(html)
requests
you could also use requests library which i find much more readable
pip install requests
import requests
with open('links.txt', 'r') as f:
links = f.readlines()
for link in links:
response = requests.get(link)
html = response.text
# append html to file
with open('result.txt', 'w+') as f:
f.write(html)
Use loop for page navigation
Use for loop to generate page links as the only thing that is changing is the page no.
links = [
f'https://www.ebay.com/sch/i.html?_from=R40&_nkw=abc&_sacat=0&_pgn={n}'
for n in range(1, 10) # [1, 2, 3, 4, 5, 6, 7, 8, 9]
]
or as you go along
for n in range(1, 10):
link = f'https://www.ebay.com/sch/i.html?_from=R40&_nkw=abc&_sacat=0&_pgn={n}'
[...]
Actually, it's usual better to use requests lib, so you should start from installing it:
pip install requests
Then I'd propose to read the links.txt line by line, download all the data you need and store it in file output.txt:
import requests
data = []
# collect all the data from all links in the file
with open('links.txt', 'r') as links:
for link in links:
response = requests.get(link)
data.append(response.text)
# put all collected to a single file
with open('output.txt', 'w+') as output:
for chunk in data:
print(chunk, file=output)

scrapy reading urls from a txt file fail

This is how the txt file looks like, and I opened it from jupiter notebook. Notice that I changed the name of the links in the result for obvious reason.
input-----------------------------
with open('...\j.txt', 'r')as f:
data = f.readlines()
print(data[0])
print(type(data))
output---------------------------------
['https://www.example.com/191186976.html', 'https://www.example.com/191187171.html']
Now I wrote these in my scrapy script, it didn't go for the links when I ran it. Instead it shows: ERROR: Error while obtaining start requests.
class abc(scrapy.Spider):
name = "abc_article"
with open('j.txt' ,'r')as f4:
url_c = f4.readlines()
u = url_c[0]
start_urls = u
And if I wrote u = ['example.html', 'example.html'] starting_url = u then it works perfectly fine. I'm new to scrapy so I'd like to ask what is the problem here? Is it the reading method or something else I didn't notice. Thanks.
Something like this should get you going in the right direction.
import csv
from urllib.request import urlopen
#import urllib2
from bs4 import BeautifulSoup
contents = []
with open('C:\\your_path_here\\test.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
for url in contents: # Parse through each url in the list.
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "html.parser")
print(soup)

How do I download from a text file that has a list with links all in 1 run?

I've tried using wget in Python to download links from a txt file.
What should I use to help me do this?
I've using the wget Python module.
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'html.parser')
body = soup.body
s = "https://google.com/"
for url in soup.find_all('a'):
f = open("output.txt", "a")
print(str(s), file=f, end = '')
print(url.get('href'), file=f)
f.close()
So far I've only been able to create the text file then use wget.exe in the command prompt. I'd like to be able to do all this in 1 step.
Since you're already using the third party requests library, just use that:
from os.path import basename
with open('output.txt') as urls:
for url in urls:
response = requests.get(url)
filename = basename(url)
with open(filename, 'wb') as output:
output.write(repsonse.content)
This code makes many assumptions:
The end of the url must be a unique name as we use basename to create the name of the downloaded file. e.g. basename('https://i.imgur.com/7ljexwX.gifv') gives '7ljexwX.gifv'
The content is assumed to be binary not text and we open the output file as 'wb' meaning 'write binary'.
The response isn't checked to make sure there were no errors
If the content is large this will be loaded into memory and then written to the output file. This may not be very efficient. There are other questions on this site which address that.
I also haven't actually tried running this code.

How can i call URL's from text file one by one

I want to parse on one website with some URL's and i created a text file has all links that i want to parse. How can i call this URL's from the text file one by one on python program.
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("https://www.example.com").content, "html.parser")
for d in soup.select("div[data-selenium=itemDetail]"):
url = d.select_one("h3[data-selenium] a")["href"]
upc = BeautifulSoup(requests.get(url).content, "html.parser").select_one("span.upcNum")
if upc:
data = json.loads(d["data-itemdata"])
text = (upc.text.strip())
print(upc.text)
outFile = open('/Users/Burak/Documents/new_urllist.txt', 'a')
outFile.write(str(data))
outFile.write(",")
outFile.write(str(text))
outFile.write("\n")
outFile.close()
urllist.txt
https://www.example.com/category/1
category/2
category/3
category/4
Thanks in advance
Use a context manager :
with open("/file/path") as f:
urls = [u.strip('\n') for u in f.readlines()]
You obtain your list with all urls in your file and can then call them as you like.

Open links from txt file in python

I would like to ask for help with a rss program. What I'm doing is collecting sites which are containing relevant information for my project and than check if they have rss feeds.
The links are stored in a txt file(one link on each line).
So I have a txt file with full of base urls what are needed to be checked for rss.
I have found this code which would make my job much easier.
import requests
from bs4 import BeautifulSoup
def get_rss_feed(website_url):
if website_url is None:
print("URL should not be null")
else:
source_code = requests.get(website_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.find_all("link", {"type" : "application/rss+xml"}):
href = link.get('href')
print("RSS feed for " + website_url + "is -->" + str(href))
get_rss_feed("http://www.extremetech.com/")
But I would like to open my collected urls from the txt file, rather than typing each, one by one.
So I have tryed to extend the program with this:
from bs4 import BeautifulSoup, SoupStrainer
with open('test.txt','r') as f:
for link in BeautifulSoup(f.read(), parse_only=SoupStrainer('a')):
if link.has_attr('http'):
print(link['http'])
But this is returning with an error, saying that beautifoulsoup is not a http client.
I have also extended with this:
def open()
f = open("file.txt")
lines = f.readlines()
return lines
But this gave me a list separated with ","
I would be really thankfull if someone would be able to help me
Typically you'd do something like this:
with open('links.txt', 'r') as f:
for line in f:
get_rss_feed(line)
Also, it's a bad idea to define a function with the name open unless you intend to replace the builtin function open.
i guess you can make it by using urllib
import urllib
f = open('test.txt','r')
#considering each url in a new line...
while True:
URL = f.readline()
if not URL:
break
mycontent=urllib.urlopen(URL).read()

Categories