I want to parse on one website with some URL's and i created a text file has all links that i want to parse. How can i call this URL's from the text file one by one on python program.
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("https://www.example.com").content, "html.parser")
for d in soup.select("div[data-selenium=itemDetail]"):
url = d.select_one("h3[data-selenium] a")["href"]
upc = BeautifulSoup(requests.get(url).content, "html.parser").select_one("span.upcNum")
if upc:
data = json.loads(d["data-itemdata"])
text = (upc.text.strip())
print(upc.text)
outFile = open('/Users/Burak/Documents/new_urllist.txt', 'a')
outFile.write(str(data))
outFile.write(",")
outFile.write(str(text))
outFile.write("\n")
outFile.close()
urllist.txt
https://www.example.com/category/1
category/2
category/3
category/4
Thanks in advance
Use a context manager :
with open("/file/path") as f:
urls = [u.strip('\n') for u in f.readlines()]
You obtain your list with all urls in your file and can then call them as you like.
Related
I want to extract some information from multiple pages which have similar page structures.
all URLs of the pages are saved in one file.txt (every URL in one line).
I already create the code to scrape all the data from one link (it works).
But I don't know how I create a loop to go through all the list of URLs from the txt file, and scrape all the data.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
from bs4 import Comment
import re
import rispy # Writing an ris file
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
Just work with each url:page inside the loop!
for line in f:
url = line.strip()
html = requests.get(url).text # is .content better?
soup = BeautifulSoup(html, "html.parser")
# work with soup here!
Creating more functions may help your program be easier to read if you find yourself packing a lot into some block
See Cyclomatic Complexity (which is practically a count of the control statements like if and for)
Additionally, if you want to collect up all the values before doing further processing (though this is frequently better accomplished with more esoteric logic like a generator or asyncio to collect many pages in parallel), you might consider creating some collection before the loop to store the results
collected_results = [] # a new list
...
for line in fh:
result = # process the line via whatever logic
collected_results.append(result)
# now collected_results has the result from each line
you are making a big mistake by writing :
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
because that will store the html data of the last url obtained from the TXT file in html variable.
after the for loop finish, the last line of the TXT file will be stored in variable url and that mean you will get only the last url in the TXT file
the code should be :
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
This is how the txt file looks like, and I opened it from jupiter notebook. Notice that I changed the name of the links in the result for obvious reason.
input-----------------------------
with open('...\j.txt', 'r')as f:
data = f.readlines()
print(data[0])
print(type(data))
output---------------------------------
['https://www.example.com/191186976.html', 'https://www.example.com/191187171.html']
Now I wrote these in my scrapy script, it didn't go for the links when I ran it. Instead it shows: ERROR: Error while obtaining start requests.
class abc(scrapy.Spider):
name = "abc_article"
with open('j.txt' ,'r')as f4:
url_c = f4.readlines()
u = url_c[0]
start_urls = u
And if I wrote u = ['example.html', 'example.html'] starting_url = u then it works perfectly fine. I'm new to scrapy so I'd like to ask what is the problem here? Is it the reading method or something else I didn't notice. Thanks.
Something like this should get you going in the right direction.
import csv
from urllib.request import urlopen
#import urllib2
from bs4 import BeautifulSoup
contents = []
with open('C:\\your_path_here\\test.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
for url in contents: # Parse through each url in the list.
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "html.parser")
print(soup)
It's an easy and basic question, I guess. But I didn't manage to find a clear and simple answer.
here it's my problem :
I have a .txt file with urls on each line (around 300). I got these urls from a python script.
I would like to open one by one these urls and execute this script for each one to get some informations I am interested by :
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.aerodromes.fr/aeroport-de-saint-martin-grand-case-ttfg-a413.html")
soup = BeautifulSoup(page, "html.parser")
info_tag = soup.find_all('b')
info_nom =info_tag[2].string
info_pos =info_tag[4].next_sibling
info_alt =info_tag[5].next_sibling
info_pis =info_tag[6].next_sibling
info_vil =info_tag[7].next_sibling
print(info_nom +","+ info_pos+","+ info_alt +","+ info_pis +","+info_vil)
aero-url.txt:
http://www.aerodromes.fr/aeroport-de-la-reunion-roland-garros-fmee-a416.html,
http://www.aerodromes.fr/aeroport-de-saint-pierre---pierrefonds-fmep-a417.html,
http://www.aerodromes.fr/base-aerienne-de-moussoulens-lf34-a433.html,
http://www.aerodromes.fr/aerodrome-d-yvetot-lf7622-a469.html,
http://www.aerodromes.fr/aerodrome-de-dieppe---saint-aubin-lfab-a1.html,
http://www.aerodromes.fr/aeroport-de-calais---dunkerque-lfac-a2.html,
http://www.aerodromes.fr/aerodrome-de-compiegne---margny-lfad-a3.html,
http://www.aerodromes.fr/aerodrome-d-eu---mers---le-treport-lfae-a4.html,
http://www.aerodromes.fr/aerodrome-de-laon---chambry-lfaf-a5.html,
http://www.aerodromes.fr/aeroport-de-peronne---saint-quentin-lfag-a6.html,
http://www.aerodromes.fr/aeroport-de-nangis-les-loges-lfai-a7.html,
...
I think i have to use a loop with something like this :
import urllib2
from bs4 import BeautifulSoup
# Open the file for reading
infile = open("aero-url.txt", 'r')
# Read every single line of the file into an array of lines
lines = infile.readline().rstrip('\n\r')
for line in infile
page = urllib2.urlopen(lines)
soup = BeautifulSoup(page, "html.parser")
#find the places of each info
info_tag = soup.find_all('b')
info_nom =info_tag[2].string
info_pos =info_tag[4].next_sibling
info_alt =info_tag[5].next_sibling
info_pis =info_tag[6].next_sibling
info_vil =info_tag[7].next_sibling
#Print them on the terminal.
print(info_nom +","+ info_pos+","+ info_alt +","+ info_pis +","+info_vil)
I will write these results in a txt file after. But my problem here is how to apply my parsing script to my urls text file.
use line instead of lines in urlopen
page = urllib2.urlopen(line)
since you are using infile in the loop, you do not need the lines line
lines = infile.readline().rstrip('\n\r')
also indentation is wrong for the loop.
correcting these your code should be like below.
import urllib2
from bs4 import BeautifulSoup
# Open the file for reading
infile = open("aero-url.txt", 'r')
for line in infile:
page = urllib2.urlopen(line)
soup = BeautifulSoup(page, "html.parser")
#find the places of each info
info_tag = soup.find_all('b')
info_nom =info_tag[2].string
info_pos =info_tag[4].next_sibling
info_alt =info_tag[5].next_sibling
info_pis =info_tag[6].next_sibling
info_vil =info_tag[7].next_sibling
#Print them on the terminal.
print(info_nom +","+ info_pos+","+ info_alt +","+ info_pis +","+info_vil)
I would like to ask for help with a rss program. What I'm doing is collecting sites which are containing relevant information for my project and than check if they have rss feeds.
The links are stored in a txt file(one link on each line).
So I have a txt file with full of base urls what are needed to be checked for rss.
I have found this code which would make my job much easier.
import requests
from bs4 import BeautifulSoup
def get_rss_feed(website_url):
if website_url is None:
print("URL should not be null")
else:
source_code = requests.get(website_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.find_all("link", {"type" : "application/rss+xml"}):
href = link.get('href')
print("RSS feed for " + website_url + "is -->" + str(href))
get_rss_feed("http://www.extremetech.com/")
But I would like to open my collected urls from the txt file, rather than typing each, one by one.
So I have tryed to extend the program with this:
from bs4 import BeautifulSoup, SoupStrainer
with open('test.txt','r') as f:
for link in BeautifulSoup(f.read(), parse_only=SoupStrainer('a')):
if link.has_attr('http'):
print(link['http'])
But this is returning with an error, saying that beautifoulsoup is not a http client.
I have also extended with this:
def open()
f = open("file.txt")
lines = f.readlines()
return lines
But this gave me a list separated with ","
I would be really thankfull if someone would be able to help me
Typically you'd do something like this:
with open('links.txt', 'r') as f:
for line in f:
get_rss_feed(line)
Also, it's a bad idea to define a function with the name open unless you intend to replace the builtin function open.
i guess you can make it by using urllib
import urllib
f = open('test.txt','r')
#considering each url in a new line...
while True:
URL = f.readline()
if not URL:
break
mycontent=urllib.urlopen(URL).read()
Problem: I am trying to scrape multiple websites using beautifulsoup for only the visible text and then export all of the data to a single text file.
This file will be used as a corpus for finding collocations using NLTK. I'm working with something like this so far but any help would be much appreciated!
import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
with open('thisisanew.txt','w') as file:
for item in text:
print(file, item)
Unfortunately, there are two issues with this: when I try to export the file to a .txt file it is completely blank.
Any ideas?
print(file, item) should be print(item, file=file).
But don't name your files file as this shadows the file builtin, something like this is better:
with open('thisisanew.txt','w') as outfile:
for item in text:
print(item, file=outfile)
To solve the next problem, overwriting the data from the first URL, you can move the file writing code into your loop, and open the file once before entering the loop:
import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]
with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
for item in text:
print(item, file=outfile)
There is another problem: you are collecting the text only from the last url: reassigning the text variable over and over.
Define the text as an empty list before the loop and add new data to it inside:
text = []
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
text += [''.join(s.findAll(text=True))for s in soup.findAll('p')]