It's an easy and basic question, I guess. But I didn't manage to find a clear and simple answer.
here it's my problem :
I have a .txt file with urls on each line (around 300). I got these urls from a python script.
I would like to open one by one these urls and execute this script for each one to get some informations I am interested by :
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.aerodromes.fr/aeroport-de-saint-martin-grand-case-ttfg-a413.html")
soup = BeautifulSoup(page, "html.parser")
info_tag = soup.find_all('b')
info_nom =info_tag[2].string
info_pos =info_tag[4].next_sibling
info_alt =info_tag[5].next_sibling
info_pis =info_tag[6].next_sibling
info_vil =info_tag[7].next_sibling
print(info_nom +","+ info_pos+","+ info_alt +","+ info_pis +","+info_vil)
aero-url.txt:
http://www.aerodromes.fr/aeroport-de-la-reunion-roland-garros-fmee-a416.html,
http://www.aerodromes.fr/aeroport-de-saint-pierre---pierrefonds-fmep-a417.html,
http://www.aerodromes.fr/base-aerienne-de-moussoulens-lf34-a433.html,
http://www.aerodromes.fr/aerodrome-d-yvetot-lf7622-a469.html,
http://www.aerodromes.fr/aerodrome-de-dieppe---saint-aubin-lfab-a1.html,
http://www.aerodromes.fr/aeroport-de-calais---dunkerque-lfac-a2.html,
http://www.aerodromes.fr/aerodrome-de-compiegne---margny-lfad-a3.html,
http://www.aerodromes.fr/aerodrome-d-eu---mers---le-treport-lfae-a4.html,
http://www.aerodromes.fr/aerodrome-de-laon---chambry-lfaf-a5.html,
http://www.aerodromes.fr/aeroport-de-peronne---saint-quentin-lfag-a6.html,
http://www.aerodromes.fr/aeroport-de-nangis-les-loges-lfai-a7.html,
...
I think i have to use a loop with something like this :
import urllib2
from bs4 import BeautifulSoup
# Open the file for reading
infile = open("aero-url.txt", 'r')
# Read every single line of the file into an array of lines
lines = infile.readline().rstrip('\n\r')
for line in infile
page = urllib2.urlopen(lines)
soup = BeautifulSoup(page, "html.parser")
#find the places of each info
info_tag = soup.find_all('b')
info_nom =info_tag[2].string
info_pos =info_tag[4].next_sibling
info_alt =info_tag[5].next_sibling
info_pis =info_tag[6].next_sibling
info_vil =info_tag[7].next_sibling
#Print them on the terminal.
print(info_nom +","+ info_pos+","+ info_alt +","+ info_pis +","+info_vil)
I will write these results in a txt file after. But my problem here is how to apply my parsing script to my urls text file.
use line instead of lines in urlopen
page = urllib2.urlopen(line)
since you are using infile in the loop, you do not need the lines line
lines = infile.readline().rstrip('\n\r')
also indentation is wrong for the loop.
correcting these your code should be like below.
import urllib2
from bs4 import BeautifulSoup
# Open the file for reading
infile = open("aero-url.txt", 'r')
for line in infile:
page = urllib2.urlopen(line)
soup = BeautifulSoup(page, "html.parser")
#find the places of each info
info_tag = soup.find_all('b')
info_nom =info_tag[2].string
info_pos =info_tag[4].next_sibling
info_alt =info_tag[5].next_sibling
info_pis =info_tag[6].next_sibling
info_vil =info_tag[7].next_sibling
#Print them on the terminal.
print(info_nom +","+ info_pos+","+ info_alt +","+ info_pis +","+info_vil)
Related
I'm trying to build a web crawler that generates a text file for multiple different websites. After it crawls a website it is supposed to get all the links in a website. However, I have encountered a problem while web crawling Wikipedia. The python script gives me the error:
Traceback (most recent call last):
File "/home/banana/Desktop/Search engine/data/crawler?.py", line 22, in <module>
urlwaitinglist.write(link.get('href'))
TypeError: write() argument must be str, not None
I looked deeper into it by having it print the discovered links and it has "None" at the top. I'm wondering if there is a function to see if the variable has any value.
Here is the code I have written so far:
from bs4 import BeautifulSoup
import os
import requests
import random
import re
toscan = "https://en.wikipedia.org/wiki/Wikipedia:Contents"
url = toscan
source_code = requests.get(url)
plain_text = source_code.text
removal_list = ["http://", "https://", "/"]
for word in removal_list:
toscan = toscan.replace(word, "")
soup = BeautifulSoup(plain_text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
urlwaitinglist = open("/home/banana/Desktop/Search engine/data/toscan", "a")
urlwaitinglist.write('\n')
urlwaitinglist.write(link.get('href'))
urlwaitinglist.close()
print(soup.get_text())
directory = "/home/banana/Desktop/Search engine/data/Crawled Data/"
results = soup.get_text()
results = results.strip()
f = open("/home/banana/Desktop/Search engine/data/Crawled Data/" + toscan + ".txt", "w")
f.write(url)
f.write('\n')
f.write(results)
f.close()
Looks like not every <a> tag you are grabbing is returning a value. I would suggest making every link variable you grab a string and check if its not None. It is also bad practice to to open a file without using the 'with' clause. I have added an example that grabs every https|http link and writing it to file using some of your code below:
from bs4 import BeautifulSoup
import os
import requests
import random
import re
file_directory = './' # your specified directory location
filename = 'urls.txt' # your specified filename
url = "https://en.wikipedia.org/wiki/Wikipedia:Contents"
res = requests.get(url)
html = res.text
soup = BeautifulSoup(html, 'html.parser')
links = []
for link in soup.find_all('a'):
link = link.get('href')
print(link)
match = re.search('^(http|https)://', str(link))
if match:
links.append(str(link))
with open(file_directory + filename, 'w') as file:
for link in links:
file.write(link + '\n')
I want to extract some information from multiple pages which have similar page structures.
all URLs of the pages are saved in one file.txt (every URL in one line).
I already create the code to scrape all the data from one link (it works).
But I don't know how I create a loop to go through all the list of URLs from the txt file, and scrape all the data.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
from bs4 import Comment
import re
import rispy # Writing an ris file
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
Just work with each url:page inside the loop!
for line in f:
url = line.strip()
html = requests.get(url).text # is .content better?
soup = BeautifulSoup(html, "html.parser")
# work with soup here!
Creating more functions may help your program be easier to read if you find yourself packing a lot into some block
See Cyclomatic Complexity (which is practically a count of the control statements like if and for)
Additionally, if you want to collect up all the values before doing further processing (though this is frequently better accomplished with more esoteric logic like a generator or asyncio to collect many pages in parallel), you might consider creating some collection before the loop to store the results
collected_results = [] # a new list
...
for line in fh:
result = # process the line via whatever logic
collected_results.append(result)
# now collected_results has the result from each line
you are making a big mistake by writing :
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
because that will store the html data of the last url obtained from the TXT file in html variable.
after the for loop finish, the last line of the TXT file will be stored in variable url and that mean you will get only the last url in the TXT file
the code should be :
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
This is how the txt file looks like, and I opened it from jupiter notebook. Notice that I changed the name of the links in the result for obvious reason.
input-----------------------------
with open('...\j.txt', 'r')as f:
data = f.readlines()
print(data[0])
print(type(data))
output---------------------------------
['https://www.example.com/191186976.html', 'https://www.example.com/191187171.html']
Now I wrote these in my scrapy script, it didn't go for the links when I ran it. Instead it shows: ERROR: Error while obtaining start requests.
class abc(scrapy.Spider):
name = "abc_article"
with open('j.txt' ,'r')as f4:
url_c = f4.readlines()
u = url_c[0]
start_urls = u
And if I wrote u = ['example.html', 'example.html'] starting_url = u then it works perfectly fine. I'm new to scrapy so I'd like to ask what is the problem here? Is it the reading method or something else I didn't notice. Thanks.
Something like this should get you going in the right direction.
import csv
from urllib.request import urlopen
#import urllib2
from bs4 import BeautifulSoup
contents = []
with open('C:\\your_path_here\\test.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
for url in contents: # Parse through each url in the list.
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "html.parser")
print(soup)
I want to parse on one website with some URL's and i created a text file has all links that i want to parse. How can i call this URL's from the text file one by one on python program.
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("https://www.example.com").content, "html.parser")
for d in soup.select("div[data-selenium=itemDetail]"):
url = d.select_one("h3[data-selenium] a")["href"]
upc = BeautifulSoup(requests.get(url).content, "html.parser").select_one("span.upcNum")
if upc:
data = json.loads(d["data-itemdata"])
text = (upc.text.strip())
print(upc.text)
outFile = open('/Users/Burak/Documents/new_urllist.txt', 'a')
outFile.write(str(data))
outFile.write(",")
outFile.write(str(text))
outFile.write("\n")
outFile.close()
urllist.txt
https://www.example.com/category/1
category/2
category/3
category/4
Thanks in advance
Use a context manager :
with open("/file/path") as f:
urls = [u.strip('\n') for u in f.readlines()]
You obtain your list with all urls in your file and can then call them as you like.
I would like to ask for help with a rss program. What I'm doing is collecting sites which are containing relevant information for my project and than check if they have rss feeds.
The links are stored in a txt file(one link on each line).
So I have a txt file with full of base urls what are needed to be checked for rss.
I have found this code which would make my job much easier.
import requests
from bs4 import BeautifulSoup
def get_rss_feed(website_url):
if website_url is None:
print("URL should not be null")
else:
source_code = requests.get(website_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.find_all("link", {"type" : "application/rss+xml"}):
href = link.get('href')
print("RSS feed for " + website_url + "is -->" + str(href))
get_rss_feed("http://www.extremetech.com/")
But I would like to open my collected urls from the txt file, rather than typing each, one by one.
So I have tryed to extend the program with this:
from bs4 import BeautifulSoup, SoupStrainer
with open('test.txt','r') as f:
for link in BeautifulSoup(f.read(), parse_only=SoupStrainer('a')):
if link.has_attr('http'):
print(link['http'])
But this is returning with an error, saying that beautifoulsoup is not a http client.
I have also extended with this:
def open()
f = open("file.txt")
lines = f.readlines()
return lines
But this gave me a list separated with ","
I would be really thankfull if someone would be able to help me
Typically you'd do something like this:
with open('links.txt', 'r') as f:
for line in f:
get_rss_feed(line)
Also, it's a bad idea to define a function with the name open unless you intend to replace the builtin function open.
i guess you can make it by using urllib
import urllib
f = open('test.txt','r')
#considering each url in a new line...
while True:
URL = f.readline()
if not URL:
break
mycontent=urllib.urlopen(URL).read()