Parse URL beautifulsoup - python

import requests
import csv
from bs4 import BeautifulSoup
page = requests.get("https://www.google.com/search?q=cars")
soup = BeautifulSoup(page.content, "lxml")
import re
links = soup.findAll("a")
with open('aaa.csv', 'wb') as myfile:
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
a = (re.split(":(?=http)",link["href"].replace("/url?q=","")))
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(a)
The output of this code is that I have a CSV file where 28 URLs are saved, however the URLs are not correct. For example this is a wrong URL:-
http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A
Instead it should be:-
http://www.imdb.com/title/tt0317219/
How can I remove the second part for each URL if it contains "&sa="
Because then the second part of the URL starting from:-
"&sa=" should be removed, so that all URLs are saved like the second URL.
I am using python 2.7 and Ubuntu 16.04.

If every time redundant part of url starts with &, you can apply split() to each url:
url = 'http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A'
url = url.split('&')[0]
print(url)
output:
http://www.imdb.com/title/tt0317219/

Not the best way, but you could do one more time split, adding one more line after a:
a=[a[0].split("&")[0]]
print(a)
Result:
['https://de.wikipedia.org/wiki/Cars_(Film)']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:I2SHYtLktRcJ']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Handlung']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Synchronisation']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Soundtrack']
['https://de.wikipedia.org/wiki/Cars_(Film)%23Kritik']
['https://www.mytoys.de/disney-cars/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:9Ohx4TRS8KAJ']
['https://www.youtube.com/watch%3Fv%3DtNmo09Q3F8s']
['https://www.youtube.com/watch%3Fv%3DtNmo09Q3F8s']
['https://www.youtube.com/watch%3Fv%3DkLAnVd5y7M4']
['https://www.youtube.com/watch%3Fv%3DkLAnVd5y7M4']
['http://cars.disney.com/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:1BoR6M9fXwcJ']
['http://cars.disney.com/']
['http://cars.disney.com/']
['https://www.whichcar.com.au/car-style/12-cartoon-cars']
['https://www.youtube.com/watch%3Fv%3D6JSMAbeUS-4']
['http://filme.disney.de/cars-3-evolution']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:fO7ypFFDGk0J']
['http://www.4players.de/4players.php/spielinfonews/Allgemein/36859/2169193/Project_CARS_2-Zehn_Ferraris_erweitern_den_virtuellen_Fuhrpark.html']
['http://www.4players.de/4players.php/spielinfonews/Allgemein/36859/2169193/Project_CARS_2-Zehn_Ferraris_erweitern_den_virtuellen_Fuhrpark.html']
['http://www.play3.de/2017/08/02/project-cars-2-6/']
['http://www.imdb.com/title/tt0317219/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:-xdXy-yX2fMJ']
['http://www.carmagazine.co.uk/']
['http://webcache.googleusercontent.com/search%3Fq%3Dcache:PRPbHf_kD9AJ']
['http://google.com/search%3Ftbm%3Disch%26q%3DCars']
['http://www.imdb.com/title/tt0317219/']
['https://de.wikipedia.org/wiki/Cars_(Film)']

Related

How to create a loop to go through a list of URLs, scrape all data. all URLs of the similar pages are saved in one file.txt (every URL in one line)?

I want to extract some information from multiple pages which have similar page structures.
all URLs of the pages are saved in one file.txt (every URL in one line).
I already create the code to scrape all the data from one link (it works).
But I don't know how I create a loop to go through all the list of URLs from the txt file, and scrape all the data.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
from bs4 import Comment
import re
import rispy # Writing an ris file
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
Just work with each url:page inside the loop!
for line in f:
url = line.strip()
html = requests.get(url).text # is .content better?
soup = BeautifulSoup(html, "html.parser")
# work with soup here!
Creating more functions may help your program be easier to read if you find yourself packing a lot into some block
See Cyclomatic Complexity (which is practically a count of the control statements like if and for)
Additionally, if you want to collect up all the values before doing further processing (though this is frequently better accomplished with more esoteric logic like a generator or asyncio to collect many pages in parallel), you might consider creating some collection before the loop to store the results
collected_results = [] # a new list
...
for line in fh:
result = # process the line via whatever logic
collected_results.append(result)
# now collected_results has the result from each line
you are making a big mistake by writing :
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")
because that will store the html data of the last url obtained from the TXT file in html variable.
after the for loop finish, the last line of the TXT file will be stored in variable url and that mean you will get only the last url in the TXT file
the code should be :
with open('F:\Python\Python-FilePy-Thesis-DownLoad/Thesis2.txt', 'r') as f:
for line in f:
url = line.strip()
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")

BeautifulSoup: save each interation of loop's resulting HTML

I have written the following code to obtain the html of some pages, according to some id which I can input in a URL. I would like to then save each html as a .txt file in a desired path. This is the code that I have written for that purpose:
import urllib3
import requests
from bs4 import BeautifulSoup
import pandas as pd
def get_html(id):
url = f'https://www.myurl&id={id}'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
html=print(soup)
return html
id = ['11111','22222']
for id in id:
path=f'D://MyPath//{id}.txt'
a = open(path, 'w')
a.write(get_html(id))
a.close()
Although generating the html pages is quite simple. This loop is not working properly. I am getting the following message TypeError: write() argument must be str, not None. Which means that the first loop somehow is failing to generate a string to be saved as a text file.
I would like to say that in the original data I have around 9k ids, so you can also let me know if instead of several .txt files you would recommend a big csv to store all the results. Thanks!
The problem is, that the print() returns None. Use str() instead:
def get_html(id):
url = f'https://www.myurl&id={id}'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
#html=print(soup) <-- print() returns None
return str(soup) # <--- convert soup to string

Write multiple files inside for-loop

I am trying to crawl several links, extract text found on <p> HTML tag and write output to different files. Each link should have its own output file. So far:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import re
import csv
import pyperclip
import pprint
import requests
urls = ['https://link1',
'https://link2']
url_list = list(urls)
#scrape elements
for url in urls:
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find_all('p')
page = soup.getText()
for line in urls:
with open('filename{}.txt'.format(line), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
I am getting OSError: [Errno 22] Invalid argument: filenamehttps://link1
If I change my code into this
for index, line in enumerate(urls):
with open('filename{}.txt'.format(index), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
The script runs but I have a semantic error; both output files contain the text extracted from link2. I guess the second for-loop does this.
I've researched S/O for similar 1 answers but I can't figure it out.
I'm guessing you're on some sort of *nix system as the error has to do with / interpreted a part of the path.
So, you have to do something to name your files correctly or create the path you want to save the output.
Having said that, using the URL as a file name is not a great idea, because of the above error.
You could either replace the / with, say _ or just name your files differently.
Also, this:
urls = ['https://link1',
'https://link2']
Is already a list, so no need for this:
url_list = list(urls)
And there's no need for two for loops. You can write to a file as you scrape the URLS from the list.
Here's the working code with some dummy website:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import requests
from bs4 import BeautifulSoup
urls = ['https://lipsum.com/', 'https://de.lipsum.com/']
for url in urls:
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find("div", {"id": "Panes"}).find("p").getText()
with open('filename_{}.txt'.format(url.replace("/", "_")), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))
You could also use your approach with enumerate():
import requests
from bs4 import BeautifulSoup
urls = ['https://lipsum.com/', 'https://de.lipsum.com/']
for index, url in enumerate(urls, start=1):
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.content, "html.parser")
page = soup.find("div", {"id": "Panes"}).find("p").getText()
with open('filename_{}.txt'.format(index), 'w', encoding="utf8") as outfile:
outfile.write('\n'.join([i for i in page.split('\n') if len(i) > 0]))

requests.get(url) in python behaving differently when used in loop

I'm new in python programming and trying to scrape every link available in my Urls.txt file.
the code I wrote is :
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
user_agent = UserAgent()
fp = open("Urls.txt", "r")
values = fp.readlines()
fin = open("soup.html", "a")
for link in values:
print( link )
page = requests.get(link, headers={"user-agent": user_agent.chrome})
html = page.content
soup = BeautifulSoup(html, "html.parser")
fin.write(str(soup))
The code works absolutely fine when the links are provided directly as string instead of as variable but when used as it is the output differs.
Maybe the string you read from the file has a line break.
To remove it use link.strip("\n")

scrapy reading urls from a txt file fail

This is how the txt file looks like, and I opened it from jupiter notebook. Notice that I changed the name of the links in the result for obvious reason.
input-----------------------------
with open('...\j.txt', 'r')as f:
data = f.readlines()
print(data[0])
print(type(data))
output---------------------------------
['https://www.example.com/191186976.html', 'https://www.example.com/191187171.html']
Now I wrote these in my scrapy script, it didn't go for the links when I ran it. Instead it shows: ERROR: Error while obtaining start requests.
class abc(scrapy.Spider):
name = "abc_article"
with open('j.txt' ,'r')as f4:
url_c = f4.readlines()
u = url_c[0]
start_urls = u
And if I wrote u = ['example.html', 'example.html'] starting_url = u then it works perfectly fine. I'm new to scrapy so I'd like to ask what is the problem here? Is it the reading method or something else I didn't notice. Thanks.
Something like this should get you going in the right direction.
import csv
from urllib.request import urlopen
#import urllib2
from bs4 import BeautifulSoup
contents = []
with open('C:\\your_path_here\\test.csv','r') as csvf: # Open file in read mode
urls = csv.reader(csvf)
for url in urls:
contents.append(url) # Add each url to list contents
for url in contents: # Parse through each url in the list.
page = urlopen(url[0]).read()
soup = BeautifulSoup(page, "html.parser")
print(soup)

Categories