I have scraped a list of pdf links that I want from this website https://www.gmcameetings.co.uk
It is all of the minutes from the local council's committee meetings.
I now need to save all my results into a file so I can then download and read all the pdfs.
How do I go about saving them?
This is my code:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as bs
url = "https://www.gmcameetings.co.uk/"
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
folder_location = r'E:\Internship\WORK'
meeting_links = soup.find_all('a', href=True)
for link in meeting_links:
if link['href'].find('/meetings/')>1:
r2 = requests.get(link['href'])
print(link['href'])
page2 = r2.text
soup2 = bs(page2, 'lxml')
date_links = soup2.find_all('a', href=True)
for dlink in date_links:
if dlink['href'].find('/meetings/')>1:
r3 = requests.get(dlink['href'])
print(dlink['href'])
page3 = r3.text
soup3 = bs(page3, 'lxml')
pdf_links = soup3.find_all('a', href=True)
for plink in pdf_links:
if plink['href'].find('minutes')>1:
print("Minutes!")
I need a file that has all the links, which I can then read the pdfs from. Sorry I'm new to coding completely so a bit lost.
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.gmcameetings.co.uk/"
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
f= open(r"E:\Internship\WORK\links.txt","w+")
n = 0
meeting_links = soup.find_all('a', href=True)
for link in meeting_links:
if link['href'].find('/meetings/')>1:
r2 = requests.get(link['href'])
print(link['href'])
page2 = r2.text
soup2 = bs(page2, 'lxml')
date_links = soup2.find_all('a', href=True)
for dlink in date_links:
if dlink['href'].find('/meetings/')>1:
r3 = requests.get(dlink['href'])
print(dlink['href'])
page3 = r3.text
soup3 = bs(page3, 'lxml')
pdf_links = soup3.find_all('a', href=True)
for plink in pdf_links:
if plink['href'].find('minutes')>1:
n += 1
print("Minutes!")
f.write("Link " + str(n) + ": " + str(plink['href']) +"\n")
f.close()
Just use a regular text file, like this and then write there whaterver output you find required:
with open('Test.txt', 'w') as file:
file.write('Testing output')
Declare file before for loop on write mode and write the link in each iteration and add next line at each addition.
with open('Linkfile.txt', 'w') as f:
for link in meeting_links:
if link['href'].find('/meetings/')>1:
r2 = requests.get(link['href'])
print("link1")
page2 = r2.text
soup2 = bs(page2, 'lxml')
date_links = soup2.find_all('a', href=True)
for dlink in date_links:
if dlink['href'].find('/meetings/')>1:
r3 = requests.get(dlink['href'])
print("link2")
page3 = r3.text
soup3 = bs(page3, 'lxml')
pdf_links = soup3.find_all('a', href=True)
for plink in pdf_links:
if plink['href'].find('minutes')>1:
print(plink['href'])
f.write(plink['href'])
f.write('\n')
for link in meeting_links:
with open('filename.txt', 'a') as fp:
fp.write(link)
We can use Python's context manager which would open the file (allocate resources) & once the operation is performed, it would close the file too (release resources).
with open('links.txt', 'w') as file:
file.write('required content')
We can also specify file type extension as required like links.txt, links.csv etc.
Related
I'm learning python and webscraping, It is very cool but I am not able to get what I want.
I'm trying to save products links in a text file to scrape data after.
here is my script, which work correctly (almost) in the console of pycharm :
import bs4 as bs4
from bs4 import BeautifulSoup
import requests
suffixeUrl = '_puis_nblignes_est_200.html'
for i in range(15):
url = 'https://www.topachat.com/pages/produits_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_page_est_' + str(i) + suffixeUrl
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
if response.ok:
print('Page: ' + str(i))
for data in soup.find_all('div', class_='price'):
for a in data.find_all('a'):
link = (a.get('href'))
links = ('https://www.topachat.com/' + link)
print(links) #for getting link
My goal is to save the result of the links variable, line by line in a text file.
I tried this, but something is wrong and I can't get each url :
for link in links:
with open("urls.txt", "a") as f:
f.write(links+"\n")
Please, does someone can help me?
You can try this way.
Just open the file once and write the complete data to it. Opening and closing files inside a loop is not a good thing to do.
import bs4 as bs4
from bs4 import BeautifulSoup
import requests
suffixeUrl = '_puis_nblignes_est_200.html'
with open('text.txt', 'w') as f:
for i in range(15):
url = 'https://www.topachat.com/pages/produits_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_page_est_' + str(i) + suffixeUrl
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
if response.ok:
print('Page: ' + str(i))
for data in soup.find_all('div', class_='price'):
for a in data.find_all('a'):
link = 'https://www.topachat.com/' + a.get('href')
f.write(link+'\n')
Sample output from text.txt
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in11020650.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in10119254.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in20005046.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in20002036.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in20002591.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in20004309.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in20002592.html
https://www.topachat.com/pages/detail2_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_ref_est_in10089390.html
.
.
.
Your problem is in for link in links line:
link = (a.get('href'))
links = ('https://www.topachat.com/' + link)
print(links)
for link in links:
with open("urls.txt", "a") as f:
f.write(links+"\n")
Type of links is string and your for loop iterates it letter-by-letter (or characater-by-character). That is why you see a single character at each line in your txt file. You can just remove the for loop and the code will work:
from bs4 import BeautifulSoup
import requests
suffixeUrl = '_puis_nblignes_est_200.html'
for i in range(15):
url = 'https://www.topachat.com/pages/produits_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_page_est_' + str(i) + suffixeUrl
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
if response.ok:
print('Page: ' + str(i))
for data in soup.find_all('div', class_='price'):
for a in data.find_all('a'):
link = (a.get('href'))
links = ('https://www.topachat.com/' + link)
print(links) #for getting link
with open("urls.txt", "a") as f:
f.write(links+"\n")
You can do like this:
import bs4 as bs4
from bs4 import BeautifulSoup
import requests
suffixeUrl = '_puis_nblignes_est_200.html'
url_list = set()
for i in range(15):
url = 'https://www.topachat.com/pages/produits_cat_est_micro_puis_rubrique_est_w_boi_sa_puis_page_est_' + str(i) + suffixeUrl
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
if response.ok:
print('Page: ' + str(i))
for data in soup.find_all('div', class_='price'):
for a in data.find_all('a'):
link = (a.get('href'))
links = ('https://www.topachat.com/' + link)
print(links) #for getting link
url_list.add(links)
with open("urls.txt", "a") as f:
for link in url_list:
f.write(link+"\n")
I am able to get all the links on a particular web page but am having trouble with the pagination.
I am doing the following:
import requests, bs4, re
from bs4 import BeautifulSoup
from urllib.parse import urljoin
r = requests.get(start_url)
soup = BeautifulSoup(r.text,'html.parser')
a_tags = soup.find_all('a')
print(a_tags)
links = [urljoin(start_url, a['href'])for a in a_tags]
print(links)
As a toy example, I am using the following website:
start_url = 'https://www.opencodez.com/page/1'
I am able to get all the links this way. However, I am trying to automate it more by going to the next page and doing the same thing, and outputting all the links to a csv file.
I tried the following but get no outputs:
start_url = 'https://www.opencodez.com/'
with open('names.csv', mode='w') as csv_file:
fieldnames = ['Name']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
article_link = []
def scraping(webpage, page_number):
next_page = webpage + str(page_number)
r = requests.get(str(next_page))
soup = BeautifulSoup(r.text,'html.parser')
a_tags = soup.find_all('a')
print(a_tags)
links = [urljoin(start_url, a['href'])for a in a_tags]
print(links)
for x in range(len(soup)):
article_link.append(links)
if page_number < 16:
page_number = page_number + 1
scraping(webpage, page_number)
scraping('https://www.opencodez.com/page/', 1)
#creating the data frame and populating its data into the csv file
data = { 'Name': article_link}
df = DataFrame(data, columns = ['Article_Link'])
df.to_csv(r'C:\Users\xxxxx\names.csv')
Could you please help me determine where I am going wrong?
I do not mind getting the links in either the output console or printed in a csv file
There were issues here and there with your code but this worked for me:
import requests, bs4, re
from bs4 import BeautifulSoup
from urllib.parse import urljoin
start_url = 'https://www.opencodez.com/'
r = requests.get(start_url) # first page scraping
soup = BeautifulSoup(r.text,'html.parser')
a_tags = soup.find_all('a')
article_link = []
links = [urljoin(start_url, a['href'])for a in a_tags]
article_link.append(links)
for page in range(2,19): # for every page after 1
links = [] # resetting lists on every page just in case
a_tags = []
url = 'https://www.opencodez.com/page/'+str(page)
r = requests.get(start_url)
soup = BeautifulSoup(r.text,'html.parser')
a_tags = soup.find_all('a')
links = [urljoin(start_url, a['href'])for a in a_tags]
article_link.append(links)
print(article_link)
I basically just changed how you append to the list article_link. This variable at the moment is a list of length 18. Each list within article_link is a list of 136 links.
I am new to python. I have successfully extracted html links or a tags and entered them into a CSV file.
I am only getting an output of 2, 3 or 22 links, depending on what I try.
The website has website has 244 links and over half are duplicate's. The correct number of links that are not duplicates is 117.
This is what I have so far:
import requests
from bs4 import BeautifulSoup, SoupStrainer
import bs4, csv
search_link = "https://www.census.gov/programs-surveys/popest.html"
r = requests.get(search_link)
raw_html = r.text
soup = BeautifulSoup(raw_html, 'html.parser')
all_links = soup.find_all("a")
r.content
rem_dup = set()
for link in all_links:
hrefs = str(link.get("href"))
if hrefs.startswith('#http'):
rem_dup.add(hrefs[1:])
elif hrefs.endswith('.gov'):
rem_dup.add(hrefs + '/')
elif hrefs.startswith('/'):
rem_dup.add('https://www.census.gov' + hrefs)
else:
rem_dup.add(hrefs)
filename = "Page_Links.csv"
f = open(filename, "w+")
f.write("LINKS\n")
f.write('https://www.census.gov')
f.close()
I have crawler that extract links from page only if the link text include given text and I'm writing the output to html file. Its working but I would like to add whole link text next to these links like this - "Junior Java developer - https://www.jobs.cz/junior-developer/" How can I do this?
Thanks
import requests
from bs4 import BeautifulSoup
import re
def jobs_crawler(max_pages):
page = 1
file_name = 'links.html'
while page < max_pages:
url = 'https://www.jobs.cz/prace/praha/?field%5B%5D=200900011&field%5B%5D=200900012&field%5B%5D=200900013&page=' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
page += 1
file = open(file_name,'w')
for link in soup.find_all('a', {'class': 'search-list__main-info__title__link'}, text=re.compile('IT', re.IGNORECASE)):
href = link.get('href') + '\n'
file.write(''+ 'LINK TEXT HERE' + '' + '<br />')
print(href)
file.close()
print('Saved to %s' % file_name)
jobs_crawler(5)
This should help.
file.write('''{1}<br />'''.format(link.get('href'), link.text ))
Try this:--
href = link.get('href') + '\n'
txt = link.get_text('href') #will give you text
I'm trying to clean my code however I can't merge those result lists, each link is in one list, please any thoughts how to fix it?
Here's the code example:
import requests
from bs4 import BeautifulSoup
def Download_Image_from_Web(url):
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('img'):
image_links = link.get('src')
if '.jpg' in image_links:
for i in image_links.split("\\n"):
links_list = list(i.split())
print(links_list)
Download_Image_from_Web("https://pixabay.com")
Result:
['https://cdn.pixabay.com/photo/2017/04/24/00/16/auto-2255161__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/24/13/28/photographer-2256456__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/08/10/23/surfer-2212948__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/10/08/08/church-window-2217785__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/01/05/54/cow-2193018__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/12/19/06/lighthouse-2225445__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/10/19/46/ice-2219574__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/11/22/49/hand-2223109__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/07/18/23/landscape-2211587__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/02/19/57/horse-2196755__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/06/17/43/water-2208931__340.jpg']
['https://cdn.pixabay.com/photo/2017/03/13/17/27/man-2140606__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/08/22/26/meditation-2214532__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/12/16/56/chocolate-2224998__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/14/16/46/red-fox-2230731__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/06/19/37/sculpture-2209152__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/08/09/59/mushrooms-2212899__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/09/16/46/lamb-2216160__340.jpg']
['https://cdn.pixabay.com/photo/2017/04/07/21/22/wave-2211925__340.jpg']
I want it to be like this:
['https://cdn.pixabay.com/photo/2017/04/24/00/16/auto-2255161__340.jpg','https://cdn.pixabay.com/photo/2017/04/24/13/28/photographer-2256456__340.jpg','https://cdn.pixabay.com/photo/2017/04/08/10/23/surfer-2212948__340.jpg','https://cdn.pixabay.com/photo/2017/04/10/08/08/church-window-2217785__340.jpg','https://cdn.pixabay.com/photo/2017/04/01/05/54/cow-2193018__340.jpg']
and so and forth
just need to use the empty list and append to it with the strings of i.split()
def Download_Image_from_Web(url):
source_code = requests.get(url)
links_list = []
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('img'):
image_links = link.get('src')
if '.jpg' in image_links:
for i in image_links.split("\\n")
link_list.append(i.split())
print(links_list)
hope this helps