I'm trying to scrape the text from articles in my website. I have a 'for' loop, but it works very slow. Are there any faster ways to do that? I've read about Pandas Built-In-Loop, vectorization and and numpy vectorization, but failed to apply it to my code.
def scarp_text(df):
pd.options.mode.chained_assignment = None
session = requests.Session()
for j in range(0, len(df)):
try:
url = df['url'][j] #takes a url of an article in a column 'url'
req = session.get(url)
soup = BeautifulSoup(req.text, 'lxml')
except Exception as e:
print(e)
tags = soup.find_all('p')
if tags == []:
tags = soup.find_all('p', itemprop = 'articleBody')
# Putting together all text from HTML p tags
article = ''
for p in paragraph_tags:
article = article + ' ' + p.get_text()
article = " ".join(article.split())
df['article_text'][j] = article #put collected text to a corresponding cell
return df
You have 2 for loops, the most inner loop is usually the best place to start.
Plus operator is inefficient for string concatenation. Str.join is a better choice, it also takes a generator as input.
article = " ".join(p.get_text() for p in paragraph_tags)
article = " ".join(article.split())
Related
I got this project where I'm scraping data on Trulia.com and where I want to get the max number of page (last number) for a specific location (photo below) so I can loop through it and get all the hrefs.
To get that last number, I have my code that run as planned and should return an integer but it doesn't always return the same number. I added the print(comprehension list) to understand what's wrong. Here is the code and the output below. The return is commented but sould return the last number of the output list as an int.
city_link = "https://www.trulia.com/for_rent/San_Francisco,CA/"
def bsoup(url):
resp = r.get(url, headers=req_headers)
soup = bs(resp.content, 'html.parser')
return soup
def max_page(link):
soup = bsoup(link)
page_num = soup.find_all(attrs={"data-testid":"pagination-page-link"})
print([x.get_text() for x in page_num])
# return int(page_num[-1].get_text())
for x in range(10):
max_page(city_link)
I have no clue why sometimes it's returning something wrong. The photo above is the corresponding link.
Okay, now if I understand what you want, you are trying to see how many pages of links there are for a given location for rent. If we can assume the given link is the only required link, this code:
import requests
import bs4
url = "https://www.trulia.com/for_rent/San_Francisco,CA/"
req = requests.get(url)
soup = bs4.BeautifulSoup(req.content, features='lxml')
def get_number_of_pages(soup):
caption_tag = soup.find('div', class_="Text__TextBase-sc-1cait9d-0-
div Text__TextContainerBase-sc-1cait9d-1 RBSGf")
pagination = caption_tag.text
words = pagination.split(" ")
values = []
for word in words:
if not word.isalpha():
values.append(word)
links_per_page = values[0].split('-')[1]
total_links = values[1].replace(',', '')
no_of_pages = round(int(total_links)/int(links_per_page) + 0.5)
return no_of_pages
for i in range(10):
print(get_number_of_pages(soup))
achieves what you're looking for, and has repeatability because it doesn't interact with javascript, but the pagination caption at the bottom of the page.
I am working on a program that crawls Internet articles using the web crawling method.The program is started by entering the start and end pages of the website.
This program works in the following order.
web-crawling of articles information(title, sort, time, contents)
Remove special characters
Only nouns are extracted.
The problem maybe occurs lies in extracting nouns in the process of cleaning the content of the article. It works until the stage before noun extraction.
The error message is as follows
ValueError: Length of passed values is 4, index implies 5
To solve this problem, I coded using a method of adding DataFrame append.
But it doesn't solve the problem.
Use konlypy method(Korean morpheme analyzer)
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import pandas as pd
import requests
import re
from konlpy.tag import Okt
from pandas import Series
i = input('Start page? : ')
k = input('End page? : ')
startpage = int(i)
lastpage = int(k)
count = int(i)
# Definition of text cleaning function
def text_cleaning(text):
hangul = re.compile('[^ㄱ-ㅣ가-힣]+')
result = hangul.sub(' ', text)
return result
# Definition of nouns extraction function
def get_nouns(x):
nouns_tagger = Okt()
nouns = nouns_tagger.nouns(x)
nouns = [noun for noun in nouns if len(noun)>1]
nouns = [noun for noun in nouns if noun not in stopwords]
return nouns
# dataframe formation
columns = ['Title', 'Sort', 'Datetime', 'Article']
news_info = pd.DataFrame(columns=columns)
idx = 0
Web-site page loop
while startpage<lastpage + 1:
url = f'http://www.koscaj.com/news/articleList.html?page={startpage}&total=72698&box_idxno=&sc_section_code=S1N2&view_type=sm'
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all(class_='list-titles')
print(f'-----{count}page result-----')
# Articles loop in the web-site page
for link in links:
news_url = "http://www.koscaj.com"+link.find('a')['href']
news_link = urllib.request.urlopen(news_url).read()
soup2 = BeautifulSoup(news_link, 'html.parser')
# an article's title
title = soup2.find('div', {'class':'article-head-title'})
if title:
title = soup2.find('div', {'class':'article-head-title'}).text
else:
title = ''
# an article's sort
sorts = soup2.find('nav', {'class':'article-head-nav auto-marbtm-10'})
try:
sorts2 = sorts.find_all('a')
sort = sorts2[2].text
except:
sort =''
# an article's time
date = soup2.find('div',{'class':'info-text'})
try:
datetime = date.find('i', {'class':'fa fa-clock-o fa-fw'}).parent.text.strip()
datetime = datetime.replace("승인", "")
except:
datetime = ''
# an article's content
article = soup2.find('div', {'id':'article-view-content-div'})
if article:
article = soup2.find('div', {'id':'article-view-content-div'}).text
article = article.replace("\n", "")
article = article.replace("\r", "")
article = article.replace("\t", "")
article = article.replace("[전문건설신문] koscaj#kosca.or.kr", "")
article = article.replace("저작권자 © 대한전문건설신문 무단전재 및 재배포 금지", "")
article = article.replace("전문건설신문", "")
article = article.replace("다른기사 보기", "")
else:
article = ''
# Remove special characters
news_info['Title'] = news_info['Title'].apply(lambda x: text_cleaning(x))
news_info['Sort'] = news_info['Sort'].apply(lambda x: text_cleaning(x))
news_info['Article'] = news_info['Article'].apply(lambda x: text_cleaning(x))
So far, the program works without any problems. But if you see the program error message, it is indicated that the operation is not working because the input value and index are different.
Text data cleaning for extraction nouns
# Dataframe for storing after crawling individual articles
row = [title, sort, datetime, article]
series = pd.Series(row, index=news_info.columns)
news_info = news_info.append(series, ignore_index=True)
# Load Korean stopword dictionary file
path = "C:/Users/이바울/Desktop/이바울/코딩파일/stopwords-ko.txt"
with open(path, encoding = 'utf-8') as f:
stopwords = f.readlines()
stopwords = [x.strip() for x in stopwords]
news_info['Nouns'] = news_info['Article'].apply(lambda x: get_nouns(x))
startpage += 1
count += 1
news_info.to_excel(f'processing{lastpage-int(1)}-{startpage-int(1)}.xlsx')
print('Complete')
After setting the existing 4 columns in the Pandas DataFrame, the append was used to add the column extracted as a noun as the 5th column. I know this method adds a column regardless of the index name. And if you look at the image link at the bottom, as a result, the first article is crawled and shows the results. From the next article, it does not work and an error occurs.
enter image description here(Program error result)
enter link description here(Korean stopwords dictionary)
I solves the problem.
It depends on the location of the code in the for loop statement.
I've been able to fix the problem as a result of continuing to reposition the problematic areas except for the code that worked before.
I solved the problem by applying backspace only twice in the code below.
news_info['Nouns'] = news_info['Article'].apply(lambda x: get_nouns(x))
My problem is related to this answer.
I have following code:
import urllib.request
from bs4 import BeautifulSoup
time = 0
html = urllib.request.urlopen("https://www.kramerav.com/de/Product/VM-2N").read()
html2 = urllib.request.urlopen("https://www.kramerav.com/de/Product/SDIA-IN2-F16").read()
try:
div = str(BeautifulSoup(html).select("div.large-image")[0])
if(str(BeautifulSoup(html).select("div.large-image")[1]) != ""):
div += str(BeautifulSoup(html).select("div.large-image")[1])
time = time + 1
except IndexError:
div = ""
time = time + 1
finally:
print(str(time) + div)
The site of the variable html has 2 div-classes named "large-image". The site of the variable html2 only has 1.
With html the program works as intended. But if I switch to html2 the variable div is going to be completely empty.
I would like to save the 1 div-class rather than saving nothing. How could I archieve this?
the variable div is going to be completely empty.
That's because your error handler assigned it the empty string.
Please don't use subscripts, conditionals, and handlers in that way. It would be more natural to iterate over the results of select() with for, building up a result list (or string).
Also, you should create soup = BeautifulSoup(html) just once, as that can be a fairly expensive operation, since it carefully parses a potentially long web page. With that, you could build up a list of HTML fragments with:
images = [image
for image in soup.select('div.large-image')]
Or if for some reason you're not fond list comprehensions, you could equivalently write:
images = []
for image in soup.select('div.large-image'):
images.append(image)
and then get the required html with div = '\n'.join(images).
You can concatenate all items inside for loop
all_divs = soup.select("div.large-image")
for item in all_divs:
div += str(item)
time += 1
or using join()
time = len(all_divs)
div = ''.join(str(item) for item in all_divs)
You can also write in file directly inside for loop and you get to row
for item in all_divs:
csv_writer.writerow( [str(item).strip()] )
time += 1
Working example
import urllib.request
from bs4 import BeautifulSoup
import csv
div = ""
time = 0
f = open('output.csv', 'w')
csv_writer = csv.writer(f)
all_urls = [
"https://www.kramerav.com/de/Product/VM-2N",
"https://www.kramerav.com/de/Product/SDIA-IN2-F16",
]
for url in all_urls:
print('url:', url)
html = urllib.request.urlopen(url).read()
try:
soup = BeautifulSoup(html)
all_divs = soup.select("div.large-image")
for item in all_divs:
div += str(item)
time += 1
# or
time = len(all_divs)
div = ''.join(str(item) for item in all_divs)
# or
for item in all_divs:
#div += str(item)
#time += 1
csv_writer.writerow( [time, str(item).strip()] )
except IndexError as ex:
print('Error:', ex)
time += 1
finally:
print(time, div)
f.close()
I want to load automatically a code from website.
I have a list with some names and want to go through every item. Go through the first item, make request, open website, copy the code/number from HTML (text in span) and than save this result in dictionary and so on (for all items).
I read from csv all lines and save them into a list.
After this I make request to load HTML from a website, search the company and read the numbers from span.
My code:
with open(test_f, 'r') as file:
rows = csv.reader(file,
delimiter=',',
quotechar='"')
data = [data for data in rows]
print(data)
url_part1 = "http://www.monetas.ch/htm/651/de/Firmen-Suchresultate.htm?Firmensuche="
url_enter_company = [data for data in rows]
url_last_part = "&CompanySearchSubmit=1"
firma_noga = []
for data in firma_noga:
search_noga = url_part1 + url_enter_company + url_last_part
r = requests.get(search_noga)
soup = BeautifulSoup(r.content, 'html.parser')
lii = soup.find_all("span")
# print all numbers that are in a span
numbers = [d.text for d in lii]
print("NOGA Codes: ")
I want to get in dictionary the result, where the key should be the company name (item in a list) and the value should be the number that I read from the span:
dict = {"firma1": "620100", "firma2": "262000, 465101"}
Can some one help me, I am new at web scraping and python, and don't know what I am doing wrong.
Split your string with regex and do your stuff depending on wether it is a number or not:
import re
for partial in re.split('([0-9]+)', myString):
try:
print(int(partial))
except:
print(partial + ' is not a number')
EDIT:
Well, myString is somewhat expected to be a string.
To get the text content of your spans as a string you should be able to use .text something like this:
spans = soup.find_all('span')
for span in spans:
myString = span.text #
for partial in re.split('([0-9]+)', myString):
try:
print(int(partial))
except:
print(partial + ' is not a number')
Abstracting from my requirements in comments I think somethinfg like this should work for you:
firma_noga = ['firma1', 'firma2', 'firma3'] #NOT EMPTY as in your code!
res_dict = {}
for data in firma_noga:
search_noga = url_part1 + url_enter_company + url_last_part
r = requests.get(search_noga)
soup = BeautifulSoup(r.content, 'html.parser')
lii = soup.find_all("span")
for l in lii:
if data not in res_dict:
res_dict[data] = [l]
else:
res_dict[data].append(l)
Obviously this will work obviously if firma-noga won't be empty like in your code; and all the rest (your) parsing logic should be valid as well.
I'm trying to scrape text from a series of hyperlinks on a main page and then store the results as a list of string objects. The code I've written works when I perform it on an individual link, but it breaks down when I try to loop through all the links.
FYI, my base url looks like this:
base_url = "http://www.achpr.org"
And my hyperlinks look like this:
hyperlinks = ['/sessions/58th',
'/sessions/58th/resolutions/337/',
'/sessions/58th/resolutions/338/',
'/sessions/58th/resolutions/339/', ...]
So this works fine:
r = requests.get('http://www.achpr.org' + "/sessions/19th-eo/resolutions/328/")
soup = BeautifulSoup(r.text, "lxml")
soup.find('b').span.string
text = soup.findAll('span')
y = []
for i in text:
x = i.strings #returns string within tags
y.extend(x)
y = "".join(y)
y = y.replace("\n", " ")
y = y.replace("\xa0*", " ")
print(ok)
But when I try to turn this into a loop:
output = []
for item in hyperlinks:
r = requests.get('http://www.achpr.org' + link)
soup = BeautifulSoup(r.text, "lxml")
soup.find('b').span.string
text = soup.findAll('span')
y = []
for i in text:
x = i.strings #returns string within tags (so no tags)
y.extend(x)
y = "".join(y)
y = y.replace("\n", " ")
y = y.replace("\xa0*", " ")
output.extend(y)
I get the following error:
Error message
It feels like I'm making a really simple looping error (putting indents in the wrong place), but I've been staring at this too long and I'd like a fresh pair of eyes. Can anyone spot what I'm doing wrong?
It's not an indent error I suppose.
for item in hyperlinks:
r = requests.get('http://www.achpr.org' + link)
soup = BeautifulSoup(r.text, "lxml")
if soup.find('b').span is None:
continue
soup.find('b').span.string
text = soup.findAll('span')
Add an if test before soup.find('b').span.string.