I am trying to fetch data from 7000 URLs and save the scraped info into csv. Rather then go through all the 7000 URLs once. how can I break the csv into let say 1000 URLs per csv.
Below is an example of my current code. I have change the total to index 7000 = 10 and per csv = 2 url.
urls = ['www.1.com', 'www.2.com', 'www.3.com', 'www.4.com', 'www.5.com', 'www.6.com', 'www.7.com', 'www.8.com', 'www.9.com', 'www.10.com']
ranks = []
names = []
prices = []
count = 0
rows_count = 0
total_index = 10
i = 1
while i < total_index:
for url in urls[rows_count+0:rows_count+2]:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
count += 1
print('Loop', count, f'started for {url}')
rank = []
name = []
price = []
# loop for watchlist
for item in soup.find('div', class_ = 'sc-16r8icm-0 bILTHz'):
item = item.text
rank.append(item)
ranks.append(rank)
# loop for ticker name
for ticker in soup.find('h2', class_ = 'sc-1q9q90x-0 jCInrl h1'):
ticker = ticker.text
name.append(ticker)
names.append(name)
# loop for price
for price_tag in soup.find('div', class_ = 'sc-16r8icm-0 kjciSH priceTitle'):
price_tag = price_tag.text
price.append(price_tag)
prices.append(price)
sleep_interval = randint(1, 2)
print('Sleep interval ', sleep_interval)
time.sleep(sleep_interval)
rows_count += 2
df = pd.DataFrame(ranks)
df2 = pd.DataFrame(names)
df3 = pd.DataFrame(prices)
final_table = pd.concat([df, df2, df3], axis=1)
final_table.columns=['rank', 'type', 'watchlist', 'name', 'symbol', 'price', 'changes']
final_table.to_csv(os.path.join(path,fr'summary_{rows_count}.csv'))
i += 2
Seek senior assistant for my problem.
Or is there any other way to do it.
As I understand it you are getting one row of data from scraping each URL. A generic solution for scraping in chunks and writing to CSVs would look something like this:
def scrape_in_chunks(urls, scrape, chunk_size, filename_template):
""" Apply a scraping function to a list of URLs and save as a series of CSVs with data from
one URL on each row and chunk_size urls in each CSV file.
"""
for i in range(0, len(urls), chunk_size):
df = pd.DataFrame([scrape(url) for url in urls[i:i+chunk_size]])
df.to_csv(filename_template.format(start=i, end=i+chunk_size-1))
def my_scraper(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
print('Loop', count, f'started for {url}')
keys = ['rank', 'type', 'watchlist', 'name', 'symbol', 'price', 'changes']
data = ([item.text for item in soup.find('div', class_ = 'sc-16r8icm-0 bILTHz')] +
[item.text for item in soup.find('h2', class_ = 'sc-1q9q90x-0 jCInrl h1')] +
[item.text for item in soup.find('div', class_ = 'sc-16r8icm-0 kjciSH priceTitle')])
return dict(zip(keys, data)) # You could alternatively return a dataframe or series here but dict seems simpler
scrape_in_chunks(urls, my_scraper, 1000, os.path.join(path, "summary {start}-{end}.csv"))
Related
I'm doing this project to scrape how many links a series of webpages have.
My ideia is to add the count of the links for each page in a column of a Pandas dataframe. The ideia is to have something like this:
title count links
0 page1 2
1 page2 3
2 page3 0
I did this code:
links_bs4 = ['page1', 'page2']
article_title = []
links = []
for item in links_bs4:
page = requests.get(item)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('title')
article_title.append(title.string)
body_text = soup.find('div', class_='article-body')
for link in body_text.find_all('a'):
links.append((link.get('href')))
count_of_links = len(links)
s1 = pd.Series(article_title, name='title')
s2 = pd.Series(count_of_links, name='count links')
df = pd.concat([s1, s2], axis=1)
It partly works. The count_of_links = len(links) generates a count of all links of all pages combined.
I wish the count for each page, not the total as is happening now. How can I do this? My for loop is adding the count for the whole list. I should create a new list for each URL I scrape? Or use another thing in Python?
I'm clearly missing some part of the logic.
You can treat count_of_links the same way as article_title. Below is based on your code, but with my changes.
links_bs4 = ['page1', 'page2']
article_title = []
count_of_links = [] # <------ added
links = []
for item in links_bs4:
page = requests.get(item)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('title')
article_title.append(title.string)
body_text = soup.find('div', class_='article-body')
count = 0 # <------- added
for link in body_text.find_all('a'):
links.append((link.get('href')))
# count_of_links = len(links) # <------- commented out
count += 1 # <------- added
count_of_links.append(count) # <------- added
s1 = pd.Series(article_title, name='title')
s2 = pd.Series(count_of_links, name='count links')
df = pd.concat([s1, s2], axis=1)
Or you may code it this way, then you won't need to create one variable for one new column, instead you only need to expand the dictionary.
links_bs4 = ['page1', 'page2']
data = []
links = []
for item in links_bs4:
page = requests.get(item)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('title')
body_text = soup.find('div', class_='article-body')
link_temp = [link.get('href') for link in body_text.find_all('a')]
data.append({'title': title.string, 'count links': len(link_temp)})
links.extend(link_temp)
df = pd.DataFrame(data)
I am working on scraping the two tables from the webpage: https://www.transfermarkt.com/premier-league/legionaereeinsaetze/wettbewerb/GB1/plus/?option=spiele&saison_id=2017&altersklasse=alle
I am trying to get many countries and years of data and have set up lists including country URLs.
Here is my code:
for l in range(0, len(league_urls)):
time.sleep(0.5)
#The second loop is for each year we want to scrape
for n in range(2007,2020):
time.sleep(0.5)
df_soccer1 = None
url = league_urls[l] + str(n) + str('&altersklasse=alle')
headers = {"User-Agent":"Mozilla/5.0"}
response = requests.get(url, headers=headers, verify=False)
time.sleep(0.5)
soup = BeautifulSoup(response.text, 'html.parser')
#Table 1 with information about the value
table = soup.find("table", {"class" : "items"})
team = []
players_used = []
minutes_nonforeign = []
minutes_foreign = []
for row in table.find_all('tr')[1:]:
try:
col = row.find_all('td')
team_ = col[1].text
players_used_ = col[2].text
minutes_nonforeign_ = col[3].text
minutes_foreign_ = col[4].text
team.append(team_)
players_used.append(players_used_)
minutes_nonforeign.append(minutes_nonforeign_)
minutes_foreign.append(minutes_foreign_)
except:
team.append('')
players_used.append('')
minutes_nonforeign.append('')
minutes_foreign.append('')
team = [elem.replace('\n','').replace('\xa0','').strip() for elem in team]
#Table 2 with information about placement, goals and points
df_soccer2 = None
table2 = soup.find("div", {"class" : "box tab-print"})
team2 = []
place = []
matches = []
difference = []
pts = []
for row in table2.find_all('tr'):
try:
col = row.findAll('td')
team2_ = col[2].text
place_ = col[0].text
matches_ = col[3].text
difference_ = col[4].text
pts_ = col[5].text
team2.append(team2_)
place.append(place_)
matches.append(matches_)
difference.append(difference_)
pts.append(pts_)
except:
team2.append('')
place.append('')
matches.append('')
difference.append('')
pts.append('')
team2 = [elem.replace('\n','').replace('\xa0','').strip() for elem in team2]
df_soccer1 = pd.DataFrame({'Team': team[1:], 'Season': [n]*(len(team)-1), 'Players used': players_used[1:],
'Minutes nonforeign': minutes_nonforeign[1:], 'Minutes foreign': minutes_foreign[1:]})
df_soccer2 = pd.DataFrame({'Team': team2, 'Place': place, 'Matches': matches, 'Difference': difference,
'Points': pts})
I am getting this issue when scraping the first table:
AttributeError Traceback (most recent call last)
<ipython-input-46-b4cd681f68e8> in <module>
21 minutes_foreign = []
22
---> 23 for row in table.find_all("tr")[1:]:
24 try:
25 col = row.find_all('td')
AttributeError: 'NoneType' object has no attribute 'find_all'
To note, league_urls is a long list of URLs.
I have used a similar code on another portion of the site and it works great. I just can't seem to figure out why it is not working on this one.
In addition, when I run the code using just a single URL, it works great. Is it possible there is some problem since I am looping across 12 years for 55 different URLs?
Test if table is None e.g.
import requests
from bs4 import BeautifulSoup
url = 'https://www.transfermarkt.com/remier-liga/legionaereeinsaetze/wettbewerb/RU1/plus/?option=spiele&saison_id=2011&altersklasse=alle'
headers = {"User-Agent":"Mozilla/5.0"}
response = requests.get(url, headers=headers, verify=False)
#time.sleep(0.5)
soup = BeautifulSoup(response.text, 'html.parser')
#Table 1 with information about the value
table = soup.find("table", {"class" : "items"})
team = []
players_used = []
minutes_nonforeign = []
minutes_foreign = []
if not table is None:
for row in table.find_all('tr')[1:]:
try:
col = row.find_all('td')
team_ = col[1].text
players_used_ = col[2].text
minutes_nonforeign_ = col[3].text
minutes_foreign_ = col[4].text
team.append(team_)
players_used.append(players_used_)
minutes_nonforeign.append(minutes_nonforeign_)
minutes_foreign.append(minutes_foreign_)
except:
team.append('')
players_used.append('')
minutes_nonforeign.append('')
minutes_foreign.append('')
else:
team.append('')
players_used.append('')
minutes_nonforeign.append('')
minutes_foreign.append('')
class Crawler():
def __init__(self):
self.pag = 1
i = 0
def get_urls(self,main_url):
self.url = 'https://www.test.ro/search/'+ main_url +'/p1'
self.filename = main_url
r = requests.get(self.url)
soup = BeautifulSoup(r.text, 'html.parser')
number_pages = soup.find(class_= 'row' )
last_page = number_pages.find_all('a')[len(number_pages.find_all('a'))-2].get("data-page")
for i in range(1, int(last_page)+1):
url.append('https://www.test.ro/search/'+ main_url +'/p' + str(i))
def print_urls(self):
for urls in url:
print (urls)
def scrape(self,url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
product_list = soup.find(class_ = 'page-container')
product_list_name = product_list.find_all('h2')
product_list_oldprice = product_list.find_all(class_ = 'product-old-price')
product_list_newprice = product_list.find_all(class_ = 'product-new-price')
for i in range(0, len(product_list_name)):
name = product_list_name[i].get_text().strip()
link = product_list_name[i].find('a').get('href')
#print(name)
#print(len(name))
try:
price = product_list_oldprice[i].contents[0].get_text()
price = price[:-6]
#print(price)
except IndexError:
#print("no old price")
#print(product_list_newprice[i].contents[0])
with open(self.filename+'.csv', 'a', encoding = 'utf-8', newline='') as csv_file:
file_is_empty = os.stat(self.filename+'.csv').st_size == 0
fieldname = ['name','link', 'price_old', 'price_actualy']
writer = csv.DictWriter(csv_file, fieldnames = fieldname)
if file_is_empty:
writer.writeheader()
writer.writerow({'name':name,'link':link, 'price_old':price, 'price_actualy':product_list_newprice[i].contents[0]})
if __name__=='__main__':
print("Search for product: ")
urlsearch = input()
starttime = time.time()
scraper = Crawler()
scraper.get_urls(urlsearch)
scraper.print_urls()
#scraper.scrape(url[0])
pool = multiprocessing.Pool()
pool.map(scraper.scrape,url)
pool.close()
print('That took {} seconds'.format(time.time() - starttime))
So I have this scraper, it works perfectly on any website bag but only on the product page.
I did it for a specific website, but how could I go on each page to take the data from the product and give it back and do it all over again?
Is such a thing possible?
I now take the data from the products page, ie name, link, price.
You have divs there too.
Can I help href?
In this case you need to create a category scraper that safes all product urls first. Scrape all urls and go through all the category's and for example safe them to csv first (the product urls). Then you can take all the product urls from the CSV and loop through all of them.
I have coded a web scraper for auto trader but for some reason when iterating through urls I can only ever get a maximum length of 1300 for my dataframe. There are 13 results per page so is there some sort of significance about a limit of 100 or am I just doing something wrong? Any help would be greatly appreciated :)
I've attached my code below
# Import required libraries
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
# List of urls
path = 'https://www.autotrader.co.uk/car-search?advertClassification=standard&postcode=RH104JJ&make=&price-from=500&price-to=100000&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&advertising-location=at_cars&is-quick-search=TRUE&page='
urls = []
for i in range(1,500):
url = path + str(i)
urls.append(url)
# Lists to store the scraped data in
makes = []
prices = []
ratings = []
dates = []
types = []
miles = []
litres = []
bhps = []
transmissions = []
fuels = []
owners = []
attributes = [makes, ratings, dates, types, miles, litres, bhps, transmissions, fuels, owners]
# Iterate through urls
sum = 0
for url in urls:
sum += 1
if sum%10 == 0:
print(sum)
# Attempt to connect to the url
try:
response = get(url)
except:
print('oops')
html_soup = BeautifulSoup(response.text, 'html.parser')
# Get a list of individual cars and iterate through it
car_containers = html_soup.find_all('li', class_ = 'search-page__result')
for container in car_containers:
try:
container.find("div", {"class": "js-tooltip"}).find("div", {"class": "pi-indicator js-tooltip-trigger"}).text
rating = container.find("div", {"class": "js-tooltip"}).find("div", {"class": "pi-indicator js-tooltip-trigger"}).text.strip()
except:
rating = ''
ratings.append(rating)
make = container.h2.text.strip().title().split(' ')[0]
makes.append(make)
price = container.find("div", {"class": "vehicle-price"}).text[1:]
prices.append(price)
specs = container.find("ul", {"class": "listing-key-specs"}).find_all("li", recursive=True)
for spec in specs:
if spec.text.split(' ')[0].isdigit() and len(spec.text.split(' ')[0]) == 4:
date = spec.text.split(' ')[0]
dates.append(date)
if 'mile' in str(spec):
mile = spec.text.split(' ')[0]
miles.append(mile)
if 'l' in str(spec).lower() and str(spec.text)[:-1].replace('.', '').isnumeric() and not spec.text.split(' ')[0].isdigit():
litre = spec.text[:-1]
litres.append(litre)
if any(x in str(spec).lower() for x in ['automatic', 'manual']):
transmission = spec.text
transmissions.append(transmission)
if any(x in str(spec).lower() for x in ['bhp', 'ps']):
bhp = spec.text
bhps.append(bhp)
if any(x in str(spec).lower() for x in ['petrol', 'diesel']):
fuel = spec.text
fuels.append(fuel)
if 'owner' in str(spec):
owner = spec.text
owners.append(owner.split(' ')[0])
typelist = ['hatchback', 'saloon', 'convertible', 'coupe', 'suv', 'mpv', 'estate', 'limousine',
'pickup']
if any(x in str(spec).lower() for x in typelist):
typ = spec.text
types.append(typ)
# Filling in empty spaces
for attribute in attributes:
if len(attribute) < len(prices):
attribute.append('')
# Creating a dataframe from the lists
df = ({'makes': makes,
'Price': prices,
'Rating': ratings,
'Year': dates,
'Type': types,
'Miles': miles,
'Litres': litres,
'BHP': bhps,
'Transmission': transmissions,
'Fuel': fuels,
'Owners': owners
})
df = pd.DataFrame(df)
Maybe just use a url shortener if the length of the url is too long
I experience dealing with multi tags/attributes in one loop and appending them to the DataFrame. More speicifcally, it concerns Place loop:
for car_item in soup2.findAll('ul', {'class': 'seller-info-links'}):
place = car_item.find('h3', {'class':'heading'}).text.strip()
places.append(place)
Appending it to the DataFrame yields only 1 result out of expected 30.
Thank you in advance.
import requests
import bs4
import pandas as pd
frames = []
for pagenumber in range (0,2):
url = 'https://www.marktplaats.nl/l/auto-s/p/'
txt = requests.get(url + str(pagenumber))
soup = bs4.BeautifulSoup(txt.text, 'html.parser')
soup_table = soup.find('ul', 'mp-Listings mp-Listings--list-view')
for car in soup_table.findAll('li'):
link = car.find('a')
sub_url = 'https://www.marktplaats.nl/' + link.get('href')
sub_soup = requests.get(sub_url)
sub_soup_txt = bs4.BeautifulSoup(sub_soup.text, 'html.parser')
soup1 = sub_soup_txt.find('div', {'id': 'car-attributes'})
soup2 = sub_soup_txt.find('div', {'id': 'vip-seller'})
tmp = []
places = []
for car_item in soup1.findAll('div', {'class': 'spec-table-item'}):
key = car_item.find('span', {'class': 'key'}).text
value = car_item.find('span', {'class': 'value'}).text
tmp.append([key, value])
for car_item in soup2.findAll('ul', {'class': 'seller-info-links'}):
place = car_item.find('h3', {'class':'heading'}).text.strip()
places.append(place)
frames.append(pd.DataFrame(tmp).set_index(0))
df_final = pd.concat((tmp_df for tmp_df in frames), axis=1, join='outer').reset_index()
df_final = df_final.T
df_final.columns = df_final.loc["index"].values
df_final.drop("index", inplace=True)
df_final.reset_index(inplace=True, drop=True)
df_final['Places'] = pd.Series(places)
df_final.to_csv('auto_database.csv')
As you are adding places to the final df, this line (currently sitting in for pagenumber in ... for car in ...)
places = []
should go all the way up and out of the main for loop here:
frames = []
places = []