I'm having a bit of trouble automatically scraping data in a table from a Wikipedia article. First I was getting an encoding error. I specified UTF-8 and the error went away, but the scraped data doesn't display a lot of the characters correctly. You will be able to tell from the code that I am a complete newbie:
from bs4 import BeautifulSoup
import urllib2
wiki = "http://en.wikipedia.org/wiki/Anderson_Silva"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
Result = ""
Record = ""
Opponent = ""
Method = ""
Event = ""
Date = ""
Round = ""
Time = ""
Location = ""
Notes = ""
table = soup.find("table", { "class" : "wikitable sortable" })
f = open('output.csv', 'w')
for row in table.findAll("tr"):
cells = row.findAll("td")
#For each "tr", assign each "td" to a variable.
if len(cells) == 10:
Result = cells[0].find(text=True)
Record = cells[1].find(text=True)
Opponent = cells[2].find(text=True)
Method = cells[3].find(text=True)
Event = cells[4].find(text=True)
Date = cells[5].find(text=True)
Round = cells[6].find(text=True)
Time = cells[7].find(text=True)
Location = cells[8].find(text=True)
Notes = cells[9].find(text=True)
write_to_file = Result + "," + Record + "," + Opponent + "," + Method + "," + Event + "," + Date + "," + Round + "," + Time + "," + Location + "\n"
write_to_unicode = write_to_file.encode('utf-8')
print write_to_unicode
f.write(write_to_unicode)
f.close()
As pswaminathan pointed out, using the csv module will help greatly. Here is how I do it:
table = soup.find('table', {'class': 'wikitable sortable'})
with open('out2.csv', 'w') as f:
csvwriter = csv.writer(f)
for row in table.findAll('tr'):
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 10:
csvwriter.writerow(cells)
Discussion
Using the csv module, I created a csvwriter object connected to my output file.
By using the with command, I don't need to worry about closing the output file after done: it will be closed after the with block.
In my code, cells is a list of UTF8-encoded text extracted from the td tags within a tr tag.
I used the construct c.text, which is more concise than c.find(text=True).
Related
I am trying to save results in .csv but a receive the follow message and I have no idea how to fix that:
f.write(linha_csv)
ValueError: I/O operation on closed file.
Code Bellow:
import requests
from bs4 import BeautifulSoup
import csv
from csv import reader, writer
url_base = "https://lista.mercadolivre.com.br/"
soup = BeautifulSoup(requests.get(url_base + produto_nome).content,
"html.parser")
produtos = soup.findAll('div', attrs =
{'class': 'andes-card andes-card--flat andes-card--default ui-
search-result ui-search-result--core andes-card--padding-default'}
)
with open
(r'Lista_Precos_MercadoLivre.csv','a',encoding='utf8',newline='')
as f:
fieldnames = ['Produto','Link do Produto','Preco']
dw = csv.DictWriter(f,delimiter=';',fieldnames=fieldnames)
dw.writeheader()
i = 1
while True:
for tag in soup:
titulo = soup.find('h2', attrs={'class': 'ui-search-
item__title'})
print(i, tag.text)
print(i,'Título do Produto:', titulo.text)
print(i,'Link do Produto:', link['href'])
next_link = soup.select_one( "a.andes-pagination__link:-soup-
contains(Seguinte)"
)
if not next_link: break
linha_csv = titulo.text + ';' + link['href'] + ';' + "R$" +
real.text + "," + centavos.text + '\n'
f.write(linha_csv)
Cause indentation in your question is not set correct, it may caused by that fact. Moving the writing part into your for-loop should fix the issue:
for tag in soup.select('li.ui-search-layout__item'):
linha_csv = tag.h2.text + ';' + tag.a['href'] + ';' + tag.select_one('.price-tag-amount').text + '\n'
f.write(linha_csv)
Example
import requests, csv
from bs4 import BeautifulSoup
url_base = "https://lista.mercadolivre.com.br/"
query = "vinho"
soup = BeautifulSoup(requests.get(url_base + query).content, "html.parser")
with open (r'Lista_Precos_MercadoLivre.csv','a',encoding='utf8',newline='') as f:
fieldnames = ['Produto','Link do Produto','Preco']
dw = csv.DictWriter(f,delimiter=';',fieldnames=fieldnames)
dw.writeheader()
while True:
for tag in soup.select('li.ui-search-layout__item'):
linha_csv = tag.h2.text + ';' + tag.a['href'] + ';' + tag.select_one('.price-tag-amount').text + '\n'
f.write(linha_csv)
next_link = soup.select_one( "a.andes-pagination__link:-soup-contains(Seguinte)")
if not next_link:
break
soup = BeautifulSoup(requests.get(next_link["href"]).content, "html.parser")
A newbie scraper here !
I am currently indulged in a tedious and boring task where I have to copy/paste certain contents from Angel List and save them in excel. I have previously used scrapers to automate such boring tasks but this one is quite tough and I am unable to find a way to automate it. Please find below the website link:
https://angel.co/people/all
Kindly apply filters Location-> USA, and Market-> Online Dating. There will be around 550 results (please note that the URL doesn't change when you apply the filters)
I have successfully scraped the URLs of all the profiles once filters are applied. Therefore, I have an excel file with 550 URLs of these profiles.
Now the next step is to go to individual profiles and scrape certain information. I am looking for these fields currently:
Name
Description Information
Investments
Founder
Advisor
Locations
Markets
What I'm looking for
Now I have tried a lot of solutions but none have worked so far. Import.io, data miner, data scraper tools are not helping me much.
Please suggest is there any VBA code or Python code or any tool that can help me to automate this scraping task?
COMPLETE CODE FOR SOLUTION:
Here is the final code with comments. If someone still has problems, please comment below and I will try to help you out.
from bs4 import BeautifulSoup
import urllib2
import json
import csv
def fetch_page(url):
opener = urllib2.build_opener()
# changing the user agent as the default one is banned
opener.addheaders = [('User-Agent', 'Mozilla/43.0.1')]
return opener.open(url).read()
#Create a CSV File.
f = open('angle_profiles.csv', 'w')
# Row Headers
f.write("URL" + "," + "Name" + "," + "Founder" + "," + "Advisor" + "," + "Employee" + "," + "Board Member" + ","
+ "Customer" + "," + "Locations" + "," + "Markets" + "," + "Investments" + "," + "What_iam_looking_for" + "\n")
# URLs to iterate over has been saved in file: 'profiles_links.csv' . I will extract the URLs individually...
index = 1;
with open("profiles_links.csv") as f2:
for row in map(str.strip,f2):
url = format(row)
print "# Index: ", index
index += 1;
# Check if URL has 404 error. if yes, skip and continue with the rest of URLs.
try:
html = fetch_page(url)
page = urllib2.urlopen(url)
except Exception, e:
print "Error 404 #: " , url
continue
bs = BeautifulSoup(html, "html.parser")
#Extract info from page with these tags..
name = bs.select(".profile-text h1")[0].get_text().strip()
#description = bs.select('div[data-field="bio"]')[0]['data-value']
founder = map(lambda link: link.get_text().strip(), bs.select('.role_founder a'))
advisor = map(lambda link: link.get_text().strip(), bs.select('.role_advisor a'))
employee = map(lambda link: link.get_text().strip(), bs.select('.role_employee a'))
board_member = map(lambda link: link.get_text().strip(), bs.select('.role_board_member a'))
customer = map(lambda link: link.get_text().strip(), bs.select('.role_customer a'))
class_wrapper = bs.body.find('div', attrs={'data-field' : 'tags_interested_locations'})
count = 1
locations = {}
if class_wrapper is not None:
for span in class_wrapper.find_all('span'):
locations[count] = span.text
count +=1
class_wrapper = bs.body.find('div', attrs={'data-field' : 'tags_interested_markets'})
count = 1
markets = {}
if class_wrapper is not None:
for span in class_wrapper.find_all('span'):
markets[count] = span.text
count +=1
what_iam_looking_for = ' '.join(map(lambda p: p.get_text().strip(), bs.select('div.criteria p')))
user_id = bs.select('.profiles-show .profiles-show')[0]['data-user_id']
# investments are loaded using separate request and response is in JSON format
json_data = fetch_page("https://angel.co/startup_roles/investments?user_id=%s" % user_id)
investment_records = json.loads(json_data)
investments = map(lambda x: x['company']['company_name'], investment_records)
# Make sure that every variable is in string
name2 = str(name); founder2 = str(founder); advisor2 = str (advisor); employee2 = str(employee)
board_member2 = str(board_member); customer2 = str(customer); locations2 = str(locations); markets2 = str (markets);
what_iam_looking_for2 = str(what_iam_looking_for); investments2 = str(investments);
# Replace any , found with - so that csv doesn't confuse it as col separator...
name = name2.replace(",", " -")
founder = founder2.replace(",", " -")
advisor = advisor2.replace(",", " -")
employee = employee2.replace(",", " -")
board_member = board_member2.replace(",", " -")
customer = customer2.replace(",", " -")
locations = locations2.replace(",", " -")
markets = markets2.replace(",", " -")
what_iam_looking_for = what_iam_looking_for2.replace(","," -")
investments = investments2.replace(","," -")
# Replace u' with nothing
name = name.replace("u'", "")
founder = founder.replace("u'", "")
advisor = advisor.replace("u'", "")
employee = employee.replace("u'", "")
board_member = board_member.replace("u'", "")
customer = customer.replace("u'", "")
locations = locations.replace("u'", "")
markets = markets.replace("u'", "")
what_iam_looking_for = what_iam_looking_for.replace("u'", "")
investments = investments.replace("u'", "")
# Write the information back to the file... Note \n is used to jump one row ahead...
f.write(url + "," + name + "," + founder + "," + advisor + "," + employee + "," + board_member + ","
+ customer + "," + locations + "," + markets + "," + investments + "," + what_iam_looking_for + "\n")
Feel free to test the above code with any of the following links:
https://angel.co/idg-ventures?utm_source=people
https://angel.co/douglas-feirstein?utm_source=people
https://angel.co/andrew-heckler?utm_source=people
https://angel.co/mvklein?utm_source=people
https://angel.co/rajs1?utm_source=people
HAPPY CODING :)
For my recipe you will need to install BeautifulSoup using pip or easy_install
from bs4 import BeautifulSoup
import urllib2
import json
def fetch_page(url):
opener = urllib2.build_opener()
# changing the user agent as the default one is banned
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
return opener.open(url).read()
html = fetch_page("https://angel.co/davidtisch")
# or load from local file
#html = open('page.html', 'r').read()
bs = BeautifulSoup(html, "html.parser")
name = bs.select(".profile-text h1")[0].get_text().strip()
description = bs.select('div[data-field="bio"]')[0]['data-value']
founder = map(lambda link: link.get_text().strip(), bs.select('.role_founder a'))
advisor = map(lambda link: link.get_text().strip(), bs.select('.role_advisor a'))
locations = map(lambda link: link.get_text().strip(), bs.select('div[data-field="tags_interested_locations"] a'))
markets = map(lambda link: link.get_text().strip(), bs.select('div[data-field="tags_interested_markets"] a'))
what_iam_looking_for = ' '.join(map(lambda p: p.get_text().strip(), bs.select('div.criteria p')))
user_id = bs.select('.profiles-show .profiles-show')[0]['data-user_id']
# investments are loaded using separate request and response is in JSON format
json_data = fetch_page("https://angel.co/startup_roles/investments?user_id=%s" % user_id)
investment_records = json.loads(json_data)
investments = map(lambda x: x['company']['company_name'], investment_records)
Take a look at https://scrapy.org/
It allows write parser very quickly. Here's my example parser for one site alike angel.co: https://gist.github.com/lisitsky/c4aac52edcb7abfd5975be067face1bb
Unfortunately, angel.co is not available for me now. Good point to start:
$ pip install scrapy
$ cat > myspider.py <<EOF
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://angel.co']
def parse(self, response):
# here's selector to extract interesting elements
for title in response.css('h2.entry-title'):
# write down here values you'd like to extract from the element
yield {'title': title.css('a ::text').extract_first()}
# how to find next page
next_page = response.css('div.prev-post > a ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
EOF
$ scrapy runspider myspider.py
Enter interesting css-selectors and run spider.
I would like to know how to export my results from crawling into multiple csv files for each different city that I have crawled. Somehow I´m running into walls, do not get a proper way to sort it out.
That is my code:
import requests
from bs4 import BeautifulSoup
import csv
user_agent = {'User-agent': 'Chrome/43.0.2357.124'}
output_file= open("TA.csv", "w", newline='')
RegionIDArray = [187147,187323,186338]
dict = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()
for reg in RegionIDArray:
for page in range(1,700,30):
r = requests.get("https://www.tripadvisor.de/Attractions-c47-g" + str(reg) + "-oa" + str(page) + ".html")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "element_wrap"})
for item in g_data:
header = item.find_all("div", {"class": "property_title"})
item = (header[0].text.strip())
if item not in already_printed:
already_printed.add(item)
print("POI: " + str(item) + " | " + "Location: " + str(dict[reg]))
writer = csv.writer(output_file)
csv_fields = ['POI', 'Locaton']
if g_data:
writer.writerow([str(item), str(dict[reg])])
My goal would be that I get three sperate CSV files for Paris, Berlin and London instead of getting all the results in one big csv file.
Could you guys help me out? Thanks for your feedback:)
I did some minor modifications to your code. To make files for each locale, I moved the out_file name inside the loop.
Note, that I don't have time now, the very last line is a hack to ignore unicode errors -- it just skips trying to output a line with a non ascii character. Thas isn't good. Maybe someone can fix that part?
import requests
from bs4 import BeautifulSoup
import csv
user_agent = {'User-agent': 'Chrome/43.0.2357.124'}
RegionIDArray = {187147: 'Paris', 187323: 'Berlin', 186338: 'London'}
already_printed = set()
for reg in RegionIDArray:
output_file= open("TA" + str(reg) + ".csv", "w")
for page in range(1,700,30):
r = requests.get("https://www.tripadvisor.de/Attractions-c47-g" + str(reg) + "-oa" + str(page) + ".html")
soup = BeautifulSoup(r.content)
g_data = soup.find_all("div", {"class": "element_wrap"})
for item in g_data:
header = item.find_all("div", {"class": "property_title"})
item = (header[0].text.strip())
if item not in already_printed:
already_printed.add(item)
# print("POI: " + str(item) + " | " + "Location: " + str(RegionIDArray[reg]))
writer = csv.writer(output_file)
csv_fields = ['POI', 'Locaton']
if g_data:
try:
writer.writerow([str(item), str(RegionIDArray[reg])])
except:
pass
I'm trying to use Python to scrape the play-by-play table from this basketball-reference example into a CSV file.
When I run this code, the table is cut short and many cells are missing. I'm a programming n00b and any help would be appreciated.
from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv
bref = "http://www.basketball-reference.com"
print "Enter game code:"
game = raw_input("> ")
def make_soup(url):
return BeautifulSoup(urlopen(url), "lxml")
def get_pbp(pbp):
soup = make_soup(bref + "/boxscores/pbp/" + game + ".html")
table = soup.find("table", "no_highlight stats_table")
rows = [row.find_all("td") for row in table.find_all("tr")]
data = []
for row in rows:
values = []
for value in row:
if value.string is None:
values.append(u"")
else:
values.append(value.string.replace(u"\xa0", u""))
data.append(values)
return data
if __name__ == '__main__':
print "Writing data for game " + game
with open(game + '.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(get_pbp(game))
print game + " has been successfully scraped."
You need to skip empty cells:
table = soup.find("table", class_="no_highlight stats_table")
rows = [[cell.text.replace(u"\xa0", u"").strip() for cell in row.find_all("td") if cell.text.strip()]
for row in table.find_all("tr")[2:]]
with open(game + '.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(rows)
I'm almost finished a webcralwer grabbing a table. This outputs the first row in the table only. Can anyone help identify why this does not return all rows in the table. Please ignore the while loop as this will eventually have a looped section.
import urllib
from bs4 import BeautifulSoup
#file_name = "/user/joe/uspc-cpc.txt
#file = open(file_name,"w")
i=125
while i==125:
url = "http://www.uspto.gov/web/patents/classification/cpc/html/us" + str(i) + "tocpc.html"
print url + '\n'
i += 1
data = urllib.urlopen(url).read()
print data
#get the table data from dump
#append to csv file
soup = BeautifulSoup(data)
table = soup.find("table", width='80%')
for tr in table.findAll('tr')[2:]:
col = row.findAll('td')
uspc = col[0].get_text().encode('ascii','ignore')
cpc1 = col[1].get_text().encode('ascii','ignore')
cpc2 = col[2].get_text().encode('ascii','ignore')
cpc3 = col[3].get_text().encode('ascii','ignore')
print uspc + ',' + cpc1 + ',' + cpc2 + ',' + cpc3 + '\n'
#file.write(record)
#file.close()
CODE I'm running:
import urllib
from bs4 import BeautifulSoup
#file_name = "/users/ripple/uspc-cpc.txt"
#file = open(file_name,"w")
i=125
while i==125:
url = "http://www.uspto.gov/web/patents/classification/cpc/html/us" + str(i) + "tocpc.html"
print 'Grabbing from: ' + url + '\n'
i += 1
#get the table data from the page
data = urllib.urlopen(url).read()
#send to beautiful soup
soup = BeautifulSoup(data)
table = soup.find("table", width='80%')
for tr in table.findAll('tr')[2:]:
col = tr.findAll('td')
uspc = col[0].get_text().encode('ascii','ignore').replace(" ","")
cpc1 = col[1].get_text().encode('ascii','ignore').replace(" ","")
cpc2 = col[2].get_text().encode('ascii','ignore').replace(" ","")
cpc3 = col[3].get_text().encode('ascii','ignore').replace(" ","").replace("more...", "")
record = uspc + ',' + cpc1 + ',' + cpc2 + ',' + cpc3 + '\n'
print record
#file.write(record)
#file.close()
You are using tr as a loop variable, but refer to row instead in the loop. If you had row defined before it'll probably produce confusing results.
for tr in table.findAll('tr')[2:]:
col = tr.findAll('td')
works for me:
125/1,B 28D 1/00,B 28D 1/221,E 01C 23/081,B 28D 1/005,B 28D 1/06more...
125/2,B 23Q 35/10,B 22C 9/18,B 23B 5/162,B 23D 63/18,B 24B 53/07more...
125/3,B 28D 1/18,B 28D 1/003,B 28D 1/048,B 28D 1/181,B 24B 7/22more...
etc.