I want to have information about many persons.
I have these informations on the website www.wats4u.com. .
I have the firstname and the name of these persons in a document exel.
For the moment I have the code :
import urllib
page = urllib.urlopenbegin('https://www.wats4u.com/annuaire-alumni?lastname=algan&firstname=michel&scholl=All&class=All&=rechercher')
strpage = page.read()
sock.close()
print htmlSource
And I would like a code more like that :
page = urllib.urlopenbegin('https://www.wats4u.com/annuaire-alumni?lastname= + name + &firstname= + firstname + &scholl=All&class=All&=rechercher')
I have the name and the firstname in a document exel "test.xlsx"( approximately 5000 people).
What do I need to change or add in my code?
Look into str.format:
url = 'https://www.wats4u.com/annuaire-alumni?lastname={}&firstname={}&scholl=All&class=All&=rechercher'
firstname = 'algan'
lastname = 'michel'
page = urllib.urlopenbegin(url.format(firstname, lastname))
Related
I have a problem with the web scraping code below. The code works, but if the entered product is not just a single word and contains for example also a number like "Playstation 4" it fails. The problem seems to be in this line if product in str(product_name):
I tried many different variations like product_name.text or product_name.string, but it won´t correctly check if the string product is in the converted object product_name if it is not just one word.
If I use print(product_name.text) I get exactly the result that I would expect, but why can´t I use the if-in-statement correctly with product_name.text or str(product_name)?
import requests
from bs4 import BeautifulSoup
product = input("Please enter product: ")
URL = "http://www.somewebsite.com/search?sSearch=" + product
website = requests.get(URL)
html = BeautifulSoup(website.text, 'html.parser')
product_info = html.find_all('div', class_="product--main")
product_array = []
for product_details in product_info:
product_name = product_details.find('a', class_="product--title product--title-desktop")
if product in str(product_name):
product_array.append(product_name.text.replace('\n', '')+'; ')
discounted_price = product_details.find('span', class_="price--default is--discount")
if discounted_price:
product_array.append(discounted_price.text.replace('\n', '').replace('\xa0€*','').replace('from','') + ';\n')
else:
regular_price = product_details.find('span', class_="price--default")
product_array.append(regular_price.text.replace('\n', '').replace('\xa0€*','').replace('from','') + ';\n' if regular_price else 'N/A;\n')
with open("data.csv", "w") as text_file:
text_file.write("product; price;\n")
for object in product_array:
text_file.write(object)
Why should I use urlencode?
I tried many different variations like product_name.text or product_name.string,
but it won´t correctly check if the string product is in the converted object product_name if it is...
not just one word.
URL = "http://www.somewebsite.com/search?sSearch=" + product
Please look what happens with query string when you use concatenation:
So please consider updating your code like below:
I'm looking to build a scraper from a list of url's I have saved in a CSV or JSON format. It is not for spam, I have a large list which I don't want to visit one by one!
My goal is to have a script which searches the url list, connecting to the website and scrapes it for email addresses. The end goal is to save the url and email in a xls file with rows for each businesses data such as:
| business 1 url | business 1 email contact.|
| business 2 url | business 2 email contact.|
business 1 url = b1url
Ideally the script will:
look at b1url, search for email address, if not on page, look on contact us page,
It can do this by either searching for #website.com on the page or searching the html for [href*=mailto] in the page.
move to b2url, after b1url + b1url contact us page has been searched, rinse and repeat for the list.
If the script could source the webpage name, as another column it would be very helpful but it is not necessary.
iimport pandas as pd
import requests
import bs4
import re
src_df = pd.read_csv('C:/src_file.csv')
def get_email(soup):
try:
email = re.findall(r'([a-zA-Z0-9._-]+#[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', response.text)[-1]
return email
except:
pass
try:
email = soup.select("a[href*=mailto]")[-1].text
except:
print ('Email not found')
email = ''
return email
for i, row in src_df.iterrows():
url = 'http://www.' + row['website']
try:
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
except:
print ('Unsucessful: ' + str(response))
continue
email = get_email(soup)
src_df.loc[i,'Email'] = email
print ('website:%s\nemail: %s\n' %(url, email))
src_df.to_csv('output.csv', index=False)
If there is no email in the home page, I want the script to search the 'contact us' page or 'contact', or 'about', or 'about us page'. If it could do this after the rest of list of url's have been searched this would be good, such as if email = No Value, search 'b1url.com/contact' it would be better to reduce the chance of being blocked. The code below contains the 'try' function which I want to integrate in the above code.
# Check if there is any email address in the homepage.
emails = soup.find_all(text=re.compile('.*#'+domain[1]+'.'+domain[2].replace("/","")))
emails.sort(key=len)
print(emails[0].replace("\n",""))
final_result = emails[0]
except:
# Searching for Contact Us Page's url.
try:
flag = 0
for link in links:
if "contact" in link.get("href") or "Contact" in link.get("href") or "CONTACT" in link.get("href") or 'contact' in link.text or 'Contact' in link.text or 'CONTACT' in link.text:
if len(link.get("href"))>2 and flag<2:
flag = flag + 1
contact_link = link.get("href")
except:
pass
domain = domain[0]+"."+domain[1]+"."+domain[2]
if(len(contact_link)<len(domain)):
domain = domain+contact_link.replace("/","")
else:
domain = contact_link
try:
# Check if there is any email address in the Contact Us Page.
res = requests.get(domain)
soup = BeautifulSoup(res.text,"lxml")
emails = soup.find_all(text=re.compile('.*#'+mailaddr[7:].replace("/","")))
emails.sort(key=len)
try:
print(emails[0].replace("\n",""))
final_result = emails[0]
return final_result
except:
pass
except Exception as e:
pass
return ""
I'm trying to webscrape https://old.reddit.com/r/all/ and get the entries on the first page.
When I run my code, it works but for the post_text it only copies the last post on the reddit page 25 times. I know this is because its getting the entry and then posting it each time through the loop.
import requests
import urllib.request
from bs4 import BeautifulSoup as soup
my_url = 'https://old.reddit.com/r/all/'
request = urllib.request.Request(my_url,headers={'User-Agent': 'your bot 0.1'})
response = urllib.request.urlopen(request)
page_html = response.read()
page_soup = soup(page_html, "html.parser")
posts = page_soup.findAll("div", {"class": "top-matter"})
post = posts[0]
authors = page_soup.findAll("p", {"class":"tagline"})
author = authors[0]
filename = "redditAll.csv"
f = open(filename, "w")
headers = "Title of the post, Author of the post\n"
f.write(headers)
for post in posts:
post_text = post.p.a.text.replace(",", " -")
for author in authors:
username = author.a.text
f.write(post_text + "," + username + "\n")
f.close()
Changed this
for post in posts:
post_text = post.p.a.text.replace(",", " -")
for author in authors:
username = author.a.text
To that
for post, author in zip(posts, authors):
post_text = post.p.a.text.replace(",", " -")
username = author.a.text
LTheriault is correct, but I'd consider this more idiomatic.
for post, author in zip(posts, authors):
post_text = post.p.a.text.replace(",", " -")
username = author.a.text
f.write(post_text + "," + username + "\n")
You're doing the two loops separately. In your code below, you're looping through each post and assigning a string to post_text, but doing nothing else with it. When that loop is done, post_text is the last thing it has been assigned as before it moves into the authors loop and writes a string with each author and the string you have stored in post_text.
for post in posts:
post_text = post.p.a.text.replace(",", " -")
for author in authors:
username = author.a.text
f.write(post_text + "," + username + "\n")
Assuming that there are an equal number of elements in posts and authors, you should be able to fix it with the following:
for i in range(len(posts)):
post_text = posts[i].p.a.text.replace(",", " -")
username = authors[i].a.text
f.write(post_text + "," + username + "\n")
The problem here is that you're writing to the file object within the scope of the
of the second for loop for author in authors, so you will indeed write the last value of post_text multiple times.
If you want to combine authors and posts you might zip them and them iterate over them (assuming they are the same length)
for author, post in zip(posts, authors):
write.(f 'author: {author}, post: {post}')
I would also recommend to write to file using a context manager
eg.
with open('filename.txt', 'w') as f:
f.write('stuff')
I am trying to scrap information about every firm in from this website : www.canadianlawlist.com
I have finished most of it, but I am running into a small problem.
I am trying to get the results to display in the following order :
-Firm Name and Information
*Employees from the firm Information.
But instead I am getting very random results.
It will scrape information about 2 firms and then scrap the information of employees. Like that :
-Firm Name and Information
-Firm name and information
*Employee from Firm 1
-Firm name and information
*Employee from Firm 2
It goes something like that . I am not sure what i am missing in my code :
def parse_after_submit(self, response):
basicurl = "canadianlawlist.com/"
products = response.xpath('//*[#class="searchresult_item_regular"]/a/#href').extract()
for p in products:
url = "http://canadianlawlist.com" + p
yield scrapy.Request(url, callback=self.parse_firm_info)
#process next page
#for x in range(2, 6):
# next_page_url = "https://www.canadianlawlist.com/searchresult?searchtype=firms&city=montreal&page=" + str(x)
def parse_firm_info(self,response):
name = response.xpath('//div[#class="listingdetail_companyname"]/h1/span/text()').extract_first()
print name
for info in response.xpath('//*[#class="listingdetail_contactinfo"]'):
street_address = info.xpath('//div[#class="listingdetail_contactinfo"]/div[1]/span/div/text()').extract_first()
city = info.xpath('//*[#itemprop="addressLocality"]/text()').extract_first(),
province = info.xpath('//*[#itemprop="addressRegion"]/text()').extract_first(),
postal_code = info.xpath('//*[#itemprop="postalCode"]/text()').extract_first(),
telephone = info.xpath('//*[#itemprop="telephone"]/text()').extract_first(),
fax_number = info.xpath('//*[#itemprop="faxNumber"]/text()').extract_first(),
email = info.xpath('//*[#itemprop="email"]/text()').extract_first(),
print street_address
print city
print province
print postal_code
print telephone
print fax_number
print email
for people in response.xpath('////div[#id="main_block"]/div[1]/div[2]/div[2]'):
pname = people.xpath('//*[#class="listingdetail_individual_item"]/h3/a/text()').extract()
print pname
basicurl = "canadianlawlist.com/"
employees = response.xpath('//*[#class="listingdetail_individual_item"]/h3/a/#href').extract()
for e in employees:
url2 = "http://canadianlawlist.com" + e
yield scrapy.Request(url2, callback=self.parse_employe_info)
def parse_employe_info(self,response):
ename = response.xpath('//*[#class="listingdetail_individualname"]/h1/span/text()').extract_first()
job_title = response.xpath('//*[#class="listingdetail_individualmaininfo"]/div/i/span/text()').extract_first()
print ename
print job_title
You cannot rely on the order Python's print function when it comes to concurrent programming. If you care about standard output order you need to use logging module.
Scrapy has shortcut function for that in Spider class:
import scrapy
import logging
class MySpider(scrapy.Spider):
def parse(self, response):
self.log("first message", level=logging.INFO)
self.log("second message", level=logging.INFO)
Scrapy run multiple requests at the same time, so the content displayed on the console can be corresponding to any the multiple requests running at same time.
You can go to settings.py and set
CONCURRENT_REQUESTS = 1
Now only one request will be launched at a time so your console will show meaningful data but this will make the scraping slower.
A newbie scraper here !
I am currently indulged in a tedious and boring task where I have to copy/paste certain contents from Angel List and save them in excel. I have previously used scrapers to automate such boring tasks but this one is quite tough and I am unable to find a way to automate it. Please find below the website link:
https://angel.co/people/all
Kindly apply filters Location-> USA, and Market-> Online Dating. There will be around 550 results (please note that the URL doesn't change when you apply the filters)
I have successfully scraped the URLs of all the profiles once filters are applied. Therefore, I have an excel file with 550 URLs of these profiles.
Now the next step is to go to individual profiles and scrape certain information. I am looking for these fields currently:
Name
Description Information
Investments
Founder
Advisor
Locations
Markets
What I'm looking for
Now I have tried a lot of solutions but none have worked so far. Import.io, data miner, data scraper tools are not helping me much.
Please suggest is there any VBA code or Python code or any tool that can help me to automate this scraping task?
COMPLETE CODE FOR SOLUTION:
Here is the final code with comments. If someone still has problems, please comment below and I will try to help you out.
from bs4 import BeautifulSoup
import urllib2
import json
import csv
def fetch_page(url):
opener = urllib2.build_opener()
# changing the user agent as the default one is banned
opener.addheaders = [('User-Agent', 'Mozilla/43.0.1')]
return opener.open(url).read()
#Create a CSV File.
f = open('angle_profiles.csv', 'w')
# Row Headers
f.write("URL" + "," + "Name" + "," + "Founder" + "," + "Advisor" + "," + "Employee" + "," + "Board Member" + ","
+ "Customer" + "," + "Locations" + "," + "Markets" + "," + "Investments" + "," + "What_iam_looking_for" + "\n")
# URLs to iterate over has been saved in file: 'profiles_links.csv' . I will extract the URLs individually...
index = 1;
with open("profiles_links.csv") as f2:
for row in map(str.strip,f2):
url = format(row)
print "# Index: ", index
index += 1;
# Check if URL has 404 error. if yes, skip and continue with the rest of URLs.
try:
html = fetch_page(url)
page = urllib2.urlopen(url)
except Exception, e:
print "Error 404 #: " , url
continue
bs = BeautifulSoup(html, "html.parser")
#Extract info from page with these tags..
name = bs.select(".profile-text h1")[0].get_text().strip()
#description = bs.select('div[data-field="bio"]')[0]['data-value']
founder = map(lambda link: link.get_text().strip(), bs.select('.role_founder a'))
advisor = map(lambda link: link.get_text().strip(), bs.select('.role_advisor a'))
employee = map(lambda link: link.get_text().strip(), bs.select('.role_employee a'))
board_member = map(lambda link: link.get_text().strip(), bs.select('.role_board_member a'))
customer = map(lambda link: link.get_text().strip(), bs.select('.role_customer a'))
class_wrapper = bs.body.find('div', attrs={'data-field' : 'tags_interested_locations'})
count = 1
locations = {}
if class_wrapper is not None:
for span in class_wrapper.find_all('span'):
locations[count] = span.text
count +=1
class_wrapper = bs.body.find('div', attrs={'data-field' : 'tags_interested_markets'})
count = 1
markets = {}
if class_wrapper is not None:
for span in class_wrapper.find_all('span'):
markets[count] = span.text
count +=1
what_iam_looking_for = ' '.join(map(lambda p: p.get_text().strip(), bs.select('div.criteria p')))
user_id = bs.select('.profiles-show .profiles-show')[0]['data-user_id']
# investments are loaded using separate request and response is in JSON format
json_data = fetch_page("https://angel.co/startup_roles/investments?user_id=%s" % user_id)
investment_records = json.loads(json_data)
investments = map(lambda x: x['company']['company_name'], investment_records)
# Make sure that every variable is in string
name2 = str(name); founder2 = str(founder); advisor2 = str (advisor); employee2 = str(employee)
board_member2 = str(board_member); customer2 = str(customer); locations2 = str(locations); markets2 = str (markets);
what_iam_looking_for2 = str(what_iam_looking_for); investments2 = str(investments);
# Replace any , found with - so that csv doesn't confuse it as col separator...
name = name2.replace(",", " -")
founder = founder2.replace(",", " -")
advisor = advisor2.replace(",", " -")
employee = employee2.replace(",", " -")
board_member = board_member2.replace(",", " -")
customer = customer2.replace(",", " -")
locations = locations2.replace(",", " -")
markets = markets2.replace(",", " -")
what_iam_looking_for = what_iam_looking_for2.replace(","," -")
investments = investments2.replace(","," -")
# Replace u' with nothing
name = name.replace("u'", "")
founder = founder.replace("u'", "")
advisor = advisor.replace("u'", "")
employee = employee.replace("u'", "")
board_member = board_member.replace("u'", "")
customer = customer.replace("u'", "")
locations = locations.replace("u'", "")
markets = markets.replace("u'", "")
what_iam_looking_for = what_iam_looking_for.replace("u'", "")
investments = investments.replace("u'", "")
# Write the information back to the file... Note \n is used to jump one row ahead...
f.write(url + "," + name + "," + founder + "," + advisor + "," + employee + "," + board_member + ","
+ customer + "," + locations + "," + markets + "," + investments + "," + what_iam_looking_for + "\n")
Feel free to test the above code with any of the following links:
https://angel.co/idg-ventures?utm_source=people
https://angel.co/douglas-feirstein?utm_source=people
https://angel.co/andrew-heckler?utm_source=people
https://angel.co/mvklein?utm_source=people
https://angel.co/rajs1?utm_source=people
HAPPY CODING :)
For my recipe you will need to install BeautifulSoup using pip or easy_install
from bs4 import BeautifulSoup
import urllib2
import json
def fetch_page(url):
opener = urllib2.build_opener()
# changing the user agent as the default one is banned
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
return opener.open(url).read()
html = fetch_page("https://angel.co/davidtisch")
# or load from local file
#html = open('page.html', 'r').read()
bs = BeautifulSoup(html, "html.parser")
name = bs.select(".profile-text h1")[0].get_text().strip()
description = bs.select('div[data-field="bio"]')[0]['data-value']
founder = map(lambda link: link.get_text().strip(), bs.select('.role_founder a'))
advisor = map(lambda link: link.get_text().strip(), bs.select('.role_advisor a'))
locations = map(lambda link: link.get_text().strip(), bs.select('div[data-field="tags_interested_locations"] a'))
markets = map(lambda link: link.get_text().strip(), bs.select('div[data-field="tags_interested_markets"] a'))
what_iam_looking_for = ' '.join(map(lambda p: p.get_text().strip(), bs.select('div.criteria p')))
user_id = bs.select('.profiles-show .profiles-show')[0]['data-user_id']
# investments are loaded using separate request and response is in JSON format
json_data = fetch_page("https://angel.co/startup_roles/investments?user_id=%s" % user_id)
investment_records = json.loads(json_data)
investments = map(lambda x: x['company']['company_name'], investment_records)
Take a look at https://scrapy.org/
It allows write parser very quickly. Here's my example parser for one site alike angel.co: https://gist.github.com/lisitsky/c4aac52edcb7abfd5975be067face1bb
Unfortunately, angel.co is not available for me now. Good point to start:
$ pip install scrapy
$ cat > myspider.py <<EOF
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://angel.co']
def parse(self, response):
# here's selector to extract interesting elements
for title in response.css('h2.entry-title'):
# write down here values you'd like to extract from the element
yield {'title': title.css('a ::text').extract_first()}
# how to find next page
next_page = response.css('div.prev-post > a ::attr(href)').extract_first()
if next_page:
yield scrapy.Request(response.urljoin(next_page), callback=self.parse)
EOF
$ scrapy runspider myspider.py
Enter interesting css-selectors and run spider.