Web scraping - scraping data for multiple URL's gives None - python

1) I am trying to scrape data for multiple URL's stored in CSV, but in result it gives None.
2) I want to store the fetched data simultaneously in rows one by one in a dataframe named df but it only stores one row.
here's my code(i have pasted below from where the data extraction started) -
import csv
df=pd.DataFrame()
with open('test1.csv', newline='', encoding='utf-8-sig' ) as f:
reader = csv.reader(f)
for line in reader:
link = line[0]
print(type(link))
print(link)
driver.get(link)
height = driver.execute_script("return document.body.scrollHeight")
for scrol in range(100,height,100):
driver.execute_script(f"window.scrollTo(0,{scrol})")
time.sleep(0.2)
src = driver.page_source
soup = BeautifulSoup(src, 'lxml')
name_div = soup.find('div', {'class': 'flex-1 mr5'})
name_loc = name_div.find_all('ul')
name = name_loc[0].find('li').get_text().strip()
loc = name_loc[1].find('li').get_text().strip()
connection = name_loc[1].find_all('li')
connection = connection[1].get_text().strip()
exp_section = soup.find('section', {'id': 'experience-section'})
exp_section = exp_section.find('ul')
div_tag = exp_section.find('div')
a_tag = div_tag.find('a')
job_title = a_tag.find('h3').get_text().strip()
company_name = a_tag.find_all('p')[1].get_text().strip()
joining_date = a_tag.find_all('h4')[0].find_all('span')[1].get_text().strip()
exp = a_tag.find_all('h4')[1].find_all('span')[1].get_text().strip()
df['name']=[name]
df['location']=[loc]
df['connection']=[connection]
df['company_name']=[company_name]
df['job_title']=[job_title]
df['joining_date']=[joining_date]
df['tenure']=[exp]
df
output -
name location connection company_name job_title joining_date tenure
0 None None None None None None None
I am not sure whether the for loop goes wrong or whats the exact problem but for a single URL it works fine.
I am using Beautiful soup for the first time so I don't have proper knowledge. Please help me to make the desired changes. Thanks.

I don't think the end of your code is appending new lines to the dataframe.
Try replacing df["name""] = [name] and the other lines with the following:
new_line = {
"name": [name],
"location": [loc],
"connection": [connection],
"company_name": [company_name],
"job_title": [job_title],
"joining_date": [joining_date],
"tenure": [exp],
}
temp_df = pd.DataFrame.from_dict(new_line)
df.append(temp_df)

Related

Fill in internet form using a pandas dataframe and mechanicalsoup

I am using mechanicalsoup and I need to auto-fill the internet form using information from a dataframe, automatically.
The dataframe is called checkdataframe.
br = mechanicalsoup.StatefulBrowser(user_agent='MechanicalSoup')
for row in checkdataframe.codenumber:
if row in '105701':
url = "https://www.internetform.com"
br.open(url)
for column in checkdataframe[['ID', 'name','email']]:
br.select_form('form[action="/internetform.com/info.do"]')
br.form['companycode'] =checkdataframe['ID'] #THIS INFORMATION SHOULD COMING FROM DATAFRAME
br.form['username'] = checkdataframe['name'] #THIS INFORMATION SHOULD COMING FROM DATAFRAME
br.form['emailaddress'] = checkdataframe['email'] #THIS INFORMATION SHOULD COMING FROM DATAFRAME
response = br.submit_selected()
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('div', attrs = {'class':'row'})
for row in table.findAll('div', attrs = {'class':'col-md-4 col-4'}):
scrapeinfo = {}
scrapeinfo['STATUS'] = row.div
scrapeinfo['NAMEOFITEM'] = row.label
scrapeinfo['PRICE'] = row.div
checkdataframe.append(scrapeinfo)
else:
break
How can I make br.form['companycode'] =checkdataframe['ID'] this work, instead of
br.form['companycode'] = '105701'
br.form['username'] = 'myusername'
br.form['emailaddress'] = 'myusername#gmail.com'
I need to append the information that is scrape into the checkdataframe.
I need help, please.
Use selenium for this activity

Why my csv file comes blank when I use this python code? can not get data from indeed

I'm going to scrape job site the I have face some problem like when I scrape data from indeed website and save it to csv file. My csv file comes blank can some one correct me? whole code is below....
the whole code is below.. this link has the whole code..
https://github.com/Ram-95/Indeed_Job_Scraper/blob/master/Indeed_Job_Scraper.py
# Scrapping the Web
soup = BeautifulSoup(html, 'lxml')
base_url = 'https://in.indeed.com/viewjob?jk='
d = soup.find('div', attrs={'id': 'mosaic-provider-jobcards'})
jobs = soup.find_all('a', class_='tapItem')
for job in jobs:
job_id = job['id'].split('_')[-1]
job_title = job.find('span', title=True).text.strip()
company = job.find('span', class_='companyName').text.strip()
location = job.find('div', class_='companyLocation').text.strip()
posted = job.find('span', class_='date').text.strip()
job_link = base_url + job_id
#print([job_title, company, location, posted, job_link])
# Writing to CSV File
writer.writerow(
[job_title, company, location.title(), posted, job_link])
print(f'Jobs data written to <{file_name}> successfully.')
Your problem is not with the csv module failing to write the file, it is with your scraping script failing to scrape anything. There are a number of errors in your code. First, in line 45, you assign the result of a soup.find to a variable d, then proceed to completely ignore that variable. I'm assuming you wanted to use it in the next line:
jobs = d.find_all('div', class_='tapItem')
Secondly, since you are using the lxml parser, your won't get anything for job['id']. Try this instead:
job_id = job.attrs['class'][4]
So your code should essentially look like this:
soup = BeautifulSoup(html, 'lxml')
base_url = 'https://in.indeed.com/viewjob?jk='
d = soup.find('div', attrs={'id': 'mosaic-provider-jobcards'})
jobs = d.find_all('div', class_='tapItem')
for job in jobs:
job_id = job.attrs['class'][4]
job_title = job.find('span', title=True).text.strip()
company = job.find('span', class_='companyName').text.strip()
location = job.find('div', class_='companyLocation').text.strip()
posted = job.find('span', class_='date').text.strip()
job_link = base_url + job_id
#print([job_title, company, location, posted, job_link])
#Writing to CSV File
writer.writerow(
[job_title, company, location.title(), posted, job_link])

How to get more data from the server in each POST call by selenium?

I'm trying to get historical data from a public purchasing website,
This website has a table that shows pages of 20 records, but my code does not respond when it exceeds 2 Millions registers iterating the pages.
path_proceso = 'https://www.compraspublicas.gob.ec/ProcesoContratacion/compras/IC/buscarInfima.cpe'
headers_df = ['Nro','cod_Factura','FechaEmisiónFactura','CPC','Descripcion_CPC','Razon_Social','Objeto_Compra','Cantidad','Costo_Unidad','Valor','Justificativo','Tipo_Compra','Responsable_Administrativo']
driverInstance = webdriver.Chrome()
driverInstance.get(path_proceso)
driverInstance.maximize_window()
driverInstance.execute_script('document.getElementsByName("f_inicio")[0].removeAttribute("readonly")')
driverInstance.execute_script('document.getElementsByName("f_fin")[0].removeAttribute("readonly")')
datepicker_from_ini = driverInstance.find_element_by_xpath("//input[#id='f_inicio']").clear()
datepicker_from_end = driverInstance.find_element_by_xpath("//input[#id='f_fin']").clear()
driverInstance.find_element_by_id("f_inicio").send_keys("2018-06-30")
driverInstance.find_element_by_id("f_fin").send_keys("2018-12-31")
driverInstance.find_elements_by_name("btnBuscar")[1].click()
soup_html = BeautifulSoup(driverInstance.page_source, 'lxml')
table_rows = soup_html.find("div", {"id": "divProcesos"}).contents[0].find_all('tr')
datalist = []
while true:
for tr in table_rows:
test_df = driverInstance.find_element_by_xpath("//div[#id='divProcesos']/table").get_attribute('outerHTML')
df_data = pd.read_html(test_df)
elem = driverInstance.find_element_by_xpath('//a[text()="Siguiente"]')
driverInstance.find_element_by_xpath('//a[text()="Siguiente"]').click()
table_rows = table_rows
datalist.append(df_data)
driverInstance.quit()
result = pd.concat([pd.DataFrame(datalist[i]) for i in range(len(datalist))], ignore_index=True)
json_records = result.to_json(orient='records')
You know any way I could make this work?

Scrape site with multiple links without "next" button using beautiful soup

I am very new to python (three days in) and I have stumbled into a problem I can't solve with google/youtube. I want to scrape the National Governors Association for background data of all US governors and save this into a csv file.
I have managed to scrape a list of all governors, but to get more details I need to enter the page of each governor individually and save the data. I have found code suggestions online which utilises a "next" button or the url structure to loop over several sites. This website, however, does not have a next button and the url-links does not follow a loopable structure. So I am stuck.
I would appreciate any help I can get very much. I want to extract the info above the main text (Office Dates, School(s) etc in the "address" tag) in each governors page, for example in this one.
This is what I have got so far:
import bs4 as bs
import urllib.request
import pandas as pd
url = 'https://www.nga.org/cms/FormerGovBios?begincac77e09-db17-41cb-9de0-687b843338d0=10&endcac77e09-db17-41cb-9de0-687b843338d0=9999&pagesizecac77e09-db17-41cb-9de0-687b843338d0=10&militaryService=&higherOfficesServed=&religion=&lastName=&sex=Any&honors=&submit=Search&college=&firstName=&party=&inOffice=Any&biography=&warsServed=&'
sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, "html.parser")
#dl list of all govs
dfs = pd.read_html(url, header=0)
for df in dfs:
df.to_csv('governors.csv')
#dl links to each gov
table = soup.find('table', 'table table-striped table-striped')
links = table.findAll('a')
with open ('governors_links.csv', 'w') as r:
for link in links:
r.write(link['href'])
r.write('\n')
r.close()
#enter each gov page and extract data in the "address" tag(s)
#save this in a csv file
I'm assuming that you've got all the links in a list named links.
You can do this to get the data you want of all the Governors one by one:
for link in links:
r = urllib.request.urlopen(link).read()
soup = bs.BeautifulSoup(r, 'html.parser')
print(soup.find('h2').text) # Name of Governor
for p in soup.find('div', {'class': 'col-md-3'}).findAll('p'):
print(p.text.strip()) # Office dates, address, phone, ...
for p in soup.find('div', {'class': 'col-md-7'}).findAll('p'):
print(p.text.strip()) # Family, school, birth state, ...
Edit:
Change your links list to
links = ['https://www.nga.org' + x.get('href') for x in table.findAll('a')]
This may work. I haven't tested it out to full completion since I'm at work but it should be a starting point for you.
import bs4 as bs
import requests
import re
def is_number(s):
try:
int(s)
return True
except ValueError:
return False
def main():
url = 'https://www.nga.org/cms/FormerGovBios?inOffice=Any&state=Any&party=&lastName=&firstName=&nbrterms=Any&biography=&sex=Any&religion=&race=Any&college=&higherOfficesServed=&militaryService=&warsServed=&honors=&birthState=Any&submit=Search'
sauce = requests.get(url).text
soup = bs.BeautifulSoup(sauce, "html.parser")
finished = False
csv_data = open('Govs.csv', 'a')
csv_data.write('Name,Address,OfficeDates,Success,Address,Phone,Fax,Born,BirthState,Party,Schooling,Email')
try:
while not finished:
#dl links to each gov
table = soup.find('table', 'table table-striped table-striped')
links = table.findAll('a')
for link in links:
info_array = []
gov = {}
name = link.string
gov_sauce = requests.get(r'https://nga.org'+link.get('href')).text
gov_soup = bs.BeautifulSoup(gov_sauce, "html.parser")
#print(gov_soup)
office_and_stuff_info = gov_soup.findAll('address')
for address in office_and_stuff_info:
infos = address.findAll('p')
for info in infos:
tex = re.sub('[^a-zA-Z\d:]','',info.text)
tex = re.sub('\\s+',' ',info.text)
tex = tex.strip()
if tex:
info_array.append(tex)
info_array = list(set(info_array))
gov['Name'] = name
secondarry_address = ''
gov['Address'] = ''
for line in info_array:
if 'OfficeDates:' in line:
gov['OfficeDates'] = line.replace('OfficeDates:','').replace('-','')
elif 'Succ' or 'Fail' in line:
gov['Success'] = line
elif 'Address' in line:
gov['Address'] = line.replace('Address:','')
elif 'Phone:' or 'Phone ' in line:
gov['Phone'] = line.replace('Phone ','').replace('Phone: ','')
elif 'Fax:' in line:
gov['Fax'] = line.replace('Fax:','')
elif 'Born:' in line:
gov['Born'] = line.replace('Born:','')
elif 'Birth State:' in line:
gov['BirthState'] = line.replace('BirthState:','')
elif 'Party:' in line:
gov['Party'] = line.replace('Party:','')
elif 'School(s)' in line:
gov['Schooling'] = line.replace('School(s):','').replace('School(s) ')
elif 'Email:' in line:
gov['Email'] = line.replace('Email:','')
else:
secondarry_address = line
gov['Address'] = gov['Address'] + secondarry_address
data_line = gov['Name'] +','+gov['Address'] +','+gov['OfficeDates'] +','+gov['Success'] +','+gov['Address'] +','+ gov['Phone'] +','+ gov['Fax'] +','+gov['Born'] +','+gov['BirthState'] +','+gov['Party'] +','+gov['Schooling'] +','+gov['Email']
csv_data.write(data_line)
next_page_link = soup.find('ul','pagination center-blockdefault').find('a',{'aria-label':'Next'})
if next_page_link.parent.get('class') == 'disabled':
finished = True
else:
url = r'https://nga.org'+next_page_link.get('href')
sauce = requests.get(url).text
soup = bs.BeautifulSoup(sauce,'html.parser')
except:
print('Code failed.')
finally:
csv_data.close()
if __name__ == '__main__':
main()

Scrape multiple pages with Beautiful soup

I am trying to scrape multiple pages of a url.
But am able to scrape only the first page is there is a way to get all the pages.
Here is my code.
from bs4 import BeautifulSoup as Soup
import urllib, requests, re, pandas as pd
pd.set_option('max_colwidth',500) # to remove column limit (Otherwise, we'll lose some info)
df = pd.DataFrame()
Comp_urls = ['https://www.indeed.com/jobs?q=Dell&rbc=DELL&jcid=0918a251e6902f97', 'https://www.indeed.com/jobs?q=Harman&rbc=Harman&jcid=4faf342d2307e9ed','https://www.indeed.com/jobs?q=johnson+%26+johnson&rbc=Johnson+%26+Johnson+Family+of+Companies&jcid=08849387e791ebc6','https://www.indeed.com/jobs?q=nova&rbc=Nova+Biomedical&jcid=051380d3bdd5b915']
for url in Comp_urls:
target = Soup(urllib.request.urlopen(url), "lxml")
targetElements = target.findAll('div', class_ =' row result')
for elem in targetElements:
comp_name = elem.find('span', attrs={'class':'company'}).getText().strip()
job_title = elem.find('a', attrs={'class':'turnstileLink'}).attrs['title']
home_url = "http://www.indeed.com"
job_link = "%s%s" % (home_url,elem.find('a').get('href'))
job_addr = elem.find('span', attrs={'class':'location'}).getText()
date_posted = elem.find('span', attrs={'class': 'date'}).getText()
description = elem.find('span', attrs={'class': 'summary'}).getText().strip()
comp_link_overall = elem.find('span', attrs={'class':'company'}).find('a')
if comp_link_overall != None:
comp_link_overall = "%s%s" % (home_url, comp_link_overall.attrs['href'])
else: comp_link_overall = None
df = df.append({'comp_name': comp_name, 'job_title': job_title,
'job_link': job_link, 'date_posted': date_posted,
'overall_link': comp_link_overall, 'job_location': job_addr, 'description': description
}, ignore_index=True)
df
df.to_csv('path\\web_scrape_Indeed.csv', sep=',', encoding='utf-8')
Please suggest if there is anyway.
Case 1: The code presented here is exactly what you have
Comp_urls = ['https://www.indeed.com/jobs?q=Dell&rbc=DELL&jcid=0918a251e6902f97', 'https://www.indeed.com/jobs?q=Harman&rbc=Harman&jcid=4faf342d2307e9ed','https://www.indeed.com/jobs?q=johnson+%26+johnson&rbc=Johnson+%26+Johnson+Family+of+Companies&jcid=08849387e791ebc6','https://www.indeed.com/jobs?q=nova&rbc=Nova+Biomedical&jcid=051380d3bdd5b915']
for url in Comp_urls:
target = Soup(urllib.request.urlopen(url), "lxml")
targetElements = target.findAll('div', class_ =' row result')
for elem in targetElements:
The problem here is targetElements changes with every iteration in the first for loop.
To avoid this, indent the second for loop inside the first like so:
for url in Comp_urls:
target = Soup(urllib.request.urlopen(url), "lxml")
targetElements = target.findAll('div', class_ =' row result')
for elem in targetElements:
Case 2: Your the bug is not a result of improper indentation (i.e. not like what is in your original post)
If it is the case that your code is properly idented , then it may be the case that targetElements is an empty list. This means target.findAll('div', class_ =' row result') does not return anything. In that case, visit the sites, check out the dom, then modify your scraping program.

Categories