Pandas: Write all re.search results to csv from BeautifulSoup

Pandas: Write all re.search results to csv from BeautifulSoup - python

I have these beginnings of a Python pandas script that searches for values in on Google and grabs any PDF links it can find on the first page.
I have two questions, listed below.
import pandas as pd
from bs4 import BeautifulSoup
import urllib2
import re
df = pd.DataFrame(["Shakespeare", "Beowulf"], columns=["Search"])
print "Searching for PDFs ..."
hdr = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"Accept-Encoding": "none",
"Accept-Language": "en-US,en;q=0.8",
"Connection": "keep-alive"}
def crawl(search):
google = "http://www.google.com/search?q="
url = google + search + "+" + "PDF"
req = urllib2.Request(url, headers=hdr)
pdf_links = None
placeholder = None #just a column placeholder
try:
page = urllib2.urlopen(req).read()
soup = BeautifulSoup(page)
cite = soup.find_all("cite", attrs={"class":"_Rm"})
for link in cite:
all_links = re.search(r".+", link.text).group().encode("utf-8")
if all_links.endswith(".pdf"):
pdf_links = re.search(r"(.+)pdf$", all_links).group()
print pdf_links
except urllib2.HTTPError, e:
print e.fp.read()
return pd.Series([pdf_links, placeholder])
df[["PDF links", "Placeholder"]] = df["Search"].apply(crawl)
df.to_csv(FileName, index=False, delimiter=",")
The results from print pdf_links will be:
davidlucking.com/documents/Shakespeare-Complete%20Works.pdf
sparks.eserver.org/books/shakespeare-tempest.pdf
www.w3.org/People/maxf/.../hamlet.pdf
www.w3.org/People/maxf/.../hamlet.pdf
www.w3.org/People/maxf/.../hamlet.pdf
www.w3.org/People/maxf/.../hamlet.pdf
www.w3.org/People/maxf/.../hamlet.pdf
www.w3.org/People/maxf/.../hamlet.pdf
www.w3.org/People/maxf/.../hamlet.pdf
calhoun.k12.il.us/teachers/wdeffenbaugh/.../Shakespeare%20Sonnets.pdf
www.yorku.ca/inpar/Beowulf_Child.pdf
www.yorku.ca/inpar/Beowulf_Child.pdf
https://is.muni.cz/el/1441/.../2._Beowulf.pdf
https://is.muni.cz/el/1441/.../2._Beowulf.pdf
https://is.muni.cz/el/1441/.../2._Beowulf.pdf
https://is.muni.cz/el/1441/.../2._Beowulf.pdf
www.penguin.com/static/pdf/.../beowulf.pdf
www.neshaminy.org/cms/lib6/.../380/text.pdf
www.neshaminy.org/cms/lib6/.../380/text.pdf
sparks.eserver.org/books/beowulf.pdf
And the csv output will look like:
Search PDF Links
Shakespeare calhoun.k12.il.us/teachers/wdeffenbaugh/.../Shakespeare%20Sonnets.pdf
Beowulf sparks.eserver.org/books/beowulf.pdf
Questions:
Is there a way to write all of the results as rows to the csv instead of
just the bottom one? And if possible, include the value in Search for each row that corresponds to "Shakespeare" or "Beowulf"?
How can I write out the full pdf links without long links automatically abbreviating with "..."?

This will get you all the proper pdf links using soup.find_all("a",href=True) and save them in a Dataframe and to a csv:
hdr = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"Accept-Encoding": "none",
"Accept-Language": "en-US,en;q=0.8",
"Connection": "keep-alive"}
def crawl(columns=None, *search):
df = pd.DataFrame(columns= columns)
for term in search:
google = "http://www.google.com/search?q="
url = google + term + "+" + "PDF"
req = urllib2.Request(url, headers=hdr)
try:
page = urllib2.urlopen(req).read()
soup = BeautifulSoup(page)
pdfs = []
links = soup.find_all("a",href=True)
for link in links:
lk = link["href"]
if lk.endswith(".pdf"):
pdfs.append((term, lk))
df2 = pd.DataFrame(pdfs, columns=columns)
df = df.append(df2, ignore_index=True)
except urllib2.HTTPError, e:
print e.fp.read()
return df
df = crawl(["Search", "PDF link"],"Shakespeare","Beowulf")
df.to_csv("out.csv",index=False)
out.csv:
Search,PDF link
Shakespeare,http://davidlucking.com/documents/Shakespeare-Complete%20Works.pdf
Shakespeare,http://www.w3.org/People/maxf/XSLideMaker/hamlet.pdf
Shakespeare,http://sparks.eserver.org/books/shakespeare-tempest.pdf
Shakespeare,https://phillipkay.files.wordpress.com/2011/07/william-shakespeare-plays.pdf
Shakespeare,http://www.artsvivants.ca/pdf/eth/activities/shakespeare_overview.pdf
Shakespeare,http://triggs.djvu.org/djvu-editions.com/SHAKESPEARE/SONNETS/Download.pdf
Beowulf,http://www.yorku.ca/inpar/Beowulf_Child.pdf
Beowulf,https://is.muni.cz/el/1441/podzim2013/AJ2RC_STAL/2._Beowulf.pdf
Beowulf,http://teacherweb.com/IL/Steinmetz/MottramM/Beowulf---Seamus-Heaney.pdf
Beowulf,http://www.penguin.com/static/pdf/teachersguides/beowulf.pdf
Beowulf,http://www.neshaminy.org/cms/lib6/PA01000466/Centricity/Domain/380/text.pdf
Beowulf,http://www.sparknotes.com/free-pdfs/uscellular/download/beowulf.pdf

To get PDF links, you're looking for these selectors:
for result in soup.select('.tF2Cxc'):
# check if PDF is present via according CSS class OR use try/except instead
if result.select_one('.ZGwO7'):
pdf_file = result.select_one('.yuRUbf a')['href']
CSS selectors reference. Have a look at SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.
To save them to CSV, you're looking for this:
# store all links from a for loop
pdfs = []
# create PDF Link column and append PDF links from a pdfs list()
df = pd.DataFrame({'PDF Link': pdfs})
# save to csv and delete default pandas index column. Done!
df.to_csv('PDFs.csv', index=False)
Code and example in the online IDE (also shows how to save locally):
from bs4 import BeautifulSoup
import requests, lxml
import pandas as pd
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "best lasagna recipe:pdf"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
pdfs = []
for result in soup.select('.tF2Cxc'):
# check if PDF is present via according CSS class
if result.select_one('.ZGwO7'):
pdf_file = result.select_one('.yuRUbf a')['href']
pdfs.append(pdf_file)
# creates PDF Link column and appends PDF links from a pdfs list()
df = pd.DataFrame({'PDF Link': pdfs})
df.to_csv('Bs4_PDFs.csv', index=False)
-----------
# from CSV
'''
PDF Link
http://www.bakersedge.com/PDF/Lasagna.pdf
http://greatgreens.ca/recipes/Recipe%20-%20Worlds%20Best%20Lasagna.pdf
https://liparifoods.com/wp-content/uploads/2015/10/lipari-foods-holiday-recipes.pdf
...
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that rather than creating everything from scratch, figuring out why certain things don't work as expected, and then maintain it over time, all that you need to do is to iterate over structured JSON and get the data you want. It might be also more readable and quickly understand what's going on inside the code.
Code to integrate with your example:
from serpapi import GoogleSearch
import os
import pandas as pd
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "best lasagna recipe:pdf",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
pdfs = []
# iterate over organic results and check if .pdf file type exists in link
for result in results['organic_results']:
if '.pdf' in result['link']:
pdf_file = result['link']
pdfs.append(pdf_file)
df = pd.DataFrame({'PDF Link': pdfs})
df.to_csv('SerpApi_PDFs.csv', index=False)
-----------
# from CSV
'''
PDF Link
http://www.bakersedge.com/PDF/Lasagna.pdf
http://greatgreens.ca/recipes/Recipe%20-%20Worlds%20Best%20Lasagna.pdf
https://liparifoods.com/wp-content/uploads/2015/10/lipari-foods-holiday-recipes.pdf
...
'''
Disclaimer, I work for SerpApi.

Related

getting an empty list when trying to extract urls from google with beautifulsoup

I am trying to extract the first 100 urls that return from a location search in google
however i am getting an empty list every time ("no results found")
import requests
from bs4 import BeautifulSoup
def get_location_info(location):
query = location + " information"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
}
url = "https://www.google.com/search?q=" + query
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
results = soup.find_all("div", class_="r")
websites = []
if results:
counter = 0
for result in results:
websites.append(result.find("a")["href"])
counter += 1
if counter == 100:
break
else:
print("No search results found.")
return websites
location = "Athens"
print(get_location_info(location))
No search results found.
[]
I have also tried this approach :
import requests
from bs4 import BeautifulSoup
def get_location_info(location):
query = location + " information"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
}
url = "https://www.google.com/search?q=" + query
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
results = soup.find_all("div", class_="r")
websites = [result.find("a")["href"] for result in results][:10]
return websites
location = "sifnos"
print(get_location_info(location))`
and i get an empty list. I think i am doing everything suggested in similar posts but i still get nothing

Always and first of all, take a look at your soup to see if all the expected ingredients are in place.
Select your elements more specific in this case for example with css selector:
[a.get('href') for a in soup.select('a:has(>h3)')]
To void consent banner also send some cookies:
cookies={'CONSENT':'YES+'}
Example
import requests
from bs4 import BeautifulSoup
def get_location_info(location):
query = location + " information"
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
}
url = "https://www.google.com/search?q=" + query
response = requests.get(url, headers=headers, cookies={'CONSENT':'YES+'})
soup = BeautifulSoup(response.text, 'html.parser')
websites = [a.get('href') for a in soup.select('a:has(>h3)')]
return websites
location = "sifnos"
print(get_location_info(location))
Output
['https://www.griechenland.de/sifnos/', 'http://de.sifnos-greece.com/plan-trip-to-sifnos/travel-information.php', 'https://www.sifnosisland.gr/', 'https://www.visitgreece.gr/islands/cyclades/sifnos/', 'http://www.griechenland-insel.de/Hauptseiten/sifnos.htm', 'https://worldonabudget.de/sifnos-griechenland/', 'https://goodmorningworld.de/sifnos-griechenland/', 'https://de.wikipedia.org/wiki/Sifnos', 'https://sifnos.gr/en/sifnos/', 'https://www.discovergreece.com/de/cyclades/sifnos']

Scrape data with <Script type="text/javascript" using beautifulsoup

Im building a web scrape to pull product data from a website, this particular company hides the price behind a "login for Price" banner but the price is hidden in the HTML under <Script type="text/javascript" but im unable to pull it out. the specific link that im testing is https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/
My current code is this and the last line is the one im using to pull the text out.
```
import requests
from bs4 import BeautifulSoup
import pandas as pd
baseurl="https://www.chadwellsupply.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
productlinks = []
for x in range (1,3):
response = requests.get(f'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/?q=&filter=&clearedfilter=undefined&orderby=19&pagesize=24&viewmode=list&currenttab=products&pagenumber={x}&articlepage=')
soup = BeautifulSoup(response.content,'html.parser')
productlist = soup.find_all('div', class_="product-header")
for item in productlist:
for link in item.find_all('a', href = True):
productlinks.append(link['href'])
testlink = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'
response = requests.get(testlink, headers = headers)
soup = BeautifulSoup(response.content,'html.parser')
print(soup.find('div',class_="product-title").text.strip())
print(soup.find('p',class_="status").text.strip())
print(soup.find('meta',{'property':'og:url'}))
print(soup.find('div',class_="tab-pane fade show active").text.strip())
print(soup.find('div',class_="Chadwell-Shared-Breadcrumbs").text.strip())
print(soup.find('script',{'type':'text/javascript'}).text.strip())
```
Below is the chunk of script from the website (tried to paste directly here but it wouldnt format correctly) that im expecting it to pull but what it gives me is
"window.dataLayer = window.dataLayer || [];"
HTML From website
Ideally id like to just pull the price out but if i can atleast get the whole chunk of data out i can manually extract price.

You can use re/json module to search/parse the HTML data (obviously, beautifulsoup cannot parse JavaScript - another option is to use selenium).
import re
import json
import requests
url = "https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/"
html_doc = requests.get(url).text
data = re.search(r"ga\('ec:addProduct', (.*?)\);", html_doc).group(1)
data = json.loads(data)
print(data)
Prints:
{
"id": "301078",
"name": 'HOTPOINT® 24" SPACESAVER ELECTRIC RANGE - WHITE',
"category": "Stove/ Ranges",
"brand": "Hotpoint",
"price": "759",
}
Then for price you can do:
print(data["price"])
Prints:
759

A hacky alternative to regex is to select for a function in the scripts. In your case, the script contains function(i,s,o,g,r,a,m).
from bs4 import BeautifulSoup
import requests
import json
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
testlink = 'https://www.chadwellsupply.com/categories/appliances/Stove-Ranges/hotpoint-24-spacesaver-electric-range---white/'
response = requests.get(testlink, headers = headers)
soup = BeautifulSoup(response.content,'html.parser')
for el in soup.find_all("script"):
if "function(i,s,o,g,r,a,m)" in el.text:
scripttext = el.text
You can then select the data.
extracted = scripttext.split("{")[-1].split("}")[0]
my_json = json.loads("{%s}" % extracted)
print(my_json)
#{'id': '301078', 'name': 'HOTPOINT® 24" SPACESAVER ELECTRIC RANGE - WHITE', 'category': 'Stove/ Ranges', 'brand': 'Hotpoint', 'price': '759'}
Then get the price.
print(my_json["price"])
#759

How do I scrap all movie title, date and reviews on the website below? https://www.nollywoodreinvented.com/list-of-all-reviews

I have tried with the code below and what the code does is to bring the first page and does not load completely the reviews for the movies. I am interested in getting all the movie titles, movie dates, and reviews.
enter code here
from bs4 import BeautifulSoup
import requests
url = 'https://www.nollywoodreinvented.com/list-of-all-reviews'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.text, 'lxml')
movie_div = soup.find_all('div', class_='article-panel')
title=[]
for div in movie_div:
images= div.find_all('div', class_='article-image-wrapper')
for image in images:
image = image.find_all('div', class_='article-image')
for img in image:
title.append(img.a.img['title'])
date =[]
for div in movie_div:
date.append(div.find('div', class_='authorship type-date').text.strip())
info =[]
for div in movie_div:
info.append(div.find('div', class_='excerpt-text').text.strip())
import pandas as pd
movie = pd.DataFrame({'title':title, 'date':date, 'info':info}, index=None)
movie.head()

There is a backend api which serves up the HTML you are scraping you can see it in action if you open your browsers Developer Tools - Network tab - fetch/Xhr and click the on a the 2nd or 3rd page link, we can recreate the POST request with python like the below:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
pages = 3
results_per_page = 500 #max 500 I think
headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
url = 'https://www.nollywoodreinvented.com/wp-admin/admin-ajax.php'
output = []
for page in range(1,pages+1):
payload = {
'action':'itajax-sort',
'view':'grid',
'loop':'main loop',
'location':'',
'thumbnail':'1',
'rating':'1',
'meta':'1',
'award':'1',
'badge':'1',
'authorship':'1',
'icon':'1',
'excerpt':'1',
'sorter':'recent',
'columns':'4',
'layout':'full',
'numarticles':str(results_per_page),
'largefirst':'',
'paginated':str(page),
'currentquery[category__in][]':'2648',
'currentquery[category__in][]':'2649'
}
resp = requests.post(url,headers=headers,data=payload).json()
print(f'Scraping page: {page} - results: {results_per_page}')
soup = BeautifulSoup(resp['content'],'html.parser')
for film in soup.find_all('div',class_='article-panel'):
try:
title = film.find('h3').text.strip()
except AttributeError:
continue
date = datetime.strptime(film.find('span',class_='date').text.strip(),"%B %d, %Y").strftime('%Y-%m-%d')
likes = film.find('span',class_='numcount').text.strip()
if not likes:
likes = 0
full_stars = [1 for _ in film.find_all('span',class_='theme-icon-star-full')]
half_stars = [0.5 for _ in film.find_all('span',class_='theme-icon-star-half')]
stars = (sum(full_stars)+ sum(half_stars))/2.0
item = {
'title':title,
'date':date,
'likes':likes,
'stars':stars
}
output.append(item)
df= pd.DataFrame(output)
df.to_csv('nollywood_data.csv',index=False)
print('Saved to nollywood_data.csv')

How to write seperate functions in seperate py files and execute it using main.py without using concept of class

i am new to python and i am yet to learn the concept of oop,classes with python. i thought i understood functions. But i am facing issue while calling functions from different py file.
Below code shows all my fuctions described in main.py
i want to split main.py and get 2 other py files as data extraction.py and data processing.py
i understand that it can be done using classes, but can we do it without using classes as well?
i divided the code in two other files but i am getting error(please find my attached screenshot)
please explain me what i can do here!
main.py
import pandas as pd
import requests
from bs4 import BeautifulSoup
from configparser import ConfigParser
import logging
import data_extraction
config = ConfigParser()
config.read('config.ini')
logging.basicConfig(filename='logfile.log', level=logging.DEBUG,
format='%(asctime)s:%(lineno)d:%(name)s:%(levelname)s:%(message)s')
baseurl = config['configData']['baseurl']
sub_url = config['configData']['sub_url']
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36',
"Upgrade-Insecure-Requests": "1", "DNT": "1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate"
}
r = requests.get(baseurl, headers=headers)
status = r.status_code
soup = BeautifulSoup(r.content, 'html.parser')
model_links = []
all_keys = ['Model', 'Platform', 'Product Family', 'Product Line', '# of CPU Cores',
'# of Threads', 'Max. Boost Clock', 'Base Clock', 'Total L2 Cache', 'Total L3 Cache',
'Default TDP', 'Processor Technology for CPU Cores', 'Unlocked for Overclocking', 'CPU Socket',
'Thermal Solution (PIB)', 'Max. Operating Temperature (Tjmax)', 'Launch Date', '*OS Support']
# function to get the model links in one list from soup object(1st page extraction)
def get_links_in_list():
for model_list in soup.find_all('td', headers='view-name-table-column'):
# model_list = model_list.a.text - to get the model names
model_list = model_list.a.get('href')
# print(model_list)
model_list = sub_url + model_list
# print(model_list)
one_link = model_list.split(" ")[0]
model_links.append(one_link)
return model_links
model_links = get_links_in_list()
logging.debug(model_links)
each_link_data = data_extraction()
print(each_link_data)
#all_link_data = data_processing()
#write_to_csv(all_keys)
data_extraction.py
import requests
from bs4 import BeautifulSoup
from main import baseurl
from main import all_keys
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36',
"Upgrade-Insecure-Requests": "1", "DNT": "1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate"
}
r = requests.get(baseurl, headers=headers)
status = r.status_code
soup = BeautifulSoup(r.content, 'html.parser')
model_links = []
# function to get data for each link from the website(2nd page extraction)
def data_extraction(model_links):
each_link_data = []
try:
for link in model_links:
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
specification = {}
for key in all_keys:
spec = soup.select_one(
f'.field__label:-soup-contains("{key}") + .field__item, .field__label:-soup-contains("{key}") + .field__items .field__item')
# print(spec)
if spec is None:
specification[key] = ''
if key == 'Model':
specification[key] = [i.text for i in soup.select_one('.page-title')]
specification[key] = specification[key][0:1:1]
# print(specification[key])
else:
if key == '*OS Support':
specification[key] = [i.text for i in spec.parent.select('.field__item')]
else:
specification[key] = spec.text
specification['link'] = link
each_link_data.append(specification)
except:
print('Error occurred')
return each_link_data
# print(each_link_data)
data processing.py
# function for data processing : converting the each link object into dataframe
def data_processing():
all_link_data = []
for each_linkdata_obj in each_link_data:
# make the nested dictionary to normal dict
norm_dict = dict()
for key in each_linkdata_obj:
if isinstance(each_linkdata_obj[key], list):
norm_dict[key] = ','.join(each_linkdata_obj[key])
else:
norm_dict[key] = each_linkdata_obj[key]
all_link_data.append(norm_dict)
return all_link_data
# print(all_link_data)
all_link_data = data_processing()
# function to write dataframe data into csv
def write_to_csv(all_keys):
all_link_df = pd.DataFrame.from_dict(all_link_data)
all_link_df2 = all_link_df.drop_duplicates()
all_link_df3 = all_link_df2.reset_index()
# print(all_link_df3)
all_keys = all_keys + ['link']
all_link_df4 = all_link_df3[all_keys]
# print(all_link_df4)
all_link_df4.to_csv('final_data.csv')
write_to_csv(all_keys)

Move the existing functions(ex. write_to_csv) to different file for example 'utility_functions.py'. Import it in main.py using from utility_functions import write_to_csv. Now you can use the function 'write_to_csv' in main.py as
write_to_csv(all_keys)
Edit
In the main.pyfile
use from data_extraction import data_extraction instead of import data_extraction
In data_extraction.py file
Remove lines
from main import baseurl from main import all_keys
It will throw variable undefined error, you can fix it by passing the variable in the function call.

Beautiful soup returns empty array

I'm using beautiful soup to find the first hit from a google search.
Looking for "Stack Overflow" it should find https://www.stackoverflow.com
The code is mainly taken from here However, it suddenly stopped working with results[0] being index out of range.
print results[0] IndexError: list index out of range
I suspect it's a cache problem as it was working fine and then stopped without changing the code. I've also rebooted and cleared the cache but still no results.
#!/usr/bin/python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import webbrowser # for webrowser, duh!
import re
#------------------------------------------------
def write_it(s, f):
# w for over write
file = open(f, "w")
file.write(s)
file.close()
#------------------------------------------------
def URL_encode_space(s):
return re.sub(r"\s", "%20", s)
#------------------------------------------------
def URL_decode_space(s):
return re.sub(r"%20", " ", s)
#------------------------------------------------
urlBase = "https://google.com"
searchRequest = "Stack Overflow"
print searchRequest
searchRequest = URL_encode_space(searchRequest)
# String literal for HTML quote
q = "%22" # is a "
numOfResults = 10
myURL = urlBase + "/search?q=" + q + searchRequest + q + "&num={" + str(numOfResults) + "}"
page = requests.get(myURL)
soup = BeautifulSoup(page.text, "html.parser")
links = soup.findAll("a")
results = []
for link in links:
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
print (link.get('href').split("?q=")[1].split("&sa=U")[0])
results.append(link.get('href').split("?q=")[1].split("&sa=U")[0])
print results[0]
# open web browser?
webbrowser.open(myURL)
I can obviously check the 'len(results)' to remove the error but that doesn't explain why it no longer works.

Just like people said above it doesn't clear what could cause the problem.
Make sure you're using user agent.
I took this code from my other answer (scraping headings, summary, and links from google search results).
Code and full example:
from bs4 import BeautifulSoup
import requests
import json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=java&oq=java',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
summary = []
for container in soup.findAll('div', class_='tF2Cxc'):
heading = container.find('h3', class_='LC20lb DKV0Md').text
article_summary = container.find('span', class_='aCOpRe').text
link = container.find('a')['href']
summary.append({
'Heading': heading,
'Article Summary': article_summary,
'Link': link,
})
print(json.dumps(summary, indent=2, ensure_ascii=False))
Alternatively, you can use Google Organic Results API from SerpApi to get these results.
It's a paid API with a free trial.
Part of JSON:
{
"position": 1,
"title": "Java | Oracle",
"link": "https://www.java.com/",
"displayed_link": "https://www.java.com",
"snippet": "Java Download. » What is Java? » Need Help? » Uninstall. About Java. Go Java Java Training Java + Greenfoot Oracle Code One Oracle Academy for ..."
}
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "stackoverflow",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Link: {result['link']}")
Output:
Link: https://stackoverflow.com/
Link: https://en.wikipedia.org/wiki/Stack_Overflow
Link: https://stackoverflow.blog/
Link: https://stackoverflow.blog/podcast/
Link: https://www.linkedin.com/company/stack-overflow
Link: https://www.crunchbase.com/organization/stack-overflow
Disclaimer, I work for SerpApi.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Write all re.search results to csv from BeautifulSoup - python

Related

getting an empty list when trying to extract urls from google with beautifulsoup

Scrape data with <Script type="text/javascript" using beautifulsoup

How do I scrap all movie title, date and reviews on the website below? https://www.nollywoodreinvented.com/list-of-all-reviews

How to write seperate functions in seperate py files and execute it using main.py without using concept of class

Beautiful soup returns empty array

Categories

Resources