Saving output URLs with next line in file - python

I am trying to extract data from website and have following code which is extracting all URLs from Main category and its sub category links.
I am now stuck in saving the extracted output with line separator (to move each URL in separate line) in a file -Medical.tsv
Need help on this.
Code is given below:
from bs4 import BeautifulSoup
import requests
import time
import random
def write_to_file(file,mode, data, newline=None, with_tab=None): #**
with open(file, mode, encoding='utf-8') as l:
if with_tab == True:
data = ''.join(data)
if newline == True:
data = data+'\n'
l.write(data)
def get_soup(url):
return BeautifulSoup(requests.get(url).content, "lxml")
url = 'http://www.medicalexpo.com/'
soup = get_soup(url)
raw_categories = soup.select('div.univers-main li.category-group-item a')
category_links = {}
for cat in (raw_categories):
t0 = time.time()
response_delay = time.time() - t0 # It wait 10x longer than it took them to respond using delay.
time.sleep(10*response_delay) # This way if the site gets overwhelmed and starts to slow down, the code will automatically back off.
time.sleep(random.randint(2,5)) # This will provide random time intervals of 2 and 3 secs acting as human crawl instead of bot.
soup = get_soup(cat['href'])
links = soup.select('#category-group li a')
category_links[cat.text] = [link['href'] for link in links]
print(category_links)

You got the write_to_file function but you never call it? mode have to be w or w+(if you wanna overwrite in the case if the file already exists)

Related

Scraping large amount of data with beautifulsoup, process being killed?

There are exactly 100 items per page. I'm assuming it is some type of memory limit that's causing it to be killed. Also I have a feeling appending the items to a list variable is most likely not best practice when it comes to memory efficiency. Would opening a text file and writing to it be better? I've done a test with 10 pages and it creates the list successfully taking about 12 seconds to do so. When I try with 9500 pages however, the process gets automatically killed in about an hour.
import requests
from bs4 import BeautifulSoup
import timeit
def lol_scrape():
start = timeit.default_timer()
summoners_listed = []
for i in range(9500):
URL = "https://www.op.gg/leaderboards/tier?region=na&page="+str(i+1)
user_agent = {user-agent}
page = requests.get(URL, headers = user_agent)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find('tbody')
summoners = results.find_all('tr')
for i in range(len(summoners)):
name = summoners[i].find('strong')
summoners_listed.append(name.string)
stop = timeit.default_timer()
print('Time: ', stop - start)
return summoners_listed
Credit to #1extralime
All I did was make a csv for every page instead of continually appending to one super long list.
from bs4 import BeautifulSoup
import timeit
import pandas as pd
def lol_scrape():
start = timeit.default_timer()
for i in range(6500):
# Moved variable inside loop to reset it every iteration
summoners_listed = []
URL = "https://www.op.gg/leaderboards/tier?region=na&page="+str(i+1)
user_agent = {user-agent}
page = requests.get(URL, headers = user_agent)
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find('tbody')
summoners = results.find_all('tr')
for x in range(len(summoners)):
name = summoners[x].find('strong')
summoners_listed.append(name.string)
# Make a new df with the list values then save to a new csv
df = pd.DataFrame(summoners_listed)
df.to_csv('all_summoners/summoners_page'+str(i+1))
stop = timeit.default_timer()
print('Time: ', stop - start)
Also as a note to my future self or anyone else reading. This method is way superior because had the process failed at anytime I had all the successful csv's saved and could just restart where it left off.

Creating multiple text files with unique file names from scraped data

I took an introductory course in Python this semester and am now trying to do a project. However, I don't really know what code I should write to create multiple .txt files of which the title will be different for each file.
I scraped all the terms and definitions from the website http://www.hogwartsishere.com/library/book/99/. Title of the .txt file should for example be 'Aconite.txt' and the content of the file should be the title and the definition. Every term with its definition can be found in a separate p-tag and the term itself is a b-tag withing the p-tag. Can I use this to write my code?
I suppose I will need to use a for-loop for this, but I don't really know where to start. I searched StackOverflow and found several solutions, but all of them contain code I am not familiar with and/or relate to another issue.
This is what I have so far:
#!/usr/bin/env/ python
import requests
import bs4
def download(url):
r = requests.get(url)
html = r.text
soup = bs4.BeautifulSoup(html, 'html.parser')
terms_definition = []
#for item in soup.find_all('p'): #beter definiƫren
items = soup.find_all("div", {"class" : "font-size-16 roboto"})
for item in items:
terms = item.find_all("p")
for term in terms:
#print(term)
if term.text is not 'None':
#print(term.text)
#print("\n")
term_split = term.text.split()
print(term_split)
if term.text != None and len(term.text) > 1:
if '-' in term.text.split():
print(term.text)
print('\n')
if item.find('p'):
terms_definition.append(item['p'])
print(terms_definition)
return terms_definition
def create_url(start, end):
list_url = []
base_url = 'http://www.hogwartsishere.com/library/book/99/chapter/'
for x in range(start, end):
list_url.append(base_url + str(x))
return list_url
def search_all_url(list_url):
for url in list_url:
download(url)
#write data into separate text files. Word in front of the dash should be title of the document, term and definition should be content of the text file
#all terms and definitions are in separate p-tags, title is a b-tag within the p-tag
def name_term
def text_files
path_write = os.path.join('data', name_term +'.txt') #'term' should be replaced by the scraped terms
with open(path_write, 'w') as f:
f.write()
#for loop? in front of dash = title / everything before and after dash = text (file content) / empty line = new file
if __name__ == '__main__':
download('http://www.hogwartsishere.com/library/book/99/chapter/1')
#list_url = create_url(1, 27)
#search_all_url(list_url)
Thanks in advance!
You can iterate over all pages (1-27) to get its content, then parse each page with bs4 and then save results to files:
import requests
import bs4
import re
for i in range(1, 27):
r = requests.get('http://www.hogwartsishere.com/library/book/99/chapter/{}/'.format(i)).text
soup = bs4.BeautifulSoup(r, 'html.parser')
items = soup.find_all("div", {"class": "font-size-16 roboto"})
for item in items:
terms = item.find_all("p")
for term in terms:
title = re.match('^(.*) -', term.text).group(1).replace('/', '-')
with open(title + '.txt', 'w', encoding='utf-8') as f:
f.write(term.text)
Output files:

Building sequential urls with Beautifulsoup

I have a text file that has 1 number in it. i'm trying to have it go to the website with the first number appended to the url, grab the info and then move on to the next url in sequence, pull the info etc. If the number brings up a blank page, it should end the sequence and email out the information it gathered. I'm not getting any errors. It completes it's run, but i'm not getting any back or seeing any changes in the number in the text file. I'm curious if what i've got for this part of the program is correct, or if i'm missing something.
Here's what I've got
import requests
from bs4 import BeautifulSoup as bs
#loads LIC# url
def get_page(license_number):
url = URL_FORMAT.format(license_number)
r = requests.get(url)
return bs(r.text, 'lxml')
#looks for non-existent info for no-license
def license_exists(soup):
if soup.find('td', class_ = 'style3'):
return True
else:
return False
#pulls lic# from text license_number.txt
def get_current_license_number():
with open(LICENSE_NUMBER_FILE, 'r') as f:
return int(f.read())
#adds lic# to urls
def get_new_license_pages(curr_license_num):
new_pages = []
more = True
curr_license_num +=1
return new_pages

How to parse Etherscan?

How can I parse every single page for eth addresses from https://etherscan.io/token/generic-tokenholders2?a=0x6425c6be902d692ae2db752b3c268afadb099d3b&s=0&p=1 ? Then add it to .txt .
Okay, possibly off-topic, but I had a play around with this. (Mainly because I thought I might need to use something similar to grab stuff in future that Etherscan's APIs don't return... )
The following Python2 code will grab what you're after. There's a hacky sleep in there to get around what I think is either something to do with how quickly the pages load, or some rate limiting imposed by Etherscan. I'm not sure.
Data gets written to a .csv file - a text file wouldn't be much fun.
#!/usr/bin/env python
from __future__ import print_function
import os
import requests
from bs4 import BeautifulSoup
import csv
import time
RESULTS = "results.csv"
URL = "https://etherscan.io/token/generic-tokenholders2?a=0x6425c6be902d692ae2db752b3c268afadb099d3b&s=0&p="
def getData(sess, page):
url = URL + page
print("Retrieving page", page)
return BeautifulSoup(sess.get(url).text, 'html.parser')
def getPage(sess, page):
table = getData(sess, str(int(page))).find('table')
return [[X.text.strip() for X in row.find_all('td')] for row in table.find_all('tr')]
def main():
resp = requests.get(URL)
sess = requests.Session()
with open(RESULTS, 'wb') as f:
wr = csv.writer(f, quoting=csv.QUOTE_ALL)
wr.writerow(map(str, "Rank Address Quantity Percentage".split()))
page = 0
while True:
page += 1
data = getPage(sess, page)
# Even pages that don't contain the data we're
# after still contain a table.
if len(data) < 4:
break
else:
for row in data:
wr.writerow(row)
time.sleep(1)
if __name__ == "__main__":
main()
I'm sure it's not the best Python in the world.

How to extract data from all urls, not just the first

This script is generating a csv with the data from only one of the urls fed into it. There are meant to be 98 sets of results, however the for loop isn't getting past the first url.
I've been working on this for 12hrs+ today, what am I missing in order get the correct results?
import requests
import re
from bs4 import BeautifulSoup
import csv
#Read csv
csvfile = open("gyms4.csv")
csvfilelist = csvfile.read()
def get_page_data(urls):
for url in urls:
r = requests.get(url.strip())
soup = BeautifulSoup(r.text, 'html.parser')
yield soup # N.B. use yield instead of return
print r.text
with open("gyms4.csv") as url_file:
for page in get_page_data(url_file):
name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text
address = page.find("span",{"class":"wlt_shortcode_map_location"}).text
phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text
email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text
th = pages.find('b',text="Category")
td = th.findNext()
for link in td.findAll('a',href=True):
match = re.search(r'http://(\w+).(\w+).(\w+)', link.text)
if match:
web_address = link.text
gyms = [name,address,phoneNum,email,web_address]
gyms.append(gyms)
#Saving specific listing data to csv
with open ("xgyms.csv", "w") as file:
writer = csv.writer(file)
for row in gyms:
writer.writerow([row])
You have 3 for-loops in your code and do not specifiy which one causes problem. I assume it is the one in get_page_date() function.
You leave the looop exactly in the first run with the return assignemt. That is why you never get to the second url.
There are at least two possible solutions:
Append every parsed line of url to a list and return that list.
Move you processing code in the loops and append the parsed data to gyms in the loop.
As Alex.S said, get_page_data() returns on the first iteration, hence subsequent URLs are never accessed. Furthermore, the code that extracts data from the page needs to be executed for each page downloaded, so it needs to be in a loop too. You could turn get_page_data() into a generator and then iterate over the pages like this:
def get_page_data(urls):
for url in urls:
r = requests.get(url.strip())
soup = BeautifulSoup(r.text, 'html.parser')
yield soup # N.B. use yield instead of return
with open("gyms4.csv") as url_file:
for page in get_page_data(url_file):
name = page.find("span",{"class":"wlt_shortcode_TITLE"}).text
address = page.find("span",{"class":"wlt_shortcode_map_location"}).text
phoneNum = page.find("span",{"class":"wlt_shortcode_phoneNum"}).text
email = page.find("span",{"class":"wlt_shortcode_EMAIL"}).text
# etc. etc.
You can write the data to the CSV file as each page is downloaded and processed, or you can accumulate the data into a list and write it in one for with csv.writer.writerows().
Also you should pass the URL list to get_page_data() rather than accessing it from a global variable.

Categories