How To Make A Web Crawler More Efficient?

How To Make A Web Crawler More Efficient? - python

Here is a code:
str_regex = '(https?:\/\/)?([a-z]+\d\.)?([a-z]+\.)?activeingredients\.[a-z]+(/?(work|about|contact)?/?([a-zA-z-]+)*)?/?'
import urllib.request
from Stacks import Stack
import re
import functools
import operator as op
from nary_tree import *
url = 'http://www.activeingredients.com/'
s = set()
List = []
url_list = []
def f_go(List, s, url):
try:
if url in s:
return
s.add(url)
with urllib.request.urlopen(url) as response:
html = response.read()
#print(url)
h = html.decode("utf-8")
lst0 = prepare_expression(list(h))
ntr = buildNaryParseTree(lst0)
lst2 = nary_tree_tolist(ntr)
lst3= functools.reduce(op.add, lst2, [])
str2 = ''.join(lst3)
List.append(str2)
f1 = re.finditer(str_regex, h)
l1 = []
for tok in f1:
ind1 = tok.span()
l1.append(h[ind1[0]:ind1[1]])
for exp in l1:
length = len(l1)
if (exp[-1] == 'g' and exp[length - 2] == 'p' and exp[length - 3] == 'j') or \
(exp[-1] == 'p' and exp[length - 2] == 'n' and exp[length - 3] == 'g'):
pass
else:
f_go(List, s, exp, iter_cnt + 1, url_list)
except:
return
It basically, using, urlllib.request.urlopen, opens urls recursively in a loop; does tis in certain domain (in that case activeingredients.com); link extraction form a page is done by regexpression. Inside, having open page it parse it and add to a list as a string. So, what this is suppose to do is go through given domain, extract information (meaningful text in that case), add to a list. Try except block, just returns in the case of all the http errors (and all the rest errors too, but this is tested and working).
It works, for example, for this small page, but for bigger is extremely slow and eat memory.
Parsing, preparing page, more or less do the right job, I believe.
Question is, is there an efficient way to do this? How web searches crawl through network so fast?

First: I don't think Google's webcrawler is running on one laptop or one pc. So don't worry if you can't get results like big companies do.
Points to consider:
You could start with a big list of words you can download from many websites. That sorts out some useless combinations of url's. After that you could crawl just with letters to get useless-named-sites on your index as well.
You could start with a list of all registered domains on dns servers. I.E. something like this: http://www.registered-domains-list.com
Use multiple threads
Have much bandwidth
Consider buying Google's Data-Center
These points are just ideas to give you a basic idea of how you could improve your crawler.

Related

Permutation List with Variable Dependencies- UnboundLocalError

I was trying to break down the code to the simplest form before adding more variables and such. I'm stuck.
I wanted it so when I use intertools the first response is the permutations of tricks and the second response is dependent on the trick's landings() and is a permutation of the trick's corresponding landing. I want to add additional variables that further branch off from landings() and so on.
The simplest form should print a list that looks like:
Backflip Complete
Backflip Hyper
180 Round Complete
180 Round Mega
Gumbi Complete
My Code:
from re import I
import pandas as pd
import numpy as np
import itertools
from io import StringIO
backflip = "Backflip"
one80round = "180 Round"
gumbi = "Gumbi"
tricks = [backflip,one80round,gumbi]
complete = "Complete"
hyper = "Hyper"
mega = "Mega"
backflip_landing = [complete,hyper]
one80round_landing = [complete,mega]
gumbi_landing = [complete]
def landings(tricks):
if tricks == backflip:
landing = backflip_landing
elif tricks == one80round:
landing = one80round_landing
elif tricks == gumbi:
landing = gumbi_landing
return landing
for trik, land in itertools.product(tricks,landings(tricks)):
trick_and_landing = (trik, land)
result = (' '.join(trick_and_landing))
tal = StringIO(result)
tl = (pd.DataFrame((tal)))
print(tl)
I get the error:
UnboundLocalError: local variable 'landing' referenced before assignment

Add a landing = "" after def landings(tricks): to get rid of the error.
But the if checks in your function are wrong. You check if tricks, which is a list, is equal to backflip, etc. which are all strings. So thats why none of the ifs are true and landing got no value assigned.
That question was also about permutation in python. Maybe it helps.

Item in list found but when I ask for location by index it says that the item can't be found

I am writing some code to get a list of certain counties in Florida for a database. These counties are listed on a website but are each on individual webpages. To make the collection process less tedious I am writing a webscraper. I have gotten the links to all of the websites with the counties. I have written code that will then inspect the website, find the line that says "COUNTY:" and then I want to get the location so I can actually get the county on the next line. The only problem is when I ask for the location it says it can't be found. I know it is in there because when I ask my code to find it and then return the line (Not the placement) it doesn't return empty. I will give some of the code for reference and an image of the problem.
Broken code:
links = ['https://www.ghosttowns.com/states/fl/acron.html', 'https://www.ghosttowns.com/states/fl/acton.html']
import requests
r = requests.get(links[1])
r = str(r.text)
r = r.split("\n")
county_before_location = [x for x in r if 'COUNTY' in x]
print(county_before_location)
print(r.index(county_before_location))
Returns:
[' </font><b><font color="#80ff80">COUNTY:</font><font color="#ffffff">'] is not in list
Code that shows the item:
links = ['https://www.ghosttowns.com/states/fl/acron.html', 'https://www.ghosttowns.com/states/fl/acton.html']
import requests
r = requests.get(links[1])
r = str(r.text)
r = r.split("\n")
county_before_location = [x for x in r if 'COUNTY' in x]
print(county_before_location)
Returns:
[' </font><b><font color="#80ff80">COUNTY:</font><font color="#ffffff">']
Photo

county_before_location is a list and you are asking for the index of said list, which is not in r. Instead you would need to ask for r.index(county_before_location[0]).

For Loop 60 items 10 per 10

I'm working with an api that gives me 61 items that I include in a discord embed in a for loop.
As all of this is planned to be included into a discord bot using pagination from DiscordUtils, I need to make it so it male an embed for each 10 entry to avoid a too long message / 2000 character message.
Currently what I use to do my loop is here: https://api.nepmia.fr/spc/ (I recomend the usage of a parsing extention for your browser or it will be a bit hard to read it)
But what I want to create is something that will look like that : https://api.nepmia.fr/spc/formated/
So I can iterate each range in a different embed and then use pagination.
I use TinyDB to generate the JSON files I shown before with this script:
import urllib.request, json
from shutil import copyfile
from termcolor import colored
from tinydb import TinyDB, Query
db = TinyDB("/home/nepmia/Myazu/db/db.json")
def api_get():
print(colored("[Myazu]","cyan"), colored("Fetching WynncraftAPI...", "white"))
try:
with urllib.request.urlopen("https://api.wynncraft.com/public_api.php?action=guildStats&command=Spectral%20Cabbage") as u1:
api_1 = json.loads(u1.read().decode())
count = 0
if members := api_1.get("members"):
print(colored("[Myazu]","cyan"),
colored("Got expecteded answer, starting saving process.", "white"))
for member in members:
nick = member.get("name")
ur2 = f"https://api.wynncraft.com/v2/player/{nick}/stats"
u2 = urllib.request.urlopen(ur2)
api_2 = json.loads(u2.read().decode())
data = api_2.get("data")
for item in data:
meta = item.get("meta")
playtime = meta.get("playtime")
print(colored("[Myazu]","cyan"),
colored("Saving playtime for player", "white"),
colored(f"{nick}...","green"))
db.insert({"username": nick, "playtime": playtime})
count += 1
else:
print(colored("[Myazu]","cyan"),
colored("Unexpected answer from WynncraftAPI [ERROR 1]", "white"))
except:
print(colored("[Myazu]","cyan"),
colored("Unhandled error in saving process [ERROR 2]", "white"))
finally:
print(colored("[Myazu]","cyan"),
colored(f"Finished saving data for", "white"),
colored(f"{count}", "green"),
colored("players.", "white"))
but this will only create a range like this : https://api.nepmia.fr/spc/
what I would like is something like this : https://api.nepmia.fr/spc/formated/
Thanks for your help!
PS: Sorry for your eyes I'm still new to Python so I know I don't do stuff really properly :s

To follow up from the comments, you shouldn't store items in your database in a format that is specific to how you want to return results from the database to a different API, as it will make it more difficult to query in other contexts, among other reasons.
If you want to paginate items from a database it's better to do that when you query it.
According to the docs, you can iterate over all documents in a TinyDB database just by iterating directly over the DB like:
for doc in db:
...
For any iterable you can use the enumerate function to associate an index to each item like:
for idx, doc in enumerate(db):
...
If you want the indices to start with 1 as in your examples you would just use idx + 1.
Finally, to paginate the results, you need some function that can return items from an iterable in fixed-sized batches, such as one of the many solutions on this question or elsewhere. E.g. given a function chunked(iter, size) you could do:
pages = enumerate(chunked(enumerate(db), 10))
Then list(pages) gives a list of lists of tuples like [(page_num, [(player_num, player), ...].
The only difference between a list of lists and what you want is you seem to want a dictionary structure like
{'range1': {'1': {...}, '2': {...}, ...}, 'range2': {'11': {...}, ...}}
This is no different from a list of lists; the only difference is you're using dictionary keys to give numerical indices to each item in a collection, rather than the indices being implict in the list structure. There's many ways you can go from a list of lists to this. The easiest I think is using a (nested) dict comprehension:
{f'range{page_num + 1}': {str(player_num + 1): player for player_num, player in page}
for page_num, page in pages}
This will give output in exactly the format you want.

Thanks #Iguananaut for your precious help.
In the end I made something similar from your solution using a generator.
def chunker(seq, size):
for i in range(0, len(seq), size):
yield seq[i:i+size]
def embed_creator(embeds):
pages = []
current_page = None
for i, chunk in enumerate(chunker(embeds, 10)):
current_page = discord.Embed(
title=f'**SPC** Last week online time',
color=3903947)
for elt in chunk:
current_page.add_field(
name=elt.get("username"),
value=elt.get("play_output"),
inline=False)
current_page.set_footer(
icon_url="https://cdn.discordapp.com/icons/513160124219523086/a_3dc65aae06b2cf7bddcb3c33d7a5ecef.gif?size=128",
text=f"{i + 1} / {ceil(len(embeds) / 10)}"
)
pages.append(current_page)
current_page = None
return pages
Using embed_creator I generate a list named pages that I can simply use with DiscordUtils paginator.

Python multiprocessing scraping, duplicate results

I'm building a scraper that needs to perform pretty fast, over a large amount of webpages. The results of the code below will be a csv file with a list of links (and other things).
Basically, I create a list of webpages that contain several links, and for each of this pages I collect these links.
Implementing multiprocessing leads to some weird results, that I wasn't able to explain.
If I run this code setting the value of the pool to 1 (hence, without multithreading) I get a final result in which I have 0.5% of duplicated links (which is fair enough).
As soon as I speed it up setting the value to 8, 12 or 24, I get around 25% of duplicate links in the final results.
I suspect my mistake is in the way I write the results to the csv file or in the way I use the imap() function (same happens with imap_unordered, map etc..), which leads the threads to somehow access the same elements on the iterable passed. Any suggestion?
#!/usr/bin/env python
# coding: utf8
import sys
import requests, re, time
from bs4 import BeautifulSoup
from lxml import etree
from lxml import html
import random
import unicodecsv as csv
import progressbar
import multiprocessing
from multiprocessing.pool import ThreadPool
keyword = "keyword"
def openup():
global crawl_list
try:
### Generate list URLS based on the number of results for the keyword, each of these contains other links. The list is subsequently randomized
startpage = 1
## Get endpage
url0 = myurl0
r0 = requests.get(url0)
print "First request: "+str(r0.status_code)
tree = html.fromstring(r0.content)
endpage = tree.xpath("//*[#id='habillagepub']/div[5]/div/div[1]/section/div/ul/li[#class='adroite']/a/text()")
print str(endpage[0]) + " pages found"
### Generate random sequence for crawling
crawl_list = random.sample(range(1,int(endpage[0])+1), int(endpage[0]))
return crawl_list
except Exception as e:
### Catches openup error and return an empty crawl list, then breaks
print e
crawl_list = []
return crawl_list
def worker_crawl(x):
### Open page
url_base = myurlbase
r = requests.get(url_base)
print "Connecting to page " + str(x) +" ..."+ str(r.status_code)
while True:
if r.status_code == 200:
tree = html.fromstring(r.content)
### Get data
titles = tree.xpath('//*[#id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/h3/a/text()')
links = tree.xpath('//*[#id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/h3/a/#href')
abstracts = tree.xpath('//*[#id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/p/text()')
footers = tree.xpath('//*[#id="habillagepub"]/div[5]/div/div[1]/section/article/div/div/span/text()')
dates = []
pagenums = []
for f in footers:
pagenums.append(x)
match = re.search(r'\| .+$', f)
if match:
date = match.group()
dates.append(date)
pageindex = zip(titles,links,abstracts,footers,dates,pagenums) #what if there is a missing value?
return pageindex
else:
pageindex = [[str(r.status_code),"","","","",str(x)]]
return pageindex
continue
def mp_handler():
### Write down:
with open(keyword+'_results.csv', 'wb') as outcsv:
wr = csv.DictWriter(outcsv, fieldnames=["title","link","abstract","footer","date","pagenum"])
wr.writeheader()
results = p.imap(worker_crawl, crawl_list)
for result in results:
for x in result:
wr.writerow({
#"keyword": str(keyword),
"title": x[0],
"link": x[1],
"abstract": x[2],
"footer": x[3],
"date": x[4],
"pagenum": x[5],
})
if __name__=='__main__':
p = ThreadPool(4)
openup()
mp_handler()
p.terminate()
p.join()

Are you sure the page responds with the correct response in a fast sequence of requests? I have been in situations where the scraped site responded with different responses if the requests were fast vs. if the requests were spaced in time. Menaing, everything went perfectly while debugging but as soon as the requests were fast and in sequence, the website decided to give me a different response.
Beside this, I would ask if the fact you are writing in a non-thread-safe environment might have impact: To minimize interactions on the final CSV output and issues with the data, you might:
use wr.writerows with a chunk of rows to write
use a threading.lock like here: Multiple threads writing to the same CSV in Python

How to apply a function on each item in a list

I have a sitemap with about 21 urls on it and each of those urls contains about 2000 more urls. I'm trying to write something that will allow me to parse each of the original 21 urls and grab their containing 2000 urls then append it to a list.
I've been bashing my head against a wall for a few days now trying to get this to work, but it keeps returning a list of 'None'. I've only been working with python for about a 3 weeks now, so I might be missing something really obvious. Any help would be great!
storage = []
storage1 = []
for x in range(21):
url = 'first part of the url' + str(x) + '.xml'
storage.append(url)
def parser(any):
tree = ET.parse(urlopen(any))
root = tree.getroot()
for i in range(len(storage)):
x = (root[i][0]).text
storage1.append(x)
storage2 = [parser(x) for x in storage]
I also tried using a while loop with a counter, but it always stopped after the first 2000 urls.

parser() never returns anything, so it defaults to returning None, hence why storage2 contains a list of Nones. Perhaps you want to look at what's in storage1?

If you don't declare a return for a function in python, it automatically returns None. Inside parser you're adding elements to storage1, but aren't returning anything. I would give this a shot instead.
storage = []
for x in range(21):
url = 'first part of the url' + str(x) + '.xml'
storage.append(url)
def parser(any):
storage1 = []
tree = ET.parse(urlopen(any))
root = tree.getroot()
for i in range(len(storage)):
x = (root[i][0]).text
storage1.append(x)
return storage1
storage2 = [parser(x) for x in storage]
EDIT: As Amber said, you should also see that all your elements were actually being stored in storage1.

If I understand your problem correctly, you have two stages in your program:
You generate initial list of the 21 URLs
You fetch the page at each of those URLs, and extract additional URLs from the page.
Your first step could look like this:
initial_urls = [('http://...%s...' % x) for x in range(21)]
Then, to populate the large list of URLs from the pages, you could do something like this:
big_list = []
def extract_urls(source):
tree = ET.parse(urlopen(any))
for link in get_links(tree):
big_list.append(link.attrib['href'])
def get_links(tree):
... - define the logic for link extraction here
for url in initial_urls:
extract_urls(url)
print big_list
Note that you'll have to write the procedure that extracts the links from the document yourself.
Hope this helps!

You have to return storage1 in the parser function
def parser(any):
tree = ET.parse(urlopen(any))
root = tree.getroot()
for i in range(len(storage)):
x = (root[i][0]).text
storage1.append(x)
return storage1
I think this is what you want.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How To Make A Web Crawler More Efficient? - python

Related

Permutation List with Variable Dependencies- UnboundLocalError

Item in list found but when I ask for location by index it says that the item can't be found

For Loop 60 items 10 per 10

Python multiprocessing scraping, duplicate results

How to apply a function on each item in a list

Categories

Resources