Passing the objects created in a class to the global workspace

Passing the objects created in a class to the global workspace - python

I'm trying to figure out how to make the objects I create inside a definition appear on the same workspace as my "name == main" area, so I can call it in other definitions. I don't think I can use a simple "return" statement in my def, because my function is designed to create n-number of objects of a particular class.
def book_to_object(subfolders):
# converts my dictionary of book titles and volumes into individual book objects
for key, val in subfolders.items():
title = replace_spaces_with_underscores(key)
# check to see if book already exists in the registry
if title in [book.name for book in Book._registry]:
print("Book already exists")
else:
exec(title + " = Book(subfolders ,key, Fpath)")
The whole program below.
I'm calling this def from my if __name__ == '__main__': function
print("Done")
#converts my dictionary of book titles and volumes into individual book objects
book_to_object(subfolders)
Originally I had the code in my def in the name=main area of my code but I wanted to reduce the clutter and make it into a definition to give me more flexibility later.
The whole program below.
import os
#get date modified time of a file to determine which file is the newest
def get_date_modified(file):
return os.path.getmtime(file)
#replace spaces with underscores in a string
def replace_spaces_with_underscores(string):
aa = string.replace(" ", "_")
bb= aa.replace('-', '_')
cc= bb.replace("__", "_")
title = cc.replace("__", "_")
return title
# create a list of volumes that have not been exported
def get_unexported_volumes(self):
unexported_volumes = []
for volume in self.volumes:
if volume not in self.last_exported:
unexported_volumes.append(volume)
return unexported_volumes
def book_to_object(subfolders):
# converts my dictionary of book titles and volumes into individual book objects
for key, val in subfolders.items():
title = replace_spaces_with_underscores(key)
# check to see if book already exists in the registry
if title in [book.name for book in Book._registry]:
print("Book already exists")
else:
exec(title + " = Book(subfolders ,key, Fpath)")
class Book(object):
#__metaclass__ = IterBook
_registry = []
#constructor
def __init__(self, my_dict, key, Fpath):
self._registry.append(self)
self.name = key
self.volumes = my_dict[key]
self.Filepath = Fpath + "/" +self.name
#self.last_exported = self.volumes[0]
self.last_exported = ""
self.newest = self.volumes[-1]
self.last_downloaded = self.volumes[-1]
self.last_converted = self.volumes[-1]
self.last_exported_date = get_date_modified(self.Filepath + "/" + self.last_exported)
self.newest_date = get_date_modified(self.Filepath + "/" + self.newest)
self.last_downloaded_date = get_date_modified(self.Filepath + "/" + self.last_downloaded)
self.last_converted_date = get_date_modified(self.Filepath + "/" + self.last_converted)
self.last_exported_volume = self.last_exported
self.newest_volume = self.newest
self.last_downloaded_volume = self.last_downloaded
self.last_converted_volume = self.last_converted
self.last_exported_volume_date = self.last_exported_date
self.newest_volume_date = self.newest_date
self.last_downloaded_volume_date = self.last_downloaded_date
self.last_converted_volume_date = self.last_converted_date
self.last_exported_volume_name = self.last_exported
self.newest_volume_name = self.newest
self.unexported_volumes = get_unexported_volumes(self)
if __name__ == '__main__':
print("Starting")
# File Paths for Debugging purposes
Fpath = '~/'
F_Temp_path = '~/temp'
#parses directory for book based on folder names
subfolders = {'Encyclopedia': ['Vol.001', 'Vol.002', 'Vol.003', 'Vol.004', 'Vol.005'], 'Enginnering_Encyclopedia': ['Avionics', 'Civil', 'Electrical', 'Materials', 'Mechanical']}
print("Done")
#converts my dictionary of book titles and volumes into individual book objects
book_to_object(subfolders)
Notes:
In case anyone is wondering why I'm using objects instead of keeping everything in dictionaries, I need the flexibility that objects give.
Variables such as, file_paths and subfolders are dynamic and are based on what directory the user gives and what files are in that directory, for the purpose of asking the questions they have been hard coded so someone could copy and paste my code and reproduce the problem in question.

Related

Data output is not the same as inside function

I am currently having an issue where I am trying to store data in a list (using dataclasses). When I print the data inside the list in the function (PullIncursionData()) it responded with a certain amount of numbers (never the same, not possible due to it's nature). When printing it after it being called to store it's return in a Var it somehow prints only the same number.
I cannot share the numbers, as they update with EVE Online's API, so the only way is to run it locally and read the first list yourself.
The repository is Here: https://github.com/AtherActive/EVEAPI-Demo
Heads up! Inside the main.py (the file with issues) (a snippet of code is down below) are more functions. All functions from line 90 and forward are important, the rest can be ignored for this question, as they do not interact with the other functions.
def PullIncursionData():
#Pulls data from URL and converts it into JSON
url = 'https://esi.evetech.net/latest/incursions/?datasource=tranquility'
data = rq.get(url)
jsData = data.json()
#Init var to store incursions
incursions = []
#Set lenght for loop. yay
length = len(jsData)
# Every loop incursion data will be read by __parseIncursionData(). It then gets added to var Incursions.
for i in range(length):
# Add data to var Incursion.
incursions.append(__parseIncursionData(jsData, i))
# If Dev mode, print some debug. Can be toggled in settings.py
if settings.developerMode == 1:
print(incursions[i].constellation_id)
return incursions
# Basically parses the input data in a decent manner. No comments needed really.
def __parseIncursionData(jsData, i):
icstruct = stru.Incursion
icstruct.constellation_id = jsData[i]['constellation_id']
icstruct.constellation_name = 'none'
icstruct.staging = jsData[i]['staging_solar_system_id']
icstruct.region_name = ResolveSystemNames(icstruct.constellation_id, 'con-reg')
icstruct.status = jsData[i]['state']
icstruct.systems_id = jsData[i]['infested_solar_systems']
icstruct.systems_names = ResolveSystemNames(jsData[i]['infested_solar_systems'], 'system')
return icstruct
# Resolves names for systems, regions and constellations. Still WIP.
def ResolveSystemNames(id, mode='constellation'):
#init value
output_name = 'none'
# If constellation, pull data and find region name.
if mode == 'con-reg':
url = 'https://www.fuzzwork.co.uk/api/mapdata.php?constellationid={}&format=json'.format(id)
data = rq.get(url)
jsData = data.json()
output_name = jsData[0]['regionname']
# Pulls system name form Fuzzwork.co.uk.
elif mode == 'system':
#Convert output to a list.
output_name = []
lenght = len(id)
# Pulls system name from Fuzzwork. Not that hard.
for i in range(lenght):
url = 'https://www.fuzzwork.co.uk/api/mapdata.php?solarsystemid={}&format=json'.format(id[i])
data = rq.get(url)
jsData = data.json()
output_name.append(jsData[i]['solarsystemname'])
return output_name
icdata = PullIncursionData()
print('external data check:')
length = len(icdata)
for i in range(length):
print(icdata[i].constellation_id)
structures.py (custom file)
#dataclass
class Incursion:
constellation_id = int
constellation_name = str
staging = int
staging_name = str
systems_id = list
systems_names = list
region_name = str
status = str
def ___init___(self):
self.constellation_id = -1
self.constellation_name = 'undefined'
self.staging = -1
self.staging_name = 'undefined'
self.systems_id = []
self.systems_names = []
self.region_name = 'undefined'
self.status = 'unknown'

How to speed up writing in a database?

I have a function which search for json files in a directory, parse the file and write data in the database. My problem is writing in database, because it take around 30 minutes. Any idea how can I speed up writting in a database? I have few quite big files to parse, but parsing the file is not a problem it take around 3 minutes. Currently I am using sqlite but in the future I will change it to PostgreSQL.
Here is my function:
def create_database():
with transaction.atomic():
directory = os.fsencode('data/web_files/unzip')
for file in os.listdir(directory):
filename = os.fsdecode(file)
with open('data/web_files/unzip/{}'.format(filename.strip()), encoding="utf8") as f:
data = json.load(f)
cve_items = data['CVE_Items']
for i in range(len(cve_items)):
database_object = DataNist()
try:
impact = cve_items[i]['impact']['baseMetricV2']
database_object.severity = impact['severity']
database_object.exp_score = impact['exploitabilityScore']
database_object.impact_score = impact['impactScore']
database_object.cvss_score = impact['cvssV2']['baseScore']
except KeyError:
database_object.severity = ''
database_object.exp_score = ''
database_object.impact_score = ''
database_object.cvss_score = ''
for vendor_data in cve_items[i]['cve']['affects']['vendor']['vendor_data']:
database_object.vendor_name = vendor_data['vendor_name']
for description_data in cve_items[i]['cve']['description']['description_data']:
database_object.description = description_data['value']
for product_data in vendor_data['product']['product_data']:
database_object.product_name = product_data['product_name']
database_object.save()
for version_data in product_data['version']['version_data']:
if version_data['version_value'] != '-':
database_object.versions_set.create(version=version_data['version_value'])
My models.py:
class DataNist(models.Model):
vendor_name = models.CharField(max_length=100)
product_name = models.CharField(max_length=100)
description = models.TextField()
date = models.DateTimeField(default=timezone.now)
severity = models.CharField(max_length=10)
exp_score = models.IntegerField()
impact_score = models.IntegerField()
cvss_score = models.IntegerField()
def __str__(self):
return self.vendor_name + "-" + self.product_name
class Versions(models.Model):
data = models.ForeignKey(DataNist, on_delete=models.CASCADE)
version = models.CharField(max_length=50)
def __str__(self):
return self.version
I will appreciate if you can give me any advice how can I improve my code.

Okay, given the structure of the data, something like this might work for you.
This is standalone code aside from that .objects.bulk_create() call; as commented in the code, the two classes defined would actually be models within your Django app.
(By the way, you probably want to save the CVE ID as an unique field too.)
Your original code had the misassumption that every "leaf entry" in the affected version data would have the same vendor, which may not be true. That's why the model structure here has a separate product-version model that has vendor, product and version fields. (If you wanted to optimize things a little, you might deduplicate the AffectedProductVersions even across DataNists (which, as an aside, is not a perfect name for a model)).
And of course, as you had already done in your original code, the importing should be run within a transaction (transaction.atomic()).
Hope this helps.
import json
import os
import types
class DataNist(types.SimpleNamespace): # this would actually be a model
severity = ""
exp_score = ""
impact_score = ""
cvss_score = ""
def save(self):
pass
class AffectedProductVersion(types.SimpleNamespace): # this too
# (foreign key to DataNist here)
vendor_name = ""
product_name = ""
version_value = ""
def import_item(item):
database_object = DataNist()
try:
impact = item["impact"]["baseMetricV2"]
except KeyError: # no impact object available
pass
else:
database_object.severity = impact.get("severity", "")
database_object.exp_score = impact.get("exploitabilityScore", "")
database_object.impact_score = impact.get("impactScore", "")
if "cvssV2" in impact:
database_object.cvss_score = impact["cvssV2"]["baseScore"]
for description_data in item["cve"]["description"]["description_data"]:
database_object.description = description_data["value"]
break # only grab the first description
database_object.save() # save the base object
affected_versions = []
for vendor_data in item["cve"]["affects"]["vendor"]["vendor_data"]:
for product_data in vendor_data["product"]["product_data"]:
for version_data in product_data["version"]["version_data"]:
affected_versions.append(
AffectedProductVersion(
data_nist=database_object,
vendor_name=vendor_data["vendor_name"],
product_name=product_data["product_name"],
version_name=version_data["version_value"],
)
)
AffectedProductVersion.objects.bulk_create(
affected_versions
) # save all the version information
return database_object # in case the caller needs it
with open("nvdcve-1.0-2019.json") as infp:
data = json.load(infp)
for item in data["CVE_Items"]:
import_item(item)

Concatenating strings and referencing

I am actually making a script to postprocess some database.
I want my script to get the path and the name of files when I enter some details (version, option, etc), and it looks like this...
def file_info(version, something, option):
####### This part is the DB
## Version-path
PATH_ver1 = '/ver1'
PATH_something1 = '/something1'
## Name of files, there are bunch of these datas
DEF_ver2_something2 = '/name of file'
DEF_ver2_something1_option4 = '/name of file'
####### Now starts to postprocess
## path setting - other variables also follows this
if version == 'ver1':
PATH_VER = PATH_ver1
elif version == 'ver2':
PATH_VER = PATH_ver2
## Concatenating the paths
PATH_BIN = PATH_TOP + PATH_VER + PATH_TYP + PATH_OPT
## Setting the file name
BIN_file = 'DEF_' + version + '_' + something + '_' + option
return PATH_BIN, BIN_FILE
def main():
version = input("version here")
something = input("something here")
option = input("option here")
output = file_info(version, something, option)
When I enter something, I can get the path of files correctly, but the file name gives the name of variable, instead of '/name of file'.
Also, since my variables have two values, I mean, it is not a one-by-one matching, I think I can't use a dictionary format. Each items have one key (DEF_***), and there are two corresponding values (PATH_BIN and BIN_FILE). How can I solve this?

It sounds like what you need are nested dictionaries:
#!python3.6
####### This part is the DB
PATH_TOP = '/TOP/PATH'
## Version-path
PATH = {'ver1':'/ver1',
'ver2':'/ver2',
'something1':'/something1',
'something2':'/something2',
'' :'',
'opt1':'/opt1',
'opt2':'/opt2',
'opt3':'/opt3',
'opt4':'/Totally/Different/PATH'
}
## Name of files
DEF = {'ver1':{'something1':{'' :'v1s1'
,'opt1':'v1s1o1'
,'opt2':'v1s1o2'
}
,'something2':{'opt2':'v1s2o2'
,'opt3':'v1s2o3'
}
}
,'ver2':{'something1':{'opt1':'v2s1o1'
,'opt2':'v2s1o2'
,'opt4':'v2s1o4'
}
,'something2':{'' :'v2s2'
}
}
}
def file_info(version, something, option):
PATH_BIN = PATH_TOP + PATH[version] + PATH[something] + PATH[option]
BIN_FILE = DEF[version][something][option]
return PATH_BIN, BIN_FILE
def prompt(item,values):
lst = "'" + "','".join(values) + "'"
while True:
selection = input(f'{item}({lst})? ')
if selection in values:
break
print('not a choice')
return selection
def main():
version = prompt('Version',DEF)
something = prompt('Something',DEF[version])
option = prompt('Option',DEF[version][something])
output = file_info(version, something, option)
print(output)
if __name__ == '__main__':
main()
Output:
C:\>test.py
Version('ver1','ver2')? ver1
Something('something1','something2')? some
not a choice
Something('something1','something2')? something2
Option('opt2','opt3')? opt2
('/TOP/PATH/ver1/something2/opt2', 'v1s2o2')
C:\>test.py
Version('ver1','ver2')? ver1
Something('something1','something2')? something1
Option('','opt1','opt2')?
('/TOP/PATH/ver1/something1', 'v1s1')

Exporting data from python to text file using petl package in python

I am trying to extract raw data from a text file and after processing the raw data, I want to export it to another text file. Below is the python code I have written for this process. I am using the "petl" package in python 3 for this purpose. 'locations.txt' is the raw data file.
import glob, os
from petl import *
class ETL():
def __init__(self, input):
self.list = input
def parse_P(self):
personids = None
for term in self.list:
if term.startswith('P'):
personids = term[1:]
personid = personids.split(',')
return personid
def return_location(self):
location = None
for term in self.list:
if term.startswith('L'):
location = term[1:]
return location
def return_location_id(self, location):
location = self.return_location()
locationid = None
def return_country_id(self):
countryid = None
for term in self.list:
if term.startswith('C'):
countryid = term[1:]
return countryid
def return_region_id(self):
regionid = None
for term in self.list:
if term.startswith('R'):
regionid = term[1:]
return regionid
def return_city_id(self):
cityid = None
for term in self.list:
if term.startswith('I'):
cityid = term[1:]
return cityid
print (os.getcwd())
os.chdir("D:\ETL-IntroductionProject")
print (os.getcwd())
final_location = [['L','P', 'C', 'R', 'I']]
new_location = fromtext('locations.txt', encoding= 'Latin-1')
stored_list = []
for identifier in new_location:
if identifier[0].startswith('L'):
identifier = identifier[0]
info_list = identifier.split('_')
stored_list.append(info_list)
for lst in stored_list:
tabling = ETL(lst)
location = tabling.return_location()
country = tabling.return_country_id()
city = tabling.return_city_id()
region = tabling.return_region_id()
person_list = tabling.parse_P()
for person in person_list:
table_new = [location, person, country, region, city]
final_location.append(table_new)
totext(final_location, 'l1.txt')
However when I use "totext" function of petl, it throws me an "Assertion Error".
AssertionError: template is required
I am unable to understand what the fault is. Can some one please explain the problem I am facing and what I should be doing ?

The template parameter to the toext function is not optional there is no default format for how the rows are written in this case, you must provide a template. Check the doc for toext here for an example: https://petl.readthedocs.io/en/latest/io.html#text-files
The template describes the format of each row that it writes out using the field headers to describe things, you can optionally pass in a prologue to write the header too. A basic template in your case would be:
table_new_template = "{L} {P} {C} {R} {I}"
totext(final_location, 'l1.txt', template=table_new_template)

Reduce RAM usage in Python script

I've written a quick little program to scrape book data off of a UNESCO website which contains information about book translations. The code is doing what I want it to, but by the time it's processed about 20 countries, it's using ~6GB of RAM. Since there are around 200 I need to process, this isn't going to work for me.
I'm not sure where all the RAM usage is coming from, so I'm not sure how to reduce it. I'm assuming that it's the dictionary that's holding all the book information, but I'm not positive. I'm not sure if I should simply make the program run once for each country, rather than processing the lot of them? Or if there's a better way to do it?
This is the first time I've written anything like this, and I'm a pretty novice, self-taught programmer, so please point out any significant flaws in the code, or improvement tips you have that may not directly relate to the question at hand.
This is my code, thanks in advance for any assistance.
from __future__ import print_function
import urllib2, os
from bs4 import BeautifulSoup, SoupStrainer
''' Set list of countries and their code for niceness in explaining what
is actually going on as the program runs. '''
countries = {"AFG":"Afghanistan","ALA":"Aland Islands","DZA":"Algeria"}
'''List of country codes since dictionaries aren't sorted in any
way, this makes processing easier to deal with if it fails at
some point, mid run.'''
country_code_list = ["AFG","ALA","DZA"]
base_url = "http://www.unesco.org/xtrans/bsresult.aspx?lg=0&c="
destination_directory = "/Users/robbie/Test/"
only_restable = SoupStrainer(class_="restable")
class Book(object):
def set_author(self,book):
'''Parse the webpage to find author names. Finds last name, then
first name of original author(s) and sets the Book object's
Author attribute to the resulting string.'''
authors = ""
author_last_names = book.find_all('span',class_="sn_auth_name")
author_first_names = book.find_all('span', attrs={\
'class':"sn_auth_first_name"})
if author_last_names == []: self.Author = [" "]
for author in author_last_names:
try:
first_name = author_first_names.pop()
authors = authors + author.getText() + ', ' + \
first_name.getText()
except IndexError:
authors = authors + (author.getText())
self.author = authors
def set_quality(self,book):
''' Check to see if book page is using Quality, then set it if
so.'''
quality = book.find_all('span', class_="sn_auth_quality")
if len(quality) == 0: self.quality = " "
else: self.quality = quality[0].contents[0]
def set_target_title(self,book):
target_title = book.find_all('span', class_="sn_target_title")
if len(target_title) == 0: self.target_title = " "
else: self.target_title = target_title[0].contents[0]
def set_target_language(self,book):
target_language = book.find_all('span', class_="sn_target_lang")
if len(target_language) == 0: self.target_language = " "
else: self.target_language = target_language[0].contents[0]
def set_translator_name(self,book) :
translators = ""
translator_last_names = book.find_all('span', class_="sn_transl_name")
translator_first_names = book.find_all('span', \
class_="sn_transl_first_name")
if translator_first_names == [] and translator_last_names == [] :
self.translators = " "
return None
for translator in translator_last_names:
try:
first_name = translator_first_names.pop()
translators = translators + \
(translator.getText() + ',' \
+ first_name.getText())
except IndexError:
translators = translators + \
(translator.getText())
self.translators = translators
def set_published_city(self,book) :
published_city = book.find_all('span', class_="place")
if len(published_city) == 0:
self.published_city = " "
else: self.published_city = published_city[0].contents[0]
def set_publisher(self,book) :
publisher = book.find_all('span', class_="place")
if len(publisher) == 0:
self.publisher = " "
else: self.publisher = publisher[0].contents[0]
def set_published_country(self,book) :
published_country = book.find_all('span', \
class_="sn_country")
if len(published_country) == 0:
self.published_country = " "
else: self.published_country = published_country[0].contents[0]
def set_year(self,book) :
year = book.find_all('span', class_="sn_year")
if len(year) == 0:
self.year = " "
else: self.year = year[0].contents[0]
def set_pages(self,book) :
pages = book.find_all('span', class_="sn_pagination")
if len(pages) == 0:
self.pages = " "
else: self.pages = pages[0].contents[0]
def set_edition(self, book) :
edition = book.find_all('span', class_="sn_editionstat")
if len(edition) == 0:
self.edition = " "
else: self.edition = edition[0].contents[0]
def set_original_title(self,book) :
original_title = book.find_all('span', class_="sn_orig_title")
if len(original_title) == 0:
self.original_title = " "
else: self.original_title = original_title[0].contents[0]
def set_original_language(self,book) :
languages = ''
original_languages = book.find_all('span', \
class_="sn_orig_lang")
for language in original_languages:
languages = languages + language.getText() + ', '
self.original_languages = languages
def export(self, country):
''' Function to allow us to easilly pull the text from the
contents of the Book object's attributes and write them to the
country in which the book was published's CSV file.'''
file_name = os.path.join(destination_directory + country + ".csv")
with open(file_name, "a") as by_country_csv:
print(self.author.encode('UTF-8') + " & " + \
self.quality.encode('UTF-8') + " & " + \
self.target_title.encode('UTF-8') + " & " + \
self.target_language.encode('UTF-8') + " & " + \
self.translators.encode('UTF-8') + " & " + \
self.published_city.encode('UTF-8') + " & " + \
self.publisher.encode('UTF-8') + " & " + \
self.published_country.encode('UTF-8') + " & " + \
self.year.encode('UTF-8') + " & " + \
self.pages.encode('UTF-8') + " & " + \
self.edition.encode('UTF-8') + " & " + \
self.original_title.encode('UTF-8') + " & " + \
self.original_languages.encode('UTF-8'), file=by_country_csv)
by_country_csv.close()
def __init__(self, book, country):
''' Initialize the Book object by feeding it the HTML for its
row'''
self.set_author(book)
self.set_quality(book)
self.set_target_title(book)
self.set_target_language(book)
self.set_translator_name(book)
self.set_published_city(book)
self.set_publisher(book)
self.set_published_country(book)
self.set_year(book)
self.set_pages(book)
self.set_edition(book)
self.set_original_title(book)
self.set_original_language(book)
def get_all_pages(country,base_url):
''' Create a list of URLs to be crawled by adding the ISO_3166-1_alpha-3
country code to the URL and then iterating through the results every 10
pages. Returns a string.'''
base_page = urllib2.urlopen(base_url+country)
page = BeautifulSoup(base_page, parse_only=only_restable)
result_number = page.find_all('td',class_="res1",limit=1)
if not result_number:
return 0
str_result_number = str(result_number[0].getText())
results_total = int(str_result_number.split('/')[1])
page.decompose()
return results_total
def build_list(country_code_list, countries):
''' Build the list of all the books, and return a list of Book objects
in case you want to do something with them in something else, ever.'''
for country in country_code_list:
print("Processing %s now..." % countries[country])
results_total = get_all_pages(country, base_url)
for url in range(results_total):
if url % 10 == 0 :
all_books = []
target_page = urllib2.urlopen(base_url + country \
+"&fr="+str(url))
page = BeautifulSoup(target_page, parse_only=only_restable)
books = page.find_all('td',class_="res2")
for book in books:
all_books.append(Book (book,country))
page.decompose()
for title in all_books:
title.export(country)
return
if __name__ == "__main__":
build_list(country_code_list,countries)
print("Completed.")

I guess I'll just list off some of the problems or possible improvements in no particular order:
Follow PEP 8.
Right now, you've got lots of variables and functions named using camel-case like setAuthor. That's not the conventional style for Python; Python would typically named that set_author (and published_country rather than PublishedCountry, etc.). You can even change the names of some of the things you're calling: for one, BeautifulSoup supports findAll for compatibility, but find_all is recommended.
Besides naming, PEP 8 also specifies a few other things; for example, you'd want to rewrite this:
if len(resultNumber) == 0 : return 0
as this:
if len(result_number) == 0:
return 0
or even taking into account the fact that empty lists are falsy:
if not result_number:
return 0
Pass a SoupStrainer to BeautifulSoup.
The information you're looking for is probably in only part of the document; you don't need to parse the whole thing into a tree. Pass a SoupStrainer as the parse_only argument to BeautifulSoup. This should reduce memory usage by discarding unnecessary parts early.
decompose the soup when you're done with it.
Python primarily uses reference counting, so removing all circular references (as decompose does) should let its primary mechanism for garbage collection, reference counting, free up a lot of memory. Python also has a semi-traditional garbage collector to deal with circular references, but reference counting is much faster.
Don't make Book.__init__ write things to disk.
In most cases, I wouldn't expect just creating an instance of a class to write something to disk. Remove the call to export; let the user call export if they want it to be put on the disk.
Stop holding on to so much data in memory.
You're accumulating all this data into a dictionary just to export it afterwards. The obvious thing to do to reduce memory is to dump it to disk as soon as possible. Your comment indicates that you're putting it in a dictionary to be flexible; but that doesn't mean you have to collect it all in a list: use a generator, yielding items as you scrape them. Then the user can iterate over it just like a list:
for book in scrape_books():
book.export()
…but with the advantage that at most one book will be kept in memory at a time.
Use the functions in os.path rather than munging paths yourself.
Your code right now is rather fragile when it comes to path names. If I accidentally removed the trailing slash from destinationDirectory, something unintended happens. Using os.path.join prevents that from happening and deals with cross-platform differences:
>>> os.path.join("/Users/robbie/Test/", "USA")
'/Users/robbie/Test/USA'
>>> os.path.join("/Users/robbie/Test", "USA") # still works!
'/Users/robbie/Test/USA'
>>> # or say we were on Windows:
>>> os.path.join(r"C:\Documents and Settings\robbie\Test", "USA")
'C:\\Documents and Settings\\robbie\\Test\\USA'
Abbreviate attrs={"class":...} to class_=....
BeautifulSoup 4.1.2 introduces searching with class_, which removes the need for the verbose attrs={"class":...}.
I imagine there are even more things you can change, but that's quite a few to start with.

What do you want the booklist for, in the end? You should export each book at the end of the "for url in range" block (inside it), and do without the allbooks dict. If you really need a list, define exactly what infos you will need, not keeping full Book objects.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Passing the objects created in a class to the global workspace - python

Related

Data output is not the same as inside function

How to speed up writing in a database?

Concatenating strings and referencing

Exporting data from python to text file using petl package in python

Reduce RAM usage in Python script

Categories

Resources