Django w/ sqlite3 CharField crashing on special character: "Rüppell's Vulture"

Django w/ sqlite3 CharField crashing on special character: "Rüppell's Vulture" - python

Django verision 1.9, DB backend: sqlite3.
I am having a hard time figuring out how to handle this error. I am importing the master bird species list (available here) into a set of Django models. I had the import going well, but it is crashing when I try to save this value: Rüppell's Vulture into the model. The target field is defined like this:
species_english = models.CharField(max_length=100, default=None, blank=True, null=True)
Here is there error:
ProgrammingError: You must not use 8-bit bytestrings unless you use a
text_factory that can interpret 8-bit bytestrings (like text_factory =
str). It is highly recommended that you instead just switch your
application to Unicode strings.
I was reading through Django's documentation about unicode strings. Which starts off beautifully like this:
Django natively supports Unicode data everywhere. Providing your
database can somehow store the data, you can safely pass around
Unicode strings to templates, models and the database.
Also looking up information about this character: ü, it has representation is both unicode and utf-8.
The method for saving this string to the DB is very straight-forward, I am simply parsing the CSV file using csv.reader:
new_species = Species(genus=new_genus, species=row[4], species_english=row[7])
Where the error-throwing string is contained in row[7]. What am I missing about why the database will not allow this character?
UPDATE
here is the content of the whole script importing the data:
import csv
from birds.models import SpeciesFile, Order, Family, Genus, Species, Subspecies
csv_file = str(SpeciesFile.objects.all()[0].species_list)
#COLUMNS
#0 - Order
#1 - Family Scientific
#2 - Family (English)
#3 - Genus
#4 - Species
#5 - SubSpecies
with open("birds/media/"+csv_file.split('/')[1], 'rU') as c:
Order.objects.all().delete()
Family.objects.all().delete()
Genus.objects.all().delete()
Species.objects.all().delete()
Subspecies.objects.all().delete()
reader = csv.reader(c, delimiter=';', quotechar='"')
ini_rows = 4
for row in reader:
if ini_rows > 0:
ini_rows -= 1
continue
if row[0]:
new_order = Order(order=row[0])
new_order.save()
elif row[1]:
new_fam = Family(order = new_order, family_scientific=row[1], family_english=row[2])
new_fam.save()
elif row[3]:
new_genus = Genus(family = new_fam, genus=row[3])
new_genus.save()
elif row[4]:
print row[4]
new_species = Species(genus=new_genus, species=row[4], species_english=row[7])
new_species.save()
elif row[5]:
print row[5]
new_subspecies = Subspecies(species=new_species, subspecies=row[5])
new_subspecies.save()
And here are the models.py file definitions:
from __future__ import unicode_literals
from django.db import models
class SpeciesFile(models.Model):
species_list = models.FileField()
class Order(models.Model):
order = models.CharField(max_length=100)
def __str__(self):
return self.order
class Family(models.Model):
order = models.ForeignKey(Order)
family_scientific = models.CharField(max_length=100)
family_english = models.CharField(max_length=100)
def __str__(self):
return self.family_english+" "+self.family_scientific
class Genus(models.Model):
family = models.ForeignKey(Family)
genus = models.CharField(max_length=100)
def __str__(self):
return self.genus
class Species(models.Model):
genus = models.ForeignKey(Genus, default=None)
species = models.CharField(max_length=100, default=None)
species_english = models.CharField(max_length=100, default=None, blank=True, null=True)
def __str__(self):
return self.species+" "+self.species_english
class Subspecies(models.Model):
species = models.ForeignKey(Species)
subspecies = models.CharField(max_length=100)
def __str__(self):
return self.subspecies

Django CharField is a character-oriented format. You need to pass it Unicode strings.
CSV is a byte-oriented format. When you read data out of a CSV file you get byte strings.
To get from bytes to characters you have to know what encoding was used when the original characters were turned into bytes as the CSV file was exported. Ideally that would be UTF-8, but if the file has come out of Excel it probably won't be. Maybe it's Windows-1252 (‘ANSI’ code page for Western European installations). Maybe it's something else.
(Django/Python 2 lets you get away with writing byte strings to Unicode properties when you have only ASCII bytes in it (bytes 0–127) because those have the same mapping in a lot encodings. ASCII is a ‘best guess’ at Do What I Mean, but it's not reliable and Python 3 prefers to raise errors if you try.)
So:
new_order = Order(order=row[0].decode('windows-1252'))
or, to decode the whole row at once:
row = [s.decode('windows-1252') for s in row]

Related

Is there a way to take input from text file and put it in set/get methods of a class?

I am doing a course project based on Python and I am curious if there is a way to write something similar to this (written in C++) in python. I am struggling to write this in Python (transfer information from text file into the set/getters of a class I have already created.
while (file >> Code >> Name >> Description >> Price >> Quantity >> color >>
Size >> BasketballRate) {
Basketball* object3 = new Basketball();
object3->SetName(Name);
object3->SetCode(Code);
object3->SetDescript(Description);
object3->SetPrice(Price);
object3->SetQuantity(Quantity);
object3->setColor(color);
object3->setSize(Size);
object3->setBasketballRate(BasketballRate);
basketball.push_back(object3);
}
file.close();

Getters and setters typically aren't used in Python since they're seldom needed and can effectively be added latter (without breaking existing code) if there turns out to be some unusual reason to have one or more of them.
Here's an example of reading the data from a text file and using it to create instances of the Basketball class.
class Basketball:
fields = ('name', 'code', 'description', 'price', 'quantity', 'color', 'size',
'basketball_rate')
def __init__(self):
for field in type(self).fields:
setattr(self, field, None)
def __str__(self):
args = []
for field in type(self).fields:
args.append(f'{field}={getattr(self, field)}')
return f'{type(self).__name__}(' + ', '.join(args) + ')'
basketballs = []
with open('bb_info.txt') as file:
while True:
basketball = Basketball()
try:
for field in Basketball.fields:
setattr(basketball, field, next(file).rstrip())
except StopIteration:
break # End of file.
basketballs.append(basketball)
for basketball in basketballs:
print(basketball)

How to get readable unicode string from single bibtex entry field in python script

Suppose you have a .bib file containing bibtex-formatted entries. I want to extract the "title" field from an entry, and then format it to a readable unicode string.
For example, if the entry was:
#article{mypaper,
author = {myself},
title = {A very nice {title} with annoying {symbols} like {\^{a}}}
}
what I want to extract is the string:
A very nice title with annoying symbols like â
I am currently trying to use the pybtex package, but I cannot figure out how to do it. The command-line utility pybtex-format does a good job in converting full .bib files, but I need to do this inside a script and for single title entries.

Figured it out:
def load_bib(filename):
from pybtex.database.input.bibtex import Parser
parser = Parser()
DB = parser.parse_file(filename)
return DB
def get_title(entry):
from pybtex.plugin import find_plugin
style = find_plugin('pybtex.style.formatting', 'plain')()
backend = find_plugin('pybtex.backends', 'plaintext')()
sentence = style.format_title(entry, 'title')
data = {'entry': entry,
'style': style,
'bib_data': None}
T = sentence.f(sentence.children, data)
title = T.render(backend)
return title
DB = load_bib("bibliography.bib")
print ( get_title(DB.entries["entry_label"]) )
where entry_label must match the label you use in latex to cite the bibliography entry.

Building upon the answer by Daniele, I wrote this function that lets one render fields without having to use a file.
from io import StringIO
from pybtex.database.input.bibtex import Parser
from pybtex.plugin import find_plugin
def render_fields(author="", title=""):
"""The arguments are in bibtex format. For example, they may contain
things like \'{i}. The output is a dictionary with these fields
rendered in plain text.
If you run tests by defining a string in Python, use r'''string''' to
avoid issues with escape characters.
"""
parser = Parser()
istr = r'''
#article{foo,
Author = {''' + author + r'''},
Title = {''' + title + '''},
}
'''
bib_data = parser.parse_stream(StringIO(istr))
style = find_plugin('pybtex.style.formatting', 'plain')()
backend = find_plugin('pybtex.backends', 'plaintext')()
entry = bib_data.entries["foo"]
data = {'entry': entry, 'style': style, 'bib_data': None}
sentence = style.format_author_or_editor(entry)
T = sentence.f(sentence.children, data)
rendered_author = T.render(backend)[0:-1] # exclude period
sentence = style.format_title(entry, 'title')
T = sentence.f(sentence.children, data)
rendered_title = T.render(backend)[0:-1] # exclude period
return {'title': rendered_title, 'author': rendered_author}

How to speed up writing in a database?

I have a function which search for json files in a directory, parse the file and write data in the database. My problem is writing in database, because it take around 30 minutes. Any idea how can I speed up writting in a database? I have few quite big files to parse, but parsing the file is not a problem it take around 3 minutes. Currently I am using sqlite but in the future I will change it to PostgreSQL.
Here is my function:
def create_database():
with transaction.atomic():
directory = os.fsencode('data/web_files/unzip')
for file in os.listdir(directory):
filename = os.fsdecode(file)
with open('data/web_files/unzip/{}'.format(filename.strip()), encoding="utf8") as f:
data = json.load(f)
cve_items = data['CVE_Items']
for i in range(len(cve_items)):
database_object = DataNist()
try:
impact = cve_items[i]['impact']['baseMetricV2']
database_object.severity = impact['severity']
database_object.exp_score = impact['exploitabilityScore']
database_object.impact_score = impact['impactScore']
database_object.cvss_score = impact['cvssV2']['baseScore']
except KeyError:
database_object.severity = ''
database_object.exp_score = ''
database_object.impact_score = ''
database_object.cvss_score = ''
for vendor_data in cve_items[i]['cve']['affects']['vendor']['vendor_data']:
database_object.vendor_name = vendor_data['vendor_name']
for description_data in cve_items[i]['cve']['description']['description_data']:
database_object.description = description_data['value']
for product_data in vendor_data['product']['product_data']:
database_object.product_name = product_data['product_name']
database_object.save()
for version_data in product_data['version']['version_data']:
if version_data['version_value'] != '-':
database_object.versions_set.create(version=version_data['version_value'])
My models.py:
class DataNist(models.Model):
vendor_name = models.CharField(max_length=100)
product_name = models.CharField(max_length=100)
description = models.TextField()
date = models.DateTimeField(default=timezone.now)
severity = models.CharField(max_length=10)
exp_score = models.IntegerField()
impact_score = models.IntegerField()
cvss_score = models.IntegerField()
def __str__(self):
return self.vendor_name + "-" + self.product_name
class Versions(models.Model):
data = models.ForeignKey(DataNist, on_delete=models.CASCADE)
version = models.CharField(max_length=50)
def __str__(self):
return self.version
I will appreciate if you can give me any advice how can I improve my code.

Okay, given the structure of the data, something like this might work for you.
This is standalone code aside from that .objects.bulk_create() call; as commented in the code, the two classes defined would actually be models within your Django app.
(By the way, you probably want to save the CVE ID as an unique field too.)
Your original code had the misassumption that every "leaf entry" in the affected version data would have the same vendor, which may not be true. That's why the model structure here has a separate product-version model that has vendor, product and version fields. (If you wanted to optimize things a little, you might deduplicate the AffectedProductVersions even across DataNists (which, as an aside, is not a perfect name for a model)).
And of course, as you had already done in your original code, the importing should be run within a transaction (transaction.atomic()).
Hope this helps.
import json
import os
import types
class DataNist(types.SimpleNamespace): # this would actually be a model
severity = ""
exp_score = ""
impact_score = ""
cvss_score = ""
def save(self):
pass
class AffectedProductVersion(types.SimpleNamespace): # this too
# (foreign key to DataNist here)
vendor_name = ""
product_name = ""
version_value = ""
def import_item(item):
database_object = DataNist()
try:
impact = item["impact"]["baseMetricV2"]
except KeyError: # no impact object available
pass
else:
database_object.severity = impact.get("severity", "")
database_object.exp_score = impact.get("exploitabilityScore", "")
database_object.impact_score = impact.get("impactScore", "")
if "cvssV2" in impact:
database_object.cvss_score = impact["cvssV2"]["baseScore"]
for description_data in item["cve"]["description"]["description_data"]:
database_object.description = description_data["value"]
break # only grab the first description
database_object.save() # save the base object
affected_versions = []
for vendor_data in item["cve"]["affects"]["vendor"]["vendor_data"]:
for product_data in vendor_data["product"]["product_data"]:
for version_data in product_data["version"]["version_data"]:
affected_versions.append(
AffectedProductVersion(
data_nist=database_object,
vendor_name=vendor_data["vendor_name"],
product_name=product_data["product_name"],
version_name=version_data["version_value"],
)
)
AffectedProductVersion.objects.bulk_create(
affected_versions
) # save all the version information
return database_object # in case the caller needs it
with open("nvdcve-1.0-2019.json") as infp:
data = json.load(infp)
for item in data["CVE_Items"]:
import_item(item)

Django ORM uses comma as a delimiter?

I have some problem with Django, which is somehow Django ORM considers comma as delimiter.
add example code in below.
print sub_categorys.description # is printed as "drum class and drums feature"
print sub_categorys.image_url # is printed as ", bongo class no.jpg"
but, real database row is description = "drum class and drums feature, bongo class ", and image_url = "no.npg"
please help me out here!
thanks!
additional explain in below, by code.
** model.py **
class SubCategory(models.Model):
name = models.TextField( unique=True )
description = models.TextField( null=True )
image_url = models.URLField( null=True )
** views.py > code use to insert data to model **
with open('./classes/resource/model/csv/sub_category_model.csv', 'rb') as f:
reader = csv.reader(f)
is_first = True
for row in reader:
if is_first:
is_first = False
continue
sub_category = SubCategory(name=unicode(row[0], 'euc-kr'),
description=unicode(row[3], 'euc-kr'),
image_url=unicode(row[4], 'euc-kr'))
try:
sub_category.save()
except Exception, e:
logger.error(e)

It's not the ORM that's using the comma as a delimiter, it's csv.reader. If you want to import strings that contain commas, you'll have to wrap them in quotation marks. Make sure the CSV file contains the proper quoting. Give your code above, your CSV rows should read something like:
foo,bar,baz,"drum class and drums feature, bongo class",no.jpg
If that's a problem for some reason, you can choose other delimiters, e.g.:
reader = csv.reader(csvfile, delimiter='|')
would take as input:
foo|bar|baz|drum class and drums feature, bongo class|no.jpg
More examples are available in the CSV module documentation

Unicode Problem with SQLAlchemy

I know I'm having a problem with a conversion from Unicode but I'm not sure where it's happening.
I'm extracting data about a recent Eruopean trip from a directory of HTML files. Some of the location names have non-ASCII characters (such as é, ô, ü). I'm getting the data from a string representation of the the file using regex.
If i print the locations as I find them, they print with the characters so the encoding must be ok:
Le Pré-Saint-Gervais, France
Hôtel-de-Ville, France
I'm storing the data in a SQLite table using SQLAlchemy:
Base = declarative_base()
class Point(Base):
__tablename__ = 'points'
id = Column(Integer, primary_key=True)
pdate = Column(Date)
ptime = Column(Time)
location = Column(Unicode(32))
weather = Column(String(16))
high = Column(Float)
low = Column(Float)
lat = Column(String(16))
lon = Column(String(16))
image = Column(String(64))
caption = Column(String(64))
def __init__(self, filename, pdate, ptime, location, weather, high, low, lat, lon, image, caption):
self.filename = filename
self.pdate = pdate
self.ptime = ptime
self.location = location
self.weather = weather
self.high = high
self.low = low
self.lat = lat
self.lon = lon
self.image = image
self.caption = caption
def __repr__(self):
return "<Point('%s','%s','%s')>" % (self.filename, self.pdate, self.ptime)
engine = create_engine('sqlite:///:memory:', echo=False)
Base.metadata.create_all(engine)
Session = sessionmaker(bind = engine)
session = Session()
I loop through the files and insert the data from each one into the database:
for filename in filelist:
# open the file and extract the information using regex such as:
location_re = re.compile("<h2>(.*)</h2>",re.M)
# extract other data
newpoint = Point(filename, pdate, ptime, location, weather, high, low, lat, lon, image, caption)
session.add(newpoint)
session.commit()
I see the following warning on each insert:
/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/default.py:230: SAWarning: Unicode type received non-unicode bind param value 'Spitalfields, United Kingdom'
param.append(processors[key](compiled_params[key]))
And when I try to do anything with the table such as:
session.query(Point).all()
I get:
Traceback (most recent call last):
File "./extract_trips.py", line 131, in <module>
session.query(Point).all()
File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/orm/query.py", line 1193, in all
return list(self)
File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/orm/query.py", line 1341, in instances
fetch = cursor.fetchall()
File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/base.py", line 1642, in fetchall
self.connection._handle_dbapi_exception(e, None, None, self.cursor, self.context)
File "/usr/lib/python2.5/site-packages/SQLAlchemy-0.5.4p2-py2.5.egg/sqlalchemy/engine/base.py", line 931, in _handle_dbapi_exception
raise exc.DBAPIError.instance(statement, parameters, e, connection_invalidated=is_disconnect)
sqlalchemy.exc.OperationalError: (OperationalError) Could not decode to UTF-8 column 'points_location' with text 'Le Pré-Saint-Gervais, France' None None
I would like to be able to correctly store and then return the location names with the original characters intact. Any help would be much appreciated.

I found this article that helped explain my troubles somewhat:
http://www.amk.ca/python/howto/unicode#reading-and-writing-unicode-data
I was able to get the desired results by using the 'codecs' module and then changing my program as follows:
When opening the file:
infile = codecs.open(filename, 'r', encoding='iso-8859-1')
When printing the location:
print location.encode('ISO-8859-1')
I can now query and manipulate the data from the table without the error from before. I just have to specify the encoding when I output the text.
(I still don't entirely understand how this is working so I guess it's time to learn more about Python's unicode handling...)

From sqlalchemy.org
See section 0.4.2
added new flag to String and
create_engine(),
assert _unicode=(True|False|'warn'|None).
Defaults to False or None on
create _engine() and String, 'warn' on the Unicode type. When
True,
results in all unicode conversion operations raising an
exception when a
non-unicode bytestring is passed as a bind parameter. 'warn' results
in a warning. It is strongly advised that all unicode-aware
applications
make proper use of Python unicode objects (i.e. u'hello' and not
'hello')
so that data round trips accurately.
I think you are trying to input a non-unicode bytestring. Perhaps this might lead you on the right track? Some form of conversion is needed, compare 'hello' and u'hello'.
Cheers

Try using a column type of Unicode rather than String for the unicode columns:
Base = declarative_base()
class Point(Base):
__tablename__ = 'points'
id = Column(Integer, primary_key=True)
pdate = Column(Date)
ptime = Column(Time)
location = Column(Unicode(32))
weather = Column(String(16))
high = Column(Float)
low = Column(Float)
lat = Column(String(16))
lon = Column(String(16))
image = Column(String(64))
caption = Column(String(64))
Edit: Response to comment:
If you're getting warnings about unicode encodings then there are two things you can try:
Convert your location to unicode. This would mean having your Point created like this:
newpoint = Point(filename, pdate, ptime, unicode(location), weather, high, low, lat, lon, image, caption)
The unicode conversion will produce a unicode string when passed either a string or a unicode string, so you don't need to worry about what you pass in.
If that doesn't solve the encoding issues, try calling encode on your unicode objects. That would mean using code like:
newpoint = Point(filename, pdate, ptime, unicode(location).encode('utf-8'), weather, high, low, lat, lon, image, caption)
This step probably won't be necessary but what it essentially does is converts a unicode object from unicode code-points to a specific byte representation (in this case, utf-8). I'd expect SQLAlchemy to do this for you when you pass in unicode objects but it may not.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django w/ sqlite3 CharField crashing on special character: "Rüppell's Vulture" - python

Related

Is there a way to take input from text file and put it in set/get methods of a class?

How to get readable unicode string from single bibtex entry field in python script

How to speed up writing in a database?

Django ORM uses comma as a delimiter?

Unicode Problem with SQLAlchemy

Categories

Resources