I have a QueryDict object in Django as follows:
{'ratingname': ['Beginner', 'Professional'], 'sportname': ['2', '3']
where the mapping is such:
2 Beginner
3 Professional
and 2, 3 are the primary key values of the sport table in models.py:
class Sport(models.Model):
name = models.CharField(unique=True, max_length=255)
class SomeTable(models.Model):
sport = models.ForeignKey(Sport)
rating = models.CharField(max_length=255, null=True)
My question here is, how do I iterate through ratingname such that I can save it as
st = SomeTable(sport=sportValue, rating=ratingValue)
st.save()
I have tried the following:
ratings = dict['ratingname']
sports = dict['sportname']
for s,i in enumerate(sports):
sport = Sport.objects.get(pk=sports[int(s[1])])
rate = SomeTable(sport=sport, rating=ratings[int(s)])
rate.save()
However, this creates a wrong entry in the tables. For example, with the above given values it creates the following object in my table:
id: 1
sport: 2
rating: 'g'
How do I solve this issue or is there a better way to do something?
There are a couple of problems here. The main one is that QueryDicts return only the last value when accessed with ['sportname'] or the like. To get the list of values, use getlist('sportname'), as documented here:
https://docs.djangoproject.com/en/1.7/ref/request-response/#django.http.QueryDict.getlist
Your enumerate is off, too - enumerate yields the index first, which your code assigns to s. So s[1] will throw an exception. There's a better way to iterate through two sequences in step, though - zip.
ratings = query_dict.getlist('ratingname') # don't reuse built in names like dict
sports = query_dict.getlist('sportname')
for rating, sport_pk in zip(ratings, sports):
sport = Sport.objects.get(pk=int(sport_pk))
rate = SomeTable(sport=sport, rating=rating)
rate.save()
You could also look into using a ModelForm based on your SomeTable model.
You may use zip:
ratings = dict['ratingname']
sports = dict['sportname']
for rating, sport_id in zip(ratings, sports):
sport = Sport.objects.get(pk=int(sport_id))
rate = SomeTable(sport=sport, rating=rating)
rate.save()
Related
I have two dataframes:
One with a single column of business names that I call 'bus_names_2' with a column name of 'BUSINESS_NAME'
One with an array of records and fields that was pulled from a RSS feed that I call 'df_newsfeed'. The import field is 'Description_2' field which represents the RSS feeds contents after scrubbing stopwords and symbols. This was also conducted on the 'bus_names_2' dataframe as well.
I am trying to look through each record in the 'df_newsfeed's 'Description_2' field to see if any array of words contains a business name from the 'bus_names_2' dataframe. This is easily done using the following:
def IdentityResolution_demo(bus_names, df, col='Description_2', upper=True):
n_rows = df.shape[0]
description_col = df.columns.get_loc(col)
df['Company'] = ''
company_col = df.columns.get_loc('Company')
if upper:
df.loc[:,col] = df.loc[:,col].str.upper()
for ind in range(n_rows):
businesses = []
description = df.iloc[ind,description_col]
for bus_name in bus_names:
if bus_name in description:
businesses.append(bus_name)
if len(businesses) > 0:
company = '|'.join(businesses)
df.iloc[ind,company_col] = company
df = df[['Source', 'RSS', 'Company', 'Title', 'PublishedDate', 'Description', 'Link']].drop_duplicates()
return df
bus_names_3 = list(set(bus_names_2['BUSINESS_NAME'].tolist()))
test = IdentityResolution_demo(bus_names_3, df_newsfeed.iloc[:10])
test[test['Company']!='']
This issue with this asides from the length of time it takes is that it is bringing back everything in a contains manner. I only want full word matches. Meaning if I have a company in my 'bus_names_2' dataframe called 'Bank of A' that it only brings back that name into the company category if the full word of 'Bank of A' exist in the 'Description_2' column of the 'df_newsfeed' dataframe and not when 'Bank of America' shows up.
Essentially, I need something like this ingrained in my function to produce the proper output for the 'Company' column but I don't know how to implement it. The below code gets the point accross.
Description_2 = 'GUARDFORCE AI CO LIMITED AI GFAIW RIVERSOFT INC PEAKWORK COMPANY GFAIS CONCIERGE GUARDFORCE AI RIVERSOFT ROBOT TRAVEL AGENCY'
bus_name_2 = ['GUARDFORCE AI CO']
for i in bus_name_2:
bus_name = re.compile(fr'\b{i}\b')
print(f"{i if bus_name.match(Description_2) else ''}")
This would produce an output of 'GUARDFORCE AI CO' but if I change the bus_name_2 to:
bus_name_2 = ['GUARDFORCE AI C']
It would produce a null output.
This function is written in the way it is because comparing two dataframes turned into a very long query and so optimization required a non-dataframe format.
Currently I'm looking for efficient way to build a matrix of rating for recommendation system in Python.
The matrix should look like this:
4|0|0|
5|2|0|
5|0|0|
4|0|0|
4|0|0|
4|0|0|
4|4|0|
2|0|0|
0|4|0|
0|3|0|
0|0|3|
0|0|5|
0|0|4|
Specifically, the columns are business_id and the rows are user_id
|bus-1|bus-2|
user-1|stars|stars|
user-2|stars|stars|
Currently I'm using this Yelp review data set stored in MongoDB:
_id: "----X0BIDP9tA49U3RvdSQ"
user_id: "gVmUR8rqUFdbSeZbsg6z_w"
business_id: "Ue6-WhXvI-_1xUIuapl0zQ"
stars: 4
useful: 1
funny: 0
cool: 0
text: "Red, white and bleu salad was super yum and a great addition to the me..."
date: "2014-02-17 16:48:49"
My approach is by building a list of unique business_id and user_id from review table and querying those value in review table again.
I've included my code here, as you can see because of the brute force approach, it took a long time just to build small matrix just like the one I included earlier.
Here's some snippet of my code:
def makeBisnisArray(cityNameParam):
arrayBisnis = []
#Append business id filtered by cityNameParam to the bisnis array
bisnisInCity = colBisnis.find({"city": cityNameParam})
for bisnis in bisnisInCity:
#if the business id is not in array, then append it to the array
if(not(bisnis in arrayBisnis)):
arrayBisnis.append(bisnis["_id"])
return arrayBisnis
def makeUserArray(bisnisName):
global arrayUser
#find review filtered by bisnisName
hslReview = colReview.find({"business_id": bisnisName})
for review in hslReview:
#if the user id is not already in array, append it to the array
if(not(review['user_id'] in arrayUser)):
arrayUser.append(review['user_id'])
def writeRatingMatrix(arrayBisnis, arrayUser):
f = open("file.txt", "w")
for user in arrayUser:
for bisnis in arrayBisnis:
#find one instance from the database by business_id and user_id
x = colReview.find_one({"business_id": bisnis, "user_id": user})
#if there's none, then just write the rating as 0
if x is None :
f.write('0|')
#if found, write the star value
else:
f.write((str(x['stars'])+"|"))
print()
f.write('\n')
def buildCityTable(cityName):
arrayBisnis = makeBisnisArray(cityName)
global arrayUser
for bisnis in arrayBisnis:
makeUserArray(bisnis)
writeRatingMatrix(arrayBisnis, arrayUser)
arrayUser = []
cityNameVar = 'Pointe-Aux-Trembles'
buildCityTable(cityNameVar)
Can anyone suggest more efficient way to build the rating matrix for me?
There are several general approaches you can take to speed this up.
Use sets or dictionaries to establish a unique set of businesses and users respectively; Set/Dict lookups are much faster than list searches.
Process the yelp file one entry at a time, once
Use something like numpy or pandas to build your matrix
Something like this
users = {}
businesses = {}
ratings = {}
for entry in yelp_entries:
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
ratings.append((
users[[entry['user_id']],
businesses[entry['business_id']],
entry['stars']
))
matrix = numpy.tile(0, (len(users), len(businesses))
for r in ratings:
matrix[r[0]][r[1]] = r[2]
I modified #sirlark's code to match my need, but for some reason i cannot use append on ratings and iterate over it with for r in ratings so i had to change the code like this
users = {}
businesses = {}
ratings = {}
#Query the yelp_entries for all reviews matching business_id and store it in businesses first
for entry in yelp_entries:
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
ratings[len(ratings)]=(users[entry['user_id']],
businesses[entry['business_id']],
int(entry['stars']))
matrix = numpy.tile(0, (len(users), len(businesses))
for ind in range(0,len(ratings)):
matrix[ratings[ind][0]][ratings[ind][1]] = ratings[ind][2]
Later i found out that other than using tile method
We can also use SciPy_coo matrix which is slightly faster than above method, but we need to modify the code a bit
from scipy.sparse import coo_matrix
users = {}
businesses = {}
ratings = {}
row = []
col = []
data = []
for entry in yelp_entries:
if entry['business_id'] not in businesses:
businesses[entry['business_id']] = len(businesses)
if entry['user_id'] not in users:
users[entry['user_id']] = len(users)
col.append(businesses[review['business_id']])
row.append(users[review['user_id']])
data.append(int(review['stars']))
matrix = coo_matrix((data, (row, col))).toarray()
note: Later i found out the reason why i can't .append() or .add() to ratings variable is because
ratings = {}
counts as dict data type, to declare a set data type you should use this instead:
ratings = set()
I have a large address database of commercial properties (about 5 million rows), of which 200,000 have missing floor areas. The properties are classified by industry, and I know the rent for each.
My approach for interpolating the missing floor areas was to filter for similarly-classified properties within a specified radius of the property with unknown floor area, and then calculate the floor area from the median of the cost/m2 of the nearby properties.
Originally, I approached this using pandas, but that has become problematic as the dataset has grown larger (even using group_by). It often exceeds available memory, and stops. When it works, it takes about 3 hours to complete.
I am testing to see whether I can do the same task in the database. The function I've written for radial fill is as follows:
def _radial_fill(self):
# Initial query selecting all latest locations, and excluding null rental valuations
q = Location.objects.order_by("locode","-update_cycle") \
.distinct("locode")
# Chained Q objects to use in filter
f = Q(rental_valuation__isnull=False) & \
Q(use_category__grouped_by__isnull=False) & \
Q(pc__isnull=False)
# All property categories at subgroup level
for c in LocationCategory.objects.filter(use_category="SGP").all():
# Start looking for appropriate interpolation locations
fc = f & Q(use_category__grouped_by=c)
for l in q.filter(fc & Q(floor_area__isnull=True)).all():
r_degree = 0
while True:
# Default Distance is metres, so multiply accordingly
r = (constants.BOUNDS**r_degree)*1000 # metres
ql = q.annotate(distance=Distance("pc__point", l.pc.point)) \
.filter(fc & Q(floor_area__isnull=False) & Q(distance__lte=r)) \
.values("rental_valuation", "floor_area")
if len(ql) < constants.LOWER_RANGE:
if r > constants.UPPER_RADIUS*1000:
# Further than the longest possible distance
break
r_degree += 1
else:
m = median([x["rental_valuation"]/x["floor_area"]
for x in ql if x["floor_area"] > 0.0])
l.floor_area = l.rental_valuation / m
l.save()
break
My problem is that this function takes 6 days to run. There has to be a faster way, right? I'm sure I'm doing something terribly wrong...
The models are as follows:
class LocationCategory(models.Model):
# Category types
GRP = "GRP"
SGP = "SGP"
UST = "UST"
CATEGORIES = (
(GRP, "Group"),
(SGP, "Sub-group"),
(UST, "Use type"),
)
slug = models.CharField(max_length=24, primary_key=True, unique=True)
usecode = models.CharField(max_length=14, db_index=True)
use_category = models.CharField(max_length=3, choices=CATEGORIES,
db_index=True, default=UST)
grouped_by = models.ForeignKey("self", null=True, blank=True,
on_delete=models.SET_NULL,
related_name="category_by_group")
class Location(models.Model):
# Hereditament identity and location
slug = models.CharField(max_length=24, db_index=True)
locode = models.CharField(max_length=14, db_index=True)
pc = models.ForeignKey(Postcode, null=True, blank=True,
on_delete=models.SET_NULL,
related_name="locations_by_pc")
use_category = models.ForeignKey(LocationCategory, null=True, blank=True,
on_delete=models.SET_NULL,
related_name="locations_by_category")
# History fields
update_cycle = models.CharField(max_length=14, db_index=True)
# Location-specific econometric data
floor_area = models.FloatField(blank=True, null=True)
rental_valuation = models.FloatField(blank=True, null=True)
class Postcode(models.Model):
pc = models.CharField(max_length=7, primary_key=True, unique=True) # Postcode excl space
pcs = models.CharField(max_length=8, unique=True) # Postcode incl space
# http://spatialreference.org/ref/epsg/osgb-1936-british-national-grid/
point = models.PointField(srid=4326)
Using Django 2.0, and Postgresql 10
UPDATE
I've achieved a 35% improvement in runtime with the following code change:
# Initial query selecting all latest locations, and excluding null rental valuations
q = Location.objects.order_by("slug","-update_cycle") \
.distinct("slug")
# Chained Q objects to use in filter
f = Q(rental_valuation__isnull=False) & \
Q(pc__isnull=False) & \
Q(use_category__grouped_by_id=category_id)
# All property categories at subgroup level
# Start looking for appropriate interpolation locations
for l in q.filter(f & Q(floor_area__isnull=True)).all().iterator():
r = q.filter(f & Q(floor_area__isnull=False) & ~Q(floor_area=0.0))
rl = Location.objects.filter(id__in = r).annotate(distance=D("pc__point", l.pc.point)) \
.order_by("distance")[:constants.LOWER_RANGE] \
.annotate(floor_ratio = F("rental_valuation")/
F("floor_area")) \
.values("floor_ratio")
if len(rl) == constants.LOWER_RANGE:
m = median([h["floor_ratio"] for h in rl])
l.floor_area = l.rental_valuation / m
l.save()
The id__in=r is inefficient, but it seems the only way to maintain the distinct queryset when adding and sorting on a new annotation. Given that some 100,000 rows can be returned in the r query, any annotations applied there, with subsequent sorting by distance, can take a hellish long time.
However ... I run into numerous problems when trying to implement the Subquery functionality. AttributeError: 'ResolvedOuterRef' object has no attribute '_output_field_or_none' which I think has something to do with the annotations, but I can't find much on it.
The relevant restructured code is:
rl = Location.objects.filter(id__in = r).annotate(distance=D("pc__point", OuterRef('pc__point'))) \
.order_by("distance")[:constants.LOWER_RANGE] \
.annotate(floor_ratio = F("rental_valuation")/
F("floor_area")) \
.distinct("floor_ratio")
and:
l.update(floor_area= F("rental_valuation") / CustomAVG(Subquery(locs),0))
I can see that this approach should be tremendously efficient, but getting it right seems somewhat far beyond my skill level.
You can simplify your method using (mostly) the built-in query methods of Django which are optimized. More specifically we will use:
Subquery and OuterRef methods (for version >= 1.11).
An annotation and AVG from Django aggregation.
dwithin lookup.
F() expression (a detailed use case for F() can be found in my QA style example: How to execute arithmetic operations between Model fields in django
We will create a custom Aggregate class to apply our AVG function (method inspired by this excellent answer: Django 1.11 Annotating a Subquery Aggregate)
class CustomAVG(Subquery):
template = "(SELECT AVG(area_value) FROM (%(subquery)s))"
output_field = models.FloatField()
and we will use it to calculate the following average:
for location in Location.objects.filter(rental_valuation__isnull=True):
location.update(
rental_valuation=CustomAVG(
Subquery(
Location.objects.filter(
pc__point__dwithin=(OuterRef('pc__point'), D(m=1000)),
rental_valuation__isnull=False
).annotate(area_value=F('rental_valuation')/F('floor_area'))
.distinct('area_value')
)
)
)
Breakdown of the above:
We collect all the Location objects without a rental_valuation and we "pass" through the list.
Subquery: We select the Location objects that are within a circle of radius=1000m (change that as you like) from our current location point and we annotate on them the cost/m2 calculation (using F() to take the value of columns rental_valuation and floor_area of each object), as a column named area_value. For more accurate results, we select only the distinct values of this column.
We apply our CustomAVG to the Subquery and we update our current locations rental_valuation.
I have 3 tables: Continent, Country and Story.
Country has ForeignKey(Continent) and Story has ManyToManyField(Country, blank=True) field.
What I need is to get a list of countries which at least have one story belonging to it, and I need these countries grouped by continents.
How can I achieve that?
One way to do it is:
countries = {}
country_list = Country.objects.all()
for c in country_list:
# check the stories that has the country
stories = Story.objects.filter(country_set__name__exact=c.name)
# only if the country have stories
if stories.count() > 0:
## check for initialize list
if not countries.has_key(c.continent.name):
countries[c.continent.name] = []
## finally we add the country
countries[c.continent.name].append(c)
That will do the work.
Bye
Hopefully this result set is explanatory enough:
title text total_score already_voted
------------- ------------ ----------- -------------
BP Oil spi... Recently i... 5 0
J-Lo back ... Celebrity ... 7 1
Don't Stop... If there w... 9 0
Australian... The electi... 2 1
My models file describes article (author, text, title) and vote (caster, date, score). I can get the first three columns just fine with the following:
articles = Article.objects.all().annotate(total_score=Sum('vote__score'))
but calculating the 4th column, which is a boolean value describing whether the current logged in user has placed any of the votes in column 3, is a bit beyond me at the moment! Hopefully there's something that doesn't require raw sql for this one.
Cheers,
Dave
--Trindaz on Fedang #django
I cannot think of a way to include the boolean condition. Perhaps others can answer that better.
How about thinking a bit differently? If you don't mind executing two queries you can filter your articles based on whether the currently logged in user has voted on them or not. Something like this:
all_articles = Article.objects.all()
articles_user_has_voted_on = all_articles.filter(vote__caster =
request.user).annotate(total_score=Sum('vote__score'))
other_articles = all_articles.exclude(vote__caster =
request.user).annotate(total_score=Sum('vote__score'))
Update
After some experiments I was able to figure out how to add a boolean condition for a column in the same model (Article in this case) but not for a column in another table (Vote.caster).
If Article had a caster column:
Article.objects.all().extra(select = {'already_voted': "caster_id = %s" % request.user.id})
In the present state this can be applied for the Vote model:
Vote.objects.all().extra(select = {'already_voted': "caster_id = %s" % request.user.id})