Sorting the catalog results by multiple fields - python

I need to sort the catalog results by multiple fields.
In my case, first sort by year, then by month. The year and month field are included in my custom content type (item_publication_year and item_publication_month respectively).
However, I'm not getting the results that I want. The year and month are not ordered at all. They should appear in descending order i.e. 2006, 2005, 2004 etc.
Below is my code:
def queryItemRepository(self):
"""
Perform a search returning items matching the criteria
"""
query = {}
portal_catalog = getToolByName(self, 'portal_catalog')
folder_path = '/'.join( self.context.getPhysicalPath() )
query['portal_type'] = "MyContentType"
query['path'] = {'query' : folder_path, 'depth' : 2 }
results = portal_catalog.searchResults(query)
# convert the results to a python list so we can use the sort function
results = list(results)
results.sort(lambda x, y : cmp((y['item_publication_year'], y['item_publication_year']),
(x['item_publication_month'], x['item_publication_month'])
))
return results
Anyone care to help?

A better bet is to use the key parameter for sorting:
results.sort(key=lambda b: (b.item_publication_year, b.item_publication_month))
You can also use the sorted() built-in function instead of using list(); it'll return a sorted list for you, it's the same amount of work for Python to first call list on the results, then sort, as it is to just call sorted:
results = portal_catalog.searchResults(query)
results = sorted(results, key=lambda b: (b.item_publication_year, b.item_publication_month))
Naturally, both item_publication_year and item_publication_month need to be present in the catalog metadata.

You can get multiple sorting straight from catalog search using advanced query see also its official docs

Related

Confusion over the ability to re-order a Dict or OrderedDict

I am here because I have been struggling to re-order a Dict that I have created. In short I pass a list of tweets returned from Twitter API through a reducer to group them by hashtag, naturally I want to use Dict for this because it gives a convenient shape:
{
"covid19": [tweet1,tweet2],
"COVID": [tweet1,tweet2, Tweet3]
}
But unfortunately after I run sort over my Dict to order the hashtags with most tweets, the ordering does not change when I send it to the client, its actually alphabetically ordered for some reason.
# method called by reducer below to group posts by their first hashtag
def group_posts_by_hashtag(hashtag_group, post):
first_hashtag_of_post = post['entities']['hashtags'][0]['text']
hashtag_group[first_hashtag_of_post] = hashtag_group.get(first_hashtag_of_post, [])
hashtag_group[first_hashtag_of_post].append(post)
return hashtag_group
tweets_categorised_by_hashtag = reduce(group_posts_by_hashtag, tweets_with_hashtags, defaultdict(list))
# sort by most tweets per hashtag
sorted_hashtags = OrderedDict(sorted(tweets_categorised_by_hashtag.items(), key=lambda x: len(x[1]), reverse=True))
# limit by 10 top hashtags
if len(sorted_hashtags) > 10:
limit_ten_hashtags = dict(itertools.islice(sorted_hashtags.items(), 10))
# limit_ten_hashtags
return json.dumps(limit_ten_hashtags)
return json.dumps(sorted_hashtags)
My question is, is it even possible to change the order, store it in a variable, and maintain that order? This video pretty much says its not possible to change the order of a dictionary https://www.youtube.com/watch?v=MGD_b2w_GU4. And some of the answers I have seen only prints the desired change but will not maintain that order.

Get max value in Python in case of tiebreaker

I would like to know of a way to get the reverse alphabetical order item in the case of a tiebreaker using max(lst,key=lst.count) or if there is a better way.
For example if I had the following list:
mylist = ['fire','water','water','fire']
Even though both water and fire occur twice, I would like it to return 'water' since it comes first in the reverse alphabetical order instead of it returning the first available value.
I created a function that gets the max value based on two keys, being one primary and one secondary.
Here it is:
def max_2_keys(__iter: iter, primary, secondary):
srtd = sorted(__iter, key=primary, reverse=True)
filtered = filter(lambda x: primary(x) == primary(srtd[0]), srtd)
return sorted(filtered, key=secondary, reverse=True)[0]
in your case, you can execute the following lines:
from string import printable
max_2_keys(lst,primary=lst.count, secondary=lambda x: printable[::-1].index(x[0]))
```

Custom function with pandas assign to get the match in a dictionary

Imagine I have a table with one column, product_line. I want to create another column with the product_line_name, based on the product_line number. The information is taken from a dictionary.
I have declared a dictionary containing product lines and descriptions:
categories={
'Apparel': 99,
'Bikes': 32}
######## function that returns the category from a gl number
def get_cat(glnum,dic=categories):
for cat, List in dic.items():
if glnum in List:
return cat
return(0)
data['category']=data['product_line'].apply(lambda x: get_cat(x))
Works
However I cannot get it to work using method chaining:
tt = (tt
.assign(category = lambda d: get_cat(d.gl_product_line)))
It should be an error related to the Series but I am unsure why it wouldn't work since the lambda function should call the get_cat repeatedly for each row of the dataframe - which does not happen apparently.
Any ideas of how I could do this using the .assign in method chaining?

Python Max function - Finding highest value in a dictionary

My question is about finding highest value in a dictionary using max function.
I have a created dictionary that looks like this:
cc_GDP = {'af': 1243738953, 'as': 343435646, etc}
I would like to be able to simply find and print the highest GDP value for each country.
My best attempt having read through similar questions is as follows (I'm currently working through the Python crash course book at which the base of this code has been taken, note the get_country_code function is simply providing 2 letter abbreviations for the countries in the GDP_data json file):
#Load the data into a list
filename = 'gdp_data.json'
with open(filename) as f:
gdp_data = json.load(f)
cc_GDP` = {}
for gdp_dict in gdp_data:
if gdp_dict['Year'] == 2016:
country_name = gdp_dict['Country Name']
GDP_total = int(gdp_dict['Value'])
code = get_country_code(country_name)
if code:
cc_GDP[code] = int(GDP_total)
print(max(cc_GDP, key=lambda key: cc_GDP[key][1]))
This provides the following error 'TypeError: 'int' object is not subscriptable'
Note if leaving out the [1] in the print function, this does provide the highest key which relates to the highest value, but does not return the highest value itself which is what I wish to achieve.
Any help would be appreciated.
So you currently extract the key of the country that has the highest value with this line:
country_w_highest_val = max(cc_GDP, key=lambda key: cc_GDP[key]))
You can of course just look that up in the dictionary again:
highest_val = cc_GDP[contry_w_highest_val]
But simpler, disregard the keys completely, and just find the highest value of all values in the dictionary:
highest_val = max(cc_GDP.values())
How about something like this:
print max(cc_GDP.values())
That will give you the highest value but not the key.
The error is being cause because you need to look at the entire dictionary, not just one item. remove the [1] and then use the following line:
print(cc_GDP[max(cc_GDP, key=lambda key: cc_GDP[key])])
Your code currently just returns the dictionary key. You need to plug this key back into the dictionary to get the GDP.
You could deploy .items() method of dict to get key-value pairs (tuples) and process it following way:
cc_GDP = {'af': 1243738953, 'as': 343435646}
m = max(list(cc_GDP.items()), key=lambda x:x[1])
print(m) #prints ('af', 1243738953)
Output m in this case is 2-tuple, you might access key 'af' via m[0] and value 1243738953 via m[1].

Simplifying a list into categories

I am a new Python developer and was wondering if someone can help me with this. I have a dataset that has one column that describes a company type. I noticed that the column has, for example, surgical, surgery listed. It has eyewear, eyeglasses and optometry listed. So instead of having a huge list in this column, i want to simply the category to say that if you find a word that contains "eye," "glasses" or "opto" then just change it to "eyewear." My initial code looks like this:
def map_company(row):
company = row['SIC_Desc']
if company in 'Surgical':
return 'Surgical'
elif company in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']:
return 'Eyewear'
elif company in ['Cotton', 'Bandages', 'gauze', 'tape']:
return 'First Aid'
elif company in ['Dental', 'Denture']:
return 'Dental'
elif company in ['Wheelchairs', 'Walkers', 'braces', 'crutches', 'ortho']:
return 'Mobility equipments'
else:
return 'Other'
df['SIC_Desc'] = df.apply(map_company,axis=1)
This is not correct though because it is changing every item into "Other," so clearly my syntax is wrong. Can someone please help me simplify this column that I am trying to relabel?
Thank you
It is hard to answer without having the exact content of your data set, but I can see one mistake. According to your description, it seems you are looking at this the wrong way. You want one of the words to be in your company description, so it should look like that:
if any(test in company for test in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers'])
However you might have a case issue here so I would recommend:
company = row['SIC_Desc'].lower()
if any(test.lower() in company for test in ['Eye', 'glasses', 'opthal', 'spectacles', 'optometers']):
return 'Eyewear'
You will also need to make sure company is a string and 'SIC_Desc' is a correct column name.
In the end your function will look like that:
def is_match(company,names):
return any(name in company for name in names)
def map_company(row):
company = row['SIC_Desc'].lower()
if 'surgical' in company:
return 'Surgical'
elif is_match(company,['eye','glasses','opthal','spectacles','optometers']):
return 'Eyewear'
elif is_match(company,['cotton', 'bandages', 'gauze', 'tape']):
return 'First Aid'
else:
return 'Other'
Here is an option using a reversed dictionary.
Code
import pandas as pd
# Sample DataFrame
s = pd.Series(["gauze", "opthal", "tape", "surgical", "eye", "spectacles",
"glasses", "optometers", "bandages", "cotton", "glue"])
df = pd.DataFrame({"SIC_Desc": s})
df
LOOKUP = {
"Eyewear": ["eye", "glasses", "opthal", "spectacles", "optometers"],
"First Aid": ["cotton", "bandages", "gauze", "tape"],
"Surgical": ["surgical"],
"Dental": ["dental", "denture"],
"Mobility": ["wheelchairs", "walkers", "braces", "crutches", "ortho"],
}
REVERSE_LOOKUP = {v:k for k, lst in LOOKUP.items() for v in lst}
def map_company(row):
company = row["SIC_Desc"].lower()
return REVERSE_LOOKUP.get(company, "Other")
df["SIC_Desc"] = df.apply(map_company, axis=1)
df
Details
We define a LOOKUP dictionary with (key, value) pairs of expected output and associated words, respectively. Note, the values are lowercase to simplify searching. Then we use a reversed dictionary to automatically invert the key value pairs and improve the search performance, e.g.:
>>> REVERSE_LOOKUP
{'bandages': 'First Aid',
'cotton': 'First Aid',
'eye': 'Eyewear',
'gauze': 'First Aid',
...}
Notice these reference dictionaries are created outside the mapping function to avoid rebuilding dictionaries for every call to map_company(). Finally the mapping function quickly returns the desired output using the reversed dictionary by calling .get(), a method that returns the default argument "Other" if no entry is found.
See #Flynsee's insightful answer for an explanation of what is happening in your code. The code is cleaner compared a bevy of conditional statements.
Benefits
Since we have used dictionaries, the search time should be relatively fast, O(1) compared to a O(n) complexity using in. Moreover, the main LOOKUP dictionary is adaptable and liberated from manually implementing extensive conditional statements for new entries.

Categories