Here's the question,
A researcher has gathered thousands of news articles. But she wants to focus her attention on articles including a specific word. Complete the function below to help her filter her list of articles.
Your function should meet the following criteria:
Do not include documents where the keyword string shows up only as a part of a larger word. For example, if she were looking for the keyword “closed”, you would not include the string “enclosed.”
She does not want you to distinguish upper case from lower case letters. So the phrase “Closed the case.” would be included when the keyword is “closed”
Do not let periods or commas affect what is matched. “It is closed.” would be included when the keyword is “closed”. But you can assume there are no other types of punctuation.
Here's my ans(I want to solve this just using loops and ifs):
def word_search(doc_list, keyword):
"""
Takes a list of documents (each document is a string) and a keyword.
Returns list of the index values into the original list for all documents
containing the keyword.
Example:
doc_list = ['The Learn Python Challenge Casino', 'They bought a car, and a horse', 'Casinoville?']
word_search(doc_list, 'casino')
>>> [0]
"""
#non-course provided and my own code starts here.
k=0
print(doc_list,keyword)
for string in doc_list:
print(string)
for char in string:
if char.upper()==keyword[0] or char.lower()==keyword[0]:
print(char,string[string.index(char)-1])
if (string[string.index(char)-1]==" " or string[string.index(char)-1]=="" or string[string.index(char)-1]==".") and (string[string.index(char)+len(keyword)]==" " or string[string.index(char)+len(keyword)]=="" or string[string.index(char)+len(keyword)]=="."):
print(string[string.index(char)-1])
for k in range(len(keyword)):
print(k)
if string[string.index(char)+k].upper()==keyword[k] or string[string.index(char)+k].lower()==keyword[k]:
c=c+k
if len(c)==len(keyword):
x=[doc_list.index(string)]
return x
But after running the check code:
q2.check() #returns,
Incorrect: Got a return value of None given doc_list=['The Learn Python Challenge Casino', 'They bought a car, and a horse', 'Casinoville?'], keyword='casino', but expected a value of type list. (Did you forget a return statement?)
Here's what gets printed out after executing the code:
['The Learn Python Challenge Casino', 'They bought a car, and a horse',
'Casinoville?'] casino
The Learn Python Challenge Casino
C
C
They bought a car, and a horse
c
Casinoville?
C ?
The code is compiling successfully without syntax and other explicit errors. But I can't find any implicit bugs that's generating a wrong ans after struggling for 5+ hrs. please help!
If I remember correctly Kaggle courses also provide you with the solution which is the solution you should understand and use moving forward. Your code has many conditions and it will be tough to determine which of these conditions is not implemented correctly. Might as well check Kaggles' solution because you cant use this moving forward. Also the solution you have has a nested for-loop checking each letter one-by-one. That is extremely inefficient. Nice beginners attempt though :)
Here is the solution using regex
import re
def word_search(documents, keyword):
res=[]
for i,j in enumerate(documents):
if re.findall('\\b'+keyword+'\\b',j,flags=re.IGNORECASE):
res.append(i)
return res
As stated by the answer, your function should return a list. You are instead returning a None value, because at some points in your nested ifs you are going at the end of your function, in which is not specified any return. When you don't specify any return keyword at the end of your function, it will return None as default
By the way, python offers a lot of utils libraries, for example the str.index() method that return the string index if found in the original string
This I think is a better development of your solution:
def word_search(doc_list, keyword):
"""
Takes a list of documents (each document is a string) and a keyword.
Returns list of the index values into the original list for all documents
containing the keyword.
Example:
doc_list = ['The Learn Python Challenge Casino', 'They bought a car, and a horse', 'Casinoville?']
word_search(doc_list, 'casino')
>>> [0]
"""
my_list = []
for doc in doc_list:
curr_doc = doc.lower()
try:
curr_index = curr_doc.index(keyword.lower())
my_list.append(curr_index)
except:
my_list.append(None)
return my_list
print(word_search(['The Learn Python Challenge Casino', 'They bought a car, and a horse', 'Casinoville?'], 'casino'))
output: [27, None, 0]
As you can see, in my code I am returning a List at the end of the function definition, as requested from the problem
A better approach to solve this would be to use the method contains(). An example of its usage can be found here.
So the algorithm would become:
list_to_return = []
counter = 0
for item in doc_list:
if item.contains(word):
list_to_return.append(counter)
counter += 1
return list_to_return
def word_search(doc_list, keyword):
res = []
sum = 0
for i in range(len(doc_list)-1):
if(doc_list[i] == keyword):
sum=sum+1
res.append(doc_list[i])
return sum, res
Related
Given a stringified phone number of non-zero length, write a function that returns all mnemonics for this phone number in any order.
`
def phoneNumberMnemonics(phoneNumber, Mnemonics=[''], idx=0):
number_lookup={'0':['0'], '1':['1'], '2':['a','b','c'], '3':['d','e','f'], '4':['g','h','i'], '5':['j','k','l'], '6':['m','n','o'], '7':['p','q','r','s'], '8':['t','u','v'], '9':['w','x','y','z']}
if idx==len(phoneNumber):
return Mnemonics
else:
new_Mnemonics=[]
for letter in number_lookup[phoneNumber[idx]]:
for mnemonic in Mnemonics:
new_Mnemonics.append(mnemonic+letter)
phoneNumberMnemonics(phoneNumber, new_Mnemonics, idx+1)
`
If I use the input "1905", my function outputs null. Using a print statement right before the return statement, I can see that the list Mnemonics is
['1w0j', '1x0j', '1y0j', '1z0j', '1w0k', '1x0k', '1y0k', '1z0k', '1w0l', '1x0l', '1y0l', '1z0l']
Which is the correct answer. Why is null being returned?
I am not very good at implementing recursion (yet?), your help is appreciated.
There are different recursive expressions of this problem, but the simplest to think about when you are starting out is a "pure functional" one. This means you never mutate recursively determined values. Rather compute fresh new ones: lists, etc. (Python does not give you a choice regarding strings; they're always immutable.) In this manner you can think about values only, not how they're stored and what's changing them, which is extremely error prone.
A pure-functional way to think about this problem is this:
If the phone number is the empty string, then the return value is just a list containing the empty string.
Else break the number into its first character and the rest. Recursively get all the mnemonics R of the rest. Then find all the letters corresponding to the first and prepend each of these to each member of R to make a new string (This is called a Cartesian cross product, which comes up often in recursion.) Return all of those strings.
In this expression, the pure function has the form
M(n: str) -> list[str]:
It's accepting a string of digits and returning a list of mnemonics.
Putting this thought into python is fairly simple:
LETTERS_BY_DIGIT = {
'0':['0'],
'1':['1'],
'2':['a','b','c'],
'3':['d','e','f'],
'4':['g','h','i'],
'5':['j','k','l'],
'6':['m','n','o'],
'7':['p','q','r','s'],
'8':['t','u','v'],
'9':['w','x','y','z'],
}
def mneumonics(n: str):
if len(n) == 0:
return ['']
rest = mneumonics(n[1:])
first = LETTERS_BY_DIGIT[n[0]]
rtn = [] # A fresh list to return.
for f in first: # Cartesian cross:
for r in rest: # first X rest
rtn.append(f + r); # Fresh string
return rtn
print(mneumonics('1905'))
Note that this code does not mutate the recursive return values rest at all. It makes a new list of new strings.
When you've mastered all the Python idioms, you'll see a slicker way to code the same thing:
def mneumonics(n: str):
return [''] if len(n) == 0 else [
c + r for c in LETTERS_BY_DIGIT[n[0]] for r in mneumonics(n[1:])]
Is this the most efficient code to solve this problem? Absolutely not. But this isn't a very practical thing to do anyway. It's better to go for a simple, correct solution that's easy to understand rather than worry about efficiency before you have a solid grasp of this way of thinking.
As others have said, using recursion at all on this problem is not a great choice if this were a production requirement.
The correct list (Mnemonics) was generated for the deepest call of the recursion. However, it was not passed back to previous calls.
To fix this, the Mnemonics not only needs to be returned in the "else" block, but it also needs to be set to equal the output of the recursive function phone Number Mnemonics.
def phoneNumberMnemonics(phoneNumber, Mnemonics=[''], idx=0):
number_lookup={'0':['0'], '1':['1'], '2':['a','b','c'], '3':['d','e','f'], '4':['g','h','i'], '5':['j','k','l'], '6':['m','n','o'], '7':['p','q','r','s'], '8':['t','u','v'], '9':['w','x','y','z']}
print(idx, len(phoneNumber))
if idx==len(phoneNumber):
pass
else:
new_Mnemonics=[]
for letter in number_lookup[phoneNumber[idx]]:
for mnemonic in Mnemonics:
new_Mnemonics.append(mnemonic+letter)
Mnemonics=phoneNumberMnemonics(phoneNumber, new_Mnemonics, idx+1)
return Mnemonics
I still feel that I'm lacking sophistication in my understanding of recursion. Advice, feedback, and clarifications are welcome.
I am able to convert an Hindi script written in English back to Hindi
import codecs,string
from indic_transliteration import sanscript
from indic_transliteration.sanscript import SchemeMap, SCHEMES, transliterate
def is_hindi(character):
maxchar = max(character)
if u'\u0900' <= maxchar <= u'\u097f':
return character
else:
print(transliterate(character, sanscript.ITRANS, sanscript.DEVANAGARI)
character = 'bakrya'
is_hindi(character)
Output:
बक्र्य
But If I try to do something like this, I don't get any conversions
character = 'Bakrya विकणे आहे'
is_hindi(character)
Output:
Bakrya विकणे आहे
Expected Output:
बक्र्य विकणे आहे
I also tried the library Polyglot but I am getting similar results with it.
Preface: I know nothing of devanagari, so you will have to bear with me.
First, consider your function. It can return two things, character or None (print just outputs something, it doesn't actually return a value). That makes your first output example originate from the print function, not Python evaluating your last statement.
Then, when you consider your second test string, it will see that there's some Devanagari text and just return the string back. What you have to do, if this transliteration works as I think it does, is to apply this function to every word in your text.
I modified your function to:
def is_hindi(character):
maxchar = max(character)
if u'\u0900' <= maxchar <= u'\u097f':
return character
else:
return transliterate(character, sanscript.ITRANS, sanscript.DEVANAGARI)
and modified your call to
' '.join(map(is_hindi, character.split()))
Let me explain, from right to left. First, I split your test string into the separate words with .split(). Then, I map (i.e., apply the function to every element) the new is_hindi function to this new list. Last, I join the separate words with a space to return your converted string.
Output:
'बक्र्य विकणे आहे'
If I may suggest, I would place this splitting/mapping functionality into another function, to make things easier to apply.
Edit: I had to modify your test string from 'Bakrya विकणे आहे' to 'bakrya विकणे आहे' because B wasn't being converted. This can be fixed in a generic text with character.lower().
I have lists of strings, some are hashtags - like #rabbitsarecool others are short pieces of prose like "My rabbits name is fred."
I have written a program to seperate them:
def seperate_hashtags_from_prose(*strs):
props = []
hashtags = []
for x in strs:
if x[0]=="#" and x.find(' ')==-1:
hashtags += x
else:
prose += x
return hashtags, prose
seperate_hashtags_from_prose(["I like cats","#cats","Rabbits are the best","#Rabbits"])
This program does not work. in the above example when i debug it, it tells me that on the first loop:
x=["I like cats","#cats","Rabbits are the best",#Rabbits].
Thisis not what I would have expected - my intuition is that something about the way the loop over optional arguments is constructed is causing an error- but i can't see why.
There are several issues.
The most obvious is switching between props and prose. The code you posted does not run.
As others have commented, if you use the * in the function call, you should not make the call with a list. You could use seperate_hashtags_from_prose("I like cats","#cats","Rabbits are the best","#Rabbits") instead.
The line hashtags += x does not do what you think it does. When you use + as an operator on iterables (such as list and string) it will concatenate them. You probably meant hashtags.append(x) instead.
I have an inverted index (as a dictionary) and I want to take a boolean search query as an input to process it and produce a result.
The inverted index is like this:
{
Test : { FileName1: [213, 1889, 27564], FileName2: [133, 9992866, 27272781, 78676818], FileName3: [9211] },
Try : { FileName4 ...
.....
}
Now, given a boolean search query, I have to return the result.
Examples:
Boolean Search Query: test AND try
The result should be all documents that have the words test and try.
Boolean Search Query: test OR try
The result should be all documents that have either test or try.
Boolean Search Query: test AND NOT try
The result should be all documents that have test but not try.
How can I build this search engine to process the given boolean search query?
Thanks in advance!
EDIT: I am retaining the first part of my answer, because if this WASN'T a school assignment, this would be in my opinion still a better way to go about the task. I replace the second part of the answer with update matching OP's question.
What you appear to want to do is to create a query string parser, which would read the query string and translate it into a series of AND/OR/NOT combos to return the correct keys.
There are 2 approaches to this.
According to what you wrote that you need, by far the simplest solution would be to load the data into any SQL database (SQLite, for example, which does not require a full-blown running SQL server), load dictionary keys as a separate field (the rest of your data may all be in a single another field, if you don't care about normal forms &c), and translate incoming queries to SQL, approximately like this:
SQL table has at least this:
CREATE TABLE my_data(
dictkey text,
data text);
python_query="foo OR bar AND NOT gazonk"
sql_keywords=["AND","NOT","OR"]
sql_query=[]
for word in python_query.split(" "):
if word in sql_keywords:
sql_query+=[ word ]
else:
sql_query+=["dictkey='%s'" % word]
real_sql_query=" ".join(sql_query)
This needs some escaping and control checking for SQL injections and special chars, but in general it would just translate your query into SQL, which, when run against the SQL datbase would return the keys (and possibly data) for further processing.
Now for the pure Python version.
What you need to do is to analyze the string you get and apply the logic to your existing Python data.
Analyzing the string to reduce it to specific components (and their interactions) is parsing. If you actually wanted to build your own fully fledged parser, there would be Python modules for that, however, for a school assignment, I expect you are tasked to build your own.
From your description, the query can be expressed in quasi BNF form as:
(<[NOT] word> <AND|OR>)...
Since you say that priority of is not relevant all, you can do it the easy way and parse word by word.
Then you have to match the keywords to the filenames, which, as mentioned in another answer, is easiest to do with sets.
So, it could go approximately like this:
import re
query="foo OR bar AND NOT gazonk"
result_set=set()
operation=None
for word in re.split(" +(AND|OR) +",query):
#word will be in ['foo', 'OR', 'bar', 'AND', 'NOT gazonk']
inverted=False # for "NOT word" operations
if word in ['AND','OR']:
operation=word
continue
if word.find('NOT ') == 0:
if operation is 'OR':
# generally "OR NOT" operation does not make sense, but if it does in your case, you
# should update this if() accordingly
continue
inverted=True
# the word is inverted!
realword=word[4:]
else:
realword=word
if operation is not None:
# now we need to match the key and the filenames it contains:
current_set=set(inverted_index[realword].keys())
if operation is 'AND':
if inverted is True:
result_set -= current_set
else:
result_set &= current_set
elif operation is 'OR':
result_set |= current_set
operation=None
print result_set
Note that this is not a complete solution (for example it does not include dealing with the first term of the query, and it requires the boolean operators to be in uppercase), and is not tested. However, it should serve the primary purpose of showing you how to go about it. Doing more would be writing your course work for you, which would be bad for you. Because you are expected to learn how to do it so you can understand it. Feel free to ask for clarifications.
Another approach could be an in-memory intersection of the posting lists (for your AND cases, you can enhance this for OR, NOT, etc).
Attached a simple merge algorithm to be performed on the posting lists, assuming that the lists are sorted (increasing doc_id order, this can be easily achieved if we index our terms correctly) - this will improve time complexity (O(n1+n2)) as we will perform linear-merge on sorted list and might stop earlier.
Now assume our positional inverted index looks like this: (similar to yours but with posting lists as lists and not dict's- this will be allow compression in future uses) where it maps- String > Terms, while each term consists of (tf, posting list ([P1, P2,...])) and each Posting has (docid, df, position list). Now we can perform a simple AND to all of our postings lists iteratively:
def search(self, sq: BoolQuery) -> list:
# Performs a search from a given query in boolean retrieval model,
# Supports AND queries only and returns sorted document ID's as result:
if sq.is_empty():
return super().search(sq)
terms = [self.index[term] for term in sq.get_terms() if term in self.index]
if not terms:
return []
# Iterate over posting lists and intersect:
result, terms = terms[0].pst_list, terms[1:]
while terms and result:
result = self.intersect(result, terms[0].pst_list)
terms = terms[1:]
return [p.id for p in result]
Now lets look at the intersection:
def intersect(p1: list, p2: list) -> list:
# Performs linear merge of 2x sorted lists of postings,
# Returns the intersection between them (== matched documents):
res, i, j = list(), 0, 0
while i < len(p1) and j < len(p2):
if p1[i].id == p2[j].id:
res.append(p1[i])
i, j = i + 1, j + 1
elif p1[i].id < p2[j].id:
i += 1
else:
j += 1
return res
This simple algorithm can be later expanded when performing phrase search (edit the intersection to calculate slop distance, e.g: |pos1-pos2| < slop)
Taking into account you have that inverted index and that is a dict with test and try as keys you can define the following functions and play with them:
def intersection(list1, list2):
return list(set(list1).intersection(list2))
def union(list1, list2):
return list(set(list1).union(list2))
def notin(list1, list2)
return [filter(lambda x: x not in list1, sublist) for sublist in list2]
intersection(inverted_index['people'].keys(), intersection(inverted_index['test'].keys(), inverted_index['try'].keys()))
I have a class I've called Earthquake, and it has a location as a string, and a few other parts that aren't important to this question (I don't think).
I've written a function (filter_by_place) that iterates through a list of Earthquakes that I've passed it, and looks for a given word in each Earthquake location string. If the word is found in the Earthquake's location, then it adds that Earthquake to a list. My problem is that it cannot be case sensitive, and I'm trying to make it that way by looking for an all lowercase word in an all lowercase version of the location string.
def filter_by_place(quakes, word):
lst = []
for quake in quakes:
if word.lower in (quake.place).lower:
lst.append(quake)
return lst
I get an error saying "TypeError: argument of type 'builtin_function_or_method' is not itterable"
So, my question is: How do I get that string within the class to become lowercase just for this function so I can search for the word without worrying about case sensitivity?
I've already tried adding
if word.lower or word.upper in quake.place:
inside the for loop, but that didn't work, and I can understand why. Help?
You're getting the error because you're not actually calling the lower string function. I am guessing you're coming from ruby where this wouldn't be required.
Try:
def filter_by_place(quakes, word):
lst = []
for quake in quakes:
if word.lower() in quake.place.lower():
lst.append(quake)
return lst
You need to do word.lower(). You're missing the brackets. Happens all the time :)
Try this
def filter_by_place(quakes, word):
lst = []
for quake in quakes:
if word.lower() in quake.place.lower():
lst.append(quake)
return lst
Because you're using python, you might want to take a look at list comprehensions. You're code would then look something like this
def filter_by_place(quakes, place):
return [quake for quake in quakes if quakes.place.lower() is place]