Converting URL titles to standard titles - python

Suppose I have this following list
[('2015-2016-regular', '2016-playoff'), ('2016-2017-regular', '2017-playoff'), ('2017-2018-regular',)]
which represents the two previous complete NHL years and the current one.
I would like to convert it so that It will give me
[('Regular Season 2015-2016 ', 'Playoff 2016'), ('Regular Season 2016-2017', 'Playoff 2017'), ('Regular Season 2017-2018 ',)]
My English is bad and those writing will be used as titles. Are there any errors in the last list?
How could I construct a function which will do such conversions in respecting the 80 characters long norm?

This is a little hacky, but it's an odd question and use case so oh well. Since you have a really limited set of replacements, you can just use a dict to define them and then use a list comprehension with string formatting:
repl_dict = {
'-regular': 'Regular Season ',
'-playoff': 'Playoff '
}
new_list = [
tuple(
'{}{}'.format(repl_dict[name[name.rfind('-'):]], name[:name.rfind('-')])
for name in tup
)
for tup in url_list
]

I tried this. So, I unpacked the tuple. I know where I have to split and which parts to join and did the needful. capitalize() function is for making the first letter uppercase. Also I need to be careful whether the tuple has one or two elements.
l = [('2015-2016-regular', '2016-playoff'), ('2016-2017-regular', '2017-playoff'), ('2017-2018-regular',)]
ans = []
for i in l:
if len(i)==2:
fir=i[0].split('-')
sec = i[1].split('-')
ans.append((fir[2].capitalize()+" "+fir[0]+'-'+fir[1],sec[1].capitalize()+" "+sec[0]))
else:
fir=i[0].split('-')
ans.append((fir[2].capitalize()+" "+fir[0]+'-'+fir[1],))
print ans
Output:
[('Regular 2015-2016', 'Playoff 2016'), ('Regular 2016-2017', 'Playoff 2017'), ('Regular 2017-2018',)]

Related

Remove combination of string in dataset in Python

I have a dataset in Python where I want to remove certain combinations of words of colomnX in a new columnY.
Example of 2 rows of columnX:
what is good: the weather what needs improvwement: the house
what is good: everything what needs improvement: nothing
I want tot delete the following combination of words: "what is good" & "what needs improvement".
In the end the following text should remain in the columnY:
the weather the house
everything nothing
I have the following script:
stoplist={'what is good', 'what needs improvement'}
dataset['columnY']=dataset['columnX'].apply(lambda x: ''.join([item in x.split() if item nog in stoplist]))
But it doesn't work. What am I doing wrong here?
In your case the replacement won't happen as the condition if item not in stoplist (in item in x.split() if item not in stoplist) checks if a single word match any phrase of the stoplist, which is wrong.
Instead combine your stop phrases into a regex pattern (for replacement) as shown below:
df['columnY'] = df.columnX.replace(rf"({'|'.join(f'({i})' for i in stoplist)}): ", "", regex=True)
columnX columnY
0 what is good: the weather what needs improveme... the weather the house
1 what is good: everything what needs improvemen... everything nothing
Maybe you can operate on the columns itself.
df["Y"] = df["X"]
df.Y = df.Y.str.replace("what is good", "")
So you would have to do this for every item in your stop list. But I am not sure how many items you have.
So for example
replacement_map = {"what needs improvement": "", "what is good": ""}
for old, new in replacement_map.items():
df.Y = df.Y.str.replace(old, new)
if you need to specify different translations or
items_to_replace = ["what needs improvement", "what is good"]
for item_to_replace in items_to_replace:
df.Y = df.Y.str.replace(item_to_replace, "")
if the item should always be deleted.
Or you can skip the loop if you express it as a regex:
items_to_replace = ["what needs improvement", "what is good"]
replace_regex = r"|".join(item for item in items_to_replace)
df.Y = df.Y.str.replace(replace_regex , "")
(Credits: #MatBailie & #romanperekhrest)
another way without using a regex and to still use apply would be to use a simple function:
def func(s):
for item in stoplist:
s = s.replace(item, '')
return s
df['columnY']=df['columnY'].apply(func)

How to select sub-strings based on the presence of word pairs? Python

I have a large number of sentences, from which I want to extract sub-sentences that start with certain word combinations. For example, I want to extract sentence segments that begin with "what does" or "what is', etc. (essentially eliminating the words from the sentence that appear before the word-pairs). Both the sentences and the word-pairs are stored in a DataFrame:
'Sentence' 'First2'
0 If this is a string what does it say? 0 can I
1 And this is a string, should it say more? 1 should it
2 This is yet another string. 2 what does
3 etc. etc. 3 etc. etc
The result I want from the above example would be:
0 what does it say?
1 should it say more?
2
The most obvious solution (at least to me) below does not work. It only uses the first word-pair b to go over all the sentences r, but not the other b's.
a = df['Sentence']
b = df['First2']
#The function seems to loop over all r's but only over the first b:
def func(z):
for x in b:
if x in r:
s = z[z.index(x):]
return s
else:
return ‘’
df['Segments'] = a.apply(func)
It seems that looping over two DataFrames simultaneously in this way does not work. Is there a more efficient and effective way to do this?
I believe there is a bug in your code.
else:
return ''
This means if the 1st comparison is not a match, 'func' will return immediately. That might be why the code does not return any matches.
A sample working code is below:
# The function seems to loop over all r's but only over the first b:
def func(sentence, first_twos=b):
for first_two in first_twos:
if first_two in sentence:
s = sentence[sentence.index(first_two):]
return s
return ''
df['Segments'] = a.apply(func)
And the output:
df:
{
'First2': ['can I', 'should it', 'what does'],
'Segments': ['what does it say? ', 'should it say more?', ''],
'Sentence': ['If this is a string what does it say? ', 'And this is a string, should it say more?', 'This is yet another string. ' ]
}
you can loop over two things easily via zip(iterator,iterator_foo)
My question was answered by the following code:
def func(r):
for i in b:
if i in r:
q = r[r.index(i):]
return q
return ''
df['Segments'] = a.apply(func)
The solution was pointed out here by Daming Lu (only the last line is different from his). The problem was in the last two lines of the original code:
else:
return ''
This caused the function to return too early. Daming Lu's answer was better than the answer to the possible duplicate question python for-loop only executes once? which created other problems - as explained in my respons to wii. (So I am not sure mine really is a duplicate.)

How to split the elements of strings in a list (Python)

I am very new to Python, and hope you can help me.
I have a list of strings called reviewerdetails that contains information on reviewers on Hostelworld. In each string, there are three elements: the country, the gender and the agegroup of the reviewer. For example, the first case looks like this:
'\n Belgium, Female, 18-24 '
I want to create three separate lists for these three elements, but I am not sure how to select elements within a string within a list? I have tried the .split function, but I get the error
AttributeError: 'list' object has no attribute 'split'.
I found this question: split elements of a list in python that sort of tries to do want I want to do, but I do not know how to apply the answer to my problem.
Unfortunately we can't use assignments in list comprehensions, so this needs to be done in an explicit for loop (if we don't want to call .split and iterate 3 times)
li = ['\n Belgium, Female, 18- 24 ',
'\n Belgium, Male, 18-24 ']
li = [elem.split() for elem in li]
print(li)
# [['Belgium,', 'Female,', '18-24'], ['Belgium,', 'Male,', '18-24']]
countries, genders, ages = [], [], []
for elem in li:
countries.append(elem[0])
genders.append(elem[1])
ages.append(elem[2])
print(countries)
print(genders)
print(ages)
# ['Belgium,', 'Belgium,']
# ['Female,', 'Male,']
# ['18-24', '18-24']
Something like this, using split and filtering empty strings.
mylist = [x.strip() for x in reviewerdetails.split(" ") if len(x.strip()) > 0];
Try using list comprehensions:
output = [input[i].split( do whatever you need to in here ) for i in range(len(input))]
The split function is a member of string, not list, so you need to apply the function to each element in the list, not to the list itself.
Hope i understood correctly, I think this is what you are trying to do:
main_list = [
'\n Belgium, Female, 18-24 ',
'\n Belgium, Female, 18-24 '
]
for s in main_list:
# create a list split by comma
sub_list = s.split(",")
# cleanup any whitespace from each list item
sub_list = [x.strip() for x in sub_list]
print(sub_list)
I am also a new programmer to Python, so this might be inefficient. The way I would do it would be to have three lists, one for the country, one for the gender, and one for the age range. Then I would do for loops, so if the countries list has [...,'Belgium',...] in it, it would know. So for each list, I would say
for reviewer in reveiwerdetails:
for country in [country list name]:
if country in reviewer:
[list name].append(country)
break
for gender in [gender list name]:
if gender in reviewer:
[list name].append(gender)
break
for agerange in [age range list name]:
if agerange in reviewer:
[list name].append(agerange)
break
So that way you have a list with all the countries of the reviewers, genders, and age ranges in order. Again, this is probably extremely inefficient and there are most likely much easier ways of doing it.

How to convert this function/for loop into a list comprehensionor higher order function using python?

Hello all I wrote the following simple translator program using a function and for loops but am trying to understand list comprehension/higher order functions better. I have a very basic grasp of functions such as map and listcomprehensions, but don't know how to work with them when the loop requires a placeholder value such as place_holder in the below code. Also, any suggestions on what I can do better would be greatly appreciated. Thanks in advance, you guys rock!
P.S how do you get that fancy formatting where my posted code looks like it does in notepad++?
sweedish = {'merry': 'god', 'christmas': 'jul', 'and': 'och', 'happy':'nytt','year':'ar'}
english =('merry christmas and happy new year')
def translate(s):
new = s.split() #split the string into a list
place_holder = [] #empty list to hold the translated word
for item in new: #loop through each item in new
if item in sweedish:
place_holder.append(sweedish[item]) #if the item is found in sweedish, add the corresponding value to place_holder
for item in place_holder: #only way I know how to print a list out with no brackets, ' or other such items. Do you know a better way?
print(item, end=' ')
translate(english)
edit to show chepner's answer and chisaku's formatting tips:
sweedish = {'merry': 'god', 'christmas': 'jul', 'and': 'och', 'happy':'nytt','year':'ar'}
english =('merry christmas and happy new year')
new = english.split()
print(' '.join([sweedish[item] for item in new if item in sweedish] ))
A list comprehension simply builds a list all at once, rather than individually calling append to add items to the end inside a for loop.
place_holder = [ sweedish[item] for item in new if item in sweedish ]
The variable itself is unnecessary, since you can put the list comprehension directly in the for loop:
for item in [ sweedish[item] for item in new if item in sweedish ]:
As #chepner says, you can use a list comprehension to concisely build your new list of words translated from English to Swedish.
To access the dictionary, you might want to use swedish.get(word, 'null_value_placeholder'), so you don't get a KeyError if your English word isn't in the dictionary.
In my example, 'None' is the placeholder for English words without a translation in the dictionary. You could just use '' as a placeholder acknowledging that the gaps in the dictionary only provide an approximate translation.
swedish = {'merry': 'god', 'christmas': 'jul', 'and': 'och', 'happy':'nytt','year':'ar'}
english ='merry christmas and happy new year'
def translate(s):
words = s.split()
translation = [swedish.get(word, 'None') for word in words]
print ' '.join(translation)
translate(english)
>>>
god jul och nytt None ar
Alternatively, you can put a conditional expression in your list comprehension so the list comprehension only attempts to translate words that show up in the dictionary.
def translate(s):
words = s.split()
translation = [swedish[word] for word in words if word in swedish.keys()]
print ' '.join(translation)
translate(english)
>>>
god jul och nytt ar
The ' '.join(translation) function will convert your list of words to a string separated by ' '.

identifying strings which cant be spelt in a list item

I have a list
['mPXSz0qd6j0 youtube ', 'lBz5XJRLHQM youtube ', 'search OpHQOO-DwlQ ',
'sachin 47427243 ', 'alex smith ', 'birthday JEaM8Lg9oK4 ',
'nebula 8x41n9thAU8 ', 'chuck norris ',
'searcher O6tUtqPcHDw ', 'graham wXqsg59z7m0 ', 'queries K70QnTfGjoM ']
Is there some way to identify the strings which can't be spelt in the list item and remove them?
You can use, e.g. PyEnchant for basic dictionary checking and NLTK to take minor spelling issues into account, like this:
import enchant
import nltk
spell_dict = enchant.Dict('en_US') # or whatever language supported
def get_distance_limit(w):
'''
The word is considered good
if it's no further from a known word than this limit.
'''
return len(w)/5 + 2 # just for example, allowing around 1 typo per 5 chars.
def check_word(word):
if spell_dict.check(word):
return True # a known dictionary word
# try similar words
max_dist = get_distance_limit(word)
for suggestion in spell_dict.suggest(word):
if nltk.edit_distance(suggestion, word) < max_dist:
return True
return False
Add a case normalisation and a filter for digits and you'll get a pretty good heuristics.
It is entirely possible to compare your list members to words that you don't believe to be valid for your input.
This can be done in many ways, partially depending on your definition of "properly spelled" and what you end up using for a comparison list. If you decide that numbers preclude an entry from being valid, or underscores, or mixed case, you could test for regex matching.
Post regex, you would have to decide what a valid character to split on should be. Is it spaces (are you willing to break on 'ad hoc' ('ad' is an abbreviation, 'hoc' is not a word))? Is it hyphens (this will break on hyphenated last names)?
With these above criteria decided, it's just a decision of what word, proper name, and common slang list to use and a list comprehension:
word_list[:] = [term for term in word_list if passes_my_membership_criteria(term)]
where passes_my_membership_criteria() is a function that contains the rules for staying in the list of words, returning False for things that you've decided are not valid.

Categories