Python word match - python

I have a list of urls and I'm trying to filter them using specific key words say word1 and word2, and a list of stop words say [stop1, stop2, stop3]. Is there a way to filter the links without using many if conditions? I got the proper output when I used if condition on each stop word, this doesn't look like a feasible option. The following is the Brute force method:
for link in url:
if word1 or word2 in link:
if stop1 not in link:
if stop2 not in link:
if stop3 not in link:
links.append(link)

Here's a couple of options I would consider if I were in your situation.
You can use a list comprehension with the built in any and all functions to filter out the unwanted urls from your list:
urls = ['http://somewebsite.tld/word',
'http://somewebsite.tld/word1',
'http://somewebsite.tld/word1/stop3',
'http://somewebsite.tld/word2',
'http://somewebsite.tld/word2/stop2',
'http://somewebsite.tld/word3',
'http://somewebsite.tld/stop3/word1',
'http://somewebsite.tld/stop4/word1']
includes = ['word1', 'word2']
excludes = ['stop1', 'stop2', 'stop3']
filtered_url_list = [url for url in urls if any(include in url for include in includes) if all(exclude not in url for exclude in excludes)]
Or you can make a function which takes one url as an argument, and returns True for urls you want to keep and False for ones you don't, then pass that function along with the unfiltered list of urls to the built in filter function:
def urlfilter(url):
includes = ['word1', 'word2']
excludes = ['stop1', 'stop2', 'stop3']
for include in includes:
if include in url:
for exclude in excludes:
if exclude in url:
return False
else:
return True
urls = ['http://somewebsite.tld/word',
'http://somewebsite.tld/word1',
'http://somewebsite.tld/word1/stop3',
'http://somewebsite.tld/word2',
'http://somewebsite.tld/word2/stop2',
'http://somewebsite.tld/word3',
'http://somewebsite.tld/stop3/word1',
'http://somewebsite.tld/stop4/word1']
filtered_url_list = filter(urlfilter, urls)

If you can cite an example then it would be helpful. If we take an example of urls like
def urlSearch():
word = []
end_words = ['gmail', 'finance']
Key_word = ['google']
urlList= ['google.com//d/gmail', 'google.com/finance', 'google.com/sports', 'google.com/search']
for i in urlList:
main_part = i.split('/',i.count('/'))
if main_part[len(main_part) - 1] in end_words:
word = []
for k in main_part[:-1]:
for j in k.split('.'):
word.append(j)
print (word)
for p in Key_word:
if p in word:
print ("Url is: " + i)
urlSearch()

I would use sets and list comprehension:
must_in = set([word1, word2])
musnt_in = set([stop1, stop2, stop3])
links = [x for x in url if must_in & set(x) and not (musnt_in & set(x))]
print links
The code above can be used with any number of words and stops, not limited to two words (word1, word2) and three stops (stop1, stop2, stop3).

Related

How to filter similar URLS located in a list of values nested in a dictionary?

How can I alter the below code to filter out similar URLs?
examples of duplicates:
http://www.example.com/sunshine
https://example.com/sunshine
https://www.example.com/sunshine_
result = {}
for key,value in student_data.items():
if value not in result.values():
result[key] = value
print(result)
You'd need to come up with a precise definition of what you mean by 'similar'. Looking at your examples, it looks like you want to do the following:
Ignore the part before '//'
Ignore presence or absence of 'www.'
Ignore any trailing underscores
You can define a function to standardize URLs by removing these parts. You can edit this function later if you want to add any conditions to your definition.
Once you've done so, you can standardize every item in the list and then look for only unique items to see how many 'unique' URLs you have after standardizing. Something like the following:
urls = [
'http://www.example.com/sunshine',
'https://example.com/sunshine',
'https://www.example.com/sunshine_'
]
def standardize(url):
# Discard everything before '//'
url = url.partition('//')[2]
# Strip 'www.' from beginning or end
url = url.strip('www.')
# Strip '_' from beginning or end
url = url.strip('_')
return url
# Standardize every url in the list
standardized = [standardize(url) for url in urls]
# Remove duplicates
unique = list(set(standardized))
for u in unique:
print(u)
Output:
example.com/sunshine

Filtering strings if certain multiple keyword matches

I am new to python so it is challenging for me to extract lines from a file if certain words matched in that line.
I have an array of html links for eg:
My Input
['https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/03/trackr-rebrands-to-adero-pivots-to-finding-whereabouts-of-groups-of-items/&ct=ga&cd=caeyacotntqymzy0njawnzu2mji3otq0mziazjhmndaxowrjnmviywm4otpjb206zw46vvm&usg=afqjcnebtnj9ybuywkwcp33xlzvkdtqndq', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/03/will-uber-gobble-up-lime-or-fly-off-with-bird/&ct=ga&cd=caeyasotntqymzy0njawnzu2mji3otq0mziazjhmndaxowrjnmviywm4otpjb206zw46vvm&usg=afqjcnf4upl3v1gzd5a1xr0pgpvc1zedya', 'https://www.google.com/alerts/remove?source=alertsmail&hl=en&gl=us&msgid=ntqymzy0njawnzu2mji3otq0mw&s=ab2xq4hy_egw7prfejiq3uhjazt-7cjtjoilna0', 'https://www.google.com/alerts?source=alertsmail&hl=en&gl=us&msgid=ntqymzy0njawnzu2mji3otq0mw', 'https://www.google.com/alerts?source=alertsmail&hl=en&gl=us&msgid=ntqymzy0njawnzu2mji3otq0mw', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/03/will-uber-gobble-up-lime-or-fly-off-with-bird/&ct=ga&cd=caeyacoumtq0mjmwnzuwmtg3odi4ndq5mtmygjbimdy5nmi3nmjkmwuymdq6y29tomvuolvt&usg=afqjcnf4upl3v1gzd5a1xr0pgpvc1zedya', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/04/hulu-to-top-23-million-subscribers-by-year-end/&ct=ga&cd=caeyasoumtq0mjmwnzuwmtg3odi4ndq5mtmygjbimdy5nmi3nmjkmwuymdq6y29tomvuolvt&usg=afqjcnfyn98cfz1e8oyay72qwdchsg_f_q', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/03/fintech-investors-and-founders-to-judge-startup-battlefield-africa/&ct=ga&cd=caeyaioumtq0mjmwnzuwmtg3odi4ndq5mtmygjbimdy5nmi3nmjkmwuymdq6y29tomvuolvt&usg=afqjcnfxredq8rapscoupmhdzbf-husqyw', 'https://www.google.com/alerts/remove?source=alertsmail&hl=en&gl=us&msgid=mtq0mjmwnzuwmtg3odi4ndq5mtm&s=ab2xq4hjxw0sqeep2yq6odjmq700btmzyqs3svy', 'https://www.google.com/alerts?source=alertsmail&hl=en&gl=us&msgid=mtq0mjmwnzuwmtg3odi4ndq5mtm', 'https://www.google.com/alerts?source=alertsmail&hl=en&gl=us&msgid=mtq0mjmwnzuwmtg3odi4ndq5mtm', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/04/podcast-series-c/&ct=ga&cd=caeyacoumtm0mjyxmzezntg2oti0nju0odgygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcngnd6o3mwrmbj-uc-1a84mlixp26w', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/03/agricool-raises-another-28-million-to-grow-fruits-in-containers/&ct=ga&cd=caeyasoumtm0mjyxmzezntg2oti0nju0odgygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcnecc5wtp2klzwob021zzcxodrkstg', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/04/fivetran-announces-15m-series-a-to-build-automated-data-pipelines/&ct=ga&cd=caeyaioumtm0mjyxmzezntg2oti0nju0odgygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcnfvplce8-juoffflxwe8-ttrqaz_g', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/04/bios-health/&ct=ga&cd=caeyayoumtm0mjyxmzezntg2oti0nju0odgygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcngjpzu9t9hyjfjkaf1sefloujvjhq', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/04/freeletics-raises-45m-for-its-ai-powered-fitness-coach/&ct=ga&cd=caeybcoumtm0mjyxmzezntg2oti0nju0odgygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcneh-xmavwlbin0hfswkmrbniousnw', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/04/mixcloud-select/&ct=ga&cd=caeybsoumtm0mjyxmzezntg2oti0nju0odgygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcneh_kjqkido1dz30dgax2cv1-6g6w', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/04/atomicos-fourth-state-of-the-european-tech-report-highlights-lots-of-rosy-numbers-but-also-a-discrimination-problem/&ct=ga&cd=caeybioumtm0mjyxmzezntg2oti0nju0odgygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcnfcgwt_rwsya4ulxw6im7mcy0a74q', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/04/fortressiq-raises-12m-to-bring-new-ai-twist-to-process-automation/&ct=ga&cd=caeybyoumtm0mjyxmzezntg2oti0nju0odgygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcngn2mdhvdzsxhjpy9wako015rsd9w', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/04/birth-control-delivery-startup-nurx-now-offers-an-at-home-hpv-testing-kit/&ct=ga&cd=caeyccoumtm0mjyxmzezntg2oti0nju0odgygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcnhufwynz2xvx8h7y5njesu5umrqbw', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/04/faraday-future-furloughs-more-employees-as-cash-woes-continue/&ct=ga&cd=caeycsoumtm0mjyxmzezntg2oti0nju0odgygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcng7ma6lr8xqakdvwbcr9kmgkvhvnw', 'https://www.google.com/alerts/remove?source=alertsmail&hl=en&gl=us&msgid=mtm0mjyxmzezntg2oti0nju0odg&s=ab2xq4j8dtcluvhhgyayaorwyeut2bkvyp4mrac', 'https://www.google.com/alerts?source=alertsmail&hl=en&gl=us&msgid=mtm0mjyxmzezntg2oti0nju0odg', 'https://www.google.com/alerts?source=alertsmail&hl=en&gl=us&msgid=mtm0mjyxmzezntg2oti0nju0odg']
My expected output should be
['https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/03/agricool-raises-another-28-million-to-grow-fruits-in-containers/&ct=ga&cd=caeyasoumtm0mjyxmzezntg2oti0nju0odgygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcnecc5wtp2klzwob021zzcxodrkstg','https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/04/freeletics-raises-45m-for-its-ai-powered-fitness-coach/&ct=ga&cd=caeybcoumtm0mjyxmzezntg2oti0nju0odgygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcneh-xmavwlbin0hfswkmrbniousnw','https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/04/fortressiq-raises-12m-to-bring-new-ai-twist-to-process-automation/&ct=ga&cd=caeybyoumtm0mjyxmzezntg2oti0nju0odgygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcngn2mdhvdzsxhjpy9wako015rsd9w']
I want to filter links which contains specific keywords likes "any number 0-9", "millions","raises", "funding", "valuations" etc etc.
I have gone through many links in stackoverflow but could not find what I was looking for. Any help will be much appreciated.
You can try:
candidates = ['https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/03/corporate-food-catering-startup-chewse-raises-19-million/&ct=ga&cd=caeyacoumtqxotu0mdi1mjkxndk4otc1mteygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcngcalj2l2089xqyzdr5clovuuvafq', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/03/sleep-tracking-ring-oura-raises-20-million-from-michael-dell-lance-armstrong-and-others/&ct=ga&cd=caeyasoumtqxotu0mdi1mjkxndk4otc1mteygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcng8wdz35c5krnjcnypbw21b0pihfg', 'https://www.google.com/alerts/remove?source=alertsmail&hl=en&gl=us&msgid=mtqxotu0mdi1mjkxndk4otc1mte&s=ab2xq4j8dtcluvhhgyayaorwyeut2bkvyp4mrac', 'https://www.google.com/alerts?source=alertsmail&hl=en&gl=us&msgid=mtqxotu0mdi1mjkxndk4otc1mte', 'https://www.google.com/alerts?source=alertsmail&hl=en&gl=us&msgid=mtqxotu0mdi1mjkxndk4otc1mte']
def interesting(text):
text = text.lower()
if any([word in text for word in ['billions', 'funding', 'valuations'] + ['%dm' % i for i in range(10)]]):
return True
# Add other conditions
return False
result = list(filter(interesting, candidates))
print(result)
The output for this example is:
['https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/03/corporate-food-catering-startup-chewse-raises-19-million/&ct=ga&cd=caeyacoumtqxotu0mdi1mjkxndk4otc1mteygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcngcalj2l2089xqyzdr5clovuuvafq', 'https://www.google.com/url?rct=j&sa=t&url=https://techcrunch.com/2018/12/03/sleep-tracking-ring-oura-raises-20-million-from-michael-dell-lance-armstrong-and-others/&ct=ga&cd=caeyasoumtqxotu0mdi1mjkxndk4otc1mteygjg0n2i3ogq3nmi1owu1yjk6y29tomvuolvt&usg=afqjcng8wdz35c5krnjcnypbw21b0pihfg', 'https://www.google.com/alerts/remove?source=alertsmail&hl=en&gl=us&msgid=mtqxotu0mdi1mjkxndk4otc1mte&s=ab2xq4j8dtcluvhhgyayaorwyeut2bkvyp4mrac', 'https://www.google.com/alerts?source=alertsmail&hl=en&gl=us&msgid=mtqxotu0mdi1mjkxndk4otc1mte', 'https://www.google.com/alerts?source=alertsmail&hl=en&gl=us&msgid=mtqxotu0mdi1mjkxndk4otc1mte']

Simple wikipedia library that returns link list is not behaving as expected

I'm using pip install wikipedia to make a simple "Philosophy step counter" and can't get the result I'm looking for. No matter what page I enter the code never finds a link that matches any of the words in the first few sentences of the article. Code is below:
import os
import string
import wikipedia
wikipedia.set_lang("en")
print("Please type the name of a wikipedia article: ")
page = input()
print("Searching wikipedia for: " + page)
wikiPage = wikipedia.page(page)
print("Using top result: " + wikiPage.title)
# currentPage = wikipedia.page(wikiPage.links[0])
# List of links (sorted alphabetically, makes our job much harder)
links = wikiPage.links
# Split the beginning of the article into words
words = wikipedia.summary(wikiPage, sentences=3).split()
words = [''.join(c for c in s if c not in string.punctuation)
for s in words] # Sanitize list of words to remove punctuation
# comparisons = [a == b for (a, b) in itertools.product(words, links)]
x = 0
while words[x] not in links:
print(words[x])
x = x + 1
newPage = wikipedia.page(words[x])
Is this a fault of the library I'm using or my code? The link list appears to be ordered alphabetically if that makes any difference (hence why I'm doing all this in the first place)

Tuple trouble when trying to count elements in a list?

I am trying to count the number of contractions used by politicians in certain speeches. I have lots of speeches, but here are some of the URLs as a sample:
every_link_test = ['http://www.millercenter.org/president/obama/speeches/speech-4427',
'http://www.millercenter.org/president/obama/speeches/speech-4424',
'http://www.millercenter.org/president/obama/speeches/speech-4453',
'http://www.millercenter.org/president/obama/speeches/speech-4612',
'http://www.millercenter.org/president/obama/speeches/speech-5502']
I have a pretty rough counter right now - it only counts the total number of contractions used in all of those links. For example, the following code returns 79,101,101,182,224 for the five links above. However, I want to link up filename, a variable I create below, so I would have something like (speech_1, 79),(speech_2, 22),(speech_3,0),(speech_4,81),(speech_5,42). That way, I can track the number of contractions used in each individual speech. I'm getting the following error with my code: AttributeError: 'tuple' object has no attribute 'split'
Here's my code:
import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)
url = 'http://www.millercenter.org/president/speeches'
url2 = 'http://www.millercenter.org'
conn = urllib2.urlopen(url)
html = conn.read()
miller_center_soup = BeautifulSoup(html)
links = miller_center_soup.find_all('a')
linklist = [tag.get('href') for tag in links if tag.get('href') is not None]
# remove all items in list that don't contain 'speeches'
linkslist = [_ for _ in linklist if re.search('speeches',_)]
del linkslist[0:2]
# concatenate 'http://www.millercenter.org' with each speech's URL ending
every_link_dups = [url2 + end_link for end_link in linkslist]
# remove duplicates
seen = set()
every_link = [] # no duplicates array
for l in every_link_dups:
if l not in seen:
every_link.append(l)
seen.add(l)
def processURL_short_2(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president, speech_num)
return item_str, filename
every_link_test = every_link[0:5]
print every_link_test
count = 0
for l in every_link_test:
content_1 = processURL_short_2(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
print count, filename
As the error message explains, you cannot use split the way you are using it. split is for strings.
So you will need to change this:
for word in content_1.split():
to this:
for word in content_1[0]:
I chose [0] by running your code, I think that gives you the chunk of the text you are looking to search through.
#TigerhawkT3 has a good suggestion you should follow in their answer too:
https://stackoverflow.com/a/32981533/1832539
Instead of print count, filename, you should save these data to a data structure, like a dictionary. Since processURL_short_2 has been modified to return a tuple, you'll need to unpack it.
data = {} # initialize a dictionary
for l in every_link_test:
content_1, filename = processURL_short_2(l) # unpack the content and filename
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
data[filename] = count # add this to the dictionary as filename:count
This would give you a dictionary like {'obama_4424':79, 'obama_4453':101,...}, allowing you to easily store and access your parsed data.

Matching regex to list items in Python

I am attempting to write a python script that shows the URL flow on my installation of nginx. So I currently have my script opening my 'rewrites' file that contains a list of of regex's and locations like so:
rewritei ^/ungrad/info.cfm$ /ungrad/info/ permanent;
So what I currently have python doing is reading the file, trimming the first and last word off (rewritei and premanent;) which just leaves a list like so:
[
['^/ungrad/info.cfm$', '/ungrad/info'],
['^/admiss/testing.cfm$', '/admiss/testing'],
['^/ungrad/testing/$', '/ungrad/info.cfm']
]
This results in the first element being the URL watched, and the second being the URL redirected to. What I would like to do now, is take each of the first elements, and run the regex over the entire list, and check if it matches any of the second elements.
With the example above, [0][0] would match [2][1].
However I am having trouble thinking of a good and efficient way to do this.
import re
a = [
['^/ungrad/info.cfm$', '/ungrad/info'],
['^/admiss/testing.cfm$', '/admiss/testing'],
['^/ungrad/testing/$', '/ungrad/info.cfm']
]
def matchingfun(b):
for list1 in a: # iterating the main list
for reglist in list1: # iterating the inner lists
count = 0
matchedurl = []
for innerlist in reglist[:1]: # iterating the inner list items
c = b.match(innerlist) # matching the regx
if c:
count = count+1
if count > 0:
matchedurl.append(reglist)
return matchedurl
result1 = []
for list1 in a:
for reglist in list1:
b = re.compile(reglist[0])
result = matchingfun(b)
result1.extend(result)
bs = list(set(result1))
print "matched url is", bs
This is bit unefficient i guess but I have done to some extent. Hope this answers your query. the above snippet prints the urls which are matched with the second items in the entire list.

Categories