Removing string from a list by using a rule - python

I am trying to write the list below into a txt file in Python.
However, im also trying to remove the ones that has 'xxx' from the list. preferably by using some sort of a if function. So like if a url has 'xxx' remove from list.
Any idea on how to approach this issue?
TTF = ('abc.com/648','xxx.com/246','def.com/566','ghi.com/624','xxx.com/123')

TTF = ('abc.com/648','xxx.com/246','def.com/566','ghi.com/624','xxx.com/123')
filtered = tuple(filter(lambda e: "xxx" not in e, TTF))
print(filtered)
Similiar to Green Cloak Guy, but using filter instead.

Simple filtered list comprehension. Strings support using in for substring matching, so you can check if a string contains xxx by just doing xxx in string.
The result:
TTF_without_xxx = tuple(s for s in TTF if 'xxx' not in s)
# ('abc.com/648', 'def.com/566', 'ghi.com/624')

Related

SpaCy: How do you check if two specific entities occur in a sentence?

I need to extract from a list of sentences (strings) all sentences that contain two specific entities and store them in a new list. The code I tried to use looks like this but unfortunately it doesnt work. I'm using Python and SpaCy.
sents_required = []
for s in sentences:
if token.ent_type_=='SPECIES' in s and token.ent_type_=='KEYWORD' in s:
sents_required.append(s)
I am grateful for any help.
The way you're declaring the condition is SQL-like, but that doesn't work in Python - you need to iterate over the list and access the data yourself. There are many ways to do this but here's one.
for s in sentences:
etypes = [tok.ent_type_ for tok in s]
if "SPECIES" in etypes and "KEYWORD" in etypes:
sents_required.append(s)
This code works for me. Thanks for help!
sents_required = []
for s in sentences:
token_types = [token.ent_type_ for token in s]
if ('SPECIES' in token_types) and ('KEYWORD' in token_types):
sents_required.append(s)

How to check for empty document.paragraph values extracted from doc.table?

I am using the win32com Python module to target a Word .doc table and then extract all Sentences/ListParagraphs from it.
I am able to successfully get the all my content using doc.Paragraphs. I then try to run..
EDITED:
doc = word.Documents.Open(path)
list = doc.Paragraphs
for x in list:
if str(x.Style) == "Normal" and x != "":
# do stuff
this does not detect empty/whitespaced Lists and Paragraphs. I also tried using
x.isspace() to check for white space but it always returned False.
I have had a run in with \r\n\t\x07\x0b characters before, which seem to be extracted in COM class objects. They cause all sorts of weird issues when converting them to strings. Could it be something similar?
Thanks

Pythonic way to find if a string contains multiple values?

I am trying to find through a list of files all the excel, txt or csv files and append them to a list
goodAttachments = [i for i in attachments if str(i).split('.')[1].find(['xlsx','csv','txt'])
This is obviously not working because find() needs a string and not a list. Should I try a list comprehension inside of a list comprehension?
There's no need to split or use double list comprehension. You can use str.endswith which takes a tuple of strings to check as an argument:
goodAttachments = [i for i in attachments if str(i).endswith(('.xlsx', '.csv', '.txt'))]
If you really want to split:
goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ('xlsx', 'csv', 'txt')]
The first way is better as it accounts for files with no extension.
You could try something like this:
goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ['xlsx', 'csv', 'txt']]
This will check if the extension after the last '.' matches one of 'xlsx', 'csv', or 'txt' exactly.
[i for i in attachments if any([e in str(i).split('.')[1] for e in ['xlsx','csv','txt']]))
Like you said, nested list comprehension.
Edit: This will work without splitting, I was trying to replicate the logic in find.
You can check that everything after the last dot is present in a second list. using [-1] instead of [1] ensures that files named like.this.txt will return the last split txt and not this.
goodAttachments = [i for i in attachments if str(i).split('.')[-1] in ['xlsx','csv','txt']]
I would suggest maybe adding a few more lines then trying to create a one-liner with nested list comprehensions. Though that would work, I think it makes more readable code to split these comprehensions out onto separate lines.
import os
attachments = ['sadf.asdf', 'asd/asd/asd.xslx']
whitelist = {'.xslx', '.csv'}
extentions = (os.path.split(fp)[1] for fp in attachments)
good_attachments = [fp for fp, ext in zip(attachments, extentions) if ext in whitelist]
I've also used os.path.split over str.split as the file may have multiple dots present and this split is designed for this exact job.

Find value matching value in a list of dicts

I have a list of dicts that looks like this:
serv=[{'scheme': 'urn:x-esri:specification:ServiceType:DAP',
'url': 'http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/air.mon.anom.nobs.nc'},
{'scheme': 'urn:x-esri:specification:ServiceType:WMS',
'url': 'http://www.esrl.noaa.gov/psd/thredds/wms/Datasets/air.mon.anom.nobs.nc?service=WMS&version=1.3.0&request=GetCapabilities'},
{'scheme': 'urn:x-esri:specification:ServiceType:WCS',
'url': 'http://ferret.pmel.noaa.gov/geoide/wcs/Datasets/air.mon.anom.nobs.nc?service=WCS&version=1.0.0&request=GetCapabilities'}]
and I want to find the URL corresponding to the ServiceType:WMS which means finding the value of url key in the dictionary from this list where the scheme key has value urn:x-esri:specification:ServiceType:WMS.
So I've got this that works:
for d in serv:
if d['scheme']=='urn:x-esri:specification:ServiceType:WMS':
url=d['url']
print url
which produces
http://www.esrl.noaa.gov/psd/thredds/wms/Datasets/air.mon.anom.nobs.nc?service=WMS&version=1.3.0&request=GetCapabilities
but I've just watched Raymond Hettinger's PyCon talk and at the end he says that that if you can say it as a sentence, it should be expressed in one line of Python.
So is there a more beautiful, idiomatic way of achieving the same result, perhaps with one line of Python?
Thanks,
Rich
The serv array you listed looks like a dictionary mapping schemes to URLs, but it's not represented as such. You can easily convert it to a dict using list comprehensions, though, and then use normal dictionary lookups:
url = dict([(d['scheme'],d['url']) for d in serv])['urn:x-esri:specification:ServiceType:WMS']
You can, of course, save the dictionary version for future use (at the cost of using two lines):
servdict = dict([(d['scheme'],d['url']) for d in serv])
url = servdict['urn:x-esri:specification:ServiceType:WMS']
If you're only interested in one URL, then you can build a generator over serv and use next with a default value for the cases where a match isn't found, eg:
url = next((dct['url'] for dct in serv if dct['scheme'] == 'urn:x-esri:specification:ServiceType:WMS'), 'default URL / not found')
I would split this into two lines, to separate the target from the url retrieval. This is because your target may change in time, so this should not be hardwired. The single line of code follows.
I would use in instead of == as we want to search for all schemes that are of this type. This adds more flexibility, and readability, assuming this will not also catch other schemes not wanted. But from the description, this is the functionality desired.
target = "ServiceType:WMS"
url = [d['url'] for d in serv if target in d['scheme']]
Also, note, this returns a list in all cases, in case there is more than one match, so you will have to loop over url in the code that uses this.
How about this?
urls = [d['url'] for d in serv if d['scheme'] == 'urn:x-esri:specification:ServiceType:WMS']
print urls # ['http://www.esrl.noaa.gov/psd/thredds/wms/Datasets/air.mon.anom.nobs.nc?service=WMS&version=1.3.0&request=GetCapabilities']
Its doing the same thing your code is doing, where d['url'] are being appended to the list - urls if they end with WMS
You can even add an else clause:
urls = [i['url'] for i in serv if i['scheme'].endswith('WMS') else pass]
I've been trying to work in more functional programming into my own work, so here is a pretty simple functional way:
needle='urn:x-esri:specification:ServiceType:WMS'
url = filter( lambda d: d['scheme']==needle, serv )[0]['url']
filter takes as arguments a function that returns a boolean and a list to be filtered. It returns a list of elements that return True when passed to the boolean-returning function (in this case a lambda I defined on the fly). So, to finally get the url, we have to take the zeroth element of the list that filter returns. Since that is the dict containing our desired url, we can tag ['url'] on the end of the whole expression to get the corresponding dictionary entry.

Pattern matching Twitter Streaming API

I'm trying to insert into a dictionary certain values from the Streaming API. One of these is the value of the term that is used in the filter method using track=keyword. I've written some code but on the print statement I get an "Encountered Exception: 'term'" error. This is my partial code:
for term in setTerms:
a = re.compile(term, re.IGNORECASE)
if re.search(a, status.text):
message['term'] = term
else:
pass
print message['text'], message['term']
This is the filter code:
setTerms = ['BBC','XFactor','Obama']
streamer.filter(track = setTerms)
It matches the string, but I also need to be able to match all instances eg. BBC should also match with #BBC, #BBC or BBC1 etc.
So my question how would i get a term in setTerms eg BBC to match all these instances in if re.search(term, status.text)?
Thanks
Have you tried putting all your search terms into a single expression?
i.e. setTerms = '(BBC)|(XFactor)|(Obama)', then seeing if it matches the whole piece string (not just individual word?

Categories