I've been working on something for a week. A PDF was converted automatically to XML (ERAS medical program details), a very large and imperfect result. The problem is that this strange result returned errors for a lot of things I tried. And it seems that regex doesn't work for lists, at least this one ... . I just need to have only the emails. I could do that by getting anything with "#" in it, or removing anything with "|" in it.
How can I do this? It doesn't seem like turning it into a string works. But I could be wrong.
import xml.etree.ElementTree as ET
tree = ET.parse(r'C:\Users\Iainc\Downloads\ERAS application 2022 emails.xml')
root = tree.getroot()
import re
email = ['']
for x in root.iter():
email.append(x.text)
editedemail = ['']
search_term = 'Email:'
for i in range(len(email)-1):
if email[i] == search_term:
editedemail.append(email[i+1])
for i in range(len(email)-1):
if email[i] == search_term:
editedemail.append(email[i-1])
phonelesseditedemail = list(filter(lambda a: a != 'Phone:', editedemail))
The only things left to remove are entries like:
'Emergency Medicine | NRMP Program Code:********* | Categorical',
But the rest are email addresses I can use. I next afterwards want to write a program to automate sending custom emails, but for now I need to remove what I have mentioned.
Use a list comprehension:
pipelesseditedemail = [x for x in editedemail if '|' not in x]
You could do all if this in one loop:
for i in range(len(email)):
item1 = item2 = ''
if i > 0:
item1 = email[i-1]
if i < len(email)-1:
item2 = email[i+1]
for item in [item1, item2]:
if '#' in item and '|' not in item:
editedemails.append(item)
How to return all values contains a specific text/string from a list as a comma separate value?
i have a list of emails like this:
emails = ['email#example.com',
'email1#example.com',
'email2#example.com',
'emaila#emailexample.com',
'emailb#emailexample.com',
'email33#examplex.com',
'emailas44#exampley.com',
'emailoi45#exampley.com',
'emailgh#exampley.com']
what i want to do is get all emails from the same domain like this:
Website = 'example.com'
Email = 'email#example.com','email1#example.com','email2#example.com'
and so on....
i tried this so far but can not figure out how can i achieve this, would be great if anyone help me, thanks in advance.
def Email(values, search):
for i in values:
if search in i:
return i
return None
data = Email(emails, 'example.com')
print(data)
You never needed a regex. Use a list-comprehension taking advantage of str.endswith() to look for strings with matching characters towards the end:
emails = ['email#example.com',
'email1#example.com',
'email2#example.com',
'emaila#emailexample.com',
'emailb#emailexample.com',
'email33#examplex.com',
'emailas44#exampley.com',
'emailoi45#exampley.com',
'emailgh#exampley.com']
Website = 'example.com'
print([email for email in emails if email.endswith(f'#{Website}')])
# ['email#example.com', 'email1#example.com', 'email2#example.com']
You are returning value at the first iteration itself that's why you are not able to achieve the result. You can store the emails in a list and then return the comma separated values.
Modifying your approach:
def Email(values, search):
x = list()
for i in values:
if i.endswith("#" + search):
x.append(i)
return ", ".join(x) # Returning list as a comma separated value
emails = ["email#example.com","email1#example.com","email2#example.com","emaila#emailexample.com","emailb#emailexample.com","email33#examplex.com","emailas44#exampley.com","emailoi45#exampley.com","emailgh#exampley.com"]
website = 'example.com'
data = Email(emails, website)
print("Website = " + website)
print("Email = " + data)
Hope this answers your question!!!
I am pulling data out of a file that looks like this
"LIC_ARP11|104100000X|33"
I collect the taxonomy number (taxonomies) out of the second field and translate it using another file (IDVtaxo) that looks like this:
"104100000X Behavioral Health & Social Service Providers Social Worker"
If the taxonomy number is not in IDVtaxo I want to append "Not Found"
if taxofile.startswith('IDV'):
for nums in taxonomies:
IDVfile = open (os.path.join(taxodir,IDVtaxo))
for line in IDVfile:
text = line.rstrip('\n')
text = text.split("\t")
if nums in line:
data = text[2:]
final.append(data)
else:
final.append('Not Found')
Then I print the original data along with the translated taxonomy. Currently I get:
"LIC_ARP11|104100000X|33| Not Found"
I want:
"LIC_ARP11|104100000X|33 | Social Worker"
The issue seems to be that the "else" appends "Not Found" for each line instead of just when the taxonomy isn't found in IDVtaxo.
taxonomies = ['152W00000X', '156FX1800X', '200000000X', '261QD0000X', '3336C0003X', '333600000X', '261QD0000X']
translations = {'261QD0000X': 'Clinic/Center Dental', '3336C0003X': 'Pharmacy Community/Retail Pharmacy', '333600000X': 'Pharmacy'}
a = 0
final = []
for nums in taxonomies:
final.append(translations.get(nums, 'Not Found'))
for nums in taxonomies:
print nums, "|", final[a]
a = a + 1
equality operator in Python is ==:
>>> if data == 'Not Found':
... final.append(data)
for "not equal":
>>> if data != 'Not Found':
... final.append(data)
It appears you are testing for presence of nums in each line, appending 'Not Found' every time you fail to find nums.
Instead, try maintaining a variable (e.g. job_title) storing 'Not Found' string. If nums is found, reassign job_title to correct value and append it to final outside of the loop.
I believe you can get a more efficient solution if you load IDVtaxo into a dictionary structure! https://docs.python.org/2/tutorial/datastructures.html#dictionaries
I'm trying to build a list of domain names from an Enom API call. I get back a lot of information and need to locate the domain name related lines, and then join them together.
The string that comes back from Enom looks somewhat like this:
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1
I'd like to build a list from that which looks like this:
[domain1.com, domain2.org, domain3.co.uk, domain4.net]
To find the different domain name components I've tried the following (where "enom" is the string above) but have only been able to get the SLD and TLD matches.
re.findall("^.*(SLD|TLD).*$", enom, re.M)
Edit:
Every time I see a question asking for regular expression solution I have this bizarre urge to try and solve it without regular expressions. Most of the times it's more efficient than the use of regex, I encourage the OP to test which of the solutions is most efficient.
Here is the naive approach:
a = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""
b = a.split("\n")
c = [x.split("=")[1] for x in b if x != 'TLDOverride=1']
for x in range(0,len(c),2):
print ".".join(c[x:x+2])
>> domain1.com
>> domain2.org
>> domain3.co.uk
>> domain4.net
You have a capturing group in your expression. re.findall documentation says:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
That's why only the conent of the capturing group is returned.
try:
re.findall("^.*((?:SLD|TLD)\d*)=(.*)$", enom, re.M)
This would return a list of tuples:
[('SLD1', 'domain1'), ('TLD1', 'com'), ('SLD2', 'domain2'), ('TLD2', 'org'), ('SLD3', 'domain3'), ('TLD4', 'co.uk'), ('SLD5', 'domain4'), ('TLD5', 'net')]
Combining SLDs and TLDs is then up to you.
this works for you example,
>>> sld_list = re.findall("^.*SLD[0-9]*?=(.*?)$", enom, re.M)
>>> tld_list = re.findall("^.*TLD[0-9]*?=(.*?)$", enom, re.M)
>>> map(lambda x: x[0] + '.' + x[1], zip(sld_list, tld_list))
['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
I'm not sure why are you talking about regular expressions. I mean, why don't you just run a for loop?
A famous quote seems to be appropriate here:
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems.
domains = []
components = []
for line in enom.split('\n'):
k,v = line.split('=')
if k == 'TLDOverride':
continue
components.append(v)
if k.startswith('TLD'):
domains.append('.'.join(components))
components = []
P.S. I'm not sure what's this TLDOverride so the code just ignores it.
Here's one way:
import re
print map('.'.join, zip(*[iter(re.findall(r'^(?:S|T)LD\d+=(.*)$', text, re.M))]*2))
# ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
Just for fun, map -> filter -> map:
input = """
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
"""
splited = map(lambda x: x.split("="), input.split())
slds = filter(lambda x: x[1][0].startswith('SLD'), enumerate(splited))
print map(lambda x: '.'.join([x[1][1], splited[x[0] + 1][1], ]), slds)
>>> ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
This appears to do what you want:
domains = re.findall('SLD\d+=(.+)', re.sub(r'\nTLD\d+=', '.', enom))
It assumes that the lines are sorted and SLD always comes before its TLD. If that can be not the case, try this slightly more verbose code without regexes:
d = dict(x.split('=') for x in enom.strip().splitlines())
domains = [
d[key] + '.' + d.get('T' + key[1:], '')
for key in d if key.startswith('SLD')
]
You need to use multiline regex for this. This is similar to this post.
data = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""
domain_seq = re.compile(r"SLD\d=(\w+)\nTLD\d=(\w+)", re.M)
for item in domain_seq.finditer(data):
domain, tld = item.group(1), item.group(2)
print "%s.%s" % (domain,tld)
As some other answers already said, there's no need to use a regular expression here. A simple split and some filtering will do nicely:
lines = data.split("\n") #assuming data contains your input string
sld, tld = [[x.split("=")[1] for x in lines if x[:3] == t] for t in ("SLD", "TLD")]
result = [x+y for x, y in zip(sld, tld)]
def handler_users_answ(coze, res, type, source):
if res:
if res.getType() == 'result':
aa=res.getQueryChildren()
if aa:
print 'workz1'
for x in aa:
m=x.getAttr('jid')
if m:
print m
so this code returns me the values like this:
roomname#domain.com/nickname1
roomname#domain.com/nickname2
and so on, but i want it to print the value after the '/' only.
like:
nickname1
nickname2
Thanks in advance.
You can use rpartition to get the part after the last \ in the string.
a = 'roomname#domain.com/nickname1'
b=a.split('/');
c=b[1];
You can use rsplit which will do the splitting form the right:
a = 'roomname#domain.com/nickname1'
try:
print a.rsplit('/')[1][1]
except IndexError:
print "No username was found"
I think that this is efficient and readable. If you really need it to be fast you can use rfind:
a = 'roomname#domain.com/nickname1'
index = a.rfind('/')
if index != -1:
print a[index+1:]
else:
print "No username was found"
To fully parse and validate the JID correctly, see this answer. There's a bunch of odd little edge cases that you might not expect.