Extracting Author name from XML tags using ElelemtTree - python
Following is the link to access the XML document:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=%2726161999%27&retmode=xml
I'm trying to extract the author Name which includes Lastname+Forename and make a string with only author name. I'm only being able to extract the details separately.
Following is the code that I have tried
r = requests.get(
'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id='26161999'&retmode=xml')
root = et.fromstring(r.content)
for elem in root.findall(".//ForeName"):
elem_ = elem.text
auth_name = list(elem_.split(" "))
authordata.append(auth_name)
val = [item if isinstance(item, str) else " ".join(item) for item in authordata] #flattening the list since its a nested list, converting nested list into string
seen = set()
val = [x for x in val if x not in seen and not seen.add(x)]
author= ' '.join(val)
print(author)
The output obtained from the above code is:
Elisa Riccardo Mirco Laura Valentina Antonio Sara Carla Borri Barbara
The expected output is a combination of firstname + Lastname:
Elisa Oppici Riccardo Montioli Mirco Dindo Laura Maccari Valentina Porcari Antonio Lorenzetto Chellini Sara Carla Borri Voltattorni Barbara Cellini
From your question I understand that you want a concatenation of ForeName and LastName for each author. You can achieve that by querying directly for those fields for each Author element in the tree and concatenate the corresponding text fields:
import xml.etree.ElementTree as et
import requests
r = requests.get(
'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id="26161999"&retmode=xml'
)
root = et.fromstring(r.content)
author_names = []
for author in root.findall(".//Author"):
fore_name = author.find('ForeName').text
last_name = author.find('LastName').text
author_names.append(fore_name + ' ' + last_name)
print(author_names)
# or to get your exact output format:
print(' '.join(author_names))
Related
Storing keyvalue as header and value text as rows using data frame in python using beautiful soup
for imo in imos: ... ... keys_div= soup.find_all("div", {"class","col-4 keytext"}) values_div = soup.find_all("div",{"class","col-8 valuetext"}) for key, value in zip(keys_div, values_div): print(key.text + ": " + value.text) '...... Output: Ship Name: MAERSK ADRIATIC Shiptype: Chemical/Products Tanker IMO/LR No.: 9636632 Gross: 23,297 Call Sign: 9V3388 Deadweight: 37,538 MMSI No.: 566429000 Year of Build: 2012 Flag: Singapore Status: In Service/Commission Operator: Handytankers K/S Shipbuilder: Hyundai Mipo Dockyard Co Ltd ShipType: Chemical/Products Tanker Built: 2012 GT: 23,297 Deadweight: 37,538 Length Overall: 184.000 Length (BP): 176.000 Length (Reg): 177.460 Bulbous Bow: Yes Breadth Extreme: 27.430 Breadth Moulded: 27.400 Draught: 11.500 Depth: 17.200 Keel To Mast Height: 46.900 Displacement: 46565 T/CM: 45.0 This is the output for one imo, i want to store this output in dataframe and write to csv, the csv will have the keytext as header and value text as rows for all the IMO's please help me on how to do it
All you have to do is add the results to a list and then output that list to a dataframe. import pandas as pd filepath = r"C\users\test\test_file.csv" output_data = [] for imo in imos: keys_div = [i.text for i in soup.find_all("div", {"class","col-4 keytext"})] values_div = [i.text for i in soup.find_all("div",{"class","col-8 valuetext"})] dict1 = dict(zip(keys_div, values_div)) output_data.append(dict1) df = pd.DataFrame(output_data) df.to_csv(filepath, index=False)
How to retrieve information in the first section of the raw data only by regular expressions?
Below is a sample of the raw data which my code will process by regular expressions: raw_data = ''' name : John age : 26 gender : male occupation : teacher Father --------------------- name : Bill age : 52 gender : male Mother --------------------- name : Mary age : 48 gender : female ''' I want to retrieve the following part of information from the raw data and store it in a dictionary: dict(name = 'John', age = 26, gender = 'male', occupation = 'teacher') However, when I run my code as follows, it does not work as I expect: import re p = re.compile('[^-]*?^([^:\-]+?):([^\r\n]*?)$', re.M) rets = p.findall(raw_data) infoAboutJohnAsDict = {} if rets != []: for ret in rets: infoAboutJohnAsDict[ret[0]] = ret[1] else: print("Not match.") print(f'rets = {rets}') print(f'infoAboutJohnAsDict = {infoAboutJohnAsDict}') Can anyone give me any suggestion about how I should modify my code to achieve what I intend to do?
Here is one approach using regular expressions. We can first trim off the latter portion of the input which you don't want using re.sub. Then, use re.findall to find all key value pairs for John, and convert to a dictionary. raw_data = re.sub(r'\s+\w+\s+-+.*', '', raw_data, flags=re.S) matches = re.findall(r'(\w+)\s*:\s*(\w+)', raw_data) d = dict() for m in matches: d[m[0]] = m[1] print(d) # {'gender': 'male', 'age': '26', 'name': 'John', 'occupation': 'teacher'}
Search in List; Display names based on search input
I have sought different articles here about searching data from a list, but nothing seems to be working right or is appropriate in what I am supposed to implement. I have this pre-created module with over 500 list (they are strings, yes, but is considered as list when called into function; see code below) of names, city, email, etc. The following are just a chunk of it. empRecords="""Jovita,Oles,8 S Haven St,Daytona Beach,Volusia,FL,6/14/1965,32114,386-248-4118,386-208-6976,joles#gmail.com,http://www.paganophilipgesq.com,; Alesia,Hixenbaugh,9 Front St,Washington,District of Columbia,DC,3/3/2000,20001,202-646-7516,202-276-6826,alesia_hixenbaugh#hixenbaugh.org,http://www.kwikprint.com,; Lai,Harabedian,1933 Packer Ave #2,Novato,Marin,CA,1/5/2000,94945,415-423-3294,415-926-6089,lai#gmail.com,http://www.buergimaddenscale.com,; Brittni,Gillaspie,67 Rv Cent,Boise,Ada,ID,11/28/1974,83709,208-709-1235,208-206-9848,bgillaspie#gillaspie.com,http://www.innerlabel.com,; Raylene,Kampa,2 Sw Nyberg Rd,Elkhart,Elkhart,IN,12/19/2001,46514,574-499-1454,574-330-1884,rkampa#kampa.org,http://www.hermarinc.com,; Flo,Bookamer,89992 E 15th St,Alliance,Box Butte,NE,12/19/1957,69301,308-726-2182,308-250-6987,flo.bookamer#cox.net,http://www.simontonhoweschneiderpc.com,; Jani,Biddy,61556 W 20th Ave,Seattle,King,WA,8/7/1966,98104,206-711-6498,206-395-6284,jbiddy#yahoo.com,http://www.warehouseofficepaperprod.com,; Chauncey,Motley,63 E Aurora Dr,Orlando,Orange,FL,3/1/2000,32804,407-413-4842,407-557-8857,chauncey_motley#aol.com,http://www.affiliatedwithtravelodge.com """ a = empRecords.strip().split(";") And I have the following code for searching: import empData as x def seecity(): empCitylist = list() for ct in x.a: empCt = ct.strip().split(",") empCitylist.append(empCt) t = sorted(empCitylist, key=lambda x: x[3]) for c in t: city = (c[3]) print(city) live_city = input("Enter city: ") for cy in city: if live_city in cy: print(c[1]) # print("Name: "+ c[1] + ",", c[0], "| Current City: " + c[3]) Forgive my idiotic approach as I am new to Python. However, what I am trying to do is user will input the city, then the results should display the employee's last name, first name who are living in that city (I dunno if I made sense lol) By the way, the code I used above doesn't return any answers. It just loops to the input. Thank you for helping. Lovelots. <3 PS: the format of the empData is: first name, last name, address, city, country, birthday, zip, phone, and email
You can use the csv module to read easily a file with comma separated values import csv with open('test.csv', newline='') as csvfile: records = list(csv.reader(csvfile)) def search(data, elem, index): out = list() for row in data: if row[index] == elem: out.append(row) return out #test print(search(records, 'Orlando', 3))
Based on your original code, you can do it like this: # Make list of list records, sorted by city t = sorted((ct.strip().split(",") for ct in x.a), key=lambda x: x[3]) # List cities print("Cities in DB:") for c in t: city = (c[3]) print("-", city) # Define search function def seecity(): live_city = input("Enter city: ") for c in t: if live_city == c[3]: print("Name: "+ c[1] + ",", c[0], "| Current City: " + c[3]) seecity() Then, after you understand what's going on, do as #Hoxha Alban suggested, and use the csv module.
The beauty of python lies in list comprehension. empRecords="""Jovita,Oles,8 S Haven St,Daytona Beach,Volusia,FL,6/14/1965,32114,386-248-4118,386-208-6976,joles#gmail.com,http://www.paganophilipgesq.com,; Alesia,Hixenbaugh,9 Front St,Washington,District of Columbia,DC,3/3/2000,20001,202-646-7516,202-276-6826,alesia_hixenbaugh#hixenbaugh.org,http://www.kwikprint.com,; Lai,Harabedian,1933 Packer Ave #2,Novato,Marin,CA,1/5/2000,94945,415-423-3294,415-926-6089,lai#gmail.com,http://www.buergimaddenscale.com,; Brittni,Gillaspie,67 Rv Cent,Boise,Ada,ID,11/28/1974,83709,208-709-1235,208-206-9848,bgillaspie#gillaspie.com,http://www.innerlabel.com,; Raylene,Kampa,2 Sw Nyberg Rd,Elkhart,Elkhart,IN,12/19/2001,46514,574-499-1454,574-330-1884,rkampa#kampa.org,http://www.hermarinc.com,; Flo,Bookamer,89992 E 15th St,Alliance,Box Butte,NE,12/19/1957,69301,308-726-2182,308-250-6987,flo.bookamer#cox.net,http://www.simontonhoweschneiderpc.com,; Jani,Biddy,61556 W 20th Ave,Seattle,King,WA,8/7/1966,98104,206-711-6498,206-395-6284,jbiddy#yahoo.com,http://www.warehouseofficepaperprod.com,; Chauncey,Motley,63 E Aurora Dr,Orlando,Orange,FL,3/1/2000,32804,407-413-4842,407-557-8857,chauncey_motley#aol.com,http://www.affiliatedwithtravelodge.com """ rows = empRecords.strip().split(";") data = [ r.strip().split(",") for r in rows ] then you can use any condition to filter the list, like print ( [ "Name: " + emp[1] + "," + emp[0] + "| Current City: " + emp[3] for emp in data if emp[3] == "Washington" ] ) ['Name: Hixenbaugh,Alesia| Current City: Washington']
Grab one or two words IF capitalised after a pattern and match the result with another list
I need to extract unique names with titles such as Lord|Baroness|Lady|Baron from text and match it with another list. I struggle to get the right result and hope the community can help me. Thanks! import re def get_names(text): # find nobel titles and grab it with the following name match = re.compile(r'(Lord|Baroness|Lady|Baron) ([A-Z][a-z]+) ([A-Z][a-z]+)') names = list(set(match.findall(text))) # remove duplicates based on the index in tuples names_ = list(dict((v[1],v) for v in sorted(names, key=lambda names: names[0])).values()) names_lst = list(set([' '.join(map(str, name)) for name in names_])) return names_lst text = 'Baroness Firstname Surname and Baroness who is also known as Lady Anothername and Lady Surname or Lady Firstname.' names_lst = get_names(text) print(names_lst) Which now yields:['Baroness Firstname Surname'] Desired output: ['Baroness Firstname Surname', 'Lady Anothername'] but NOT Lady Surname or Lady Firstname Then I need to match the result with this list: other_names = ['Firstname Surname', 'James', 'Simon Smith'] and drop the element 'Firstname Surname' from it because it matches the first name and surname of the Baroness in 'the desired output'.
I suggest you the following solution: import re def get_names(text): # find nobel titles and grab it with the following name match = re.compile(r'(Lord|Baroness|Lady|Baron) ([A-Z][a-z]+)[ ]?([A-Z][a-z]+)?') names = list(match.findall(text)) # keep only the first title encountered d = {} for name in names: if name[0] not in d: d[name[0]] = ' '.join(name[1:3]).strip() return d text = 'Baroness Firstname Surname and Baroness who is also known as Lady Anothername and Lady Surname or Lady Firstname.' other_names = ['Firstname Surname', 'James', 'Simon Smith'] names_dict = get_names(text) print(names_dict) # {'Baroness': 'Firstname Surname', 'Lady': 'Anothername'} print([' '.join([k,v]) for k,v in names_dict.items()]) # ['Baroness Firstname Surname', 'Lady Anothername'] other_names_dropped = [name for name in other_names if name not in names_dict.values()] print(other_names_dropped) # ['James', 'Simon Smith']
Error when creating dictionaries from text files
I've been working on a function which will update two dictionaries (similar authors, and awards they've won) from an open text file. The text file looks something like this: Brabudy, Ray Hugo Award Nebula Award Saturn Award Ellison, Harlan Heinlein, Robert Asimov, Isaac Clarke, Arthur Ellison, Harlan Nebula Award Hugo Award Locus Award Stephenson, Neil Vonnegut, Kurt Morgan, Richard Adams, Douglas And so on. The first name is an authors name (last name first, first name last), followed by awards they may have won, and then authors who are similar to them. This is what I've got so far: def load_author_dicts(text_file, similar_authors, awards_authors): name_of_author = True awards = False similar = False for line in text_file: if name_of_author: author = line.split(', ') nameA = author[1].strip() + ' ' + author[0].strip() name_of_author = False awards = True continue if awards: if ',' in line: awards = False similar = True else: if nameA in awards_authors: listawards = awards_authors[nameA] listawards.append(line.strip()) else: listawards = [] listawards.append(line.strip() awards_authors[nameA] = listawards if similar: if line == '\n': similar = False name_of_author = True else: sim_author = line.split(', ') nameS = sim_author[1].strip() + ' ' + sim_author[0].strip() if nameA in similar_authors: similar_list = similar_authors[nameA] similar_list.append(nameS) else: similar_list = [] similar_list.append(nameS) similar_authors[nameA] = similar_list continue This works great! However, if the text file contains an entry with just a name (i.e. no awards, and no similar authors), it screws the whole thing up, generating an IndexError: list index out of range at this part Zname = sim_author[1].strip()+" "+sim_author[0].strip() ) How can I fix this? Maybe with a 'try, except function' in that area? Also, I wouldn't mind getting rid of those continue functions, I wasn't sure how else to keep it going. I'm still pretty new to this, so any help would be much appreciated! I keep trying stuff and it changes another section I didn't want changed, so I figured I'd ask the experts.
How about doing it this way, just to get the data in, then manipulate the dictionary any ways you want. test.txt contains your data Brabudy, Ray Hugo Award Nebula Award Saturn Award Ellison, Harlan Heinlein, Robert Asimov, Isaac Clarke, Arthur Ellison, Harlan Nebula Award Hugo Award Locus Award Stephenson, Neil Vonnegut, Kurt Morgan, Richard Adams, Douglas And my code to parse it. award_parse.py data = {} name = "" awards = [] f = open("test.txt") for l in f: # make sure the line is not blank don't process blank lines if not l.strip() == "": # if this is a name and we're not already working on an author then set the author # otherwise treat this as a new author and set the existing author to a key in the dictionary if "," in l and len(name) == 0: name = l.strip() elif "," in l and len(name) > 0: # check to see if recipient is already in list, add to end of existing list if he/she already # exists. if not name.strip() in data: data[name] = awards else: data[name].extend(awards) name = l.strip() awards = [] # process any lines that are not blank, and do not have a , else: awards.append(l.strip()) f.close() for k, v in data.items(): print("%s got the following awards: %s" % (k,v))