Parsing through a deep-nested XML File in Python

Parsing through a deep-nested XML File in Python - python

I am looking at an xml file similar to the below:
<pinnacle_line_feed>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
<lastContest>28962804</lastContest>
<lastGame>162995589</lastGame>
<events>
<event>
<event_datetimeGMT>2014-12-19 11:15</event_datetimeGMT>
<gamenumber>422739932</gamenumber>
<sporttype>Alpine Skiing</sporttype>
<league>DH 145</league>
<IsLive>No</IsLive>
<participants>
<participant>
<participant_name>Kjetil Jansrud (NOR)</participant_name>
<contestantnum>2001</contestantnum>
<rotnum>2001</rotnum>
<visiting_home_draw>Visiting</visiting_home_draw>
</participant>
<participant>
<participant_name>The Field</participant_name>
<contestantnum>2002</contestantnum>
<rotnum>2002</rotnum>
<visiting_home_draw>Home</visiting_home_draw>
</participant>
</participants>
<periods>
<period>
<period_number>0</period_number>
<period_description>Matchups</period_description>
<periodcutoff_datetimeGMT>2014-12-19 11:15</periodcutoff_datetimeGMT>
<period_status>I</period_status>
<period_update>open</period_update>
<spread_maximum>200</spread_maximum>
<moneyline_maximum>100</moneyline_maximum>
<total_maximum>200</total_maximum>
<moneyline>
<moneyline_visiting>116</moneyline_visiting>
<moneyline_home>-136</moneyline_home>
</moneyline>
</period>
</periods>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
</event>
</events>
</pinnacle_line_feed>
I have parsed the file with the code below:
pinny_url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball'
tree = ET.parse(urllib.urlopen(pinny_url))
root = tree.getroot()
list = []
for event in root.iter('event'):
event_datetimeGMT = event.find('event_datetimeGMT').text
gamenumber = event.find('gamenumber').text
sporttype = event.find('sporttype').text
league = event.find('league').text
IsLive = event.find('IsLive').text
for participants in event.iter('participants'):
for participant in participants.iter('participant'):
p1_name = participant.find('participant_name').text
contestantnum = participant.find('contestantnum').text
rotnum = participant.find('rotnum').text
vhd = participant.find('visiting_home_draw').text
for periods in event.iter('periods'):
for period in periods.iter('period'):
period_number = period.find('period_number').text
desc = period.find('period_description').text
pdatetime = period.find('periodcutoff_datetimeGMT')
status = period.find('period_status').text
update = period.find('period_update').text
max = period.find('spread_maximum').text
mlmax = period.find('moneyline_maximum').text
tot_max = period.find('total_maximum').text
for moneyline in period.iter('moneyline'):
ml_vis = moneyline.find('moneyline_visiting').text
ml_home = moneyline.find('moneyline_home').text
However, I am hoping to get the nodes separated by event similar to a 2D table (as in a pandas dataframe). However, the full xml file has multiple "event" children, some events that do not share the same nodes as above. I am struggling quite mightily with being able to take each event node and simply create a 2d table with the tag and that value where the tag acts as the column name and the text acts as the value.
Up to this point, I have done the above to gauge how I might put that information into a dictionary and subsequently put a number of dictionaries into a list from which I can create a dataframe using pandas, but that has not worked out, as all attempts have required me to find and replace text to create the dxcictionaries and python has not responded well to that when attempting to subsequently create a dataframe. I have also used a simple:
for elt in tree.iter():
list.append("'%s': '%s'") % (elt.tag, elt.text.strip()))
which worked quite well in simple pulling out every single tag and the corresponding text, but I was unable to make anything of that because any attempts at finding and replacing the text to create dictionaries was no good.
Any assistance would be greatly appreciated.
Thank you.

Here's an easy way to get your XML into a pandas dataframe. This utilizes the awesome requests library (which you can switch for urllib if you'd like, as well as the always helpful xmltodict library available in pypi. (NOTE: a reverse library is also available, knows as dicttoxml)
import json
import pandas
import requests
import xmltodict
web_request = requests.get(u'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball')
# Make that unweidly XML doc look like a native Dictionary!
result = xmltodict.parse(web_request.text)
# Next, convert the nested OrderedDict to a real dict, which isn't strictly necessary, but helps you
# visualize what the structure of the data looks like
normal_dict = json.loads(json.dumps(result.get('pinnacle_line_feed', {}).get(u'events', {}).get(u'event', [])))
# Now, make that dictionary into a dataframe
df = pandas.DataFrame.from_dict(normal_dict)
To get some idea of what this is starting to look like, here's the first couple of lines of the CSV:
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo) # Output the df to a CSV file
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,participants,periods,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,"{u'participant': [{u'contestantnum': u'1071', u'rotnum': u'1071', u'visiting_home_draw': u'Home', u'participant_name': u'Obras Sanitarias'}, {u'contestantnum': u'1072', u'rotnum': u'1072', u'visiting_home_draw': u'Visiting', u'participant_name': u'Libertad'}]}",,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,"{u'participant': [{u'contestantnum': u'1079', u'rotnum': u'1079', u'visiting_home_draw': u'Home', u'participant_name': u'Boca Juniors'}, {u'contestantnum': u'1080', u'rotnum': u'1080', u'visiting_home_draw': u'Visiting', u'participant_name': u'Penarol'}]}","{u'period': {u'total_maximum': u'450', u'total': {u'total_points': u'152.5', u'under_adjust': u'-107', u'over_adjust': u'-103'}, u'spread_maximum': u'450', u'period_description': u'Game', u'moneyline_maximum': u'450', u'period_number': u'0', u'period_status': u'I', u'spread': {u'spread_visiting': u'3', u'spread_adjust_visiting': u'-102', u'spread_home': u'-3', u'spread_adjust_home': u'-108'}, u'periodcutoff_datetimeGMT': u'2015-01-06 23:00', u'moneyline': {u'moneyline_visiting': u'136', u'moneyline_home': u'-150'}, u'period_update': u'open'}}",Basketball
Notice that the participants and periods columns are still their native Python dictionaries. You'll either need to remove them from the columns list, or do some additional mangling to get them to flatten out:
# Remove the offending columns in this example by selecting particular columns to show
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo, cols=['IsLive', 'event_datetimeGMT', 'gamenumber', 'league', 'sporttype'])
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,Basketball

Related

xml elements in elements to python dataframe

i'm trying to convert the xml data into pandas dataframe.
what i'm struggling is that i cannot get the elements in the element.
here is the example of my xml file.
i'm trying to extract the information of
-orth :"decrease"
-cre_date:2013/12/07
-morph_grp -> var type :"decease"
-subsense - eg: "abcdabcdabcd."
<superEntry>
<orth>decrease</orth>
<entry n="1" pos="vk">
<mnt_grp>
<cre>
<cre_date>2013/12/07</cre_date>
<cre_writer>james</cre_writer>
<cre_writer>jen</cre_writer>
</cre>
<mod>
<mod_date>2007/04/14</mod_date>
<mod_writer>kim</mod_writer>
<mod_note>edited ver</mod_note>
</mod>
<mod>
<mod_date>2009/11/01</mod_date>
<mod_writer>kim</mod_writer>
<mod_note>edited</mod_note>
</mod>
</mnt_grp>
<morph_grp>
<var type="spr">decease</var>
<cntr opt="opt" type="oi"/>
<org lg="si">decrease_</org>
<infl type="reg"/>
</morph_grp>
<sense n="01">
<sem_grp>
<sem_class>active solution</sem_class>
<trans>be added and subtracted to</trans>
</sem_grp>
<frame_grp type="FIN">
<frame>X=N0-i Y=N1-e V</frame>
<subsense>
<sel_rst arg="X" tht="THM">countable</sel_rst>
<sel_rst arg="Y" tht="GOL">countable</sel_rst>
<eg>abcdabcdabcd.</eg>
<eg>abcdabcdabcd.</eg>
</subsense>
and i'm using the code
df_cols=["orth","cre_Date","var type","eg"]
row=[]
for node in xroot:
a=node.attrib.get("sense")
b=node.attrib.get("orth").text if node is not None else None
c=node.attrib.get("var type").text if node is not None else None
d=node.attrib.get("eg").text if node is not None else None
rows.append({"orth":a, "entry":b,
"morph_grp":c, "eg" : d})
out_df= pd.DataFrame(rows,colums=df_cols)
and i'm stuck with getting the element inside the element
any good solution for this?
thank you so much in advance

Making some assumptions about what you want, here is an approach using XPath.
I'm assuming you will be iterating over multiple XML files that each have one superEntry root node in order to generate a DataFrame with more than one record.
Or, perhaps your actual XML doc has a higher-level root/parent element above superEntry, and you will be iterating over multiple superEntry elements within that.
You will need to modify the below accordingly to add your loop.
Also, the provided example XML had two of the "eg" elements with same value. Not sure how you want to handle that. The below will just get the first one. If you need to deal with both, then you can use the findall() method instead of find().
I was a little confused about what you wanted from the "var" element. You indicated "var type", but that you wanted the value to be "deceased", which is the text in the "var" element, whereas "type" is an attribute with a value of "spr". I assumed you wanted the text instead of the attribute value.
import pandas as pd
import xml.etree.ElementTree as ET
df_cols = ["orth","cre_Date","var","eg"]
data = []
xmlDocPath = "example.xml"
tree = ET.parse(xmlDocPath)
superEntry = tree.getroot()
#Below XPaths will just get the first occurence of these elements:
orth = superEntry.find("./orth").text
cre_Date = superEntry.find("./entry/mnt_grp/cre/cre_date").text
var = superEntry.find("./entry/morph_grp/var").text
eg = superEntry.find("./entry/sense/frame_grp/subsense/eg").text
data.append({"orth":orth, "cre_Date":cre_Date, "var":var, "eg":eg})
#After exiting Loop, create DataFrame:
df = pd.DataFrame(data, columns=df_cols)
df.head()
Output:
orth cre_Date var eg
0 decrease 2013/12/07 decease abcdabcdabcd.
Here is a link to the ElementTree documentation for XPath usage: https://docs.python.org/3/library/xml.etree.elementtree.html#xpath-support

Converting molecule name to SMILES?

I was just wondering, is there any way to convert IUPAC or common molecular names to SMILES? I want to do this without having to manually convert every single one utilizing online systems. Any input would be much appreciated!
For background, I am currently working with python and RDkit, so I wasn't sure if RDkit could do this and I was just unaware. My current data is in the csv format.
Thank you!

RDKit cant convert names to SMILES.
Chemical Identifier Resolver can convert names and other identifiers (like CAS No) and has an API so you can convert with a script.
from urllib.request import urlopen
from urllib.parse import quote
def CIRconvert(ids):
try:
url = 'http://cactus.nci.nih.gov/chemical/structure/' + quote(ids) + '/smiles'
ans = urlopen(url).read().decode('utf8')
return ans
except:
return 'Did not work'
identifiers = ['3-Methylheptane', 'Aspirin', 'Diethylsulfate', 'Diethyl sulfate', '50-78-2', 'Adamant']
for ids in identifiers :
print(ids, CIRconvert(ids))
Output
3-Methylheptane CCCCC(C)CC
Aspirin CC(=O)Oc1ccccc1C(O)=O
Diethylsulfate CCO[S](=O)(=O)OCC
Diethyl sulfate CCO[S](=O)(=O)OCC
50-78-2 CC(=O)Oc1ccccc1C(O)=O
Adamant Did not work

OPSIN (https://opsin.ch.cam.ac.uk/) is another solution for name2structure conversion.
It can be used by installing the cli, or via https://github.com/gorgitko/molminer
(OPSIN is used by the RDKit KNIME nodes also)

PubChemPy has some great features that can be used for this purpose. It supports IUPAC systematic names, trade names and all known synonyms for a given Compound as documented in PubChem database:
https://pubchempy.readthedocs.io/en/latest/
>>> import pubchempy as pcp
>>> results = pcp.get_compounds('Glucose', 'name')
>>> print results
[Compound(79025), Compound(5793), Compound(64689), Compound(206)]
The first argument is the identifier, and the second argument is the identifier type, which must be one of name, smiles, sdf, inchi, inchikey or formula. It looks like there are 4 compounds in the PubChem Database that have the name Glucose associated with them. Let’s take a look at them in more detail:
>>> for compound in results:
>>> print compound.isomeric_smiles
C([C##H]1[C#H]([C##H]([C#H]([C#H](O1)O)O)O)O)O
C([C##H]1[C#H]([C##H]([C#H](C(O1)O)O)O)O)O
C([C##H]1[C#H]([C##H]([C#H]([C##H](O1)O)O)O)O)O
C(C1C(C(C(C(O1)O)O)O)O)O
It looks like they all have different stereochemistry information !

The accepted answer uses the Chemical Identifier Resolver but for some reason the website seems to be buggy for me and the API seems to be messed up.
So another way to connvert smiles to IUPAC name is with the the PubChem python API, which can work if your smiles is in their database
e.g.
#!/usr/bin/env python
import sys
import pubchempy as pcp
smiles = str(sys.argv[1])
print(smiles)
s= pcp.get_compounds(smiles,'smiles')
print(s[0].iupac_name)

You can use batch query of pubchem:
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi
https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange-help.html

You can use the pubchem API (PUG REST) for this
(https://pubchemdocs.ncbi.nlm.nih.gov/pug-rest-tutorial)
Basically, the url you are calling will take the compound as a "name", you then give the name, then you specify that you want the "property" of "CanonicalSMILES", as text
identifiers = ['3-Methylheptane', 'Aspirin', 'Diethylsulfate', 'Diethyl sulfate', '50-78-2', 'Adamant']
smiles_df = pd.DataFrame(columns = ['Name', 'Smiles'])
for x in identifiers :
try:
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/' + x + '/property/CanonicalSMILES/TXT'
# remove new line character with rstrip
smiles = requests.get(url).text.rstrip()
if('NotFound' in smiles):
print(x, " not found")
else:
smiles_df = smiles_df.append({'Name' : x, 'Smiles' : smiles}, ignore_index = True)
except:
print("boo ", x)
print(smiles_df)

Loop function across multiple XML files in directory so each XML becomes a row in a CSV

I've figured out how to get data from a single XML file into a row on a CSV. I'd like to iterate this across a number of files in a directory so that the data from each XML file is extracted to a new row on the CSV. I've done some searching and I get the gist of having to create a loop (perhaps using the OS module) but the specifics are lost on me.
This script does the extraction for a single XML file.
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("[PATH/FILE.xml]")
root = tree.getroot()
test_file = open('PATH','w',newline='')
csvwriter = csv.writer(test_file)
header = []
count = 0
for trial in root.iter('[XML_ROOT]'):
item_info = []
if count == 0:
item_ID = trial.find('itemid').tag
header.append(item_ID)
data_1 = trial.find('data1').tag
header.append(data_1)
csvwriter.writerow(header)
count = count + 1
item_ID = trial.find('itemid').text
item_info.append(item_ID)
data_1 = trial.find('data1').text
trial_info.append(data_1)
csvwriter.writerow(item_info)
test_file.close()
Now I need to figure out what to do to it to iterate.
Edit:
Here is an example of an XML file i'm using. Just for testing i'm pulling out actrnumber as item_id and stage as data_1. Eventually I'll need to figure out the most sensible way to create arrays for the nested data. For instance in the outcomes node, nesting the data, probably in an array for primaryOutcome and all secondaryOutcome instances.
<?xml-stylesheet type='text/xsl' href='anzctrTransform.xsl'?>
<ANZCTR_Trial requestNumber="1">
<stage>Registered</stage>
<submitdate>6/07/2005</submitdate>
<approvaldate>7/07/2005</approvaldate>
<actrnumber>ACTRN12605000001695</actrnumber>
<trial_identification>
<studytitle>A phase II trial of gemcitabine in a fixed dose rate infusion combined with cisplatin in patients with operable biliary tract carcinomas</studytitle>
<scientifictitle>A phase II trial of gemcitabine in a fixed dose rate infusion combined with cisplatin in patients with operable biliary tract carcinomas with the primary objective tumour response</scientifictitle>
<utrn />
<trialacronym>ABC trial</trialacronym>
<secondaryid>National Clinical Trials Registry: NCTR570</secondaryid>
</trial_identification>
<conditions>
<healthcondition>Adenocarcinoma of the gallbladder or intra/extrahepatic bile ducts</healthcondition>
<conditioncode>
<conditioncode1>Cancer</conditioncode1>
<conditioncode2>Biliary tree (gall bladder and bile duct)</conditioncode2>
</conditioncode>
</conditions>
<interventions>
<interventions>Gemcitabine delivered as fixed dose-rate infusion with cisplatin</interventions>
<comparator>Single arm trial</comparator>
<control>Uncontrolled</control>
<interventioncode>Treatment: drugs</interventioncode>
</interventions>
<outcomes>
<primaryOutcome>
<outcome>Objective tumour response.</outcome>
<timepoint>Measured every 6 weeks during study treatment, and post treatment.</timepoint>
</primaryOutcome>
<secondaryOutcome>
<outcome>Tolerability and safety of treatment</outcome>
<timepoint>Prior to each cycle of treatment, and at end of treatment</timepoint>
</secondaryOutcome>
<secondaryOutcome>
<outcome>Duration of response</outcome>
<timepoint>Prior to starting every second treatment cycle, then 6 monthly for 12 months, then as clinically indicated</timepoint>
</secondaryOutcome>
<secondaryOutcome>
<outcome>Time to treatment failure</outcome>
<timepoint>Assessed at end of treatment</timepoint>
</secondaryOutcome>
...
</ANZCTR_Trial>

Simply generalize your process in a method and iterate across files with os.listdir assuming all XML files reside in same folder. And be sure to use context manager using with to better manage the open/close file process.
Also, your header parsing is redundant since you name the very tags that you extract: itemid and data1. Node names likely stay the same so can be hard-coded while text values differ, requiring parsing. Below uses list comprehension for a more streamlined collection of data within XML files and across XML files. This also separates the XML parsing and CSV writing.
# GENERALIZED METHOD
def proc_xml(xml_path):
full_path = os.path.join('/path/to/xml/folder', xml_path)
print(full_path)
tree = ET.parse(full_path)
root = tree.getroot()
item_info = [[trial.find('itemid').text, trial.find('data1').text] \
for trial in root.iter('[XML_ROOT]')][0]
return item_info
# NESTED LIST OF XML DATA PER FILE
xml_data_lst = [proc_xml(f) for f in os.listdir('/path/to/xml/folder') \
if f.endswith('.xml')]
# WRITE TO CSV FILE
with open('/path/to/final.csv', 'w', newline='') as test_file:
csvwriter = csv.writer(test_file)
# HEADERS
csvwriter.writerow(['itemid', 'data1'])
# DATA ROWS
for i in xml_data_lst:
csvwriter.writerow(i)

While .find gets you the next match, .findall should return a list of all of them. So you could do something like this:
extracted_IDs = []
item_IDs = trial.findall('itemid')
for id_tags in item_IDs:
extracted_IDs.append(id_tag.text)
Or, to do the same thing in one line:
extracted_IDs = [item.text for item in trial.findall('itemid')]
Likewise, try:
extracted_data = [item.text for item in trial.findall('data1')]
If you have an equal number of both, and if the row you want to write each time is in the form of [<itemid>,<data1>] paired sets, then you can just make a combined set like this:
combined_pairs = [(extracted_IDs[i], extracted_data[i]) for i in range(len(extracted_IDs))]

Python: Joining and writing (XML.etrees) trees stored in a list

I'm looping over some XML files and producing trees that I would like to store in a defaultdict(list) type. With each loop and the next child found will be stored in a separate part of the dictionary.
d = defaultdict(list)
counter = 0
for child in root.findall(something):
tree = ET.ElementTree(something)
d[int(x)].append(tree)
counter += 1
So then repeating this for several files would result in nicely indexed results; a set of trees that were in position 1 across different parsed files and so on. The question is, how do I then join all of d, and write the trees (as a cumulative tree) to a file?
I can loop through the dict to get each tree:
for x in d:
for y in d[x]:
print (y)
This gives a complete list of trees that were in my dict. Now, how do I produce one massive tree from this?
Sample input file 1
Sample input file 2
Required results from 1&2
Given the apparent difficulty in doing this, I'm happy to accept more general answers that show how I can otherwise get the result I am looking for from two or more files.

Use Spyne:
from spyne.model.primitive import *
from spyne.model.complex import *
class GpsInfo(ComplexModel):
UTC = DateTime
Latitude = Double
Longitude = Double
DopplerTime = Double
Quality = Unicode
HDOP = Unicode
Altitude = Double
Speed = Double
Heading = Double
Estimated = Boolean
class Header(ComplexModel):
Name = Unicode
Time = DateTime
SeqNo = Integer
class CTrailData(ComplexModel):
index = UnsignedInteger
gpsInfo = GpsInfo
Header = Header
class CTrail(ComplexModel):
LastError = AnyXml
MaxTrial = Integer
Trail = Array(CTrailData)
from lxml import etree
from spyne.util.xml import *
file_1 = get_xml_as_object(etree.fromstring(open('file1').read()), CTrail)
file_2 = get_xml_as_object(etree.fromstring(open('file2').read()), CTrail)
file_1.Trail.extend(file_2.Trail)
file_1.Trail.sort(key=lambda x: x.index)
elt = get_object_as_xml(file_1, no_namespace=True)
print etree.tostring(elt, pretty_print=True)
While doing this, Spyne also converts the data fields from string to their native Python formats as well, so it'll be much easier for you to work with the data from this xml document.
Also, if you don't mind using the latest version from git, you can do e.g.:
class GpsInfo(ComplexModel):
# (...)
doppler_time = Double(sub_name="DopplerTime")
# (...)
so that you can get data from the CamelCased tags without having to violate PEP8.

Use lxml.objectify:
from lxml import etree, objectify
obj_1 = objectify.fromstring(open('file1').read())
obj_2 = objectify.fromstring(open('file2').read())
obj_1.Trail.CTrailData.extend(obj_2.Trail.CTrailData)
# .sort() won't work as objectify's lists are not regular python lists.
obj_1.Trail.CTrailData = sorted(obj_1.Trail.CTrailData, key=lambda x: x.index)
print etree.tostring(obj_1, pretty_print=True)
It doesn't do the additional conversion work that the Spyne variant does, but for your use case, that might be enough.

Filter large file using python, using contents of another

I have a ~1GB text file of data entries and another list of names that I would like to use to filter them. Running through every name for each entry will be terribly slow. What's the most efficient way of doing this in python? Is it possible to use a hash table if the name is embedded in the entry? Can I use make of the fact that the name part is consistently placed?
Example files:
Entries file -- each part of the entry is separated by a tab, until the names
246 lalala name="Jack";surname="Smith"
1357 dedada name="Mary";surname="White"
123456 lala name="Dan";surname="Brown"
555555 lalala name="Jack";surname="Joe"
Names file -- each on a newline
Jack
Dan
Ryan
Desired output -- only entries with a name in the names file
246 lalala name="Jack";surname="Smith"
123456 lala name="Dan";surname="Brown"
555555 lalala name="Jack";surname="Joe"

You can use the set data structure to store the names — it offers efficient lookup but if the names list is very large then you may run into memory troubles.
The general idea is to iterate through all the names, adding them to a set, then checking if each name from each line from the data file is contained in the set. As the format of the entries doesn't vary, you should be able to extract the names with a simple regular expression.
If you run into troubles with the size of the names set, you can read n lines from the names file and repeat the process for each set of names, unless you require sorting.

My first instinct was to make a dictionary of with names as keys, assuming that it was most efficient to look up the names using the hash of keys in the dictionary.
Given the answer, by #rfw, using a set of names, I edited the code as below and tested it against the two methods, using a dict of names and a set.
I built a dummy dataset of over 40 M records and over 5400 names. Using this dataset, the set method consistently had the edge on my machine.
import re
from collections import Counter
import time
# names file downloaded from http://www.tucows.com/preview/520007
# the set contains over 5400 names
f = open('./names.txt', 'r')
names = [ name.rstrip() for name in f.read().split(',') ]
name_set = set(names) # set of unique names
names_dict = Counter(names) # Counter ~= dict of names with counts
# Expect: 246 lalala name="Jack";surname="Smith"
pattern = re.compile(r'.*\sname="([^"]*)"')
def select_rows_set():
f = open('./data.txt', 'r')
out_f = open('./data_out_set.txt', 'a')
for record in f.readlines():
name = pattern.match(record).groups()[0]
if name in name_set:
out_f.write(record)
out_f.close()
f.close()
def select_rows_dict():
f = open('./data.txt', 'r')
out_f = open('./data_out_dict.txt', 'a')
for record in f.readlines():
name = pattern.match(record).groups()[0]
if name in names_dict:
out_f.write(record)
out_f.close()
f.close()
if __name__ == '__main__':
# One round to time the use of name_set
t0 = time.time()
select_rows_set()
t1 = time.time()
time_for_set = t1-t0
print 'Total set: ', time_for_set
# One round to time the use of names_dict
t0 = time.time()
select_rows_dict()
t1 = time.time()
time_for_dict = t1-t0
print 'Total dict: ', time_for_dict
I assumed that a Counter, being at heart a dictionary, and easier to build from the dataset, does not add any overhead to the access time. Happy to be corrected if I am missing something.

Your data is clearly structured as a table so this may be applicable.
Data structure for maintaining tabular data in memory?

You could create a custom data structure with its own "search by name" function. That'd be a list of dictionaries of some sort. This should take less memory than the size of your text file as it'll remove duplicate information you have on each line such as "name" and "surname", which would be dictionary keys. If you know a bit of SQL (very little is required here) then go with Filter large file using python, using contents of another

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing through a deep-nested XML File in Python - python

Related

xml elements in elements to python dataframe

Converting molecule name to SMILES?

Loop function across multiple XML files in directory so each XML becomes a row in a CSV

Python: Joining and writing (XML.etrees) trees stored in a list

Filter large file using python, using contents of another

Categories

Resources