Struggling to ingest XML elements into Pandas dataframe - python

xml and pandas.
What I am trying to do is learn Pandas dataframes, so I can work with and analyse data coming from xml format. In particular I want to be confident in ingesting nested xml. In most of the tutorials I have read or seen on Youtube the instructions use flat XML documents with no nesting. This does not represent real world data so I am trying something a bit more challenging.
I have knocked up some code in python with a view to generating Pandas data-frames I can start practising querying the data with the Pandas framework.
I am using an open source music resource 'Discogs' because they provide access to large xmls with lots of data I can play with.
There are a couple of challenges with the source data, first being there is no standardised schema for the tables, so the structure of the data is not consistent throughout an XML table (an issue I feel mimics real data I'll eventually be working with for real). The second is that the source files are huge the smallest being 1.5GB.
The first step I took was to split the files into smaller 200MB chunks. I then looked at the structure with a text editor so I had a good understanding of the tags and elements I needed to work with. Right now I am working with a table called 'Masters'. I m hard coding he elements I am trying to pull into a dataframe to keep the exercise simple and contained for now.
I am using xml.etree to parse an xml document and interact with each element that contains.
I have created a static data-frame with 8 columns for data go into. Again keeping it simple for now.
I am then searching for specific elements within the parsed xml data and extracting the Text from each into a variable per element of interest.
The data is broken down within this xml as a set of rows, each wrapped in a tag called master. So I use this tag as my root anchor to loop around.
If I run the above as a print to console, all works fine up to this point, and I get a stream of nicely flattened and well formed data (excluding some elements which randomly have None values and therefore throw an error)
The last step was to then parse the strings from each collected element into a row appended to the data-frame.
This is where I hit a problem. The code to append to the data-frame seems straightforward, but when I add it to my for loop, I get an endless loop which I have to force to end.
I am obviously missing something here. advise greatly appreciated. Code I am working with below:
import xml.etree.ElementTree as et
import re
import pandas as pd
tree = et.parse
('/media/linux/Data1/TestData/masters/200mb/masters-01.xml')
root = tree.getroot()
masters_df_cols = ["MasterID", "MainRelese", "Title", "Year",
"Genre", "ArtistID", "ArtistName"]
masters_df = pd.DataFrame(columns = masters_df_cols)
for elem in root.iter('master'):
if elem is not None:
masterID = str(elem.get('id'))
mainRelease = str(elem.find('main_release').text)
year = str(elem.find('year').text)
title = str(elem.find('title').text)
genre = str(elem.find('./genres/genre').text)
#style = str(elem.find('./styles/style').text)
artistID = str(elem.find('./artists/artist/id').text)
artistName =
str(elem.find('./artists/artist/name').text)
print(masterID, ':', mainRelease, ':', year, ':', title,
':', genre, ':', artistID, ':', artistName)
masters_df = masters_df.append(pd.DataFrame([masterID,
mainRelease, year, title, genre, artistID, artistName],
index = masters_df_cols), ignore_index = True)
print("Dataframe exported.")
The goal is to eventually take this exercise a replicate the knowledge I gain from it across different types of XMl, giving me the skill to searching dynamically through XMls for the tags and elements I want to draw out into a data-frame. Then use the data frames to generate meaningful stats about the data content. For now I am just trying to create simple flat data frames, with hard coded element values.

There are a couple of issues in the above code.
First and foremost, it is important to be vigilant with the whitespace and indentation in Python. The copy/paste of the above code had a combination of space and tab whitespace.
The primary semantic problem is that Pandas is not good at incrementally building DataFrames. Typically, the Pandas API is used to pass in an entire data structure at once to let the framework do its heavy lifting. But of course there are methods to append DataFrames. But here, a DataFrame was instantiated, and then for each iteration of the XML, a new DataFrame as instantiated and then appended to the first. This is most certainly not what was desired and would cause memory headaches, especially in face of the voluminous XML which will require lots of host computer memory.
Then there were minor issues on constructing the row level DataFrame. Here is a working MCVE illustrating how to make parsing discogs XML work including sample in-line data.
import xml.etree.ElementTree as et
import re
import pandas as pd
parser = et.XMLParser()
discogs_masters = """<masters>
<master id="18500"><main_release>155102</main_release><images><image height="588" type="primary" uri="" uri150="" width="600" /></images><artists><artist><id>212070</id><name>Samuel L Session</name><anv>Samuel L</anv><join /><role /><tracks /></artist></artists><genres><genre>Electronic</genre></genres><styles><style>Techno</style></styles><year>2001</year><title>New Soil</title><data_quality>Correct</data_quality><videos><video duration="489" embed="true" src="http://www.youtube.com/watch?v=f05Ai921itM"><title>Samuel L - Velvet</title><description>Samuel L - Velvet</description></video><video duration="292" embed="true" src="http://www.youtube.com/watch?v=iOQsBOJLbwg"><title>Samuel L. - Danshes D'afrique</title><description>Samuel L. - Danshes D'afrique</description></video><video duration="348" embed="true" src="http://www.youtube.com/watch?v=v23rSPG_StA"><title>Samuel L - Danses D'Afrique</title><description>Samuel L - Danses D'Afrique</description></video><video duration="288" embed="true" src="http://www.youtube.com/watch?v=tHo82ha6p40"><title>Samuel L - Body N' Soul</title><description>Samuel L - Body N' Soul</description></video><video duration="331" embed="true" src="http://www.youtube.com/watch?v=KDcqzHca5dk"><title>Samuel L - Into The Groove</title><description>Samuel L - Into The Groove</description></video><video duration="334" embed="true" src="http://www.youtube.com/watch?v=3DIYjJFl8Dk"><title>Samuel L - Soul Syndrome</title><description>Samuel L - Soul Syndrome</description></video><video duration="325" embed="true" src="http://www.youtube.com/watch?v=_o8yZMPqvNg"><title>Samuel L - Lush</title><description>Samuel L - Lush</description></video><video duration="346" embed="true" src="http://www.youtube.com/watch?v=JPwwJSc_-30"><title>Samuel L - Velvet ( Direct Me )</title><description>Samuel L - Velvet ( Direct Me )</description></video></videos></master>
<master id="18512"><main_release>33699</main_release><images><image height="150" type="primary" uri="" uri150="" width="150" /><image height="592" type="secondary" uri="" uri150="" width="600" /><image height="592" type="secondary" uri="" uri150="" width="600" /></images><artists><artist><id>212070</id><name>Samuel L Session</name><anv /><join /><role /><tracks /></artist></artists><genres><genre>Electronic</genre></genres><styles><style>Tribal</style><style>Techno</style></styles><year>2002</year><title>Psyche EP</title><data_quality>Correct</data_quality><videos><video duration="376" embed="true" src="http://www.youtube.com/watch?v=c_AfLqTdncI"><title>Samuel L. Session - Psyche Part 1</title><description>Samuel L. Session - Psyche Part 1</description></video><video duration="419" embed="true" src="http://www.youtube.com/watch?v=0nxvR8Zl9wY"><title>Samuel L. Session - Psyche Part 2</title><description>Samuel L. Session - Psyche Part 2</description></video><video duration="118" embed="true" src="http://www.youtube.com/watch?v=QYf4j0Pd2FU"><title>Samuel L. Session - Arrival</title><description>Samuel L. Session - Arrival</description></video></videos></master>
</masters>
"""
parser.feed(discogs_masters)
root = parser.close()
masters_df_cols = ["MasterID", "MainRelese", "Title", "Year",
"Genre", "ArtistID", "ArtistName"]
masters_rows = []
for elem in root.iter('master'):
if elem is not None:
masterID = str(elem.get('id'))
mainRelease = str(elem.find('main_release').text)
year = str(elem.find('year').text)
title = str(elem.find('title').text)
genre = str(elem.find('./genres/genre').text)
artistID = str(elem.find('./artists/artist/id').text)
artistName = str(elem.find('./artists/artist/name').text)
masters_rows.append([masterID, mainRelease, year, title, genre, artistID, artistName])
masters_df = pd.DataFrame(masters_rows, columns = masters_df_cols)
print(masters_df)
Produces this output
MasterID MainRelese Title Year Genre ArtistID ArtistName
0 18500 155102 2001 New Soil Electronic 212070 Samuel L Session
1 18512 33699 2002 Psyche EP Electronic 212070 Samuel L Session

Related

Loop function across multiple XML files in directory so each XML becomes a row in a CSV

I've figured out how to get data from a single XML file into a row on a CSV. I'd like to iterate this across a number of files in a directory so that the data from each XML file is extracted to a new row on the CSV. I've done some searching and I get the gist of having to create a loop (perhaps using the OS module) but the specifics are lost on me.
This script does the extraction for a single XML file.
import xml.etree.ElementTree as ET
import csv
tree = ET.parse("[PATH/FILE.xml]")
root = tree.getroot()
test_file = open('PATH','w',newline='')
csvwriter = csv.writer(test_file)
header = []
count = 0
for trial in root.iter('[XML_ROOT]'):
item_info = []
if count == 0:
item_ID = trial.find('itemid').tag
header.append(item_ID)
data_1 = trial.find('data1').tag
header.append(data_1)
csvwriter.writerow(header)
count = count + 1
item_ID = trial.find('itemid').text
item_info.append(item_ID)
data_1 = trial.find('data1').text
trial_info.append(data_1)
csvwriter.writerow(item_info)
test_file.close()
Now I need to figure out what to do to it to iterate.
Edit:
Here is an example of an XML file i'm using. Just for testing i'm pulling out actrnumber as item_id and stage as data_1. Eventually I'll need to figure out the most sensible way to create arrays for the nested data. For instance in the outcomes node, nesting the data, probably in an array for primaryOutcome and all secondaryOutcome instances.
<?xml-stylesheet type='text/xsl' href='anzctrTransform.xsl'?>
<ANZCTR_Trial requestNumber="1">
<stage>Registered</stage>
<submitdate>6/07/2005</submitdate>
<approvaldate>7/07/2005</approvaldate>
<actrnumber>ACTRN12605000001695</actrnumber>
<trial_identification>
<studytitle>A phase II trial of gemcitabine in a fixed dose rate infusion combined with cisplatin in patients with operable biliary tract carcinomas</studytitle>
<scientifictitle>A phase II trial of gemcitabine in a fixed dose rate infusion combined with cisplatin in patients with operable biliary tract carcinomas with the primary objective tumour response</scientifictitle>
<utrn />
<trialacronym>ABC trial</trialacronym>
<secondaryid>National Clinical Trials Registry: NCTR570</secondaryid>
</trial_identification>
<conditions>
<healthcondition>Adenocarcinoma of the gallbladder or intra/extrahepatic bile ducts</healthcondition>
<conditioncode>
<conditioncode1>Cancer</conditioncode1>
<conditioncode2>Biliary tree (gall bladder and bile duct)</conditioncode2>
</conditioncode>
</conditions>
<interventions>
<interventions>Gemcitabine delivered as fixed dose-rate infusion with cisplatin</interventions>
<comparator>Single arm trial</comparator>
<control>Uncontrolled</control>
<interventioncode>Treatment: drugs</interventioncode>
</interventions>
<outcomes>
<primaryOutcome>
<outcome>Objective tumour response.</outcome>
<timepoint>Measured every 6 weeks during study treatment, and post treatment.</timepoint>
</primaryOutcome>
<secondaryOutcome>
<outcome>Tolerability and safety of treatment</outcome>
<timepoint>Prior to each cycle of treatment, and at end of treatment</timepoint>
</secondaryOutcome>
<secondaryOutcome>
<outcome>Duration of response</outcome>
<timepoint>Prior to starting every second treatment cycle, then 6 monthly for 12 months, then as clinically indicated</timepoint>
</secondaryOutcome>
<secondaryOutcome>
<outcome>Time to treatment failure</outcome>
<timepoint>Assessed at end of treatment</timepoint>
</secondaryOutcome>
...
</ANZCTR_Trial>
Simply generalize your process in a method and iterate across files with os.listdir assuming all XML files reside in same folder. And be sure to use context manager using with to better manage the open/close file process.
Also, your header parsing is redundant since you name the very tags that you extract: itemid and data1. Node names likely stay the same so can be hard-coded while text values differ, requiring parsing. Below uses list comprehension for a more streamlined collection of data within XML files and across XML files. This also separates the XML parsing and CSV writing.
# GENERALIZED METHOD
def proc_xml(xml_path):
full_path = os.path.join('/path/to/xml/folder', xml_path)
print(full_path)
tree = ET.parse(full_path)
root = tree.getroot()
item_info = [[trial.find('itemid').text, trial.find('data1').text] \
for trial in root.iter('[XML_ROOT]')][0]
return item_info
# NESTED LIST OF XML DATA PER FILE
xml_data_lst = [proc_xml(f) for f in os.listdir('/path/to/xml/folder') \
if f.endswith('.xml')]
# WRITE TO CSV FILE
with open('/path/to/final.csv', 'w', newline='') as test_file:
csvwriter = csv.writer(test_file)
# HEADERS
csvwriter.writerow(['itemid', 'data1'])
# DATA ROWS
for i in xml_data_lst:
csvwriter.writerow(i)
While .find gets you the next match, .findall should return a list of all of them. So you could do something like this:
extracted_IDs = []
item_IDs = trial.findall('itemid')
for id_tags in item_IDs:
extracted_IDs.append(id_tag.text)
Or, to do the same thing in one line:
extracted_IDs = [item.text for item in trial.findall('itemid')]
Likewise, try:
extracted_data = [item.text for item in trial.findall('data1')]
If you have an equal number of both, and if the row you want to write each time is in the form of [<itemid>,<data1>] paired sets, then you can just make a combined set like this:
combined_pairs = [(extracted_IDs[i], extracted_data[i]) for i in range(len(extracted_IDs))]

Sentiment analysis - arousal

I am really a beginner in programming, and I have run into a problem. I am making a comparative analysis between fake news and real news. I have a text corpus with aprox. 3000 real news and 3000 fake news. I need to figure out whether fake or real news evoke more high-arousal emotions. I want to do that by using Warriner et. al. word list: http://crr.ugent.be/archives/1003
I have imported the word list to my script:
warriner = pd.read_csv('warriner.csv', sep = '\t', encoding = 'utf-8')
print warriner.head()
I (think, I) want to find the Arousal Mean Sum, which in the word list is called A.Mean.Sum. But I can't make it work, Spyder just say: 'DataFrame' object has no attribute 'A'.
Can anyone help? I have already calculated the sentiment scores by using LabMT as seen below, but I can't make Warringer et al work.
text_scored = [] for text in df['text']: sent_score = tm.labMT_sent(text)
text_scored.append(sent_score)
df['abs_sent'] = text_scored #adding the scored text to the df
relative sentiment score
text_scored = [] for text in df['text']: sent_score = tm.labMT_sent(text, rel = True)
text_scored.append(sent_score)
df['rel_sent'] = text_scored #adding the scored text to the df
overall mean
df['abs_sent'].mean() df['abs_sent'].loc[df['label'] == 'FAKE'].mean()
#'fake' mean = - 22,1 df['abs_sent'].loc[df['label'] == 'REAL'].mean()
#'real' mean = - 41,95
relative score mean calculations
df['rel_sent'].mean() #overall mean df['rel_sent'].loc[df['label'] == 'FAKE'].mean()
#'fake' mean = - 0,02 df['rel_sent'].loc[df['label'] == 'REAL'].mean()
#'real' mean = - 0,05
The example code you provided is hard for me to read. You're reporting the problem as having to do with A.Mean.Sum, but there's no code relating to that. There are also references to Spyder and DataFrame without explanation, code, or tags. Finally, the title should tell the potential answerer something about the problem itself, not the general field the code is working with. The current one expects the reader to find what they're supposed to do from within the report.
I'll readily admit I'm a novice here, but I suggest reading the intro How-to-ask and clarifying your question with it.
I'm also guessing this is a pandas related question, so its docs page might help you.
I hope I was of any help!

Using search terms with Biopython to return accession numbers

I am trying to use Biopython (Entrez) with search terms that will return the accession number (and not the GI*).
Here is a tiny excerpt of my code:
from Bio import Entrez
Entrez.email = 'myemailaddress'
search_phrase = 'Escherichia coli[organism]) AND (complete genome[keyword])'
handle = Entrez.esearch(db='nuccore', term=search_phrase, retmax=100, rettype='acc', retmode='text')
result = Entrez.read(handle)
handle.close()
gi_numbers = result['IdList']
print(gi_numbers)
'745369752', '910228862', '187736741', '802098270', '802098269',
'802098267', '387610477', '544579032', '544574430', '215485161',
'749295052', '387823261', '387605479', '641687520', '641682562',
'594009615', '557270520', '313848522', '309700213', '284919779',
'215263233', '544345556', '544340954', '144661', '51773702',
'202957457', '202957451', '172051323'
I am sure I can convert from GI to accession, but it would be nice to avoid the additional step. What slice of magic am I missing?
Thank you in advance.
*especially since NCBI is phasing out GI numbers
Looking through the docs for esearch on NCBI's website, there are only two rettypes available - uilist, which is the default XML format that you're getting currently (it's parsed into a dict by Entrez.read()), and count, which just displays the Count value (look at the complete contents of result, it's there), which I'm unclear on its exact meaning, as it doesn't represent the total number of items in IdList...
At any rate, Entrez.esearch() will take any value of rettype and retmode you like, but it only returns the uilist or count in xml or json mode - no accession IDs, no nothin'.
Entrez.efetch() will pass you back all sorts of cool stuff, depending on which DB you're querying. The downside, of course, is that you need to query by one or more IDs, not by a search string, so in order to get your accession IDs you'd need to run two queries:
search_phrase = "Escherichia coli[organism]) AND (complete genome[keyword])"
handle = Entrez.esearch(db="nuccore", term=search_phrase, retmax=100)
result = Entrez.read(handle)
handle.close()
fetch_handle = Entrez.efetch(db="nuccore", id=results["IdList"], rettype="acc", retmode="text")
acc_ids = [id.strip() for id in fetch_handle]
fetch_handle.close()
print(acc_ids)
gives
['HF572917.2', 'NZ_HF572917.1', 'NC_010558.1', 'NZ_HG941720.1', 'NZ_HG941719.1', 'NZ_HG941718.1', 'NC_017633.1', 'NC_022371.1', 'NC_022370.1', 'NC_011601.1', 'NZ_HG738867.1', 'NC_012892.2', 'NC_017626.1', 'HG941719.1', 'HG941718.1', 'HG941720.1', 'HG738867.1', 'AM946981.2', 'FN649414.1', 'FN554766.1', 'FM180568.1', 'HG428756.1', 'HG428755.1', 'M37402.1', 'AJ304858.2', 'FM206294.1', 'FM206293.1', 'AM886293.1']
So, I'm not terribly sure if I answered your question satisfactorily, but unfortunately I think the answer is "There is no magic."

Parsing through a deep-nested XML File in Python

I am looking at an xml file similar to the below:
<pinnacle_line_feed>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
<lastContest>28962804</lastContest>
<lastGame>162995589</lastGame>
<events>
<event>
<event_datetimeGMT>2014-12-19 11:15</event_datetimeGMT>
<gamenumber>422739932</gamenumber>
<sporttype>Alpine Skiing</sporttype>
<league>DH 145</league>
<IsLive>No</IsLive>
<participants>
<participant>
<participant_name>Kjetil Jansrud (NOR)</participant_name>
<contestantnum>2001</contestantnum>
<rotnum>2001</rotnum>
<visiting_home_draw>Visiting</visiting_home_draw>
</participant>
<participant>
<participant_name>The Field</participant_name>
<contestantnum>2002</contestantnum>
<rotnum>2002</rotnum>
<visiting_home_draw>Home</visiting_home_draw>
</participant>
</participants>
<periods>
<period>
<period_number>0</period_number>
<period_description>Matchups</period_description>
<periodcutoff_datetimeGMT>2014-12-19 11:15</periodcutoff_datetimeGMT>
<period_status>I</period_status>
<period_update>open</period_update>
<spread_maximum>200</spread_maximum>
<moneyline_maximum>100</moneyline_maximum>
<total_maximum>200</total_maximum>
<moneyline>
<moneyline_visiting>116</moneyline_visiting>
<moneyline_home>-136</moneyline_home>
</moneyline>
</period>
</periods>
<PinnacleFeedTime>1418929691920</PinnacleFeedTime>
</event>
</events>
</pinnacle_line_feed>
I have parsed the file with the code below:
pinny_url = 'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball'
tree = ET.parse(urllib.urlopen(pinny_url))
root = tree.getroot()
list = []
for event in root.iter('event'):
event_datetimeGMT = event.find('event_datetimeGMT').text
gamenumber = event.find('gamenumber').text
sporttype = event.find('sporttype').text
league = event.find('league').text
IsLive = event.find('IsLive').text
for participants in event.iter('participants'):
for participant in participants.iter('participant'):
p1_name = participant.find('participant_name').text
contestantnum = participant.find('contestantnum').text
rotnum = participant.find('rotnum').text
vhd = participant.find('visiting_home_draw').text
for periods in event.iter('periods'):
for period in periods.iter('period'):
period_number = period.find('period_number').text
desc = period.find('period_description').text
pdatetime = period.find('periodcutoff_datetimeGMT')
status = period.find('period_status').text
update = period.find('period_update').text
max = period.find('spread_maximum').text
mlmax = period.find('moneyline_maximum').text
tot_max = period.find('total_maximum').text
for moneyline in period.iter('moneyline'):
ml_vis = moneyline.find('moneyline_visiting').text
ml_home = moneyline.find('moneyline_home').text
However, I am hoping to get the nodes separated by event similar to a 2D table (as in a pandas dataframe). However, the full xml file has multiple "event" children, some events that do not share the same nodes as above. I am struggling quite mightily with being able to take each event node and simply create a 2d table with the tag and that value where the tag acts as the column name and the text acts as the value.
Up to this point, I have done the above to gauge how I might put that information into a dictionary and subsequently put a number of dictionaries into a list from which I can create a dataframe using pandas, but that has not worked out, as all attempts have required me to find and replace text to create the dxcictionaries and python has not responded well to that when attempting to subsequently create a dataframe. I have also used a simple:
for elt in tree.iter():
list.append("'%s': '%s'") % (elt.tag, elt.text.strip()))
which worked quite well in simple pulling out every single tag and the corresponding text, but I was unable to make anything of that because any attempts at finding and replacing the text to create dictionaries was no good.
Any assistance would be greatly appreciated.
Thank you.
Here's an easy way to get your XML into a pandas dataframe. This utilizes the awesome requests library (which you can switch for urllib if you'd like, as well as the always helpful xmltodict library available in pypi. (NOTE: a reverse library is also available, knows as dicttoxml)
import json
import pandas
import requests
import xmltodict
web_request = requests.get(u'http://xml.pinnaclesports.com/pinnacleFeed.aspx?sportType=Basketball')
# Make that unweidly XML doc look like a native Dictionary!
result = xmltodict.parse(web_request.text)
# Next, convert the nested OrderedDict to a real dict, which isn't strictly necessary, but helps you
# visualize what the structure of the data looks like
normal_dict = json.loads(json.dumps(result.get('pinnacle_line_feed', {}).get(u'events', {}).get(u'event', [])))
# Now, make that dictionary into a dataframe
df = pandas.DataFrame.from_dict(normal_dict)
To get some idea of what this is starting to look like, here's the first couple of lines of the CSV:
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo) # Output the df to a CSV file
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,participants,periods,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,"{u'participant': [{u'contestantnum': u'1071', u'rotnum': u'1071', u'visiting_home_draw': u'Home', u'participant_name': u'Obras Sanitarias'}, {u'contestantnum': u'1072', u'rotnum': u'1072', u'visiting_home_draw': u'Visiting', u'participant_name': u'Libertad'}]}",,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,"{u'participant': [{u'contestantnum': u'1079', u'rotnum': u'1079', u'visiting_home_draw': u'Home', u'participant_name': u'Boca Juniors'}, {u'contestantnum': u'1080', u'rotnum': u'1080', u'visiting_home_draw': u'Visiting', u'participant_name': u'Penarol'}]}","{u'period': {u'total_maximum': u'450', u'total': {u'total_points': u'152.5', u'under_adjust': u'-107', u'over_adjust': u'-103'}, u'spread_maximum': u'450', u'period_description': u'Game', u'moneyline_maximum': u'450', u'period_number': u'0', u'period_status': u'I', u'spread': {u'spread_visiting': u'3', u'spread_adjust_visiting': u'-102', u'spread_home': u'-3', u'spread_adjust_home': u'-108'}, u'periodcutoff_datetimeGMT': u'2015-01-06 23:00', u'moneyline': {u'moneyline_visiting': u'136', u'moneyline_home': u'-150'}, u'period_update': u'open'}}",Basketball
Notice that the participants and periods columns are still their native Python dictionaries. You'll either need to remove them from the columns list, or do some additional mangling to get them to flatten out:
# Remove the offending columns in this example by selecting particular columns to show
>>> from StringIO import StringIO
>>> foo = StringIO() # A fake file to write to
>>> df.to_csv(foo, cols=['IsLive', 'event_datetimeGMT', 'gamenumber', 'league', 'sporttype'])
>>> foo.seek(0) # And rewind the file to the beginning
>>> print ''.join(foo.readlines()[:3])
,IsLive,event_datetimeGMT,gamenumber,league,sporttype
0,No,2015-01-10 23:00,426688683,Argentinian,Basketball
1,No,2015-01-06 23:00,426686588,Argentinian,Basketball

Data analysis for inconsistent string formatting

I have this task that I've been working on, but am having extreme misgivings about my methodology.
So the problem is that I have a ton of excel files that are formatted strangely (and not consistently) and I need to extract certain fields for each entry. An example data set is
My original approach was this:
Export to csv
Separate into counties
Separate into districts
Analyze each district individually, pull out values
write to output.csv
The problem I've run into is that the format (seemingly well organized) is almost random across files. Each line contains the same fields, but in a different order, spacing, and wording. I wrote a script to correctly process one file, but it doesn't work on any other files.
So my question is, is there a more robust method of approaching this problem rather than simple string processing? What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
If it helps clear up the problem, here is the script I wrote:
# This file takes a tax CSV file as input
# and separates it into counties
# then appends each county's entries onto
# the end of the master out.csv
# which will contain everything including
# taxes, bonds, etc from all years
#import the data csv
import sys
import re
import csv
def cleancommas(x):
toggle=False
for i,j in enumerate(x):
if j=="\"":
toggle=not toggle
if toggle==True:
if j==",":
x=x[:i]+" "+x[i+1:]
return x
def districtatize(x):
#list indexes of entries starting with "for" or "to" of length >5
indices=[1]
for i,j in enumerate(x):
if len(j)>2:
if j[:2]=="to":
indices.append(i)
if len(j)>3:
if j[:3]==" to" or j[:3]=="for":
indices.append(i)
if len(j)>5:
if j[:5]==" \"for" or j[:5]==" \'for":
indices.append(i)
if len(j)>4:
if j[:4]==" \"to" or j[:4]==" \'to" or j[:4]==" for":
indices.append(i)
if len(indices)==1:
return [x[0],x[1:len(x)-1]]
new=[x[0],x[1:indices[1]+1]]
z=1
while z<len(indices)-1:
new.append(x[indices[z]+1:indices[z+1]+1])
z+=1
return new
#should return a list of lists. First entry will be county
#each successive element in list will be list by district
def splitforstos(string):
for itemind,item in enumerate(string): # take all exception cases that didn't get processed
splitfor=re.split('(?<=\d)\s\s(?=for)',item) # correctly and split them up so that the for begins
splitto=re.split('(?<=\d)\s\s(?=to)',item) # a cell
if len(splitfor)>1:
print "\n\n\nfor detected\n\n"
string.remove(item)
string.insert(itemind,splitfor[0])
string.insert(itemind+1,splitfor[1])
elif len(splitto)>1:
print "\n\n\nto detected\n\n"
string.remove(item)
string.insert(itemind,splitto[0])
string.insert(itemind+1,splitto[1])
def analyze(x):
#input should be a string of content
#target values are nomills,levytype,term,yearcom,yeardue
clean=cleancommas(x)
countylist=clean.split(',')
emptystrip=filter(lambda a: a != '',countylist)
empt2strip=filter(lambda a: a != ' ', emptystrip)
singstrip=filter(lambda a: a != '\' \'',empt2strip)
quotestrip=filter(lambda a: a !='\" \"',singstrip)
splitforstos(quotestrip)
distd=districtatize(quotestrip)
print '\n\ndistrictized\n\n',distd
county = distd[0]
for x in distd[1:]:
if len(x)>8:
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
else:
print "x\n\n",x
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
special=x[5]
splitspec=special.split(' ')
try:
forind=[i for i,j in enumerate(splitspec) if j=='for'][0]
numyears=splitspec[forind+1]
yearcom=splitspec[forind+6]
except:
forind=[i for i,j in enumerate(splitspec) if j=='commencing'][0]
numyears=None
yearcom=splitspec[forind+2]
yeardue=str(x[6])[-4:]
reason=x[7]
data = [filename,county,district,vote1,vote2,mills,votetype,numyears,yearcom,yeardue,reason]
print "data other", data
openfile=csv.writer(open('out.csv','a'),delimiter=',', quotechar='|',quoting=csv.QUOTE_MINIMAL)
openfile.writerow(data)
# call the file like so: python tax.py 2007May8Tax.csv
filename = sys.argv[1] #the file is the first argument
f=open(filename,'r')
contents=f.read() #entire csv as string
#find index of every instance of the word county
separators=[m.start() for m in re.finditer('\w+\sCOUNTY',contents)] #alternative implementation in regex
# split contents into sections by county
# analyze each section and append to out.csv
for x,y in enumerate(separators):
try:
data = contents[y:separators[x+1]]
except:
data = contents[y:]
analyze(data)
is there a more robust method of approaching this problem rather than simple string processing?
Not really.
What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
After a ton of analysis and programming, it won't be significantly better than what you've got.
Reading stuff prepared by people requires -- sadly -- people-like brains.
You can mess with NLTK to try and do a better job, but it doesn't work out terribly well either.
You don't need a radically new approach. You need to streamline the approach you have.
For example.
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
Might be improved by using a named tuple.
Then build something like this.
data = SomeSensibleName(
district= x[0],
vote1=x[1], ... etc.
)
So that you're not creating a lot of intermediate (and largely uninformative) loose variables.
Also, keep looking at your analyze function (and any other function) to pull out the various "pattern matching" rules. The idea is that you'll examine a county's data, step through a bunch of functions until one matches the pattern; this will also create the named tuple. You want something like this.
for p in ( some, list, of, functions ):
match= p(data)
if match:
return match
Each function either returns a named tuple (because it liked the row) or None (because it didn't like the row).

Categories