Converting complex XML file to Pandas dataframe/CSV - Python - python

I'm currently in the middle of converting a complex XML file to csv or pandas df.
I have zero experience with xml data format and all the code suggestions I found online are just not working for me. Can anyone kindly help me with this?
There are lots of elements in the data that I do not need so I won't include those here.
For privacy reasons I won't be uploading the original data here but I'll be sharing what the structure looks like.
<RefData>
<Attributes>
<Id>1011</Id>
<FullName>xxxx</FullName>
<ShortName>xx</ShortName>
<Country>UK</Country>
<Currency>GBP</Currency>
</Attributes>
<PolicyID>000</PolicyID>
<TradeDetails>
<UniqueTradeId>000</UniqueTradeId>
<Booking>UK</Booking>
<Date>12/2/2019</Date>
</TradeDetails>
</RefData>
<RefData>
<Attributes>
<Id>1012</Id>
<FullName>xxx2</FullName>
<ShortName>x2</ShortName>
<Country>UK</Country>
<Currency>GBP</Currency>
</Attributes>
<PolicyID>002</PolicyID>
<TradeDetails>
<UniqueTradeId>0022</UniqueTradeId>
<Booking>UK</Booking>
<Date>12/3/2019</Date>
</TradeDetails>
</RefData>
I would be needing everything in the tag.
Ideally I want the headers and output to look like this:
I would sincerely appreciate any help I can get on this. Thanks a mil.

One correction concerning your input XML file: It has to contain
a single main element (of any name) and within it your RefData
elements.
So the input file actually contains:
<Main>
<RefData>
...
</RefData>
<RefData>
...
</RefData>
</Main>
To process the input XML file, you can use lxml package, so to import
it start from:
from lxml import etree as et
Then I noticed that you actually don't need the whole parsed XML tree,
so the usually applied scheme is to:
read the content of each element as soon as it has been parsed,
save the content (text) of any child elements in any intermediate
data structure (I chose a list of dictionaries),
drop the source XML element (not needed any more),
after the reading loop, create the result DataFrame from the above
intermediate data structure.
So my code looks like below:
rows = []
for _, elem in et.iterparse('RefData.xml', tag='RefData'):
rows.append({'id': elem.findtext('Attributes/Id'),
'fullname': elem.findtext('Attributes/FullName'),
'shortname': elem.findtext('Attributes/ShortName'),
'country': elem.findtext('Attributes/Country'),
'currency': elem.findtext('Attributes/Currency'),
'Policy ID': elem.findtext('PolicyID'),
'UniqueTradeId': elem.findtext('TradeDetails/UniqueTradeId'),
'Booking': elem.findtext('TradeDetails/Booking'),
'Date': elem.findtext('TradeDetails/Date')
})
elem.clear()
elem.getparent().remove(elem)
df = pd.DataFrame(rows)
To fully comprehend details, search the Web for description of lxml and
each method used.
For your sample data the result is:
id fullname shortname country currency Policy ID UniqueTradeId Booking Date
0 1011 xxxx xx UK GBP 000 000 UK 12/2/2019
1 1012 xxx2 x2 UK GBP 002 0022 UK 12/3/2019
Probably the last step to perform is to save the above DataFrame in a CSV
file, but I suppose you know how to do it.

Another way to do it, using lxml and xpath:
from lxml import etree
dat = """[your FIXED xml]"""
doc = etree.fromstring(dat)
columns = []
rows = []
to_delete = ["TradeDetails",'Attributes']
body = doc.xpath('.//RefData')
for el in body[0].xpath('.//*'):
columns.append(el.tag)
for b in body:
items = b.xpath('.//*')
row = []
for item in items:
if item.tag not in to_delete:
row.append(item.text)
rows.append(row)
for col in to_delete:
if col in columns:
columns.remove(col)
pd.DataFrame(rows,columns=columns)
Output is the dataframe indicated in your question.

Related

Parsing XML column in Pyspark dataframe

I'm relatively new to PySpark and trying to solve a data problem. I have a pyspark DF, created with data extracted from MS SQL Server, having 2 columns: ID (Integer) and XMLMsg (String). The 2nd column, XMLMsg contains data in XML format.
The goal is to parse the XMLMsg column and create additional columns in the same DF with the extracted columns from the XML.
Following is a sample structure of the pyspark DF:
ID XMLMsg
101 ...<a><b>name1</b><c>loc1</c></a>...<d>dept1</d>...
102 ...<a><b>name2</b><c>loc2</c></a>...<d>dept2</d>...
103 ...<a><b>name3</b><c>loc3</c></a>...<d>dept3</d>...
Expected output is:
ID XMLMsg b c d
101 ...<a><b>name1</b><c>loc1</c></a>...<d>dept1</d>... name1 loc1 dept1
102 ...<a><b>name2</b><c>loc2</c></a>...<d>dept2</d>... name2 loc2 dept2
103 ...<a><b>name3</b><c>loc3</c></a>...<d>dept3</d>... name3 loc3 dept3
I tried few suggestions based on my search in SO; however, could not achieve the expected result. Hence, reaching out for some help and directions. Thanks for your time.
I solved this finally using Lambda and UDF considering i had to get the texts from 4 nodes from a huge XML file. Since the XML files are already in a column and part of the pyspark Dataframe, I didnt want to write as files and again parse the whole XML. I also wanted to avoid using XSD schema.
The actual xml has multiple namespaces and also some nodes with specific conditions.
Example:
<ap:applicationproduct xmlns:xsi="http://www.example.com/2005/XMLSchema-instance" xmlns:ap="http://example.com/productinfo/1_6" xmlns:ct="http://example.com/commontypes/1_0" xmlns:dc="http://example.com/datacontent/1_0" xmlns:tp="http://aexample.com/prmvalue/1_0" ....." schemaVersion="..">
<ap:ParameterInfo>
<ap:Header>
<ct:Version>1.0</ct:Version>
<ct:Sender>ABC</ct:Sender>
<ct:SenderVersion />
<ct:SendTime>...</ct:SendTime>
</ap:Header>
<ap:ProductID>
<ct:Model>
<ct:Series>34AP</ct:Series>
<ct:ModelNo>013780</ct:ModelNo>
..............
..............
<ap:Object>
<ap:Parameter schemaVersion="2.5" Code="DDA">
<dc:Value>
<tp:Blob>mbQAEAgBTgKQEBAX4KJJYABAIASL0AA==</tp:Blob>
</dc:Value>
</ap:Parameter>
.........
........
Here I need to extract the values from ct:ModelNo and tp:Blob
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import xml.etree.ElementTree as ET
# List of namespaces to be used:
ns = {'ap' : 'http://example.com/productinfo/1_6',
'ct':'http://example.com/commontypes/1_0',
'dc':'http://example.com/datacontent/1_0',
'tp':'http://aexample.com/prmvalue/1_0'
}
parsed_model = (lambda x: ET.fromstring(x).find('ap:ParameterInfo/ap:ProductID/ct:Model/ct:ModelNo', ns).text)
udf_model = udf(parsed_model)
parsed_model_df = df.withColumn('ModelNo', udf_Model('XMLMsg'))
Also for the Node with blob value similar function can be written but the path to the node would be:
'ap:ParameterInfo/ap:Object/ap:Parameter[#Code="DDA"]/dc:Value/tp:Blob'
This worked for me and I was able to add the required values in the pyspark DF. Any suggestions are welcome to make it better though. Thank you!

How to Automatically select data on webpage and download the resulting xls file using Python

I am new to Python. I am trying to scrape the data on the page:
For example:
Category: grains
Organic: No
Commodity: Coarse
SubCommodity: Corn
Publications: Daily
Location: All
Refine Commodity: All
Dates: 07/31/2018 - 08/01/2019
Is there a way in which Python can select this on the webpage and then click on run and then
Click on Download as Excel and store the excel file?
Is it possible? I am new to coding and need some guidance here.
Currently what I have done is enter the data and then on the resulting page I used Beautiful Soup to scrape the table. However it takes a lot of time since the table is spread on more than 200 pages.
Using the query you defined as an example, I input the query manually and found the following URL for the Excel (really HTML) format:
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain=08%2F01%2F2019&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain=07%2F31%2F2018&runReport=true&grade=&regionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate=07%2F31%2F2018&endDate=08%2F01%2F2019&format=excel'
In the URL are parameters we can set in Python, and we could easily make a loop to change the parameters. For now, let me just get into the example of actually getting this data. I use the pandas.read_html to read this HTML result and populate a DataFrame, which could be thought of as a table with columns and rows.
import pandas as pd
# use URL defined earlier
# url = '...'
df_lst = pd.read_html(url, header=1)
Now df_lst is a list of DataFrame objects containing the desired data. For your particular example, this results in 30674 rows and 11 columns:
>>> df_lst[0].columns
Index([u'Report Date', u'Location', u'Class', u'Variety', u'Grade Description',
u'Units', u'Transmode', u'Low', u'High', u'Pricing Point',
u'Delivery Period'],
dtype='object')
>>> df_lst[0].head()
Report Date Location Class Variety Grade Description Units Transmode Low High Pricing Point Delivery Period
0 07/31/2018 Blytheville, AR YELLOW NaN US NO 2 Bushel Truck 3.84 3.84 Country Elevators Cash
1 07/31/2018 Helena, AR YELLOW NaN US NO 2 Bushel Truck 3.76 3.76 Country Elevators Cash
2 07/31/2018 Little Rock, AR YELLOW NaN US NO 2 Bushel Truck 3.74 3.74 Mills and Processors Cash
3 07/31/2018 Pine Bluff, AR YELLOW NaN US NO 2 Bushel Truck 3.67 3.67 Country Elevators Cash
4 07/31/2018 Denver, CO YELLOW NaN US NO 2 Bushel Truck-Rail 3.72 3.72 Terminal Elevators Cash
>>> df_lst[0].shape
(30674, 11)
Now, back to the point I made about the URL parameters--using Python, we can run through lists and format the URL string to our liking. For instance, iterating through 20 years of the given query can be done by modifying the URL to have numbers corresponding to positional arguments in Python's str.format() method. Here's a full example below:
import datetime
import pandas as pd
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={1}&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={0}&runReport=true&grade=&regionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate={0}&endDate={1}&format=excel'
start = [datetime.date(2018-i, 7, 31) for i in range(20)]
end = [datetime.date(2019-i, 8, 1) for i in range(20)]
for s, e in zip(start, end):
url_get = url.format(s.strftime('%m/%d/%Y'), e.strftime('%m/%d/%Y'))
df_lst = pd.read_html(url_get, header=1)
#print(df_lst[0].head()) # uncomment to see first five rows
#print(df_lst[0].shape) # uncomment to see DataFrame shape
Be careful with pd.read_html. I've modified my answer with a header keyword argument to pd.read_html() because the multi-indexing made it a pain to get results. By giving a single row index as the header, it's no longer a multi-index, and data indexing is easy. For instance, I can get corn class using this:
df_lst[0]['Class']
Compiling all the reports into one large file is also easy with Pandas. Since we have a DataFrame, we can use the pandas.to_csv function to export our data as a CSV (or any other file type you want, but I chose CSV for this example). Here's a modified version with the additional capability of outputting a CSV:
import datetime
import pandas as pd
# URL
url = 'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={1}&commDetail=All&repMonth=1&endDateWeekly=&repType=Daily&rtype=&fsize=&_use=1&use=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={0}&runReport=true&grade=&regionsDesc=&subprimals=&mscore=&endYear=2019&repDateWeekly=&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&organic=NO&category=Grain&_mscore=1&subComm=Corn&commodity=Coarse&_commDetail=1&_subprimals=1&cut=&endMonth=1&repDate={0}&endDate={1}&format=excel'
# CSV output file and flag
csv_out = 'myreports.csv'
flag = True
# Start and end dates
start = [datetime.date(2018-i, 7, 31) for i in range(20)]
end = [datetime.date(2019-i, 8, 1) for i in range(20)]
# Iterate through dates and get report from URL
for s, e in zip(start, end):
url_get = url.format(s.strftime('%m/%d/%Y'), e.strftime('%m/%d/%Y'))
df_lst = pd.read_html(url_get, header=1)
print(df_lst[0].head()) # uncomment to see first five rows
print(df_lst[0].shape) # uncomment to see DataFrame shape
# Save to big CSV
if flag is True:
# 0th iteration, so write header and overwrite existing file
df_lst[0].to_csv(csv_out, header=True, mode='w') # change mode to 'wb' if Python 2.7
flag = False
else:
# Subsequent iterations should append to file and not add new header
df_lst[0].to_csv(csv_out, header=False, mode='a') # change mode to 'ab' if Python 2.7
Your particular query generates at least 1227 pages of data - so I just trimmed it down to one location - Arizona(from 07/31/2018 - 08/1/2019) - now generating 47 pages of data. xml size was 500KB
You can semi automate like this:
>>> end_day='01'
>>> start_day='31'
>>> start_month='07'
>>> end_month='08'
>>> start_year='2018'
>>> end_year='2019'
>>> link = f"https://marketnews.usda.gov/mnp/ls-report?&endDateGrain={end_month}%2F{end_day}%2F{end_year}&commDetail=All&endDateWeekly={end_month}%2F{end_day}%2F{end_year}&repMonth=1&repType=Daily&rtype=&use=&_use=1&fsize=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain={start_month}%2F{start_day}%2F{start_year}+&runReport=true&grade=&regionsDesc=All+AR&subprimals=&mscore=&endYear={end_year}&repDateWeekly={start_month}%2F{start_day}%2F{start_year}&_wrange=1&endDateWeeklyGrain=&repYear={end_year}&loc=All+AR&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&category=Grain&organic=NO&commodity=Coarse&subComm=Corn&_mscore=0&_subprimals=1&_commDetail=1&cut=&endMonth=1&repDate={start_month}%2F{start_day}%2F{start_year}&endDate={end_month}%2F{end_day}%2F{end_year}&format=xml"
>>> link
'https://marketnews.usda.gov/mnp/ls-report?&endDateGrain=08%2F01%2F2019&commDetail=All&endDateWeekly=08%2F01%2F2019&repMonth=1&repType=Daily&rtype=&use=&_use=1&fsize=&_fsize=1&byproducts=&run=Run&pricutvalue=&repDateGrain=07%2F31%2F2018+&runReport=true&grade=&regionsDesc=All+AR&subprimals=&mscore=&endYear=2019&repDateWeekly=07%2F31%2F2018&_wrange=1&endDateWeeklyGrain=&repYear=2019&loc=All+AR&_loc=1&wrange=&_grade=1&repDateWeeklyGrain=&_byproducts=1&category=Grain&organic=NO&commodity=Coarse&subComm=Corn&_mscore=0&_subprimals=1&_commDetail=1&cut=&endMonth=1&repDate=07%2F31%2F2018&endDate=08%2F01%2F2019&format=xml'
>>> with urllib.request.urlopen(link) as response:
... html = response.read()
...
loading the html could take a hot minute with large queries
If you for some reason wished to process the entire data set - you could repeat this process - but you may wish to look into techniques that can be specifically optimized for big data - perhaps a solution involving Python's Pandas and numexpr(for GPU acceleration/parallelization)
You can find the data used in this answer here - which you can download as an xml.
First import your xml:
>>> import xml.etree.ElementTree as ET
you can either download the file from the website in python
>>> tree = ET.parse(html)
or manually
>>> tree = ET.parse('report.xml')
>>> report = tree.getroot()
you can then do stuff like this:
>>> report[0][0]
<Element 'reportDate' at 0x7f902adcf368>
>>> report[0][0].text
'07/31/2018'
>>> for el in report[0]:
... print(el)
...
<Element 'reportDate' at 0x7f902adcf368>
<Element 'location' at 0x7f902ac814f8>
<Element 'classStr' at 0x7f902ac81548>
<Element 'variety' at 0x7f902ac81b88>
<Element 'grade' at 0x7f902ac29cc8>
<Element 'units' at 0x7f902ac29d18>
<Element 'transMode' at 0x7f902ac29d68>
<Element 'bidLevel' at 0x7f902ac29db8>
<Element 'deliveryPoint' at 0x7f902ac29ea8>
<Element 'deliveryPeriod' at 0x7f902ac29ef8>
More info on parsing xml is here.
You're going to want to learn some python - but hopefully you can make sense of the following. Luckily - there are many free python tutorials online - here is a quick snippet to get you started.
#lets find the lowest bid on a certain day
>>> report[0][0]
<Element 'reportDate' at 0x7f902adcf368>
>>> report[0][0].text
'07/31/2018'
>>> report[0][7]
<Element 'bidLevel' at 0x7f902ac29db8>
>>> report[0][7][0]
<Element 'lowPrice' at 0x7f902ac29e08>
>>> report[0][7][0].text
'3.84'
#how many low bids are there?
>>> len(report)
1216
#get an average of the lowest bids...
>>> low_bid_list = [float(bid[7][0].text) for bid in report]
[3.84, 3.76, 3.74, 3.67, 3.65, 3.7, 3.5, 3.7, 3.61,...]
>>> sum = 0
>>> for el in low_bid_list:
... sum = sum + el
...
>>> sum
4602.599999999992
>>> sum/len(report)
3.7850328947368355

Change json formatting for pandas to_json(orient="records") method

I'm trying to change the format of my json file as shown below - is this possible through pandas? I've tried some regex operations but when I use the to_json(orient='records').replace(regex=true) method I get some very funky outputs. (the [] turn into '[\"\"]'). Are there any alternatives? Thanks so much for your help. I've included one line from the million or so with the personal information removed.
Some background info: The below data was scraped from my algolia database, read into pandas, and saved as a json file.
My actual json file (there are around a million of these kinds of rows)
[{"Unnamed: 0":37427,"email":null,"industry":"['']","category":"['help', 'motivation']","phone":null,"tags":"['U.S.']","twitter_bio":"I'm the freshest kid on the block."}]
My actual output
Unnamed: 0 category email industry phone tags twitter_bio
37427 ['help', 'motivation'] NaN [''] NaN ['U.S.'] I'm the freshest kid on the block.
Desired json file
[{"Unnamed: 0":37427,"email":null,"industry":[""],"category":["help", "motivation"],"phone":null,"tags":["U.S."],"twitter_bio":"I'm the freshest kid on the block."}]
Desired output
Unnamed: 0 category email industry phone tags twitter_bio
37427 [help, motivation] NaN [] NaN [U.S.] I'm the freshest kid on the block.
I sort of assuming that what you are trying to do is to convert your lists (which originally are just strings), and want them as actual lists.
you can do some string manipulation to achieve that:
import json
import re
from pandas.io.json import json_normalize
json_file = 'C:/test.json'
jsonStr= open(json_file).read()
jsonStr = jsonStr.replace('"[','[')
jsonStr = jsonStr.replace(']"',']')
jsonStr = re.sub("\[[^]]*\]", lambda x:x.group(0).replace("'",'"'), jsonStr)
jsonObj = json.loads(jsonStr)
df = json_normalize(jsonObj[0])
Output:
print (df.to_string())
Unnamed: 0 category email industry phone tags twitter_bio
0 37427 [help, motivation] None [] None [U.S.] I'm the freshest kid on the block.

Iterate over particular child nodes in XML and save to CSV using Python

I've looked through a bunch of similar questions and didn't see an answer that solved this specifically. I haven't worked with XML files in python before and am on a time crunch, so I'm probably just overlooking the obvious. I have a bunch of XML files that I need to just grab two values from, for each provider record in the file. I need to save those in a csv.
I have some code that's pulling more than I'm expecting ...
import xml.etree.ElementTree as ET
import csv
tree = ET.parse('xml/HSP-FullOutOfAreaSite03-DEC.xml')
root = tree.getroot()
for PROVIDER in root.iter('PROVIDER'):
for PROV_IDENTIFIER in PROVIDER:
print(PROV_IDENTIFIER.text)
for TAXONOMY_CODE in PROVIDER:
print(TAXONOMY_CODE.text)
The XML has a bunch of repeating PROVIDER_GROUP's and for each provider in all the provider groups I need the provider's PROV_IDENTIFIER and TAXONOMY_CODE.
<PROVIDER_GROUP>
<MASTER_GROUP_CODE>345093845</MASTER_GROUP_CODE>
<TAX_ID>3095</TAX_ID>
<GROUPNUMBER>16</GROUPNUMBER>
<SITECOUNT>1</SITECOUNT>
<CONTRACTS>
<CONTRACT>
<EFF_DATE>2002-01-01</EFF_DATE>
</CONTRACT>
</CONTRACTS>
<PROVIDER_SITES>
<PROVIDER_SITE>
<PROV_MASTER_ID>18583783745</PROV_MASTER_ID>
<MASTER_GROUP_CODE>584293845</MASTER_GROUP_CODE>
<PROVIDERS>
<PROVIDER>
<PROVNO>123456</PROVNO>
<NAME_FIRST>John</NAME_FIRST>
<NAME_LAST>Doe</NAME_LAST>
<NAME_CREDENTIAL>DDD</NAME_CREDENTIAL>
<GENDER>M</GENDER>
<PROV_IDENTIFIER>3459832385</PROV_IDENTIFIER>
<TAXONOMIES>
<TAXONOMY>
<TAXONOMY_CODE>23498R98239X</TAXONOMY_CODE>
</TAXONOMY>
</TAXONOMIES>
<HOSPRELATIONS>
<HOSP>
<NPI>1366896300</NPI>
</HOSP>
</HOSPRELATIONS>
</PROVIDER>
<PROVIDER>
<PROVNO>123454</PROVNO>
<NAME_FIRST>Jane</NAME_FIRST>
<NAME_LAST>Doe</NAME_LAST>
<NAME_CREDENTIAL>DDD</NAME_CREDENTIAL>
<GENDER>F</GENDER>
<PROV_IDENTIFIER>3945092358</PROV_IDENTIFIER>
<TAXONOMIES>
<TAXONOMY>
<TAXONOMY_CODE>55598R98239X</TAXONOMY_CODE>
</TAXONOMY>
</TAXONOMIES>
<HOSPRELATIONS>
<HOSP>
<NPI>34598345030</NPI>
</HOSP>
</HOSPRELATIONS>
</PROVIDER>
</PROVIDERS>
</PROVIDER_SITE>
</PROVIDER_SITES>
</PROVIDER_GROUP>
<PROVIDER_GROUP>
<PROVIDER_SITES>
<PROVIDER_SITE>
<PROVIDERS>
<!-- MORE PROVIDERS -->
</PROVIDERS>
</PROVIDER_SITE>
</PROVIDER_SITES>
</PROVIDER_GROUP>
And I need a CSV that just looks like:
PROV_IDENTIFIER | TAXONOMY_CODE
---------------------------------
210985345098 | 234R345359X
310495345091 | 456R345359X
534581039568 | 567R345359X
802869458327 | 234R345359X
You can put the XML into bs4 and get them like this:
from bs4 import BeautifulSoup
import pandas as pd
with open('xml/HSP-FullOutOfAreaSite03-DEC.xml', 'r') as f:
soup = BeautifulSoup(f.read(), 'lxml')
# Get the data you want
df = pd.DataFrame(list(zip(
[el.text for el in soup.find_all('prov_identifier')],
[el.text for el in soup.find_all('taxonomy_code')]
)), columns=['PROV_IDENTIFIER', 'TAXONOMY_CODE'])
# Dump to csv
df.to_csv('out.csv', index=False)
Here's a simple example so you can get an idea on how to proceed:
from xml.etree import ElementTree as ET
tree = ET.parse('xml/HSP-FullOutOfAreaSite03-DEC.xml')
providers = tree.findall(".//PROVIDERS/PROVIDER")
agg = [
(p.find('./PROV_IDENTIFIER').text,
[t.text for t in p.findall(".//TAXONOMY_CODE")]) for p in providers]
If you run this against your XML sample you will get
[('3459832385', ['23498R98239X']), ('3945092358', ['55598R98239X'])]
The first element in the tuple will have the PROV_IDENTIFIER, the second element will be a list of all the nested TAXONOMY_CODEelements.

From 10-K -- extract SIC, CIK, create metadata table

I am working with 10-Ks from Edgar. To assist in file management and data analysis, I would like to create a table containing the path to each file, the CIK number for the company filed (this is a unique ID issued by SEC), and the SIC industry code which it belongs to. Below is an image visually representing what I want to do.
The two things I want to extract are listed at the top of each document. The CIK # will always be a number which is listed after the phrase "CENTRAL INDEX KEY:". The SIC # will always be a number enclosed in brackets after "STANDARD INDUSTRIAL CLASSIFICATION" and then a description of that particular industry.
This is consistent across all filings.
To do's:
Loop through files: extract file path, CIK and SIC numbers -- with attention that I just get one return per document, and each result is in order, so my records between fields align.
Merge these fields together -- I am guessing the best way to do this is to extract each field into their own separate lists and then merge, maybe into a Pandas dataframe?
Ultimately I will be using this table to help me subset the data between SIC industries.
Thank you for taking a look. Please let me know if I can provide additional documentation.
Here are some codes I just wrote for doing something similar. You can output the results to a CSV file. As the first step, you need to walk through the folder and get a list of all the 10-Ks and iterate over it.
year_end = ""
sic = ""
with open(txtfile, 'r', encoding='utf-8', errors='replace') as rawfile:
for cnt, line in enumerate(rawfile):
#print(line)
if "CONFORMED PERIOD OF REPORT" in line:
year_end = line[-9:-1]
#print(year_end)
if "STANDARD INDUSTRIAL CLASSIFICATION" in line:
match = re.search(r"\d{4}", line)
if match:
sic = match.group(0)
#print(sic)
#print(sic)
if (year_end and sic) or cnt > 100:
#print(year_end, sic)
break

Categories