Parsing XML column in Pyspark dataframe - python
I'm relatively new to PySpark and trying to solve a data problem. I have a pyspark DF, created with data extracted from MS SQL Server, having 2 columns: ID (Integer) and XMLMsg (String). The 2nd column, XMLMsg contains data in XML format.
The goal is to parse the XMLMsg column and create additional columns in the same DF with the extracted columns from the XML.
Following is a sample structure of the pyspark DF:
ID XMLMsg
101 ...<a><b>name1</b><c>loc1</c></a>...<d>dept1</d>...
102 ...<a><b>name2</b><c>loc2</c></a>...<d>dept2</d>...
103 ...<a><b>name3</b><c>loc3</c></a>...<d>dept3</d>...
Expected output is:
ID XMLMsg b c d
101 ...<a><b>name1</b><c>loc1</c></a>...<d>dept1</d>... name1 loc1 dept1
102 ...<a><b>name2</b><c>loc2</c></a>...<d>dept2</d>... name2 loc2 dept2
103 ...<a><b>name3</b><c>loc3</c></a>...<d>dept3</d>... name3 loc3 dept3
I tried few suggestions based on my search in SO; however, could not achieve the expected result. Hence, reaching out for some help and directions. Thanks for your time.
I solved this finally using Lambda and UDF considering i had to get the texts from 4 nodes from a huge XML file. Since the XML files are already in a column and part of the pyspark Dataframe, I didnt want to write as files and again parse the whole XML. I also wanted to avoid using XSD schema.
The actual xml has multiple namespaces and also some nodes with specific conditions.
Example:
<ap:applicationproduct xmlns:xsi="http://www.example.com/2005/XMLSchema-instance" xmlns:ap="http://example.com/productinfo/1_6" xmlns:ct="http://example.com/commontypes/1_0" xmlns:dc="http://example.com/datacontent/1_0" xmlns:tp="http://aexample.com/prmvalue/1_0" ....." schemaVersion="..">
<ap:ParameterInfo>
<ap:Header>
<ct:Version>1.0</ct:Version>
<ct:Sender>ABC</ct:Sender>
<ct:SenderVersion />
<ct:SendTime>...</ct:SendTime>
</ap:Header>
<ap:ProductID>
<ct:Model>
<ct:Series>34AP</ct:Series>
<ct:ModelNo>013780</ct:ModelNo>
..............
..............
<ap:Object>
<ap:Parameter schemaVersion="2.5" Code="DDA">
<dc:Value>
<tp:Blob>mbQAEAgBTgKQEBAX4KJJYABAIASL0AA==</tp:Blob>
</dc:Value>
</ap:Parameter>
.........
........
Here I need to extract the values from ct:ModelNo and tp:Blob
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import xml.etree.ElementTree as ET
# List of namespaces to be used:
ns = {'ap' : 'http://example.com/productinfo/1_6',
'ct':'http://example.com/commontypes/1_0',
'dc':'http://example.com/datacontent/1_0',
'tp':'http://aexample.com/prmvalue/1_0'
}
parsed_model = (lambda x: ET.fromstring(x).find('ap:ParameterInfo/ap:ProductID/ct:Model/ct:ModelNo', ns).text)
udf_model = udf(parsed_model)
parsed_model_df = df.withColumn('ModelNo', udf_Model('XMLMsg'))
Also for the Node with blob value similar function can be written but the path to the node would be:
'ap:ParameterInfo/ap:Object/ap:Parameter[#Code="DDA"]/dc:Value/tp:Blob'
This worked for me and I was able to add the required values in the pyspark DF. Any suggestions are welcome to make it better though. Thank you!
Related
Can I loop the same analysis across multiple csv dataframes then concatenate results from each into one table?
newbie python learner here! I have 20 participant csv files (P01.csv to P20.csv) with dataframes in them that contain stroop test data. The important columns for each are the condition column which has a random mix of incongruent and congruent conditions, the reaction time column for each condition and the column for if the response was correct, true or false. Here is an example of the dataframe for P01 I'm not sure if this counts as a code snippet? : trialnum,colourtext,colourname,condition,response,rt,correct 1,blue,red,incongruent,red,0.767041,True 2,yellow,yellow,congruent,yellow,0.647259,True 3,green,blue,incongruent,blue,0.990185,True 4,green,green,congruent,green,0.720116,True 5,yellow,yellow,congruent,yellow,0.562909,True 6,yellow,yellow,congruent,yellow,0.538918,True 7,green,yellow,incongruent,yellow,0.693017,True 8,yellow,red,incongruent,red,0.679368,True 9,yellow,blue,incongruent,blue,0.951432,True 10,blue,blue,congruent,blue,0.633367,True 11,blue,green,incongruent,green,1.289047,True 12,green,green,congruent,green,0.668142,True 13,blue,red,incongruent,red,0.647722,True 14,red,blue,incongruent,blue,0.858307,True 15,red,red,congruent,red,1.820112,True 16,blue,green,incongruent,green,1.118404,True 17,red,red,congruent,red,0.798532,True 18,red,red,congruent,red,0.470939,True 19,red,blue,incongruent,blue,1.142712,True 20,red,red,congruent,red,0.656328,True 21,red,yellow,incongruent,yellow,0.978830,True 22,green,red,incongruent,red,1.316182,True 23,yellow,yellow,congruent,green,0.964292,False 24,green,green,congruent,green,0.683949,True 25,yellow,green,incongruent,green,0.583939,True 26,green,blue,incongruent,blue,1.474140,True 27,green,blue,incongruent,blue,0.569109,True 28,green,green,congruent,blue,1.196470,False 29,red,red,congruent,red,4.027546,True 30,blue,blue,congruent,blue,0.833177,True 31,red,red,congruent,red,1.019672,True 32,green,blue,incongruent,blue,0.879507,True 33,red,red,congruent,red,0.579254,True 34,red,blue,incongruent,blue,1.070518,True 35,blue,yellow,incongruent,yellow,0.723852,True 36,yellow,green,incongruent,green,0.978838,True 37,blue,blue,congruent,blue,1.038232,True 38,yellow,green,incongruent,yellow,1.366425,False 39,green,red,incongruent,red,1.066038,True 40,blue,red,incongruent,red,0.693698,True 41,red,blue,incongruent,blue,1.751062,True 42,blue,blue,congruent,blue,0.449651,True 43,green,red,incongruent,red,1.082267,True 44,blue,blue,congruent,blue,0.551023,True 45,red,blue,incongruent,blue,1.012258,True 46,yellow,green,incongruent,yellow,0.801443,False 47,blue,blue,congruent,blue,0.664119,True 48,red,green,incongruent,yellow,0.716189,False 49,green,green,congruent,yellow,0.630552,False 50,green,yellow,incongruent,yellow,0.721917,True 51,red,red,congruent,red,1.153943,True 52,blue,red,incongruent,red,0.571019,True 53,yellow,yellow,congruent,yellow,0.651611,True 54,blue,blue,congruent,blue,1.321344,True 55,green,green,congruent,green,1.159240,True 56,blue,blue,congruent,blue,0.861646,True 57,yellow,red,incongruent,red,0.793069,True 58,yellow,yellow,congruent,yellow,0.673190,True 59,yellow,red,incongruent,red,1.049320,True 60,red,yellow,incongruent,yellow,0.773447,True 61,red,yellow,incongruent,yellow,0.693554,True 62,red,red,congruent,red,0.933901,True 63,blue,blue,congruent,blue,0.726794,True 64,green,green,congruent,green,1.046116,True 65,blue,blue,congruent,blue,0.713565,True 66,blue,blue,congruent,blue,0.494177,True 67,green,green,congruent,green,0.626399,True 68,blue,blue,congruent,blue,0.711896,True 69,blue,blue,congruent,blue,0.460420,True 70,green,green,congruent,yellow,1.711978,False 71,blue,blue,congruent,blue,0.634218,True 72,yellow,blue,incongruent,yellow,0.632482,False 73,yellow,yellow,congruent,yellow,0.653813,True 74,green,green,congruent,green,0.808987,True 75,blue,blue,congruent,blue,0.647117,True 76,green,red,incongruent,red,1.791693,True 77,red,yellow,incongruent,yellow,1.482570,True 78,red,red,congruent,red,0.693132,True 79,red,yellow,incongruent,yellow,0.815830,True 80,green,green,congruent,green,0.614441,True 81,yellow,red,incongruent,red,1.080385,True 82,red,green,incongruent,green,1.198548,True 83,blue,green,incongruent,green,0.845769,True 84,yellow,blue,incongruent,blue,1.007089,True 85,green,blue,incongruent,blue,0.488701,True 86,green,green,congruent,yellow,1.858272,False 87,yellow,yellow,congruent,yellow,0.893149,True 88,yellow,yellow,congruent,yellow,0.569597,True 89,yellow,yellow,congruent,yellow,0.483542,True 90,yellow,red,incongruent,red,1.669842,True 91,blue,green,incongruent,green,1.158416,True 92,blue,red,incongruent,red,1.853055,True 93,green,yellow,incongruent,yellow,1.023785,True 94,yellow,blue,incongruent,blue,0.955395,True 95,yellow,yellow,congruent,yellow,1.303260,True 96,blue,yellow,incongruent,yellow,0.737741,True 97,yellow,green,incongruent,green,0.730972,True 98,green,red,incongruent,red,1.564596,True 99,yellow,yellow,congruent,yellow,0.978911,True 100,blue,yellow,incongruent,yellow,0.508151,True 101,red,green,incongruent,green,1.821969,True 102,red,red,congruent,red,0.818726,True 103,yellow,yellow,congruent,yellow,1.268222,True 104,yellow,yellow,congruent,yellow,0.585495,True 105,green,green,congruent,green,0.673404,True 106,blue,yellow,incongruent,yellow,1.407036,True 107,red,red,congruent,red,0.701050,True 108,red,green,incongruent,red,0.402334,False 109,red,green,incongruent,green,1.537681,True 110,green,yellow,incongruent,yellow,0.675118,True 111,green,green,congruent,green,1.004550,True 112,yellow,blue,incongruent,blue,0.627439,True 113,yellow,yellow,congruent,yellow,1.150248,True 114,blue,yellow,incongruent,yellow,0.774452,True 115,red,red,congruent,red,0.860966,True 116,red,red,congruent,red,0.499595,True 117,green,green,congruent,green,1.059725,True 118,red,red,congruent,red,0.593180,True 119,green,yellow,incongruent,yellow,0.855915,True 120,blue,green,incongruent,green,1.335018,True But I am only interested in the 'condition', 'rt', and 'correct' columns. I need to create a table that says the mean reaction time for the congruent conditions, and the incongruent conditions, and the percentage correct for each condition. But I want to create an overall table of these results for each participant. I am aiming to get something like this as an output table: Participant Stimulus Type Mean Reaction Time Percentage Correct 01 Congruent 0.560966 80 01 Incongruent 0.890556 64 02 Congruent 0.460576 89 02 Incongruent 0.956556 55 Etc. for all 20 participants. This was just an example of my ideal output because later I'd like to plot a graph of the means from each condition across the participants. But if anyone thinks that table does not make sense or is inefficient, I'm open to any advice! I want to use pandas but don't know where to begin finding the rt means for each condition when there are two different conditions in the same column in each dataframe? And I'm assuming I need to do it in some kind of loop that can run over each participant csv file, and then concatenates the results in a table for all the participants? Initially, after struggling to figure out the loop I would need and looking on the web, I ran this code, which worked to concatenate all of the dataframes of the participants, I hoped this would help me to do the same analysis on all of them at once but the problem is it doesn't identify the individual participants for each of the rows from each participant csv file (there are 120 rows for each participant like the example I give above) that I had put into one table: import os import glob import pandas as pd #set working directory os.chdir('data') #find all csv files in the folder #use glob pattern matching -> extension = 'csv' #save result in list -> all_filenames extension = 'csv' all_filenames = [i for i in glob.glob('*.{}'.format(extension))] #print(all_filenames) #combine all files in the list combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ]) #export to csv combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig') Perhaps I could do something to add a participant column to identify each participant's data set in the concatenated table and then perform the mean and percentage correct analysis on the two conditions for each participant in that big concatenated table? Or would it be better to do the analysis and then loop it over all of the individual participant csv files of dataframes? I'm sorry if this is a really obvious process, I'm new to python and trying to learn to analyse my data more efficiently, have been scouring the Internet and Panda tutorials but I'm stuck. Any help is welcome! I've also never used Stackoverflow before so sorry if I haven't formatted things correctly here but thanks for the feedback about including examples of the input data, code I've tried, and desired output data, I really appreciate the help.
Try this: from pathlib import Path # Use the Path class to represent a path. It offers more # functionalities when perform operations on paths path = Path("./data").resolve() # Create a dictionary whose keys are the Participant ID # (the `01` in `P01.csv`, etc), and whose values are # the data frames initialized from the CSV data = { p.stem[1:]: pd.read_csv(p) for p in path.glob("*.csv") } # Create a master data frame by combining the individual # data frames from each CSV file df = pd.concat(data, keys=data.keys(), names=["participant", None]) # Calculate the statistics result = ( df.groupby(["participant", "condition"]).agg(**{ "Mean Reaction Time": ("rt", "mean"), "correct": ("correct", "sum"), "size": ("trialnum", "size") }).assign(**{ "Percentage Correct": lambda x: x["correct"] / x["size"] }).drop(columns=["correct", "size"]) .reset_index() )
Convert heavily nested json file into R/Python dataframe
I have found a numerous number of similar questions on stackoverflow, however, one issue remains unsolved to me. I have a heavily nested “.json” file I need to import and convert into R or Python data.frame to work with. Json file contains lists inside (usually empty but sometime contains data). Example of json's structure: I use R's library jsonlite and Python's pandas. # R jsonlite::fromJSON(json_file, flatten = TRUE) # or jsonlite::read_json(json_file, simplifyVector = T) # Python with open(json_file.json, encoding = "utf-8") as f: data = json.load(f) pd.json_normalize(data) Generally, in both cases it work. The output looks like a normal data.frame, however, the problem is that some columns of a new data.frame contain an embedded lists (I am not sure about "embedded lists" whether it's correct and clear). Seems that both Pandas and jsonlite combined each list into single column, which is clearly seen in the screens below. R Python As you might see some columns, such as wymagania.wymaganiaKonieczne.wyksztalcenia is nothing but a vector contains a combined/embedded list, i.e. content of a list has been combined into single column. As a desired output I want to split each element of such lists as a single column of a data.frame. In other words, I want to obtain normal “in tidy sense” data.frame without any nested either data.frames and lists. Both R and Python codes are appreciated. Minimum reproducible example: [ { "warunkiPracyIPlacy":{"miejscePracy":"abc","rodzajObowiazkow":"abc","zakresObowiazkow":"abc","rodzajZatrudnienia":"abc","kodRodzajuZatrudnienia":"abc","zmianowosc":"abc"}, "wymagania":{ "wymaganiaKonieczne":{ "zawody":[], "wyksztalcenia":["abc"], "wyksztalceniaSzczegoly":[{"kodPoziomuWyksztalcenia":"RPs002|WY","kodTypuWyksztalcenia":"abc"}], "jezyki":[], "jezykiSzczegoly":[], "uprawnienia":[]}, "wymaganiaPozadane":{ "zawody":[], "zawodySzczegoly":[], "staze":[]}, "wymaganiaDodatkowe":{"zawody":[],"zawodySzczegoly":[]}, "inneWymagania":"abc" }, "danePracodawcy":{"pracodawca":"abc","nip":"abc","regon":"abc","branza":null}, "pozostaleDane":{"identyfikatorOferty":"abc","ofertaZgloszonaPrzez":"abc","ofertaZgloszonaPrzezKodJednostki":"abc"}, "typOferty":"abc", "typOfertyNaglowek":"abc", "rodzajOferty":["DLA_ZAREJESTROWANYCH"],"staz":false,"link":false} ]
This is an answer for Python. It is not very elegant, but I think it will do for your purpose. I have called your example file nested_json.json import json import pandas as pd json_file = "nested_json.json" with open(json_file, encoding="utf-8") as f: data = json.load(f) df = pd.json_normalize(data) df_exploded = df.apply(lambda x: x.explode()).reset_index(drop=True) # check based on first row whether its of type dict columns_dict = df_exploded.columns[df_exploded.apply(lambda x: isinstance(x[0], dict))] # append the splitted dict to the dataframe for col in columns_dict: df_splitted_dict = df_exploded[col].apply(pd.Series) df_exploded = pd.concat([df_exploded, df_splitted_dict], axis=1) This leads to a rectangular dataframe >>> df_exploded.T 0 typOferty abc typOfertyNaglowek abc rodzajOferty DLA_ZAREJESTROWANYCH staz False link False warunkiPracyIPlacy.miejscePracy abc warunkiPracyIPlacy.rodzajObowiazkow abc warunkiPracyIPlacy.zakresObowiazkow abc warunkiPracyIPlacy.rodzajZatrudnienia abc warunkiPracyIPlacy.kodRodzajuZatrudnienia abc warunkiPracyIPlacy.zmianowosc abc wymagania.wymaganiaKonieczne.zawody NaN wymagania.wymaganiaKonieczne.wyksztalcenia abc wymagania.wymaganiaKonieczne.wyksztalceniaSzcze... {'kodPoziomuWyksztalcenia': 'RPs002|WY', 'kodT... wymagania.wymaganiaKonieczne.jezyki NaN wymagania.wymaganiaKonieczne.jezykiSzczegoly NaN wymagania.wymaganiaKonieczne.uprawnienia NaN wymagania.wymaganiaPozadane.zawody NaN wymagania.wymaganiaPozadane.zawodySzczegoly NaN wymagania.wymaganiaPozadane.staze NaN wymagania.wymaganiaDodatkowe.zawody NaN wymagania.wymaganiaDodatkowe.zawodySzczegoly NaN wymagania.inneWymagania abc danePracodawcy.pracodawca abc danePracodawcy.nip abc danePracodawcy.regon abc danePracodawcy.branza None pozostaleDane.identyfikatorOferty abc pozostaleDane.ofertaZgloszonaPrzez abc pozostaleDane.ofertaZgloszonaPrzezKodJednostki abc kodPoziomuWyksztalcenia RPs002|WY kodTypuWyksztalcenia abc
Python JSON to a dataframe
I am using a Yahoo finance Python library to grab accounting financial data to do some basic analysis. All of the financial statement data comes in JSON format. I want the data to be in a tabular format as I typically see in a Python dataframe. Hello there are several wrappers around the data and I'm not sure how to remove those so that I can get my data into a simple columns and rows dataframe. Here is what the Python looks like: { "incomeStatementHistory":{ "F":[ { "2019-12-31":{ "researchDevelopment":"None", "effectOfAccountingCharges":"None", "incomeBeforeTax":-640000000, "minorityInterest":45000000, "netIncome":47000000, "sellingGeneralAdministrative":10218000000, "grossProfit":12876000000, "ebit":2658000000, "operatingIncome":2658000000, "otherOperatingExpenses":"None", "interestExpense":-1049000000, "extraordinaryItems":"None",
you don't have the full response so it's difficult to tell if this will be what you want d = { "incomeStatementHistory":{ "F":[ { "2019-12-31":{ "researchDevelopment":"None", "effectOfAccountingCharges":"None", "incomeBeforeTax":-640000000, "minorityInterest":45000000, "netIncome":47000000, "sellingGeneralAdministrative":10218000000, "grossProfit":12876000000, "ebit":2658000000, "operatingIncome":2658000000, "otherOperatingExpenses":"None", "interestExpense":-1049000000, "extraordinaryItems":"None",}}]}} pd.json_normalize(d['incomeStatementHistory']['F']) Output: 2019-12-31.researchDevelopment 2019-12-31.effectOfAccountingCharges 2019-12-31.incomeBeforeTax ... 2019-12-31.otherOperatingExpenses 2019-12-31.interestExpense 2019-12-31.extraordinaryItems 0 None None -640000000 ... None -1049000000 None [1 rows x 12 columns]
You should use Pandas Here its a tutorial of how to do that with pandas Also you could check this question
Converting complex XML file to Pandas dataframe/CSV - Python
I'm currently in the middle of converting a complex XML file to csv or pandas df. I have zero experience with xml data format and all the code suggestions I found online are just not working for me. Can anyone kindly help me with this? There are lots of elements in the data that I do not need so I won't include those here. For privacy reasons I won't be uploading the original data here but I'll be sharing what the structure looks like. <RefData> <Attributes> <Id>1011</Id> <FullName>xxxx</FullName> <ShortName>xx</ShortName> <Country>UK</Country> <Currency>GBP</Currency> </Attributes> <PolicyID>000</PolicyID> <TradeDetails> <UniqueTradeId>000</UniqueTradeId> <Booking>UK</Booking> <Date>12/2/2019</Date> </TradeDetails> </RefData> <RefData> <Attributes> <Id>1012</Id> <FullName>xxx2</FullName> <ShortName>x2</ShortName> <Country>UK</Country> <Currency>GBP</Currency> </Attributes> <PolicyID>002</PolicyID> <TradeDetails> <UniqueTradeId>0022</UniqueTradeId> <Booking>UK</Booking> <Date>12/3/2019</Date> </TradeDetails> </RefData> I would be needing everything in the tag. Ideally I want the headers and output to look like this: I would sincerely appreciate any help I can get on this. Thanks a mil.
One correction concerning your input XML file: It has to contain a single main element (of any name) and within it your RefData elements. So the input file actually contains: <Main> <RefData> ... </RefData> <RefData> ... </RefData> </Main> To process the input XML file, you can use lxml package, so to import it start from: from lxml import etree as et Then I noticed that you actually don't need the whole parsed XML tree, so the usually applied scheme is to: read the content of each element as soon as it has been parsed, save the content (text) of any child elements in any intermediate data structure (I chose a list of dictionaries), drop the source XML element (not needed any more), after the reading loop, create the result DataFrame from the above intermediate data structure. So my code looks like below: rows = [] for _, elem in et.iterparse('RefData.xml', tag='RefData'): rows.append({'id': elem.findtext('Attributes/Id'), 'fullname': elem.findtext('Attributes/FullName'), 'shortname': elem.findtext('Attributes/ShortName'), 'country': elem.findtext('Attributes/Country'), 'currency': elem.findtext('Attributes/Currency'), 'Policy ID': elem.findtext('PolicyID'), 'UniqueTradeId': elem.findtext('TradeDetails/UniqueTradeId'), 'Booking': elem.findtext('TradeDetails/Booking'), 'Date': elem.findtext('TradeDetails/Date') }) elem.clear() elem.getparent().remove(elem) df = pd.DataFrame(rows) To fully comprehend details, search the Web for description of lxml and each method used. For your sample data the result is: id fullname shortname country currency Policy ID UniqueTradeId Booking Date 0 1011 xxxx xx UK GBP 000 000 UK 12/2/2019 1 1012 xxx2 x2 UK GBP 002 0022 UK 12/3/2019 Probably the last step to perform is to save the above DataFrame in a CSV file, but I suppose you know how to do it.
Another way to do it, using lxml and xpath: from lxml import etree dat = """[your FIXED xml]""" doc = etree.fromstring(dat) columns = [] rows = [] to_delete = ["TradeDetails",'Attributes'] body = doc.xpath('.//RefData') for el in body[0].xpath('.//*'): columns.append(el.tag) for b in body: items = b.xpath('.//*') row = [] for item in items: if item.tag not in to_delete: row.append(item.text) rows.append(row) for col in to_delete: if col in columns: columns.remove(col) pd.DataFrame(rows,columns=columns) Output is the dataframe indicated in your question.
convert comment (list) to dataframe ,pandas
I have big list of names , I want to keep it in my interpreter so I would like not use csv files. The only way how i can store it in my interpreter as variable using 'copy -paste' from my original file is comment so my input looks like this : temp='''A,B,C adam,dorothy,ben luis,cristy,hoover''' my goal is to convert this 'comment' inside my interpreter to dataframe i tried to df=pd.DataFrame([temp]) and also to series using in comment only one column but without success, any idea? my read data have hundreds of lines
Use: from io import StringIO temp=u'''A,B,C adam,dorothy,ben luis,cristy,hoover''' df = pd.read_csv(StringIO(temp)) print (df) A B C 0 adam dorothy ben 1 luis cristy hoover