Parsing XML column in Pyspark dataframe - python

I'm relatively new to PySpark and trying to solve a data problem. I have a pyspark DF, created with data extracted from MS SQL Server, having 2 columns: ID (Integer) and XMLMsg (String). The 2nd column, XMLMsg contains data in XML format.
The goal is to parse the XMLMsg column and create additional columns in the same DF with the extracted columns from the XML.
Following is a sample structure of the pyspark DF:
ID XMLMsg
101 ...<a><b>name1</b><c>loc1</c></a>...<d>dept1</d>...
102 ...<a><b>name2</b><c>loc2</c></a>...<d>dept2</d>...
103 ...<a><b>name3</b><c>loc3</c></a>...<d>dept3</d>...
Expected output is:
ID XMLMsg b c d
101 ...<a><b>name1</b><c>loc1</c></a>...<d>dept1</d>... name1 loc1 dept1
102 ...<a><b>name2</b><c>loc2</c></a>...<d>dept2</d>... name2 loc2 dept2
103 ...<a><b>name3</b><c>loc3</c></a>...<d>dept3</d>... name3 loc3 dept3
I tried few suggestions based on my search in SO; however, could not achieve the expected result. Hence, reaching out for some help and directions. Thanks for your time.

I solved this finally using Lambda and UDF considering i had to get the texts from 4 nodes from a huge XML file. Since the XML files are already in a column and part of the pyspark Dataframe, I didnt want to write as files and again parse the whole XML. I also wanted to avoid using XSD schema.
The actual xml has multiple namespaces and also some nodes with specific conditions.
Example:
<ap:applicationproduct xmlns:xsi="http://www.example.com/2005/XMLSchema-instance" xmlns:ap="http://example.com/productinfo/1_6" xmlns:ct="http://example.com/commontypes/1_0" xmlns:dc="http://example.com/datacontent/1_0" xmlns:tp="http://aexample.com/prmvalue/1_0" ....." schemaVersion="..">
<ap:ParameterInfo>
<ap:Header>
<ct:Version>1.0</ct:Version>
<ct:Sender>ABC</ct:Sender>
<ct:SenderVersion />
<ct:SendTime>...</ct:SendTime>
</ap:Header>
<ap:ProductID>
<ct:Model>
<ct:Series>34AP</ct:Series>
<ct:ModelNo>013780</ct:ModelNo>
..............
..............
<ap:Object>
<ap:Parameter schemaVersion="2.5" Code="DDA">
<dc:Value>
<tp:Blob>mbQAEAgBTgKQEBAX4KJJYABAIASL0AA==</tp:Blob>
</dc:Value>
</ap:Parameter>
.........
........
Here I need to extract the values from ct:ModelNo and tp:Blob
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import xml.etree.ElementTree as ET
# List of namespaces to be used:
ns = {'ap' : 'http://example.com/productinfo/1_6',
'ct':'http://example.com/commontypes/1_0',
'dc':'http://example.com/datacontent/1_0',
'tp':'http://aexample.com/prmvalue/1_0'
}
parsed_model = (lambda x: ET.fromstring(x).find('ap:ParameterInfo/ap:ProductID/ct:Model/ct:ModelNo', ns).text)
udf_model = udf(parsed_model)
parsed_model_df = df.withColumn('ModelNo', udf_Model('XMLMsg'))
Also for the Node with blob value similar function can be written but the path to the node would be:
'ap:ParameterInfo/ap:Object/ap:Parameter[#Code="DDA"]/dc:Value/tp:Blob'
This worked for me and I was able to add the required values in the pyspark DF. Any suggestions are welcome to make it better though. Thank you!

Related

Can I loop the same analysis across multiple csv dataframes then concatenate results from each into one table?

newbie python learner here!
I have 20 participant csv files (P01.csv to P20.csv) with dataframes in them that contain stroop test data. The important columns for each are the condition column which has a random mix of incongruent and congruent conditions, the reaction time column for each condition and the column for if the response was correct, true or false.
Here is an example of the dataframe for P01 I'm not sure if this counts as a code snippet? :
trialnum,colourtext,colourname,condition,response,rt,correct
1,blue,red,incongruent,red,0.767041,True
2,yellow,yellow,congruent,yellow,0.647259,True
3,green,blue,incongruent,blue,0.990185,True
4,green,green,congruent,green,0.720116,True
5,yellow,yellow,congruent,yellow,0.562909,True
6,yellow,yellow,congruent,yellow,0.538918,True
7,green,yellow,incongruent,yellow,0.693017,True
8,yellow,red,incongruent,red,0.679368,True
9,yellow,blue,incongruent,blue,0.951432,True
10,blue,blue,congruent,blue,0.633367,True
11,blue,green,incongruent,green,1.289047,True
12,green,green,congruent,green,0.668142,True
13,blue,red,incongruent,red,0.647722,True
14,red,blue,incongruent,blue,0.858307,True
15,red,red,congruent,red,1.820112,True
16,blue,green,incongruent,green,1.118404,True
17,red,red,congruent,red,0.798532,True
18,red,red,congruent,red,0.470939,True
19,red,blue,incongruent,blue,1.142712,True
20,red,red,congruent,red,0.656328,True
21,red,yellow,incongruent,yellow,0.978830,True
22,green,red,incongruent,red,1.316182,True
23,yellow,yellow,congruent,green,0.964292,False
24,green,green,congruent,green,0.683949,True
25,yellow,green,incongruent,green,0.583939,True
26,green,blue,incongruent,blue,1.474140,True
27,green,blue,incongruent,blue,0.569109,True
28,green,green,congruent,blue,1.196470,False
29,red,red,congruent,red,4.027546,True
30,blue,blue,congruent,blue,0.833177,True
31,red,red,congruent,red,1.019672,True
32,green,blue,incongruent,blue,0.879507,True
33,red,red,congruent,red,0.579254,True
34,red,blue,incongruent,blue,1.070518,True
35,blue,yellow,incongruent,yellow,0.723852,True
36,yellow,green,incongruent,green,0.978838,True
37,blue,blue,congruent,blue,1.038232,True
38,yellow,green,incongruent,yellow,1.366425,False
39,green,red,incongruent,red,1.066038,True
40,blue,red,incongruent,red,0.693698,True
41,red,blue,incongruent,blue,1.751062,True
42,blue,blue,congruent,blue,0.449651,True
43,green,red,incongruent,red,1.082267,True
44,blue,blue,congruent,blue,0.551023,True
45,red,blue,incongruent,blue,1.012258,True
46,yellow,green,incongruent,yellow,0.801443,False
47,blue,blue,congruent,blue,0.664119,True
48,red,green,incongruent,yellow,0.716189,False
49,green,green,congruent,yellow,0.630552,False
50,green,yellow,incongruent,yellow,0.721917,True
51,red,red,congruent,red,1.153943,True
52,blue,red,incongruent,red,0.571019,True
53,yellow,yellow,congruent,yellow,0.651611,True
54,blue,blue,congruent,blue,1.321344,True
55,green,green,congruent,green,1.159240,True
56,blue,blue,congruent,blue,0.861646,True
57,yellow,red,incongruent,red,0.793069,True
58,yellow,yellow,congruent,yellow,0.673190,True
59,yellow,red,incongruent,red,1.049320,True
60,red,yellow,incongruent,yellow,0.773447,True
61,red,yellow,incongruent,yellow,0.693554,True
62,red,red,congruent,red,0.933901,True
63,blue,blue,congruent,blue,0.726794,True
64,green,green,congruent,green,1.046116,True
65,blue,blue,congruent,blue,0.713565,True
66,blue,blue,congruent,blue,0.494177,True
67,green,green,congruent,green,0.626399,True
68,blue,blue,congruent,blue,0.711896,True
69,blue,blue,congruent,blue,0.460420,True
70,green,green,congruent,yellow,1.711978,False
71,blue,blue,congruent,blue,0.634218,True
72,yellow,blue,incongruent,yellow,0.632482,False
73,yellow,yellow,congruent,yellow,0.653813,True
74,green,green,congruent,green,0.808987,True
75,blue,blue,congruent,blue,0.647117,True
76,green,red,incongruent,red,1.791693,True
77,red,yellow,incongruent,yellow,1.482570,True
78,red,red,congruent,red,0.693132,True
79,red,yellow,incongruent,yellow,0.815830,True
80,green,green,congruent,green,0.614441,True
81,yellow,red,incongruent,red,1.080385,True
82,red,green,incongruent,green,1.198548,True
83,blue,green,incongruent,green,0.845769,True
84,yellow,blue,incongruent,blue,1.007089,True
85,green,blue,incongruent,blue,0.488701,True
86,green,green,congruent,yellow,1.858272,False
87,yellow,yellow,congruent,yellow,0.893149,True
88,yellow,yellow,congruent,yellow,0.569597,True
89,yellow,yellow,congruent,yellow,0.483542,True
90,yellow,red,incongruent,red,1.669842,True
91,blue,green,incongruent,green,1.158416,True
92,blue,red,incongruent,red,1.853055,True
93,green,yellow,incongruent,yellow,1.023785,True
94,yellow,blue,incongruent,blue,0.955395,True
95,yellow,yellow,congruent,yellow,1.303260,True
96,blue,yellow,incongruent,yellow,0.737741,True
97,yellow,green,incongruent,green,0.730972,True
98,green,red,incongruent,red,1.564596,True
99,yellow,yellow,congruent,yellow,0.978911,True
100,blue,yellow,incongruent,yellow,0.508151,True
101,red,green,incongruent,green,1.821969,True
102,red,red,congruent,red,0.818726,True
103,yellow,yellow,congruent,yellow,1.268222,True
104,yellow,yellow,congruent,yellow,0.585495,True
105,green,green,congruent,green,0.673404,True
106,blue,yellow,incongruent,yellow,1.407036,True
107,red,red,congruent,red,0.701050,True
108,red,green,incongruent,red,0.402334,False
109,red,green,incongruent,green,1.537681,True
110,green,yellow,incongruent,yellow,0.675118,True
111,green,green,congruent,green,1.004550,True
112,yellow,blue,incongruent,blue,0.627439,True
113,yellow,yellow,congruent,yellow,1.150248,True
114,blue,yellow,incongruent,yellow,0.774452,True
115,red,red,congruent,red,0.860966,True
116,red,red,congruent,red,0.499595,True
117,green,green,congruent,green,1.059725,True
118,red,red,congruent,red,0.593180,True
119,green,yellow,incongruent,yellow,0.855915,True
120,blue,green,incongruent,green,1.335018,True
But I am only interested in the 'condition', 'rt', and 'correct' columns.
I need to create a table that says the mean reaction time for the congruent conditions, and the incongruent conditions, and the percentage correct for each condition. But I want to create an overall table of these results for each participant. I am aiming to get something like this as an output table:
Participant
Stimulus Type
Mean Reaction Time
Percentage Correct
01
Congruent
0.560966
80
01
Incongruent
0.890556
64
02
Congruent
0.460576
89
02
Incongruent
0.956556
55
Etc. for all 20 participants. This was just an example of my ideal output because later I'd like to plot a graph of the means from each condition across the participants. But if anyone thinks that table does not make sense or is inefficient, I'm open to any advice!
I want to use pandas but don't know where to begin finding the rt means for each condition when there are two different conditions in the same column in each dataframe? And I'm assuming I need to do it in some kind of loop that can run over each participant csv file, and then concatenates the results in a table for all the participants?
Initially, after struggling to figure out the loop I would need and looking on the web, I ran this code, which worked to concatenate all of the dataframes of the participants, I hoped this would help me to do the same analysis on all of them at once but the problem is it doesn't identify the individual participants for each of the rows from each participant csv file (there are 120 rows for each participant like the example I give above) that I had put into one table:
import os
import glob
import pandas as pd
#set working directory
os.chdir('data')
#find all csv files in the folder
#use glob pattern matching -> extension = 'csv'
#save result in list -> all_filenames
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#print(all_filenames)
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')
Perhaps I could do something to add a participant column to identify each participant's data set in the concatenated table and then perform the mean and percentage correct analysis on the two conditions for each participant in that big concatenated table?
Or would it be better to do the analysis and then loop it over all of the individual participant csv files of dataframes?
I'm sorry if this is a really obvious process, I'm new to python and trying to learn to analyse my data more efficiently, have been scouring the Internet and Panda tutorials but I'm stuck. Any help is welcome! I've also never used Stackoverflow before so sorry if I haven't formatted things correctly here but thanks for the feedback about including examples of the input data, code I've tried, and desired output data, I really appreciate the help.
Try this:
from pathlib import Path
# Use the Path class to represent a path. It offers more
# functionalities when perform operations on paths
path = Path("./data").resolve()
# Create a dictionary whose keys are the Participant ID
# (the `01` in `P01.csv`, etc), and whose values are
# the data frames initialized from the CSV
data = {
p.stem[1:]: pd.read_csv(p) for p in path.glob("*.csv")
}
# Create a master data frame by combining the individual
# data frames from each CSV file
df = pd.concat(data, keys=data.keys(), names=["participant", None])
# Calculate the statistics
result = (
df.groupby(["participant", "condition"]).agg(**{
"Mean Reaction Time": ("rt", "mean"),
"correct": ("correct", "sum"),
"size": ("trialnum", "size")
}).assign(**{
"Percentage Correct": lambda x: x["correct"] / x["size"]
}).drop(columns=["correct", "size"])
.reset_index()
)

Convert heavily nested json file into R/Python dataframe

I have found a numerous number of similar questions on stackoverflow, however, one issue remains unsolved to me. I have a heavily nested “.json” file I need to import and convert into R or Python data.frame to work with. Json file contains lists inside (usually empty but sometime contains data). Example of json's structure:
I use R's library jsonlite and Python's pandas.
# R
jsonlite::fromJSON(json_file, flatten = TRUE)
# or
jsonlite::read_json(json_file, simplifyVector = T)
# Python
with open(json_file.json, encoding = "utf-8") as f:
data = json.load(f)
pd.json_normalize(data)
Generally, in both cases it work. The output looks like a normal data.frame, however, the problem is that some columns of a new data.frame contain an embedded lists (I am not sure about "embedded lists" whether it's correct and clear). Seems that both Pandas and jsonlite combined each list into single column, which is clearly seen in the screens below.
R
Python
As you might see some columns, such as wymagania.wymaganiaKonieczne.wyksztalcenia is nothing but a vector contains a combined/embedded list, i.e. content of a list has been combined into single column.
As a desired output I want to split each element of such lists as a single column of a data.frame. In other words, I want to obtain normal “in tidy sense” data.frame without any nested either data.frames and lists. Both R and Python codes are appreciated.
Minimum reproducible example:
[
{
"warunkiPracyIPlacy":{"miejscePracy":"abc","rodzajObowiazkow":"abc","zakresObowiazkow":"abc","rodzajZatrudnienia":"abc","kodRodzajuZatrudnienia":"abc","zmianowosc":"abc"},
"wymagania":{
"wymaganiaKonieczne":{
"zawody":[],
"wyksztalcenia":["abc"],
"wyksztalceniaSzczegoly":[{"kodPoziomuWyksztalcenia":"RPs002|WY","kodTypuWyksztalcenia":"abc"}],
"jezyki":[],
"jezykiSzczegoly":[],
"uprawnienia":[]},
"wymaganiaPozadane":{
"zawody":[],
"zawodySzczegoly":[],
"staze":[]},
"wymaganiaDodatkowe":{"zawody":[],"zawodySzczegoly":[]},
"inneWymagania":"abc"
},
"danePracodawcy":{"pracodawca":"abc","nip":"abc","regon":"abc","branza":null},
"pozostaleDane":{"identyfikatorOferty":"abc","ofertaZgloszonaPrzez":"abc","ofertaZgloszonaPrzezKodJednostki":"abc"},
"typOferty":"abc",
"typOfertyNaglowek":"abc",
"rodzajOferty":["DLA_ZAREJESTROWANYCH"],"staz":false,"link":false}
]
This is an answer for Python. It is not very elegant, but I think it will do for your purpose.
I have called your example file nested_json.json
import json
import pandas as pd
json_file = "nested_json.json"
with open(json_file, encoding="utf-8") as f:
data = json.load(f)
df = pd.json_normalize(data)
df_exploded = df.apply(lambda x: x.explode()).reset_index(drop=True)
# check based on first row whether its of type dict
columns_dict = df_exploded.columns[df_exploded.apply(lambda x: isinstance(x[0], dict))]
# append the splitted dict to the dataframe
for col in columns_dict:
df_splitted_dict = df_exploded[col].apply(pd.Series)
df_exploded = pd.concat([df_exploded, df_splitted_dict], axis=1)
This leads to a rectangular dataframe
>>> df_exploded.T
0
typOferty abc
typOfertyNaglowek abc
rodzajOferty DLA_ZAREJESTROWANYCH
staz False
link False
warunkiPracyIPlacy.miejscePracy abc
warunkiPracyIPlacy.rodzajObowiazkow abc
warunkiPracyIPlacy.zakresObowiazkow abc
warunkiPracyIPlacy.rodzajZatrudnienia abc
warunkiPracyIPlacy.kodRodzajuZatrudnienia abc
warunkiPracyIPlacy.zmianowosc abc
wymagania.wymaganiaKonieczne.zawody NaN
wymagania.wymaganiaKonieczne.wyksztalcenia abc
wymagania.wymaganiaKonieczne.wyksztalceniaSzcze... {'kodPoziomuWyksztalcenia': 'RPs002|WY', 'kodT...
wymagania.wymaganiaKonieczne.jezyki NaN
wymagania.wymaganiaKonieczne.jezykiSzczegoly NaN
wymagania.wymaganiaKonieczne.uprawnienia NaN
wymagania.wymaganiaPozadane.zawody NaN
wymagania.wymaganiaPozadane.zawodySzczegoly NaN
wymagania.wymaganiaPozadane.staze NaN
wymagania.wymaganiaDodatkowe.zawody NaN
wymagania.wymaganiaDodatkowe.zawodySzczegoly NaN
wymagania.inneWymagania abc
danePracodawcy.pracodawca abc
danePracodawcy.nip abc
danePracodawcy.regon abc
danePracodawcy.branza None
pozostaleDane.identyfikatorOferty abc
pozostaleDane.ofertaZgloszonaPrzez abc
pozostaleDane.ofertaZgloszonaPrzezKodJednostki abc
kodPoziomuWyksztalcenia RPs002|WY
kodTypuWyksztalcenia abc

Python JSON to a dataframe

I am using a Yahoo finance Python library to grab accounting financial data to do some basic analysis. All of the financial statement data comes in JSON format. I want the data to be in a tabular format as I typically see in a Python dataframe. Hello there are several wrappers around the data and I'm not sure how to remove those so that I can get my data into a simple columns and rows dataframe. Here is what the Python looks like:
{
"incomeStatementHistory":{
"F":[
{
"2019-12-31":{
"researchDevelopment":"None",
"effectOfAccountingCharges":"None",
"incomeBeforeTax":-640000000,
"minorityInterest":45000000,
"netIncome":47000000,
"sellingGeneralAdministrative":10218000000,
"grossProfit":12876000000,
"ebit":2658000000,
"operatingIncome":2658000000,
"otherOperatingExpenses":"None",
"interestExpense":-1049000000,
"extraordinaryItems":"None",
you don't have the full response so it's difficult to tell if this will be what you want
d = {
"incomeStatementHistory":{
"F":[
{
"2019-12-31":{
"researchDevelopment":"None",
"effectOfAccountingCharges":"None",
"incomeBeforeTax":-640000000,
"minorityInterest":45000000,
"netIncome":47000000,
"sellingGeneralAdministrative":10218000000,
"grossProfit":12876000000,
"ebit":2658000000,
"operatingIncome":2658000000,
"otherOperatingExpenses":"None",
"interestExpense":-1049000000,
"extraordinaryItems":"None",}}]}}
pd.json_normalize(d['incomeStatementHistory']['F'])
Output:
2019-12-31.researchDevelopment 2019-12-31.effectOfAccountingCharges 2019-12-31.incomeBeforeTax ... 2019-12-31.otherOperatingExpenses 2019-12-31.interestExpense 2019-12-31.extraordinaryItems
0 None None -640000000 ... None -1049000000 None
[1 rows x 12 columns]
You should use Pandas
Here its a tutorial of how to do that with pandas
Also you could check this question

Converting complex XML file to Pandas dataframe/CSV - Python

I'm currently in the middle of converting a complex XML file to csv or pandas df.
I have zero experience with xml data format and all the code suggestions I found online are just not working for me. Can anyone kindly help me with this?
There are lots of elements in the data that I do not need so I won't include those here.
For privacy reasons I won't be uploading the original data here but I'll be sharing what the structure looks like.
<RefData>
<Attributes>
<Id>1011</Id>
<FullName>xxxx</FullName>
<ShortName>xx</ShortName>
<Country>UK</Country>
<Currency>GBP</Currency>
</Attributes>
<PolicyID>000</PolicyID>
<TradeDetails>
<UniqueTradeId>000</UniqueTradeId>
<Booking>UK</Booking>
<Date>12/2/2019</Date>
</TradeDetails>
</RefData>
<RefData>
<Attributes>
<Id>1012</Id>
<FullName>xxx2</FullName>
<ShortName>x2</ShortName>
<Country>UK</Country>
<Currency>GBP</Currency>
</Attributes>
<PolicyID>002</PolicyID>
<TradeDetails>
<UniqueTradeId>0022</UniqueTradeId>
<Booking>UK</Booking>
<Date>12/3/2019</Date>
</TradeDetails>
</RefData>
I would be needing everything in the tag.
Ideally I want the headers and output to look like this:
I would sincerely appreciate any help I can get on this. Thanks a mil.
One correction concerning your input XML file: It has to contain
a single main element (of any name) and within it your RefData
elements.
So the input file actually contains:
<Main>
<RefData>
...
</RefData>
<RefData>
...
</RefData>
</Main>
To process the input XML file, you can use lxml package, so to import
it start from:
from lxml import etree as et
Then I noticed that you actually don't need the whole parsed XML tree,
so the usually applied scheme is to:
read the content of each element as soon as it has been parsed,
save the content (text) of any child elements in any intermediate
data structure (I chose a list of dictionaries),
drop the source XML element (not needed any more),
after the reading loop, create the result DataFrame from the above
intermediate data structure.
So my code looks like below:
rows = []
for _, elem in et.iterparse('RefData.xml', tag='RefData'):
rows.append({'id': elem.findtext('Attributes/Id'),
'fullname': elem.findtext('Attributes/FullName'),
'shortname': elem.findtext('Attributes/ShortName'),
'country': elem.findtext('Attributes/Country'),
'currency': elem.findtext('Attributes/Currency'),
'Policy ID': elem.findtext('PolicyID'),
'UniqueTradeId': elem.findtext('TradeDetails/UniqueTradeId'),
'Booking': elem.findtext('TradeDetails/Booking'),
'Date': elem.findtext('TradeDetails/Date')
})
elem.clear()
elem.getparent().remove(elem)
df = pd.DataFrame(rows)
To fully comprehend details, search the Web for description of lxml and
each method used.
For your sample data the result is:
id fullname shortname country currency Policy ID UniqueTradeId Booking Date
0 1011 xxxx xx UK GBP 000 000 UK 12/2/2019
1 1012 xxx2 x2 UK GBP 002 0022 UK 12/3/2019
Probably the last step to perform is to save the above DataFrame in a CSV
file, but I suppose you know how to do it.
Another way to do it, using lxml and xpath:
from lxml import etree
dat = """[your FIXED xml]"""
doc = etree.fromstring(dat)
columns = []
rows = []
to_delete = ["TradeDetails",'Attributes']
body = doc.xpath('.//RefData')
for el in body[0].xpath('.//*'):
columns.append(el.tag)
for b in body:
items = b.xpath('.//*')
row = []
for item in items:
if item.tag not in to_delete:
row.append(item.text)
rows.append(row)
for col in to_delete:
if col in columns:
columns.remove(col)
pd.DataFrame(rows,columns=columns)
Output is the dataframe indicated in your question.

convert comment (list) to dataframe ,pandas

I have big list of names , I want to keep it in my interpreter so I would like not use csv files.
The only way how i can store it in my interpreter as variable using 'copy -paste' from my original file is comment
so my input looks like this :
temp='''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
my goal is to convert this 'comment' inside my interpreter to dataframe
i tried to
df=pd.DataFrame([temp]) and also to series using in comment only one column but without success, any idea?
my read data have hundreds of lines
Use:
from io import StringIO
temp=u'''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
df = pd.read_csv(StringIO(temp))
print (df)
A B C
0 adam dorothy ben
1 luis cristy hoover

Categories