Convert nested XML content into CSV using xml tree in python - python

I'm very new to python and please treat me as same. When i tried to convert the XML content into List of Dictionaries I'm getting output but not as expected and tried a lot playing around.
XML Content
<project>
<data>
<row>
<respondent>m0wxo5f6w42h3fot34m7s6xij</respondent>
<timestamp>10-06-16 11:30</timestamp>
<product>1</product>
<replica>1</replica>
<seqnr>1</seqnr>
<session>1</session>
<column>
<question>Q1</question>
<answer>a1</answer>
</column>
<column>
<question>Q2</question>
<answer>a2</answer>
</column>
</row>
<row>
<respondent>w42h3fot34m7s6x</respondent>
<timestamp>10-06-16 11:30</timestamp>
<product>1</product>
<replica>1</replica>
<seqnr>1</seqnr>
<session>1</session>
<column>
<question>Q3</question>
<answer>a3</answer>
</column>
<column>
<question>Q4</question>
<answer>a4</answer>
</column>
<column>
<question>Q5</question>
<answer>a5</answer>
</column>
</row>
</data>
</project>
Code i have used:
import xml.etree.ElementTree as ET
tree = ET.parse(xml_file.xml) # import xml from
root = tree.getroot()
data_list = []
for item in root.find('./data'): # find all projects node
data = {} # dictionary to store content of each projects
for child in item:
data[child.tag] = child.text # add item to dictionary
#-----------------for loop with subchild is not working as expcted in my case
for subchild in child:
data[subchild.tag] = subchild.text
data_list.append(data)
print(data_list)
headers = {k for d in data_list for k in d.keys()} # headers for csv
with open(csv_file,'w') as f:
writer = csv.DictWriter(f, fieldnames = headers) # creating a DictWriter object
writer.writeheader() # write headers to csv
writer.writerows(data_list)
Output for the data_list is getting the last info of question to the list of dictionaries.
i guess the issue is at subchild forloop but im not understanding how to append the list with dictionaries.
[{
'respondent': 'anonymous_m0wxo5f6w42h3fot34m7s6xij',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'column': '\n ,
'question': 'Q2',
'answer': 'a2'
},
{
'respondent': 'w42h3fot34m7s6x',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'column': '\n ,
'question': 'Q2',
'answer': 'a2'
}.......
]
I expect the below output, tried a lot but unable to loop over the column tag.
[{
'respondent': 'anonymous_m0wxo5f6w42h3fot34m7s6xij',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'question': 'Q1',
'answer': 'a1'
},
{
'respondent': 'anonymous_m0wxo5f6w42h3fot34m7s6xij',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'question': 'Q2',
'answer': 'a2'
},
{
'respondent': 'w42h3fot34m7s6x',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'question': 'Q3',
'answer': 'a3'
},
{
'respondent': 'w42h3fot34m7s6x',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'question': 'Q4',
'answer': 'a4'
},
{
'respondent': 'w42h3fot34m7s6x',
'timestamp': '10-06-16 11:30',
'product': '1',
'replica': '1',
'seqnr': '1',
'session': '1',
'question': 'Q5',
'answer': 'a5'
}
]
I have refereed so many stack overflow questions on xml tree but still didn't helped me.
any help/suggestion is appreciated.

I had a problem understanding what this code is supposed to do because it uses abstract variable names like item, child, subchild and this makes it hard to reason about the code. I'm not as clever as that, so I renamed the variables to row, tag, and column to make it easier for me to see what the code is doing. (In my book, even row and column are a bit abstract, but I suppose the opacity of the XML input is hardly your fault.)
You have 2 rows but you want 5 dictionaries, because you have 5 <column> tags and you want each <column>'s data in a separate dictionary. But you want the other tags in the <row> to be repeated along with each <column>'s data.
That means you need to build a dictionary for every <row>, then, for each <column>, add that column's data to the dictionary, then output it before going on to the next column.
This code makes the simplifying assumption that all of your <columns>s have the same structure, with exactly one <question> and exactly one <answer> and nothing else. If this assumption does not hold then a <column> may get reported with stale data it inherited from the previous <column> in the same row. It will also produce no output at all for any <row> that does not have at least one <column>.
The code has to loop through the tags twice, once for the non-<column>s and once for the <column>s. Otherwise it can't be sure it has seen all the non-<column> tags before it starts outputting the <column>s.
There are other (no doubt more elegant) ways to do this, but I kept the code structure as close to your original as I could, other than making the variable names less opaque.
for row in root.find('./data'): # find all projects node
data = {} # dictionary to store content of each projects
for tag in row:
if tag.tag != "column":
data[tag.tag] = tag.text # add row to dictionary
# Now the dictionary data is built for the row level
for tag in row:
if tag.tag == "column":
for column in tag:
data[column.tag] = column.text
# Now we have added the column level data for one column tag
data_list.append(data.copy())
Output is as below. The key order of the dicts isn't preserved because I used pprint.pprint for convenience.
[{'answer': 'a1',
'product': '1',
'question': 'Q1',
'replica': '1',
'respondent': 'm0wxo5f6w42h3fot34m7s6xij',
'seqnr': '1',
'session': '1',
'timestamp': '10-06-16 11:30'},
{'answer': 'a2',
'product': '1',
'question': 'Q2',
'replica': '1',
'respondent': 'm0wxo5f6w42h3fot34m7s6xij',
'seqnr': '1',
'session': '1',
'timestamp': '10-06-16 11:30'},
{'answer': 'a3',
'product': '1',
'question': 'Q3',
'replica': '1',
'respondent': 'w42h3fot34m7s6x',
'seqnr': '1',
'session': '1',
'timestamp': '10-06-16 11:30'},
{'answer': 'a4',
'product': '1',
'question': 'Q4',
'replica': '1',
'respondent': 'w42h3fot34m7s6x',
'seqnr': '1',
'session': '1',
'timestamp': '10-06-16 11:30'},
{'answer': 'a5',
'product': '1',
'question': 'Q5',
'replica': '1',
'respondent': 'w42h3fot34m7s6x',
'seqnr': '1',
'session': '1',
'timestamp': '10-06-16 11:30'}]

Related

Mastodon API - /api/v2/search - Not getting expected results

My objective is to perform a search of Mastodon statuses and return the content (i.e. the text) of any status that matches. The docs suggest I can do this. Can I actually do this?
My python code is
import requests
url = 'https://<server>/api/v2/search'
auth = {'Authorization': 'Bearer <token>'}
params = {'q': '<keyword>', 'type':'statuses'}
response = requests.get(url, data=params, headers=auth)
First issue: I get the following response (no matter what keyword I choose and even when my keyword clearly appears in a recent status):
{'accounts': [], 'statuses': [], 'hashtags': []}
Second issue: If I don't restrict the search to statuses, I get results! But they are not what I expect. There's no content key :( While the example results do contain a content key, which was my goal.
{'accounts': [], 'statuses': [], 'hashtags': [{'name': 'hiring', 'url': 'https://data-folks.masto.host/tags/hiring', 'history': [{'day': '1675814400', 'accounts': '5', 'uses': '5'}, {'day': '1675728000', 'accounts': '10', 'uses': '13'}, {'day': '1675641600', 'accounts': '7', 'uses': '7'}, {'day': '1675555200', 'accounts': '6', 'uses': '6'}, {'day': '1675468800', 'accounts': '3', 'uses': '3'}, {'day': '1675382400', 'accounts': '5', 'uses': '6'}, {'day': '1675296000', 'accounts': '9', 'uses': '9'}], 'following': False}]}
Thanks for any help! I'm a beginner and truly appreciate it.
This question was answered for me by the user trwnh over on Mastodon's github discussions, so I am copying it here:
"Full-text search across all statuses is not supported. If your server
has configured the optional Elasticsearch backend, then you can
perform limited full-text search against your own posts, favourites,
and bookmarks -- basically, only posts relevant to you. To obtain
content based on a keyword, you must use hashtags."

Compare values in list of dicts in Python

I'm a newbie in Python. I have a list of members and a list of meetings (containing the member id):
memberList = [{'id': '1', 'name': 'Joe'},
{'id': '2', 'name': 'Jason'},
{'id': '3', 'name': 'Billy'}]
meetingList = [{'meetingId': '20', 'hostId' : '1'},
{'meetingId': '21', 'hostId' : '1'},
{'meetingId': '22', 'hostId' : '2'},
{'meetingId': '23', 'hostId' : '2'}]
Where the id of the member and the hostId of meeting is the same value.
Result: a list of member ids which has no meetings ['3'] or the list of dicts [{'id': '3', 'name': 'Billy'}]
What's the best and most readable way to do it?
You could build a set of hosts and then use a list comprehension to filter out the members:
member_list = [{'id': '1', 'name': 'Joe'},
{'id': '2', 'name': 'Jason'},
{'id': '3', 'name': 'Billy'}]
meeting_list = [{'meetingId': '20', 'hostId': '1'},
{'meetingId': '21', 'hostId': '1'},
{'meetingId': '22', 'hostId': '2'},
{'meetingId': '23', 'hostId': '2'}]
# create a set of hosts
hosts = set(meeting['hostId'] for meeting in meeting_list) # or { meeting['hostId'] for meeting in meeting_list }
# filter out the members that are in hosts
res = [member['id'] for member in member_list if member['id'] not in hosts]
print(res)
Output
[{'id': '3', 'name': 'Billy'}]
For the id only output, do:
res = [member['id'] for member in member_list if member['id'] not in hosts]
print(res)
Output
['3']
I'd extract out the id's from both lists of dictionaries and compare them directly.
First I'm just rewriting your list variables to assign them with =.
Using : won't save the variable.
memberList = [{'id': '1', 'name': 'Joe'},
{'id': '2', 'name': 'Jason'},
{'id': '3', 'name': 'Billy'}]
meetingList = [{'meetingId': '20', 'hostId' : '1'},
{'meetingId': '21', 'hostId' : '1'},
{'meetingId': '22', 'hostId' : '2'},
{'meetingId': '23', 'hostId' : '2'}]
Then use list comprehension to extract out the id's from each list of dicts.
member_id_list = [i["id"] for i in memberList]
meeting_hostid_list = [i["hostId"] for i in meetingList]
You could also use list comprehension here but if you aren't familiar with it, this for loop and if logic will print out any member id who isn't a host.
for i in member_id_list:
if i not in meeting_hostid_list:
print(i)
>> 3

How to get values in script using python

I'm creating a crawler with python + beautiful soup.
I have to access the tag to get some data in the dataLayer.
I did a search with beatifulsoup and managed to return the tag that I need but I can not turn it into a json to access the information.
This is the code that I made to get the :
page = get_html('URL')
dataLayer = page.findAll('script')[NUMBER OF SCRIPT]
And this is my return:
<script type="text/javascript">
dataLayer = [{
'site': {
'isMobile': false
},
'page': {
'pageType': 'ad_detail',
'detail': {
'parent_category_id': '2000',
'category_id': '2020',
'state_id': '2',
'region_id': '31',
'ad_id': '293231982',
'list_id': '250941507',
'city_id': '9208',
'zipcode':'34710620',
},
'adDetail': {
'adID': '293231982',
'listID': '250941507',
'sellerName': 'Marr',
'adDate': '2016-11-30 20:52:11',
},
},
'session': {
'user': {
'userID': '',
'loginType': ''
}
},
'pageType': 'Ad_detail',
'abtestingEnable' : '1',
// Listing information
'listingCategory': '2020',
// Ad information
'adId': '293231982',
'state': '2',
'region': '31',
'category': '2020',
'pictures': '8',
'listId': '250941507',
//Account Information
'loggedUser':'0',
'referrer': '',
//User Information
}];
</script>
I would like to get the data as adDate and zipcode.
s = soup.script.text.replace('\'', '"') # replace ' with "
s = re.search(r'\{.+\}', s, re.DOTALL).group() # get json data
s = re.sub(r'//.+\n', '', s) # replace comment
s = re.sub(r'\s+', '', s) # strip whitspace
s = re.sub(r',}', '}', s) # get rid of last , in the dict
json.loads(s)
out:
{'abtestingEnable': '1',
'adId': '293231982',
'category': '2020',
'listId': '250941507',
'listingCategory': '2020',
'loggedUser': '0',
'page': {'adDetail': {'adDate': '2016-11-3020:52:11',
'adID': '293231982',
'listID': '250941507',
'sellerName': 'Marr'},
'detail': {'ad_id': '293231982',
'category_id': '2020',
'city_id': '9208',
'list_id': '250941507',
'parent_category_id': '2000',
'region_id': '31',
'state_id': '2',
'zipcode': '34710620'},
'pageType': 'ad_detail'},
'pageType': 'Ad_detail',
'pictures': '8',
'referrer': '',
'region': '31',
'session': {'user': {'loginType': '', 'userID': ''}},
'site': {'isMobile': False},
'state': '2'}
Your json is using single quotes instead of double quotes.
you should replace all single quotes with doubles quotes to make you dataLayer variable json compliant.
A simple .replace("'", '"') should do the trick.
Note : you also have to remove the commented line with a second regex.

Scraping information from hoverbox

As background, I am scraping a webpage in Python and using BeautifulSoup.
Some of the information that I need to access is a little box about user profiles that pops up when the mouse hovers over the user's profile picture. The problem, is that this information is not available in the html, instead, I get the following:
""div class="username mo"
span class="expand_inline scrname mbrName_1586A02614A388AEE215B4A3139A2C18" onclick="ta.trackEventOnPage('Reviews', 'show_reviewer_info_window', 'user_name_name_click')">Sapphire-Ed
""
(I have deleted some of the >s so that the html will show up in the question, sorry!)
Can anyone tell me how to do this? Thank you for the help!!
Here is the webpage if that is helpful:
view-source:http://www.tripadvisor.com/Attraction_Review-g143010-d108269-Reviews-Cadillac_Mountain-Acadia_National_Park_Mount_Desert_Island_Maine.html
The information I am trying to access is the review distribution.
Below is the complete working code that outputs a dictionary where the keys are usernames and the values are review distributions. To understand how the code works, here are the key things to take into an account:
the information in the overlay appearing on the mouse over is loaded dynamically with a HTTP GET request with a number of user-specific parameters - the most important are uid and src
the uid and src values can be extracted with a regular expression from the id attribute for every user profile element
the response to this GET request is HTML which you need to parse with BeautifulSoup also
you should maintain the web-scraping session with requests.Session
The code:
import re
from pprint import pprint
import requests
from bs4 import BeautifulSoup
data = {}
# this pattern would help us to extract uid and src needed to make a GET request
pattern = re.compile(r"UID_(\w+)-SRC_(\w+)")
# making a web-scraping session
with requests.Session() as session:
response = requests.get("http://www.tripadvisor.com/Attraction_Review-g143010-d108269-Reviews-Cadillac_Mountain-Acadia_National_Park_Mount_Desert_Island_Maine.html")
soup = BeautifulSoup(response.content, "lxml")
# iterating over usernames on the page
for member in soup.select("div.member_info div.memberOverlayLink"):
# extracting uid and src from the `id` attribute
match = pattern.search(member['id'])
if match:
username = member.find("div", class_="username").text.strip()
uid, src = match.groups()
# making a GET request for the overlay information
response = session.get("http://www.tripadvisor.com/MemberOverlay", params={
"uid": uid,
"src": src,
"c": "",
"fus": "false",
"partner": "false",
"LsoId": ""
})
# getting the grades dictionary
soup_overlay = BeautifulSoup(response.content, "lxml")
data[username] = {grade_type: soup_overlay.find("span", text=grade_type).find_next_sibling("span", class_="numbersText").text.strip(" ()")
for grade_type in ["Excellent", "Very good", "Average", "Poor", "Terrible"]}
pprint(data)
Prints:
{'Anna T': {'Average': '2',
'Excellent': '0',
'Poor': '0',
'Terrible': '0',
'Very good': '2'},
'Arlyss T': {'Average': '0',
'Excellent': '6',
'Poor': '0',
'Terrible': '0',
'Very good': '1'},
'Bf B': {'Average': '1',
'Excellent': '22',
'Poor': '0',
'Terrible': '0',
'Very good': '17'},
'Charmingnl': {'Average': '15',
'Excellent': '109',
'Poor': '4',
'Terrible': '4',
'Very good': '45'},
'Jackie M': {'Average': '2',
'Excellent': '10',
'Poor': '0',
'Terrible': '0',
'Very good': '4'},
'Jonathan K': {'Average': '69',
'Excellent': '90',
'Poor': '6',
'Terrible': '0',
'Very good': '154'},
'Sapphire-Ed': {'Average': '8',
'Excellent': '47',
'Poor': '2',
'Terrible': '0',
'Very good': '49'},
'TundraJayco': {'Average': '14',
'Excellent': '59',
'Poor': '0',
'Terrible': '1',
'Very good': '49'},
'Versrii': {'Average': '2',
'Excellent': '8',
'Poor': '0',
'Terrible': '0',
'Very good': '10'},
'tripavisor83': {'Average': '12',
'Excellent': '9',
'Poor': '1',
'Terrible': '0',
'Very good': '20'}}

How to updating dict items inside a list of dicts

I have this list of dicts that I'm maintaining as a master list:
orig_list = [
{ 'cpu': '4', 'mem': '4', 'name': 'server1', 'drives': '4', 'nics': '1' }
{ 'cpu': '1', 'mem': '2', 'name': 'server2', 'drives': '2', 'nics': '2' }
{ 'cpu': '2', 'mem': '8', 'name': 'server3', 'drives': '1', 'nics': '1' }
]
However, I need to perform actions on things inside this list of dicts, like:
def modifyVM(local_list)
local_temp_list = []
for item in local_list :
'''
Tons of VM processy things happen here.
'''
item['cpu'] = 4
item['notes'] = 'updated cpu'
local_temp_list.append(item)
return local_temp_list
temp_list []
for item in orig_list :
if item['cpu'] < 4
temp_list.append(item)
result_list = modifyVM(temp_list)
At this point, result_list contains:
result_list = [
{ 'cpu': '4', 'mem': '2', 'name': 'server2', 'drives': '2', 'nics': '2' }
{ 'cpu': '4', 'mem': '8', 'name': 'server3', 'drives': '1', 'nics': '1' }
]
So my questions are:
1) What is the most efficient way to update orig_list with the results of result_list? I'm hoping to end up with:
orig_list = [
{ 'cpu': '4', 'mem': '4', 'name': 'server1', 'drives': '4', 'nics': '1' }
{ 'cpu': '4', 'mem': '2', 'name': 'server2', 'drives': '2', 'nics': '2' 'notes': 'updated cpu' }
{ 'cpu': '4', 'mem': '8', 'name': 'server3', 'drives': '1', 'nics': '1' 'notes': 'updated cpu' }
]
2) Is there a way to update orig_list without ever creating secondary lists?
Thank you in advance.
Collections store references to the objects.
So the code you posted is already modifying the items in "orig_list" as well, cause all the lists are storing references to the same original dictionaries.
As for the second part of your question, you don't need to create a new list. You can modify the objects directly, and next time you iterate the list you'll see the updated values.
Like for example:
orig_list = [
{ 'cpu': 4, 'mem': '4', 'name': 'server1', 'drives': '4', 'nics': '1' },
{ 'cpu': 1, 'mem': '2', 'name': 'server2', 'drives': '2', 'nics': '2' },
{ 'cpu': 2, 'mem': '8', 'name': 'server3', 'drives': '1', 'nics': '1' }
]
print orig_list
for item in orig_list :
if item['cpu'] < 4:
item['cpu'] = 4
print orig_list
Output of first print:
[{'mem': '4', 'nics': '1', 'drives': '4', 'cpu': 4, 'name': 'server1'},
{'mem': '2', 'nics': '2', 'drives': '2', 'cpu': 1, 'name': 'server2'},
{'mem': '8', 'nics': '1', 'drives': '1', 'cpu': 2, 'name': 'server3'}]
And second print:
[{'mem': '4', 'nics': '1', 'drives': '4', 'cpu': 4, 'name': 'server1'},
{'mem': '2', 'nics': '2', 'drives': '2', 'cpu': 4, 'name': 'server2'},
{'mem': '8', 'nics': '1', 'drives': '1', 'cpu': 4, 'name': 'server3'}]
No, you don't need to create a separate list, just use list comprehension.
Just iterate through the list and check if value of cpu key is less than 4. If value is less than 4, then update value the cpu key to 4 and add an extra key notes having value as 'updated_cpu'. Value of orig_list after iteration finishes is the desired result.
>>> orig_list = [{'cpu': 4, 'drives': '4', 'mem': '4', 'name': 'server1', 'nics': '1'},
{'cpu': 1, 'drives': '2', 'mem': '2', 'name': 'server2', 'nics': '2'},
{'cpu': 2, 'drives': '1', 'mem': '8', 'name': 'server3', 'nics': '1'}]
>>> for item in orig_list:
if item['cpu']<4:
item['cpu']=4
item['notes'] = 'updated cpu'
>>> orig_list
[{'cpu': 4, 'drives': '4', 'mem': '4', 'name': 'server1', 'nics': '1'},
{'cpu': 4, 'drives': '2', 'mem': '2', 'name': 'server2', 'nics': '2', 'notes': 'updated cpu'},
{'cpu': 4, 'drives': '1', 'mem': '8', 'name': 'server3', 'nics': '1', 'notes': 'updated cpu'}]
Thank you for all the input! I flagged eugenioy's post as the answer because he posted first. Both the answer from him and from Rahul Gupta are very efficient ways to update a list of dictionaries.
However, I kept trying other ways because these answers, as efficient as they are, still do one other thing I've always been told is taboo: modifying the list you're iterating over.
Keep in mind, that I'm still learning Python. So if some of my "revelations" here are mundain, they are new and "wow" to me. To that effect, I'm adding the answer that I actually ended up implementing.
Here's the finished code:
def modifyVM(local_list, l_orig_list)
for item in local_list[:] :
l_orig_list.remove(item)
'''
Tons of VM processy things happen here.
'''
item['cpu'] = 4
item['notes'] = 'updated cpu'
l_orig_list.append(item)
temp_list []
for item in orig_list[:] :
if item['cpu'] < 4
temp_list.append(item)
modifyVM(temp_list, orig_list)
I change this line:
def modifyVM(local_list)
To this line:
def modifyVM(local_list, l_orig_list)
In this way, I'm passing in both the list I want to use as well as the list I want to update.
Next I changed:
for item in local_list :
To this line:
for item in local_list[:] :
This causes "item" to iterate through a slice (copy) of "local_list" that contains everything.
I also added:
l_orig_list.remove(item)
And:
l_orig_list.append(item)
This solved several problems for me.
1) This avoids the potential of modifying any list that's being iterated through.
2) This allows "orig_list" to be updated as processes are happening, which cuts down on the "secondary lists" that are created and maintained.
3) The "orig_list" that's passed into the function and "l_orig_list" are linked variables until a hard assignment (i.e. l_orig_list = 'anything') is made. (Again, thank you to everyone that answered! This was some great "secret sauce" learning for me, and y'all pointed it out.) So, avoiding the "=", I'm able to update "l_orig_list" and have "orig_list" updated as well.
4) This also allows the movement of items from one list to another if needed (i.e. list items that generate errors can be removed from "orig_list" and placed in any other list, like "bad_list" for example.
In closing, I'd like to give recognition to Steven Rumbalski. When I read your comment, I was like, "Of course!!!" However, I spent 2 days on it before realizing that dictionaries cannot be sorted. I had to narrow down the technical problem I was facing to ask a question here. Sorting was an unstated requirement for other parts of the script. So GREAT suggestion, and I'll probably use that for another script.

Categories