BeautifulSoup: extracting attribute for various items - python

Let's say we have HTML like this (sorry, I don't know how to copy and paste page info and this is on an intranet):
And I want to get the highlighted portion for all of the questions (this is like a Stack Overflow page). EDIT: to be clearer, what I am interested in is getting a list that has:
['question-summary-39968',
'question-summary-40219',
'question-summary-42899',
'question-summary-34348',
'question-summary-32497',
'question-summary-35308',
...]
Now I know that a working solution is a list comprehension where I could do:
[item["id"] for item in html_df.find_all(class_="question-summary")]
But this is not exactly what I want. How can I directly access question-summary-41823 for the first item?
Also, what is the difference between soup.select and soup.get?

I thought I would post my answer here if it helps others.
What I am trying to do is access the id attribute within the question-summary class.
Now you can do something like this and obtain it for only the first item (object?):
html_df.find(class_="question-summary")["id"]
But you want it for all of them. So you could do this to get the class data:
html_df.select('.question-summary')
But you can't just do
html_df.select('.question-summary')["id"]
Because you have a list filled with bs4.elements. So you need to iterate over the list and select just the piece that you want. You could do a for loop but a more elegant way is to just use list comprehension:
[item["id"] for item in html_df.find_all(class_="question-summary")]
Breaking down what this does, it:
It first creates a list of all the question-summary objects from the soup
Iterates over each element in the list, which we've named item
Extracts the id attribute and adds it to the list
Alternatively you can use select:
[item["id"] for item in html_df.find_all(class_="question-summary")]
I prefer the first version because it's more explicit, but either one results in:
['question-summary-43960',
'question-summary-43953',
'question-summary-43959',
'question-summary-43947',
'question-summary-43952',
'question-summary-43945',
...]

Related

Understanding selenium web-elements list

Ok so I had a list of web items created by seleniums Webdriver.find_elements_by_path method, and I had trouble utilizing the data.
Ultimately, the code I needed to get what I wanted was this:
menu_items=driver.find_elements_by_xpath('//div[#role="menuitem"]')[-2]
I was only ever able to get any meaningful data here by using a negative index. If I used any positive indices, the menu_items would return nothing.
However, when I had left menu_items as follows:
menu_items=driver.find_elements_by_xpath('//div[#role="menuitem"]')
I could iterate through the list and gain access to the webelements properly, meaning if I had"for i in menu_items" I could call something like i.text and have the desired result. But again, I could not do menu_items[2]. I am new to selenium so if someone could explain what is going on here, that would be very helpful
This line of code...
menu_items=driver.find_elements_by_xpath('//div[#role="menuitem"]')[-2]
...indicates you are considering the second element counting from the right instead of the left as list[-1] refers to the last element within the list and list[-2] refers to the second last element in the list.
A bit more about your usecase would have helped us to construct a canonical answer. The number of visible/interactable elements at any given point of time and/or the sequence in which the elements gets visible/interactable may vary based on the type of elements present in the DOM Tree. Incase the HTML DOM consists of JavaScript, Angular, ReactJS, enabled elements even the position of the elements may differ as well.

Using nested for loops to iterate through JSON file of tweets in Python

So I am new to Python, but I know what I am trying to accomplish. Basically, I have the output of tweets from twitter in a JSON file loaded into Python. What I need to do is iterate through the tweets to access the "text" key, that has the text of each tweet, because that's what I'm going to use to do topic modeling. So, I have discovered that "text" is triple nested in this data structure, and it's been really difficult to find the correct way to write the for loop code in order to iterate through the dataset and pull the "text" from every tweet.
Here is a look at what the JSON structure is like: https://pastebin.com/fUH5MTMx
So, I have figured out that the "text" key that I want is within [hits][hits][_source]. What I can't figure out is the appropriate for loop to iterate through _source and pull those texts. Here is my code so far (again I'm very beginning sorry if try code is way off):
for hits in tweets["hits"]["hits"]:
for _source in hits:
for text in _source:
for item in text:
print(item)
also tried this:
for item in tweets['hits']["hits"]["_source"]:
print(item['text'])
But I keep getting either syntax errors for the first one then "TypeError: list indices must be integers or slices, not str" for the second one. I am understanding that I need to specify some way that I am trying to access this list, and that I'm missing something in order to show that its a list and I am not looking for integers as an output from iterations...(I am using the JSON module in Python for this, an using a Mac with Python3 in Spyder)
Any insight would be greatly appreciated! This multiple nesting is confusing me a lot.
['hits']["hits"] is not dictionary with ["_source"]
but a list with one or many items which have ["_source"]
it means
tweets['hits']["hits"][0]["_source"]
tweets['hits']["hits"][1]["_source"]
tweets['hits']["hits"][2]["_source"]
So this should work
for item in tweets['hits']["hits"]:
print(item["_source"]['text'])
Not sure if you realize it, but JSON is transformed into a Python dictionary, not a list. Anyway, let's get into this nest.
tweets['hits'] will give you another dict.
tweets['hits']['hits'] will give you a list (notice the brackets)
This apparently is a list of dictionaries, and in this case (not sure if it will always be), the dict with the "_source" key you are looking for is the first one,so:
tweets['hits']['hits'][0] will give you the dict you want. Then, finally:
tweets['hits']['hits'][0]['_source'] should give you the text.
The value of the second "hits" is a list.
Try:
for hit in tweets["hits"]["hits"]:
print(hit["_source"]["text"])

Python list.remove items present in second list

I've searched around and most of the errors I see are when people are trying to iterate over a list and modify it at the same time. In my case, I am trying to take one list, and remove items from that list that are present in a second list.
import pymysql
schemaOnly = ["table1", "table2", "table6", "table9"]
db = pymysql.connect(my connection stuff)
tables = db.cursor()
tables.execute("SHOW TABLES")
tablesTuple = tables.fetchall()
tablesList = []
# I do this because there is no way to remove items from a tuple
# which is what I get back from tables.fetchall
for item in tablesTuple:
tablesList.append(item)
for schemaTable in schemaOnly:
tablesList.remove(schemaTable)
When I put various print statements in the code, everything looks like proper and like it is going to work. But when it gets to the actual tablesList.remove(schemaTable) I get the dreaded ValueError: list.remove(x): x not in list.
If there is a better way to do this I am open to ideas. It just seemed logical to me to iterate through the list and remove items.
Thanks in advance!
** Edit **
Everyone in the comments and the first answer is correct. The reason this is failing is because the conversion from a Tuple to a list is creating a very badly formatted list. Hence there is nothing that matches when trying to remove items in the next loop. The solution to this issue was to take the first item from each Tuple and put those into a list like so: tablesList = [x[0] for x in tablesTuple] . Once I did this the second loop worked and the table names were correctly removed.
Thanks for pointing me in the right direction!
I assume that fetchall returns tuples, one for each database row matched.
Now the problem is that the elements in tablesList are tuples, whereas schemaTable contains strings. Python does not consider these to be equal.
Thus when you attempt to call remove on tablesList with a string from schemaTable, Python cannot find any such value.
You need to inspect the values in tablesList and find a way convert them to a strings. I suspect it would be by simply taking the first element out of the tuple, but I do not have a mySQL database at hand so I cannot test that.
Regarding your question, if there is a better way to do this: Yes.
Instead of adding items to the list, and then removing them, you can append only the items that you want. For example:
for item in tablesTuple:
if item not in schemaOnly:
tablesList.append(item)
Also, schemaOnly can be written as a set, to improve search complexity from O(n) to O(1):
schemaOnly = {"table1", "table2", "table6", "table9"}
This will only be meaningful with big lists, but in my experience it's useful semantically.
And finally, you can write the whole thing in one list comprehension:
tablesList = [item for item in tablesTuple if item not in schemaOnly]
And if you don't need to keep repetitions (or if there aren't any in the first place), you can also do this:
tablesSet = set(tablesTuple) - schemaOnly
Which is also has the best big-O complexity of all these variations.

Filter List of Strings By Keys

My project has required this enough times that I'm hoping someone on here can give me an elegant way to write it.
I have a list of strings, and would like to filter out duplicates using a key/key-like functionality (like I can do with sorted([foo, key=bar)).
Most recently, I'm dealing with links.
Currently I have to create an empty list, and add in values if
Note: name is the name of the file the link links too -- just a regex matching
parsed_links = ["http://www.host.com/3y979gusval3/name_of_file_1",
"http://www.host.com/6oo8wha55crb/name_of_file_2",
"http://www.host.com/6gaundjr4cab/name_of_file_3",
"http://www.host.com/udzfiap79ld/name_of_file_6",
"http://www.host.com/2bibqho4mtox/name_of_file_5",
"http://www.host.com/4a31wozeljsp/name_of_file_4"]
links = []
[links.append(link) for link in parsed_links if not name(link) in
[name(lnk) for lnk in links]]
I want the final list to have the full links (so I can't just get rid of everything but the filenames and use set); but I'd like to be able to do this without creating an empty list every time.
Also, my current method seems inefficient (which is significant as it is often dealing with hundreds of links).
Any suggestions?
Why not just use a dictionary?
links = dict((name(link), link) for link in parsed_links)
If I understand your question correctly, your performance problems may come from the list comprehension that is repeatedly evaluated in a tight loop.
Try caching the result by putting the list comprehension outside of the loop, then use another comprehension instead of append() on an empty list:
linkNames = [name(lnk) for lnk in links]
links = [link in parsed_links if not name(link) in linkNames]

Referring to objects inside a list without using references or indices

I'm using python for my shopping cart class which has a list of items. When a customer wants to edit an item, I need to pass the JavaScript front-end some way to refer to the item so that it can call AJAX methods to manipulate it.
Basically, I need a simple way to point to a particular item that isn't its index, and isn't a reference to the object itself.
I can't use an index, because another item in the list might be added or removed while the identifier is "held" by the front end. If I were to pass the index forward, if an item got deleted from the list then that index wouldn't point to the right object.
One solution seems to be to use UUIDs, but that seems particularly heavyweight for a very small list. What's the simplest/best way to do this?
Instead of using a list, why not use a dictionary and use small integers as the keys? Adding and removing items from the dictionary will not change the indices into the dictionary. You will want to keep one value in the dictionary that lets you know what the next assigned index will be.
A UUID seems perfect for this. Why don't you want to do that?
Do the items have any sort of product_id? Can the shopping cart have more than one of the same product_id, or does it store a quantity? What I'm getting at is: If product_id's in the cart are unique, you can just use that.

Categories