Get columns from generator object - python

I'm using Scrapetube to get videos from a channel, and it brings a generator object. From the very simple documentation, I know it includes the parameter "videoId", but how can I know what other parameters I can get from there? Can I transform a generator object into, say, a dataframe?

Generators allow you to efficiently iterate over (potentially infinite) sequences.
In your case, you probably want to first convert the generator into a list to expose all items in the sequence.
Then you can inspect what the returned elements look like and extract the information you need.
You can then create a dataframe for instance from a list of dictionaries:
result_gen = scrapetube.xxx()
result_list = list(result_gen)
# Inspect first element
print(result_list[0])
# Inspect attributes of the first element
print(dir(result_list[0]))
# Convert/extract information of interest into a dictionary
def to_dict(element):
...
result_df = pd.DataFrame([to_dict(element) for element in result_list])

Related

BeautifulSoup: extracting attribute for various items

Let's say we have HTML like this (sorry, I don't know how to copy and paste page info and this is on an intranet):
And I want to get the highlighted portion for all of the questions (this is like a Stack Overflow page). EDIT: to be clearer, what I am interested in is getting a list that has:
['question-summary-39968',
'question-summary-40219',
'question-summary-42899',
'question-summary-34348',
'question-summary-32497',
'question-summary-35308',
...]
Now I know that a working solution is a list comprehension where I could do:
[item["id"] for item in html_df.find_all(class_="question-summary")]
But this is not exactly what I want. How can I directly access question-summary-41823 for the first item?
Also, what is the difference between soup.select and soup.get?
I thought I would post my answer here if it helps others.
What I am trying to do is access the id attribute within the question-summary class.
Now you can do something like this and obtain it for only the first item (object?):
html_df.find(class_="question-summary")["id"]
But you want it for all of them. So you could do this to get the class data:
html_df.select('.question-summary')
But you can't just do
html_df.select('.question-summary')["id"]
Because you have a list filled with bs4.elements. So you need to iterate over the list and select just the piece that you want. You could do a for loop but a more elegant way is to just use list comprehension:
[item["id"] for item in html_df.find_all(class_="question-summary")]
Breaking down what this does, it:
It first creates a list of all the question-summary objects from the soup
Iterates over each element in the list, which we've named item
Extracts the id attribute and adds it to the list
Alternatively you can use select:
[item["id"] for item in html_df.find_all(class_="question-summary")]
I prefer the first version because it's more explicit, but either one results in:
['question-summary-43960',
'question-summary-43953',
'question-summary-43959',
'question-summary-43947',
'question-summary-43952',
'question-summary-43945',
...]

Mutability of Python Generator Expressions versus List and Dictionary Comprehension: Nested Dictionary Weirdness

I am using Python 3.5 to create a set of generators to parse a set of opened files in order to cherry pick data from those files to construct an object I plan to export later. I was originally parsing through the entirety of each file and creating a list of dictionary objects before doing any analysis, but this process would take up to 30 seconds sometimes, and since I only need to work with each line of each file only once, I figure its a great opportunity to use a generator. However, I feel that I am missing something conceptually with generators, and perhaps the mutability of objects within a generator.
My original code that makes a list of dictionaries goes as follows:
parsers = {}
# iterate over files in the file_name file to get their attributes
for dataset, data_file in files.items():
# Store each dataset as a list of dictionaries with keys that
# correspond to the attributes of that dataset
parsers[dataset] = [{attributes[dataset][i]: value.strip('~')
for i, value in enumerate(line.strip().split('^'))}
for line
in data_file]
And I access the the list by calling:
>>>parsers['definitions']
And it works as expected returning a list of dictionaries. However when I convert this list into a generator, all sorts of weirdness happens.
parsers = {}
# iterate over files in the file_name file to get their attributes
for dataset, data_file in files.items():
# Store each dataset as a list of dictionaries with keys that
# correspond to the attributes of that dataset
parsers[dataset] = ({attributes[dataset][i]: value.strip('~')
for i, value in enumerate(line.strip().split('^'))}
for line
in data_file)
And I call it by using:
>>> next(parsers['definitions'])
Running this code returns an index out of range error.
The main difference I can see between the two code segments is that in the list comprehension version, python constructs the list from the file and moves on without needing to store the comprehensions variables for later use.
Conversely, in the generator expression the variables defined within the generator need to be stored with the generator, as they effect each successive call of the generator later in my code. I am thinking that perhaps the variables inside the generator are sharing a namespace with the other generators my code creates, and so each generator has erratic behavior based on whatever generator expression was run last, and therefore set the values of the variables last.
I appreciate any thoughts as to the reason for this issue!
I assume that the problem is when you're building the dictionaries.
attributes[dataset][i]
Note that with the list version, dataset is whatever dataset was at that particular turn of the for loop. However, with the generator, that expression isn't evaluated until after the for loop has completed, so dataset will have the value of the last dataset from the files.items() loop...
Here's a super simple demo that hopefully elaborates on the problem:
results = []
for a in [1, 2, 3]:
results.append(a for _ in range(3))
for r in results:
print(list(r))
Note that we always get [3, 3, 3] because when we take the values from the generator, the value of a is 3.

Create new list using other list to look up values in dictionary - python

Consider the below situation. I have a list:
feature_dict = vectorizer.get_feature_names()
Which just have some strings, all of which are a kind of internal identifiers, completely meaningless. I also have a dictionary (it is filled in different part of code):
phoneDict = dict()
This dictionary has mentioned identifiers as keys, and values assigned to them are, well, good values which mean something.
I want to create a new list preserving the order of original list (this is crucial) but replacing each element with the value from dictionary. So I thought about creating new list by applying a function to each element of list but with no luck.
I tried to create a fuction:
def fastMap(x):
return phoneDict[x]
And then map it:
map(fastMap, feature_dict)
It just returns me
map object at 0x0000000017DFBD30.
Nothing else
Anyone tried to solve similar problem?
Just convert the result to list:
list(map(fastMap, feature_dict))
Why? map() returns an iterator, see https://docs.python.org/3/library/functions.html#map:
map(function, iterable, ...)
Return an iterator that applies function
to every item of iterable, yielding the results. If additional
iterable arguments are passed, function must take that many arguments
and is applied to the items from all iterables in parallel. With
multiple iterables, the iterator stops when the shortest iterable is
exhausted. For cases where the function inputs are already arranged
into argument tuples, see itertools.starmap().
which you can convert to a list with list()
Note: in python 2, map() returns a list, but this was changed in python 3 to return an iterator

BeautifulSoup4 - python: how to merge two bs4.element.ResultSet and get one single list?

I have two
bs4.element.ResultSet
objects.
Let's call them
rs1
rs2
I want one result set (let's call it rs) with all the results in resultset.
I also need to figure out:
whether an array or list or dictionary is better to navigate through a (potentially) big list of results given that each element will consist of an object which is composed of 7 properties of different types
how to convert the merged resultset to the array/list/dictionary
A bs4.element.ResultSet object is a straight-up subclass of list. You can use ResultSet.extend() to extend one or the other resultset:
rs1.extend(rs2)
or simply concatenate the two result sets:
newlist = rs1 + rs2
The latter creates a list object with the contents of the two result sets, which means you'll lose the .source attribute. Not a great loss, really, seeing as nothing in BeautifulSoup itself uses that attribute.
There are ways to create just the one result set to begin with, rather than concatenate the two. Searches that can find either result type would lead to the results being returned in document source order, rather than back to back. You can use list arguments to the find_all() method, for example:
soup.find_all(['a', 'link'], href=True)
would find all a and link elements with a href attribute, for example.

Joining a list of object values into a string using list comprehensions in python 2.7

I have a list of objects. In a single line I would like to create a string that contains a specific variable of each object in the list, separated by commas.
Right now I'm able to achieve this using a combination of list comprehensions and map like so:
','.join(map(str, [instance.public_dns_name for instance in instances]))
or using lambda:
','.join(map(str, [(lambda(i): i.public_dns_name)(instance) for instance in instances]))
Each instance object has a "public_dns_name" variable that returns the host name. This returns a string like this:
host1,host2,hos3,host4
Is it possible to achieve the same thing using only the list comprehension?
You can't use just a list comprehension, you'll still need to use join.
It's more efficient to use a generator expression
','.join(str(instance.public_dns_name) for instance in instances)
A list comprehension would look like this:
','.join([str(instance.public_dns_name) for instance in instances])
The difference is that it creates the entire list in memory before joining, whereas the generator expression will create the components as they are joined
I'm not sure what you mean by "only the list comprehension", you should still use join ultimately, but the process can be far less convoluted:
','.join(str(instance.public_dns_name) for instance in instances)
No need to create a lambda function here. And remember that join takes any iterable, so you don't have to create a list just to pass it to join.

Categories