Python dataframe issue - python

I have the following dataframe. It is from imdb. What i need to do is to extract movies with a score lower than 5 that receive more than 100000 votes. My problem is that I dont understand what the last code lines about the voting really do.
# two lists, one for movie data, the other of vote data
movie_data=[]
vote_data=[]
# this will do some reformating to get the right unicode escape for
hexentityMassage = [(re.compile('&#x([^;]+);'), lambda m: '&#%d;' % int(m.group(1), 16))] # converts XML/HTML entities into unicode string in Python
for i in range(20):
next_url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=%d&title_type=feature&year=1950,2012'%(i*50+1)
r = requests.get(next_url)
bs = BeautifulSoup(r.text,convertEntities=BeautifulSoup.HTML_ENTITIES,markupMassage=hexentityMassage)
# movie info is found in the table cell called 'title'
for movie in bs.findAll('td', 'title'):
title = movie.find('a').contents[0].replace('&','&') #get '&' as in 'Batman & Robin'
genres = movie.find('span', 'genre').findAll('a')
year = int(movie.find('span', 'year_type').contents[0].strip('()'))
genres = [g.contents[0] for g in genres]
runtime = movie.find('span', 'runtime').contents[0]
rating = float(movie.find('span', 'value').contents[0])
movie_data.append([title, genres, runtime, rating, year])
# rating info is found in a separate cell called 'sort_col'
for voting in bs.findAll('td', 'sort_col'):
vote_data.append(int(voting.contents[0].replace(',','')))

Your problem is this snippet,
for voting in bs.findAll('td', 'sort_col'):
vote_data.append(int(voting.contents[0].replace(',','')))
Here you are looping over all the td tags which have a attribute sort_col. In this case they have class="sort_col".
In the second line,
you are replacing the ',' with '' (empty string) of the first element of the list returned by voting.contents.
casting it to int.
then appending it to vote_data.
If I break this up, It will be like this,
for voting in bs.findAll('td', 'sort_col'):
# voting.contents returns a list like this [u'377,936']
str_vote = voting.contents[0]
# str_vote will be '377,936'
int_vote = int(str_vote.replace(',', ''))
# int_vote will be 377936
vote_data.append(int_vote)
Print the values in the loop to get more understanding. If you tagged your question right you might get a good answer faster.

Related

How to filter a collection by multiple conditions

I have a csv file named film.csv here is the header line with a few lines to use as an example
Year;Length;Title;Subject;Actor;Actress;Director;Popularity;Awards;*Image
1990;111;Tie Me Up! Tie Me Down!;Comedy;Banderas, Antonio;Abril, Victoria;Almodóvar, Pedro;68;No;NicholasCage.png
1991;113;High Heels;Comedy;Bosé, Miguel;Abril, Victoria;Almodóvar, Pedro;68;No;NicholasCage.png
1983;104;Dead Zone, The;Horror;Walken, Christopher;Adams, Brooke;Cronenberg, David;79;No;NicholasCage.png
1979;122;Cuba;Action;Connery, Sean;Adams, Brooke;Lester, Richard;6;No;seanConnery.png
1978;94;Days of Heaven;Drama;Gere, Richard;Adams, Brooke;Malick, Terrence;14;No;NicholasCage.png
1983;140;Octopussy;Action;Moore, Roger;Adams, Maud;Glen, John;68;No;NicholasCage.png
I am trying to filter, and need to display the move titles, for this criteria: first name contains "Richard", Year < 1985, Awards == "Y"
I am able to filter for the award, but not the rest. can you help?
file_name = "film.csv"
lines = (line for line in open(file_name,encoding='cp1252')) #generator to capture lines
lists = (s.rstrip().split(";") for s in lines) #generators to capture lists containing values from lines
#browse lists and index them per header values, then filter all movies that have been awarded
#using a new generator object
cols=next(lists) #obtains only the header
print(cols)
collections = (dict(zip(cols,data)) for data in lists)
filtered = (col["Title"] for col in collections if col["Awards"][0] == "Y")
for item in filtered:
print(item)
# input()
This works for the award but I don't know how to add additional filters. Also when I try to filter for if col["Year"] < 1985 I get error message because string vs int not compatible. How do I make the years a value?
I believe for the first name I can filter like this:
if col["Actor"].split(", ")[-1] == "Richard"
You know how to add one filter. There is no such thing as "additional" filters. Just add your conditions to the current condition. Since you want all of the conditions to be True to select a record, you'd use the boolean and logic. For example:
filtered = (
col["Title"]
for col in collections
if col["Awards"][0] == "Y"
and col["Actor"].split(", ")[-1] == "Richard"
and int(col["Year"]) < 1985
)
Notice I added an int() around the col["Year"] to convert it to an integer.
You've actually gone and reinvented csv.DictReader in the setup to this problem! Instead of
file_name = "film.csv"
lines = (line for line in open(file_name,encoding='cp1252')) #generator to capture lines
lists = (s.rstrip().split(";") for s in lines) #generators to capture lists containing values from lines
#browse lists and index them per header values, then filter all movies that have been awarded
#using a new generator object
cols=next(lists) #obtains only the header
print(cols)
collections = (dict(zip(cols,data)) for data in lists)
filtered = ...
You could have just done:
import csv
file_name = "film.csv"
with open(file_name) as f:
collections = csv.DictReader(delimiter=";")
filtered = ...

Generate DF from attributes of tags in list

I have a list of revisions from a Wikipedia article that I queried like this:
import urllib
import re
def getRevisions(wikititle):
url = "https://en.wikipedia.org/w/api.php?action=query&format=xml&prop=revisions&rvlimit=500&titles="+wikititle
revisions = [] #list of all accumulated revisions
next = '' #information for the next request
while True:
response = urllib.request.urlopen(url + next).read() #web request
response = str(response)
revisions += re.findall('<rev [^>]*>', response) #adds all revisions from the current request to the list
cont = re.search('<continue rvcontinue="([^"]+)"', response)
if not cont: #break the loop if 'continue' element missing
break
next = "&rvcontinue=" + cont.group(1) #gets the revision Id from which to start the next request
return revisions
Which results in a list with each element being a rev Tag as a string:
['<rev revid="343143654" parentid="6546465" minor="" user="name" timestamp="2021-12-12T08:26:38Z" comment="abc" />',...]
How can I get generate a DF from this list
An "easy" way without using regex would be splitting the string and then parsing:
for rev_string in revisions:
rev_dict = {}
# Skipping the first and last as it's the tag.
attributes = rev_string.split(' ')[1:-1]
#Split on = and take each value as key and value and convert value to string to get rid of excess ""
for attribute in attributes:
key, value = attribute.split("=")
rev_dict[key] = str(value)
df = pd.DataFrame.from_dict(rev_dict)
This sample would create one dataframe per revision. If you would like to gather multiple reivsions in one dictionary then you handle unique attributes (I don't know if these are changing depending on wiki-document) and then after gathering all attributes in the dictionary you convert to a DataFrame.
Use output format of json then you can easily create data fram from Json
Example URL for JSON output
For json to dataframe help check out this stackoverflow query
any other solution if i have multiple revisions like
''''[, ]''''

Should I Append or Join a list of dict and And how in python?

Using this code I was able to cycle through several instances of attributes and extract First and Last name if they matched the criteria. The results are a list of dict. How would i make all of these results which match the criteria, return as a full name each on it's own line as text?
my_snapshot = cfm.child('teamMap').get()
for players in my_snapshot:
if players['age'] != 27:
print({players['firstName'], players['lastName']})
Results of Print Statement
{'Chandon', 'Sullivan'}
{'Urban', 'Brent'}
Are you looking for this:
print(players['firstName'], players['lastName'])
This would output:
Chandon Sullivan
Urban Brent
Your original trial just put the items to a set {}, and then printed the set, for no apparent reason.
Edit:
You can also for example join the firstName and lastName to be one string and then append the combos to a lists. Then you can do whatever you need with the list:
names = []
my_snapshot = cfm.child('teamMap').get()
for players in my_snapshot:
if players['age'] != 27:
names.append(f"{players['firstName']} {players['lastName']}")
If you're using a version of Python lower than 3.6 and can't use f-strings you can do the last line for example like this:
names.append("{} {}").format(players['firstName'], players['lastName'])
Or if you prefer:
names.append(players['firstName'] + ' ' + players['lastName'])
Ok I figured out by appending the first and last name and creating a list for the found criteria. I then converted the list to a string to display it on the device.
full_list = []
my_snapshot = cfm.child('teamMap').get()
for players in my_snapshot:
if players['age'] != 27:
full_list.append((players['firstName'] + " " + players['lastName']))
send_message('\n'.join(str(i) for i in full_list))

How to web scrape all of the batters names?

I would like to scrape all of the MLB batters stats for 2018. Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml import html
#fetch url/html
response = urlopen("https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml")
content = response.read()
tree = html.fromstring( content )
#parse data
comment_html = tree.xpath('//comment()[contains(., "players_standard_batting")]')[0]
comment_html = str(comment_html).replace("-->", "")
comment_html = comment_html.replace("<!--", "")
tree = html.fromstring( comment_html )
for batter_row in tree.xpath('//table[#id="players_standard_batting"]/tbody/tr[contains(#class, "full_table")]'):
csk = batter_row.xpath('./td[#data-stat="player"]/#csk')[0]
When I scraped all of the batters there is 0.01 attached to each name. I tried to remove attached numbers using the following code:
bat_data = [csk]
string = '0.01'
result = []
for x in bat_data :
if string in x:
substring = x.replace(string,'')
if substring != "":
result.append(substring)
else:
result.append(x)
print(result)
This code removed the number, however, only the last name was printed:
Output:
['Zunino, Mike']
Also, there is a bracket and quotations around the name. The name is also in reverse order.
1) How can I print all of the batters names?
2) How can I remove the quotation marks and brackets?
3) Can I reverse the order of the names so the first name gets printed and then the last name?
The final output I am hoping for would be all of the batters names like so: Mike Zunino.
I am new to this site... I am also new to scraping/coding and will greatly appreciate any help I can get! =)
You can do the same in different ways. Here is one such approach which doesn't require post processing. You get the names how you wanted to get:
from urllib.request import urlopen
from lxml.html import fromstring
url = "https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml"
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
for batter_row in tree.xpath('//table[contains(#class,"stats_table")]//tr[contains(#class,"full_table")]'):
csk = batter_row.xpath('.//td[#data-stat="player"]/a')[0].text
print(csk)
Output you may get like:
Jose Abreu
Ronald Acuna
Jason Adam
Willy Adames
Austin L. Adams
You get only the last batter because you are overwriting the value of csk each time in your first loop. Initialize the empty list bat_data first and then add each batter to it.
bat_data= []
for batter_row in blah:
csk = blah
bat_data.append(csk)
This will give you a list of all batters, ['Abreu,Jose0.01', 'Acuna,Ronald0.01', 'Adam,Jason0.01', ...]
Then loop through this list but you don't have to check if string is in the name. Just do x.replace('0.01', '') and then check if the string is empty.
To reverse the order of the names
substring = substring.split(',')
substring.reverse()
nn = " ".join(substring)
Then append nn to the result.
You are getting the quotes and the brackets because you are printing the list. Instead iterate through the list and print each item.
Your code edited assuming you got bat_data correctly:
for x in bat_data :
substring = x.replace(string,'')
if substring != "":
substring = substring.split(',')
substring.reverse()
substring = ' '.join(substring)
result.append(substring)
for x in result:
print(x)
1) Print all batter names
print(result)
This will print everything in the result object. If it’s not printing what you expect then there’s something else wrong going on.
2) Remove quotations
The brackets are due to it being an array object. Try this...
print(result[0])
This will tell the interpreter to print result at the 0 index.
3) Reverse order of names
Try
name = result[0].split(“ “).reverse()[::-1]

RegEx string works when directly assigned in Python, but not from a PostgreSQL database

I have a working routine to determine the categories a news item belongs to. The routine works when assigning values in Python for the title, category, subcategory, and the search words as RegExp.
But when retrieving these values from PostgreSQL as strings I do not get any errors, or results from the same routine.
I checked the datatypes, both are Python strings.
What can be done to fix this?
# set the text to be analyzed
title = "next week there will be a presentation. The location will be aat"
# these could be the categories
category = "presentation"
subcategory = "scientific"
# these are the regular expressions
main_category_search_words = r'\bpresentation\b'
sub_category_search_words= r'\basm microbe\b | \basco\b | \baat\b'
category_final = ''
subcategory_final = ''
# identify main category
r = re.compile(main_category_search_words, flags=re.I | re.X)
result = r.findall(title)
if len(result) == 1:
category_final = category
# identify sub category
r2 = re.compile(sub_category_search_words, flags=re.I | re.X)
result2 = r2.findall(title)
if len(result2) > 0:
subcategory_final = subcategory
print("analysis result:", category_final, subcategory_final)
I'm pretty sure that what you get back from PostgreSQL is not a raw string literal, hence your RegEx is invalid. You will have to escape the backslashes in your pattern explicitly in the DB.
print(r"\basm\b")
print("\basm\b")
print("\\basm\\b")
# output
\basm\b
as # yes, including the line break above here
\basm\b

Categories