I am trying to count the number of contractions used by politicians in certain speeches. I have lots of speeches, but here are some of the URLs as a sample:
every_link_test = ['http://www.millercenter.org/president/obama/speeches/speech-4427',
'http://www.millercenter.org/president/obama/speeches/speech-4424',
'http://www.millercenter.org/president/obama/speeches/speech-4453',
'http://www.millercenter.org/president/obama/speeches/speech-4612',
'http://www.millercenter.org/president/obama/speeches/speech-5502']
I have a pretty rough counter right now - it only counts the total number of contractions used in all of those links. For example, the following code returns 79,101,101,182,224 for the five links above. However, I want to link up filename, a variable I create below, so I would have something like (speech_1, 79),(speech_2, 22),(speech_3,0),(speech_4,81),(speech_5,42). That way, I can track the number of contractions used in each individual speech. I'm getting the following error with my code: AttributeError: 'tuple' object has no attribute 'split'
Here's my code:
import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)
url = 'http://www.millercenter.org/president/speeches'
url2 = 'http://www.millercenter.org'
conn = urllib2.urlopen(url)
html = conn.read()
miller_center_soup = BeautifulSoup(html)
links = miller_center_soup.find_all('a')
linklist = [tag.get('href') for tag in links if tag.get('href') is not None]
# remove all items in list that don't contain 'speeches'
linkslist = [_ for _ in linklist if re.search('speeches',_)]
del linkslist[0:2]
# concatenate 'http://www.millercenter.org' with each speech's URL ending
every_link_dups = [url2 + end_link for end_link in linkslist]
# remove duplicates
seen = set()
every_link = [] # no duplicates array
for l in every_link_dups:
if l not in seen:
every_link.append(l)
seen.add(l)
def processURL_short_2(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president, speech_num)
return item_str, filename
every_link_test = every_link[0:5]
print every_link_test
count = 0
for l in every_link_test:
content_1 = processURL_short_2(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
print count, filename
As the error message explains, you cannot use split the way you are using it. split is for strings.
So you will need to change this:
for word in content_1.split():
to this:
for word in content_1[0]:
I chose [0] by running your code, I think that gives you the chunk of the text you are looking to search through.
#TigerhawkT3 has a good suggestion you should follow in their answer too:
https://stackoverflow.com/a/32981533/1832539
Instead of print count, filename, you should save these data to a data structure, like a dictionary. Since processURL_short_2 has been modified to return a tuple, you'll need to unpack it.
data = {} # initialize a dictionary
for l in every_link_test:
content_1, filename = processURL_short_2(l) # unpack the content and filename
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
data[filename] = count # add this to the dictionary as filename:count
This would give you a dictionary like {'obama_4424':79, 'obama_4453':101,...}, allowing you to easily store and access your parsed data.
Related
I would like to scrape all of the MLB batters stats for 2018. Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml import html
#fetch url/html
response = urlopen("https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml")
content = response.read()
tree = html.fromstring( content )
#parse data
comment_html = tree.xpath('//comment()[contains(., "players_standard_batting")]')[0]
comment_html = str(comment_html).replace("-->", "")
comment_html = comment_html.replace("<!--", "")
tree = html.fromstring( comment_html )
for batter_row in tree.xpath('//table[#id="players_standard_batting"]/tbody/tr[contains(#class, "full_table")]'):
csk = batter_row.xpath('./td[#data-stat="player"]/#csk')[0]
When I scraped all of the batters there is 0.01 attached to each name. I tried to remove attached numbers using the following code:
bat_data = [csk]
string = '0.01'
result = []
for x in bat_data :
if string in x:
substring = x.replace(string,'')
if substring != "":
result.append(substring)
else:
result.append(x)
print(result)
This code removed the number, however, only the last name was printed:
Output:
['Zunino, Mike']
Also, there is a bracket and quotations around the name. The name is also in reverse order.
1) How can I print all of the batters names?
2) How can I remove the quotation marks and brackets?
3) Can I reverse the order of the names so the first name gets printed and then the last name?
The final output I am hoping for would be all of the batters names like so: Mike Zunino.
I am new to this site... I am also new to scraping/coding and will greatly appreciate any help I can get! =)
You can do the same in different ways. Here is one such approach which doesn't require post processing. You get the names how you wanted to get:
from urllib.request import urlopen
from lxml.html import fromstring
url = "https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml"
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
for batter_row in tree.xpath('//table[contains(#class,"stats_table")]//tr[contains(#class,"full_table")]'):
csk = batter_row.xpath('.//td[#data-stat="player"]/a')[0].text
print(csk)
Output you may get like:
Jose Abreu
Ronald Acuna
Jason Adam
Willy Adames
Austin L. Adams
You get only the last batter because you are overwriting the value of csk each time in your first loop. Initialize the empty list bat_data first and then add each batter to it.
bat_data= []
for batter_row in blah:
csk = blah
bat_data.append(csk)
This will give you a list of all batters, ['Abreu,Jose0.01', 'Acuna,Ronald0.01', 'Adam,Jason0.01', ...]
Then loop through this list but you don't have to check if string is in the name. Just do x.replace('0.01', '') and then check if the string is empty.
To reverse the order of the names
substring = substring.split(',')
substring.reverse()
nn = " ".join(substring)
Then append nn to the result.
You are getting the quotes and the brackets because you are printing the list. Instead iterate through the list and print each item.
Your code edited assuming you got bat_data correctly:
for x in bat_data :
substring = x.replace(string,'')
if substring != "":
substring = substring.split(',')
substring.reverse()
substring = ' '.join(substring)
result.append(substring)
for x in result:
print(x)
1) Print all batter names
print(result)
This will print everything in the result object. If it’s not printing what you expect then there’s something else wrong going on.
2) Remove quotations
The brackets are due to it being an array object. Try this...
print(result[0])
This will tell the interpreter to print result at the 0 index.
3) Reverse order of names
Try
name = result[0].split(“ “).reverse()[::-1]
Currently I am crawling a Webpage for newspaper articles using Pythons BeautifulSoup Library. These articles are stored in the object "details".
Then I have a couple of names of various streets that are stored in the object "lines". Now I want to search the articles for the street names that are contained in "lines".
If one of the street names is part of one of the articles, I want to safe the name of the street in an array.
If there is no match for an article (the selected article does not contain any of the street names), then there should be an empty element in the array.
So for example, let's assume the object "lines" would consist of ("Abbey Road", "St-John's Bridge", "West Lane", "Sunpoint", "East End").
The object "details" consists of 4 articles, of which 2 contain "Abbey Road" and "West Lane" (e.g. as in "Car accident on Abbey Road, three people hurt"). The other 2 articles don't contain any of names from "lines".
Then after matching the result should be an array like this:
[]["Abbey Road"][]["West Lane"]
I was also told to use Vectorization for this, as my original data sample is quite big. However I'm not familiar with using vectorization for String operations. Has anyone worked with this already?
My Code currently looks like this, however this only returns "-1" as elements of my resulting array:
from bs4 import BeautifulSoup
import requests
import io
import re
import string
import numpy as np
my_list = []
for y in range (0, 2):
y *= 27
i = str(y)
my_list.append('http://www.presseportal.de/blaulicht/suche.htx?q=' + 'einbruch' + '&start=' + i)
for link in my_list:
# print (link)
r = requests.get(link)
r.encoding = 'utf-8'
soup = BeautifulSoup(r.content, 'html.parser')
with open('a4.txt', encoding='utf8') as f:
lines = f.readlines()
lines = [w.replace('\n', '') for w in lines]
details = soup.find_all(class_='news-bodycopy')
for class_element in details:
details = class_element.get_text()
sdetails = ''.join(details)
slines = ''.join(lines)
i = str.find(sdetails, slines[1 : 38506])
print(i)
If someone wants to reproduce my experiment, the Website-Url is in the code above and the crawling and storing of articles in the object "details" works properly, so the code can just be copied.
The .txt-file for my original Data for the object "lines" can be accessed in this Dropbox-Folder:
https://www.dropbox.com/s/o0cjk1o2ej8nogq/a4.txt?dl=0
Thanks a lot for any hints how I can make this work, preferably via Vectorization.
You could try something like this:
my_list = []
for y in range (0, 2):
i = str(y)
my_list.append('http://www.presseportal.de/blaulicht/suche.htx?q=einbruch&start=' + i)
for link in my_list:
r = requests.get(link)
soup = BeautifulSoup(r.content.decode('utf-8','ignore'), 'html.parser')
details = soup.find_all(class_='news-bodycopy')
f = open('a4.txt')
lines = [line.rstrip('\r\n') for line in f]
result = []
for i in range(len(details)):
found_in_line = 0
for j in range(len(lines)):
try:
if details[i].get_text().index(lines[j].decode('utf-8','ignore')) is not None:
result.append(lines[j])
found_in_line = found_in_line + 1
except:
if (j == len(lines)-1) and (found_in_line == 0):
result.append(" ")
print result
I have a list of urls and I'm trying to filter them using specific key words say word1 and word2, and a list of stop words say [stop1, stop2, stop3]. Is there a way to filter the links without using many if conditions? I got the proper output when I used if condition on each stop word, this doesn't look like a feasible option. The following is the Brute force method:
for link in url:
if word1 or word2 in link:
if stop1 not in link:
if stop2 not in link:
if stop3 not in link:
links.append(link)
Here's a couple of options I would consider if I were in your situation.
You can use a list comprehension with the built in any and all functions to filter out the unwanted urls from your list:
urls = ['http://somewebsite.tld/word',
'http://somewebsite.tld/word1',
'http://somewebsite.tld/word1/stop3',
'http://somewebsite.tld/word2',
'http://somewebsite.tld/word2/stop2',
'http://somewebsite.tld/word3',
'http://somewebsite.tld/stop3/word1',
'http://somewebsite.tld/stop4/word1']
includes = ['word1', 'word2']
excludes = ['stop1', 'stop2', 'stop3']
filtered_url_list = [url for url in urls if any(include in url for include in includes) if all(exclude not in url for exclude in excludes)]
Or you can make a function which takes one url as an argument, and returns True for urls you want to keep and False for ones you don't, then pass that function along with the unfiltered list of urls to the built in filter function:
def urlfilter(url):
includes = ['word1', 'word2']
excludes = ['stop1', 'stop2', 'stop3']
for include in includes:
if include in url:
for exclude in excludes:
if exclude in url:
return False
else:
return True
urls = ['http://somewebsite.tld/word',
'http://somewebsite.tld/word1',
'http://somewebsite.tld/word1/stop3',
'http://somewebsite.tld/word2',
'http://somewebsite.tld/word2/stop2',
'http://somewebsite.tld/word3',
'http://somewebsite.tld/stop3/word1',
'http://somewebsite.tld/stop4/word1']
filtered_url_list = filter(urlfilter, urls)
If you can cite an example then it would be helpful. If we take an example of urls like
def urlSearch():
word = []
end_words = ['gmail', 'finance']
Key_word = ['google']
urlList= ['google.com//d/gmail', 'google.com/finance', 'google.com/sports', 'google.com/search']
for i in urlList:
main_part = i.split('/',i.count('/'))
if main_part[len(main_part) - 1] in end_words:
word = []
for k in main_part[:-1]:
for j in k.split('.'):
word.append(j)
print (word)
for p in Key_word:
if p in word:
print ("Url is: " + i)
urlSearch()
I would use sets and list comprehension:
must_in = set([word1, word2])
musnt_in = set([stop1, stop2, stop3])
links = [x for x in url if must_in & set(x) and not (musnt_in & set(x))]
print links
The code above can be used with any number of words and stops, not limited to two words (word1, word2) and three stops (stop1, stop2, stop3).
I want to extract some data from JSON, but I don't know what happenend. It response "TypeError: list indices must be integers, not str".
Here is my code, thanks:
import urllib
import json
url = 'http://python-data.dr-chuck.net/comments_304658.json'
data = urllib.urlopen(url).read()
info = json.loads(data)
#print json.dumps(info,indent=4)
lst = list()
for item in info:
count = item['comments']['count']
count = int(count)
lst.append(count)
print sum(lst)
You seem to be confused by the structure of the returned data. Your code assumes the structure is a list of two-level dictionaries. If this were the case, then you could find an individual count like so:
info[7]['comments']['count']
It is actually a dictionary, one item of which is a list of dictionaries. To find a single item, the expression is like:
info['comments'][7]['count']
So, if we want to iterate over the list, we iterate over info['comments'].
Try this:
import urllib
import json
url = 'http://python-data.dr-chuck.net/comments_304658.json'
data = urllib.urlopen(url).read()
info = json.loads(data)
#print json.dumps(info,indent=4)
lst = list()
for item in info['comments']:
count = item['count']
count = int(count)
lst.append(count)
print sum(lst)
I'm reading an MP3 radio stream using Python and it prints out something like this
INXS~Disappear~RADIOSTATION~~MUSIC~~~360000~~~
I'd like to separate the words and place them in their own table / field.
The above is Artist, SongName, RadioStation, MUSIC, some number which I don't know what it is and doesn't ever change.
I've found something called word split but unsure if that will work.
I'm also unsure if a space in the song or artist name will cause any problems. The space isn't an underscore or anything clever, it is literally a space.
#!/usr/bin/env python
import urllib2
import datetime
import requests
stream_url = 'http://stream....'
request = urllib2.Request(stream_url)
try:
request.add_header('Icy-MetaData', 1)
response = urllib2.urlopen(request)
icy_metaint_header = response.headers.get('icy-metaint')
if icy_metaint_header is not None:
metaint = int(icy_metaint_header)
read_buffer = metaint+512
content = response.read(read_buffer)
title = content[metaint:].split("'")[1]
print title
# post_data = {'artist':'////', 'songname':'/////'}
# post_response = requests.post(url='http:///////.co.uk', data=post_data)
print datetime.datetime.now()
import json
except:
print 'null'
# print 'Error'
# print datetime.datetime.now()
You want to use the string's split method:
stream = "INXS~Disappear~RADIOSTATION~~MUSIC~~~360000~~~"
parts = stream.split("~")
with python it's even possible to directly assign the list elements returned by the split method to specific variables:
artist, songname, radiostation, music, number = [x for x in stream.split("~") if x]
I used a simple list comprehension to get rid of the empty elements in the list.
Instead of using a list comprehension you could use the filter built-in function to remove the empty elements:
artist, songname, radiostation, music, number = filter(len, stream.split("~"))
The split function will accomplish this for you. More information about string utils in the Python docs
https://docs.python.org/2/library/stdtypes.html#str.split
artist, song_name, radio_station, music, misc_number = string.split("~")
import re
st = 'INXS~Disappear~RADIOSTATION~~MUSIC~~~360000~~~'
names = ('artist', 'songname', 'radiostatino', 'music', 'number')
pp(list(zip(names,re.split(r'~+',st))))
[('artist', 'INXS'),
('songname', 'Disappear'),
('radiostatino', 'RADIOSTATION'),
('music', 'MUSIC'),
('number', '360000')]