Split words into variables - python

I'm reading an MP3 radio stream using Python and it prints out something like this
INXS~Disappear~RADIOSTATION~~MUSIC~~~360000~~~
I'd like to separate the words and place them in their own table / field.
The above is Artist, SongName, RadioStation, MUSIC, some number which I don't know what it is and doesn't ever change.
I've found something called word split but unsure if that will work.
I'm also unsure if a space in the song or artist name will cause any problems. The space isn't an underscore or anything clever, it is literally a space.
#!/usr/bin/env python
import urllib2
import datetime
import requests
stream_url = 'http://stream....'
request = urllib2.Request(stream_url)
try:
request.add_header('Icy-MetaData', 1)
response = urllib2.urlopen(request)
icy_metaint_header = response.headers.get('icy-metaint')
if icy_metaint_header is not None:
metaint = int(icy_metaint_header)
read_buffer = metaint+512
content = response.read(read_buffer)
title = content[metaint:].split("'")[1]
print title
# post_data = {'artist':'////', 'songname':'/////'}
# post_response = requests.post(url='http:///////.co.uk', data=post_data)
print datetime.datetime.now()
import json
except:
print 'null'
# print 'Error'
# print datetime.datetime.now()

You want to use the string's split method:
stream = "INXS~Disappear~RADIOSTATION~~MUSIC~~~360000~~~"
parts = stream.split("~")
with python it's even possible to directly assign the list elements returned by the split method to specific variables:
artist, songname, radiostation, music, number = [x for x in stream.split("~") if x]
I used a simple list comprehension to get rid of the empty elements in the list.
Instead of using a list comprehension you could use the filter built-in function to remove the empty elements:
artist, songname, radiostation, music, number = filter(len, stream.split("~"))

The split function will accomplish this for you. More information about string utils in the Python docs
https://docs.python.org/2/library/stdtypes.html#str.split
artist, song_name, radio_station, music, misc_number = string.split("~")

import re
st = 'INXS~Disappear~RADIOSTATION~~MUSIC~~~360000~~~'
names = ('artist', 'songname', 'radiostatino', 'music', 'number')
pp(list(zip(names,re.split(r'~+',st))))
[('artist', 'INXS'),
('songname', 'Disappear'),
('radiostatino', 'RADIOSTATION'),
('music', 'MUSIC'),
('number', '360000')]

Related

How to fix : TypeError: normalize() argument 2 must be str, not list

I'm making an api call that pulls the desired endpoints from ...url/articles.json and transforms it into a csv file. My problem here is that the ['labels_name'] endpoint is a string with multiple values.(an article might have multiple labels)
How can I pull multiple values of a string without getting this error . "File "articles_labels.py", line 40, in <module>
decode_3 = unicodedata.normalize('NFKD', article_label)
TypeError: normalize() argument 2 must be str, not list"?
import requests
import csv
import unicodedata
import getpass
url = 'https://......./articles.json'
user = ' '
pwd = ' '
csvfile = 'articles_labels.csv'
output_1 = []
output_1.append("id")
output_2 = []
output_2.append("title")
output_3 = []
output_3.append("label_names")
output_4 = []
output_4.append("link")
while url:
response = requests.get(url, auth=(user, pwd))
data = response.json()
for article in data['articles']:
article_id = article['id']
decode_1 = int(article_id)
output_1.append(decode_1)
for article in data['articles']:
title = article['title']
decode_2 = unicodedata.normalize('NFKD', title)
output_2.append(decode_2)
for article in data['articles']:
article_label = article['label_names']
decode_3 = unicodedata.normalize('NFKD', article_label)
output_3.append(decode_3)
for article in data['articles']:
article_url = article['html_url']
decode_3 = unicodedata.normalize('NFKD', article_url)
output_3.append(decode_3)
print(data['next_page'])
url = data['next_page']
print("Number of articles:")
print(len(output_1))
with open(csvfile, 'w') as fp:
writer = csv.writer(fp,dialect = 'excel')
writer.writerows([output_1])
writer.writerows([output_2])
writer.writerows([output_3])
writer.writerows([output_4])
My problem here is that the ['labels_name'] endpoint is a string with multiple values.(an article might have multiple labels) How can I pull multiple values of a string
It's a list not a string, so you don't have "a string with multiple values" you have a list of multiple strings, already, as-is.
The question is what you want to do with them, CSV certainly isn't going to handle that, so you must decide on a way to serialise a list of strings to a single string e.g. by joining them together (with some separator like space or comma) or by just picking the first one (beware to handle the case where there is none), … either way the issue is not really technical.
unicodedata.normalize takes a unicode string, and not a list as the error says. The correct way to use unicodedata.normalize will be (example taken from How does unicodedata.normalize(form, unistr) work?
from unicodedata import normalize
print(normalize('NFD', u'\u00C7'))
print(normalize('NFC', u'C\u0327'))
#Ç
#Ç
Hence you need to make sure that unicodedata.normalize('NFKD', title) has title as a unicode string

How to web scrape all of the batters names?

I would like to scrape all of the MLB batters stats for 2018. Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml import html
#fetch url/html
response = urlopen("https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml")
content = response.read()
tree = html.fromstring( content )
#parse data
comment_html = tree.xpath('//comment()[contains(., "players_standard_batting")]')[0]
comment_html = str(comment_html).replace("-->", "")
comment_html = comment_html.replace("<!--", "")
tree = html.fromstring( comment_html )
for batter_row in tree.xpath('//table[#id="players_standard_batting"]/tbody/tr[contains(#class, "full_table")]'):
csk = batter_row.xpath('./td[#data-stat="player"]/#csk')[0]
When I scraped all of the batters there is 0.01 attached to each name. I tried to remove attached numbers using the following code:
bat_data = [csk]
string = '0.01'
result = []
for x in bat_data :
if string in x:
substring = x.replace(string,'')
if substring != "":
result.append(substring)
else:
result.append(x)
print(result)
This code removed the number, however, only the last name was printed:
Output:
['Zunino, Mike']
Also, there is a bracket and quotations around the name. The name is also in reverse order.
1) How can I print all of the batters names?
2) How can I remove the quotation marks and brackets?
3) Can I reverse the order of the names so the first name gets printed and then the last name?
The final output I am hoping for would be all of the batters names like so: Mike Zunino.
I am new to this site... I am also new to scraping/coding and will greatly appreciate any help I can get! =)
You can do the same in different ways. Here is one such approach which doesn't require post processing. You get the names how you wanted to get:
from urllib.request import urlopen
from lxml.html import fromstring
url = "https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml"
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
for batter_row in tree.xpath('//table[contains(#class,"stats_table")]//tr[contains(#class,"full_table")]'):
csk = batter_row.xpath('.//td[#data-stat="player"]/a')[0].text
print(csk)
Output you may get like:
Jose Abreu
Ronald Acuna
Jason Adam
Willy Adames
Austin L. Adams
You get only the last batter because you are overwriting the value of csk each time in your first loop. Initialize the empty list bat_data first and then add each batter to it.
bat_data= []
for batter_row in blah:
csk = blah
bat_data.append(csk)
This will give you a list of all batters, ['Abreu,Jose0.01', 'Acuna,Ronald0.01', 'Adam,Jason0.01', ...]
Then loop through this list but you don't have to check if string is in the name. Just do x.replace('0.01', '') and then check if the string is empty.
To reverse the order of the names
substring = substring.split(',')
substring.reverse()
nn = " ".join(substring)
Then append nn to the result.
You are getting the quotes and the brackets because you are printing the list. Instead iterate through the list and print each item.
Your code edited assuming you got bat_data correctly:
for x in bat_data :
substring = x.replace(string,'')
if substring != "":
substring = substring.split(',')
substring.reverse()
substring = ' '.join(substring)
result.append(substring)
for x in result:
print(x)
1) Print all batter names
print(result)
This will print everything in the result object. If it’s not printing what you expect then there’s something else wrong going on.
2) Remove quotations
The brackets are due to it being an array object. Try this...
print(result[0])
This will tell the interpreter to print result at the 0 index.
3) Reverse order of names
Try
name = result[0].split(“ “).reverse()[::-1]

How would I get rid of certain characters then output a cleaned up string In python?

In this snippet of code I am trying to obtain the links to images posted in a groupchat by a certain user:
import groupy
from groupy import Bot, Group, Member
prog_group = Group.list().first
prog_members = prog_group.members()
prog_messages = prog_group.messages()
rojer = str(prog_members[4])
rojer_messages = ['none']
rojer_pics = []
links = open('rojer_pics.txt', 'w')
print(prog_group)
for message in prog_messages:
if message.name == rojer:
rojer_messages.append(message)
if message.attachments:
links.write(str(message) + '\n')
links.close()
The issue is that in the links file it prints the entire message: ("Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12')>"
What I am wanting to do, is to get rid of characters that aren't part of the URL so it is written like so:
"https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12"
are there any methods in python that can manipulate a string like so?
I just used string.split() and split it into 3 parts by the parentheses:
for message in prog_messages:
if message.name == rojer:
rojer_messages.append(message)
if message.attachments:
link = str(message).split("'")
rojer_pics.append(link[1])
links.write(str(link[1]) + '\n')
This can done using string indices and the string method .find():
>>> url = "(\"Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12')"
>>> url = url[url.find('+')+1:-2]
>>> url
'https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12'
>>>
>>> string = '("Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12\')>"'
>>> string.split('+')[1][:-4]
'https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12'

From .txt file to dictionary

i'm using python 2.7, and I need to do some algorithm and I need some help:
The function need to read some data: the data model is like this:
# some_album *song_name::writer::duration::song_lyrics
All over the txt file, I need to get into every position like : the album name and the song name using the function split().
I have some questions:
how can I use split() between two characters- example: to an Album name, split between # to * ????
I want to divide all the txt file to a dictionary, the albums is the key's and the value is another dictionary that his key's is the song name and the value is a list of all the lyrics in the song. mt question is how can i do it with a loop or any other idea, because i want it to divide the hull txt file, and not just part of him.
this is what i do until now:
data_file = open("<someplace>","r")
data = data_file.readlines()
data = str(data)
i=0
for i in data:
albums= {data.split('#','*')[0] : data.split("::")[0]}
to print just the album and the name of the first song. I dont understand how to do it with some loop??
Referring to your first question I would recommend to use the "Regular expressions operations module" re for this.
>>> import re
>>> str = 'py=th;on'
>>> lst = re.split("=|;",str)
>>> lst[1]
'th'

Tuple trouble when trying to count elements in a list?

I am trying to count the number of contractions used by politicians in certain speeches. I have lots of speeches, but here are some of the URLs as a sample:
every_link_test = ['http://www.millercenter.org/president/obama/speeches/speech-4427',
'http://www.millercenter.org/president/obama/speeches/speech-4424',
'http://www.millercenter.org/president/obama/speeches/speech-4453',
'http://www.millercenter.org/president/obama/speeches/speech-4612',
'http://www.millercenter.org/president/obama/speeches/speech-5502']
I have a pretty rough counter right now - it only counts the total number of contractions used in all of those links. For example, the following code returns 79,101,101,182,224 for the five links above. However, I want to link up filename, a variable I create below, so I would have something like (speech_1, 79),(speech_2, 22),(speech_3,0),(speech_4,81),(speech_5,42). That way, I can track the number of contractions used in each individual speech. I'm getting the following error with my code: AttributeError: 'tuple' object has no attribute 'split'
Here's my code:
import urllib2,sys,os
from bs4 import BeautifulSoup,NavigableString
from string import punctuation as p
from multiprocessing import Pool
import re, nltk
import requests
reload(sys)
url = 'http://www.millercenter.org/president/speeches'
url2 = 'http://www.millercenter.org'
conn = urllib2.urlopen(url)
html = conn.read()
miller_center_soup = BeautifulSoup(html)
links = miller_center_soup.find_all('a')
linklist = [tag.get('href') for tag in links if tag.get('href') is not None]
# remove all items in list that don't contain 'speeches'
linkslist = [_ for _ in linklist if re.search('speeches',_)]
del linkslist[0:2]
# concatenate 'http://www.millercenter.org' with each speech's URL ending
every_link_dups = [url2 + end_link for end_link in linkslist]
# remove duplicates
seen = set()
every_link = [] # no duplicates array
for l in every_link_dups:
if l not in seen:
every_link.append(l)
seen.add(l)
def processURL_short_2(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id':'transcript'},{'class':'displaytext'})
item_str = item_div.text.lower()
splitlink = l.split("/")
president = splitlink[4]
speech_num = splitlink[-1]
filename = "{0}_{1}".format(president, speech_num)
return item_str, filename
every_link_test = every_link[0:5]
print every_link_test
count = 0
for l in every_link_test:
content_1 = processURL_short_2(l)
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
print count, filename
As the error message explains, you cannot use split the way you are using it. split is for strings.
So you will need to change this:
for word in content_1.split():
to this:
for word in content_1[0]:
I chose [0] by running your code, I think that gives you the chunk of the text you are looking to search through.
#TigerhawkT3 has a good suggestion you should follow in their answer too:
https://stackoverflow.com/a/32981533/1832539
Instead of print count, filename, you should save these data to a data structure, like a dictionary. Since processURL_short_2 has been modified to return a tuple, you'll need to unpack it.
data = {} # initialize a dictionary
for l in every_link_test:
content_1, filename = processURL_short_2(l) # unpack the content and filename
for word in content_1.split():
word = word.strip(p)
if word in contractions:
count = count + 1
data[filename] = count # add this to the dictionary as filename:count
This would give you a dictionary like {'obama_4424':79, 'obama_4453':101,...}, allowing you to easily store and access your parsed data.

Categories