extract value information from python string - python

I have a string output and I would like to extract the str_data out. That is the value in str_data. Currently I'm using the below code but I think it can be improved on. The below code does not work well with str_data=[''] and str_data=['L'm'] as it return list/index out of range error. str_data contains language information, so it could be empty or contain words like it's. Anyway to improve this? Thanks
right = result.split("str_data=['")[1]
final = right.split("'], extra='")[0]
Example 1:
result = TensorSet(tensors={'result': Tensor(shape=['5'], str_data=['ขอคุยด้วยหน่อย'], extra={})}, extra={}
Example 2:
result = TensorSet(tensors={'result': Tensor(shape=['102'], str_data=[''], extra={})}, extra={}
Example 3:
result = TensorSet(tensors={'result': Tensor(shape=[], str_data=['L'm'], extra={})}, extra={}
I would like to extract out:
example_1_result = 'ขอคุยด้วยหน่อย'
example_2_result = ''
example_3_result = 'L'm'

Assuming TensorFlow(...) is a string, that will always be formatted with the same arguments, then something like this will work:
final = result.split(",")[1].split("str_data=")[1].replace("[","").replace("]","")
Here's a breakdown:
Example input:
result = "TensorSet(tensors={'result': Tensor(shape=['5'], str_data=['ขอคุยด้วยหน่อย'], extra={})}, extra={})"
>>> result.split(",")[1]
" str_data=['ขอคุยด้วยหน่อย']"
>>> data = result.split(",")[1]
>>> data.split("str_data=")[1]
"['ขอคุยด้วยหน่อย']"
>>> content = data.split("str_data=")[1]
>>> content.replace("[","").replace("]","")
"'ขอคุยด้วยหน่อย'"
>>> final = content.replace("[","").replace("]","")
>>> final
"'ขอคุยด้วยหน่อย'"

Related

python string split slice and into a list

I have a string for example "streemlocalbbv"
and I have my_function that takes this string and a string that I want to find ("loc") in the original string. And what I want to get returned is this;
my_function("streemlocalbbv", "loc")
output = ["streem","loc","albbv"]
what I did so far is
def find_split(string,find_word):
length = len(string)
find_word_start_index = string.find(find_word)
find_word_end_index = find_word_start_index + len(find_word)
string[find_word_start_index:find_word_end_index]
a = string[0:find_word_start_index]
b = string[find_word_start_index:find_word_end_index]
c = string[find_word_end_index:length]
return [a,b,c]
Trying to find the index of the string I am looking for in the original string, and then split the original string. But from here I am not sure how should I do it.
You can use str.partition which does exactly what you want:
>>> "streemlocalbbv".partition("loc")
('streem', 'loc', 'albbv')
Use the split function:
def find_split(string,find_word):
ends = string.split(find_word)
return [ends[0], find_word, ends[1]]
Use the split, index and insert function to solve this
def my_function(word,split_by):
l = word.split(split_by)
l.insert(l.index(word[:word.find(split_by)])+1,split_by)
return l
print(my_function("streemlocalbbv", "loc"))
#['str', 'eem', 'localbbv']

How to web scrape all of the batters names?

I would like to scrape all of the MLB batters stats for 2018. Here is my code so far:
#import modules
from urllib.request import urlopen
from lxml import html
#fetch url/html
response = urlopen("https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml")
content = response.read()
tree = html.fromstring( content )
#parse data
comment_html = tree.xpath('//comment()[contains(., "players_standard_batting")]')[0]
comment_html = str(comment_html).replace("-->", "")
comment_html = comment_html.replace("<!--", "")
tree = html.fromstring( comment_html )
for batter_row in tree.xpath('//table[#id="players_standard_batting"]/tbody/tr[contains(#class, "full_table")]'):
csk = batter_row.xpath('./td[#data-stat="player"]/#csk')[0]
When I scraped all of the batters there is 0.01 attached to each name. I tried to remove attached numbers using the following code:
bat_data = [csk]
string = '0.01'
result = []
for x in bat_data :
if string in x:
substring = x.replace(string,'')
if substring != "":
result.append(substring)
else:
result.append(x)
print(result)
This code removed the number, however, only the last name was printed:
Output:
['Zunino, Mike']
Also, there is a bracket and quotations around the name. The name is also in reverse order.
1) How can I print all of the batters names?
2) How can I remove the quotation marks and brackets?
3) Can I reverse the order of the names so the first name gets printed and then the last name?
The final output I am hoping for would be all of the batters names like so: Mike Zunino.
I am new to this site... I am also new to scraping/coding and will greatly appreciate any help I can get! =)
You can do the same in different ways. Here is one such approach which doesn't require post processing. You get the names how you wanted to get:
from urllib.request import urlopen
from lxml.html import fromstring
url = "https://www.baseball-reference.com/leagues/MLB/2018-standard-batting.shtml"
content = str(urlopen(url).read())
comment = content.replace("-->","").replace("<!--","")
tree = fromstring(comment)
for batter_row in tree.xpath('//table[contains(#class,"stats_table")]//tr[contains(#class,"full_table")]'):
csk = batter_row.xpath('.//td[#data-stat="player"]/a')[0].text
print(csk)
Output you may get like:
Jose Abreu
Ronald Acuna
Jason Adam
Willy Adames
Austin L. Adams
You get only the last batter because you are overwriting the value of csk each time in your first loop. Initialize the empty list bat_data first and then add each batter to it.
bat_data= []
for batter_row in blah:
csk = blah
bat_data.append(csk)
This will give you a list of all batters, ['Abreu,Jose0.01', 'Acuna,Ronald0.01', 'Adam,Jason0.01', ...]
Then loop through this list but you don't have to check if string is in the name. Just do x.replace('0.01', '') and then check if the string is empty.
To reverse the order of the names
substring = substring.split(',')
substring.reverse()
nn = " ".join(substring)
Then append nn to the result.
You are getting the quotes and the brackets because you are printing the list. Instead iterate through the list and print each item.
Your code edited assuming you got bat_data correctly:
for x in bat_data :
substring = x.replace(string,'')
if substring != "":
substring = substring.split(',')
substring.reverse()
substring = ' '.join(substring)
result.append(substring)
for x in result:
print(x)
1) Print all batter names
print(result)
This will print everything in the result object. If it’s not printing what you expect then there’s something else wrong going on.
2) Remove quotations
The brackets are due to it being an array object. Try this...
print(result[0])
This will tell the interpreter to print result at the 0 index.
3) Reverse order of names
Try
name = result[0].split(“ “).reverse()[::-1]

x.split has no effect

For some reason x.split(':', 1)[-1] doesn't do anything. Could someone explain and maybe help me?
I'm trying to remove the data before : (including ":") but it keeps that data anyway
Code
data = { 'state': 1, 'endTime': 1518852709307, 'fileSize': 000000 }
data = data.strip('{}')
data = data.split(',')
for x in data:
x.split(':', 1)[-1]
print(x)`
Output
"state":1
"endTime":1518852709307
"fileSize":16777216
It's a dictonary, not a list of strings.
I think this is what you're looking for:
data = str({"state":1,"endTime":1518852709307,"fileSize":000000}) #add a str() here
data = data.strip('{}')
data = data.split(',')
for x in data:
x=x.split(':')[-1] # set x to x.split(...)
print(x)
The script below prints out:
1
1518852709307
0
Here is a one-liner version:
print (list(map(lambda x:x[1],data.items())))
Prints out:
[1, 1518852709307, 0]
Which is a list of integers.
Seems like you just want the values in the dictionary
data = {"state":1,"endTime":1518852709307,"fileSize":000000}
for x in data:
print(data[x])
I'm not sure, but I think it's because the computer treats "state" and 1 as separate objects. Therefore, it is merely stripping the string "state" of its colons, of which there are none.
You could make the entire dictionary into a string by putting:
data = str({ Your Dictionary Here })
then, print what you have left in for "for x in data" statement like so:
for x in data:
b = x.split(':', 1)[-1] # creating a new string
print(b)
data in your code is a dictionary. So you can just access your the values of it like data[state] which evaluates to 1.
If you get this data as a string like:
data = "{'state':1, 'endTime':1518852709307, 'fileSize':000000}"
You could use json.loads to convert it into a dictionary and access the data like explained above.
import json
data = '{"state":1, "endTime":1518852709307, "fileSize":0}'
data = json.loads(data)
for _,v in data.items():
print(v)
If you want to parse the string yourself this should work:
data = '{"state":1,"endTime":1518852709307,"fileSize":000000}'
data = data.strip('{}')
data = data.split(',')
for x in data:
x=x.split(':')[-1]
print(x)

How would I get rid of certain characters then output a cleaned up string In python?

In this snippet of code I am trying to obtain the links to images posted in a groupchat by a certain user:
import groupy
from groupy import Bot, Group, Member
prog_group = Group.list().first
prog_members = prog_group.members()
prog_messages = prog_group.messages()
rojer = str(prog_members[4])
rojer_messages = ['none']
rojer_pics = []
links = open('rojer_pics.txt', 'w')
print(prog_group)
for message in prog_messages:
if message.name == rojer:
rojer_messages.append(message)
if message.attachments:
links.write(str(message) + '\n')
links.close()
The issue is that in the links file it prints the entire message: ("Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12')>"
What I am wanting to do, is to get rid of characters that aren't part of the URL so it is written like so:
"https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12"
are there any methods in python that can manipulate a string like so?
I just used string.split() and split it into 3 parts by the parentheses:
for message in prog_messages:
if message.name == rojer:
rojer_messages.append(message)
if message.attachments:
link = str(message).split("'")
rojer_pics.append(link[1])
links.write(str(link[1]) + '\n')
This can done using string indices and the string method .find():
>>> url = "(\"Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12')"
>>> url = url[url.find('+')+1:-2]
>>> url
'https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12'
>>>
>>> string = '("Rojer Doewns: Heres a special one +https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12\')>"'
>>> string.split('+')[1][:-4]
'https://i.groupme.com/406x1199.png.7679b4f1ee964656bde93448ff9cee12'

Python read .txt File -> list

I have a .txt File and I want to get the values in a list.
The format of the txt file should be:
value0,timestamp0
value1,timestamp1
...
...
...
In the end I want to get a list with
[[value0,timestamp0],[value1,timestamp1],.....]
I know it's easy to get these values by
direction = []
for line in open(filename):
direction,t = line.strip().split(',')
direction = float(direction)
t = long(t)
direction.append([direction,t])
return direction
But I have a big problem: When creating the data I forgot to insert a "\n" in each row.
Thats why I have this format:
value0, timestamp0value1,timestamp1value2,timestamp2value3.....
Every timestamp has exactly 13 characters.
Is there a way to get these data in a list as I want it? Would be very much work get the data again.
Thanks
Max
import re
input = "value0,0123456789012value1,0123456789012value2,0123456789012value3"
for (line, value, timestamp) in re.findall("(([^,]+),(.{13}))", input):
print value, timestamp
You will have to strip the last , but you can insert a comma after every 13 chars following a comma:
import re
s = "-0.1351197,1466615025472-0.25672746,1466615025501-0.3661744,1466615025531-0.4646‌​7665,1466615025561-0.5533287,1466615025591-0.63311553,1466615025621-0.7049236,146‌​6615025652-0.7695509,1466615025681-1.7158673,1466615025711-1.6896278,146661502574‌​1-1.65375,1466615025772-1.6092329,1466615025801"
print(re.sub("(?<=,)(.{13})",r"\1"+",", s))
Which will give you:
-0.1351197,1466615025472,-0.25672746,1466615025501,-0.3661744,1466615025531,-0.4646‌​7665,1466615025561,-0.5533287,1466615025591,-0.63311553,1466615025621,-0.7049236,146‌​6615025652-0.7695509,1466615025681,-1.7158673,1466615025711,-1.6896278,146661502574‌​1-1.65375,1466615025772,-1.6092329,1466615025801,
I coded a quickie using your example, and not using 13 but len("timestamp") so you can adapt
instr = "value,timestampvalue2,timestampvalue3,timestampvalue4,timestamp"
previous_i = 0
for i,c in enumerate(instr):
if c==",":
next_i = i+len("timestamp")+1
print(instr[previous_i:next_i])
previous_i = next_i
output is descrambled:
value,timestamp
value2,timestamp
value3,timestamp
value4,timestamp
I think you could do something like this:
direction = []
for line in open(filename):
list = line.split(',')
v = list[0]
for s in list[1:]:
t = s[:13]
direction.append([float(v), long(t)])
v = s[13:]
If you're using python 3.X, then the long function no longer exists -- use int.

Categories