Python - Split string with " " - python

I know this is a basic question, but I'm having difficult with parsing some text.
So how the system will work, let's take the following example:
> set title "Hello world"
I should therefore get:
["set title", "Hello world"]
The problem is therefore, I need to split the string so when I enter, for example:
> plot("data.txt");
Should give me:
["plot", "data.txt"]
I have tried the following:
While True:
command = raw_input(">");
parse = command.split("' '");
if(parse[0] == "set title"):
title = parse[1];
But this does not work and will not even recognise that I am entering "set title"
Any ideas?

You don't need split. You need re:
import re
def parse(command):
regex = r'(.*) "(.*)"'
items = list(re.match(regex, command).groups())
return items
if __name__ == '__main__':
command = 'set title "Hello world"'
print parse(command)
returns
['set title', 'Hello world']

split("' '")
Will split on the literal sequence of three characters single quote, space, single quote, which don't appear in your command strings.
I think you will need to approach this more like:
command, content = command.split(" ", 1)
if command == "plot":
plot(command[1:-1])
elif command == "set":
item, content = content.split(" ", 1)
if item == "title":
title = content[1:-1]
...
Note the use of a second argument to tell split how many times to do so; 'set title "foo"'.split(" ", 1) == ['set', 'title "foo"']. Precisely how you implement will depend on the range of things you want to be able to parse.

To split the string by blanks you need to use
parse = command.split(' ')
For the input "set title" you will get a parse array looking like this
['set', 'title']
where parse[0] == 'set' and parse[1] == 'title'
If you want to test whether your string starts with "set title", either check the command string itself or check the first two indexes of parse.

Related

pyparsing how to SkipTo end of indented block?

I am trying to parse a structure like this with pyparsing:
identifier: some description text here which will wrap
on to the next line. the follow-on text should be
indented. it may contain identifier: and any text
at all is allowed
next_identifier: more description, short this time
last_identifier: blah blah
I need something like:
import pyparsing as pp
colon = pp.Suppress(':')
term = pp.Word(pp.alphanums + "_")
description = pp.SkipTo(next_identifier)
definition = term + colon + description
grammar = pp.OneOrMore(definition)
But I am struggling to define the next_identifier of the SkipTo clause since the identifiers may appear freely in the description text.
It seems that I need to include the indentation in the grammar, so that I can SkipTo the next non-indented line.
I tried:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) +
pp.indentedBlock(
pp.ZeroOrMore(
pp.SkipTo(pp.LineEnd())
),
indent_stack
)
)
But I get the error:
ParseException: not a subentry (at char 55), (line:2, col:1)
Char 55 is at the very beginning of the run-on line:
...will wrap\n on to the next line...
^
Which seems a bit odd, because that char position is clearly followed by the whitespace which makes it an indented subentry.
My traceback in ipdb looks like:
5311 def checkSubIndent(s,l,t):
5312 curCol = col(l,s)
5313 if curCol > indentStack[-1]:
5314 indentStack.append( curCol )
5315 else:
-> 5316 raise ParseException(s,l,"not a subentry")
5317
ipdb> indentStack
[1]
ipdb> curCol
1
I should add that the whole structure above that I'm matching may also be indented (by an unknown amount), so a solution like:
description = pp.Combine(
pp.SkipTo(pp.LineEnd()) + pp.LineEnd() +
pp.ZeroOrMore(
pp.White(' ') + pp.SkipTo(pp.LineEnd()) + pp.LineEnd()
)
)
...which works for the example as presented will not work in my case as it will consume the subsequent definitions.
When you use indentedBlock, the argument you pass in is the expression for each line in the block, so it shouldn't be a indentedBlock(ZeroOrMore(line_expression), stack), just indentedBlock(line_expression, stack). Pyparsing includes a builtin expression for "everything from here to the end of the line", titled restOfLine, so we will just use that for the expression for each line in the indented block:
import pyparsing as pp
NL = pp.LineEnd().suppress()
label = pp.ungroup(pp.Word(pp.alphas, pp.alphanums+'_') + pp.Suppress(":"))
indent_stack = [1]
# see corrected version below
#description = pp.Group((pp.Empty()
# + pp.restOfLine + NL
# + pp.ungroup(pp.indentedBlock(pp.restOfLine, indent_stack))))
description = pp.Group(pp.restOfLine + NL
+ pp.Optional(pp.ungroup(~pp.StringEnd()
+ pp.indentedBlock(pp.restOfLine,
indent_stack))))
labeled_text = pp.Group(label("label") + pp.Empty() + description("description"))
We use ungroup to remove the extra level of nesting created by indentedBlock but we also need to remove the per-line nesting that is created internally in indentedBlock. We do this with a parse action:
def combine_parts(tokens):
# recombine description parts into a single list
tt = tokens[0]
new_desc = [tt.description[0]]
new_desc.extend(t[0] for t in tt.description[1:])
# reassign rebuild description into the parsed token structure
tt['description'] = new_desc
tt[1][:] = new_desc
labeled_text.addParseAction(combine_parts)
At this point, we are pretty much done. Here is your sample text parsed and dumped:
parsed_data = (pp.OneOrMore(labeled_text)).parseString(sample)
print(parsed_data[0].dump())
['identifier', ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']]
- description: ['some description text here which will wrap', 'on to the next line. the follow-on text should be', 'indented. it may contain identifier: and any text', 'at all is allowed']
- label: 'identifier'
Or this code to pull out the label and description fields:
for item in parsed_data:
print(item.label)
print('..' + '\n..'.join(item.description))
print()
identifier
..some description text here which will wrap
..on to the next line. the follow-on text should be
..indented. it may contain identifier: and any text
..at all is allowed
next_identifier
..more description, short this time
last_identifier
..blah blah

replace value between pattern [duplicate]

I have below string, I am able to grab the 'text' what I wanted to (text is warped between pattern). code is give below,
val1 = '[{"vmdId":"Text1","vmdVersion":"text2","vmId":"text3"},{"vmId":"text4","vmVersion":"text5","vmId":"text6"}]'
temp = val1.split(',')
list_len = len(temp)
for i in range(0, list_len):
var = temp[i]
found = re.findall(r':"([^(]*)\&quot\;', var)
print ''.join(found)
I would like to replace values (Text1, text2, tex3, etc) with new values provided by user / or by reading from another XML. (Text1, tex2 .. are is totally random and alphanumeric data. below some details
Text1 = somename
text2 = alphanumatic value
text3 = somename
Text4 = somename
text5 = alphanumatic value
text6 = somename
anstring =
[{"vmdId":"newText1","vmdVersion":"newtext2","vmId":"newtext3"},{"vmId":"newtext4","vmVersion":"newtext5","vmId":"newtext6"}]
I decided to go with replace() but later realize data is not constant. hence seeking for help again. Appreciate your response.
Any help would be appreciated. Also, if let me know if I can improve the way i am grabing the value right now, as i new with regex.
You can do this by using backreferences in combination with re.sub:
import re
val1 = '[{"vmdId":"Text1","vmdVersion":"text2","vmId":"text3"},{"vmId":"text4","vmVersion":"text5","vmId":"text6"}]'
ansstring = re.sub(r'(?<=:")([^(]*)', r'new\g<1>' , val1)
print ansstring
\g<1> is the text which is in the first ().
EDIT
Maybe a better approach would be to decode the string, change the data and encode it again. This should allow you to easier access the values.
import sys
# python2 version
if sys.version_info[0] < 3:
import HTMLParser
html = HTMLParser.HTMLParser()
html_escape_table = {
"&": "&",
'"': """,
"'": "&apos;",
">": ">",
"<": "<",
}
def html_escape(text):
"""Produce entities within text."""
return "".join(html_escape_table.get(c,c) for c in text)
html.escape = html_escape
else:
import html
import json
val1 = '[{"vmdId":"Text1","vmdVersion":"text2","vmId":"text3"},{"vmId":"text4","vmVersion":"text5","vmId":"text6"}]'
print(val1)
unescaped = html.unescape(val1)
json_data = json.loads(unescaped)
for d in json_data:
d['vmId'] = 'new value'
new_unescaped = json.dumps(json_data)
new_val = html.escape(new_unescaped)
print(new_val)
I hope this helps.

Parsing with placeholders

I am trying to scrape all the different variations of this webpage.For instance the code that should scrape this webpage http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=11849.
should be the same as the code i use to scrape this webpage
http://www.virginiaequestrian.com/main.cfm?action=greenpages&sub=view&ID=11849
def extract_contact(url):
r=requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]
list=[]
Contact=tbl.findAll('p')[0]
for br in Contact.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = re.sub(r'[\n\r\t\xa0]','',next).replace('Phone:','').strip()
list.append(text)
print list
#Street=list.pop(0)
#CityStateZip=list.pop(0)
#Phone=list.pop(0)
#City,StateZip= CityStateZip.split(',')
#State,Zip= StateZip.split(' ')
#ContactName = Contact.findAll('b')[1]
#ContactEmail = Contact.findAll('a')[1]
#Body=tbl.findAll('p')[1]
#Website = Contact.findAll('a')[2]
#Email = ContactEmail.text.strip()
#ContactName = ContactName.text.strip()
#Website = Website.text.strip()
#Body = Body.text
#Body = re.sub(r'[\n\r\t\xa0]','',Body).strip()
#list.extend([Street,City,State,Zip,ContactName,Phone,Email,Website,Body])
return list
The way i believe i will need to write the code in order it to work, is to set it up so that print list returns the same number of values, ordered identically.Currently, the above script returns these values
[u'2133 Craigs Store Road', u'Afton,VA 22920', u'434-882-3150']
[u'Alexandria,VA 22305']
Accounting for missing values,in order to be able to parse this page consistently,
I need the print list command to return something similar to this
[u'2133 Craigs Store Road', u'Afton,VA 22920', u'434-882-3150']
['',u'Alexandria,VA 22305','']
this way i will be able to manipulate values by position(as they will be in consistent order). The problem is that i don't know how to accomplish this as I am still very new to parsing. If anybody has any insight as to how to solve the problem i would be highly appreciative.
def extract_contact(url):
r=requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
tbl=soup.findAll('table')[2]
list=[]
Contact=tbl.findAll('p')[0]
for br in Contact.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = re.sub(r'[\n\r\t\xa0]','',next).replace('Phone:','').strip()
list.append(text)
Street=[s for s in list if ',' not in s and '-' not in s]
CityStateZip=[s for s in list if ',' in s]
Phone = [s for s in list if '-' in s]
if not Street:
Street=''
else:
Street=Street[0]
if not CityStateZip:
CityStateZip=''
else:
City,StateZip= CityStateZip[0].split(',')
State,Zip= StateZip.split(' ')
if not Phone:
Phone=''
else:
Phone=Phone[0]
list=[]
I figured out an alternative solution using substrings and if statements. Since there are only 3 values max in the list, all with defining characteristics i realized that i could delegate by looking for special characters rather than the position of the record.

Python - Delete Conditional Lines of Chat Log File

I am trying to delete my conversation from a chat log file and only analyse the other persons data. When I load the file into Python like this:
with open(chatFile) as f:
chatLog = f.read().splitlines()
The data is loaded like this (much longer than the example):
'My Name',
'08:39 Chat data....!',
'Other person's name',
'08:39 Chat Data....',
'08:40 Chat data...,
'08:40 Chat data...?',
I would like it to look like this:
'Other person's name',
'08:39 Chat Data....',
'08:40 Chat data...,
'08:40 Chat data...?',
I was thinking of using an if statement with regular expressions:
name = 'My Name'
for x in chatLog:
if x == name:
"delete all data below until you get to reach the other
person's name"
I could not get this code to work properly, any ideas?
I think you misunderstand what "regular expressions" means... It doesn't mean you can just write English language instructions and the python interpreter will understand them. Either that or you were using pseudocode, which makes it impossible to debug.
If you don't have the other person's name, we can probably assume it doesn't begin with a number. Assuming all of the non-name lines do begin with a number, as in your example:
name = 'My Name'
skipLines = False
results = []
for x in chatLog:
if x == name:
skipLines = True
elif not x[0].isdigit():
skipLines = False
if not skipLines:
results.append(x)
others = []
on = True
for line in chatLog:
if not line[0].isdigit():
on = line != name
if on:
others.append(line)
You can delete all of your messages using re.sub with an empty string as the second argument which is your replacement string.
Assuming each chat message starts on a new line beginning with a time stamp, and that nobody's name can begin with a digit, the regular expression pattern re.escape(yourname) + r',\n(?:\d.*?\n)*' should match all of your messages, and then those matches can be replaced with the empty string.
import re
with open(chatfile) as f:
chatlog = f.read()
yourname = 'My Name'
pattern = re.escape(yourname) + r',\n(?:\d.*?\n)*'
others_messages = re.sub(pattern, '', chatlog)
print(others_messages)
This will work to delete the messages of any user from any chat log where an arbitrary number of users are chatting.

how to get the info which contains 'sss', not "= ",

this is my code:
class Marker_latlng(db.Model):
geo_pt = db.GeoPtProperty()
class Marker_info(db.Model):
info = db.StringProperty()
marker_latlng =db.ReferenceProperty(Marker_latlng)
q = Marker_info.all()
q.filter("info =", "sss")
but how to get the info which contains 'sss', not "=",
has a method like "contains "?
q = Marker_info.all()
q.filter("info contains", "sss")
Instead of using a StringProperty, you could use a StringListProperty. Before saving the info string, split it into a list of strings, containing each word.
Then, when you use q.filter("info =", "sss") it will match any item which contains a word which is each to "sss".
For something more general, you could look into app engine full text search.

Categories