Python shlex.split(), ignore single quotes - python

How, in Python, can I use shlex.split() or similar to split strings, preserving only double quotes? For example, if the input is "hello, world" is what 'i say' then the output would be ["hello, world", "is", "what", "'i", "say'"].

import shlex
def newSplit(value):
lex = shlex.shlex(value)
lex.quotes = '"'
lex.whitespace_split = True
lex.commenters = ''
return list(lex)
print newSplit('''This string has "some double quotes" and 'some single quotes'.''')

You can use shlex.quotes to control which characters will be considered string quotes. You'll need to modify shlex.wordchars as well, to keep the ' with the i and the say.
import shlex
input = '"hello, world" is what \'i say\''
lexer = shlex.shlex(input)
lexer.quotes = '"'
lexer.wordchars += '\''
output = list(lexer)
# ['"hello, world"', 'is', 'what', "'i", "say'"]

Related

nextLine().split("\\s+") converted to python

Can someone explain to me what
nextLine().split("\\s+")
does and how would I convert that to python?
Thanks
i wanted to use it but its in java
split takes an input string, possibly a regular expression (in your case) and uses the regex as a delimiter. Here, the regex is simply \s+ (the extra backslash is to escape the string), where \s denotes any sort of white space and + means "one or more", so basically, if I have the string "Hello world ! ." you will have the output ["Hello", "world", "!", "."].
In Python, you need to use the re library for this functionality:
re.split(r"\s+", input_str)
Or, just for this specific case (as #Kurt pointed out), input_str.split() will do the trick.
The nextLine() is used to read user input, and split("\\s+") will split it to a bunch of elements based on a specific delimiter, and for this case the delim is a regex \\s+.
The equivalent of it in python is this, by using the :
import re
s = input()
sub_s = re.split(r"\s+", s)
# hello and welcome everyone
# ['hello', 'and', 'welcome', 'everyone']
code in java
import java.util.*;
public class MyClass {
public static void main(String args[]) {
String s = "Hello my Wonderful\nWorld!";
// nextLine()
Scanner scanner = new Scanner(s);
System.out.println("'" + scanner.nextLine() + "'");
System.out.println("'" + scanner.nextLine() + "'");
scanner.close();
// nextLine().split("\\s+")
scanner = new Scanner(s);
String str[] = scanner.nextLine().split("\\s+");
System.out.println("*" + str[2] + "*");
scanner.close();
}
}
python
s = "Hello my Wonderful\nWorld!";
o = s.split("\n")
print ("'" + o[0] + "'")
print ("'" + o[1] + "'")
'''
resp. use of
i = s.find('\n')
print (s[:i])
print (s[i+1:])
e.g.
def get_lines(str):
start = 0
end = 0
sub = '\n'
while True:
end = str.find(sub, start)
if end==-1:
yield str[start:]
return
else:
yield str[start:end]
start = end + 1
i = iter(get_lines(s))
print ("'" + next (i) + "'")
print ("'" + next (i) + "'")
'''
o = s.split()
print ("*" + o[2] + "*")
output
'Hello my Wonderful'
'World!'
*Wonderful*

Python raw HTML contain "\n" characters that i cannot remove with the replace command

I am getting HTML data with a python get( url ) command which returns raw HTML data that contains “\n” characters. When I run the replace (“\n”,””) command against this it does not remove it. Could some explain how to either remove this at the "simple_get" stage or from the "raw_htmlB" stage! Code below.
from CodeB import simple_get
htmlPath = "https://en.wikipedia.org/wiki/Terminalia_nigrovenulosa"
raw_html = simple_get(htmlPath)
if raw_html is None:
print("not found")
else:
tmpHtml = str(raw_html)
tmpHtmlB = tmpHtml.replace("\n","")
print("tmpHtmlB:=", tmpHtmlB)
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
def simple_get(url):
try:
with closing(get(url, stream=True)) as resp:
if is_good_response(resp):
return resp.content
else:
return None
except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None
def is_good_response(resp):
content_type = resp.headers['Content-Type'].lower()
return (resp.status_code == 200
and content_type is not None
and content_type.find('html') > -1)
def log_error(e):
print(e)
I think a simple adding of space between your double quotes should do you good
Use raw strings r'\n or remember that \n stands for newline and you need to escape the backslash: .replace('\\n', '')
I believe you need to add a another backlash "\" to \n in order to search for the literal string \n, and escape the backlash.
Quick example:
string = '\\n foo'
print(string.replace('\n', ''))
Returns:
\n foo
While:
print(string.replace('\n', ''))
Returns just:
foo
It should be pretty straight-forward, Use rstrip to chop off the \n char from the tmpHtmlB.
>>> tmpHtmlB = "my string\n"
>>> tmpHtmlB.rstrip()
'my string'
In your case it should be :
tmpHtmlB = tmpHtml.rstrip()
Even if you have multiple newline chars there, you can use as follows because The canonical way to strip end-of-line (EOL) characters is to use the string rstrip() method removing any trailing \r or \n.
\r\n - on a windows computer
\r - on an Apple computer
\n - on Linux
>>> tmpHtmlB = "Test String\n\n\n"
>>> tmpHtmlB.rstrip("\r\n")
'Test String'
OR
>>> tmpHtmlB.rstrip("\n")
'Test String'

String Operation on captured group in re Python

I have a string:
str1 = "abc = def"
I want to convert it to:
str2 = "abc = #Abc#"
I am trying this:
re.sub("(\w+) = (\w+)",r"\1 = %s" % ("#"+str(r"\1").title()+"#"),str1)
but it returns: (without the string operation done)
"abc = #abc#"
What is the possible reason .title() is not working.?
How to use string operation on the captured group in python?
You can see what's going on with the help of a little function:
import re
str1 = "abc = def"
def fun(m):
print("In fun(): " + m)
return m
str2 = re.sub(r"(\w+) = (\w+)",
r"\1 = %s" % ("#" + fun(r"\1") + "#"),
# ^^^^^^^^^^
str1)
Which yields
In fun(): \1
So what you are basically trying to do is to change \1 (not the substitute!) to an uppercase version which obviously remains \1 literally. The \1 is replaced only later with the captured content than your call to str.title().
Go with a lambda function as proposed by #Rakesh.
Try using lambda.
Ex:
import re
str1 = "abc = def"
print( re.sub("(?P<one>(\w+)) = (\w+)",lambda match: r'{0} = #{1}#'.format(match.group('one'), match.group('one').title()), str1) )
Output:
abc = #Abc#

Double quote string manipulation

I have some input data from ASCII files which uses double quote to encapsulate string as well as still use double quote inside those strings, for example:
"Reliable" "Africa" 567.87 "Bob" "" "" "" "S 05`56'21.844"" "No Shift"
Notice the double quote used in the coordinate.
So I have been using:
valList = shlex.split(line)
But shlex get's confused with the double quote used as the second in the coordinate.
I've been doing a find and replace on '\"\"' to '\\\"\"'. This of course turns an empty strings to \"" as well so I do a find and replace on (this time with spaces) ' \\\"\" ' to ' \"\"" '. Not exactly the most efficient way of doing it!
Any suggestions on handling this double quote in the coordinate?
I would do it this way:
I would treat this line of text as a csv file. Then according to RFC 4180 :
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
Then all you would need to do is to add another " to your coordinates. So it would look like this "S 0556'21.844"""(NOTE extra quote) Then you can use a standartcsv` module to break it apart and extract necessary information.
>>> from StringIO import StringIO
>>> import csv
>>>
>>> test = '''"Reliable" "Africa" 567.87 "Bob" "" "" "" "S 05`56'21.844""" "No Shift"'''
>>> test_obj = StringIO(test)
>>> reader = csv.reader(test_obj, delimiter=' ', quotechar='"', quoting=csv.QUOTE_ALL)
>>> for i in reader:
... print i
...
The output would be :
['Reliable', 'Africa', '567.87', 'Bob', '', '', '', 'S 05`56\'21.844"', 'No Shift']
I'm not good with regexes, but this non-regex suggestion might help ...
INPUT = ('"Reliable" "Africa" 567.87 "Bob" "" "" "" "S 05`56'
"'"
'21.844"" "No Shift"')
def main(input):
output = input
surrounding_quote_symbol = '<!>'
if input.startswith('"'):
output = '%s%s' % (surrounding_quote_symbol, output[1:])
if input.endswith('"'):
output = '%s%s' % (output[:-1], surrounding_quote_symbol)
output = output.replace('" ', '%s ' % surrounding_quote_symbol)
output = output.replace(' "', ' %s' % surrounding_quote_symbol)
print "Stage 1:", output
output = output.replace('"', '\"')
output = output.replace(surrounding_quote_symbol, '"')
return output
if __name__ == "__main__":
output = main(INPUT)
print "End results:", output

Python/Pyparsing - Multiline quotes

I'm trying to use pyparsing to match a multiline string that can continue in a similar fashion to those of python:
Test = "This is a long " \
"string"
I can't find a way to make pyparsing recognize this. Here is what I've tried so far:
import pyparsing as pp
src1 = '''
Test("This is a long string")
'''
src2 = '''
Test("This is a long " \
"string")
'''
_lp = pp.Suppress('(')
_rp = pp.Suppress(')')
_str = pp.QuotedString('"', multiline=True, unquoteResults=False)
func = pp.Word(pp.alphas)
function = func + _lp + _str + _rp
print src1
print function.parseString(src1)
print '-------------------------'
print src2
print function.parseString(src2)
The problem is that having a multi-line quoted string doesn't do what you think. A multiline quoted string is literally that -- a string with newlines inside:
import pyparsing as pp
src0 = '''
"Hello
World
Goodbye and go"
'''
pat = pp.QuotedString('"', multiline=True)
print pat.parseString(src0)
The output of parsing this string would be ['Hello\n World\n Goodbye and go'].
As far as I know, if you want a string that's similar to how Python's strings behave, you have to define it yourself:
import pyparsing as pp
src1 = '''
Test("This is a long string")
'''
src2 = '''
Test("This is a long"
"string")
'''
src3 = '''
Test("This is a long" \\
"string")
'''
_lp = pp.Suppress('(')
_rp = pp.Suppress(')')
_str = pp.QuotedString('"')
_slash = pp.Suppress(pp.Optional("\\"))
_multiline_str = pp.Combine(pp.OneOrMore(_str + _slash), adjacent=False)
func = pp.Word(pp.alphas)
function = func + _lp + _multiline_str + _rp
print src1
print function.parseString(src1)
print '-------------------------'
print src2
print function.parseString(src2)
print '-------------------------'
print src3
print function.parseString(src3)
This produces the following output:
Test("This is a long string")
['Test', 'This is a long string']
-------------------------
Test("This is a long"
"string")
['Test', 'This is a longstring']
-------------------------
Test("This is a long" \
"string")
['Test', 'This is a longstring']
Note: The Combine class merges the various quoted strings into a single unit so that they appear as a single string in the output list. The reason why the backslash is suppressed so that it isn't combined as a part of output string.

Categories