How to convert this code to dart
python:
querystr.strip ('[]'). strip ('()'). rstrip (','). strip ('' ')
python strip Definition and Usage
The strip() method removes any leading (spaces at the beginning) and
trailing (spaces at the end) characters (space is the default leading
character to remove)
You can check String trailing and leading with .startsWith() and .endsWith() and perform substring like this:
void main() {
print(strip(strip(strip(strip(strip(" [(,'Sample String',)] ", " "), "[]"), "()"), ","), "''"));
//Output: "Sample String"
}
String strip(String string, String char)
{
string = (string.startsWith(char[0]) && string.endsWith(char[char.length - 1]))
? (){string = string.substring(1);
string = string.substring(0, string.length - 1); return string;}()
: string;
return string;
}
Or, You can use regular expression like this:
void main() {
print(strip(strip(strip(strip(strip(" [(,'Sample String',)] ", " "), "[]"), "()"), ","), "''"));
//Output: "Sample String"
}
String strip(String string, String char)
{
char = char.length == 1 ? "$char$char" : char; // For single character
String lc = char[0]; // Extract leading character
String tc = char[1]; // Extract trailing character
String l = "^\\$lc+"; // Regex for Leading character: <lc>
String t = "\\$tc+\$"; // Regex for Trailing character <tc>
// Replace Leading and Trailing sequence match with blank
return string.replaceAll(new RegExp(l), "").replaceAll(new RegExp(t), "");
}
For dart you can use trim() method of the String class. This method removes the leading and trailing spaces. See the documentation: https://api.dartlang.org/stable/2.7.0/dart-core/String/trim.html
And, remember, in dart as in many other languages I know, strings are immutable. You cannot modify but you can make a certain operation and assign the result to a new string.
Good luck
extension StringExtension on String {
String stip(List regs, String replaces) {
var ss;
for (int i = 0; i <= regs.length; i++) {
ss = this.replaceAll(regs[i], replaces);
}
return ss;
}
}
Related
I'm trying to convert a hexadecimal value to a string. It works fine in python, but in C#, it's just a bunch of characters.
Hex: "293A4D48E43D5D1FBBFC8993DD93949F"
Python
>>> bytearray.fromhex('293A4D48E43D5D1FBBFC8993DD93949F');
bytearray(b'):MH\xe4=]\x1f\xbb\xfc\x89\x93\xdd\x93\x94\x9f')
C#
public static string HextoString(string InputText)
{
byte[] bb = Enumerable.Range(0, InputText.Length)
.Where(x => x % 2 == 0)
.Select(x => Convert.ToByte(InputText.Substring(x, 2), 16))
.ToArray();
return System.Text.Encoding.Unicode.GetString(bb);
// or System.Text.Encoding.UTF7.GetString
// or System.Text.Encoding.UTF8.GetString
// or System.Text.Encoding.Unicode.GetString
// or etc.
}
HextoString('293A4D48E43D5D1FBBFC8993DD93949F');
// "):MH?=]▼????????"
Python and C# decide what to do with unprintable characters in different ways. In the case of Python, it is printing their escape sequences (eg \xe4), but in C#, the unprintable characters are printed as question marks. You may want to convert the string to an escaped string literal before you print it.
Hi my friend you can choose one of these ways :
Convert Hex to String in Python:
print(bytes.fromhex('68656c6c6f').decode('utf-8'))
2.Using codecs.decode:
import codecs
my_string = "68656c6c6f"
my_string_bytes = bytes(my_string, encoding='utf-8')
binary_string = codecs.decode(my_string_bytes, "hex")
print(str(binary_string, 'utf-8'))
3.Append hex to string
def hex_to_string(hex):
if hex[:2] == '0x':
hex = hex[2:]
string_value = bytes.fromhex(hex).decode('utf-8')
return string_value
hex_value = '0x737472696e67'
string = 'This is just a ' + hex_to_string('737472696e67')
print(string)
Good luck (Arman Golbidi)
Banging my head here..
I am trying to parse the html source for the entire contents of javascript variable 'ListData' with regex which starts with the declaration var Listdata = and ends with };.
I found a solution which is similar:
Fetch data of variables inside script tag in Python or Content added from js
But I am unable to get it to match the entire regex.
Code:
# Need the ListData object
pat = re.compile('var ListData = (.*?);')
string = """QuickLaunchMenu == null) QuickLaunchMenu = $create(UI.AspMenu,
null, null, null, $get('QuickLaunchMenu')); } ExecuteOrDelayUntilScriptLoaded(QuickLaunchMenu, 'Core.js');
var ListData = { "Row" :
[{
"ID": "159",
"PermMask": "0x1b03cc312ef",
"FSObjType": "0",
"ContentType": "Item"
};
moretext;
moretext"""
#Returns NoneType instead of match object
print(type(pat.search(string)))
Not sure what is going wrong here. Any help would be appreaciated.
In your regex, (.*?); part matches any 0+ chars other than line break chars up to the first ;. If there is no ; on the line, you will have no match.
Basing on the fact your expected match ends with the first }; at the end of a line, you may use
'(?sm)var ListData = (.*?)};$'
Here,
(?sm) - enables re.S (it makes . match any char) and re.M (this makes $ match the end of a line, not just the whole string and makes ^ match the start of line positions) modes
var ListData =
(.*?) - Group 1: any 0+ chars, as few as possible, up to the first...
};$ - }; at the end of a line
Here is an interesting oddity about Python's repr:
The tab character \x09 is represented as \t. However this convention does not apply for the null terminator.
Why is \x00 represented as \x00, rather than \0?
Sample code:
# Some facts to make sure we are on the same page
>>> '\x31' == '1'
True
>>> '\x09' == '\t'
True
>>> '\x00' == '\0'
True
>>> x = '\x31'
>>> y = '\x09'
>>> z = '\x00'
>>> x
'1' # As Expected
>>> y
'\t' # Okay
>>> z
'\x00' # Inconsistent - why is this not \0
The short answer: because that's not a specific escape that is used. String representations only use the single-character escapes \\, \n, \r, \t, (plus \' when both " and ' characters are present) because there are explicit tests for those.
The rest is either considered printable and included as-is, or included using a longer escape sequence (depending on the Python version and string type, \xhh, \uhhhh and \Uhhhhhhhh, always using the shortest of the 3 options that'll fit the value).
Moreover, when generating the repr() output, for a string consisting of a null byte followed by a digit from '1' through to '7' (so bytes([0x00, 0x49]), or bytes([0x00, 0x4A]), etc), you can't just use \0 in the output without then also having to escape the following digit. '\01' is a single octal escape sequence, and not the same value as '\x001', which is two bytes. While forcing the output to always use three octal digits (e.g. '\0001') could be a work-around, it is just simpler to stick to a standardised, simpler escape sequence format. Scanning ahead to see if the next character is an octal digit and switching output styles would just produce confusing output (imagine the question on SO: What is the difference between '\x001' and '\0Ol'?)
The output is always consistent. Apart from the single quote (which can appear either with ' or \', depending on the presence of " characters), Python will always use same escape sequence style for a given codepoint.
If you want to study the code that produces the output, you can find the Python 3 str.__repr__ implementation in the Objects/unicodeobject.c unicode_repr() function, which uses
/* Escape quotes and backslashes */
if ((ch == quote) || (ch == '\\')) {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, ch);
continue;
}
/* Map special whitespace to '\t', \n', '\r' */
if (ch == '\t') {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, 't');
}
else if (ch == '\n') {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, 'n');
}
else if (ch == '\r') {
PyUnicode_WRITE(okind, odata, o++, '\\');
PyUnicode_WRITE(okind, odata, o++, 'r');
}
for single-character escapes, followed by additional checks longer escapes below. For Python 2, a similar but shorter PyString_Repr() function does much the same thing.
If it tried to use \0, then it would have to special-case when numbers immediately followed it, to prevent them from being interpreted as an octal literal. Always using \x00 is simpler and always correct.
I'm a beginner C++ programmer working on a small C++ project for which I have to process a number of relatively large XML files and remove the XML tags out of them. I've succeeded doing so using the C++0x regex library. However, I'm running into some performance issues. Just reading in the files and executing the regex_replace function over its contents takes around 6 seconds on my PC. I can bring this down to 2 by adding some compiler optimization flags. Using Python, however, I can get it done it less than 100 milliseconds. Obviously, I'm doing something very inefficient in my C++ code. What can I do to speed this up a bit?
My C++ code:
std::regex xml_tags_regex("<[^>]*>");
for (std::vector<std::string>::iterator it = _files.begin(); it !=
_files.end(); it++) {
std::ifstream file(*it);
file.seekg(0, std::ios::end);
size_t size = file.tellg();
std::string buffer(size, ' ');
file.seekg(0);
file.read(&buffer[0], size);
buffer = regex_replace(buffer, xml_tags_regex, "");
file.close();
}
My Python code:
regex = re.compile('<[^>]*>')
for filename in filenames:
with open(filename) as f:
content = f.read()
content = regex.sub('', content)
P.S. I don't really care about processing the complete file at once. I just found that reading a file line by line, word by word or character by character slowed it down considerably.
C++11 regex replace is indeed rather slow, as of yet, at least. PCRE performs much better in terms of pattern matching speed, however, PCRECPP provides very limited means for regular expression based substitution, citing the man page:
You can replace the first match of "pattern" in "str" with "rewrite".
Within "rewrite", backslash-escaped digits (\1 to \9) can be used to
insert text matching corresponding parenthesized group from the
pattern. \0 in "rewrite" refers to the entire matching text.
This is really poor, compared to Perl's 's' command. That is why I wrote my own C++ wrapper around PCRE that handles regular expression based substitution in a fashion that is close to Perl's 's', and also supports 16- and 32-bit character strings: PCRSCPP:
Command string syntax
Command syntax follows Perl s/pattern/substitute/[options]
convention. Any character (except the backslash \) can be used as a
delimiter, not just /, but make sure that delimiter is escaped with
a backslash (\) if used in pattern, substitute or options
substrings, e.g.:
s/\\/\//g to replace all backslashes with forward ones
Remember to double backslashes in C++ code, unless using raw string
literal (see string literal):
pcrscpp::replace rx("s/\\\\/\\//g");
Pattern string syntax
Pattern string is passed directly to pcre*_compile, and thus has to
follow PCRE syntax as described in PCRE documentation.
Substitute string syntax
Substitute string backreferencing syntax is similar to Perl's:
$1 ... $n: nth capturing subpattern matched.
$& and $0: the whole match
${label} : labled subpattern matched. label is up to 32 alphanumerical +
underscore characters ('A'-'Z','a'-'z','0'-'9','_'),
first character must be alphabetical
$` and $' (backtick and tick) refer to the areas of the subject before
and after the match, respectively. As in Perl, the unmodified
subject is used, even if a global substitution previously matched.
Also, following escape sequences get recognized:
\n: newline
\r: carriage return
\t: horizontal tab
\f: form feed
\b: backspace
\a: alarm, bell
\e: escape
\0: binary zero
Any other escape sequence \<char>, is interpreted as <char>,
meaning that you have to escape backslashes too
Options string syntax
In Perl-like manner, options string is a sequence of allowed modifier
letters. PCRSCPP recognizes following modifiers:
Perl-compatible flags
g: global replace, not just the first match
i: case insensitive match
(PCRE_CASELESS)
m: multi-line mode: ^ and $ additionally match positions
after and before newlines, respectively
(PCRE_MULTILINE)
s: let the scope of the . metacharacter include newlines
(treat newlines as ordinary characters)
(PCRE_DOTALL)
x: allow extended regular expression syntax,
enabling whitespace and comments in complex patterns
(PCRE_EXTENDED)
PHP-compatible flags
A: "anchor" pattern: look only for "anchored" matches: ones that
start with zero offset. In single-line mode is identical to
prefixing all pattern alternative branches with ^
(PCRE_ANCHORED)
D: treat dollar $ as subject end assertion only, overriding the default:
end, or immediately before a newline at the end.
Ignored in multi-line mode
(PCRE_DOLLAR_ENDONLY)
U: invert * and + greediness logic: make ungreedy by default,
? switches back to greedy. (?U) and (?-U) in-pattern switches
remain unaffected
(PCRE_UNGREEDY)
u: Unicode mode. Treat pattern and subject as UTF8/UTF16/UTF32 string.
Unlike in PHP, also affects newlines, \R, \d, \w, etc. matching
((PCRE_UTF8/PCRE_UTF16/PCRE_UTF32) | PCRE_NEWLINE_ANY
| PCRE_BSR_UNICODE | PCRE_UCP)
PCRSCPP own flags:
N: skip empty matches
(PCRE_NOTEMPTY)
T: treat substitute as a trivial string, i.e., make no backreference
and escape sequences interpretation
n: discard non-matching portions of the string to replace
Note: PCRSCPP does not automatically add newlines,
the replacement result is plain concatenation of matches,
be specifically aware of this in multiline mode
I wrote a simple speed test code, which stores a 10x copy of file "move.sh" and tests regex performance on resulting string:
#include <pcrscpp.h>
#include <string>
#include <iostream>
#include <fstream>
#include <regex>
#include <chrono>
int main (int argc, char *argv[]) {
const std::string file_name("move.sh");
pcrscpp::replace pcrscpp_rx(R"del(s/(?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n)/$1\n$2\n/Dgn)del");
std::regex std_rx (R"del((?:^|\n)mv[ \t]+(?:-f)?[ \t]+"([^\n]+)"[ \t]+"([^\n]+)"(?:$|\n))del");
std::ifstream file (file_name);
if (!file.is_open ()) {
std::cerr << "Unable to open file " << file_name << std::endl;
return 1;
}
std::string buffer;
{
file.seekg(0, std::ios::end);
size_t size = file.tellg();
file.seekg(0);
if (size > 0) {
buffer.resize(size);
file.read(&buffer[0], size);
buffer.resize(size - 1); // strip '\0'
}
}
file.close();
std::string bigstring;
bigstring.reserve(10*buffer.size());
for (std::string::size_type i = 0; i < 10; i++)
bigstring.append(buffer);
int n = 10;
std::cout << "Running tests " << n << " times: be patient..." << std::endl;
std::chrono::high_resolution_clock::duration std_regex_duration, pcrscpp_duration;
std::chrono::high_resolution_clock::time_point t1, t2;
std::string result1, result2;
for (int i = 0; i < n; i++) {
// clear result
std::string().swap(result1);
t1 = std::chrono::high_resolution_clock::now();
result1 = std::regex_replace (bigstring, std_rx, "$1\\n$2", std::regex_constants::format_no_copy);
t2 = std::chrono::high_resolution_clock::now();
std_regex_duration = (std_regex_duration*i + (t2 - t1)) / (i + 1);
// clear result
std::string().swap(result2);
t1 = std::chrono::high_resolution_clock::now();
result2 = pcrscpp_rx.replace_copy (bigstring);
t2 = std::chrono::high_resolution_clock::now();
pcrscpp_duration = (pcrscpp_duration*i + (t2 - t1)) / (i + 1);
}
std::cout << "Time taken by std::regex_replace: "
<< std_regex_duration.count()
<< " ms" << std::endl
<< "Result size: " << result1.size() << std::endl;
std::cout << "Time taken by pcrscpp::replace: "
<< pcrscpp_duration.count()
<< " ms" << std::endl
<< "Result size: " << result2.size() << std::endl;
return 0;
}
(note that std and pcrscpp regular expressions do the same here, the trailing newline in expression for pcrscpp is due to std::regex_replace not stripping newlines despite std::regex_constants::format_no_copy)
and launched it on a large (20.9 MB) shell move script:
Running tests 10 times: be patient...
Time taken by std::regex_replace: 12090771487 ms
Result size: 101087330
Time taken by pcrscpp::replace: 5910315642 ms
Result size: 101087330
As you can see, PCRSCPP is more than 2x faster. And I expect this gap to grow with pattern complexity increase, since PCRE deals with complicated patterns much better. I originally wrote a wrapper for myself, but I think it can be useful for others too.
Regards,
Alex
I don't think you're doing anything "wrong" per-say, the C++ regex library just isn't as fast as the python one (for this use case at this time at least). This isn't too surprising, keeping in mind the python regex code is all C/C++ under the hood as well, and has been tuned over the years to be pretty fast as that's a fairly important feature in python, so naturally it is going to be pretty fast.
But there are other options in C++ for getting things faster if you need. I've used PCRE ( http://pcre.org/ ) in the past with great results, though I'm sure there are other good ones out there these days as well.
For this case in particular however, you can also achieve what you're after without regexes, which in my quick tests yielded a 10x performance improvement. For example, the following code scans your input string copying everything to a new buffer, when it hits a < it starts skipping over characters until it sees the closing >
std::string buffer(size, ' ');
std::string outbuffer(size, ' ');
... read in buffer from your file
size_t outbuffer_len = 0;
for (size_t i=0; i < buffer.size(); ++i) {
if (buffer[i] == '<') {
while (buffer[i] != '>' && i < buffer.size()) {
++i;
}
} else {
outbuffer[outbuffer_len] = buffer[i];
++outbuffer_len;
}
}
outbuffer.resize(outbuffer_len);
Triyng to make this in Python 2.7:
>>> s = u"some\u2028text"
>>> s
u'some\u2028text'
>>> l = s.splitlines(True)
>>> l
[u'some\u2028', u'text']
\u2028 is Left-To-Right Embedding character, not \r or \n, so that line should not be splitted. Is there a bug or just my misunderstanding?
\u2028 is LINE SEPARATOR, left-to-right embedding is \u202A:
>>> import unicodedata
>>> unicodedata.name(u'\u2028')
'LINE SEPARATOR'
>>> unicodedata.name(u'\u202A')
'LEFT-TO-RIGHT EMBEDDING'
The list of codepoints considered linebreaks is easy (not that easy to find though) to see in python source (python 2.7, comments by me):
/* Returns 1 for Unicode characters having the line break
* property 'BK', 'CR', 'LF' or 'NL' or having bidirectional
* type 'B', 0 otherwise.
*/
int _PyUnicode_IsLinebreak(register const Py_UNICODE ch)
{
switch (ch) {
// Basic Latin
case 0x000A: // LINE FEED
case 0x000B: // VERTICAL TABULATION
case 0x000C: // FORM FEED
case 0x000D: // CARRIAGE RETURN
case 0x001C: // FILE SEPARATOR
case 0x001D: // GROUP SEPARATOR
case 0x001E: // RECORD SEPARATOR
// Latin-1 Supplement
case 0x0085: // NEXT LINE
// General punctuation
case 0x2028: // LINE SEPARATOR
case 0x2029: // PARAGRAPH SEPARATOR
return 1;
}
return 0;
}
U+2028 is LINE SEPARATOR. Both U+2028 and U+2029 (PARAGRAPH SEPARATOR) should be treated as newlines, so Python is doing the right thing.
Of course it is sometimes perfectly reasonable to want to split on a non-standard list of newline characters. But you can't do that with splitlines. You will have to use split—and, if you need the additional features of splitlines, you'll have to implement them yourself. For example:
return [line.rstrip(sep) for line in s.split(sep)]