Related
I have a string, part of which I want to keep and use for plot headings, the rest can be discarded. I have a method of doing it (below) but I am sure there is a more pythonic way of doing this rather than the clumsy find method
Code
filename = "Merged_2001_1234567_ZZ00_device_name_Row1_Ch480_1V_3A_4800OHMS_13Oct2022.csv"
filename = os.path.splitext(filename)[0]
b = filename.find("Row1")
print("b : ",b, '\n'*3)
print(filename[b:], '\n'*3)
Returns
b : 37
Row1_Ch480_1V_3A_4800OHMS_13Oct2022
It is returning what I am after, but I couldn't find a better way to do it without losing the word I was splitting at ("Row1").
Has anyone got a more robust method?
Doing it with string methods and slicing is fine you only have to do a couple. For this you could replace os.path.splitext with the string native method filename = filename.split(".")[0]
If you find yourself doing a lot of these, compiling a regex might be beneficial
import re
filename = "Merged_2001_1234567_ZZ00_device_name_Row1_Ch480_1V_3A_4800OHMS_13Oct2022.csv"
exp = re.compile(r"Row1_.+?(?=\.)")
ret = re.search(exp, filename).group()
print(ret)
>>> Row1_Ch480_1V_3A_4800OHMS_13Oct2022
or plainly if you don't worry about doing the same search repeatedly
ret = re.search(r"Row1_.+?(?=\.)", filename).group()
Some reference: How can I match "anything up until this sequence of characters" in a regular expression?
I want to test if certain characters are in a line of text. The condition is simple but characters to be tested are many.
Currently I am using \ for easy viewing, but it feels clumsy. What's the way to make the lines look nicer?
text = "Tel+971-2526-821 Fax:+971-2526-821"
if "971" in text or \
"(84)" in text or \
"+66" in text or \
"(452)" in text or \
"19 " in text:
print "foreign"
Why don't extract the phone numbers from the string and do your tests
text = "Tel:+971-2526-821 Fax:+971-2526-821"
tel, fax = text.split()
tel_prefix, *_ = tel.split(':')[-1].split('-')
fax_prefix, *_ = fax.split(':')[-1].split('-')
if tel_prefix in ("971", "(84)"):
print("Foreigner")
for python 2.x
tel_prefix = tel.split(':')[-1].split('-')[0]
fax_prefix = fax.split(':')[-1].split('-')[0]
Enlightened by #Patrick Haugh in the comment. We can do:
text = "Tel+971-2526-821 Fax:+971-2526-821"
if any(x in text for x in ("971", "(84)", "+66", "(452)", "19 ")):
print "foreign"
You can use any builtin function to check if any one of the token exists in the text. If you would like to check if all the token exists in the string you can replace the below any with all function. Cheers!
text = 'Hello your number is 19 '
tokens = ('971', '(84)', '+66', '(452)', '19 ')
if any(token in text for token in tokens):
print('Foriegn')
Output:
Foriegn
Existing comments mention that you can't really have multiple or statements like you intend, but using generators/comprehensions and the any() function you are able to come up with a serviceable option, such as the snippet if any(x in text for x in ('971', '(84)', '+66', '(452)', '19 ')): that #Patrick Haugh recommended.
I would recommend using regular expressions instead as a more versatile and efficient way of solving the problem. You could either generate the pattern dynamically, or for the purpose of this problem, the following snippet would work (don't forget to escape parentheses):
import re
text = 'Tel:+971-2526-821 Fax:+971-2526-821'
pattern = u'(971|\(84\)|66|\(452\)|19)'
prog = re.compile(pattern)
if prog.search(text):
print 'foreign'
If you are searching many lines of text or large bodies of text for multiple possible substrings, this approach will be faster and more reusable. You only have to compile prog once, and then you can use it as often as you'd like.
As far as dynamic generation of a pattern is concerned, a naive implementation might do something like this:
match_list = ['971', '(84)', '66', '(452)', '19']
pattern = '|'.join(map(lambda s: s.replace('(', '\(').replace(')', '\)'), match_list)).join(['(', ')'])
The variable match_list could then be updated and modified as needed. There is a slight inefficiency in running two passes of replace(), and #Andrew Clark has a good trick for fixing that here, but I don't want this answer to be too long and cumbersome.
You can construct a lambda function that checks if a value is in the text, and then map this function to all of the values:
text = "Tel:+971-2526-821 Fax:+971-2526-821"
print any(map((lambda x: x in text), ["971", "(84)", "+66", "(452)", "19 "]))
The result is True, which means at least one of the values is in text.
I have an output string like this:
read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec
And I want to just extract one of the numerical values for computation, say iops. I'm processing it like this:
if 'read ' in key:
my_read_iops = value.split(",")[2].split("=")[1]
result['test_details']['read'] = my_read_iops
But there are slight inconsistencies with some of the strings I'm reading in and my code is getting super complicated and verbose. So instead of manually counting the number of commas vs "=" chars, what's a better way to handle this?
You can use regular expression \s* to handle inconsistent spacing, it matches zero or more whitespaces:
import re
s = 'read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec'
for m in re.finditer(r'\s*(?P<name>\w*)\s*=\s*(?P<value>[\w/]*)\s*', s):
print(m.group('name'), m.group('value'))
# io 131220KB
# bw 14016KB/s
# iops 3504
# runt 9362msec
Using group name, you can construct pattern string from a list of column names and do it like:
names = ['io', 'bw', 'iops', 'runt']
name_val_pat = r'\s*{name}\s*=\s*(?P<{group_name}>[\w/]*)\s*'
pattern = ','.join([name_val_pat.format(name=name, group_name=name) for name in names])
# '\s*io\s*=\s*(?P<io>[\w/]*)\s*,\s*bw\s*=\s*(?P<bw>[\w/]*)\s*,\s*iops\s*=\s*(?P<iops>[\w/]*)\s*,\s*runt\s*=\s*(?P<runt>[\w/]*)\s*'
match = re.search(pattern, s)
data_dict = {name: match.group(name) for name in names}
print(data_dict)
# {'io': '131220KB', 'bw': '14016KB/s', 'runt': '9362msec', 'iops': '3504'}
In this way, you only need to change names and keep the order correct.
If I were you,I'd use regex(regular expression) as first choice.
import re
s= "read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec"
re.search(r"iops=(\d+)",s).group(1)
By this python code, I find the string pattern that starts 'iops=' and continues number expression at least 1 digit.I extract the target string(3504) by using round bracket.
you can find more information about regex from
https://docs.python.org/3.6/library/re.html#module-re
regex is powerful language for complex pattern matching with simple syntax.
from re import match
string = 'read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec'
iops = match(r'.+(iops=)([0-9]+)', string).group(2)
iops
'3504'
What are the most efficient ways to extract text from a string? Are there some available functions or regex expressions, or some other way?
For example, my string is below and I want to extract the IDs as well
as the ScreenNames, separately.
[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]
Thank you!
Edit: These are the text strings that I want to pull. I want them to be in a list.
Target_IDs = 1234567890, 233323490, 4459284
Target_ScreenNames = RandomNameHere, AnotherRandomName, YetAnotherName
import re
str = '[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]'
print 'Target IDs = ' + ','.join( re.findall(r'ID=(\d+)', str) )
print 'Target ScreenNames = ' + ','.join( re.findall(r' ScreenName=(\w+)', str) )
Output :
Target IDs = 1234567890,233323490,4459284
Target ScreenNames = RandomNameHere,AnotherRandomName,YetAnotherName
It depends. Assuming that all your text comes in the form of
TagName = TagValue1, TagValue2, ...
You need just two calls to split.
tag, value_string = string.split('=')
values = value_string.split(',')
Remove the excess space (probably a couple of rstrip()/lstrip() calls will suffice) and you are done. Or you can take regex. They are slightly more powerful, but in this case I think it's a matter of personal taste.
If you want more complex syntax with nonterminals, terminals and all that, you'll need lex/yacc, which will require some background in parsers. A rather interesting thing to play with, but not something you'll want to use for storing program options and such.
The regex I'd use would be:
(?:ID=|ScreenName=)+(\d+|[\w\d]+)
However, this assumes that ID is only digits (\d) and usernames are only letters or numbers ([\w\d]).
This regex (when combined with re.findall) would return a list of matches that could be iterated through and sorted in some fashion like so:
import re
s = "[User(ID=1234567890, ScreenName=RandomNameHere), User(ID=233323490, ScreenName=AnotherRandomName), User(ID=4459284, ScreenName=YetAnotherName)]"
pattern = re.compile(r'(?:ID=|ScreenName=)+(\d+|[\w\d]+)');
ids = []
names = []
for p in re.findall(pattern, s):
if p.isnumeric():
ids.append(p)
else:
names.append(p)
print(ids, names)
Say I define a string in Python like the following:
my_string = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
I would like to parse that string in Python in a way that allows me to index the different structures of the language.
For example, the output could be a dictionary parsing_result that allows me to index the different elements in a structred manner.
For example, the following:
parsing_result['names']
would hold a list of strings: ['name1', 'name2']
whereas parsing_result['options'] would hold a dictionary so that:
parsing_result['something']['options']['opt2'] holds the string "text"
parsing_result['something_else']['options']['opt1'] holds the string "58"
My first question is: How do I approach this problem in Python? Are there any libraries that simplify this task?
For a working example, I am not necessarily interested in a solution that parses the exact syntax I defined above (although that would be fantastic), but anything close to it would be great.
Update
It looks like the general right solution is using a parser and a lexer such as ply (thank you #Joran), but the documentation is a bit intimidating. Is there an easier way of getting this done when the syntax is lightweight?
I found this thread where the following regular expression is provided to partition a string around outer commas:
r = re.compile(r'(?:[^,(]|\([^)]*\))+')
r.findall(s)
But this is assuming that the grouping character are () (and not {}). I am trying to adapt this, but it doesn't look easy.
I highly recommend pyparsing:
The pyparsing module is an alternative approach to creating and
executing simple grammars, vs. the traditional lex/yacc approach, or
the use of regular expressions.
The Python representation of the grammar is quite
readable, owing to the self-explanatory class names, and the use of
'+', '|' and '^' operator definitions. The parsed results returned from parseString() can be accessed as a nested list, a dictionary, or an object with named attributes.
Sample code (Hello world from the pyparsing docs):
from pyparsing import Word, alphas
greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
hello = "Hello, World!"
print (hello, "->", greet.parseString( hello ))
Output:
Hello, World! -> ['Hello', ',', 'World', '!']
Edit: Here's a solution to your sample language:
from pyparsing import *
import json
identifier = Word(alphas + nums + "_")
expression = identifier("lhs") + Suppress("=") + identifier("rhs")
struct_vals = delimitedList(Group(expression | identifier))
structure = Group(identifier + nestedExpr(opener="{", closer="}", content=struct_vals("vals")))
grammar = delimitedList(structure)
my_string = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
parse_result = grammar.parseString(my_string)
result_list = parse_result.asList()
def list_to_dict(l):
d = {}
for struct in l:
d[struct[0]] = {}
for ident in struct[1]:
if len(ident) == 2:
d[struct[0]][ident[0]] = ident[1]
elif len(ident) == 1:
d[struct[0]][ident[0]] = None
return d
print json.dumps(list_to_dict(result_list), indent=2)
Output: (pretty printed as JSON)
{
"something_else": {
"opt1": "58",
"name3": null
},
"something": {
"opt1": "2",
"opt2": "text",
"name2": null,
"name1": null
}
}
Use the pyparsing API as your guide to exploring the functionality of pyparsing and understanding the nuances of my solution. I've found that the quickest way to master this library is trying it out on some simple languages you think up yourself.
As stated by #Joran Beasley, you'd really want to use a parser and a lexer. They are not easy to wrap your head around at first, so you'd want to start off with a very simple tutorial on them.
If you are really trying to write a light weight language, then you're going to want to go with parser/lexer, and learn about context-free grammars.
If you are really just trying to write a program to strip data out of some text, then regular expressions would be the way to go.
If this is not a programming exercise, and you are just trying to get structured data in text format into python, check out JSON.
Here is a test of regular expression modified to react on {} instead of ():
import re
s = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
r = re.compile(r'(?:[^,{]|{[^}]*})+')
print r.findall(s)
You'll get a list of separate 'named blocks' as a result:
`['something{name1, name2, opt1=2, opt2=text}', ' something_else{name3, opt1=58}']`
I've made better code that can parse your simple example, you should for example catch exceptions to detect a syntax error, and restrict more valid block names, parameter names:
import re
s = "something{name1, name2, opt1=2, opt2=text}, something_else{name3, opt1=58}"
r = re.compile(r'(?:[^,{]|{[^}]*})+')
rblock = re.compile(r'\s*(\w+)\s*{(.*)}\s*')
rparam = re.compile(r'\s*([^=\s]+)\s*(=\s*([^,]+))?')
blocks = r.findall(s)
for block in blocks:
resb = rblock.match(block)
blockname = resb.group(1)
blockargs = resb.group(2)
print "block name=", blockname
print "args:"
for arg in re.split(",", blockargs):
resp = rparam.match(arg)
paramname = resp.group(1)
paramval = resp.group(3)
if paramval == None:
print "param name =\"{0}\" no value".format(paramname)
else:
print "param name =\"{0}\" value=\"{1}\"".format(paramname, str(paramval))