I have the following string s = "~ VERSION 11 11 11.1 222 22 22.222"
I Want to extract the following into the following variables:
string Variable1 = "11 11 11.1"
string Variable2 = "222 22 22.222"
How do I extract this with regular expression? Or is there a better alternative way? (note, There may be variable spacing in between the the tokens I want to extract and the leading character may be something other than a ~, but it will definitely be a symbol:
e.g. could be:
~ VERSION 11 11 11.1 222 22 22.222
$ VERSION 11 11 11.1 222 22 22.222
# VERSION 11 11 11.1 222 22 22.222
If regular expression does not make sense for this or if there is a better way, please recommend.
How do I preform the extraction into those two variables in python?
Try this:
import re
test_lines = """
~ VERSION 11 11 11.1 222 22 22.222
$ VERSION 11 11 11.1 222 22 22.222
# VERSION 11 11 11.1 222 22 22.222
"""
version_pattern = re.compile(r"""
[~!##$%^&*()] # Starting symbol
\s+ # Some amount of whitespace
VERSION # the specific word "VERSION"
\s+ # Some amount of whitespace
(\d+\s+\d+\s+\d+\.\d+) # First capture group
\s+ # Some amount of whitespace
(\d+\s+\d+\s+\d+\.\d+) # Second capture group
""", re.VERBOSE)
lines = test_lines.split('\n')
for line in lines:
m = re.match(version_pattern, line)
if (m):
print (line)
print (m.groups())
which gives output:
~ VERSION 11 11 11.1 222 22 22.222
('11 11 11.1', '222 22 22.222')
$ VERSION 11 11 11.1 222 22 22.222
('11 11 11.1', '222 22 22.222')
# VERSION 11 11 11.1 222 22 22.222
('11 11 11.1', '222 22 22.222')
Note the use of verbose regular expressions with comments.
To convert the extracted version numbers to their numeric representation (i.e. int, float) use the regexp in #Preet Kukreti's answer and convert using int() or float() as suggested.
You can use split method of String.
v1 = "~ VERSION 11 11 11.1 222 22 22.222"
res_arr = v1.split(' ') # get ['~', 'VERSION', '11', '11', '11.1', '222', '22', '22.222']
and then use elements 2-4 and 5-7 as you want.
import re
pattern_string = r"(\d+)\s+(\d+)\s+([\d\.]+)" #is the regex you are probably after
m = re.match(pattern_string, "222 22 22.222")
groups = None
if m:
groups = m.groups()
# groups is ('222', '22', '22.222')
after which you could use int() and float() to convert to primitive numeric types if needed. For performant code you might want to precompile the regex beforehand with re.compile(...), and calling match(...) or search(...) on the resulting precompiled regex object
It is definitely easy with regular expression. Here would be one way to do it
>>> st="~ VERSION 11 11 11.1 222 22 22.222 333 33 33.3333"
>>> re.findall(r"(\d+[ ]+\d+[ ]+\d+\.\d+)",st)
['11 11 11.1', '222 22 22.222', '333 33 33.3333']
Once you get the result(s) in a list you can index and get the individual strings.
Related
apb = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
for i in range(26):
s = apb[i:26] + apb[0:i]
print("{:2d} {} ".format(i, s))
Supposed to output this
Sorry just started learning python and this can seem like a dumb question. I tried googling but it keeps telling me it has something to do with 2d array and I definietly know thats not the answer I am looking for.
I understand everything until the last line.
What does: print("{:2d} {} ".format(i, s)) do?
The format function replaces {} (a placeholder) with a variable. {:2d} is similar to the %2d printf format specifier in the C language where it reserves 2 spaces on the console for the variable. For example '{:2d}'.format(2) would print ' 2'. If you want, you can use {}, which would yield the same result, but the letters would not be aligned the same. With {:2d}:
0 ABCDEFGHIJKLMNOPQRSTUVWXYZ
1 BCDEFGHIJKLMNOPQRSTUVWXYZA
2 CDEFGHIJKLMNOPQRSTUVWXYZAB
3 DEFGHIJKLMNOPQRSTUVWXYZABC
4 EFGHIJKLMNOPQRSTUVWXYZABCD
5 FGHIJKLMNOPQRSTUVWXYZABCDE
6 GHIJKLMNOPQRSTUVWXYZABCDEF
7 HIJKLMNOPQRSTUVWXYZABCDEFG
8 IJKLMNOPQRSTUVWXYZABCDEFGH
9 JKLMNOPQRSTUVWXYZABCDEFGHI
10 KLMNOPQRSTUVWXYZABCDEFGHIJ
11 LMNOPQRSTUVWXYZABCDEFGHIJK
12 MNOPQRSTUVWXYZABCDEFGHIJKL
13 NOPQRSTUVWXYZABCDEFGHIJKLM
14 OPQRSTUVWXYZABCDEFGHIJKLMN
15 PQRSTUVWXYZABCDEFGHIJKLMNO
16 QRSTUVWXYZABCDEFGHIJKLMNOP
17 RSTUVWXYZABCDEFGHIJKLMNOPQ
18 STUVWXYZABCDEFGHIJKLMNOPQR
19 TUVWXYZABCDEFGHIJKLMNOPQRS
20 UVWXYZABCDEFGHIJKLMNOPQRST
21 VWXYZABCDEFGHIJKLMNOPQRSTU
22 WXYZABCDEFGHIJKLMNOPQRSTUV
23 XYZABCDEFGHIJKLMNOPQRSTUVW
24 YZABCDEFGHIJKLMNOPQRSTUVWX
25 ZABCDEFGHIJKLMNOPQRSTUVWXY
With {}:
0 ABCDEFGHIJKLMNOPQRSTUVWXYZ
1 BCDEFGHIJKLMNOPQRSTUVWXYZA
2 CDEFGHIJKLMNOPQRSTUVWXYZAB
3 DEFGHIJKLMNOPQRSTUVWXYZABC
4 EFGHIJKLMNOPQRSTUVWXYZABCD
5 FGHIJKLMNOPQRSTUVWXYZABCDE
6 GHIJKLMNOPQRSTUVWXYZABCDEF
7 HIJKLMNOPQRSTUVWXYZABCDEFG
8 IJKLMNOPQRSTUVWXYZABCDEFGH
9 JKLMNOPQRSTUVWXYZABCDEFGHI
10 KLMNOPQRSTUVWXYZABCDEFGHIJ
11 LMNOPQRSTUVWXYZABCDEFGHIJK
12 MNOPQRSTUVWXYZABCDEFGHIJKL
13 NOPQRSTUVWXYZABCDEFGHIJKLM
14 OPQRSTUVWXYZABCDEFGHIJKLMN
15 PQRSTUVWXYZABCDEFGHIJKLMNO
16 QRSTUVWXYZABCDEFGHIJKLMNOP
17 RSTUVWXYZABCDEFGHIJKLMNOPQ
18 STUVWXYZABCDEFGHIJKLMNOPQR
19 TUVWXYZABCDEFGHIJKLMNOPQRS
20 UVWXYZABCDEFGHIJKLMNOPQRST
21 VWXYZABCDEFGHIJKLMNOPQRSTU
22 WXYZABCDEFGHIJKLMNOPQRSTUV
23 XYZABCDEFGHIJKLMNOPQRSTUVW
24 YZABCDEFGHIJKLMNOPQRSTUVWX
25 ZABCDEFGHIJKLMNOPQRSTUVWXY
I am trying to delimit the following into a table, but am running into issues with the name having 2 spaces in it or else "[\s]{2,}" would work. I also can't ignore whitespace between letters since the 1st column ends with a letter and the 2nd column starts with a letter.
I would like to skip any whitespace in between letters after the 1st occurrence.
String:
> TESTID DR5 777777 0 50000 TEST NAME 23.40 600000.00 1000000 20 5 09 05 18 09 07 18 3876.00
TESTID
DR5 777777 0
50000
TEST NAME
23.40
600000.00
1000000 20 5
09 05 18
09 07 18
3876.00
I will try to solve your stated problem (vs the regex thing), because I don't fully understand the regex question.
If I were going to make that string into a list, I would do it like this:
my_str = "TESTID DR5 777777 0 50000 TEST NAME 23.40 600000.00 1000000 20 5 09 05 18 09 07 18 3876.00"
my_list = [section for section in my_str.split(" ") if section != ""]
This uses list comprehension to filter out the blank strings from the split.
You can also use a regular expression as the separator.
import re
my_str = "TESTID DR5 777777 0 50000 TEST NAME 23.40 600000.00 1000000 20 5 09 05 18 09 07 18 3876.00"
my_list = re.split(r'\s{2, }', my_str)
I have an input file as following:
75647485 10 20 13 12 14 17 13 16
63338495 15 20 11 17 18 20 17 20
00453621 3 10 4 10 20 18 15 10
90812341 18 18 16 20 8 20 7 15
I need to find the mean of each row starting from the second element till the end [1:8] and give the output as:
ID Mean Lowest number Highest number
75647485 14.37 10 20
90812341 ... ... ...
I am new to python, so can someone please help. I don't need to write the output to the file, but just displaying it on the console would work.
thank you
array = [ [int(s) for s in line.split()] for line in open('file') ]
for line in array:
print('%08i %3.1f %3i %3i' % (line[0], sum(line[1:])/len(line[1:]), min(line[1:]), max(line[1:])))
This produces the output:
75647485 14.4 10 20
63338495 17.2 11 20
00453621 11.2 3 20
90812341 15.2 7 20
Alternate Version
To assure that the file handle is properly closed, this version uses with. Also, string formatting is done with the more modern format function:
with open('file') as f:
array = [ [int(s) for s in line.split()] for line in f ]
for line in array:
print('{:08.0f} {:3.1f} {:3.0f} {:3.0f}'.format(line[0], sum(line[1:])/len(line[1:]), min(line[1:]), max(line[1:])))
You can do this by using numpy:
import numpy
numpy.mean(mylist[1:8])
fileRecord = namedtuple('RecordID', 'num1, num2, num3, num4, num5, num6, num7, num8)
import csv
for line in csv.reader(open("file.txt", header=None, delimiter=r"\s+")):
numList = fileRecord._make(line)
numListDict = numList._asdict()
lowest = numListDict[0]
highest = numListDict[7]
for (key, value) in numListDict:
total += value;
mean = total/8
print (lowest, highest, mean)
I would recommend using pandas. Much more scalable and many more features. It is also based on numpy.
import pandas as pd
x='''75647485 10 20 13 12 14 17 13 16
63338495 15 20 11 17 18 20 17 20
00453621 3 10 4 10 20 18 15 10
90812341 18 18 16 20 8 20 7 15'''
from cStringIO import StringIO # py27
df = pd.read_csv(StringIO(x), delim_whitespace=True, header=None, index_col=0)
print df.T.max()
#75647485 20
#63338495 20
#453621 20
#90812341 20
print df.T.min()
#75647485 10
#63338495 11
#453621 3
#90812341 7
print df.T.mean()
#75647485 14.375
#63338495 17.250
#453621 11.250
#90812341 15.250
I'm trying to use the timeit module in Python (EDIT: We are using Python 3) to decide between a couple of different code flows. In our code, we have a series of if-statements that test for the existence of a character code in a string, and if it's there replace it like this:
if "<substring>" in str_var:
str_var = str_var.replace("<substring>", "<new_substring>")
We do this a number of times for different substrings. We're debating between that and using just the replace like this:
str_var = str_var.replace("<substring>", "<new_substring>")
We tried to use timeit to determine which one was faster. If the first code-block above is "stmt1" and the second is "stmt2", and our setup string looks like
str_var = '<string><substring><more_string>',
our timeit statements will look like this:
timeit.timeit(stmt=stmt1, setup=setup)
and
timeit.timeit(stmt=stmt2, setup=setup)
Now, running it just like that, on 2 of our laptops (same hardware, similar processing load) stmt1 (the statement with the if-statement) runs faster even after multiple runs (3-4 hundredths of a second vs. about a quarter of a second for stmt2).
However, if we define functions to do both things (including the setup creating the variable) like so:
def foo():
str_var = '<string><substring><more_string>'
if "<substring>" in str_var:
str_var = str_var.replace("<substring>", "<new_substring>")
and
def foo2():
str_var = '<string><substring><more_string>'
str_var = str_var.replace("<substring>", "<new_substring>")
and run timeit like:
timeit.timeit("foo()", setup="from __main__ import foo")
timeit.timeit("foo2()", setup="from __main__ import foo2")
the statement without the if-statement (foo2) runs faster, contradicting the non-functioned results.
Are we missing something about how Timeit works? Or how Python handles a case like this?
edit here is our actual code:
>>> def foo():
s = "hi 1 2 3"
s = s.replace('1','5')
>>> def foo2():
s = "hi 1 2 3"
if '1' in s:
s = s.replace('1','5')
>>> timeit.timeit(foo, "from __main__ import foo")
0.4094226634183542
>>> timeit.timeit(foo2, "from __main__ import foo2")
0.4815539780738618
vs this code:
>>> timeit.timeit("""s = s.replace("1","5")""", setup="s = 'hi 1 2 3'")
0.18738432400277816
>>> timeit.timeit("""if '1' in s: s = s.replace('1','5')""", setup="s = 'hi 1 2 3'")
0.02985000199987553
I think I've got it.
Look at this code:
timeit.timeit("""if '1' in s: s = s.replace('1','5')""", setup="s = 'hi 1 2 3'")
In this code, setup is run exactly once. That means that s becomes a "global". As a result, it gets modified to hi 5 2 3 in the first iteration and in now returns False for all successive iterations.
See this code:
timeit.timeit("""if '1' in s: s = s.replace('1','5'); print(s)""", setup="s = 'hi 1 2 3'")
This will print out hi 5 2 3 a single time because the print is part of the if statement. Contrast this, which will fill up your screen with a ton of hi 5 2 3s:
timeit.timeit("""s = s.replace("1","5"); print(s)""", setup="s = 'hi 1 2 3'")
So the problem here is that the non-function with if test is flawed and is giving you false timings, unless repeated calls on an already processed string is what you were trying to test. (If it is what you were trying to test, your function versions are flawed.) The reason the function with if doesn't fair better is because it's running the replace on a fresh copy of the string for each iteration.
The following test does what I believe you intended since it doesn't re-assign the result of the replace back to s, leaving it unmodified for each iteration:
>>> timeit.timeit("""if '1' in s: s.replace('1','5')""", setup="s = 'hi 1 2 3'"
0.3221409016812231
>>> timeit.timeit("""s.replace('1','5')""", setup="s = 'hi 1 2 3'")
0.28558505721252914
This change adds a lot of time to the if test and adds a little bit of time to the non-if test for me, but I'm using Python 2.7. If the Python 3 results are consistent, though, these results suggest that in saves a lot of time when the strings rarely need any replacing. If they usually do require replacement, it appears in costs a little bit of time.
Made even weirder by looking at the disassembled code. The second block has the if version (which clocks in faster for me using timeit just as in the OP's example).
Yet, by looking at the op codes, it purely appears to have 7 extra op codes, starting with the first BUILD_MAP and also involving one extra POP_JUMP_IF_TRUE (presumably for the if statement check itself). Before and after that, all codes are the same.
This would suggest that building and performing the check in the if statement somehow reduces the computation time for then checking within the call to replace. How can we see specific timing information for the different op codes?
In [55]: dis.disassemble_string("s='HI 1 2 3'; s = s.replace('1','4')")
0 POP_JUMP_IF_TRUE 10045
3 PRINT_NEWLINE
4 PRINT_ITEM_TO
5 SLICE+2
6 <49>
7 SLICE+2
8 DELETE_SLICE+0
9 SLICE+2
10 DELETE_SLICE+1
11 <39>
12 INPLACE_MODULO
13 SLICE+2
14 POP_JUMP_IF_TRUE 15648
17 SLICE+2
18 POP_JUMP_IF_TRUE 29230
21 LOAD_NAME 27760 (27760)
24 STORE_GLOBAL 25955 (25955)
27 STORE_SLICE+0
28 <39>
29 <49>
30 <39>
31 <44>
32 <39>
33 DELETE_SLICE+2
34 <39>
35 STORE_SLICE+1
In [56]: dis.disassemble_string("s='HI 1 2 3'; if '1' in s: s = s.replace('1','4')")
0 POP_JUMP_IF_TRUE 10045
3 PRINT_NEWLINE
4 PRINT_ITEM_TO
5 SLICE+2
6 <49>
7 SLICE+2
8 DELETE_SLICE+0
9 SLICE+2
10 DELETE_SLICE+1
11 <39>
12 INPLACE_MODULO
13 SLICE+2
14 BUILD_MAP 8294
17 <39>
18 <49>
19 <39>
20 SLICE+2
21 BUILD_MAP 8302
24 POP_JUMP_IF_TRUE 8250
27 POP_JUMP_IF_TRUE 15648
30 SLICE+2
31 POP_JUMP_IF_TRUE 29230
34 LOAD_NAME 27760 (27760)
37 STORE_GLOBAL 25955 (25955)
40 STORE_SLICE+0
41 <39>
42 <49>
43 <39>
44 <44>
45 <39>
46 DELETE_SLICE+2
47 <39>
48 STORE_SLICE+1
How can I remove special characters and letters from a line read from a text file while preserving the whitespaces? Let's say we have the following contents in a file:
16 ` C38# 26535 2010 4 14 2 7 7 3 8^#1 2
15 100 140 30 $ 14^]
(2003 2 ! -6 �021 0 � 14 ! 2 3! 1 0 35454
0$ ^#0 0 0 "0 "63 194 (56 188 26 27" 24 0 0 10� 994! 8 58
0 0 " � 0 0 32�47 32767 32767 ! 1
The output basically should be:
16 38 26535 2010 4 14 2 7 7 3 8 1 2
15 100 140 30 14
2003 2 -6 021 0 14 2 3 1 0 35454
0 0 0 0 0 63 194 56 188 26 27 24 0 0 10 994 8 58
0 0 0 0 32 47 32767 32767 1
What's the most straightforward way to do this?
import re
output_string = re.sub(r'[^\d\s-]', '', input_string)
The pattern [^\d\s-] will match anything that's not a digit, dash, or whitespace - thus, replacing any match with an empty string will remove everything except the numbers (including minus signs) and whitespace.
If you want to keep just digits, plus and minus signs, and all whitespace, simplest might be
import re
...
line = re.sub(r'[^\d\s+-]+', '', line)
which reads "replace each sequence of one or more non-digit non-whitespace with nothing".
Faster would be the translate method of strings, but it is quite a bit less simple to set up, so, since you ask for "straightforward", I suggest the re approach (now brace for the sure-to-come screeches of the re-haters...;-).
''.join([x for x in s if x in string.digits+string.whitespace])
or if what you really want is a list of the numbers:
import re
re.findall('\d+',s)
LOL #Alex's regex comment... hopefully there aren't too many haters. With that said however, although they're faster because they're executed in C, regexes aren't my first choice... perhaps i've been biased by the famous jwz quote: '''Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.'''
I will say that solving this homework exercise is tricky because solutions are fraught with errors, as seen in the existing solutions so far. Perhaps this is serendipity because it requires the OP to debug and correct those suggestions instead of just cutting-and-pasting them verbatim into their assignment solution.
As far as the problems go, they include but are not limited to:
leaving successive spaces
removing negative signs, and
merging multiple numbers together
Bottom line... which solutions do I like best? I would start one of the following and debug from there:
For regex, i'll pick:
#Alex's solution or #Matt's if I want just the data instead of the "golden" string
For string processing, I'll modify #Matt's solution to:
keep = set(string.whitespace+string.digits+'+-')
line = ''.join(x for x in line if x in keep)
Finally, #Greg has a good point. Without a clear spec, these are just partial solutions.