Extracting float numbers from file using python - python

I have .txt file which looks like:
[ -5.44339373e+00 -2.77404404e-01 1.26122094e-01 9.83589873e-01
1.95201179e-01 -4.49866890e-01 -2.06423297e-01 1.04780491e+00]
[ 4.34562117e-01 -1.04469577e-01 2.83633101e-01 1.00452355e-01 -7.12572469e-01 -4.99234705e-01 -1.93152897e-01 1.80787567e-02]
I need to extract all floats from it and put them to list/array
What I've done is this:
A = []
for line in open("general.txt", "r").read().split(" "):
for unit in line.split("]", 3):
A.append(list(map(lambda x: str(x), unit.replace("[", "").replace("]", "").split(" "))))
but A contains elements like [''] or even worse ['3.20973096e-02\n']. These are all strings, but I need floats. How to do that?

Why not use a regular expression?
>>> import re
>>> e = r'(\d+\.\d+e?(?:\+|-)\d{2}?)'
>>> results = re.findall(e, your_string)
['5.44339373e+00',
'2.77404404e-01',
'1.26122094e-01',
'9.83589873e-01',
'1.95201179e-01',
'4.49866890e-01',
'2.06423297e-01',
'1.04780491e+00',
'4.34562117e-01',
'1.04469577e-01',
'2.83633101e-01',
'1.00452355e-01',
'7.12572469e-01',
'4.99234705e-01',
'1.93152897e-01',
'1.80787567e-02']
Now, these are the matched strings, but you can easily convert them to floats:
>>> map(float, re.findall(e, your_string))
[5.44339373,
0.277404404,
0.126122094,
0.983589873,
0.195201179,
0.44986689,
0.206423297,
1.04780491,
0.434562117,
0.104469577,
0.283633101,
0.100452355,
0.712572469,
0.499234705,
0.193152897,
0.0180787567]
Note, the regular expression might need some tweaking, but its a good start.

As a more precise way you can use regex for split the lines :
>>> s="""[ -5.44339373e+00 -2.77404404e-01 1.26122094e-01 9.83589873e-01
... 1.95201179e-01 -4.49866890e-01 -2.06423297e-01 1.04780491e+00]
... [ 4.34562117e-01 -1.04469577e-01 2.83633101e-01 1.00452355e-01 -7.12572469e-01 -4.99234705e-01 -1.93152897e-01 1.80787567e-02] """
>>> print re.split(r'[\s\[\]]+',s)
['', '-5.44339373e+00', '-2.77404404e-01', '1.26122094e-01', '9.83589873e-01', '1.95201179e-01', '-4.49866890e-01', '-2.06423297e-01', '1.04780491e+00', '4.34562117e-01', '-1.04469577e-01', '2.83633101e-01', '1.00452355e-01', '-7.12572469e-01', '-4.99234705e-01', '-1.93152897e-01', '1.80787567e-02', '']
And in this case that you have the data in file you can do :
import re
print re.split(r'[\s\[\]]+',open("general.txt", "r").read())
If you want to get ride of the empty strings in leading and trailing you can just use a list comprehension :
>>> print [i for i in re.split(r'[\s\[\]]*',s) if i]
['-5.44339373e+00', '-2.77404404e-01', '1.26122094e-01', '9.83589873e-01', '1.95201179e-01', '-4.49866890e-01', '-2.06423297e-01', '1.04780491e+00', '4.34562117e-01', '-1.04469577e-01', '2.83633101e-01', '1.00452355e-01', '-7.12572469e-01', '-4.99234705e-01', '-1.93152897e-01', '1.80787567e-02']

let's slurp the file
content = open('data.txt').read()
split on ']'
logical_lines = content.split(']')
strip the '[' and the other stuff
logical_lines = [ll.lstrip(' \n[') for ll in logical_lines]
convert to floats
lol = [map(float,ll.split()) for ll in logical_lines]
Sticking it all in a one-liner
lol=[map(float,l.lstrip(' \n[').split()) for l in open('data.txt').read().split(']')]
I've tested it on the exemplar data we were given and it works...

Related

Extract numeric values from a string for python

I have a string with contains numeric values which are inside quotes. I need to remove numeric values from these and also the [ and ]
sample string: texts = ['13007807', '13007779']
texts = ['13007807', '13007779']
texts.replace("'", "")
texts..strip("'")
print texts
# this will return ['13007807', '13007779']
So what i need to extract from string is:
13007807
13007779
If your texts variable is a string as I understood from your reply, then you can use Regular expressions:
import re
text = "['13007807', '13007779']"
regex=r"\['(\d+)', '(\d+)'\]"
values=re.search(regex, text)
if values:
value1=int(values.group(1))
value2=int(values.group(2))
output:
value1=13007807
value2=13007779
You can use * unpack operator:
texts = ['13007807', '13007779']
print (*texts)
output:
13007807 13007779
if you have :
data = "['13007807', '13007779']"
print (*eval(data))
output:
13007807 13007779
The easiest way is to use map and wrap around in list
list(map(int,texts))
Output
[13007807, 13007779]
If your input data is of format data = "['13007807', '13007779']" then
import re
data = "['13007807', '13007779']"
list(map(int, re.findall('(\d+)',data)))
or
list(map(int, eval(data)))

Python Regex: find all lines that start with '{' and end with '}'

I am receiving data over a socket, a bunch of JSON strings. However, I receive a set amount of bytes, so sometimes the last of my JSON strings is cut-off. I will typically get the following:
{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}
{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}
{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}
{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}
{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}
{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}
{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}
{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}
{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}
{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}
{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}
{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}
{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}
{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}
{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}
{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}
{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}
{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}
{"pitch":-30.816765,"yaw":-125
With Python, I would like to create a string array of the first 18 complete { data... } strings.
Here is what I have tried: cleanData = re.search('{.*}', data) but it seems like this is only giving me the very first { data... } entry. How can I get the full string array of complete { } sets?
To get all, you can use re.finditer or re.findall.
>>> re.findall(r'{.*}', s)
['{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}', '{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}', '{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}', '{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}', '{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}', '{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}', '{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}', '{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}', '{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}', '{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}', '{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}', '{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}', '{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}', '{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}', '{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}', '{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}', '{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}', '{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}']
>>>
OR
>>> [x.group() for x in re.finditer(r'{.*}', s)]
['{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}', '{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}', '{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}', '{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}', '{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}', '{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}', '{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}', '{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}', '{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}', '{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}', '{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}', '{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}', '{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}', '{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}', '{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}', '{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}', '{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}', '{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}']
>>>
You need re.findall() (or re.finditer)
>>> import re
>>> for r in re.findall(r'{.*}', data)[:18]:
print r
{"pitch":-30.778193,"yaw":-124.63285,"roll":-8.977466}
{"pitch":-30.856342,"yaw":-124.57556,"roll":-7.7220345}
{"pitch":-31.574106,"yaw":-124.65623,"roll":-7.911794}
{"pitch":-30.479567,"yaw":-124.24301,"roll":-8.730827}
{"pitch":-29.30239,"yaw":-123.97949,"roll":-8.134723}
{"pitch":-29.84712,"yaw":-124.584465,"roll":-8.588374}
{"pitch":-31.072054,"yaw":-124.707466,"roll":-8.877062}
{"pitch":-31.493435,"yaw":-124.75457,"roll":-9.019922}
{"pitch":-29.591925,"yaw":-124.960815,"roll":-9.379437}
{"pitch":-29.37105,"yaw":-125.14427,"roll":-9.642341}
{"pitch":-29.483717,"yaw":-125.16528,"roll":-9.687177}
{"pitch":-30.903332,"yaw":-124.603935,"roll":-9.423098}
{"pitch":-30.211857,"yaw":-124.471664,"roll":-9.116135}
{"pitch":-30.837414,"yaw":-125.18984,"roll":-9.824204}
{"pitch":-30.526165,"yaw":-124.85788,"roll":-9.158611}
{"pitch":-30.333513,"yaw":-123.68705,"roll":-7.9481263}
{"pitch":-30.903502,"yaw":-123.78847,"roll":-8.209373}
{"pitch":-31.194769,"yaw":-124.79708,"roll":-8.709783}
Extracting lines that start and end with a specific character can be done without any regex, use str.startswith and str.endswith methods when iterating through the lines in a file:
results = []
with open(filepath, 'r') as f:
for line in f:
if line.startswith('{') and line.rstrip('\n').endswith('}'):
results.append(line.rstrip('\n'))
Note the .rstrip('\n') is used before .endswith to make sure the final newline does not interfere with the } check at the end of the string.

Regular Expression in Python

I'm trying to build a list of domain names from an Enom API call. I get back a lot of information and need to locate the domain name related lines, and then join them together.
The string that comes back from Enom looks somewhat like this:
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1
I'd like to build a list from that which looks like this:
[domain1.com, domain2.org, domain3.co.uk, domain4.net]
To find the different domain name components I've tried the following (where "enom" is the string above) but have only been able to get the SLD and TLD matches.
re.findall("^.*(SLD|TLD).*$", enom, re.M)
Edit:
Every time I see a question asking for regular expression solution I have this bizarre urge to try and solve it without regular expressions. Most of the times it's more efficient than the use of regex, I encourage the OP to test which of the solutions is most efficient.
Here is the naive approach:
a = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""
b = a.split("\n")
c = [x.split("=")[1] for x in b if x != 'TLDOverride=1']
for x in range(0,len(c),2):
print ".".join(c[x:x+2])
>> domain1.com
>> domain2.org
>> domain3.co.uk
>> domain4.net
You have a capturing group in your expression. re.findall documentation says:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
That's why only the conent of the capturing group is returned.
try:
re.findall("^.*((?:SLD|TLD)\d*)=(.*)$", enom, re.M)
This would return a list of tuples:
[('SLD1', 'domain1'), ('TLD1', 'com'), ('SLD2', 'domain2'), ('TLD2', 'org'), ('SLD3', 'domain3'), ('TLD4', 'co.uk'), ('SLD5', 'domain4'), ('TLD5', 'net')]
Combining SLDs and TLDs is then up to you.
this works for you example,
>>> sld_list = re.findall("^.*SLD[0-9]*?=(.*?)$", enom, re.M)
>>> tld_list = re.findall("^.*TLD[0-9]*?=(.*?)$", enom, re.M)
>>> map(lambda x: x[0] + '.' + x[1], zip(sld_list, tld_list))
['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
I'm not sure why are you talking about regular expressions. I mean, why don't you just run a for loop?
A famous quote seems to be appropriate here:
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems.
domains = []
components = []
for line in enom.split('\n'):
k,v = line.split('=')
if k == 'TLDOverride':
continue
components.append(v)
if k.startswith('TLD'):
domains.append('.'.join(components))
components = []
P.S. I'm not sure what's this TLDOverride so the code just ignores it.
Here's one way:
import re
print map('.'.join, zip(*[iter(re.findall(r'^(?:S|T)LD\d+=(.*)$', text, re.M))]*2))
# ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
Just for fun, map -> filter -> map:
input = """
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
"""
splited = map(lambda x: x.split("="), input.split())
slds = filter(lambda x: x[1][0].startswith('SLD'), enumerate(splited))
print map(lambda x: '.'.join([x[1][1], splited[x[0] + 1][1], ]), slds)
>>> ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
This appears to do what you want:
domains = re.findall('SLD\d+=(.+)', re.sub(r'\nTLD\d+=', '.', enom))
It assumes that the lines are sorted and SLD always comes before its TLD. If that can be not the case, try this slightly more verbose code without regexes:
d = dict(x.split('=') for x in enom.strip().splitlines())
domains = [
d[key] + '.' + d.get('T' + key[1:], '')
for key in d if key.startswith('SLD')
]
You need to use multiline regex for this. This is similar to this post.
data = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""
domain_seq = re.compile(r"SLD\d=(\w+)\nTLD\d=(\w+)", re.M)
for item in domain_seq.finditer(data):
domain, tld = item.group(1), item.group(2)
print "%s.%s" % (domain,tld)
As some other answers already said, there's no need to use a regular expression here. A simple split and some filtering will do nicely:
lines = data.split("\n") #assuming data contains your input string
sld, tld = [[x.split("=")[1] for x in lines if x[:3] == t] for t in ("SLD", "TLD")]
result = [x+y for x, y in zip(sld, tld)]

Split stock quote to tokens in Python

I need to read a text file of stock quotes and do some processing with each stock data (i.e. a line in the file).
The stock data looks like this :
[class,'STOCK'],[symbol,'AAII'],[open,2.60],[high,2.70],[low,2.53],[close,2.60],[volume,458500],[date,'21-Dec-04'],[openClosePDiff,0.0],[highLowPDiff,0.067],[closeEqualsLow,'false'],[closeEqualsHigh,'false']
How do I split the line into tokens where each token is what is enclosed in the square braces, .i.e. for the above line, the tokens should be "class, 'STOCK'" , "symbol, 'AAII'" etc.
print(re.findall("\[(.*?)\]", inputline))
Or perhaps without regex:
print(inputline[1:-1].split("],["))
Try this code:
#!/usr/bin/env python3
import re
str="[class,'STOCK'],[symbol,'AAII'],[open,2.60],[high,2.70],[low,2.53],[close,2.60],[volume,458500],[date,'21-Dec-04'],[openClosePDiff,0.0],[highLowPDiff,0.067],[closeEqualsLow,'false'],[closeEqualsHigh,'false']"
str = re.sub('^\[','',str)
str = re.sub('\]$','',str)
array = str.split("],[")
for line in array:
print(line)
Start with:
re.findall("[^,]+,[^,]+", a)
This would give you a list of:
[class,'STOCK'], [symbol,'AAII'] and such, then you could cut the brackets.
If you want a functional one liner, use:
map(lambda x: x[1:-1], re.findall("[^,]+,[^,]+", a))
The first part splits every second ,, the map (for each item in the list, use the lambda function..) cuts the brackets.
import re
s = "[class,'STOCK'],[symbol,'AAII'],[open,2.60],[high,2.70],[low,2.53],[close,2.60],[volume,458500],[date,'21-Dec-04'],[openClosePDiff,0.0],[highLowPDiff,0.067],[closeEqualsLow,'false'],[closeEqualsHigh,'false']"
m = re.findall(r"([a-zA-Z0-9]+),([a-zA-Z0-9']+)", s)
d= { x[0]:x[1] for x in m }
print d
you can run the snippet here : http://liveworkspace.org/code/EZGav$35

python regular expression. Extract text between patterns

How to get all the values in between 'uniprotkb:' and '(gene name)' in the 'str' below:
str = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)|uniprotkb:HIST1H3B(gene name)|uniprotkb:HIST1H3C(gene name)|uniprotkb:HIST1H3E(gene name)|uniprotkb:HIST1H3F(gene name)|uniprotkb:HIST1H3G(gene name)|uniprotkb:HIST1H3H(gene name)|uniprotkb:HIST1H3I(gene name)|uniprotkb:HIST1H3J(gene name)'
The result is:
HIST1H3D
HIST1H3A
HIST1H3B
HIST1H3C
HIST1H3E
HIST1H3F
HIST1H3G
HIST1H3H
HIST1H3I
HIST1H3J
Using re.findall(), you can get all parts of a string that match a regular expression:
>>> import re
>>> sstr = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)|uniprotkb:HIST1H3B(gene name)|uniprotkb:HIST1H3C(gene name)|uniprotkb:HIST1H3E(gene name)|uniprotkb:HIST1H3F(gene name)|uniprotkb:HIST1H3G(gene name)|uniprotkb:HIST1H3H(gene name)|uniprotkb:HIST1H3I(gene name)|uniprotkb:HIST1H3J(gene name)'
>>> re.findall(r'uniprotkb:([^(]*)\(gene name\)', sstr)
['HIST1H3D', 'HIST1H3A', 'HIST1H3B', 'HIST1H3C', 'HIST1H3E', 'HIST1H3F', 'HIST1H3G', 'HIST1H3H', 'HIST1H3I', 'HIST1H3J']
Here is a oneliner:
astr = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)|uniprotkb:HIST1H3B(gene name)|uniprotkb:HIST1H3C(gene name)|uniprotkb:HIST1H3E(gene name)|uniprotkb:HIST1H3F(gene name)|uniprotkb:HIST1H3G(gene name)|uniprotkb:HIST1H3H(gene name)|uniprotkb:HIST1H3I(gene name)|uniprotkb:HIST1H3J(gene name)'
[pt.split('(')[0] for pt in astr.strip().split('uniprotkb:')][1:]
Gives:
['HIST1H3D',
'HIST1H3A',
'HIST1H3B',
'HIST1H3C',
'HIST1H3E',
'HIST1H3F',
'HIST1H3G',
'HIST1H3H',
'HIST1H3I',
'HIST1H3J']
I don't recommend regexp solutions, if runtime matters.
I wouldn't bother with a regular expression:
s = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)' # etc
gene_names = []
for substring in s.split('|'):
removed_first = substring.partition('uniprotkb:')[2] # remove the first part of the substring
removed_second = removed_first.partition('(gene name)')[0] # remove the second part
gene_names.append(removed_second) # put it on the list
should do the trick. You could even one-liner it - the above is equivalent to:
gene_names = [substring.partition('uniprotkb:')[2].partition('(gene name)')[0] for substring in s.split('|')]

Categories