Extract numeric values from a string for python - python

I have a string with contains numeric values which are inside quotes. I need to remove numeric values from these and also the [ and ]
sample string: texts = ['13007807', '13007779']
texts = ['13007807', '13007779']
texts.replace("'", "")
texts..strip("'")
print texts
# this will return ['13007807', '13007779']
So what i need to extract from string is:
13007807
13007779

If your texts variable is a string as I understood from your reply, then you can use Regular expressions:
import re
text = "['13007807', '13007779']"
regex=r"\['(\d+)', '(\d+)'\]"
values=re.search(regex, text)
if values:
value1=int(values.group(1))
value2=int(values.group(2))
output:
value1=13007807
value2=13007779

You can use * unpack operator:
texts = ['13007807', '13007779']
print (*texts)
output:
13007807 13007779
if you have :
data = "['13007807', '13007779']"
print (*eval(data))
output:
13007807 13007779

The easiest way is to use map and wrap around in list
list(map(int,texts))
Output
[13007807, 13007779]
If your input data is of format data = "['13007807', '13007779']" then
import re
data = "['13007807', '13007779']"
list(map(int, re.findall('(\d+)',data)))
or
list(map(int, eval(data)))

Related

Can't get rid of hex characters

This program makes an array of verbs which come from a text file.
file = open("Verbs.txt", "r")
data = str(file.read())
table = eval(data)
num_table = len(table)
new_table = []
for x in range(0, num_table):
newstr = table[x].replace(")", "")
split = newstr.rsplit("(")
numx = len(split)
for y in range(0, numx):
split[y] = split[y].split(",", 1)[0]
new_table.append(split[y])
num_new_table = len(new_table)
for z in range(0, num_new_table):
print(new_table[z])
However the text itself contains hex characters such as in
('a\\xc4\\x9fr\\xc4\\xb1[Verb]+[Pos]+[Imp]+[A2sg]', ':', 17.6044921875)('A\\xc4\\x9fr\\xc4\\xb1[Noun]+[Prop]+[A3sg]+[Pnon]+[Nom]', ':', 11.5615234375)
I'm trying to get rid of those. How am supposed to do that?
I've looked up pretty much everywhere and decode() returns an error (even after importing codecs).
You could use parse, a python module that allows you to search inside a string for regularly-formatted components, and, from the components returned, you could extract the corresponding integers, replacing them from the original string.
For example (untested alert!):
import parse
# Parse all hex-like items
list_of_findings = parse.findall("\\x{:w}", your_string)
# For each item
for hex_item in list_of_findings:
# Replace the item in the string
your_string = your_string.replace(
# Retrieve the value from the Parse Data Format
hex_item[0],
# Convert the value parsed to a normal hex string,
# then to int, then to string again
str(int("0x"+hex_item[0]))
)
Obs: instead of "int", you could convert the found hex-like values to characters, using chr, as in:
chr(hex_item[0])

python find specific pattern at end of string

I need to find a timestamp pattern at the end of a string. For example, blahblahblah_tsYYYMMDDHHMMSS. I need to extract the time string and replace it with current timestamp.
What is the regex for such pattern?
You might try making a function.
def tsExtractor(SomeString): #your String
LenSomeString = len(SomeString) #get length of String
TargetString = SomeString[(LenSomeString - 14):LenSomeString] #Get placeholder of numeric data
try:
int(TargetString) #attempt to change to integer
except ValueError:
TargetString = 99998877665544 #not integer, put in some dummy data
return str(TargetString) #change integer back to string
#examples
exampleString = 'blahblahblahyeahblahblahts20160616140003'
badString = 'Blahblahblahaldskfjlakdflaksdflkjasl500566'
exampleString = tsExtractor(exampleString)
print(exampleString)
exampleString = tsExtractor(badString)
print(exampleString)
Thanks #rock321987, i did this:
match = re.compile("_ts\d{14}$").match(pathstring)
if match:
do something
It works pretty well.

Extracting float numbers from file using python

I have .txt file which looks like:
[ -5.44339373e+00 -2.77404404e-01 1.26122094e-01 9.83589873e-01
1.95201179e-01 -4.49866890e-01 -2.06423297e-01 1.04780491e+00]
[ 4.34562117e-01 -1.04469577e-01 2.83633101e-01 1.00452355e-01 -7.12572469e-01 -4.99234705e-01 -1.93152897e-01 1.80787567e-02]
I need to extract all floats from it and put them to list/array
What I've done is this:
A = []
for line in open("general.txt", "r").read().split(" "):
for unit in line.split("]", 3):
A.append(list(map(lambda x: str(x), unit.replace("[", "").replace("]", "").split(" "))))
but A contains elements like [''] or even worse ['3.20973096e-02\n']. These are all strings, but I need floats. How to do that?
Why not use a regular expression?
>>> import re
>>> e = r'(\d+\.\d+e?(?:\+|-)\d{2}?)'
>>> results = re.findall(e, your_string)
['5.44339373e+00',
'2.77404404e-01',
'1.26122094e-01',
'9.83589873e-01',
'1.95201179e-01',
'4.49866890e-01',
'2.06423297e-01',
'1.04780491e+00',
'4.34562117e-01',
'1.04469577e-01',
'2.83633101e-01',
'1.00452355e-01',
'7.12572469e-01',
'4.99234705e-01',
'1.93152897e-01',
'1.80787567e-02']
Now, these are the matched strings, but you can easily convert them to floats:
>>> map(float, re.findall(e, your_string))
[5.44339373,
0.277404404,
0.126122094,
0.983589873,
0.195201179,
0.44986689,
0.206423297,
1.04780491,
0.434562117,
0.104469577,
0.283633101,
0.100452355,
0.712572469,
0.499234705,
0.193152897,
0.0180787567]
Note, the regular expression might need some tweaking, but its a good start.
As a more precise way you can use regex for split the lines :
>>> s="""[ -5.44339373e+00 -2.77404404e-01 1.26122094e-01 9.83589873e-01
... 1.95201179e-01 -4.49866890e-01 -2.06423297e-01 1.04780491e+00]
... [ 4.34562117e-01 -1.04469577e-01 2.83633101e-01 1.00452355e-01 -7.12572469e-01 -4.99234705e-01 -1.93152897e-01 1.80787567e-02] """
>>> print re.split(r'[\s\[\]]+',s)
['', '-5.44339373e+00', '-2.77404404e-01', '1.26122094e-01', '9.83589873e-01', '1.95201179e-01', '-4.49866890e-01', '-2.06423297e-01', '1.04780491e+00', '4.34562117e-01', '-1.04469577e-01', '2.83633101e-01', '1.00452355e-01', '-7.12572469e-01', '-4.99234705e-01', '-1.93152897e-01', '1.80787567e-02', '']
And in this case that you have the data in file you can do :
import re
print re.split(r'[\s\[\]]+',open("general.txt", "r").read())
If you want to get ride of the empty strings in leading and trailing you can just use a list comprehension :
>>> print [i for i in re.split(r'[\s\[\]]*',s) if i]
['-5.44339373e+00', '-2.77404404e-01', '1.26122094e-01', '9.83589873e-01', '1.95201179e-01', '-4.49866890e-01', '-2.06423297e-01', '1.04780491e+00', '4.34562117e-01', '-1.04469577e-01', '2.83633101e-01', '1.00452355e-01', '-7.12572469e-01', '-4.99234705e-01', '-1.93152897e-01', '1.80787567e-02']
let's slurp the file
content = open('data.txt').read()
split on ']'
logical_lines = content.split(']')
strip the '[' and the other stuff
logical_lines = [ll.lstrip(' \n[') for ll in logical_lines]
convert to floats
lol = [map(float,ll.split()) for ll in logical_lines]
Sticking it all in a one-liner
lol=[map(float,l.lstrip(' \n[').split()) for l in open('data.txt').read().split(']')]
I've tested it on the exemplar data we were given and it works...

Split stock quote to tokens in Python

I need to read a text file of stock quotes and do some processing with each stock data (i.e. a line in the file).
The stock data looks like this :
[class,'STOCK'],[symbol,'AAII'],[open,2.60],[high,2.70],[low,2.53],[close,2.60],[volume,458500],[date,'21-Dec-04'],[openClosePDiff,0.0],[highLowPDiff,0.067],[closeEqualsLow,'false'],[closeEqualsHigh,'false']
How do I split the line into tokens where each token is what is enclosed in the square braces, .i.e. for the above line, the tokens should be "class, 'STOCK'" , "symbol, 'AAII'" etc.
print(re.findall("\[(.*?)\]", inputline))
Or perhaps without regex:
print(inputline[1:-1].split("],["))
Try this code:
#!/usr/bin/env python3
import re
str="[class,'STOCK'],[symbol,'AAII'],[open,2.60],[high,2.70],[low,2.53],[close,2.60],[volume,458500],[date,'21-Dec-04'],[openClosePDiff,0.0],[highLowPDiff,0.067],[closeEqualsLow,'false'],[closeEqualsHigh,'false']"
str = re.sub('^\[','',str)
str = re.sub('\]$','',str)
array = str.split("],[")
for line in array:
print(line)
Start with:
re.findall("[^,]+,[^,]+", a)
This would give you a list of:
[class,'STOCK'], [symbol,'AAII'] and such, then you could cut the brackets.
If you want a functional one liner, use:
map(lambda x: x[1:-1], re.findall("[^,]+,[^,]+", a))
The first part splits every second ,, the map (for each item in the list, use the lambda function..) cuts the brackets.
import re
s = "[class,'STOCK'],[symbol,'AAII'],[open,2.60],[high,2.70],[low,2.53],[close,2.60],[volume,458500],[date,'21-Dec-04'],[openClosePDiff,0.0],[highLowPDiff,0.067],[closeEqualsLow,'false'],[closeEqualsHigh,'false']"
m = re.findall(r"([a-zA-Z0-9]+),([a-zA-Z0-9']+)", s)
d= { x[0]:x[1] for x in m }
print d
you can run the snippet here : http://liveworkspace.org/code/EZGav$35

python regular expression. Extract text between patterns

How to get all the values in between 'uniprotkb:' and '(gene name)' in the 'str' below:
str = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)|uniprotkb:HIST1H3B(gene name)|uniprotkb:HIST1H3C(gene name)|uniprotkb:HIST1H3E(gene name)|uniprotkb:HIST1H3F(gene name)|uniprotkb:HIST1H3G(gene name)|uniprotkb:HIST1H3H(gene name)|uniprotkb:HIST1H3I(gene name)|uniprotkb:HIST1H3J(gene name)'
The result is:
HIST1H3D
HIST1H3A
HIST1H3B
HIST1H3C
HIST1H3E
HIST1H3F
HIST1H3G
HIST1H3H
HIST1H3I
HIST1H3J
Using re.findall(), you can get all parts of a string that match a regular expression:
>>> import re
>>> sstr = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)|uniprotkb:HIST1H3B(gene name)|uniprotkb:HIST1H3C(gene name)|uniprotkb:HIST1H3E(gene name)|uniprotkb:HIST1H3F(gene name)|uniprotkb:HIST1H3G(gene name)|uniprotkb:HIST1H3H(gene name)|uniprotkb:HIST1H3I(gene name)|uniprotkb:HIST1H3J(gene name)'
>>> re.findall(r'uniprotkb:([^(]*)\(gene name\)', sstr)
['HIST1H3D', 'HIST1H3A', 'HIST1H3B', 'HIST1H3C', 'HIST1H3E', 'HIST1H3F', 'HIST1H3G', 'HIST1H3H', 'HIST1H3I', 'HIST1H3J']
Here is a oneliner:
astr = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)|uniprotkb:HIST1H3B(gene name)|uniprotkb:HIST1H3C(gene name)|uniprotkb:HIST1H3E(gene name)|uniprotkb:HIST1H3F(gene name)|uniprotkb:HIST1H3G(gene name)|uniprotkb:HIST1H3H(gene name)|uniprotkb:HIST1H3I(gene name)|uniprotkb:HIST1H3J(gene name)'
[pt.split('(')[0] for pt in astr.strip().split('uniprotkb:')][1:]
Gives:
['HIST1H3D',
'HIST1H3A',
'HIST1H3B',
'HIST1H3C',
'HIST1H3E',
'HIST1H3F',
'HIST1H3G',
'HIST1H3H',
'HIST1H3I',
'HIST1H3J']
I don't recommend regexp solutions, if runtime matters.
I wouldn't bother with a regular expression:
s = 'uniprotkb:HIST1H3D(gene name)|uniprotkb:HIST1H3A(gene name)' # etc
gene_names = []
for substring in s.split('|'):
removed_first = substring.partition('uniprotkb:')[2] # remove the first part of the substring
removed_second = removed_first.partition('(gene name)')[0] # remove the second part
gene_names.append(removed_second) # put it on the list
should do the trick. You could even one-liner it - the above is equivalent to:
gene_names = [substring.partition('uniprotkb:')[2].partition('(gene name)')[0] for substring in s.split('|')]

Categories