isolate data from long string [duplicate]

isolate data from long string [duplicate] - python

Suppose I had a string
string1 = "498results should get"
Now I need to get only integer values from the string like 498. Here I don't want to use list slicing because the integer values may increase like these examples:
string2 = "49867results should get"
string3 = "497543results should get"
So I want to get only integer values out from the string exactly in the same order. I mean like 498,49867,497543 from string1,string2,string3 respectively.
Can anyone let me know how to do this in a one or two lines?

>>> import re
>>> string1 = "498results should get"
>>> int(re.search(r'\d+', string1).group())
498
If there are multiple integers in the string:
>>> map(int, re.findall(r'\d+', string1))
[498]

An answer taken from ChristopheD here: https://stackoverflow.com/a/2500023/1225603
r = "456results string789"
s = ''.join(x for x in r if x.isdigit())
print int(s)
456789

Here's your one-liner, without using any regular expressions, which can get expensive at times:
>>> ''.join(filter(str.isdigit, "1234GAgade5312djdl0"))
returns:
'123453120'

if you have multiple sets of numbers then this is another option
>>> import re
>>> print(re.findall('\d+', 'xyz123abc456def789'))
['123', '456', '789']
its no good for floating point number strings though.

Iterator version
>>> import re
>>> string1 = "498results should get"
>>> [int(x.group()) for x in re.finditer(r'\d+', string1)]
[498]

>>> import itertools
>>> int(''.join(itertools.takewhile(lambda s: s.isdigit(), string1)))

With python 3.6, these two lines return a list (may be empty)
>>[int(x) for x in re.findall('\d+', your_string)]
Similar to
>>list(map(int, re.findall('\d+', your_string))

this approach uses list comprehension, just pass the string as argument to the function and it will return a list of integers in that string.
def getIntegers(string):
numbers = [int(x) for x in string.split() if x.isnumeric()]
return numbers
Like this
print(getIntegers('this text contains some numbers like 3 5 and 7'))
Output
[3, 5, 7]

def function(string):
final = ''
for i in string:
try:
final += str(int(i))
except ValueError:
return int(final)
print(function("4983results should get"))

Another option is to remove the trailing the letters using rstrip and string.ascii_lowercase (to get the letters):
import string
out = [int(s.replace(' ','').rstrip(string.ascii_lowercase)) for s in strings]
Output:
[498, 49867, 497543]

integerstring=""
string1 = "498results should get"
for i in string1:
if i.isdigit()==True
integerstring=integerstring+i
print(integerstring)

Related

Can i include multiple statements when creating a one-line for loop?

I have an array I want to iterate through. The array consists of strings consisting of numbers and signs.
like this: €110.5M
I want to loop over it and remove all Euro sign and also the M and return that array with the strings as ints.
How would I do this knowing that the array is a column in a table?

You could just strip the characters,
>>> x = '€110.5M'
>>> x.strip('€M')
'110.5'

def sanitize_string(ss):
ss = ss.replace('$', '').replace('€', '').lower()
if 'm' in ss:
res = float(ss.replace('m', '')) * 1000000
elif 'k' in ss:
res = float(ss.replace('k', '')) * 1000
return int(res)
This can be applied to a list as follows:
>>> ls = [sanitize_string(x) for x in ["€3.5M", "€15.7M" , "€167M"]]
>>> ls
[3500000, 15700000, 167000000]
If you want to apply it to the column of a table instead:
dataFrame = dataFrame.price.apply(sanitize_string) # Assuming you're using DataFrames and the column is called 'price'

You can use a string comprehension:
numbers = [float(p.replace('€','').replace('M','')) for p in a]
which gives:
[110.5, 210.5, 310.5]

You can use a list comprehension to construct one list from another:
foo = ["€13.5M", "€15M" , "€167M"]
foo_cleaned = [value.translate(None, "€M")]
str.translate replaces all occurrences of characters in the latter string with the first argument None.

Try this
arr = ["€110.5M","€110.5M","€110.5M","€110.5M","€110.5M","€110.5M","€110.5M"]
f = [x.replace("€","").replace("M","") for x in arr]

You can call .replace() on a string as often as you like. An initial solution could be something like this:
my_array = ['€110.5M', '€111.5M', '€112.5M']
my_cleaned_array = []
for elem in my_array:
my_cleaned_array.append(elem.replace('€', '').replace('M', ''))
At this point, you still have strings in your array. If you want to return them as ints, you can write int(elem.replace('€', '').replace('M', '')) instead. But be aware that you will then lose everything after the floating point, i.e. you will end up with [110, 111, 112].

You can use Regex to do that.
import re
str = "€110.5M"
x = re.findall("\-?\d+\.\d+", str )
print(x)
I didn't quite understand the second part of the question.

How can I extract words before some string?

I have several strings like this:
mylist = ['pearsapple','grapevinesapple','sinkandapple'...]
I want to parse the parts before apple and then append to a new list:
new = ['pears','grapevines','sinkand']
Is there a way other than finding starting points of 'apple' in each string and then appending before the starting point?

By using slicing in combination with the index method of strings.
>>> [x[:x.index('apple')] for x in mylist]
['pears', 'grapevines', 'sinkand']
You could also use a regular expression
>>> import re
>>> [re.match('(.*?)apple', x).group(1) for x in mylist]
['pears', 'grapevines', 'sinkand']
I don't see why though.

I hope the word apple will be fix (fixed length word), then we can use:
second_list = [item[:-5] for item in mylist]

If some elements in the list don't contain 'apple' at the end of the string, this regex leaves the string untouched:
>>> import re
>>> mylist = ['pearsapple','grapevinesapple','sinkandapple', 'test', 'grappled']
>>> [re.sub('apple$', '', word) for word in mylist]
['pears', 'grapevines', 'sinkand', 'test', 'grappled']

By also using string split and list comprehension
new = [x.split('apple')[0] for x in mylist]
['pears', 'grapevines', 'sinkand']

One way to do it would be to iterate through every string in the list and then use the split() string function.
for word in mylist:
word = word.split("apple")[0]

Find a string in a list of list in python

I have a nested list as below:
[['asgy200;f','ssll100',' time is: 10h:00m:12s','xxxxxxx','***','','asgy200;f','frl5100',' time is: 00h:00m:05s','ooo']]
'***' is my delimiter. I want to separate all of seconds in the list in python.
First of all with regular expression I want to separate the line that has time is: string but it doesn't work!
I don't know what should I do.
Thanks

import re
x=[['asgy200;f','ssll100','time is: 10h:00m:12s','xxxxxxx','***','','asgy200;f','frl5100','time is: 00h:00m:05s','ooo']]
s=str(x)
print re.findall(r"(?<=time is)\s*:\s*[^']*:(\d+)",s)
Output:['12', '05']
You can try this.

You can use a look-ahead regex (r'(?<=time is\:).*') :
>>> [i.group(0).split(':')[2] for i in [re.search(r'(?<=time is\:).*',i) for i in l[0]] if i is not None]
['12s', '05s']
and you can convert them to int :
>>> [int(j.replace('s','')) for j in sec]
[12, 5]
if you want the string of seconds don't convert them to int after replace :
>>> [j.replace('s','') for j in sec]
['12', '05']

You could use capturing groups also. It won't print the seconds if the seconds is exactly equal to 00
>>> lst = [['asgy200;f','ssll100','time is: 10h:00m:12s','xxxxxxx','***','','asgy200;f','frl5100','time is: 00h:00m:05s','ooo']]
>>> [i for i in re.findall(r'time\s+is:\s+\d{2}h:\d{2}m:(\d{2})', ' '.join(lst[0])) if int(i) != 00]
['12', '05']
>>> lst = [['asgy200;f','ssll100','time is: 10h:00m:00s','xxxxxxx','***','','asgy200;f','frl5100','time is: 00h:00m:05s','ooo']]
>>> [i for i in re.findall(r'time\s+is:\s+\d{2}h:\d{2}m:(\d{2})', ' '.join(lst[0])) if int(i) != 00]
['05']

Taking into account your last comment to your Q,
>>> x = [['asgy200;f','ssll100','time is: 10h:00m:12s','xxxxxxx','***','','asgy200;f','frl5100','time is: 00h:00m:05s','ooo']]
>>> print all([w[-3:-1]!='00' for r in x for w in r if w.startswith('time is: ')])
True
>>>
all and any are two useful builtins...
The thing operates like this, the slower loop is on the sublists (rows) of x, the fastest loop on the items (words)in each row, we pick up only the words that startswith a specific string, and our iterable is made of booleans where we have true if the 3rd last and 2nd last character of the picked word are different from'00'. Finally the all consumes the iterable and returns True if all the second fields are different from '00'.
HTH,
Addendum
Do we want to break out early?
all_secs_differ_from_0 = True
for row in x:
for word in row:
if word.startswith('time is: ') and word[-3:-1] == '00':
all_secs_differ_from_0 = False
break
if not all_secs_differ_from_0: break

Remove strings from a list that contains numbers in python [duplicate]

This question already has answers here:
How to remove all integer values from a list in python
(8 answers)
Closed 6 years ago.
Is there a short way to remove all strings in a list that contains numbers?
For example
my_list = [ 'hello' , 'hi', '4tim', '342' ]
would return
my_list = [ 'hello' , 'hi']

Without regex:
[x for x in my_list if not any(c.isdigit() for c in x)]

I find using isalpha() the most elegant, but it will also remove items that contain other non-alphabetic characters:
Return true if all characters in the string are alphabetic and there is at least one character, false otherwise. Alphabetic characters are those characters defined in the Unicode character database as “Letter”
my_list = [item for item in my_list if item.isalpha()]

I'd use a regex:
import re
my_list = [s for s in my_list if not re.search(r'\d',s)]
In terms of timing, using a regex is significantly faster on your sample data than the isdigit solution. Admittedly, it's slower than isalpha, but the behavior is slightly different with punctuation, whitespace, etc. Since the problem doesn't specify what should happen with those strings, it's not clear which is the best solution.
import re
my_list = [ 'hello' , 'hi', '4tim', '342' 'adn322' ]
def isalpha(mylist):
return [item for item in mylist if item.isalpha()]
def fisalpha(mylist):
return filter(str.isalpha,mylist)
def regex(mylist,myregex = re.compile(r'\d')):
return [s for s in mylist if not myregex.search(s)]
def isdigit(mylist):
return [x for x in mylist if not any(c.isdigit() for c in x)]
import timeit
for func in ('isalpha','fisalpha','regex','isdigit'):
print func,timeit.timeit(func+'(my_list)','from __main__ import my_list,'+func)
Here are my results:
isalpha 1.80665302277
fisalpha 2.09064006805
regex 2.98224401474
isdigit 8.0824341774

Try:
import re
my_list = [x for x in my_list if re.match("^[A-Za-z_-]*$", x)]

Sure, use the string builtin for digits, and test the existence of them.
We'll get a little fancy and just test for truthiness in the list comprehension; if it's returned anything there's digits in the string.
So:
out_list = []
for item in my_list:
if not [ char for char in item if char in string.digits ]:
out_list.append(item)

And yet another slight variation:
>>> import re
>>> filter(re.compile('(?i)[a-z]').match, my_list)
['hello', 'hi']
And put the characters that are valid in your re (such as spaces/punctuation/other)

Python string split decimals from end of string

I use nlst on a ftp server which returns directories in the form of lists. The format of the returned list is as follows:
[xyz123,abcde345,pqrst678].
I have to separate each element of the list into two parts such that part1 = xyz and part2 = 123 i.e split the string at the beginning of the integer part. Any help on this will be appreciated!

>>> re.findall(r'\d+|[a-z]+', 'xyz123')
['xyz', '123']

For example, using the re module:
>>> import re
>>> a = ['xyz123','ABCDE345','pqRst678']
>>> regex = '(\D+)(\d+)'
>>> for item in a:
... m = re.match(regex, item)
... (a, b) = m.groups()
... print a, b
xyz 123
ABCDE 345
pqRst 678

Use the regular expression module re:
import re
def splitEntry(entry):
firstDecMatch = re.match(r"\d+$", entry)
alpha, numeric = "",""
if firstDecMatch:
pos = firstDecMatch.start(0)
alpha, numeric = entry[:pos], entry[pos:]
else # no decimals found at end of string
alpha = entry
return (alpha, numeric)
Note that the regular expression is `\d+$', which should match all decimals at the end of the string. If the string has decimals in the first part, it will not count those, e.g: xy3zzz134 -> "xy3zzz","134". I opted for that because you say you are expecting filenames, and filenames can include numbers. Of course it's still a problem if the filename ends with numbers.

Another non-re answer:
>>> [''.join(x[1]) for x in itertools.groupby('xyz123', lambda x: x.isalpha())]
['xyz', '123']

If you don't want to use regex, then you can do something like this. Note that I have not tested this so there could be a bug or typo somewhere.
list = ["xyz123", "abcde345", "pqrst678"]
newlist = []
for item in list:
for char in range(0, len(item)):
if item[char].isnumeric():
newlist.append([item[:char], item[char:]])
break

>>> import re
>>> [re.findall(r'(.*?)(\d+$)',x)[0] for x in ['xyz123','ABCDE345','pqRst678']]
[('xyz', '123'), ('ABCDE', '345'), ('pqRst', '678')]

I don't think its that difficult without re
>>> s="xyz123"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['xyz', '123']
>>> s="abcde345"
>>> for n,i in enumerate(s):
... if i.isdigit(): x=n ; break
...
>>> [ s[:x], s[x:] ]
['abcde', '345']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

isolate data from long string [duplicate] - python

>>> import re >>> string1 = "498results should get" >>> int(re.search(r'\d+', string1).group()) 498 If there are multiple integers in the string: >>> map(int, re.findall(r'\d+', string1)) [498]

An answer taken from ChristopheD here: https://stackoverflow.com/a/2500023/1225603 r = "456results string789" s = ''.join(x for x in r if x.isdigit()) print int(s) 456789

Here's your one-liner, without using any regular expressions, which can get expensive at times: >>> ''.join(filter(str.isdigit, "1234GAgade5312djdl0")) returns: '123453120'

if you have multiple sets of numbers then this is another option >>> import re >>> print(re.findall('\d+', 'xyz123abc456def789')) ['123', '456', '789'] its no good for floating point number strings though.

Iterator version >>> import re >>> string1 = "498results should get" >>> [int(x.group()) for x in re.finditer(r'\d+', string1)] [498]

>>> import itertools >>> int(''.join(itertools.takewhile(lambda s: s.isdigit(), string1)))

With python 3.6, these two lines return a list (may be empty) >>[int(x) for x in re.findall('\d+', your_string)] Similar to >>list(map(int, re.findall('\d+', your_string))

def function(string): final = '' for i in string: try: final += str(int(i)) except ValueError: return int(final) print(function("4983results should get"))

Another option is to remove the trailing the letters using rstrip and string.ascii_lowercase (to get the letters): import string out = [int(s.replace(' ','').rstrip(string.ascii_lowercase)) for s in strings] Output: [498, 49867, 497543]

integerstring="" string1 = "498results should get" for i in string1: if i.isdigit()==True integerstring=integerstring+i print(integerstring)

Related

Can i include multiple statements when creating a one-line for loop?

How can I extract words before some string?

Find a string in a list of list in python

Remove strings from a list that contains numbers in python [duplicate]

Python string split decimals from end of string

Categories

Resources