I have a 1m+ row dataset, and each row has a combination of lower/uppercase letters, symbols and numbers. I am looking to clean this data and only keep the last instance of where a lowercase letter and number are beside each other. For speed efficiency, my current plan was to have this data as an array of strings and then use the .findall operation to keep the letter/number combo I'm looking for.
Here is something along the lines of what I am trying to do:
Input
list = Array(["Nd4","0-0","Nxe4","e8+","e4g2"])
newList = list.findall('[a-z]\d')[len(list.findall('[a-z]\d')-1]
Expected Output from newList
newList = ("d4","","e4","e8","g2")
It is not recommend to use "list" to assign a variable since it a built-in function
import re
import numpy as np
lists = np.array(["Nd4","0-0","Nxe4","e8+","e4g2"])
def findall(i,pattern=r'[a-z1-9]+'):
return re.findall(pattern,i)[0] if re.findall(pattern,i) else ""
newList = [findall(i) for i in lists]
# OR if you want to return an array
newList = np.array(list(map(findall,lists)))
# >>> ['d4', '', 'xe4', 'e8', 'e4g2']
This may not be the prettiest way, but I think it gets the job done!
import re
import numpy as np
lists = np.array(["Nd4","0-0","Nxe4","e8+","e4g2"])
def function(i):
try:
return re.findall(r'[a-z]\d',i)[len(re.findall(r'[a-z]\d',i))-1]
except:
return ""
newList = [function(i) for i in lists]
Related
I know how to create a new list based on the values of an existing list, eg casting
numspec = [float(x) for x in textspec]
Now I have a list of numbers where I need to subtract a value based on the index of a list. I have calculated an a and b value and ended up doing
peakadj = []
for i in range(len(peakvalues)):
val=peakvalues[i]-(i*a+b)
peakadj.append(val)
This works, but I don't like the feel of it, is there any more pythonic way of doing this?
Use the builtin enumerate function and a list comprehension.
peakadj = [val-(i*a+b) for i, val in enumerate(peakvalues)]
Perhaps faster:
from itertools import count
peakadj = [val-iab for val, iab in zip(peakvalues, count(b, a))]
Or:
from itertools import count
from operator import sub
peakadj = [*map(sub, peakvalues, count(b, a))]
Little benchmark
I am trying to implement an inplace algorithm to remove duplicates from a string in Python.
str1 = "geeksforgeeks"
for i in range(len(str1)):
for j in range(i+1,len(str1)-1):
if str1[i] == str1[j]: //Error Line
str1 = str1[0:j]+""+str1[j+1:]
print str1
In the above code, I am trying to replace the duplicate character with whitespace. But I get IndexError: string index out of range at if str1[i] == str1[j]. Am I missing out on something or is it not the right way?
My expected output is: geksfor
You can do all of this with just a set and a comprehension. No need to complicate things.
str1 = "geeksforgeeks"
seen = set()
seen_add = seen.add
print(''.join(s for s in str1 if not (s in seen or seen_add(s))))
#geksfor
"Simple is better than complex."
~ See PEP20
Edit
While the above is more simple than your answer, it is the most performant way of removing duplicates from a collection the more simple solution would be to use:
from collections import OrderedDict
print("".join(OrderedDict.fromkeys(str1)))
It is impossible to modify strings in-place in Python, the same way that it's impossible to modify numbers in-place in Python.
a = "something"
b = 3
b += 1 # allocates a new integer, 4, and assigns it to b
a += " else" # allocates a new string, " else", concatenates it to `a` to produce "something else"
# then assigns it to a
As already pointed str is immutable, so in-place requirement make no sense.
If you want to get desired output I would do it following way:
str1 = 'geeksforgeeks'
out = ''.join([i for inx,i in enumerate(str1) if str1.index(i)==inx])
print(out) #prints: geksfor
Here I used enumerate function to get numerated (inx) letters and fact that .index method of str, returns lowest possible index of element therefore str1.index('e') for given string is 1, not 2, not 9 and not 10.
Here is a simplified version of unique_everseen from itertools recipes.
from itertools import filterfalse
def unique_everseen(iterable)
seen = set()
see _ add = seen.add
for element in filterfalse(seen.__contains__, iterable):
seen_add(element)
yield element
You can then use this generator with str.join to get the expected output.
str1 = "geeksforgeeks"
new_str1 = ''.join(unique_everseen(str1)) # 'geksfor'
I have an array I want to iterate through. The array consists of strings consisting of numbers and signs.
like this: €110.5M
I want to loop over it and remove all Euro sign and also the M and return that array with the strings as ints.
How would I do this knowing that the array is a column in a table?
You could just strip the characters,
>>> x = '€110.5M'
>>> x.strip('€M')
'110.5'
def sanitize_string(ss):
ss = ss.replace('$', '').replace('€', '').lower()
if 'm' in ss:
res = float(ss.replace('m', '')) * 1000000
elif 'k' in ss:
res = float(ss.replace('k', '')) * 1000
return int(res)
This can be applied to a list as follows:
>>> ls = [sanitize_string(x) for x in ["€3.5M", "€15.7M" , "€167M"]]
>>> ls
[3500000, 15700000, 167000000]
If you want to apply it to the column of a table instead:
dataFrame = dataFrame.price.apply(sanitize_string) # Assuming you're using DataFrames and the column is called 'price'
You can use a string comprehension:
numbers = [float(p.replace('€','').replace('M','')) for p in a]
which gives:
[110.5, 210.5, 310.5]
You can use a list comprehension to construct one list from another:
foo = ["€13.5M", "€15M" , "€167M"]
foo_cleaned = [value.translate(None, "€M")]
str.translate replaces all occurrences of characters in the latter string with the first argument None.
Try this
arr = ["€110.5M","€110.5M","€110.5M","€110.5M","€110.5M","€110.5M","€110.5M"]
f = [x.replace("€","").replace("M","") for x in arr]
You can call .replace() on a string as often as you like. An initial solution could be something like this:
my_array = ['€110.5M', '€111.5M', '€112.5M']
my_cleaned_array = []
for elem in my_array:
my_cleaned_array.append(elem.replace('€', '').replace('M', ''))
At this point, you still have strings in your array. If you want to return them as ints, you can write int(elem.replace('€', '').replace('M', '')) instead. But be aware that you will then lose everything after the floating point, i.e. you will end up with [110, 111, 112].
You can use Regex to do that.
import re
str = "€110.5M"
x = re.findall("\-?\d+\.\d+", str )
print(x)
I didn't quite understand the second part of the question.
Say I have some list with files of the form *.1243.*, and I wish to obtain everything before these 4 digits. How do I do this efficiently?
An ugly, inefficient example of working code is:
names = []
for file in file_list:
words = file.split('.')
for i, word in enumerate(words):
if word.isdigit():
if int(word)>999 and int(word)<10000:
names.append(' '.join(words[:i]))
break
print(names)
Obviously though, this is far from ideal and I was wondering about better ways to do this.
You may want to use regular expressions for this.
import re
name = []
for file in file_list:
m = re.match(r'^(.+?)\.\d{4}\.', file)
if m:
name.append(m.groups()[0])
Using a regular expression, this would become simpler
import re
names = ['hello.1235.sas','test.5678.hai']
for fn in names:
myreg = r'(.*)\.(?:\d{4})\..*'
output = re.findall(myreg,fn)
print(output)
output:
['hello']
['test']
If you know that all entries has the same format, here is list comprehension approach:
[item[0] for item in filter(lambda start, digit, end: len(digit) == 4, (item.split('.') for item in file_list))]
To be fair I also like solution, provided by #James. Note, that downside of this list comprehension is three loops:
1. On all items to split
2. Filtering all items, that match
3. Returning result.
With regular for loop it could be be more sufficient:
output = []
for item in file_list:
begging, digits, end = item.split('.')
if len(digits) == 4:
output.append(begging)
It does only one loop, which way better.
You can use Positive Lookahead (?=(\.\d{4}))
import re
pattern=r'(.*)(?=(\.\d{4}))'
text=['*hello.1243.*','*.1243.*','hello.1235.sas','test.5678.hai','a.9999']
print(list(map(lambda x:re.search(pattern,x).group(0),text)))
output:
['*hello', '*', 'hello', 'test', 'a']
I have a list of strings that need to be sorted in numerical order using as a int key two substrings.
Obviously using the sort() function orders my strings alphabetically so I get 1,10,2... that is obviously not what I'm looking for.
Searching around I found a key parameter can be passed to the sort() function, and using sort(key=int) should do the trick, but being my key a substring and not the whole string should lead to a cast error.
Supposing my strings are something like:
test1txtfgf10
test1txtfgg2
test2txffdt3
test2txtsdsd1
I want my list to be ordered in numeric order on the basis of the first integer and then on the second, so I would have:
test1txtfgg2
test1txtfgf10
test2txtsdsd1
test2txffdt3
I think I could extract the integer values, sort only them keeping track of what string they belong to and then ordering the strings, but I was wondering if there's a way to do this thing in a more efficient and elegant way.
Thanks in advance
Try the following
In [26]: import re
In [27]: f = lambda x: [int(x) for x in re.findall(r'\d+', x)]
In [28]: sorted(strings, key=f)
Out[28]: ['test1txtfgg2', 'test1txtfgf10', 'test2txtsdsd1', 'test2txffdt3']
This uses regex (the re module) to find all integers in each string, then compares the resulting lists. For example, f('test1txtfgg2') returns [1, 2], which is then compared against other lists.
Extract the numeric parts and sort using them
import re
d = """test1txtfgf10
test1txtfgg2
test2txffdt3
test2txtsdsd1"""
lines = d.split("\n")
re_numeric = re.compile("^[^\d]+(\d+)[^\d]+(\d+)$")
def key(line):
"""Returns a tuple (n1, n2) of the numeric parts of line."""
m = re_numeric.match(line)
if m:
return (int(m.groups(1)), int(m.groups(2)))
else:
return None
lines.sort(key=key)
Now lines are
['test1txtfgg2', 'test1txtfgf10', 'test2txtsdsd1', 'test2txffdt3']
import re
k = [
"test1txtfgf10",
"test1txtfgg2",
"test2txffdt3",
"test2txtsdsd1"
]
tmp = [([e for e in re.split("[a-z]",el) if e], el) for el in k ]
sorted(tmp, key=lambda k: tmp[0])
tmp = [res for cm, res in tmp]