I have two lists:
wrong_chars = [
['أ','إ','ٱ','ٲ','ٳ','ٵ'],
['ٮ','ݕ','ݖ','ﭒ','ﭓ','ﭔ'],
['ڀ','ݐ','ݔ','ﭖ','ﭗ','ﭘ'],
['ٹ','ٺ','ٻ','ټ','ݓ','ﭞ'],
]
true_chars = [
['ا'],
['ب'],
['پ'],
['ت'],
]
For a given string I want to replace the entries in wrong_chars with those in true_chars. Is there a clean way to do that in python?
string module to the rescue!
There's a really handy function as a part of the string module called translate that does exactly what you're looking for, though you'll have to pass in your translation mapping as a dictionary.
The documentation is here
An example based on a tutorial from tutoriapoint is shown below:
>>> from string import maketrans
>>> trantab = maketrans("aeiou", "12345")
>>> "this is string example....wow!!!".translate(trantab)
th3s 3s str3ng 2x1mpl2....w4w!!!
It looks like you're using unicode here though, which works slightly differently. You can look at this question to get a sense, but here's an example that should work for you more specifically:
translation_dict = {}
for i, char_list in enumerate(wrong_chars):
for char in char_list:
translation_dict[ord(char)] = true_chars[i]
example.translate(translation_dict)
I merged your two wrong and true chars in a list of dictionaries of wrongs and what should be replaced with them. so here you are:
link to a working sample http://ideone.com/mz7E0R
and code itself
given_string = "ayznobcyn"
correction_list = [
{"wrongs":['x','y','z'],"true":'x'},
{"wrongs":['m','n','o'],"true":'m'},
{"wrongs":['q','r','s','t'],"true":'q'}
]
processed_string = ""
true_char = ""
for s in given_string:
for correction in correction_list:
true_char=s
if s in correction['wrongs']:
true_char=correction['true']
break
processed_string+=true_char
print given_string
print processed_string
this code can be more optimized and of course i do not care about unicode problems if there was any, because i see you are using Farsi. you should take care about that.
#!/usr/bin/env python
from __future__ import unicode_literals
wrong_chars = [
['1', '2', '3'],
['4', '5', '6'],
['7'],
]
true_chars = 'abc'
table = {}
for keys, value in zip(wrong_chars, true_chars):
table.update(dict.fromkeys(map(ord, keys), value))
print("123456789".translate(table))
Output
aaabbbc89
In my idea you can make just one list that contain true characters too like this:
NewChars = {["ا"،"أ"،"إ"،"آ"], ["ب"،"بِ"،"بِ"،]}
# add all true characters to the first of lists and add all lists to a dict, then:
Ch="إ"
For L in NewChars:
If Ch in L: return L[0]
Related
I have a large list of names which is in this format
list1 = ["apple", "orange", "banana", "pine-apple"]
And I want it in this format
list1 = ["'apple'", "'orange'", "'banana'", "'pine-apple'"]
Basically, I want to add punctuation marks to every single word in the list
but since the list is too large, I can't do it manually.
So is there any python function or way to do this task. Thank You.
The names in python are already strings enclosed in the quotes like you have shown here. I am supposing you want to wrap the string with specific quote to look this '"apple"' or "'apple'". To do so, you should use the following snippet
q = "'" # this will be wrapped around the string
list1 = ['apple','orange','banana','pine-apple']
list1 = [q+x+q for x in list1]
For reference, the syntax I have used in last line is known as list comprehension
According to latest comment posted by #xdhmoore
If you are using vim/nano (linux/macos) or notepad(windows), then i would rather suggest you to use IDLE python (shipped with python setup)
Str function is the built in function to convert a value into string.
You can run this code;
For i in range(len(list1)):
new = str(list1[i])
list1.remove(list[i])
list1.append(new)
Using for loop to process each line, two ways to go
text = "list1 = [apple,orange,banana,pine-apple]"
start = text.find('[')+1
stop = text.find(']')
lst = text[start:stop].split(',') # ['apple', 'orange', 'banana', 'pine-apple']
new_lst = [f'"{item}"' for item in lst] # ['"apple"', '"orange"', '"banana"', '"pine-apple"']
new_text1 = text[:start]+','.join(new_lst)+text[stop:] # 'list1 = ["apple","orange","banana","pine-apple"]'
text = "list1 = [apple,orange,banana,pine-apple]"
new_text2 = text.replace('[', '["').replace(']', '"]').replace(',', '","')
Consider the line below read in from a txt file:
EDIT: The text file has thousands of lines just like the one below: TAG1=1494947148,1,d,ble,0,2,0,0&TAG2[]=0,229109531800552&TAG2[]=0,22910953180055 ...
In the line there would be some data that corresponds to TAG1 and lots of data that have &TAG2 at their start.
I want to make a dictionary that has further dictionaries within it, like
{
{'TAG1':1494947148,1,d,ble,0,2,0,0}
{'TAG2:
{'1': 0, '2':229109531800552}
{'1': 0, '2':22910953180055}
}
.
.
}
How do I split the string starting at TAG1 and stopping just before the ampersand before TAG2? Does python allow some way to check if a certain character(s) has been encountered and stop/start there?
I would turn them into a dictionary of string key and list of values. It doesn't matter if a tag has one or more items, just lists would make parsing them simple. You can further process the result dictionary if you find that necessary.
The code will discard the [] in tag names, as they all turned to list anyway.
from itertools import groupby
from operator import itemgetter
import re
s = "TAG1=1494947148,1,d,ble,0,2,0,0&TAG2[]=0,229109531800552&TAG2[]=0,22910953180055"
splitted = map(re.compile("(?:\[\])?=").split, s.split("&"))
tag_values = groupby(sorted(splitted, key=itemgetter(0)), key=itemgetter(0))
result = {t: [c[1].split(',') for c in v] for t, v in tag_values}
And when you print the result, you get:
print(result)
{'TAG2': [['0', '229109531800552'], ['0', '22910953180055']], 'TAG1': [['1494947148', '1', 'd', 'ble', '0', '2', '0', '0']]}
How it works
splitted = map(re.compile("(?:\[\])?=").split, s.split("&"))
first you split the line with &. That will turn the line into little chunks like "TAG2[]=0,229109531800552", then map turns each chunk into two parts removing the = or []= between them.
tag_values = groupby(sorted(splitted, key=itemgetter(0)), key=itemgetter(0))
Because of the map function, splitted is now a iterable that will return lists of two items when consumed. We further sort then group them with the tag(the string on the left of =). Now we have tag_values with keys represent tags and each tag paired with all the matching values(including the tag). Still an iterable though, which means all the thing we talked about haven't really happend yet, except for s.split("&")
result = {t: [c[1].split(',') for c in v] for t, v in tag_values}
The last line uses both list and dictionary comprehension. We want to turn the result into a dict of tag and list of values. The curly brackets are dictionary comprehension. The inner variables t and v are extracted from tag_values where t is the tag and v is the grouped matching values(again tag included). At the beginning of the curly bracket t: means use t as a dictionary key, after the column would be the key's matching value.
We want to turn the dictionary value to a list of lists. The square brackets are the list comprehension that consumes the iterable v and turn it into a list. Variable c represent each item in v, and finally because c has two items, the tag and the string values, by using c[1].split(',') we take the value part and split it right into a list. And there is your result.
Further Reading
You really ought to get familiar with list/dict comprehension and generator expression, also take a look at yield if you want to get more things done with python, and learn itertools, functools, operator along the way. Basically just functional programming stuff, python is not a pure functional language though, these are just some powerful metaphors you can use. Read up on some functional languages like haskell that would also improve your python skills.
I think this might what you need:
import json
data = "TAG1=1494947148,1,d,ble,0,2,0,0&TAG2[]=0,229109531800552&TAG2[]=0,22910953180055"
items = data.split("&")
res ={}
for item in items:
key, value = item.split("=")
key = key.replace("[]","")
values = value.split(",")
if key in res:
res[key].append(values)
else:
res[key] = [values]
print(res)
print(json.dumps(res))
The results:
{'TAG1': [['1494947148', '1', 'd', 'ble', '0', '2', '0', '0']],
'TAG2': [['0', '229109531800552'], ['0', '22910953180055']]}
{"TAG1": [["1494947148", "1", "d", "ble", "0", "2", "0", "0"]],
"TAG2": [["0", "229109531800552"], ["0", "22910953180055"]]}
This may helps you
string = 'TAG1=1494947148,1,d,ble,0,2,0,0&TAG2[]=0,229109531800552'
data = map(str,string.split('&'))
print data
in_data_dic= {}
for i in data:
in_data = map(str,i.split('='))
in_data_dic[in_data[0]] = in_data[1]
in_data=[]
print in_data_dic
output
{'TAG2[]': '0,229109531800552', 'TAG1': '1494947148,1,d,ble,0,2,0,0'}
I used the csv module to create lists from a data file. It looks something like this now:
['unitig_5\t.\tregion\t401\t500\t0.00\t+\t.\tcov2=3.000', '0.000;gaps=0',
'0;cov=3', '3', '3;cQv=20', '20', '20;del=0;ins=0;sub=0']
['unitig_5\t.\tregion\t2201\t2300\t0.00\t+\t.\tcov2=10.860',
'1.217;gaps=0', '0;cov=8', '11', '13;cQv=20', '20', '20;del=0;ins=0;sub=0']
I need to pull lists and put them into a new file if cov2= (part of the first column above) is equal to some number greater than some specified integer (say 140), so then in that case the two lists above wouldn't be accepted.
How would I set it up to check which lists meet this qualification and put those lists to a new file?
You can use regex :
>>> l=['unitig_5\t.\tregion\t401\t500\t0.00\t+\t.\tcov2=3.000', '0.000;gaps=0',
... '0;cov=3', '3', '3;cQv=20', '20', '20;del=0;ins=0;sub=0']
>>> import re
>>> float(re.search(r'cov2=([\d.]+)',l[0]).group(1))
3.0
The pattern r'cov2=([\d.]+)' will match and combination of digits (\d) and dot with length 1 or more. then you can convert the result to float and compare :
>>> var=float(re.search(r'cov2=([\d.]+)',l[0]).group(1))
>>> var>140
False
Also as its possible that your regex doesn't match the pattern you can use a try-except to handle the exception :
try :
var=float(re.search(r'cov2=([\d.]+)',l[0]).group(1))
print var>140
except AttributeError:
#print 'the_error_message'
I would first split the first string by tabs "\t", which seems to separate the fields.
Then, if cov2 is always the last fild, further parsing would be easy (cut of "cov2=", then convert the remainder to float and compare.
If not necessarily the last field, a simple search for the start should be sufficient.
Of course, complexity could be increased indefinitively if error-checking or a more tolerant search is required.
lst = [ ['unitig_5\t.\tregion\t401\t500\t0.00\t+\t.\tcov2=3.000', '0.000;gaps=0',
'0;cov=3', '3', '3;cQv=20', '20', '20;del=0;ins=0;sub=0'],
['unitig_5\t.\tregion\t2201\t2300\t0.00\t+\t.\tcov2=10.860',
'1.217;gaps=0', '0;cov=8', '11', '13;cQv=20', '20', '20;del=0;ins=0;sub=0'], ]
filtered_list = [ l for l in lst if re.match('.*cov2=([\d.]+$'), l) ]
You could extract the float value using rsplit if all the first elements contain the substring:
for row in list_of_rows:
if float(row[0].rsplit("=",1)[1]) > 140:
# write rows
If you don't actually need every row you should do it when you first read the file writing as you go.
with open("input.csv") as f, open("output.csv", "w") as out:
r = csv.reader(f)
wr = csv.writer(out)
for row in r:
if float(row[0].rsplit("=", 1)[1]) > 140:
wr.writerows(row)
I have .txt file which looks like:
[ -5.44339373e+00 -2.77404404e-01 1.26122094e-01 9.83589873e-01
1.95201179e-01 -4.49866890e-01 -2.06423297e-01 1.04780491e+00]
[ 4.34562117e-01 -1.04469577e-01 2.83633101e-01 1.00452355e-01 -7.12572469e-01 -4.99234705e-01 -1.93152897e-01 1.80787567e-02]
I need to extract all floats from it and put them to list/array
What I've done is this:
A = []
for line in open("general.txt", "r").read().split(" "):
for unit in line.split("]", 3):
A.append(list(map(lambda x: str(x), unit.replace("[", "").replace("]", "").split(" "))))
but A contains elements like [''] or even worse ['3.20973096e-02\n']. These are all strings, but I need floats. How to do that?
Why not use a regular expression?
>>> import re
>>> e = r'(\d+\.\d+e?(?:\+|-)\d{2}?)'
>>> results = re.findall(e, your_string)
['5.44339373e+00',
'2.77404404e-01',
'1.26122094e-01',
'9.83589873e-01',
'1.95201179e-01',
'4.49866890e-01',
'2.06423297e-01',
'1.04780491e+00',
'4.34562117e-01',
'1.04469577e-01',
'2.83633101e-01',
'1.00452355e-01',
'7.12572469e-01',
'4.99234705e-01',
'1.93152897e-01',
'1.80787567e-02']
Now, these are the matched strings, but you can easily convert them to floats:
>>> map(float, re.findall(e, your_string))
[5.44339373,
0.277404404,
0.126122094,
0.983589873,
0.195201179,
0.44986689,
0.206423297,
1.04780491,
0.434562117,
0.104469577,
0.283633101,
0.100452355,
0.712572469,
0.499234705,
0.193152897,
0.0180787567]
Note, the regular expression might need some tweaking, but its a good start.
As a more precise way you can use regex for split the lines :
>>> s="""[ -5.44339373e+00 -2.77404404e-01 1.26122094e-01 9.83589873e-01
... 1.95201179e-01 -4.49866890e-01 -2.06423297e-01 1.04780491e+00]
... [ 4.34562117e-01 -1.04469577e-01 2.83633101e-01 1.00452355e-01 -7.12572469e-01 -4.99234705e-01 -1.93152897e-01 1.80787567e-02] """
>>> print re.split(r'[\s\[\]]+',s)
['', '-5.44339373e+00', '-2.77404404e-01', '1.26122094e-01', '9.83589873e-01', '1.95201179e-01', '-4.49866890e-01', '-2.06423297e-01', '1.04780491e+00', '4.34562117e-01', '-1.04469577e-01', '2.83633101e-01', '1.00452355e-01', '-7.12572469e-01', '-4.99234705e-01', '-1.93152897e-01', '1.80787567e-02', '']
And in this case that you have the data in file you can do :
import re
print re.split(r'[\s\[\]]+',open("general.txt", "r").read())
If you want to get ride of the empty strings in leading and trailing you can just use a list comprehension :
>>> print [i for i in re.split(r'[\s\[\]]*',s) if i]
['-5.44339373e+00', '-2.77404404e-01', '1.26122094e-01', '9.83589873e-01', '1.95201179e-01', '-4.49866890e-01', '-2.06423297e-01', '1.04780491e+00', '4.34562117e-01', '-1.04469577e-01', '2.83633101e-01', '1.00452355e-01', '-7.12572469e-01', '-4.99234705e-01', '-1.93152897e-01', '1.80787567e-02']
let's slurp the file
content = open('data.txt').read()
split on ']'
logical_lines = content.split(']')
strip the '[' and the other stuff
logical_lines = [ll.lstrip(' \n[') for ll in logical_lines]
convert to floats
lol = [map(float,ll.split()) for ll in logical_lines]
Sticking it all in a one-liner
lol=[map(float,l.lstrip(' \n[').split()) for l in open('data.txt').read().split(']')]
I've tested it on the exemplar data we were given and it works...
I'm trying to build a list of domain names from an Enom API call. I get back a lot of information and need to locate the domain name related lines, and then join them together.
The string that comes back from Enom looks somewhat like this:
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1
I'd like to build a list from that which looks like this:
[domain1.com, domain2.org, domain3.co.uk, domain4.net]
To find the different domain name components I've tried the following (where "enom" is the string above) but have only been able to get the SLD and TLD matches.
re.findall("^.*(SLD|TLD).*$", enom, re.M)
Edit:
Every time I see a question asking for regular expression solution I have this bizarre urge to try and solve it without regular expressions. Most of the times it's more efficient than the use of regex, I encourage the OP to test which of the solutions is most efficient.
Here is the naive approach:
a = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""
b = a.split("\n")
c = [x.split("=")[1] for x in b if x != 'TLDOverride=1']
for x in range(0,len(c),2):
print ".".join(c[x:x+2])
>> domain1.com
>> domain2.org
>> domain3.co.uk
>> domain4.net
You have a capturing group in your expression. re.findall documentation says:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
That's why only the conent of the capturing group is returned.
try:
re.findall("^.*((?:SLD|TLD)\d*)=(.*)$", enom, re.M)
This would return a list of tuples:
[('SLD1', 'domain1'), ('TLD1', 'com'), ('SLD2', 'domain2'), ('TLD2', 'org'), ('SLD3', 'domain3'), ('TLD4', 'co.uk'), ('SLD5', 'domain4'), ('TLD5', 'net')]
Combining SLDs and TLDs is then up to you.
this works for you example,
>>> sld_list = re.findall("^.*SLD[0-9]*?=(.*?)$", enom, re.M)
>>> tld_list = re.findall("^.*TLD[0-9]*?=(.*?)$", enom, re.M)
>>> map(lambda x: x[0] + '.' + x[1], zip(sld_list, tld_list))
['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
I'm not sure why are you talking about regular expressions. I mean, why don't you just run a for loop?
A famous quote seems to be appropriate here:
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.” Now they have two problems.
domains = []
components = []
for line in enom.split('\n'):
k,v = line.split('=')
if k == 'TLDOverride':
continue
components.append(v)
if k.startswith('TLD'):
domains.append('.'.join(components))
components = []
P.S. I'm not sure what's this TLDOverride so the code just ignores it.
Here's one way:
import re
print map('.'.join, zip(*[iter(re.findall(r'^(?:S|T)LD\d+=(.*)$', text, re.M))]*2))
# ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
Just for fun, map -> filter -> map:
input = """
SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
"""
splited = map(lambda x: x.split("="), input.split())
slds = filter(lambda x: x[1][0].startswith('SLD'), enumerate(splited))
print map(lambda x: '.'.join([x[1][1], splited[x[0] + 1][1], ]), slds)
>>> ['domain1.com', 'domain2.org', 'domain3.co.uk', 'domain4.net']
This appears to do what you want:
domains = re.findall('SLD\d+=(.+)', re.sub(r'\nTLD\d+=', '.', enom))
It assumes that the lines are sorted and SLD always comes before its TLD. If that can be not the case, try this slightly more verbose code without regexes:
d = dict(x.split('=') for x in enom.strip().splitlines())
domains = [
d[key] + '.' + d.get('T' + key[1:], '')
for key in d if key.startswith('SLD')
]
You need to use multiline regex for this. This is similar to this post.
data = """SLD1=domain1
TLD1=com
SLD2=domain2
TLD2=org
TLDOverride=1
SLD3=domain3
TLD4=co.uk
SLD5=domain4
TLD5=net
TLDOverride=1"""
domain_seq = re.compile(r"SLD\d=(\w+)\nTLD\d=(\w+)", re.M)
for item in domain_seq.finditer(data):
domain, tld = item.group(1), item.group(2)
print "%s.%s" % (domain,tld)
As some other answers already said, there's no need to use a regular expression here. A simple split and some filtering will do nicely:
lines = data.split("\n") #assuming data contains your input string
sld, tld = [[x.split("=")[1] for x in lines if x[:3] == t] for t in ("SLD", "TLD")]
result = [x+y for x, y in zip(sld, tld)]