How to remove duplicate dialing codes in Python? - python

I'm trying to standardize a dataframe column of international phone numbers. I've managed to get rid of everything else except of duplicate dialing codes.
For instance, some German numbers are in the following format "00 49 49 7 123 456 789", i.e. they contain two consecutive dialing codes (49). I was wondering if there's an easy fix to get rid of the duplicate and leave the number as "00 49 7 123 456 789"
I have tried some regex and itertools.groupby solutions, however with no success, as the variations in the different dialing codes cause issues.
I would appreciate any help, thank you.

This is a very data-driven problem, therefore the solution may change a lot depending on the actual data you are dealing with. Anyway, this will do what you want to achieve:
number = "00 49 49 7 123 456 789"
# Split the number to a list of number parts
num_parts = number.split()
# Define the latest position to look for a dial number
dial_code_max_pos = 4
# Iterator of tuples in the form of (number_part, next_number_part)
first_parts_with_sibling = zip(
num_parts[:dial_code_max_pos],
num_parts[1:dial_code_max_pos]
)
# Re-build the start of the number but removing the parts
# that have an identical right-sibling
first_parts_with_no_duplicates = [
num[0] for num in first_parts_with_sibling
if len(set(num)) > 1 # This is the actual filter
]
# Compose back the number
number = ' '.join(first_parts_with_no_duplicates + num_parts[dial_code_max_pos - 1:])
Again, this kind of normalizations are very dangerous in production, you could end up loosing valuable data due to an algorithm that does not cover every possible kind of data.
As #Clèment said in his comment, be sure to make a few checks on the original number (i.e.: length) before applying any transformation.

Related

Remove empty rows within a dataframe and check similarity

I am having some difficulties to select not empty fields using regex (findall) within my dataframe, looking for words contained into a text source:
text = "Be careful otherwise police will capture you quickly."
I will need to look for words that ends with ful in my text string, then looking for words that ends with full in my dataset.
Author DF_Text
31 Better the devil you know than the one you don't
53 Beware the door with too many keys.
563 Be careful what you tolerate. You are teaching people how to treat you.
41 Fear the Greeks bearing gifts.
539 NaN
51 The honey is sweet but the bee has a sting.
21 Be careful what you ask for; you may get it.
(from csv/txt file).
I need to extract words ending with ful in text, then look at both DF_Text (thus Author) which contains words ending with ful and appending results in a list.
n=0
for i in df['DF_Text']:
print(re.findall(r"\w+ful", i))
n=n+1
print(n)
My question is: how can I remove empty rows([]) from the analysis (NaN) and report the author names (e.g. 563, 21) related to?
I will be happy to provide further information, in case it would be not clear.
Use str.findall instead of looping with re.findall:
df["found"] = df["DF_Text"].str.findall(r"(\w+ful)")
df.loc[df["found"].str.len().eq(0),"found"] = df["Author"]
print (df)
Author DF_Text found
0 31 Better the devil you know than the one you don't 31
1 53 Beware the door with too many keys. 53
2 563 Be careful what you tolerate. You are teaching... [careful]
3 41 Fear the Greeks bearing gifts. 41
4 539 NaN NaN
5 51 The honey is sweet but the bee has a sting. 51
6 21 Be careful what you ask for; you may get it. [careful]
I would use the .notna() function of Pandas to get rid of that row in your datafrae. l.
Something like this
df = df[df['DF_Text'].notna()]
And please note that Python calls the dataframe twice before overwriting it, this is correct.
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notna.htm

How to generate set of specific string values with maximum of 6 digits?

I am working on a project in Python 3 where I need to create a sequence without adding digits. The numbers should be a string saved in a set, since it's faster than a list and they're all unique values.
I.e., I need something like:
Output
000001
000002
000003
...
000010
000011
...
000100
//and so on
Code
def build_sequence():
seq = set()
// logic here
return seq
I have no idea how to solve this issue. It would be great if someone could put me in the right direction.
There are several ways to do this.
But the first will be to get the lower and upper limit of the desired sequence, from the question lets assume 0 to hundred. A simple for loop should do the trick
for i in range(1,100):
print(i)
This should print 1 to 100 as
1 2 3 .... 100
But we want a sequence, there are many pythonic approach to this,
A simple one is format
format(1, "06")
'000001'
Put this in loop.
In python 3.6, we have f-strings
temp_var = 19
f'{temp_var:06}'
There are lots of other methods as well, google and python docs are a great start
Your question is kind of vague, but from what I can guess you want a function to generate from 000000 to 999999.
def build_sequence():
for i in range(1000000):
yield "%.6d" % i
You can then iterate through each string using
for i in build_sequence():
print(i)
where i is the string

How to use a regex to search for contiguous incrementing sequences

I would like to use regex to increase the speed of my searches for specific records within a large binary image. It seems like regex searches always outperform my own search methods, so that's why I'm looking into this. I have already implemented the following, which works, but is not very fast.
My binary image is loaded into a Numpy memmap as words.
I_FILE = np.memmap(opts.image_file, dtype='uint32', mode='r')
And here is start of my search loop currently (which works):
for i in range(0, FILESIZE - 19):
if (((I_FILE[i] + 1 == I_FILE[i + 19]) or (I_FILE[i - 19] + 1 == I_FILE[i])) and I_FILE[i] < 60):
...do stuff...
This is seeking out records that are 19 bytes long that start with a decimal sequence number between 0 and 59. It looks for an incrementing sequence on either a record before or after the current search location to validate the record.
I've seen a few examples where folks have crafted variables into string using re.escape (like this: How to use a variable inside a regular expression?) But I can't seem to figure out how to search for a changing value sequence.
I managed to make it work with regex, but it was a bit more complicated than I expected. The regex expressions look for two values between 0 and 59 that are separated by 72 bytes (18 words). I used two regex searches to ensure that I wouldn't miss records at the end of a sequence:
# First search uses the lookahead assertion to not consume large amounts of data.
SearchPattern1 = re.compile(b'[\0-\x3B]\0\0\0(?=.{72}[\1-\x3B]\0\0\0)', re.DOTALL)
# Again using the positive lookbehind assertion (?<= ... ) to grab the ending entries.
SearchPattern2 = re.compile(b'(?<=[\0-\x3B]\0\0\0.{72})[\1-\x3B]\0\0\0', re.DOTALL)
Next, perform both searches and combine the results.
HitList1 = [m.start(0) for m in SearchPattern1.finditer(I_FILE)]
HitList2 = [m.start(0) for m in SearchPattern2.finditer(I_FILE)]
AllHitList = list(set(HitList1 + HitList2))
SortedHitList = sorted(AllHitList)
Now I run a search that has the same conditions as my original solution, but it runs on a much smaller set of data!
for i in range(0, len(SortedHitList)):
TestLoc = SortedHitList[i]
if (I_FILE[TestLoc] + 1 == I_FILE[TestLoc + 19]) or (I_FILE[TestLoc - 19] + 1 == I_FILE[TestLoc]):
... do stuff ...
The result was very successful! The original solution took 58 seconds to run on a 300 MB binary file, while the new regex solution took only 2 seconds!!

How to refine a python script for a bioinformatics query

I am quite new to python and I would be grateful for some assistance if possible. I am comparing the genomes of two closely related organisms [E_C & E_F] and trying to identify some basic insertions and deletions. I have run a FASTA pairwise alignment (glsearch36) using sequences from both organisms.
The below is a section of my python script where I have been able to identify a 7 nucleotide (heptamer) in one sequence (database) that corresponds to a gap in the other sequence (query). This is an example of what I have:
ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9
GAGGAAG
Assume the gap is at position 9. I am trying to refine the script to select gaps that are 20 nucleotides or more apart on both sequences and only if the surrounding nucleotides also match
ATGCACAAGTAAGGTTACCG-ACCTGTATGTGAACTCAACA
||| |||
GTGCTCGGGTCACCTTACCGGACCGCCCAGGGCGGCCCAAG
21
CCGGACC
This is the section of my script, the top half deals with opening different files. it also prints a dictionary with the count of each sequence at the end.
list_of_positions = []
for match in re.finditer(r'(?=(%s))' % re.escape("-"), dict_seqs[E_C]):
list_of_positions.append(match.start())
set_of_positions = set(list_of_positions)
for position in list_of_positions:
list_no_indels = []
for number in range(position-20, position) :
list_no_indels.append(number)
for number in range(position+1, position+21) :
list_no_indels.append(number)
set_no_indels = set(list_no_indels)
if len(set_no_indels.intersection(set_of_positions))> 0 : continue
if len(set_no_indels.intersection(set_of_positions_EF))> 0 : continue
print position
#print match.start()
print dict_seqs[E_F][position -3:position +3]
key = dict_seqs[E_F][position -3: position +3]
if nt_dict.has_key(key):
nt_dict[key] += 1
else:
nt_dict[key] = 1
print nt_dict
Essentially, I was trying to edit the results of pairwise alignments to try and identify the nucleotides opposite the gaps in both the query and database sequences in order to conduct some basic Insertion/Deletion analysis.
I was able to solve one of my earlier issues by increasing the distance between gaps "-" to 20 nt's in an attempt to reduce noise, this has improved my results. Script edited above.
This is an example of my results and at the end I have a dictionary which counts the occurences of each sequence.
ATGCACAA-ACCTGTATG # query
ATGCAGAGGAAGAGCAAG # database
9 (position on the sequence)
GAGGAA (hexamer)
ATGCACAAGACCTGTATG # query
ATGCAGAG-AAGAGCAAG # database
9 (position)
CAAGAC (hexamer)
However, I am still trying to fix the script where I get the nucleotides around the gap to match exactly such as this, where the | is just to show matching nt's on each sequence:
GGTTACCG-ACCTGTATGTGAACTCAACA # query
||| ||
CCTTACCGGACCGCCCAGGGCGGCCCAAG # database
9
ACCGAC
Any help with this would be gratefully appreciated!
I think I understand what you are trying to do but as #alko has said - comments in your code will definitely help a lot.
As to finding an exact match around the gap you could run a string comparison:
Something along the lines of:
if query[position -3: position] == database[position -3: position] and query[position +1: position +3] == database[position +1: position +3]:
# Do something
You will need to replace "query" and "database" with what you have called your strings that you want to compare.

More efficient solution to loop nesting required

I am trying to compare two files. I will list the two file content:
File 1 File 2
"d.complex.1" "d.complex.1"
1 4
5 5
48 47
65 21
d.complex.10 d.complex.10
46 5
21 46
109 121
192 192
There are totally 2000 d.complex in each file. I am trying to compare both the files but the problem is the values listed under d.complex.1 in first file has to be checked with all the 2000 d.complex entries in the second file and if the entry do not match, they are to be printed out. For example in the above files, in file1 d.complex.1 number 48 is not present in file2 d.complex.1; so that number has to be stored in a list (to print out later). Then again the same d.complex.1 has to be compared with d.complex.10 of file2 and since 1, 48 and 65 are not there, they have to be appended to a list.
The method I chose to achieve this was to use sets and then do a intersection. Code I wrote was:
first_complex=open( "file1.txt", "r" )
first_complex_lines=first_complex.readlines()
first_complex_lines=map( string.strip, first_complex_lines )
first_complex.close()
second_complex=open( "file2.txt", "r" )
second_complex_lines=second_complex.readlines()
second_complex_lines=map( string.strip, second_complex_lines )
second_complex.close()
list_1=[]
list_2=[]
res_1=[]
for line in first_complex_lines:
if line.startswith( "d.complex" ):
res_1.append( [] )
res_1[-1].append( line )
res_2=[]
for line in second_complex_lines:
if line.startswith( "d.complex" ):
res_2.append( [] )
res_2[-1].append( line )
h=len( res_1 )
k=len( res_2 )
for i in res_1:
for j in res_2:
print i[0]
print j[0]
target_set=set ( i )
target_set_1=set( j )
for s in target_set:
if s not in target_set_1:
print s
The above code is giving an output like this (just an example):
1
48
65
d.complex.1.dssp
d.complex.1.dssp
46
21
109
d.complex.1.dssp
d.complex.1.dssp
d.complex.10.dssp
Though the above answer is correct, I want a more efficient way of doing this, can anyone help me? Also two d.complex.1.dssp are printed instead of one which is also not good.
What I would like to have is:
d.complex.1
d.complex.1 (name from file2)
1
48
65
d.complex.1
d.complex.10 (name from file2)
1
48
65
I am so new to python so my concept above might be flawed. Also I have never used sets before :(. Can someone give me a hand here?
Pointers:
Use list comprehensions or generator expressions to simplify data processing. More readable
Just generate the sets once.
Use functions to not repeat yourself, especially doing the same task twice.
I've made a few assumptions about your input data, you might want to try something like this.
def parsefile(filename):
ret = {}
cur = None
for line in ( x.strip() for x in open(filename,'r')):
if line.startswith('d.complex'):
cur = set()
ret[line] = cur
if not cur or not line.isdigit():
continue
cur.add(int(line))
return ret
def compareStructures(first,second):
# Iterate through key,value pairs in first
for firstcmplx, firstmembers in first.iteritems():
# Iterate through key,value pairs in second
for secondcmplx, secondmembers in second.iteritems():
notinsecond = firstmembers- secondmembers
if notinsecond:
# There are items in first that aren't in second
print firstcmplx
print secondcmplx
print "\n".join([ str(x) for x in notinsecond])
first = parsefile("myFirstFile.txt")
second = parsefile("mySecondFile.txt")
compareStructures(first,second)
Edited for fixes.. shows how much I rely on running the code to test it :) Thanks Alex
There's already a good answer, by #MattH, focused on the Python details of your problem, and while it can be improved in several details the improvements would only gain you some percentage points in efficiency -- worthwhile but not great.
The only hope for a huge boost in efficiency (as opposed to "kai-zen" incremental improvement) is a drastic change in the algorithm -- which may or may not be possible depending on characteristics of your data that you do not reveal, and some details about your precise requirements.
The crucial part is: roughly, what range of numbers will be present in the file, and roughly, how many numbers per "d.complex.N" stanza? You already told us there are going to be about 2000 stanzas per file (and that's also crucial of course) and the impression is that in each file they're going to be ordered by contiguous increasing N -- 1, 2, 3, and so on (is that so?).
Your algorithm builds two maps stanza->numbers (not with top efficiency, but that's what #MattH's answer focuses on enhancing) so then inevitably it needs N squared stanza-to-stanza checks -- as N is 2,000, it needs 4 million such checks.
Consider building reversed maps, number->stanzas -- if the range of numbers and the typical size of (amount of numbers in) a stanza are both reasonably limited, those will be more compact. For example, if the numbers are between 1 and 200, and there are about 4 numbers per stanzas, this implies a number will typically be in (2000 * 4) / 200 -> 40 stanzas, so such mappings would have 200 entries of about 40 stanzas each. It only needs 200 squared (40,000) checks, rather than 4 million, to obtain the joint information for each number (then, depending on exact need for output format, formatting that info may require very substantial effort again -- if you absolutely require as final result 4 million "stanza-pairs" section as the output, then of course there's no way to avoid 4 million "output operations, which will be inevitably very costly).
But all of this depends on those numbers that you're not telling us -- average stanza population, and range of numbers in the files, as well as details on what constraints you must absolutely respect for output format (if the numbers are reasonable, the output format constraints are going to be the key constraint on the big-O performance you can get out of any program).
Remember, to quote Fred Brooks:
Show me your flowcharts and conceal
your tables, and I shall continue to
be mystified. Show me your tables, and
I won’t usually need your flowcharts;
they’ll be obvious.
Brooks was writing in the '60s (though his collection of essays, "The Mythical Man-Month", was published later, in the '70s), whence the quaint use of "flowcharts" (where we'd say code or algorithms) and "tables" (where we'd say data or data structures) -- but the general concept is still perfectly valid: the organization and nature of your data, in all kinds of programs focused on data processing (such as yours), can be even more important than the organization of the code, especially since it constrains the latter;-).

Categories