Not Parsing Through - python

I tried to parse through a text file, and see the index of the character where the four characters before it are each different. Like this:
wxrgh
The h would be the marker, since it is after the four different digits, and the index would be 4. I would find the index by converting the text into an array, and it works for the test but not for the actually input. Does anyone know what is wrong.
def Repeat(x):
size = len(x)
repeated = []
for i in range(_size):
k = i + 1
for j in range(k, _size):
if x[i] == x[j] and x[i] not in repeated:
repeated.append(x[i])
return repeated
with open("input4.txt") as f:
text = f.read()
test_array = []
split_array = list(text)
woah = ""
for i in split_array:
first = split_array[split_array.index(i)]
second = split_array[split_array.index(i) + 1]
third = split_array[split_array.index(i) + 2]
fourth = split_array[split_array.index(i) + 3]
test_array.append(first)
test_array.append(second)
test_array.append(third)
test_array.append(fourth)
print(test_array)
if Repeat(test_array) != []:
test_array = []
else:
woah = split_array.index(i)
print(woah)
print(woah)
I tried a test document and unit tests but that still does not work

You can utilise a set to help you with this.
Read the entire file into a list (buffer). Iterate over the buffer starting at offset 4. Create a set of the 4 characters that precede the current position. If the length of the set is 4 (i.e., they're all different) and the character at the current position is not in the set then you've found the index you're interested in.
W = 4
with open('input4.txt') as data:
buffer = data.read()
for i in range(W, len(buffer)):
if len(s := set(buffer[i-W:i])) == W and buffer[i] not in s:
print(i)
Note:
If the input data are split over multiple lines you may want to remove newline characters.
You will need to be using Python 3.8+ to take advantage of the assignment expression (walrus operator)

Related

why I am getting empty list when I use split()?

I have a textfile as:
-- Generated ]
FILEUNIT
METRIC /
Hello
-- timestep: Jan 01,2017 00:00:00
3*2 344.0392 343.4564 343.7741
343.9302 343.3884 343.7685 0.0000 341.0843
342.2441 342.5899 343.0728 343.4850 342.8882
342.0056 342.0564 341.9619 341.8840 342.0447 /
I have written a code to read the file and remove the words, characters and empty lines, and do some other processes on that and finally return those numbers in the last four lines. I cannot understand how to put all the numbers of the text file properly in a list. Right now the new_line generates a string of those lines with numbers
import string
def expand(chunk):
l = chunk.split("*")
chunk = [str(float(l[1]))] * int(l[0])
return chunk
with open('old_textfile.txt', 'r') as infile1:
for line in infile1:
if set(string.ascii_letters.replace("e","")) & set(line):
continue
chunks = line.split(" ")
#Get rid of newlines
chunks = list(map(lambda chunk: chunk.strip(), chunks))
if "/" in chunks:
chunks.remove("/")
new_chunks = []
for i in range(len(chunks)):
if '*' in chunks[i]:
new_chunks += expand(chunks[i])
else:
new_chunks.append(chunks[i])
new_chunks[len(new_chunks)-1] = new_chunks[len(new_chunks)-1]+"\n"
new_line = " ".join(new_chunks)
when I use the
A = new_line.split()
B = list(map(float, A))
it returns an empty list. Do you have any idea how I can put all these numbers in one single list?
currently, I am writing the new_line as a textfile and reading it again, but it increase my runtime which is not good.
f = open('new_textfile.txt').read()
A = f.split()
B = list(map(float, A))
list_1.extend(B)
There was another solution to use Regex, but it deletes 3*2. I want to process that as 2 2 2
import re
with open('old_textfile.txt', 'r') as infile1:
lines = infile1.read()
nums = re.findall(r'\d+\.\d+', lines)
print(nums)
I'm not quite sure if I entirely understand what you are trying to do, but as I understand it you want to extract all numbers which either are in a decimal form \d+\.\d+ or an integer which is multiplied by another integer using an asterisk, so \d+\*\d+. You want the results all in a list of floats where the decimals are in the list directly and for the integers the second is repeated by the first.
One way to do this would be:
lines = """
-- Generated ]
FILEUNIT
METRIC /
Hello
-- timestep: Jan 01,2017 00:00:00
3*2 344.0392 343.4564 343.7741
343.9302 343.3884 343.7685 0.0000 341.0843
342.2441 342.5899 343.0728 343.4850 342.8882
342.0056 342.0564 341.9619 341.8840 342.0447 /
"""
nums = []
for n in re.findall(r'(\d+\.\d+|\d+\*\d+)', lines):
split_by_ast = n.split("*")
if len(split_by_ast) == 1:
nums += [float(split_by_ast[0])]
else:
nums += [float(split_by_ast[1])] * int(split_by_ast[0])
print(nums)
Which returns:
[2.0, 2.0, 2.0, 344.0392, 343.4564, 343.7741, 343.9302, 343.3884, 343.7685, 0.0, 341.0843, 342.2441, 342.5899, 343.0728, 343.485, 342.8882, 342.0056, 342.0564, 341.9619, 341.884, 342.0447]
The regular expression searches for numbers matching one of the formats (decimal or int*int). Then in case of a decimal it is directly appended to the list, in case of int*int it is parsed to a smaller list repeating the second int by first int times, then the lists are concatenated.

How to find the most amount of shared characters in two strings? (Python)

yamxxopd
yndfyamxx
Output: 5
I am not quite sure how to find the number of the most amount of shared characters between two strings. For example (the strings above) the most amount of characters shared together is "yamxx" which is 5 characters long.
xx would not be a solution because that is not the most amount of shared characters. In this case the most is yamxx which is 5 characters long so the output would be 5.
I am quite new to python and stack overflow so any help would be much appreciated!
Note: They should be the same order in both strings
Here is simple, efficient solution using dynamic programming.
def longest_subtring(X, Y):
m,n = len(X), len(Y)
LCSuff = [[0 for k in range(n+1)] for l in range(m+1)]
result = 0
for i in range(m + 1):
for j in range(n + 1):
if (i == 0 or j == 0):
LCSuff[i][j] = 0
elif (X[i-1] == Y[j-1]):
LCSuff[i][j] = LCSuff[i-1][j-1] + 1
result = max(result, LCSuff[i][j])
else:
LCSuff[i][j] = 0
print (result )
longest_subtring("abcd", "arcd") # prints 2
longest_subtring("yammxdj", "nhjdyammx") # prints 5
This solution starts with sub-strings of longest possible lengths. If, for a certain length, there are no matching sub-strings of that length, it moves on to the next lower length. This way, it can stop at the first successful match.
s_1 = "yamxxopd"
s_2 = "yndfyamxx"
l_1, l_2 = len(s_1), len(s_2)
found = False
sub_length = l_1 # Let's start with the longest possible sub-string
while (not found) and sub_length: # Loop, over decreasing lengths of sub-string
for start in range(l_1 - sub_length + 1): # Loop, over all start-positions of sub-string
sub_str = s_1[start:(start+sub_length)] # Get the sub-string at that start-position
if sub_str in s_2: # If found a match for the sub-string, in s_2
found = True # Stop trying with smaller lengths of sub-string
break # Stop trying with this length of sub-string
else: # If no matches found for this length of sub-string
sub_length -= 1 # Let's try a smaller length for the sub-strings
print (f"Answer is {sub_length}" if found else "No common sub-string")
Output:
Answer is 5
s1 = "yamxxopd"
s2 = "yndfyamxx"
# initializing counter
counter = 0
# creating and initializing a string without repetition
s = ""
for x in s1:
if x not in s:
s = s + x
for x in s:
if x in s2:
counter = counter + 1
# display the number of the most amount of shared characters in two strings s1 and s2
print(counter) # display 5

Is there a way to remove specific strings from indexes using a for loop?

I am trying to remake the built-in function for bin(x) for better understanding, I have got that part down, now the issue is how to dynamically remove the 0s when they are not necessary.
I have tried using replace() but it seems to be removing every suggested "0" I am unsure how to select the zeroes till it hits the first index in which there is a "1"
for eg:
if i have 0b00010010
___
0b00010010
^
I would like to select the numbers after the 0b and erase the 0s right after until "1"
def bin(x):
if x>0:
binary = ""
i = 0
while x>0 and i<=16:
string = str(int(x%2))
binary = binary+string
x/=2
i = i+1
d = binary[::-1]
ret = f"0b{d}"
return ret.replace("00","")
else:
x = abs(x)
binary = ""
i = 0
while x > 0 and i <=16:
string = str(int(x % 2))
binary = binary + string
x /= 2
i = i + 1
nd = binary[::-1]
ret = f"-0b{nd}"
return ret.replace("00","")
print(bin(8314))# 0b00010000001111010 this is the current out
0b00010000001111010 this is the current output
0b10000001111010 this is what I want
It might be better to simplify things by not generating those extra zeroes in the first place:
def bin(x):
prefix = ("-" if x < 0 else "")
x = abs(x)
bits = []
while x:
x, bit = divmod(x, 2) # division and remainder in one operation
bits.append(str(bit))
# Flip the bits so the LSB is on the right, then join as string
bit_string = ''.join(bits[::-1])
# Form the final string
return f"{prefix}0b{bit_string}"
print(bin(8314))
prints
0b10000001111010
You should take a look at lstrip():
>>> b = "00010000001111010"
>>> b.lstrip("0")
'10000001111010'
Of course, make sure to prefix the binary with "0b" after calling lstrip().
Scott Hunter brought up a nice solution to your problem, however, if you want to use a for loop, consider trying the following:
binary = "0b00010000001111010"
start_index = binary.find("b")
for index in range(b+1, len(binary)):
if binary[index] == 0:
binary = binary[0:index:] + binary[index+1::]
else:
break

python intelligent hexadecimal numbers generator

I want to be able to generate 12 character long chain, of hexadecimal, BUT with no more than 2 identical numbers duplicate in the chain: 00 and not 000
Because, I know how to generate ALL possibilites, including 00000000000 to FFFFFFFFFFF, but I know that I won't use all those values, and because the size of the file generated with ALL possibilities is many GB long, I want to reduce the size by avoiding the not useful generated chains.
So my goal is to have results like 00A300BF8911 and not like 000300BF8911
Could you please help me to do so?
Many thanks in advance!
if you picked the same one twice, remove it from the choices for a round:
import random
hex_digits = set('0123456789ABCDEF')
result = ""
pick_from = hex_digits
for digit in range(12):
cur_digit = random.sample(hex_digits, 1)[0]
result += cur_digit
if result[-1] == cur_digit:
pick_from = hex_digits - set(cur_digit)
else:
pick_from = hex_digits
print(result)
Since the title mentions generators. Here's the above as a generator:
import random
hex_digits = set('0123456789ABCDEF')
def hexGen():
while True:
result = ""
pick_from = hex_digits
for digit in range(12):
cur_digit = random.sample(hex_digits, 1)[0]
result += cur_digit
if result[-1] == cur_digit:
pick_from = hex_digits - set(cur_digit)
else:
pick_from = hex_digits
yield result
my_hex_gen = hexGen()
counter = 0
for result in my_hex_gen:
print(result)
counter += 1
if counter > 10:
break
Results:
1ECC6A83EB14
D0897DE15E81
9C3E9028B0DE
CE74A2674AF0
9ECBD32C003D
0DF2E5DAC0FB
31C48E691C96
F33AAC2C2052
CD4CEDADD54D
40A329FF6E25
5F5D71F823A4
You could also change the while true loop to only produce a certain number of these based on a number passed into the function.
I interpret this question as, "I want to construct a rainbow table by iterating through all strings that have the following qualities. The string has a length of 12, contains only the characters 0-9 and A-F, and it never has the same character appearing three times in a row."
def iter_all_strings_without_triplicates(size, last_two_digits = (None, None)):
a,b = last_two_digits
if size == 0:
yield ""
else:
for c in "0123456789ABCDEF":
if a == b == c:
continue
else:
for rest in iter_all_strings_without_triplicates(size-1, (b,c)):
yield c + rest
for s in iter_all_strings_without_triplicates(12):
print(s)
Result:
001001001001
001001001002
001001001003
001001001004
001001001005
001001001006
001001001007
001001001008
001001001009
00100100100A
00100100100B
00100100100C
00100100100D
00100100100E
00100100100F
001001001010
001001001011
...
Note that there will be several hundred terabytes' worth of values outputted, so you aren't saving much room compared to just saving every single string, triplicates or not.
import string, random
source = string.hexdigits[:16]
result = ''
while len(result) < 12 :
idx = random.randint(0,len(source))
if len(result) < 3 or result[-1] != result[-2] or result[-1] != source[idx] :
result += source[idx]
You could extract a random sequence from a list of twice each hexadecimal digits:
digits = list('1234567890ABCDEF') * 2
random.shuffle(digits)
hex_number = ''.join(digits[:12])
If you wanted to allow shorter sequences, you could randomize that too, and left fill the blanks with zeros.
import random
digits = list('1234567890ABCDEF') * 2
random.shuffle(digits)
num_digits = random.randrange(3, 13)
hex_number = ''.join(['0'] * (12-num_digits)) + ''.join(digits[:num_digits])
print(hex_number)
You could use a generator iterating a window over the strings your current implementation yields. Sth. like (hex_str[i:i + 3] for i in range(len(hex_str) - window_size + 1)) Using len and set you could count the number of different characters in the slice. Although in your example it might be easier to just compare all 3 characters.
You can create an array from 0 to 255, and use random.sample with your list to get your list

extract substring pattern

I have long file like 1200 sequences
>3fm8|A|A0JLQ2
CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP
QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
>2ht9|A|A0JLT0
LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA
LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL
I want to read each possible pattern has cysteine in middle and has in the beginning five string and follow by other five string such as xxxxxCxxxxx
the output should be like this:
QDIQLCGMGIL
ILPEHCIIDIT
TISDNCVVIFS
FSKTSCSYCTM
this is the pogram only give position of C . it is not work like what I want
pos=[]
def find(ch,string1):
for i in range(len(string1)):
if ch == string1[i]:
pos.append(i)
return pos
z=find('C','AWERQRTCWERTYCTAAAACTTCTTT')
print z
You need to return outside the loop, you are returning on the first match so you only ever get a single character in your list:
def find(ch,string1):
pos = []
for i in range(len(string1)):
if ch == string1[i]:
pos.append(i)
return pos # outside
You can also use enumerate with a list comp in place of your range logic:
def indexes(ch, s1):
return [index for index, char in enumerate(s1)if char == ch and 5 >= index <= len(s1) - 6]
Each index in the list comp is the character index and each char is the actual character so we keep each index where char is equal to ch.
If you want the five chars that are both sides:
In [24]: s="CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTP QKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP"
In [25]: inds = indexes("C",s)
In [26]: [s[i-5:i+6] for i in inds]
Out[26]: ['QDIQLCGMGIL', 'ILPEHCIIDIT']
I added checking the index as we obviously cannot get five chars before C if the index is < 5 and the same from the end.
You can do it all in a single function, yielding a slice when you find a match:
def find(ch, s):
ln = len(s)
for i, char in enumerate(s):
if ch == char and 5 <= i <= ln - 6:
yield s[i- 5:i + 6]
Where presuming the data in your question is actually two lines from yoru file like:
s="""">3fm8|A|A0JLQ2CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
>2ht9|A|A0JLT0LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDALYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCY"""
Running:
for line in s.splitlines():
print(list(find("C" ,line)))
would output:
['0JLQ2CFLVNL', 'QDIQLCGMGIL', 'ILPEHCIIDIT']
['TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']
Which gives six matches not four as your expected output suggest so I presume you did not include all possible matches.
You can also speed up the code using str.find, starting at the last match index + 1 for each subsequent match
def find(ch, s):
ln, i = len(s) - 6, s.find(ch)
while 5 <= i <= ln:
yield s[i - 5:i + 6]
i = s.find(ch, i + 1)
Which will give the same output. Of course if the strings cannot overlap you can start looking for the next match much further in the string each time.
My solution is based on regex, and shows all possible solutions using regex and while loop. Thanks to #Smac89 for improving it by transforming it into a generator:
import re
string = """CFLVNLNADPALNELLVYYLKEHTLIGSANSQDIQLCGMGILPEHCIIDITSEGQVMLTPQKNTRTFVNGSSVSSPIQLHHGDRILWGNNHFFRLNLP
LATAPVNQIQETISDNCVVIFSKTSCSYCTMAKKLFHDMNVNYKVVELDLLEYGNQFQDA LYKMTGERTVPRIFVNGTFIGGATDTHRLHKEGKLLPLVHQCYL"""
# Generator
def find_cysteine2(string):
# Create a loop that will utilize regex multiple times
# in order to capture matches within groups
while True:
# Find a match
data = re.search(r'(\w{5}C\w{5})',string)
# If match exists, let's collect the data
if data:
# Collect the string
yield data.group(1)
# Shrink the string to not include
# the previous result
location = data.start() + 1
string = string[location:]
# If there are no matches, stop the loop
else:
break
print [x for x in find_cysteine2(string)]
# ['QDIQLCGMGIL', 'ILPEHCIIDIT', 'TISDNCVVIFS', 'FSKTSCSYCTM', 'TSCSYCTMAKK']

Categories