In Python if I have numerous lines of data containing both strings and floats (sample below) which I have tokenized, how can I call the first float value in each line if this position is not constant? I eventually want to use this as a reference point for latter tokens. Thanks in advance.
F + FR > FR* + F + E 11.60 0 2 FR > FR*
F + FR > FR*** + F 11.60 0 2382 FR > FR***
You can use the re module. Just import it and look for digits with a . within them. For example,
import re
def findFloat(s):
return float( re.search('[0-9]+.[0-9]+', s).group() )
This finds the first occurrence of a group of numbers separated by a .
Related
I have a file with with following strings
input complex_data_BITWIDTH;
output complex_data_(2*BITWIDTH+1);
Lets say BITWIDTH = 8
I want the following output
input complex_data_8;
output complex_data_17;
How can I achieve this in python with find and replace with some mathematical operation.
I would recommend looking into the re RegEx library for string replacement and string search, and the eval() function for performing mathematical operations on strings.
Example (assuming that there are always parentheses around what you want to evaluate) :
import re
BITWIDTH_VAL = 8
string_initial = "something_(BITWIDTH+3)"
string_with_replacement = re.sub("BITWIDTH", str(BITWIDTH_VAL), string_initial)
# note: string_with_replacement is "something_(8+3)"
expression = re.search("(\(.*\))", string_with_replacement).group(1)
# note: expression is "(8+3)"
string_evaluated = string_with_replacement.replace(expression, str(eval(expression)))
# note: string_evaluated is "something_11"
You can use variables for that if you know the value to change, one for the value to search and other for the new value
BITWIDTH = 8
NEW_BITWIDTH = 2 * BITWIDTH + 1
string_input = 'complex_data_8;'
string_output = string_input.replace(str(BITWIDTH), str(NEW_BITWIDTH))
if you don't know the value then you need to get it first and then operate with it
string_input = 'complex_data_8;'
bitwidth = string_input.split('_')[-1].replace(';', '')
new_bitwidth = 2 * int(bitwidth) + 1
string_output = string_input.replace(bitwidth, str(new_bitwidth))
My Directory looks like this:
P1_SAMPLE.csv
P2_SAMPLE.csv
P3_SAMPLE.csv
P11_SAMPLE.csv
P12_SAMPLE.csv
P13_SAMPLE.csv
My code looks like this:
from pathlib import Path
file_path = r'C:\Users\HP\Desktop\My Directory'
for fle in Path(file_path).glob('P*_SAMPLE.csv'):
number = fle.name[1]
print(number)
This gives output:
1
2
3
1
1
1
How do I make the code output the actual full digits for each file, like this:
1
2
3
11
12
13
Would prefer to use fle.name[] if possible. Many thanks in advance!
Use a regular expression:
import re
for fle in Path(file_path).glob('P*_SAMPLE.csv'):
m = re.search(r'P(\d+)_SAMPLE.csv', fle.name)
print(m.group(1))
You can even simplify this to:
m = re.search(r'(\d+)', fle.name)
Since a number only appears in one place within the filename.
string = probability is 0.05
how can I extract 0.05 float value in a variable? There are many such strings in the file,I need to find the average probability, so
I used 'for' loop.
my code :
fname = input("enter file name: ")
fh = open(fname)
count = 0
val = 0
for lx in fh:
if lx.startswith("probability"):
count = count + 1
val = val + #here i need to get the only "float" value which is in string
print(val)
import re
string='probability is 1'
string2='probability is 1.03'
def FindProb(string):
pattern=re.compile('[0-9]')
result=pattern.search(string)
result=result.span()[0]
prob=string[result:]
return(prob)
print(FindProb(string2))
Ok, so.
This is using the regular expression (aka Regex aka re) library
It basically sets up a pattern and then searches for it in a string.
This function takes in a string and finds the first number in the string, then returns the variable prob which would be the string from the first number to the end.
If you need to find the probability multiple times then this might do it:
import re
string='probability is 1'
string2='probability is 1.03 blah blah bllah probablity is 0.2 ugggggggggggggggg probablity is 1.0'
def FindProb(string):
amount=string.count('.')
prob=0
for i in range(amount):
pattern=re.compile('[0-9]+[.][0-9]+')
result=pattern.search(string)
start=result.span()[0]
end=result.span()[1]
prob+=float(string[start:end])
string=string[end:]
return(prob)
print(FindProb(string2))
The caveat to this is that everything has to have a period so 1 would have to be 1.0 but that shouldn't be too much of a problem. If it is, let me know and I will try to find a way
I have a file with lines of DNA in a file called 'DNASeq.txt'. I need a code to read each line and split each line at random places (inserting spaces) throughout the line. Each line needs to be split at different places.
EX: I have:
AAACCCHTHTHDAFHDSAFJANFAJDSNFADKFAFJ
And I need something like this:
AAA ADSF DFAFDDSAF ADF ADSF AFD AFAD
I have tried (!!!very new to python!!):
import random
for x in range(10):
print(random.randint(50,250))
but that prints me random numbers. Is there some way to get a random number generated as like a variable?
You can read a file line wise, write each line character-wise in a new file and insert spaces randomly:
Create demo file without spaces:
with open("t.txt","w") as f:
f.write("""ASDFSFDGHJEQWRJIJG
ASDFJSDGFIJ
SADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFG
SDFJGIKDSFGOROHPTLPASDMKFGDOKRAMGO""")
Read and rewrite demo file:
import random
max_no_space = 9 # if max sequence length without space
no_space = 0
with open("t.txt","r") as f, open("n.txt","w") as w:
for line in f:
for c in line:
w.write(c)
if random.randint(1,6) == 1 or no_space >= max_no_space:
w.write(" ")
no_space = 0
else:
no_space += 1
with open("n.txt") as k:
print(k.read())
Output:
ASDF SFD GHJEQWRJIJG
A SDFJ SDG FIJ
SADFJSD FJ JDSFJIDFJG I JSRGJSDJ FIDJFG
The pattern of spaces is random. You can influence it by settin max_no_spaces or remove the randomness to split after max_no_spaces all the time
Edit:
This way of writing 1 character at a time if you need to read 200+ en block is not very economic, you can do it with the same code like so:
with open("t.txt","w") as f:
f.write("""ASDFSFDGHJEQWRJIJSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFGG
ASDFJSDGFIJSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFGSADFJSDFJJDSFJIDFJGIJK
SADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFGSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJF
SDFJGIKDSFGOROHPTLPASDMKFGDOKRAMGSADFJSDFJJDSFJIDFJGIJSRGJSDJFIDJFG""")
import random
min_no_space = 10
max_no_space = 20 # if max sequence length without space
no_space = 0
with open("t.txt","r") as f, open("n.txt","w") as w:
for line in f:
for c in line:
w.write(c)
if no_space > min_no_space:
if random.randint(1,6) == 1 or no_space >= max_no_space:
w.write(" ")
no_space = 0
else:
no_space += 1
with open("n.txt") as k:
print(k.read())
Output:
ASDFSFDGHJEQ WRJIJSADFJSDF JJDSFJIDFJGIJ SRGJSDJFIDJFGG
ASDFJSDGFIJSA DFJSDFJJDSFJIDF JGIJSRGJSDJFIDJ FGSADFJSDFJJ DSFJIDFJGIJK
SADFJ SDFJJDSFJIDFJG IJSRGJSDJFIDJ FGSADFJSDFJJDS FJIDFJGIJSRG JSDJFIDJF
SDFJG IKDSFGOROHPTLPASDMKFGD OKRAMGSADFJSDF JJDSFJIDFJGI JSRGJSDJFIDJFG
If you want to split your DNA fixed amount of times (10 in my example) here's what you could try:
import random
DNA = 'AAACCCHTHTHDAFHDSAFJANFAJDSNFADKFAFJ'
splitted_DNA = ''
for split_idx in sorted(random.sample(range(len(DNA)), 10)):
splitted_DNA += DNA[len(splitted_DNA)-splitted_DNA.count(' ') :split_idx] + ' '
splitted_DNA += DNA[split_idx:]
print(splitted_DNA) # -> AAACCCHT HTH D AF HD SA F JANFAJDSNFA DK FAFJ
import random
with open('source', 'r') as in_file:
with open('dest', 'w') as out_file:
for line in in_file:
newLine = ''.join(map(lambda x:x+' '*random.randint(0,1), line)).strip() + '\n'
out_file.write(newLine)
Since you mentioned being new, I'll try to explain
I'm writing the new sequences to another file for precaution. It's
not safe to write to the file you are reading from.
The with constructor is so that you don't need to explicitly close
the file you opened.
Files can be read line by line using for loop.
''.join() converts a list to a string.
map() applies a function to every element of a list and returns the
results as a new list.
lambda is how you define a function without naming it. lambda x:
2*x doubles the number you feed it.
x + ' ' * 3 adds 3 spaces after x. random.randint(0, 1) returns
either 1 or 0. So I'm randomly selecting if I'll add a space after
each character or not. If the random.randint() returns 0, 0 spaces are added.
You can toss a coin after each character whether to add space there or not.
This function takes string as input and returns output with space inserted at random places.
def insert_random_spaces(str):
from random import randint
output_string = "".join([x+randint(0,1)*" " for x in str])
return output_string
I have a very complex parsing problem. Any thoughts would be appreciated here. I have a test.dat file.The file to be parsed looks like this:
* Number = 40
Time = 0
1 10.13 10 10.11 12 13
.
.
Time = n
1 10 10 10 12.50 13
.
.
There are N time blocks and each block has 40 lines like shown above. What I would like to do is add e.g. the 1st line of first block , then 1st line in block #2 .. and so on to to a new file -test_1.dat. Similarly, 2nd line of every block to test_2.datand so on.The lines in the block should be written as is to the new _n.dat file. Is there any way to do this? The number I have assumed here is 40, so if the * number = 40 there will be 40 lines under each time block.
regards,
Ris
You can read the file in as a list of strings (call it fileList), where each string is a different line:
f = open('filename')
fileList = f.readlines()
Then, remove the "header" part of your file with
fileList.pop(0)
fileList.pop(0)
Then, do
outFileContents = {} # This will be a dict, where number -> content of test_number.dat
for outFileName in range(1,41): #outFileName will be the number going after the _ in your filename
outFileContents[outFileName] = []
for n in range(40): # Counting through the time blocks
currentRowIndex = (42 * n) + outFileName # 42 to account for the Time = and blank row
outFileContents[outFileName].append(fileList[currentRowIndex])
Finally you can loop through outFileContents and write the contents of each value to separate files.