I have a file like that
301 my name is joe
303 whatsup
306 how are you doing today
308 what happened?
308 going home
309 let's go
I want to convert the labels 301, 303, 306, 308, 308, 309 to 1, 2, 3, 4, 4, 5
How can I rename these labels in order in such a way that similar ones get the same number?
Use a dictionary to store the mapping from original to new label, and use the current len of the dictionary for values that have not yet been mapped, using setdefault.
>>> labels = 301, 303, 306, 308, 308, 309
>>> names = {}
>>> for l in labels:
... names.setdefault(l, len(names)+1)
...
>>> names
{301: 1, 303: 2, 306: 3, 308: 4, 309: 5}
More complete example:
text = """301 my name is joe
303 whatsup
306 how are you doing today
308 what happened?
308 going home
309 let's go""".splitlines()
import re
names = {}
replacer = lambda x: str(names.setdefault(x.group(), len(names) + 1))
for line in text:
replaced = re.sub(r'^\d+', replacer, line)
print(replaced)
Output:
1 my name is joe
2 whatsup
3 how are you doing today
4 what happened?
4 going home
5 let's go
You could use an index which only increments when the label is different from the last one:
data = ["301 my name is joe", "303 whatsup", "306 how are you doing today", "308 what happened?", "308 going home", "309 let's go"]
idx = 0
last_index = ""
for i in range(len(data)):
if last_index != data[i].split(" ")[0]: idx += 1
print str(idx) + " " + ' '.join(data[i].split(" ")[1:])
last_index = data[i].split(" ")[0]
Result:
1 my name is joe
2 whatsup
3 how are you doing today
4 what happened?
4 going home
5 let's go
Use a dict to collect the prefixes and a counter.
data = """301 my name is joe
303 whatsup
306 how are you doing today
308 what happened?
308 going home
309 let's go"""
prefixes = {}
i = 1
for line in data.split("\n"):
prefix, rest = line.split(" ", 1)
pr = int(prefix)
if not pr in prefixes:
prefixes[pr] = i
i = i + 1
newPrefix = prefixes[pr]
print("{} {}".format(newPrefix, rest))
Output:
1 my name is joe
2 whatsup
3 how are you doing today
4 what happened?
4 going home
5 let's go
def update_text(data):
labels = sorted(set([line.split()[0] for line in data.splitlines()]))
for inx, line in enumerate(data.splitlines()):
yield str(labels.index(line.split()[0]) + 1) + ' ' + ' '.join(line.split()[1:])
data = '''301 my name is joe
303 whatsup
306 how are you doing today
308 what happened?
308 going home
309 let's go'''
print '\n'.join(update_text(data))
Output:
1 my name is joe
2 whatsup
3 how are you doing today
4 what happened?
4 going home
5 let's go
Another simple solution:
>>> keys = sorted(set([line.split()[0] for line in data.splitlines()]))
>>> for k, v in enumerate(keys):
... data = data.replace(v, str(k + 1))
...
>>> print data
1 my name is joe
2 whatsup
3 how are you doing today
4 what happened?
4 going home
5 let's go
Related
I'm really a beginner at Python, and I'm doing course in my Uni. If you have tips and advice for this question please, much appreciated it.
I have trouble with writing the codes for the frequency of the first digit in CSV file.
No import is allowed.
for example, if I have the following values from CSV,
we have to figure it out how many 1,2,3,4,5,6,7,8,9,0 appears in the first digit in every number,
etc. from 5.385686, 3665, 6942, 4053, 7726, 4601, 7302 there are one 3 in the first digit,
two 4 in the first digit,one 3 in the first digit etc)
I deleted anything other than the number and . from the file. (using corrector for Ascii table)
I tried to put all the data into the list first and returned '5.385686' but I have no idea what to do next..
expected output:
[[26, 22, 28, 22, 16, 20, 31, 22, 13, 0]]
I'm showing only some part from CSV.
5.385686 3665 6942 4053 7726 4601 7302
11754.41657 7859 7002 1502 8754 449 472
800.1759341 2161 4958 3738 5105 1472 2487
1055.19226 7473 3713 4302 3174 6415 9094
1747.798453 2685 5343 3207 2137 1934 1101
2551.157404 3200 4655 2673 4270 821 330
480.7713868 1172 847 3683 9486 2258 6323
19018.97818 3678 5628 1171 7270 8333 2534
505.5652756 7222 4105 6529 169 307 3142
3759.276869 9649 1445 5944 8892 371 8307
4753 6737 906 5057 4401 8698 533
2790 5239 6392 8637 8785 1331 6848
3328 639 3519 7829 6796 3935 2893
6331 2986 6076 1085 7715 8241 5688
[[26, 22, 28, 22, 16, 20, 31, 22, 13, 0]]
This is what I got so far:
def filename():
file = open("sample_accounts.csv", "r")
filecsv = file.read()
filecsv = filecsv.lower()
a = []
b = [ ]
chlist = list(range(128))
del chlist[48:58]
del chlist[46]
for c in chlist:
filecsv = filecsv.replace(chr©," ")
a.append(chlist)
ftlist = filecsv.split()
greet = ftlist
a.append(ftlist)
for i in greet:
return greet[0]
# for i in greet:
# return greet[i]
#
# dic = {}
#
# for word in ftlist:
# dic[word] = dic.get(word,0) + 1
#
# # for item in dic: # **** *
# # print(item, dic[item])
# return greet
d = filename()
You can do that by string the count of each digit in a dictionary:
count = dict({})
with open('path to your file') as file:
for line in file.readlines():
for number in line.split(' '):
number=number.strip()
if len(number)<1:
continue
digit = number[0]
if digit.isdigit():
digit = int(digit)
if digit in count:
count[digit] = count[digit]+1
else:
count[digit] = 1
print(count.values())
Output:
[14, 11, 16, 12, 10, 11, 9, 11, 4]
Based exclusively on the csv snipped in the question, you can do something like this:
csv_dat = """[your csv snippet]"""
csv_lst = csv_dat.split(' ') #need to create a list from your snippet; you may already have it in your code
fd_lst = [] #initialize a list for the first digit in each
for item in csv_lst:
fd_lst.append((item.strip()[0])) #select the first character in each entry
print('digit frequency')
for x in set(fd_lst): #count only unique characters
print(x,'\t',fd_lst.count(x))
Output:
digit frequency
8 10
6 10
9 4
7 9
3 14
1 10
5 9
2 9
4 10
The text file looks like this:
421 2 1 8 34 27
421 0 0 8 37 27
435 0 1 9 8 44
435 4 0 9 10 50
for row in file_content[0:]:
id, place, inout, hour, min, sec = row.split(" ")
print (id)
In the code I wanted to separate the rows, the first column contains the ids of persons, the second is ids of places, third is the person go in or out (0/1), and the last 3 is time (hour:min:sec)
Could someone help me correct this code so I could continue the practicing for my exam? (I'm a beginner)
with open("Text.txt", "r") as f:
id, place, inout, hour, min, sec = zip(*map(str.split, f))
print(id)
# [OUT] ('421', '421', '435', '435')
Zip()
>>> filecontent =open("test.txt",'r')
>>> for row in filecontent:
... id, place, inout, hour, min, sec = row.split(" ")
... print("id is", id)
...
id is 421
id is 421
id is 435
id is 435
I have a question that includes various steps.
I am parsing a file that looks like this:
9
123
0 987
3 890 234 111
1 0 1 90 1 34 1 09 1 67
1 684321
2 352 69
1 1 1 243 1 198 1 678 1 11
2 098765
1 143
1 2 1 23 1 63 1 978 1 379
3 784658
1 43
1 3 1 546 1 789 1 12 1 098
I want to make this lines in the file, keys of a dictionary (ignoring the first number and just taking the second one, because it just indicates which number of key should be):
0 987
1 684321
2 098765
3 784658
And this lines, the values of the elements (ignoring only the first number too, because it just indicates how many elements are):
3 890 234 111
2 352 69
1 143
1 43
So at the end it has to look like this:
d = {987 : [890, 234, 111], 684321 : [352, 69],
098765 : [143], 784658 : [43]}
So far I have this:
findkeys = re.findall(r"\d\t(\d+)\n", line)
findelements = re.findall(r"\d\t(\d+)", line)
listss.append("".join(findelements))
d = {findkeys: listss}
The regular expressions need more exceptions because the one for the keys, it gives me the elements of other lines that I don't want them to be keys, but have just one number too. Like in the example of the file, the number 43 appears as a result.
And the regular expression of the elements gives me back all the lines.
I don´t know if it will be easier to make that the code should ignore the lines of which I do not need information, but I don't know how to do that.
I want it to keep it has simple has possible.
Thanks!
with open('filename.txt') as f:
lines = f.readlines()
lines = [x.strip() for x in lines]
lines = lines[2:]
keys = lines[::3]
values = lines[1::3]
output lines:
['0 987',
'3 890 234 111',
'1 0 1 90 1 34 1 09 1 67',
'1 684321',
'2 352 69',
'1 1 1 243 1 198 1 678 1 11',
'2 098765',
'1 143',
'1 2 1 23 1 63 1 978 1 379',
'3 784658',
'1 43',
'1 3 1 546 1 789 1 12 1 098']
output keys:
['0 987', '1 684321', '2 098765', '3 784658']
output values:
['3 890 234 111', '2 352 69', '1 143', '1 43']
Now you just have to put it together ! Iterate through keys and values.
Once you have the lines in a list (lines variable), you can simply use re to isolate numbers and dictionary/list comprehension to build the desired data structure.
Based on you example data, every 3rd line is a key with values on the following line. This means you only need to stride by 3 in the list.
findall() will give you the list of numbers (as text) on each line and you can ignore the first one with simple subscripts.
import re
value = re.compile(r"(\d+)")
numbers = [ [int(v) for v in value.findall(line)] for line in lines]
intDict = { key[1]:values[1:] for key,values in zip(numbers[2::3],numbers[3::3]) }
You could also do it using split() but then you have to exclude empty entries that multiple spaces will create in the split:
numbers = [ [int(v) for v in line.split() if v != ""] for line in lines]
intDict = { key[1]:values[1:] for key,values in zip(numbers[2::3],numbers[3::3]) }
You could build yourself a parser with e.g. parsimonious:
from parsimonious.nodes import NodeVisitor
from parsimonious.grammar import Grammar
data = """
9
123
0 987
3 890 234 111
1 0 1 90 1 34 1 09 1 67
1 684321
2 352 69
1 1 1 243 1 198 1 678 1 11
2 098765
1 143
1 2 1 23 1 63 1 978 1 379
3 784658
1 43
1 3 1 546 1 789 1 12 1 098
"""
grammar = Grammar(
r"""
data = (important / garbage)+
important = keyline newline valueline
garbage = ~".*" newline?
keyline = ws number ws number
valueline = (ws number)+
newline = ~"[\n\r]"
number = ~"\d+"
ws = ~"[ \t]+"
"""
)
tree = grammar.parse(data)
class DataVisitor(NodeVisitor):
output = {}
current = None
def generic_visit(self, node, visited_children):
return node.text or visited_children
def visit_keyline(self, node, children):
key = node.text.split()[-1]
self.current = key
def visit_valueline(self, node, children):
values = node.text.split()
self.output[self.current] = [int(x) for x in values[1:]]
dv = DataVisitor()
dv.visit(tree)
print(dv.output)
This yields
{'987': [890, 234, 111], '684321': [352, 69], '098765': [143], '784658': [43]}
The idea here is that every "keyline" is only composed of two numbers with the second being the soon-to-be keyword. The next line is the valueline.
Input file (test.sam):
SN398:470:C8RD3ACXX:7:1111:19077:53994 16 chrI 65374 255 51M * 0 0 TGAGAAATTCTTGAACATTCGTCTGTATTGATAAATAAAACTAGTATACAG IJJJJJJJJJJJJJIJJJIJJJJJJHJJJJJJJJJJJJHHHHHFFFFDB#B AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU NH:i:1
genes.bed file is the reference:
chrI 130798 131983 YAL012W 0 + 130798 131983 0 1 1185, 0,
chrI 334 649 YAL069W 0 + 334 649 0 1 315, 0,
chrI 537 792 YAL068W-A 0 + 537 792 0 1 255, 0,
chrI 1806 2169 YAL068C 0 - 1806 2169 0 1 363, 0,
chrI 2479 2707 YAL067W-A 0 + 2479 2707 0 1 228, 0,
chrI 7234 9016 YAL067C 0 - 7234 9016 0 1 1782, 0,
chrI 10090 10399 YAL066W 0 + 10090 10399 0 1 309, 0,
chrI 11564 11951 YAL065C 0 - 11564 11951 0 1 387, 0,
chrI 12045 12426 YAL064W-B 0 + 12045 12426 0 1 381, 0,
script is the following - it looks if "chr" matches between two files, and if fourth column of test.sam (called genomic_location) is within the second and third column of genes.bed file, then it prints the fourth column of genes.bed and counts it as "1".
#!/usr/bin/env python
import sys
samfile=open('test.sam') #sorted sam file
bedfile=open('genes.bed') #reference genome
sys.stdout=open('merged.txt', 'w')
lookup = {}
for line in bedfile:
fields = line.strip().split()
chrm = fields[0]
st = int(fields[1])
end = int(fields[2])
name = fields[3]
if chrm not in lookup:
lookup[chrm] = {}
for i in range(st,end):
if i not in lookup[chrm]:
lookup[chrm][i] = [name]
else:
lookup[chrm][i].append(name)
gene_counts = {}
for line in samfile:
reads = line.split()
qname = reads[0]
flag = reads[1] # be 0 or 16
rname=reads[2]
genomic_location = int(reads[3])
mapq = int(reads[4])
if rname in lookup:
if genomic_location in lookup[rname]:
for gene in lookup[rname][genomic_location]:
if gene not in gene_counts:
gene_counts[gene] = 0
else:
gene_counts[gene] += 1
print gene_counts
I need to change it in such a way that when flag (second column in input file test.sam) is 16, then subtract 51 from the fourth column in inputfile (test.sam) and then process it to see if that newly made integer is within st and end of genes.bed file.
What do you think is the best way to do this? I need to implement this within script and not make a new input files (test.sam) that first changes the fourth column if second is 16.
I would like to do this Python. Thank you for your help and please let me know if something is unclear.
Maybe there's some hidden complexity that I'm missing here, but the most obvious python implementation of "when flag (second column in input file test.sam) is 16, then subtract 51 from the fourth column in inputfile" is:
if flag == 16:
genomic_location = int(reads[3]) - 51
I have read other simliar posts but they don't seem to work in my case. Hence, I'm posting it newly here.
I have a text file which has varying row and column sizes. I am interested in the rows of values which have a specific parameter. E.g. in the sample text file below, I want the last two values of each line which has the number '1' in the second position. That is, I want the values '1, 101', '101, 2', '2, 102' and '102, 3' from the lines starting with the values '101 to 104' because they have the number '1' in the second position.
$MeshFormat
2.2 0 8
$EndMeshFormat
$Nodes
425
.
.
$EndNodes
$Elements
630
.
97 15 2 0 193 97
98 15 2 0 195 98
99 15 2 0 197 99
100 15 2 0 199 100
101 1 2 0 201 1 101
102 1 2 0 201 101 2
103 1 2 0 202 2 102
104 1 2 0 202 102 3
301 2 2 0 303 178 78 250
302 2 2 0 303 250 79 178
303 2 2 0 303 198 98 249
304 2 2 0 303 249 99 198
.
.
.
$EndElements
The problem is, with the code I have come up with mentioned below, it starts from '101' but it reads the values from the other lines upto '304' or more. What am I doing wrong or does someone has a better way to tackle this?
# Here, (additional_lines + anz_knoten_gmsh - 2) are additional lines that need to be skipped
# at the beginning of the .txt file. Initially I find out where the range
# of the lines lies which I need.
# The two_noded_elem_start is the first line having the '1' at the second position
# and four_noded_elem_start is the first line number having '2' in the second position.
# So, basically I'm reading between these two parameters.
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"))
output_file = open(os.path.join(gmsh_path, "mesh_skip_nodes.txt"), "w")
for i, line in enumerate(input_file):
if i == (additional_lines + anz_knoten_gmsh + two_noded_elem_start - 2):
break
for i, line in enumerate(input_file):
if i == additional_lines + anz_knoten_gmsh + four_noded_elem_start - 2:
break
elem_list = line.strip().split()
del elem_list[:5]
writer = csv.writer(output_file)
writer.writerow(elem_list)
input_file.close()
output_file.close()
*EDIT: The piece of code used to find the parameters like two_noded_elem_start is as follows:
# anz_elemente_ueberg_gmsh is another parameter that is found out
# from a previous piece of code and '$EndElements' is what
# is at the end of the text file "mesh_outer_region.msh".
input_file = open(os.path.join(gmsh_path, "mesh_outer_region.msh"), "r")
for i, line in enumerate(input_file):
if line.strip() == anz_elemente_ueberg_gmsh:
break
for i, line in enumerate(input_file):
if line.strip() == '$EndElements':
break
element_list = line.strip().split()
if element_list[1] == '1':
two_noded_elem_start = element_list[0]
two_noded_elem_start = int(two_noded_elem_start)
break
input_file.close()
>>> with open('filename') as fh: # Open the file
... for line in fh: # For each line the file
... values = line.split() # Split the values into a list
... if values[1] == '1': # Compare the second value
... print values[-2], values[-1] # Print the 2nd from last and last
1 101
101 2
2 102
102 3