Correcting a wrong splitted in txt, python - python

I have several txt files which consists of different values, e.g:
TFF,BAP,VAP,DNAAF5,CDKN2B,PDE2D,SLC22A19,RBPJ,STAT1,TAP2,HLA-
I have probabely done a wrong split in the middle of the code, and it splitted by '-' so when I double click one value, it choose all line till the '-'. This mistake does not effect the function till this step. Now I need to count each value occurrens with "Counter" , and the count is wrong.
My code:
gene_calc = r'C:\Users\MrD\Top'
new_dir = r'C:\\Users\\MrD\\Br_Count\\Frequency\\'
for files in gene_calc:
if not os.path.exists(new_dir):
os.mkdir(new_dir)
else:
break
os.chdir(gene_calc)
for files in glob.glob(os.path.join('*.txt*')):
#print(files) # iterating over files to check if prints
with open(files) as f:
content = (line for line in f.read().splitlines())
list = Counter(Vol for Vol in content).most_common()
with open(new_dir + files, "w") as output:
output.write(str(list))
gene_calc folder consists of values as shown in the example above.
I couldn't resplit it (tried "if ',' in gene_list:" or reversing .reverse() but it's already a list with tuples)

at the moment you are counting lines
content = (line for line in f.read().splitlines())
to count items you need a second split on ',':
content = (item for line in f for item in line.strip().split(','))

Related

Split txt file by Year and ID and rename each new txt file as "Year_ID.txt"

I have a bunch of txt files (comma separated) and I want to split the file into separate text files by using common group identifiers from Column 1(Year) and Column 3(ID). Also, I would like to save the new filenames as "Column1_Column3.txt".I do not want to keep any header for these files.
I have tried many scripts/suggestions from other questions, but nothing seems to work.
I am new to python and any suggestions would be very helpful. Thank you very much.
file format:
1.0,9.0,0.0,0.0,5.0,13.2,143.2,993.8529934630001,18.005554199200002,92.5999984741,0.0,0.0,159.882055791
1.0,9.0,0.0,1.0,5.0,13.3,142.8,992.4,19.0,91.5013544438,0.0,0.0,202.645072402
1.0,9.0,0.0,2.0,5.0,13.4,142.5,989.0,21.2,90.4027104135,0.0,0.0,235.39787781
1.0,9.0,0.0,3.0,5.0,13.5,142.2,986.5,22.7,89.3040663832,0.0,0.0,268.74681081200004
1.0,11.0,1.0,1.0,5.0,11.5,175.6,995.6,18.7,18.5200004578,0.0,0.0,680.61138846
1.0,11.0,1.0,5.0,5.0,12.2,174.1,988.9,23.4,18.5200004578,0.0,0.0,645.040646961
1.0,11.0,1.0,6.0,5.0,12.4,173.9,986.5,24.9,18.5200004578,0.0,0.0,654.7981628169999
1.0,9.0,2.0,4.0,5.0,10.7,146.8,986.0,23.2,68.3182237413,0.0,0.0,364.724300756
1.0,9.0,2.0,5.0,5.0,10.8,146.2,982.9,25.0,66.8777792189,0.0,0.0,317.156397048
So my output should be:
File1:
1.0,9.0,0.0,0.0,5.0,13.2,143.2,993.8529934630001,18.005554199200002,92.5999984741,0.0,0.0,159.882055791
1.0,9.0,0.0,1.0,5.0,13.3,142.8,992.4,19.0,91.5013544438,0.0,0.0,202.645072402
1.0,9.0,0.0,2.0,5.0,13.4,142.5,989.0,21.2,90.4027104135,0.0,0.0,235.39787781
File2:
1.0,11.0,1.0,1.0,5.0,11.5,175.6,995.6,18.7,18.5200004578,0.0,0.0,680.61138846
1.0,11.0,1.0,5.0,5.0,12.2,174.1,988.9,23.4,18.5200004578,0.0,0.0,645.040646961
1.0,11.0,1.0,6.0,5.0,12.4,173.9,986.5,24.9,18.5200004578,0.0,0.0,654.7981628169999
File3:
1.0,9.0,2.0,4.0,5.0,10.7,146.8,986.0,23.2,68.3182237413,0.0,0.0,364.724300756
1.0,9.0,2.0,5.0,5.0,10.8,146.2,982.9,25.0,66.8777792189,0.0,0.0,317.156397048
Assumptions:
All entries are uniform
Entries are housed in a 2d list
All entries have at least length 3 (to include both delimiting fields)
Slight concern:
In File1, is the second entry supposed to have '2055791 ' in front of it? This would mean that the list entries are not too uniform for what you want. If this is the case then I suggest scrubbing the data before hand or adding to this code so that it could ignore that.
#grab the full list
full_list = []
#grab every value of column 1
col_one_list = [a[0] for a in full_list]
#grab every value of column 3
col_three_list = [b[2] for b in full_list]
#sort by them
for i in col_one_list:
for j in col_three_list:
separate_list = []
for entry in full_list:
if (entry[0] == i and entry[2] == j):
separate_list.append(entry)
with open(str(i) + "_" +str(j)+".txt", "w" ) as file:
for item in separate_list:
file.write("%s\n" % item)
this should be sufficient.

I am struggling with reading specific words and lines from a text file in python

I want my code to be able to find what the user has asked for and print the 5 following lines. For example if the user entered "james" into the system i want it to find that name in the text file and read the 5 lines below it. Is this even possible? All i have found whilst looking through the internet is how to read specific lines.
So, you want to read a .txt file and you want to read, let's say the word James and the 5 lines after it.
Our example text file is as follows:
Hello, this is line one
The word James is on this line
Hopefully, this line will be found later,
and this line,
and so on...
are we at 5 lines yet?
ah, here we are, the 5th line away from the word James
Hopefully, this should not be found
Let's think through what we have to do.
What We Have to Do
Open the text file
Find the line where the word 'James' is
Find the next 5 lines
Save it to a variable
Print it
Solution
Let's just call our text file info.txt. You can call it whatever you want.
To start, we must open the file and save it to a variable:
file = open('info.txt', 'r') # The 'r' allows us to read it
Then, we must save the data from it to another variable, we shall do it as a list:
file_data = file.readlines()
Now, we iterate (loop through) the line with a for loop, we must save the line that 'James' is on to another variable:
index = 'Not set yet'
for x in range(len(file_data)):
if 'James' in file_data[x]:
index = x
break
if index == 'Not set yet':
print('The word "James" is not in the text file.')
As you can see, it iterates through the list, and checks for the word 'James'. If it finds it, it breaks the loop. If the index variable still is equal to what it was originally set as, it obviously has not found the word 'James'.
Next, we should find the five lines next and save it to another variable:
five_lines = [file_data[index]]
for x in range(5):
try:
five_lines.append(file_data[index + x + 1])
except:
print(f'There are not five full lines after the word James. {x + 1} have been recorded.')
break
Finally, we shall print all of these:
for i in five_lines:
print(i, end='')
Done!
Final Code
file = open('info.txt', 'r') # The 'r' allows us to read it
file_data = file.readlines()
index = 'Not set yet'
for x in range(len(file_data)):
if 'James' in file_data[x]:
index = x
break
if index == 'Not set yet':
print('The word "James" is not in the text file.')
five_lines = [file_data[index]]
for x in range(5):
try:
five_lines.append(file_data[index + x + 1])
except:
print(f'There are not five full lines after the word James. {x + 1} have been recorded.')
break
for i in five_lines:
print(i, end='')
I hope that I have been helpful.
Yeah, sure. Say the keyword your searching for ("james") is keywrd and Xlines is the number of lines after a match you want to return
def readTextandSearch1(txtFile, keywrd, Xlines):
with open(txtFile, 'r') as f: #Note, txtFile is a full path & filename
allLines = f.readlines() #Send all the lines into a list
#with automatically closes txt file at exit
temp = [] #Dim it here so you've "something" to return in event of no match
for iLine in range(0, len(allLines)):
if keywrd in allLines[iLine]:
#Found keyword in this line, want the next X lines for returning
maxScan = min(len(allLines),Xlines+1) #Use this to avoid trying to address beyond end of text file.
for iiLine in range(1, maxScan):
temp.append(allLines[iLine+iiLine]
break #On assumption of only one entry of keywrd in the file, can break out of "for iLine" loop
return temp
Then by calling readTextandSearch1() with appropriate parameters, you'll get a list back that you can print at your leisure. I'd take the return as follows:
rtn1 = readTextAndSearch1("C:\\Docs\\Test.txt", "Jimmy", 6)
if rtn1: #This checks was Jimmy within Test.txt
#Jimmy was in Test.txt
print(rtn1)

Remove lines from file what called from list

I want to remove lines from a .txt file.
i wanna make a list for string what i want to remove but the code will paste the lines as many times
as many string in list. How to avoid that?
file1 = open("base.txt", encoding="utf-8", errors="ignore")
Lines = file1.readlines()
file1.close()
not_needed = ['asd', '123', 'xyz']
row = 0
result = open("result.txt", "w", encoding="utf-8")
for line in Lines:
for item in not_needed:
if item not in line:
row += 1
result.write(str(row) + ": " + line)
so if the line contains the string from list, then delete it.
After every string print the file without the lines.
How to do it?
Look at the logic in your for loop... What it's doing is: take each line in lines, then for all the items in not_needed go through the line and write if condition is verified. But condition verifies each time the item is not found.
Try thinking about doing the inverse:
check if a line is in non needed.
if it is do nothing
otherwise write it
Expanded answer:
Here's what I think you are looking for:
for line in Lines:
if item not in not_needed:
row += 1
result.write(str(row) + ": " + line)

Python - How to split a list into two separate lists dynamically

I am using Python-3 and I am reading a text file which can have multiple paragraphs separated by '\n'. I want to split all those paragraphs into a separate list. There can be n number of paragraphs in the input file.
So this split and output list creation should happen dynamically thereby allowing me to view a particular paragraph by just entering the paragraph number as list[2] or list[3], etc....
So far I have tried the below process :
input = open("input.txt", "r") #Reading the input file
lines = input.readlines() #Creating a List with separate sentences
str = '' #Declaring a empty string
for i in range(len(lines)):
if len(lines[i]) > 2: #If the length of a line is < 2, It means it can be a new paragraph
str += lines[i]
This method will not store paragraphs into a new list (as I am not sure how to do it). It will just remove the line with '\n' and stores all the input lines into str variable. When I tried to display the contents of str, it is showing the output as words. But I need them as sentences.
And my code should store all the sentences until first occurence of '\n' into a separate list and so on.
Any ideas on this ?
UPDATE
I found a way to print all the lines that are present until '\n'. But when I try to store them into the list, it is getting stored as letters, not as whole sentences. Below is the code snippet for reference
input = open("input.txt", "r")
lines = input.readlines()
input_ = []
for i in range(len(lines)):
if len(lines[i]) <= 2:
for j in range(i):
input_.append(lines[j]) #This line is storing as letters.
even "input_ += lines" is storing as letters, Not as sentences.
Any idea how to modify this code to get the desired output ?
Don't forgot to do input.close(), or the file won't save.
Alternatively you can use with.
#Using "with" closes the file automatically, so you don't need to write file.close()
with open("input.txt","r") as file:
file_ = file.read().split("\n")
file_ is now a list with each paragraph as a separate item.
It's as simple as 2 lines.

Comapring multiple text files and grouping similar files in one group

I have 100-200 text files with different name in a folder and I want to compare text present in the file with each other and keep the similar files in a group.
Note :
1.Files are not identical. They are similar like 2-3 lines in a paragraph are same with other file.
2. one file may be kept in different groups or can be kept in multiple groups
Can anyone help me in this as I an beginner to python?
I have tried the below code but it doesn't work for me.
file1=open("F1.txt","r")
file2=open("F2.txt","r")
file3=open("F3.txt","r")
file4=open("F4.txt","r")
file5=open("F5.txt","r")
list1=file1.readlines()
list2=file2.readlines()
list3=file3.readlines()
list4=file4.readlines()
list5=file5.readlines()
for line1 in list1:
for line2 in list2:
for line3 in list3:
for line3 in list4:
for line4 in list5:
if line1.strip() in line2.strip() in line3.strip() in line4.strip() in line5.strip():
print line1
file3.write(line1)
You can use this code to check similar lines in between files:
import glob
_contents = dict()
for filename in glob.glob('*.csv'):
file = open(filename, 'r')
frd = file.readlines()
_contents[filename]=frd
for key in _contents:
for other_key in _contents:
if key == other_key:
pass
else:
print("Comparing in between files {0} and {1}".format(key, other_key))
non_identical_contents = set(_contents[key]) - set(_contents[other_key])
print(list(set(_contents[key])-non_identical_contents))
If I understood your purpose right, you should iterate over all of the text files in the library and compare each one with the other (in all possible combinations). The code should look something like this:
import glob, os
nl = [] #Name list (containing the names of all files in the directory)
fl = [] #File list (containing the content of all files in the directory, each element in this list is a list of strings - the list of lines in a file)
os.chdir("/libwithtextfiles")
for filename in glob.glob("*.txt"): #Using glob to get all the files ending with '.txt'
nl.append(filename) #Appending all the filenames in the directory to 'nl'
f = open(filename, 'r')
fl.append(f.readlines()) #Appending all of the lists of line to 'fl'
f.close()
for fname1 in nl:
l1 = fl[nl.index(fname1)]
if nl.index(fname1) == len(nl) - 1: #We reached the last file
break
for fname2 in nl[nl.index(fname1) + 1:]:
l2 = fl[nl.index(fname2)]
#Here compare the amount of lines identical, use a counter
#then print it, or output to a file or do whatever you want
#with it
#e.g (according to what I understood from your code)
for f1line in l1:
for f2line in l2:
if f1line == f2line: #Why 'in' and not '=='?
"""
have some counter increase right here, a suggestion is having
a list of lists, where the first element is
a list that contains integers
the first integer is the number of lines found identical
between the file (index in list_of_lists is corresponding to the name in that index in 'nl')
and the one following it (index in list_of_lists + 1)
the next integer is the number of lines identical between the same file
and the one following the one following it (+2 this time), etc.
Long story short: list_of_lists[i][j] is the number of lines identical
between the 'i'th file and the 'i+j'th one.
"""
pass
Note that your code doesn't utilize loops where it should, you could have had a list called l instead of line1 - line5.
Aside from that, your code is unclear at all, I assume the missing indentation (for line2 in list2: should be indent, including anything afterwards) and the for line3 in list3: for line3 in list4: #using line3 twice are accidental and happened copying the code to this site. You're comparing every line with every line in the other files?
You should, as my comment in the code suggest, have a counter to count in how many files does that line repeat (doing that by having a for-loop with another loop nested inside, iterating over the lines and comparing just two, rather than all five, where even when having 5 files, each with 10 lines, you'd iterate 100,000 times over it (10**5) - whereas in my method, you only have 1000 iterations in such case, 100 times more efficient).

Categories