How to extract text from several .txt files with Python?

How to extract text from several .txt files with Python? - python

I'm relatively new to programming and using Python, and I couldn't find anything on here that quite answered my question. Basically what I'm looking to do is extract a certain section of about 150 different .txt files and collect each of these pieces into a single .txt file.
Each of the .txt files contains DNA sequence alignment data, and each file basically reads out several dozen different possible sequences. I'm only interested in one of the sequences in each file, and I want to be able to use a script to excise that sequence from all of the files and combine them into a single file that I can then feed into a program that translates the sequences into protein code. Really what I'm trying to avoid is having to go one by one through each of the 150 files and copy/paste the desired sequence into the software.
Does anyone have any idea how I might do this? Thanks!
Edit: I tried to post an image of one of the text files, but apparently I don't have enough "reputation."
Edit2: Hi y'all, I'm sorry I didn't get back to this sooner. I've uploaded the image, here's a link to the upload: http://imgur.com/k3zBTu8

Im assuming you have 150 fasta files and in each fasta file you have sequence id that you want its sequence. you could use Biopython module to do this, put all your 150 files in a folder such as "C:\seq_folder"(folder should not contain any other file, and txt files should not be open)
import os
from Bio import SeqIO
from Bio.Seq import Seq
os.chdir('C:\\seq_folder') # changing working directory, to make it easy for python finding txt files
seq_id=x # the sequence id you want the sequence
txt_list=os.listdir('C:\\seq_folder')
result=open('result.fa','w')
for item in txt_list:
with open (item,'rU') as file:
for records in SeqIO.parse(file,'fasta'):
if records.id == seq_id:
txt.write('>'+records.id+'\n')
txt.write(str(records.seq)+'\n')
else:
continue
result.close()
this code will produce a fasta file including the sequence from your desired id from all the files and put them in 'result.fa'. you can also translate them into protein using Biopythn module.

Related

How to read Json files in a directory separately with a for loop and performing a calculation

Update: Sorry it seems my question wasn't asked properly. So I am analyzing a transportation network consisting of more than 5000 links. All the data included in a big CSV file. I have several JSON files which each consist of subset of this network. I am trying to loop through all the JSON files INDIVIDUALLY (i.e. not trying to concatenate or something), read the JSON file, extract the information from the CVS file, perform calculation, and save the information along with the name of file in new dataframe. Something like this:
enter image description here
This is the code I wrote, but not sure if it's efficient enough.
name=[]
percent_of_truck=[]
path_to_json = \\directory
import glob
z= glob.glob(os.path.join(path_to_json, '*.json'))
for i in z:
with open(i, 'r') as myfile:
l=json.load(myfile)
name.append(i)
d_2019= final.loc[final['LINK_ID'].isin(l)] #retreive data from main CSV file
avg_m=(d_2019['AADTT16']/d_2019['AADT16']*d_2019['Length']).sum()/d_2019['Length'].sum() #calculation
percent_of_truck.append(avg_m)
f=pd.DataFrame()
f['Name']=name
f['% of truck']=percent_of_truck

I'm assuming here you just want a dictionary of all the JSON. If so, use the JSON library ( import JSON). If so, this code may be of use:
import json
def importSomeJSONFile(f):
return json.load(open(f))
# make sure the file exists in the same directory
example = importSomeJSONFile("example.json")
print(example)
#access a value within this , replacing key with what you want like "name"
print(JSON_imported[key])

Since you haven't added any Schema or any other specific requirements.
You can follow this approach to solve your problem, in any language you prefer
Get Directory of the JsonFiles, which needs to be read
Get List of all files present in directory
For each file-name returned in Step2.
Read File
Parse Json from String
Perform required calculation

Test a ZIP file if data has been added at the end of the file?

I am searching for a way to test ZIP files for more details as Pythons ZipFile.testzip() does.
In detail I am searching a way to identify ZIP files that have been modified in a way that somebody has appended additional data after the end of the ZIP file - or to be precise after the end of the End of central directory record (EOCD).
Common zip testing tools (Python ZipFile.testzip(), unzip, 7zip, WinRAR, ...) only test the file up to the EOCD and ignore additional data afterwards. However I need to know if there is additional data present or not after the end of the EOCD.
Is there a simple way to do so in Python? The simplest way would be if I could read the real "ZIP file size" (the offset of the last byte of the EOCD + 1). But how can this be done in Python?

Loading data from text file from only a part of the file name

I have a lot of data in different text files. Each file name contains a word that is chosen by me, but they also include a lot of "gibberish". So for example I have a text file called datapoints-(my chosen name)-12iu8w9e8v09wr-140-ad92-dw9
So the datapoints string is in all text files, the (my chosen name) is what I define and know how to extract in my code, but the last bit is random. And I don't want to go and delete that part in every text file I have, that would be a bit time consuming.
I just want to load these text files, but I'm unsure of how to target each file without using the "gibberish" in the end. I just want to say something like: "load file that includes (my chosen name)" and then not worry about the rest.

this returns a list of all your files using the glob module
import glob
your_words = ['word1', 'word2']
files = []
# find files matching 'datapoint-your words-*.txt'
for word in your_words:
# The * is a wildcard, your words are filled in the {}. one by one
files.extend(glob.glob('datapoint-{}-*.txt'.format(word)))
print files

Duplicating PDF file by variable

I'm working on a project and I've ran into a brick wall.
We have a text file which has a numeric value and pdf file name. If the value is 6 we need to print 6 copies of the PDF. I was thinking of creating X copies of the PDF per line then combining them after. I'm not sure this is the most efficient way to do it but was wondering if anyone else has another idea.
DATA
1,PDF1
2,PDF5
7,PDF2
923,PDF33

You should be using the python CSV module to read in your data into two variables, numCopies and filePath https://docs.python.org/2/library/csv.html
You can then just use
for i in range(1, numCopies):
shutil.copyfile(filePath, newFilePath)
or something along those lines.
If you want to physically work with the PDF files I'd recommend the pyPdf module.

Python - Moving entire text between two .doc files

I have been having this issue for a while and cannot figure how should I start to do this with python. My OS is windows xp pro. I need the script that moves entire (100% of the text) text from one .doc file to another. But its not so easy as it sounds. The target .doc file is not the only one but can be many of them. All the target .doc files are always in the same folder (same path) but all of them don't have the same name. The .doc file FROM where I want to move entire text is only one, always in the same folder (same path) and always with the same file name.
Names of the target are only similar but as I have said before, not the same. Here is the point of whole script:
Target .doc files have the names:
HD1.doc HD2.doc HD3.doc HD4.doc
and so on
What I would like to have is moved the entire (but really all of the text, must be 100% all) text into the .doc file with the highest ( ! ) number. The target .doc files will always start with ''HD'' and always be similar to above examples.
It is possible that the doc file (target file) is only one, so only HD1.doc. Therefore ''1'' is the maximum number and the text is moved into this file.
Sometimes the target file is empty but usually won't be. If it won't be then the text should be moved to the end of the text, into first new line (no empty lines inbetween).
So for example in the target file which has the maximum number in its name is the following text:
a
b
c
In the file from which I want to move the text is:
d
This means I need in the target file this:
a
b
c
d
But no empty lines anywhere.
I have found (showing three different codes):
http://paste.pocoo.org/show/169309/
But neither of them make any sense to me. I know I would need to begin with finding the correct target file (correct HDX file where X is the highest number - again all HD files are and will be in the same folder) but no idea how to do this.
I meant microsoft office word .doc files. They have "pure text". What I mean with pure text is that Im also able to see them in notepad (.txt). But I need to work with .doc extensions. Python is because I need this as automated system, so I wouldn't even need to open any file. Why exsactly python and not any other programming language? The reason for this is because recently I have started learning python and need this script for my work - Python is the "only" programming language that Im interested for and thats why I would like to make this script with it. By "really 100%" I meant that entire text (everything in source file - every single line, no matter if there are 2 or several thousands) would be moved to correct (which one is correct is described in my first post) target file. I cannot move the whole file because I need to move entire text (everything gathered - source file will be always the same but contest of text will be always different - different words in lines) and not whole file because I need the text in correct .doc file with correct name and together (with "together" i mean inside the same file) with already exsisting text IF is there anything already in the target file. Because its possible that the correct target file is empty also.
If someone could suggest me anything, I would really appreciate it.
Thank you, best wishes.
I have tried to ask on openoffice forum but they don't answer. Seen the code could be something like this:
from time import sleep
import win32com.client
from win32com.client import Dispatch
wordApp = win32com.client.Dispatch('Word.Application')
wordApp.Visible=False
wordApp.Documents.Open('C:\\test.doc')
sleep(5)
HD1 = wordApp.Documents.Open('C:\\test.doc') #HD1 word document as object.
HD1.Content.Select.Copy() #Selects entire document and copies it. `
But I have no idea what does that mean. Also I cannot use the .doc file like that because I never know what is the correct filename (HDX.doc where X is maximum integer number, all HD are in same directory path) of the file and therefore I cannot use its name - the script should find the correct file. Also ''filename'' = wordApp.Documents.open... would for sure give me syntax error. :-(

Openoffice ships with full python scripting support, have a look: http://wiki.services.openoffice.org/wiki/Python
Might be easier than trying to mess around with MS Word and COM apis.

So you want to take the text from a doc file, and append it to the end of the text in another doc file. And the problem here is that's MS Word files. It's a proprietary format, and as far as I know there is not module to access them from Python.
But if you are on Windows, you can access them via the COM API, but that's pretty complicated. But look into that. Otehrwise I recommend you to not us MS Word files. The above sounds like some sort of logging facility, and it sounds like a bad idea to use Word files for this, it's too fragile.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.