Importing Text Files To Python And Spintax Then Outputting Text

Importing Text Files To Python And Spintax Then Outputting Text - python

I'm stuck on how to accomplish this. What I have is a folder of say 10 text files and in each text file is a article done in spintax, example: {This is|Here is} {My|Your|Their} {Articles|Post}
Each text file contains a full article with paragraphs in spintax. What I'm wanting to do is randomly grab from that folder one article, spin it off the spintax and then output/save it as a new text file or append to file.
I've tried to find some examples of how to do this but have had no success.

Related

Compare two XML files preferably using Python

I need to compare two XML files having same structure for any extra lines. Below example provides some details:
File 1:
<Log1> Some text here </Log1>
<Log2>
<Id1> Some text here </Id1>
<Id2> Some text here </Id2>
</Log2>
File2:
<Log1> Some text here </Log1>
<Log2>
<Id1> Some text here </Id1>
<Id2> Some text here </Id2>
</Log2>
<Log3> Some text here </Log3>
I need the diff to identify here that there is one extra tag in File2. Is there an efficient to do this in Python?

how to extract text from double lines pdf

I have some pdf files, which have double columns per page. I want to extract text from those files by program. The content of pdf file is Chinese. I tried to use pdfminer3k library of python3 and ghostscript, whose result are all not very good.
At last, I use the github open source project named textract, and the link is deanmalmgren/textract.
But textract can not detect that the every page that contains two columns. I use the following command:
import textract
text = textract.process("/home/name/Downloads/textract-master/test.pdf")
print text
And pdf file link is https://pan.baidu.com/s/1nvLQnLf
The output result shows that the extract program regards the two columns as one column. I want to extract the double columns pdf files. How to solve?
This is the output result by extract program.

Loading data from text file from only a part of the file name

I have a lot of data in different text files. Each file name contains a word that is chosen by me, but they also include a lot of "gibberish". So for example I have a text file called datapoints-(my chosen name)-12iu8w9e8v09wr-140-ad92-dw9
So the datapoints string is in all text files, the (my chosen name) is what I define and know how to extract in my code, but the last bit is random. And I don't want to go and delete that part in every text file I have, that would be a bit time consuming.
I just want to load these text files, but I'm unsure of how to target each file without using the "gibberish" in the end. I just want to say something like: "load file that includes (my chosen name)" and then not worry about the rest.

this returns a list of all your files using the glob module
import glob
your_words = ['word1', 'word2']
files = []
# find files matching 'datapoint-your words-*.txt'
for word in your_words:
# The * is a wildcard, your words are filled in the {}. one by one
files.extend(glob.glob('datapoint-{}-*.txt'.format(word)))
print files

How to extract text from several .txt files with Python?

I'm relatively new to programming and using Python, and I couldn't find anything on here that quite answered my question. Basically what I'm looking to do is extract a certain section of about 150 different .txt files and collect each of these pieces into a single .txt file.
Each of the .txt files contains DNA sequence alignment data, and each file basically reads out several dozen different possible sequences. I'm only interested in one of the sequences in each file, and I want to be able to use a script to excise that sequence from all of the files and combine them into a single file that I can then feed into a program that translates the sequences into protein code. Really what I'm trying to avoid is having to go one by one through each of the 150 files and copy/paste the desired sequence into the software.
Does anyone have any idea how I might do this? Thanks!
Edit: I tried to post an image of one of the text files, but apparently I don't have enough "reputation."
Edit2: Hi y'all, I'm sorry I didn't get back to this sooner. I've uploaded the image, here's a link to the upload: http://imgur.com/k3zBTu8

Im assuming you have 150 fasta files and in each fasta file you have sequence id that you want its sequence. you could use Biopython module to do this, put all your 150 files in a folder such as "C:\seq_folder"(folder should not contain any other file, and txt files should not be open)
import os
from Bio import SeqIO
from Bio.Seq import Seq
os.chdir('C:\\seq_folder') # changing working directory, to make it easy for python finding txt files
seq_id=x # the sequence id you want the sequence
txt_list=os.listdir('C:\\seq_folder')
result=open('result.fa','w')
for item in txt_list:
with open (item,'rU') as file:
for records in SeqIO.parse(file,'fasta'):
if records.id == seq_id:
txt.write('>'+records.id+'\n')
txt.write(str(records.seq)+'\n')
else:
continue
result.close()
this code will produce a fasta file including the sequence from your desired id from all the files and put them in 'result.fa'. you can also translate them into protein using Biopythn module.

Python - Moving entire text between two .doc files

I have been having this issue for a while and cannot figure how should I start to do this with python. My OS is windows xp pro. I need the script that moves entire (100% of the text) text from one .doc file to another. But its not so easy as it sounds. The target .doc file is not the only one but can be many of them. All the target .doc files are always in the same folder (same path) but all of them don't have the same name. The .doc file FROM where I want to move entire text is only one, always in the same folder (same path) and always with the same file name.
Names of the target are only similar but as I have said before, not the same. Here is the point of whole script:
Target .doc files have the names:
HD1.doc HD2.doc HD3.doc HD4.doc
and so on
What I would like to have is moved the entire (but really all of the text, must be 100% all) text into the .doc file with the highest ( ! ) number. The target .doc files will always start with ''HD'' and always be similar to above examples.
It is possible that the doc file (target file) is only one, so only HD1.doc. Therefore ''1'' is the maximum number and the text is moved into this file.
Sometimes the target file is empty but usually won't be. If it won't be then the text should be moved to the end of the text, into first new line (no empty lines inbetween).
So for example in the target file which has the maximum number in its name is the following text:
a
b
c
In the file from which I want to move the text is:
d
This means I need in the target file this:
a
b
c
d
But no empty lines anywhere.
I have found (showing three different codes):
http://paste.pocoo.org/show/169309/
But neither of them make any sense to me. I know I would need to begin with finding the correct target file (correct HDX file where X is the highest number - again all HD files are and will be in the same folder) but no idea how to do this.
I meant microsoft office word .doc files. They have "pure text". What I mean with pure text is that Im also able to see them in notepad (.txt). But I need to work with .doc extensions. Python is because I need this as automated system, so I wouldn't even need to open any file. Why exsactly python and not any other programming language? The reason for this is because recently I have started learning python and need this script for my work - Python is the "only" programming language that Im interested for and thats why I would like to make this script with it. By "really 100%" I meant that entire text (everything in source file - every single line, no matter if there are 2 or several thousands) would be moved to correct (which one is correct is described in my first post) target file. I cannot move the whole file because I need to move entire text (everything gathered - source file will be always the same but contest of text will be always different - different words in lines) and not whole file because I need the text in correct .doc file with correct name and together (with "together" i mean inside the same file) with already exsisting text IF is there anything already in the target file. Because its possible that the correct target file is empty also.
If someone could suggest me anything, I would really appreciate it.
Thank you, best wishes.
I have tried to ask on openoffice forum but they don't answer. Seen the code could be something like this:
from time import sleep
import win32com.client
from win32com.client import Dispatch
wordApp = win32com.client.Dispatch('Word.Application')
wordApp.Visible=False
wordApp.Documents.Open('C:\\test.doc')
sleep(5)
HD1 = wordApp.Documents.Open('C:\\test.doc') #HD1 word document as object.
HD1.Content.Select.Copy() #Selects entire document and copies it. `
But I have no idea what does that mean. Also I cannot use the .doc file like that because I never know what is the correct filename (HDX.doc where X is maximum integer number, all HD are in same directory path) of the file and therefore I cannot use its name - the script should find the correct file. Also ''filename'' = wordApp.Documents.open... would for sure give me syntax error. :-(

Openoffice ships with full python scripting support, have a look: http://wiki.services.openoffice.org/wiki/Python
Might be easier than trying to mess around with MS Word and COM apis.

So you want to take the text from a doc file, and append it to the end of the text in another doc file. And the problem here is that's MS Word files. It's a proprietary format, and as far as I know there is not module to access them from Python.
But if you are on Windows, you can access them via the COM API, but that's pretty complicated. But look into that. Otehrwise I recommend you to not us MS Word files. The above sounds like some sort of logging facility, and it sounds like a bad idea to use Word files for this, it's too fragile.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.