I am saving all the words from a file like so:
sentence = " "
fileName = sys.argv[1]
fileIn = open(sys.argv[1],"r")
for line in open(sys.argv[1]):
for word in line.split(" "):
sentence += word
Everything works okay when outputting it except the formatting.
I am moving source code, is there any way I can save the indention?
Since you state, that you want to move source code files, why not just copy/move them?
import shutil
shutil.move(src, dest)
If you read source file,
fh = open("yourfilename", "r")
content = fh.read()
should load your file as it is (with indention), or not?
When you invoke line.split(), you remove all leading spaces.
What's wrong with just reading the file into a single string?
textWithIndentation = open(sys.argv[1], "r").read()
Split removes all spaces:
>>> a=" a b c"
>>> a.split(" ")
['', '', '', 'a', 'b', '', '', 'c']
As you can see, the resulting array doesn't contain any spaces anymore. But you can see these strange empty strings (''). They denote that there has been a space. To revert the effect of split, use join(" "):
>>> l=a.split(" ")
>>> " ".join(l)
' a b c'
or in your code:
sentence += " " + word
Or you can use a regular expression to get all spaces at the start of the line:
>>> import re
>>> re.match(r'^\s*', " a b c").group(0)
' '
Related
I am preprocessing data for an NLP task and need to structure the data in the following way:
[tokenized_sentence] tab [tags_corresponding_to_tokens]
I have a text file with thousands of lines in this format, where the two lists are separated by a tab. Here is an example
['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'] ['I-ORG', 'O', 'I-MISC', 'O', 'O', 'O', 'I-MISC', 'O', 'O']
and the piece of code I used to get this is
with open('data.txt', 'w') as foo:
for i,j in zip(range(len(text)),range(len(tags))):
foo.write(str([item for item in text[i].split()]) + '\t' + str([tag for tag in tags[j]]) + '\n')
where text is a list containing sentences (i.e. each sentence is a string) and tags is a list of tags (i.e. the tags corresponding to each word/token in a sentence is a list).
I need to get the string elements in the lists to have double quotes instead of single quotes while maintaining this structure. The expected output should look like this
["EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "."] ["I-ORG", "O", "I-MISC", "O", "O", "O", "I-MISC", "O", "O"]
I've tried using json.dump() and json.dumps() from the json module in Python but I didn't get the expected output as required. Instead, I get the two lists as strings. My best effort was to manually add the double quotes like this (for the tags)
for i in range(len(tags)):
for token in tags[i]:
tkn = "\"%s\"" %token
print(tkn)
which gives the output
"I-ORG"
"O"
"I-MISC"
"O"
"O"
"O"
"I-MISC"
"O"
"O"
"I-PER"
"I-PER"
.
.
.
however, this seems too inefficient. I have seen these related questions
Convert single-quoted string to double-quoted string
Converting a Text file to JSON format using Python
but they didn't address this directly.
I'm using Python 3.8
I'm pretty sure there is no way to force python to write strings with double quotes; the default is single quotes. As #deadshot commented, you can either replace the ' with " after you write the whole string to the file, or manually add the double quotes when you write each word. The answer of this post has many different ways to do it, the simplest being f'"{your_string_here}"'. You would need to write each string separately though, as writing a list automatically adds ' around every item, and that would be very spaghetti.
Just do find and replace ' with " after you write the string to the file.
You can even do it with python:
# after the string is written in 'data.txt'
with open('data.txt', "r") as f:
text = f.read()
text = text.replace("'", '"')
with open('data.txt', "w") as f:
text = f.write(text)
Edit according to OP's comment below
Do this instead of the above; this should fix most of the problems, as it searches for the string ', ' which, hopefully, only appears at the end of one string and the start of the next
with open('data.txt', "r") as f:
text = f.read()
# replace ' at the start of the list
text = text.replace("['", '["')
# replace ' at the end of the list
text = text.replace("']", '"]')
# replace ' at the item changes inside the list
text = text.replace("', '", '", "')
with open('data.txt', "w") as f:
text = f.write(text)
(Edit by OP) New edit based on my latest comment
Running this solves the problem I described in the comment and returns the expected solution.
with open('data.txt', "r") as f:
text = f.read()
# replace ' at the start of the list
text = text.replace("['", '["')
# replace ' at the end of the list
text = text.replace("']", '"]')
# replace ' at the item changes inside the list
text = text.replace("', '", '", "')
text = text.replace("', ", '", ')
text = text.replace(", '", ', "')
with open('data.txt', "w") as f:
text = f.write(text)
What my program does is
- Take a sentence
- Make a dictionary and puts it in an external txt file
- Make a list of numbers which indicates what words are in what positions
- Recreate the original sentence using the numbers and the dictionary and puts it in an external txt file
However, I get this error message when recreating the sentence:
line 22, in <module>
newoutput = (wordDictionary[int(numbers)]) + " "
ValueError: invalid literal for int() with base 10: ''
and here is my code
sentence = input("What is your sentence? ")
splitWords = sentence.split()
wordPositions = ""
wordDictionary = {}
for positions, words in enumerate(splitWords):
if words not in wordDictionary:
wordDictionary[words] = positions+1
wordPositions = wordPositions + str(wordDictionary[words]) + " "
fileName = input("What would you like to call your dictionary .txt file? ")
file = open (fileName + ".txt", "w")
for words in wordDictionary:
output = words + "\t" + str(wordDictionary[words]) + "\n"
file.write(output)
file.close()
numberList = wordPositions.split(" ")
wordDictionary = {y:x for x,y in wordDictionary.items()}
fileName2 = input("What would you like to call your sentance .txt file? ")
file = open(fileName2 + ".txt", "w")
for numbers in numberList:
newoutput = (wordDictionary[int(numbers)]) + " "
file.write(newoutput)
file.close()
How can I fix this error message?
Thanks for your help :)
It looks like you have an empty string in numberList. The reason is the splitting of the text.
See the following example:
>>> 'xx yy'.split()
['xx', 'yy']
>>> 'xx yy'.split(' ')
['xx', '', 'yy']
If you use a delimiter you will always get a result, even if you split an empty string.
>>> ''.split(' ')
['']
To cite the documentation for split.
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace
with a None separator returns [].
So at the moment you use split(" "), but you should use split() without a parameter.
You are passing an empty String to the int() method.
It is saying that it doesn't know how to parse an empty string.
Echo printing and checking what your program is giving to the method would help you out the most for this.
Numbers in this case is an empty string.
As designated by your traceback, you are passing in an empty string (''), and trying to make it an integer. You could either test for empty values before
newoutput = (wordDictionary[int(numbers)]) + " "
Or wrap the statement after it in a try/catch.
However, the best solution would probably be to prevent adding empty strings to your numbersList list before it is evaluated.
import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
with open('data10.txt', 'r') as f:
for line in f:
for word in line.split():
w = f.read().translate(remove)
print(word.lower())
I have this code here and for some reason, the translate(remove) is leaving a good amount of punctuation in the parsed file.
Why are you reading the whole file within the for loop?
Try this:
import string
remove = dict.fromkeys(map(ord, '\n ' + string.punctuation))
with open('data10.txt', 'r') as f:
for line in f:
for word in line.split():
word = word.translate(remove)
print(word.lower())
This will print our the lower cased and stripped words, one per line. Not really sure if that's what you want.
I have an output of which each line contains one list, each list contains one word of a sentence after hyphenation.
It looks something like this:
['I']
['am']
['a']
['man.']
['I']
['would']
['like']
['to']
['find']
['a']
['so','lu','tion.']
(let's say it's hyphenated like this, I'm not a native English speaker)
etc.
Now, what I'd like to do is write this output to a new .txt file, but each sentence (sentence ends when item in list contains a point) has to be written to a newline. I'd like to have following result written to this .txt file:
I am a man.
I would like to find a so,lu,tion.
etc.
The coding that precedes all this is the following:
with open('file.txt','r') as f:
for line in f:
for word in line.split():
if h_en.syllables(word)!=[]:
h_en.syllables (word)
else:
print ([word])
The result I want is a file which contains a sentence at each line.
Each word of a sentence is represented by it's hyphenated version.
Any suggestions?
Thank you a lot.
Something basic like this seems to answer your need:
def write_sentences(filename, *word_lists):
with open(filename, "w") as f:
sentence = []
for word_list in word_lists:
word = ",".join(word_list) ##last edit
sentence.append(word)
if word.endswith("."):
f.write(" ".join(sentence))
f.write("\n")
sentence = []
Feed the write_sentences function with the output filename, then each of your word
lists as arguments. If you have a list of word lists (e.g [['I'], ['am'], ...]), you can use * when calling
the function to pass everything.
EDIT: changed to make it work with the latest edit of the answer (with multiple words in the word lists)
This short regex does what you want when it is compiled in MULTILINE mode:
>>> regex = re.compile("\[([a-zA-Z\s]*\.?)\]$",re.MULTILINE)`
>>> a = regex.findall(string)
>>> a
[u'I', u'am', u'a man.', u'I', u'would like', u'to find', u'a solution.']
Now you just manipulate the list until you get your wanted result. An example follows, but there are more ways to do it:
>>> b = ' '.join(a)
>>> b
'I am a real man. I want a solution.'
>>> c = re.sub('\.','.\n',b)
>>> print(c)
'I am a real man.'
' I want a solution.'
>>> with open("result.txt", "wt") as f:
f.write(c)
words = [['I'],['am'],['a'],['man.'],['I'],['would'],['like'],['to'],['find'],['a'],['so','lu','tion.']]
text = "".join(
"".join(item) + ("\n" if item[-1].endswith(".") else " ")
for item in words)
with open("out.txt", "wt") as f:
f.write(text)
I am writing a program that analyzes a large directory text file line-by-line. In doing so, I am trying to extract different parts of the file and categorize them as 'Name', 'Address', etc. However, due to the format of the file, I am running into a problem. Some of the text i have is split into two lines, such as:
'123 ABCDEF ST
APT 456'
How can I make it so that even through line-by-line analysis, Python returns this as a single-line string in the form of
'123 ABCDEF ST APT 456'?
if you want to remove newlines:
"".join( my_string.splitlines())
Assuming you are using windows if you do a print of the file to your screen you will see
'123 ABCDEF ST\nAPT 456\n'
the \n represent the line breaks.
so there are a number of ways to get rid of the new lines in the file. One easy way is to split the string on the newline characters and then rejoin the items from the list that will be created when you do the split
myList = [item for item in myFile.split('\n')]
newString = ' '.join(myList)
To replace the newlines with a space:
address = '123 ABCDEF ST\nAPT 456\n'
address.replace("\n", " ")
import re
def mergeline(c, l):
if c: return c.rstrip() + " " + l
else: return l
def getline(fname):
qstart = re.compile(r'^\'[^\']*$')
qend = re.compile(r'.*\'$')
with open(fname) as f:
linecache, halfline = ("", False)
for line in f:
if not halfline: linecache = ""
linecache = mergeline(linecache, line)
if halfline: halfline = not re.match(qend, line)
else: halfline = re.match(qstart, line)
if not halfline:
yield linecache
if halfline:
yield linecache
for line in getline('input'):
print line.rstrip()
Assuming you're iterating through your file with something like this:
with open('myfile.txt') as fh:
for line in fh:
# Code here
And also assuming strings in your text file are delimited with single quotes, I would do this:
while not line.endswith("'"):
line += next(fh)
That's a lot of assuming though.
i think i might have found a easy solution just put .replace('\n', " ") to whatever string u want to convert
Example u have
my_string = "hi i am an programmer\nand i like to code in python"
like anything and if u want to convert it u can just do
my_string.replace('\n', " ")
hope it helps