I am reading a file in my Python script which looks like this:
#im a useless comment
this is important
I wrote a script to read and split the "this is important" part and ignore the comment lines that start with #.
I only need the first and the last word (In my case "this" and "important").
Is there a way to tell Python that I don't need certain parts of a split?
In my example I have what I want and it works.
However if the string is longer and I have like 10 unused variables, I gues it is not like programmers would do it.
Here is my code:
#!/usr/bin/python3
import re
filehandle = open("file")
for line in file:
if re.search("#",line):
break;
else:
a,b,c = line.split(" ")
print(a)
print(b)
filehandle.close()
Another possibility would be:
a, *_, b = line.split()
print(a, b)
# <a> <b>
If I recall correctly, *_ is not backwards compatible, meaning you require Python 3.5/6 or above (would really have to look into the changelogs here).
On line 8, use the following instead of
a,b,c = line.split(" ")
use:
splitLines = line.split(" ")
a, b, c = splitLines[0], splitLines[1:-1], splitLines[-1]
Negative indexing in python, parses from the last. More info
I think python negative indexing can solve your problem
import re
filehandle = open("file")
for line in file:
if re.search("#",line):
break;
else:
split_word = line.split()
print(split_word[0]) #First Word
print(split_word[-1]) #Last Word
filehandle.close()
Read more about Python Negative Index
You can save the result to a list, and get the first and last elements:
res = line.split(" ")
# res[0] and res[-1]
If you want to print each 3rd element, you can use:
res[::3]
Otherwise, if you don't have a specific pattern, you'll need to manually extract elements by their index.
See the split documentation for more details.
If I've understood your question, you can try this:
s = "this is a very very very veeeery foo bar bazzed looong string"
splitted = s.split() # splitted is a list
splitted[0] # first element
splitted[-1] # last element
str.split() returns a list of the words in the string, using sep as the delimiter string. ... If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
In that way you can get the first and the last words of your string.
For multiline text (with re.search() function):
import re
with open('yourfile.txt', 'r') as f:
result = re.search(r'^(\w+).+?(\w+)$', f.read(), re.M)
a,b = result.group(1), result.group(2)
print(a,b)
The output:
this important
Related
I am trying to remove all parenthetical comments that are in a text file. Here is an very brief example called "sample.txt":
Sentence one (comment 1). Second sentence (second comment).
I would like it to look like this instead:
Sentence one . Second sentence .
I have tried re.sub in the form below, but can only get it to work for strings, not text files. Here one of the many things I've tried:
intext = 'C:\\Users\\Sarah\\PycharmProjects\\pythonProject1\\sample.txt'
outtext = 'C:\\Users\\Sarah\\PycharmProjects\\pythonProject1\\EDITEDsample.txt'
with open(intext, 'r') as f, open(outtext, 'w') as fo:
for line in f:
fo.write(re.sub(r'\([^()]*\)', ''))
This doesn't get me an error message but it also doesn't do anything to the text.
with open (intext, 'r') as f, open(outtext, 'w') as fo:
for line in f:
fo.write(line.replace('(', " ").replace(')', " "))
This successfully removes the parenthesis, but since .replace doesn't handle regex, I don't see how I can use it to also remove any text that is between parenthesis.
I also tried
with open (intext, 'r') as f, open(outtext, 'w') as fo:
for line in f:
re.sub(r'\([^()]*\)', '', outtext)
but I get an error saying I'm missing a string, which is expected since re.sub requires strings. What can I use to remove/replace parenthetical comments from a TEXT file?
re.sub takes three parameters (pattern, replacement, string), and the result must be assigned back to a variable:
import re
# Read entire file as a string into a variable
with open('input.txt') as f:
data = f.read()
# Replace all parenthesized items.
# Make sure to use a non-greedy match or it will replace everything
# from the very first parenthesis to the very last parenthesis.
# Note that this will NOT handle nested parentheses correctly,
# e.g. "a (commented (nested)) sentence" -> "a ) sentence".
# Regular expressions don't handle nesting well. Use a parser for that.
data = re.sub(r'\(.*?\)','',data)
# write file back out
with open('output.txt','w') as f:
f.write(data)
Input file:
sentence (comment) sentence (comment)
sentence (comment) sentence (comment)
sentence (comment) sentence (comment)
Output file:
sentence sentence
sentence sentence
sentence sentence
The thing with regular expression is that you can't handle nested parentheses.
something like "(())" is bound to fail.
So assuming each '(' is followed by an ')', It would be better if you could handle it this way :
infile = ["Sentence one (comment 2). Second sentence (second comment).\n",
"Sentence one (comment 2). Second sentence (second (comment)((((((())))))))."]
open_parenthese_counter = 0
for line in infile:
for char in line:
if open_parenthese_counter == 0:
print(char, end='') # write into the output file.
elif char == '(':
open_parenthese_counter += 1
elif char == ')':
open_parenthese_counter -= 1 # = max(open_parenthese_counter-1, 0)
and then make changes as u see fits.
OUTPUT:
Sentence one . Second sentence .
Sentence one . Second sentence .
You may use the newer regex module with
\((?:[^()]+|(?R))+\)
See a demo on regex101.com.
In Python this could be
import regex as re
rx = re.compile(r'\((?:[^()]+|(?R))+\)')
data = """
Sentence one (comment 1). Second sentence (second comment).
Or even ((nested) ones).
"""
data = rx.sub('', data)
Hello I am writing a Python program that reads through a given .txt file and looks for keywords. In this program once I have found my keyword (for example 'data') I would like to print out the entire sentence the word is associated with.
I have read in my input file and used the split() method to rid of spaces, tabs and newlines and put all the words into an array.
Here is the code I have thus far.
text_file = open("file.txt", "r")
lines = []
lines = text_file.read().split()
keyword = 'data'
for token in lines:
if token == keyword:
//I have found my keyword, what methods can I use to
//print out the words before and after the keyword
//I have a feeling I want to use '.' as a marker for sentences
print(sentence) //prints the entire sentence
file.txt Reads as follows
Welcome to SOF! This website securely stores data for the user.
desired output:
This website securely stores data for the user.
We can just split text on characters that represent line endings and then loop trough those lines and print those who contain our keyword.
To split text on multiple characters , for example line ending can be marked with ! ? . we can use regex:
import re
keyword = "data"
line_end_chars = "!", "?", "."
example = "Welcome to SOF! This website securely stores data for the user?"
regexPattern = '|'.join(map(re.escape, line_end_chars))
line_list = re.split(regexPattern, example)
# line_list looks like this:
# ['Welcome to SOF', ' This website securely stores data for the user', '']
# Now we just need to see which lines have our keyword
for line in line_list:
if keyword in line:
print(line)
But keep in mind that: if keyword in line: matches a sequence of
characters, not necessarily a whole word - for example, 'data' in
'datamine' is True. If you only want to match whole words, you ought
to use regular expressions:
source explanation with example
Source for regex delimiters
My approach is similar to Alberto Poljak but a little more explicit.
The motivation is to realise that splitting on words is unnecessary - Python's in operator will happily find a word in a sentence. What is necessary is the splitting of sentences. Unfortunately, sentences can end with ., ? or ! and Python's split function does not allow multiple separators. So we have to get a little complicated and use re.
re requires us to put a | between each delimiter and escape some of them, because both . and ? have special meanings by default. Alberto's solution used re itself to do all this, which is definitely the way to go. But if you're new to re, my hard-coded version might be clearer.
The other addition I made was to put each sentence's trailing delimiter back on the sentence it belongs to. To do this I wrapped the delimiters in (), which captures them in the output. I then used zip to put them back on the sentence they came from. The 0::2 and 1::2 slices will take every even index (the sentences) and concatenate them with every odd index (the delimiters). Uncomment the print statement to see what's happening.
import re
lines = "Welcome to SOF! This website securely stores data for the user. Another sentence."
keyword = "data"
sentences = re.split('(\.|!|\?)', lines)
sentences_terminated = [a + b for a,b in zip(sentences[0::2], sentences[1::2])]
# print(sentences_terminated)
for sentence in sentences_terminated:
if keyword in sentence:
print(sentence)
break
Output:
This website securely stores data for the user.
This solution uses a fairly simple regex in order to find your keyword in a sentence, with words that may or may not be before and after it, and a final period character. It works well with spaces and it's only one execution of re.search().
import re
text_file = open("file.txt", "r")
text = text_file.read()
keyword = 'data'
match = re.search("\s?(\w+\s)*" + keyword + "\s?(\w+\s?)*.", text)
print(match.group().strip())
Another Solution:
def check_for_stop_punctuation(token):
stop_punctuation = ['.', '?', '!']
for i in range(len(stop_punctuation)):
if token.find(stop_punctuation[i]) > -1:
return True
return False
text_file = open("file.txt", "r")
lines = []
lines = text_file.read().split()
keyword = 'data'
sentence = []
stop_punctuation = ['.', '?', '!']
i = 0
while i < len(lines):
token = lines[i]
sentence.append(token)
if token == keyword:
found_stop_punctuation = check_for_stop_punctuation(token)
while not found_stop_punctuation:
i += 1
token = lines[i]
sentence.append(token)
found_stop_punctuation = check_for_stop_punctuation(token)
print(sentence)
sentence = []
elif check_for_stop_punctuation(token):
sentence = []
i += 1
I have a list containing the lines of a file.
list1[0]="this is the first line"
list2[1]="this is the second line"
I also have a string.
example="TTTTTTTaaaaaaaaaabcccddeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeefffff"
I want to replace list[0] with the string (example). However I want to keep the word length. For example the new list1[0] should be "TTTT TT TTa aaaaa aaaa". The only solution I could come up with was to turn the string example into a list and use a for loop to read letter by letter from the string list into the original list.
for line in open(input, 'r'):
list1[i] = listString[i]
i=i+1
However this does not work from what I understand because Python strings are immutable? What's a good way for a beginner to approach this problem?
I'd probably do something like:
orig = "this is the first line"
repl = "TTTTTTTaaaaaaaaaabcccddeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeefffff"
def replace(orig, repl):
r = iter(repl)
result = ''.join([' ' if ch.isspace() else next(r) for ch in orig])
return result
If repl could be shorter than orig, consider r = itertools.cycle(repl)
This works by creating an iterator out of the replacement string, then iterating over the original string, keeping the spaces, but using the next character from the replacement string instead of any non-space characters.
The other approach you could take would be to note the indexes of the spaces in one pass through orig, then insert them at those indexes in a pass of repl and return a slice of the result
def replace(orig, repl):
spaces = [idx for idx,ch in enumerate(orig) if ch.isspace()]
repl = list(repl)
for idx in spaces:
repl.insert(idx, " ")
# add a space before that index
return ''.join(repl[:len(orig)])
However I couldn't imagine the second approach to be any faster, is certain to be less memory-efficient, and I don't find it easier to read (in fact I find it HARDER to read!) It also don't have a simple workaround if repl is shorter than orig (I guess you could do repl *= 2 but that's uglier than sin and still doesn't guarantee it'll work)
I have a text file test.txt which has in it 'a 2hello 3fox 2hen 1dog'.
I want to read the file and then add all the items into a list, then strip the integers so it will result in the list looking like this 'a hello fox hen dog'
I tried this but my code is not working. The result is ['a 2hello 3foz 2hen 1dog']. thanks
newList = []
filename = input("Enter a file to read: ")
openfile = open(filename,'r')
for word in openfile:
newList.append(word)
for item in newList:
item.strip("1")
item.strip("2")
item.strip("3")
print(newList)
openfile.close()
from python Doc
str.strip([chars])Return a copy of the string with the leading and
trailing characters removed. The chars argument is a string specifying
the set of characters to be removed. If omitted or None, the chars
argument defaults to removing whitespace. The chars argument is not a
prefix or suffix; rather, all combinations of its values are stripped:
Strip wont modify the string, returns a copy of the string after removing the characters mentioned.
>>> text = '132abcd13232111'
>>> text.strip('123')
'abcd'
>>> text
'132abcd13232111'
You can try:
out_put = []
for item in newList:
out_put.append(item.strip("123"))
If you want to remove all 123 then use regular expression re.sub
import re
newList = [re.sub('[123]', '', word) for word in openfile]
Note: This will remove all 123 from the each line
Pointers:
strip returns a new string, so you need to assign that to something. (better yet, just use a list comprehension)
Iterating over a file object gives you lines, not words;
so instead you can read the whole thing then split on spaces.
The with statement saves you from having to call close manually.
strip accepts multiple characters, so you don't need to call it three times.
Code:
filename = input("Enter a file to read: ")
with open(filename, 'r') as openfile:
new_list = [word.strip('123') for word in openfile.read().split()]
print(new_list)
This will give you a list that looks like ['a', 'hello', 'fox', 'hen', 'dog']
If you want to turn it back into a string, you can use ' '.join(new_list)
there are several types of strips in python, basically they strip some specified char in every line. In your case you could use lstrip or just strip:
s = 'a 2hello 3fox 2hen 1dog'
' '.join([word.strip('0123456789') for word in s.split()])
Output:
'a hello fox hen dog'
A function in Python is called in this way:
result = function(arguments...)
This calls function with the arguments and stores the result in result.
If you discard the function call result as you do in your case, it will be lost.
Another way to use it is:
l=[]
for x in range(5):
l.append("something")
l.strip()
This will remove all spaces
Im trying to make a new line every time i find a word starting with a capital letter, here is my code:
import re
def new_line(name):
fr = open(name, 'r')
string = fr.read()
new_list = []
fw = open('output', 'w')
c = 0
m = re.findall('\s+[A-Z]\w+', string,re.MULTILINE)
for i in m:
j = str(i)
l = re.sub('[A-Z]\w+','\n'+str(m[c]), string,re.MULTILINE)
c = c+1
print("These are the list items:"+j+"\n")
print("STRINGY STRING BELOW!!!")
print(string)
print('/////////////////////////////////////////////')
print("Output :\n"+l)
print(m)
new_line('task.txt')
Desired output should be something like this :
These are the list items: Miss
These are the list items: Catherine
.
.
.
These are the list items: Heathcliff
And then the text with new lines added , instead of replacing every match with a \n and the match itself, the text is replaced with only the last item from list m
Like this:
Output :
I got
Heathcliff
Heathcliff and myself to
Heathcliff
Heathcliff; and, to my agreeable disappointment, she behaved infinitely better than I dared to expect.
Heathcliff seemed almost over-fond of
Heathcliff.
Heathcliff; and even to his sister she showed plenty of affection.
I didnt post the original input text as it's too long.
You could try this. It just prefixes each word (with capital letter) with \n.
>>> re.sub(r'\s+([A-Z])','\n\g<1>', "Heathcliff and myself to Heathcliff; to my")
'Heathcliff and myself to\nHeathcliff; to my'
Here is my approach: use re.sub to search for white spaces followed by a capital letter. Replace that with the capital letter itself.
with open(name) as infile, open('output', 'w') as outfile:
contents = infile.read()
new_contents = re.sub(r'\s+([A-Z])', r'\n\1', contents)
outfile.write(new_contents)
Notes
The paretheses in the pattern tells re to remember the text within
the \1 in the replacement text is what re remembered before
Since the list contains only matches that will end up in the list m, you are constantly replacing any word starting with upper case in the document with what is in m[c], so after you've looped through, it will be the last name in the list.
Try stopping the loop after c = 1, c = 2 etc, and you will find all the names to be that number in the list.
re.sub() replaces all non overlapping ocurrences of your pattern.
What does that mean? See the following example:
import re
test_str = 'spam spam spam'
print re.sub('spam', 'beans', test_str, re.MULTILINE)
will print
beans beans beans
What this means is that your code is replacing all ocurrences of capitalized words in the string with your last word. That is why you're seeing 'Heathcliff' everywhere: it was the last capitalized word in your text