I have a txt file containing one sentence per line, and there are lines containing numbers attached to letters. For instance:
The boy3 was strolling on the beach while four seagulls appeared flying.
There were 3 women sunbathing as well.
All children were playing happily.
I would like remove lines like the first one (i.e. having numbers stuck to words) but not lines like the second which are properly written.
Has anybody got a slight idea?
You can use a simple regex pattern. We start with [0-9]+. This pattern detects any number 0-9 an indefinite amounts of times. Meaning 6, or 56, or 56790 works. If you want to detect sentences that have numbers attached to a string you could use something like this: ([a-zA-Z][0-9]+)|([0-9]+[a-zA-Z]) This regex string matches a string with a letter before a number or after a number. You can search strings using:
import re
lines = [
'The boy3 was strolling on the beach while 4 seagulls appeared flying.',
'There were 3 women sunbathing as well.',
]
for line in lines:
res = re.search("([a-zA-Z][0-9]+)|([0-9]+[a-zA-Z])", line)
if res is None:
# remove line
However you can add more characters to the allowed letters if your sentences can include special characters and such.
Suppose, your input text is stored in file in.txt, you can use following code:
import re
with open("in.txt", "r") as f:
for line in f:
if not(re.search(r'(?!\d)[\w]\d|\d(?!\d)[\w]', line, flags=re.UNICODE)):
print(line, end="")
The pattern (?!\d)[\w] looks for word characters (\w) excluding digits. The idea is stolen from https://stackoverflow.com/a/12349464/2740367
Related
I'm trying to find the biggest sentence in a text file. I'm using the dot (.) to define the beginning and end of sentences. The text file don't have special punctuation (like ?! etc).
My code currently only return the first letter of my text file. I'm not sure why.
def recherche(source):
"find the biggest sentence"
fs = open(source, "r")
while 1:
txt = fs.readline()
if txt == "":
break
else:
grande_phrase= max(txt, key=len)
print (grande_phrase)
fs.close()
recherche("for92.txt")
Your current code reads each line, and finds the max of that line. Since a string is just a collection of characters, your expression max(txt, key=len) gives you the character in txt that has the maximum length. Since all characters have a length of 1, you just get the first character of the line.
You want to create a list of all sentences, and then use max on that list. There seems to be no guarantee that your input file will have one sentence per line. Since you use a period to define where a sentence ends, you're going to have to split the entire file at . to get your list of sentences. Keep in mind that this is not a foolproof strategy to split any text into sentences, since you risk splitting at other occurrences of ., such as a decimal point or an abbreviation.
def recherche(source):
"find the biggest sentence"
with open(source, "r") as fs:
sentences = fs.read().split(".")
grande_phrase = max(sentences, key=len)
print(grande_phrase)
With an input file that looks like so:
It was the best of times. It was the worst of times. It was the age of wisdom. It was the age of foolishness. It was the epoch of belief. It was the epoch of incredulity. It was the season of light. It was the season of darkness. It was the spring of hope. It was the winter of despair.
we get the output:
It was the epoch of incredulity
Try it online Note: I replaced the file with an io.StringIO to work on tio.run
I have tried this code from my side, any suggestion and help is appreciated. To be more specific, I want to create a python program which can count and identify the number of acronyms in a text file. And the output of the program should display every acronyms present in the specified text file and how many time each of those acronyms occurred in the file.
*Note- The below code is not giving the desired output. Any type of help and suggestion is appreciated.
Link for the Text File , You guys can have a look- https://drive.google.com/file/d/1zlqsmJKqGIdD7qKicVmF0W6OgF5-g7Qk/view?usp=sharing
This text file contain various acronyms which are used in it. So, I basically want to write a python script to identify those acronyms and count how many times those acronyms occurred. The acronyms are of various type which can be 2 or more letters and it can either be of small or capital letters. For further reference about acronyms please have a look at the text file provided at the google drive.
Any updated code is also appreciated.
acronyms = 0 # number of acronyms
#open file File.txt in read mode with name file
with open('Larex_text_file.txt', "r", errors ='ignore') as file:
text = str(file.read())
import re
print(re.sub("([a-zA-Z]\.*){2,}s?", "", text))
for line in text: # for every line in file
for word in line.split(' '): # for every word in line
if word.isupper(): # if word is all uppercase letters
acronyms+=1
print("Number of acronyms:", acronyms) #print number of acronyms
In building a small text file and then trying out your code, I came up with a couple of tweaks to your code to simplify it and still acquire the count of words within the text file that are all uppercase letters.
acronyms = 0 # number of acronyms
#open file File.txt in read mode with name file
with open('Larex_text_file.txt', "r", errors ='ignore') as file:
text = str(file.read())
for word in text.split(' '): # for every word in line
if word.isupper() and word.isalpha(): # if word is all uppercase letters
acronyms+=1
print("Number of words that are all uppercase:", acronyms) #print number of acronyms
First off, just a simple loop is used through the words that are split out from the read text, and then the program just checks that the word is all alpha and that all of the letters in the word are all uppercase.
To test, I built a small text file with some words all in uppercase.
NASA and UCLA have teamed up with the FBI and JPL.
also UNICEF and the WWE have teamed up.
With that, there should be five words that are all uppercase.
And, when run, this was the output on the terminal.
#Una:~/Python_Programs/Acronyms$ python3 Acronym.py
Number of words that are all uppercase: 5
You will note that I am being a bit pedantic here referring to the count of "uppercase" words and not calling them acronyms. I am not sure if you are attempting to actually derive true acronyms, but if you are, this link might help:
Acronyms
Give that a try to see if it meets the spirit of your project.
Answer to the question-
#open file File.txt in read mode with name file
with open('Larex_text_file.txt', "r", errors ='ignore') as file:
text = str(file.read())
for word in text.split(' '): # for every word in line
if word.isupper() and word.isalpha(): # if word is all uppercase letters
acronyms+=1
if len(word) == 1: #ignoring the word found in the file of single character as they are not acronyms
pass
else:
index = len(acronym_word)
acronym_word.insert(index, word) #storing all the acronyms founded in the file to a list
uniqWords = sorted(set(acronym_word)) #remove duplicate words and sort the list of acronyms
for word in uniqWords:
print(word, ":", acronym_word.count(word))
From your comments, it sounds like every acronym appears at least once as an all-uppercase word, then can appear several more times in lowercase.
I suggest making two passes on the text: a first time to collect all uppercase words, and a second pass to search for every occurrence, case-insensitive, of the words you collected on the first pass.
You can use collections.Counter to quickly count words.
You can use ''.join(filter(str.isalpha, word.lower())) to strip a word of its non-alphabetical characters and disregard its case.
In the code snippet below, I used io.StringIO to emulate opening a text file.
from io import StringIO
from collections import Counter
text = '''First we have CR and PU as uppercase words. A word which first
appeared as uppercase can also appear as lowercase.
For instance, cr and pu appear in lowercase, and pu appears again.
And again: here is a new occurrence of pu.
An acronym might or might not have punctuation or numbers in it: CR-1,
C.R., cr.
A word that contains only a singly letter will look like an acronym
if it ever appears as the first word of a sentence.'''
#with open('path/to/file.txt', 'r') as f:
with StringIO(text) as f:
counts = Counter(''.join(filter(str.isalpha, word.lower()))
for line in f for word in line.split())
f.seek(0)
uppercase_words = set(''.join(filter(str.isalpha, word.lower()))
for line in f
for word in line.split() if word.isupper())
acronyms = Counter({w: c for w,c in counts.items() if w in uppercase_words})
print(acronyms)
# Counter({'cr': 5, 'a': 5, 'pu': 4})
Very new to Python.
Problem: I have a csv file that contains rows with alpha-numeric text, and I want to remove all English words. For example, an input is: "Steam traps on Steam to 56X-233 Butane Vaporizer"
and the desired output is just: "56X-233"
Is the answer like removing stop words with NLTK?
Thank you.
If you don't care about matching actual words you can use a regex to match any word with no numbers in it:
import re
def remove_words(line):
# Remove words containing only letters
line = re.sub(r"\b[A-Za-z]+\b", "", line)
# Remove remaining extra spaces
return re.sub(" +", " ", line).strip()
print(remove_words("Steam traps on Steam to 56X-233 Butane Vaporizer"))
To do this to an entire file, you just need to grab each line of the file and run the above code on it:
with open("my_file.txt") as f:
for line in f.readlines():
print(remove_words(line))
I've got the text file for the Dracula novel and I want to count the number of lower case letters contained within it. The code I've got executes without a problem but prints out 4297. I'm not sure where I went wrong and hoped you guys could point out my issue here. Thank you!
Indentation isn't necessarily reflective of what I see on my text editor
def main():
book_file = open('dracula.txt', 'r')
lower_case = sum(map(str.islower, book_file))
print (lower_case)
book_file.close()
main()
expected: 621607
results: 4297
When you iterate over a file, you get a line as a value on each iteration. Your current code would be correct if it was running on characters, not lines. When you call islower on a longer string (like a line from a book), it only returns True if all the letters in the string are lowercase.
In your copy of Dracula, there are apparently 4297 lines that contain no capital letters, so that's the result you're getting. The much larger number is the count of characters.
You can fix your code by adding an extra step to read the file as a single large string, the iterating on that.
def main():
with open('dracula.txt', 'r') as book_file:
text = book_file.read()
lower_case = sum(map(str.islower, text))
print(lower_case)
I also modified your code slightly by using a with statement to handle closing the file. This is nice because it will always close the file when it exits the intended block, even if something has gone wrong and an exception has been raised.
You can use regex to count the lower-case and upper-case characters
import re
text = "sdfsdfdTTsdHSksdsklUHD"
lowercase = len(re.findall("[a-z]", text))
uppercase = len(re.findall("[A-Z]", text))
print(lowercase)
print(uppercase)
Outputs:
15
7
And you will need to change how you read the file to
text = open("dracula.txt").read()
with open('dracula.txt', 'r') as book_file:
count=0
for line in book_file: # for each line in the file you will count the number # of lower case letters and add it to the variable "count"
count+=sum(map(str.islower, line))
print("number of lower case letters = " +int(count))
Here is a version that uses a list comprehension rather than map()
It iterates over the characters in the text and creates a list of all lowercase characters. The length of this list is the number of lowercase letters in the text.
with open('dracula.txt') as f:
text = f.read()
lowers = [char for char in text if char.islower()]
print(len(lowers))
I am fairly new to files in python and want to find the words in a file that have say 8 letters in them, which prints them, and keeps a numerical total of how many there actually are. Can you look through files like if it were a very large string or is there a specific way that it has to be done?
You could use Python's Counter for doing this:
from collections import Counter
import re
with open('input.txt') as f_input:
text = f_input.read().lower()
words = re.findall(r'\b(\w+)\b', text)
word_counts = Counter(w for w in words if len(w) == 8)
for word, count in word_counts.items():
print(word, count)
This works as follows:
It reads in a file called input.txt, as one very long string.
It then converts it all to lowercase to make sure the same words with different case are counted as the same word.
It uses a regular expression to split all of the text into a list of words.
It uses a list comprehension to store any word that has a length of 8 characters into a Counter.
It displays all of the matching entries along with the counts.
Try this code, where "eight_l_words" is an array of all the eight letter words and where "number_of_8lwords" is the number of eight letter words:
# defines text to be used
your_file = open("file_location","r+")
text = your_file.read
# divides the text into lines and defines some arrays
lines = text.split("\n")
words = []
eight_l_words = []
# iterating through "lines" adding each separate word to the "words" array
for each in lines:
words += each.split(" ")
# checking to see if each word in the "words" array is 8 chars long, and if so
# appending that words to the "eight_l_word" array
for each in words:
if len(each) == 8:
eight_l_word.append(each)
# finding the number of eight letter words
number_of_8lwords = len(eight_l_words)
# displaying results
print(eight_l_words)
print("There are "+str(number_of_8lwords)+" eight letter words")
Running the code with
text = "boomhead shot\nshamwow slapchop"
Yields the results:
['boomhead', 'slapchop']
There are 2 eight letter words
There's a useful post from 2 years ago called "How to split a text file to its words in python?"
How to split a text file to its words in python?
It describes splitting the line by whitespace. If you got punctuation such as commas and fullstops in there then you'll have to be a bit more sophisticated. There's help here: "Python - Split Strings with Multiple Delimiters" Split Strings with Multiple Delimiters?
You can use the function len() to get the length of each individual word.