My code is not very good but this is a really interesting problem. When looking for a forward slash in a string all are found except for if the forward slash is in the last word in the file. Here is my code.
#!/usr/bin/python
import sys
if len(sys.argv)!=2:
print "usage: %s filename\n" % (sys.argv[0]);
exit(0);
f = open(sys.argv[1]);
lines = [i for i in f.readlines()]
finals = [];
for line in lines:
words = line.split(",");
for word in words:
if word.find("/") != -1:
datefixes = word.split("/")
if datefixes[2].__len__() == 4:
temp = datefixes[2]
word = datefixes[0] + "-" + datefixes[1] + "-" + temp[-2:]
finals += "," + word;
tempstring = ''.join(finals)
finallist = tempstring.split("\r\n")
finalstring = ""
for tmpstrpart in finallist:
if tmpstrpart != "" or tmpstrpart !="\r\n":
finalstring += tmpstrpart[1:] + "\r\n"
print finalstring
and here is a sample input
ACPVBF,1930-729,Z729,12/16/2014,6/10/2008,1/5/2003,44-48-46,39-43-41,35-39-37,29-33-31
ACPVGT,1930-729,Z729,25-29-27,19-23-21,14-18-16,7/11/2009,2/6/2004,48-2-0,42-46-44
ACPUQH,1930-729,Z729,32-40-19,26-34-13,21-29-8,14-22-1,9/17/1946,5/13/1942,49-7-36
ACPVOU,1930-729,Z729,42-0-29,36-44-23,31-39-18,24-32-11,19-27-6,15-23-2,9/17/1946
in the code these lines are split by commas. if the word at the end contains a / the forward slash is not found. but only if it is at the end. the rest work fine.
edit: The output I am currently getting on these lines is:
ACPVBF,1930-729,Z729,12-16-14,6-10-08,1-5-03,44-48-46,39-43-41,35-39-37,29-33-31
ACPVGT,1930-729,Z729,25-29-27,19-23-21,14-18-16,7-11-09,2-6-04,48-2-0,42-46-44
ACPUQH,1930-729,Z729,32-40-19,26-34-13,21-29-8,14-22-1,9-17-46,5-13-42,49-7-36
ACPVOU,1930-729,Z729,42-0-29,36-44-23,31-39-18,24-32-11,19-27-6,15-23-2,9/17/1946
the output that I am trying to get from these lines is:
ACPVBF,1930-729,Z729,12-16-14,6-10-08,1-5-03,44-48-46,39-43-41,35-39-37,29-33-31
ACPVGT,1930-729,Z729,25-29-27,19-23-21,14-18-16,7-11-09,2-6-04,48-2-0,42-46-44
ACPUQH,1930-729,Z729,32-40-19,26-34-13,21-29-8,14-22-1,9-17-46,5-13-42,49-7-36
ACPVOU,1930-729,Z729,42-0-29,36-44-23,31-39-18,24-32-11,19-27-6,15-23-2,[9-17-46]
I want the one with the brackets around it to change also.
final working code based on BrenBarn's answer:
#!/usr/bin/python
import sys
import re
if len(sys.argv)!=2:
print "usage: %s filename\n" % (sys.argv[0]);
exit(0);
f = open(sys.argv[1]);
x = f.read()
f.close()
filename = sys.argv[1]
filename = filename[:-4] + " finished.csv"
f = open(filename, 'w')
f.write(re.sub(r'(\d{1,2})/(\d{1,2})/\d{2}(\d{2})', r'\1-\2-\3', x))
f.close()
Thanks for all the help. Sorry I can't upvote yet.
I'm not sure what the problem is, but I think what you're trying to do can be accomplished way more easily by just using a regular expression.
>>> print x
ACPVBF,1930-729,Z729,12/16/2014,6/10/2008,1/5/2003,44-48-46,39-43-41,35-39-37,29-33-31
ACPVGT,1930-729,Z729,25-29-27,19-23-21,14-18-16,7/11/2009,2/6/2004,48-2-0,42-46-44
ACPUQH,1930-729,Z729,32-40-19,26-34-13,21-29-8,14-22-1,9/17/1946,5/13/1942,49-7-36
ACPVOU,1930-729,Z729,42-0-29,36-44-23,31-39-18,24-32-11,19-27-6,15-23-2,9/17/1946
>>> print re.sub(r'(\d{1,2})/(\d{1,2})/\d{2}(\d{2})', r'\1-\2-\3', x)
ACPVBF,1930-729,Z729,12-16-14,6-10-08,1-5-03,44-48-46,39-43-41,35-39-37,29-33-31
ACPVGT,1930-729,Z729,25-29-27,19-23-21,14-18-16,7-11-09,2-6-04,48-2-0,42-46-44
ACPUQH,1930-729,Z729,32-40-19,26-34-13,21-29-8,14-22-1,9-17-46,5-13-42,49-7-36
ACPVOU,1930-729,Z729,42-0-29,36-44-23,31-39-18,24-32-11,19-27-6,15-23-2,9-17-46
I think the issue with your code is the trailing new line. My guess is it would fail for last word in each line. I would suggest you do a line.strip() before splitting
Related
New here!
I am searching for the following or the next word for the word "I". Ex "I am new here" -> the next word is "am".
import re
word = 'i'
with open('tedtalk.txt', 'r') as words:
pat = re.compile(r'\b{}\b \b(\w+)\b'.format(word))
print(pat.findall(words))
with open('tedtalk.txt','r') as f:
for line in f:
phrase = 'I'
if phrase in line:
next(f)
These are the codes i have developed so far, but i am kind of stuck already. Thanks in advance!
you have 2 options.
first, with split
with open('tedtalk.txt','r') as f:
data = f.read()
search_word = "I"
list_of_words = data.split()
next_word = list_of_words[list_of_words.index(search_word) + 1]
second, with regex:
import re
regex = re.compile(r"\bI\b\s*?\b(\w+)\b")
with open('tedtalk.txt','r') as f:
data = f.readlines()
result = regex.findall(data)
In your first piece of code, words is a file object, and there will be problems with line-by-line verification. For example, in the following case, am2 may not be found.
tedtalk.txt
I am1 new here, I
am2 new here, I am3 new here
So I modified the program and read 4096 bytes multiple times to prevent the file from being too large and causing the memory to explode.
In order to prevent the data being truncated causing it to be missed, the I will be looked for from the end of the data for a single read, and if found, the data following it will be truncated and put in front of the next read.
import re
regex = re.compile(r"\bI\b\s*?\b(\w+)\b")
def find_index(data, target_value="I"):
"""Look for spaces from the back, the intention is to find the value between two space blocks and check if it is equal to `target_value`"""
index = once_read_data.rfind(" ")
if index != -1:
index2 = index
while True:
index2 = once_read_data.rfind(" ", 0, index2)
if index2 == -1:
break
t = index - index2
# two adjacent spaces
if t == 1:
continue
elif t == 2 and once_read_data[index2 + 1: index] == target_value:
return index2
result = []
with open('tedtalk.txt', 'r') as f:
# Save data that might have been truncated last time.
prev_data = ""
while True:
once_read_data = prev_data + f.read(4096)
if not once_read_data:
break
index = find_index(once_read_data)
if index is not None:
# Slicing based on the found index.
prev_data = once_read_data[index:]
once_read_data = once_read_data[:index]
else:
prev_data = ""
result += regex.findall(once_read_data)
print(result)
Output:
['am1', 'am2', 'am3']
search_word = 'I'
prev_data = ""
result = []
with open('tedtalk.txt', 'r') as f:
while True:
data = prev_data + f.readline()
if data == prev_data: # reached eof
break
list_of_words = data.split()
for word_pos, word in enumerate(list_of_words[:-1]):
if word == search_word:
result.append(list_of_words[word_pos+1])
prev_data = list_of_words[-1] + ' '
print(result)
I modified the code to read the text file by line, this should handle unlimited large file. The code also addresses the case where the search word is the last word on a line by taking as next word the first word of the next line.
If you rather treat each line independently and ignore the search word if it is the last word in the line, the code can be simplified as follows:
search_word = 'I'
result = []
with open('tedtalk.txt', 'r') as f:
while True:
data = f.readline()
if not data: # reached eof
break
list_of_words = data.split()
for word_pos, word in enumerate(list_of_words[:-1]):
if word == search_word:
result.append(list_of_words[word_pos+1])
print(result)
Just started working with python and got stuck with a find and match.
Code is as below
for line in f:
if c in line:
catch = line.split(' ', 1)[1]
return catch
f.close()
so if my line has "TRA trace" and when I input TR, it returns the vale for TRA and not TR. Is there anything that can be done in the if condition to mach the exact input string. Thank you.
You can do:
for line in f:
catch = line.split(' ', 1)
if c in catch: # checks if one of the tokens is c
return catch
# Or
if c == catch[0]: # checks if the first token is c
return catch
And for the Regex you can go for this.
import re
m=re.search('('+c+')',line)
if m.groups()
return m.group(1)
You can use regex to match the exact input string
EX:
import re
def checkStr(word, line):
result = re.search(r'\b'+ word + r'\b', line)
if result:
print "Found!"
print result.group()
else:
print "No Match!!!"
word = 'TR'
line = "TRA trace"
checkStr(word, line) #No Match!!!
word = 'TR'
line = "TR trace"
checkStr(word, line) #Found! TR
def count_spaces(filename):
input_file = open(filename,'r')
file_contents = input_file.read()
space = 0
tabs = 0
newline = 0
for line in file_contents == " ":
space +=1
return space
for line in file_contents == '\t':
tabs += 1
return tabs
for line in file_contents == '\n':
newline += 1
return newline
input_file.close()
I'm trying to write a function which takes a filename as a parameter and returns the total number of all spaces, newlines and also tab characters in the file. I want to try use a basic for loop and if statement but I'm struggling at the moment :/ any help would be great thanks.
Your current code doesn't work because you're combining loop syntax (for x in y) with a conditional test (x == y) in a single muddled statement. You need to separate those.
You also need to use just a single return statement, as otherwise the first one you reach will stop the function from running and the other values will never be returned.
Try:
for character in file_contents:
if character == " ":
space +=1
elif character == '\t':
tabs += 1
elif character == '\n':
newline += 1
return space, tabs, newline
The code in Joran Beasley's answer is a more Pythonic approach to the problem. Rather than having separate conditions for each kind of character, you can use the collections.Counter class to count the occurrences of all characters in the file, and just extract the counts of the whitespace characters at the end. A Counter works much like a dictionary.
from collections import Counter
def count_spaces(filename):
with open(filename) as in_f:
text = in_f.read()
count = Counter(text)
return count[" "], count["\t"], count["\n"]
To support large files, you could read a fixed number of bytes at a time:
#!/usr/bin/env python
from collections import namedtuple
Count = namedtuple('Count', 'nspaces ntabs nnewlines')
def count_spaces(filename, chunk_size=1 << 13):
"""Count number of spaces, tabs, and newlines in the file."""
nspaces = ntabs = nnewlines = 0
# assume ascii-based encoding and b'\n' newline
with open(filename, 'rb') as file:
chunk = file.read(chunk_size)
while chunk:
nspaces += chunk.count(b' ')
ntabs += chunk.count(b'\t')
nnewlines += chunk.count(b'\n')
chunk = file.read(chunk_size)
return Count(nspaces, ntabs, nnewlines)
if __name__ == "__main__":
print(count_spaces(__file__))
Output
Count(nspaces=150, ntabs=0, nnewlines=20)
mmap allows you to treat a file as a bytestring without actually loading the whole file into memory e.g., you could search for a regex pattern in it:
#!/usr/bin/env python3
import mmap
import re
from collections import Counter, namedtuple
Count = namedtuple('Count', 'nspaces ntabs nnewlines')
def count_spaces(filename, chunk_size=1 << 13):
"""Count number of spaces, tabs, and newlines in the file."""
nspaces = ntabs = nnewlines = 0
# assume ascii-based encoding and b'\n' newline
with open(filename, 'rb', 0) as file, \
mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
c = Counter(m.group() for m in re.finditer(br'[ \t\n]', s))
return Count(c[b' '], c[b'\t'], c[b'\n'])
if __name__ == "__main__":
print(count_spaces(__file__))
Output
Count(nspaces=107, ntabs=0, nnewlines=18)
C=Counter(open(afile).read())
C[' ']
In my case tab(\t) is converted to " "(four spaces). So i have modified the
logic a bit to take care of that.
def count_spaces(filename):
with open(filename,"r") as f1:
contents=f1.readlines()
total_tab=0
total_space=0
for line in contents:
total_tab += line.count(" ")
total_tab += line.count("\t")
total_space += line.count(" ")
print("Space count = ",total_space)
print("Tab count = ",total_tab)
print("New line count = ",len(contents))
return total_space,total_tab,len(contents)
I'm relatively new to python and coding and I'm trying to write a code that counts the number of times each different character comes out in a text file while disregarding the case of the characters.
What I have so far is
letters = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o',
'p','q','r','s','t','u','v','w','x','y','z']
prompt = "Enter filename: "
titles = "char count\n---- -----"
itemfmt = "{0:5s}{1:10d}"
totalfmt = "total{0:10d}"
whiteSpace = {' ':'space', '\t':'tab', '\n':'nline', '\r':'crtn'}
filename = input(prompt)
fname = filename
numberCharacters = 0
fname = open(filename, 'r')
for line in fname:
linecount +=1
word = line.split()
word += words
for word in words:
for char in word:
numberCharacters += 1
return numberCharacters
Somethings seems wrong about this. Is there a more efficient way to do my desired task?
Thanks!
from collections import Counter
frequency_per_character = Counter(open(filename).read().lower())
Then you can display them as you wish.
A better way would be to use the str methods such as isAlpha
chars = {}
for l in open('filename', 'rU'):
for c in l:
if not c.isalpha(): continue
chars[c] = chars.get(c, 0) + 1
And then use the chars dict to write the final histogram.
You over-complicating it, you can just convert your file content to a set in order to eliminate duplicated characters:
number_diff_value = len(set(open("file_path").read()))
Firt thing I'd like to say is this place has helped me more than I could ever repay. I'd like to say thanks to all that have helped me in the past :).
I am trying to devide up some text from a specific style message. It is formated like this:
DATA|1|TEXT1|STUFF: some random text|||||
DATA|2|TEXT1|THINGS: some random text and|||||
DATA|3|TEXT1|some more random text and stuff|||||
DATA|4|TEXT1|JUNK: crazy randomness|||||
DATA|5|TEXT1|CRAP: such random stuff I cant believe how random|||||
I have code shown below that combines the text adding a space between words and adds it to a string named "TEXT" so it looks like this:
STUFF: some random text THINGS: some random text and some more random text and stuff JUNK: crazy randomness CRAP: such random stuff I cant believe how random
I need it formated like this:
DATA|1|TEXT1|STUFF: |||||
DATA|2|TEXT1|some random text|||||
DATA|3|TEXT1|THINGS: |||||
DATA|4|TEXT1|some random text and|||||
DATA|5|TEXT1|some more random text and stuff|||||
DATA|6|TEXT1|JUNK: |||||
DATA|7|TEXT1|crazy randomness|||||
DATA|8|NEWTEXT|CRAP: |||||
DATA|9|NEWTEXT|such random stuff I cant believe how random|||||
The line numbers are easy, I have that done as well as the carraige returns. What I need is to grab "CRAP" and change the part that says "TEXT1" to "NEWTEXT".
My code scans the string looking for keywords then adds them to their own line then adds text below them followed by the next keyword on its own line etc. Here is my code I have so far:
#this combines all text to one line and adds to a string
while current_segment.move_next('DATA')
TEXT = TEXT + " " + current_segment.field(4).value
KEYWORD_LIST = [STUFF:', THINGS:', JUNK:']
KEYWORD_LIST1 = [CRAP:']
#this splits the words up to search through
TEXT_list = TEXT.split(' ')
#this searches for the first few keywords then stops at the unwanted one
for word in TEXT_list:
if word in KEYWORD_LIST:
my_output = my_output + word
elif word in KEYWORD_LIST1:
break
else:
my_output = my_output + ' ' + word
#this searches for the unwanted keywords leaving the output blank until it reaches the wanted keyword
for word1 in TEXT_list:
if word1 in KEYWORD_LIST:
my_output1 = ''
elif word1 in KEYWORD_LIST1:
my_output1 = my_output1 + word1 + '\n'
else:
my_output1 = my_output1 + ' ' + word1
#my_output is formatted back the way I want deviding up the text into 65 or less character lines
MAX_LENGTH = 65
my_wrapped_output = wrap(my_output,MAX_LENGTH)
my_wrapped_output1 = wrap(my_output1,MAX_LENGTH)
my_output_list = my_wrapped_output.split('\n')
my_output_list1 = my_wrapped_output1.split('\n')
for phrase in my_output_list:
if phrase == "":
SetID +=1
output = output + "DATA|" + str(SetID) + "|TEXT| |||||"
else:
SetID +=1
output = output + "DATA|" + str(SetID) + "|TEXT|" + phrase + "|||||"
for phrase2 in my_output_list1:
if phrase2 == "":
SetID +=1
output = output + "DATA|" + str(SetID) + "|NEWTEXT| |||||"
else:
SetID +=1
output = output + "DATA|" + str(SetID) + "|NEWTEXT|" + phrase + "|||||"
#this populates the fields I need
value = output
Then I format the "my_output" and "my_output1" adding the word "NEWTEXT" where it goes. This code runs through each line looking for the keyword then puts that keyword and a carraige return in. Once it gets the other "KEYWORD_LIST1" it stops and drops the rest of the text then starts the next loop. My problem is the above code gives my this:
DATA|1|TEXT1|STUFF: |||||
DATA|2|TEXT1|some random text|||||
DATA|3|TEXT1|THINGS: |||||
DATA|4|TEXT1|some random text and|||||
DATA|5|TEXT1|some more random text and stuff|||||
DATA|6|TEXT1|JUNK: |||||
DATA|7|TEXT1|crazy randomness|||||
DATA|8|NEWTEXT|crazy randomness|||||
DATA|9|NEWTEXT|CRAP: |||||
DATA|10|NEWTEXT|such random stuff I cant believe how random|||||
It is grabbing the text from before "KEYWORD_LIST1" and adding it into the NEWTEXT section. I know there is a way to make groups from the keyword and text after it but I am unclear on how to impliment it. Any help would be much appreciated.
Thanks.
This is what I had to do to get it to work for me:
KEYWORD_LIST = ['STUFF:', 'THINGS:', 'JUNK:']
KEYWORD_LIST1 = ['CRAP:']
def text_to_message(text):
result=[]
for word in text.split():
if word in KEYWORD_LIST or word in KEYWORD_LIST1:
if result:
yield ' '.join(result)
result=[]
yield word
else:
result.append(word)
if result:
yield ' '.join(result)
def format_messages(messages):
title='TEXT1'
for message in messages:
if message in KEYWORD_LIST:
title='TEXT1'
elif message in KEYWORD_LIST1:
title='NEWTEXT'
my_wrapped_output = wrap(message,MAX_LENGTH)
my_output_list = my_wrapped_output.split('\n')
for line in my_output_list:
if line = '':
yield title + '|'
else:
yield title + '|' + line
for line in format_messages(text_to_message(TEXT)):
if line = '':
SetID +=1
output = "DATA|" + str(SetID) + "|"
else:
SetID +=1
output = "DATA|" + str(SetID) + "|" + line
#this is needed instead of print(line)
value = output
General tip: Don't try to build up strings accretively like this:
my_output = my_output + ' ' + word
instead, make my_output a list, append word to the list, and
then, at the very end, do a single join: my_output = '
'.join(my_output). (See text_to_message code below for an example.)
Using join is the right way to build strings. Delaying the creation of the string is useful because processing lists of substrings is more pleasant than splitting and unsplitting strings, and having to add spaces and carriage returns here and there.
Study generators. They are easy to understand, and can help you a lot when processing text like this.
import textwrap
KEYWORD_LIST = ['STUFF:', 'THINGS:', 'JUNK:']
KEYWORD_LIST1 = ['CRAP:']
def text_to_message(text):
result=[]
for word in text.split():
if word in KEYWORD_LIST or word in KEYWORD_LIST1:
if result:
yield ' '.join(result)
result=[]
yield word
else:
result.append(word)
if result:
yield ' '.join(result)
def format_messages(messages):
title='TEXT1'
num=1
for message in messages:
if message in KEYWORD_LIST:
title='TEXT1'
elif message in KEYWORD_LIST1:
title='NEWTEXT'
for line in textwrap.wrap(message,width=65):
yield 'DATA|{n}|{t}|{l}'.format(n=num,t=title,l=line)
num+=1
TEXT='''STUFF: some random text THINGS: some random text and some more random text and stuff JUNK: crazy randomness CRAP: such random stuff I cant believe how random'''
for line in format_messages(text_to_message(TEXT)):
print(line)