Dividing up input txt wrong - python

I'm trying to write a program that inputs two txt files as stated by the user, takes the keywords file and splits it into words and values and then takes the tweets file, splits it into a location and a tweet/time.
Example of keywords file (single spaced .txt file):
*love,10
like,5
best,10
hate,1
lol,10
better,10*
Example of tweets file (note this shows only four, there are actually several hundred lines in the actual .txt file):
[41.298669629999999, -81.915329330000006] 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life ... ARREIC
[33.702900329999999, -117.95095704000001] 6 2011-08-28 19:03:13 Today is going to be the greatest day of my life. Hired to take pictures at my best friend's gparents 50th anniversary. 60 old people. Woo.
[38.809954939999997, -77.125144050000003] 6 2011-08-28 19:07:05 I just put my life in like 5 suitcases
[27.994195699999999, -82.569434900000005] 6 2011-08-28 19:08:02 #Miss_mariiix3 is the love of my life
So far my program looks like:
#prompt the user for the file name of keywords file
keywordsinputfile = input("Please input file name: ")
tweetsinputfile = input ("Please input tweets file name: ")
#try to open given input file
try:
k=open(keywordsinputfile, "r")
except IOError:
print ("{} file not found".format(keywordsinputfile))
try:
t=open(tweetsinputfile, "r")
except IOError:
print ("{} file not found".format(tweetsinputfile))
exit()
def main (): #main function
kinputfile = open(keywordsinputfile, "r") #Opens File for keywords
tinputfile = open(tweetsinputfile, "r") #Opens file for tweets
HappyWords = {}
HappyValues = {}
for line in kinputfile: #splits keywords
entries = line.split(",")
hvwords = str(entries[0])
hvalues = int(entries[1])
HappyWords["keywords"] = hvwords #stores Happiness keywords
HappyValues["values"] = hvalues #stores Happiness Values
for line in tinputfile:
twoparts = line.split("]") #splits tweet file by ] creating a location and tweet parts, tweets are ignored for now
startlocation = (twoparts[0]) #takes the first part (the locations)
def testing(startlocation):
for line in startlocation:
intlocation = line.split("[") #then gets rid of the "[" at the beginning of the locations
print (intlocation)
testing(startlocation)
main()
What I am hoping to get out of this is (for an infinite number of lines, the actual file contains way more than the four shown above)
41.298669629999999, -81.915329330000006
33.702900329999999, -117.95095704000001
38.809954939999997, -77.125144050000003
27.994195699999999, -82.569434900000005
And what I am getting is:
['', '']
['2']
['7']
['.']
['9']
['9']
['4']
['1']
['9']
['5']
['6']
['9']
['9']
['9']
['9']
['9']
['9']
['9']
['9']
[',']
[' ']
['-']
['8']
['2']
['.']
['5']
['6']
['9']
['4']
['3']
['4']
['9']
['0']
['0']
['0']
['0']
['0']
['0']
['0']
['5']
So in other words it's only processing the final line of the txt file and splitting it up individually as well.
After this I have to store them in such a way that I can split them again into the first part in one list and the second part in another list
(example:
for line in locations:
entries = line.split(",")
latitude = intr(entries[0])
longitude = int(entries[1])
Thanks in advance!

You just need to stick in some tracing print statements to show what's going on. I did it this way:
for line in tinputfile:
twoparts = line.split("]") #splits tweet file by ] creating a location and tweet parts, tweets are ignored for now
startlocation = (twoparts[0]) #takes the first part (the locations)
print ("-----------")
print ("twoparts", twoparts)
print ("startlocation", startlocation)
def testing(startlocation):
for line in startlocation:
print ("line", line)
intlocation = line.split("[") #then gets rid of the "[" at the beginning of the locations
print ("intlocation", intlocation)
testing(startlocation)
... and got a trace beginning with:
-----------
twoparts ['[41.298669629999999, -81.915329330000006', " 6 2011-08-28 19:02:36 Work needs to fly by ... I'm so excited to see Spy Kids 4 with then love of my life ... ARREIC\n"]
startlocation [41.298669629999999, -81.915329330000006
-----------
twoparts ['[33.702900329999999, -117.95095704000001', " 6 2011-08-28 19:03:13 Today is going to be the greatest day of my life. Hired to take pictures at my best friend's gparents 50th anniversary. 60 old people. Woo.\n"]
startlocation [33.702900329999999, -117.95095704000001
-----------
twoparts ['[38.809954939999997, -77.125144050000003', ' 6 2011-08-28 19:07:05 I just put my life in like 5 suitcases\n']
startlocation [38.809954939999997, -77.125144050000003
-----------
twoparts ['[27.994195699999999, -82.569434900000005', ' 6 2011-08-28 19:08:02 #Miss_mariiix3 is the love of my life\n']
startlocation [27.994195699999999, -82.569434900000005
line [
intlocation ['', '']
line 2
intlocation ['2']
line 7
Analysis:
There are two basic problems:
Your processing statement, testing(startlocation), is outside the loop, so it uses only the last input line.
As you can see in the output of "twoparts", your desired coordinates are still in string format, not a list of floats. You need to strip off the brackets and split them apart. Then convert them to float. In the current form, when you iterate through intlocation, you iterate through the characters of a string, not through two floats.
Also: Why do you define a function inside a loop? This redefines the function on every execution. Move it before the main program; this is where well-behaved functions hang out. :-)
Added information on point 2:
Let's step through your code, using the last line of sample input.
Start at the top of the loop for line in tinputfile
twoparts = line.split("]")
twoparts is now a pair of elements, both strings:
['[27.994195699999999, -82.569434900000005',
' 6 2011-08-28 19:08:02 #Miss_mariiix3 is the love of my life\n']
You then set startlocation to the first element:
'[27.994195699999999, -82.569434900000005'
Then comes the redundant re-definition of function testing, which produces no change. The next statement calls testing; we enter the routine.
testing(startlocation)
for line in startlocation:
The important part here is that startlocation is a string:
'[27.994195699999999, -82.569434900000005'
... so when you execute that loop, you iterate through the string, one character at a time.
Correction:
To be honest, I don't know what testing is supposed to do.
It looks like all you need to do is strip off that leading bracket:
intlocation = startlocation.split('[')
... or simply
intlocation = startlocation[1:]
Instead, if you want the float values as a two-element list, (a) knock off the bracket as above, split the elements at the comma, and convert to float:
intlocation = [ float(x) for x in startlocation[1:].split(',') ]

It looks like much of what this really needs is ast.literal_eval.
for line in tinputfile:
twoparts = line.split("]")
startlocation = ast.literal_eval(twoparts[0] + ']') # add the ']' back in
# startlocation is now a list of two coordinates.
But you might be better off still using re.
> import re
> example = '[27.994195699999999, -82.569434900000005] 6 2011-08-28 19:02:36 text text text text'
> fmt = re.split(r'\[(-?[0-9.]+),\s?(-?[0-9.]+).\s*\d\s*(\d{4}-\d{1,2}-\d{1,2}\s+\d{2}:\d{2}:\d{2})',example)
> fmt
['', '27.994195699999999', '-82.569434900000005', '2011-08-28 19:02:36', ' text text text text']
> location = (float(fmt[1]), float(fmt[2]))
> time = fmt[3]
> text = fmt[4]
So, what's going on?
Each of those (...) in the regular expression (the re module), tells re.split "Make this piece its own index".
The first and second are -?[0-9.]. That means match anything which may have a minus sign followed by numbers and decimal place (we could be stricter, but you don't really need to).
The next set of () match any date: \d{4} means "four digits". \d{1,2} means "one or two digits".
Or, you could use both together:
> fmt = re.split(r'\[(-?[0-9.]+,\s?-?[0-9.]+).\s*\d\s*(\d{4}-\d{1,2}-\d{1,2}\s+\d{2}:\d{2}:\d{2})',example)
> fmt # watch what happens when I change the grouping.
['', '27.994195699999999, -82.569434900000005', '2011-08-28 19:02:36', ' text text text text']
> location = literal_eval('(' + fmt[1] + ')')
> time = fmt[2]
> text = fmt[3]

Related

Formatted strings, decimals and commas question

I have a .txt file that I read in and wish to create formatted strings using these values. Columns 3 and 4 need decimals and the last column needs a percent sign and 2 decimal places. The formatted string will say something like "The overall attendance at Bulls was 894659, average attendance was 21,820 and the capacity was 104.30%’
the shortened .txt file has these lines:
1 Bulls 894659 21820 104.3
2 Cavaliers 843042 20562 100
3 Mavericks 825901 20143 104.9
4 Raptors 812863 19825 100.1
5 NY_Knicks 812292 19812 100
So far my code looks like this and its mostly working, minus the commas and decimal places.
file_1 = open ('basketball.txt', 'r')
count = 0
list_1 = [ ]
for line in file_1:
count += 1
textline = line.strip()
items = textline.split()
list_1.append(items)
print('Number of teams: ', count)
for line in list_1:
print ('Line: ', line)
file_1.close()
for line in list_1: #iterate over the lines of the file and print the lines with formatted strings
a, b, c, d, e = line
print (f'The overall attendance at the {b} game was {c}, average attendance was {d}, and the capacity was {e}%.')
Any help with how to format the code to show the numbers with commas (21820 ->21,828) and last column with 2 decimals and a percent sign (104.3 -> 104.30%) is greatly appreciated.
You've got some options for how to tackle this.
Option 1: Using f strings (Python 3 only)
Since your provided code already uses f strings, this solution should work for you. For others reading here, this will only work if you are using Python 3.
You can do string formatting within f strings, signified by putting a colon : after the variable name within the curly brackets {}, after which you can use all of the usual python string formatting options.
Thus, you could just change one of your lines of code to get this done. Your print line would look like:
print(f'The overall attendance at the {b} game was {int(c):,}, average attendance was {int(d):,}, and the capacity was {float(e):.2f}%.')
The variables are getting interpreted as:
The {b} just prints the string b.
The {int(c):,} and {int(d):,} print the integer versions of c and d, respectively, with commas (indicated by the :,).
The {float(e):.2f} prints the float version of e with two decimal places (indicated by the :.2f).
Option 2: Using string.format()
For others here who are looking for a Python 2 friendly solution, you can change the print line to the following:
print("The overall attendance at the {} game was {:,}, average attendance was {:,}, and the capacity was {:.2f}%.".format(b, int(c), int(d), float(e)))
Note that both options use the same formatting syntax, just the f string option has the benefit of having you write your variable name right where it will appear in the resulting printed string.
This is how I ended up doing it, very similar to the response from Bibit.
file_1 = open ('something.txt', 'r')
count = 0
list_1 = [ ]
for line in file_1:
count += 1
textline = line.strip()
items = textline.split()
items[2] = int(items[2])
items[3] = int(items[3])
items[4] = float(items[4])
list_1.append(items)
print('Number of teams/rows: ', count)
for line in list_1:
print ('Line: ', line)
file_1.close()
for line in list_1:
print ('The overall attendance at the {:s} games was {:,}, average attendance was {:,}, and the capacity was {:.2f}%.'.format(line[1], line[2], line[3], line[4]))

How to calculate the average score of class members in a text file? [duplicate]

This question already has answers here:
How to correctly split with multiple underscores? [duplicate]
(3 answers)
Closed 4 years ago.
For my beginners course python I got the following assignment:
In the input file, grades are listed for the geography tests of group 2b. There have been three tests of which the grades will be included in the half-yearly report that is given to the students before the Christmas break.
On each line of the input you can find the name of the student, followed by one or more under scores (’_’). These are succeeded by the grades for the tests, for example:
Anne Adema____________6.5 5.5 4.5
Bea de Bruin__________6.7 7.2 7.7
Chris Cohen___________6.8 7.8 7.3
Dirk Dirksen__________1.0 5.0 7.7
The lowest grade possible is a 1, the highest a 10. If somebody missed a test, the grade in the list is a 1.
Your assignment is to make the report for the geography course of group 2b, which should look like this:
Report for group 2b
Anne Adema has an average grade of 5.5
Bea de Bruin has an average grade of 7.2
Chris Cohen has an average grade of 7.3
Dirk Dirksen has an average grade of 4.6
End of report
This is my python code so far:
NUMBER_OF_GRADES = 3
file =open('grades1.in.txt').read().split('\n')
for scores in file:
name_numbers = (scores.split('_'))
def averages ():
for numbers in file:
sum=0
numbers = split("\n")
for num in numbers:
sum = sum + int(num)
averages = sum/NUMBER_OF_GRADES
print ('% has an average grade of %.1') %(name, averages)
Where does it go wrong? What am I missing? Am I not splitting the right way?
There's a few things wrong with your code. This maybe should be migrated to Code Review, but I'll write my answer here for now.
I will keep this as close to your version as I can so that you see how to get from where you are to a working example without needing a ton of stuff you might not have learned yet. First let's look at yours one part at a time.
file =open('grades1.in.txt').read().split('\n')
Your file is going to be a list of strings, where each string is a line in your input file. Note that if you have empty lines in your input, some of the lines in this list will be empty strings. This is important later.
for scores in file:
name_numbers = (scores.split('_'))
Works fine to split the name part of the line from the scores part of the line, so we'll keep it, but what are you doing with it here? Right now you are overwriting name_numbers with each new line in the file and never doing anything with it, so we are going to move this into your function and make better use of it.
def averages():
No arguments? We'll work on that.
for numbers in file:
Keep in mind your file is a list, where each entry in the list is one line of your input file, so this numbers in file doesn't really make sense. I think this is where you first go wrong. You want to look at the numbers in each line after you split it with scores.split('_'), and for that we need to index the result of the split. When you split your first line, you get something like:
split_line = ['Anne Adema', '', '', '', '', '', '', '', '', '', '', '', '6.5 5.5 4.5']
The first element (split_line[0]) is the name, and the last element (split_line[-1]) are the numbers, but you still have to split those out too! To get a list of numbers, you actually have to split it and then interpret each string as a number. You can do this pretty easily with a list comprehension (best way to loop in Python) like this:
numbers = [float(n) for n in split_line[-1].split(' ')]
This reads something like: first split the last element of the line at spaces to get ['6.5', '5.5', '4.5'] (note they're all strings), and then convert each value in that list into a floating-point number, and finally save this list of floats as numbers. OK, moving on:
sum=0
numbers = split("\n") # already talked about this
for num in numbers:
sum = sum + int(num)
averages = sum/NUMBER_OF_GRADES
sum is a keyword in Python, and we never want to assign something to a keyword, so something's wrong here. You can actually just call sum(my_list) on any list (actually any iterable) my_list to get the sum of all of the values in the list. To take the average, you just want to divide this sum by the length of the list, which you can get with len(my_list).
print ('% has an average grade of %.1') %(name, averages)
There are some cool newer ways to print formatted text, one of which I will show in the following, but if you are supposed to use this way then I say stick with it. That said, I couldn't get this line to work for me, so I went with something I know better.
Rewriting it into something that works:
def averages(line):
if line is '':
return # skips blank lines!
name_numbers = line.split('_') # now we split our line
name = name_numbers[0]
numbers = [float(n) for n in name_numbers[-1].split(' ')]
average = sum(numbers) / len(numbers)
print('{} has an average grade of {:.2}'.format(name, average))
And running it on your data:
file =open('grades1.in.txt').read().split('\n')
for line in file:
averages(line) # call the function on each line
# Anne Adema has an average grade of 5.5
# Bea de Bruin has an average grade of 7.2
# Chris Cohen has an average grade of 7.3
# Dirk Dirksen has an average grade of 4.6
With the result shown in the comments below the function call. One more note, you never close your file, and in fact you never save off the file handle to close it. You can get rid of any headaches around this by using the context manager syntax in Python:
with open('grades1.in.txt', 'r') as a_file:
for line in a_file:
averages(line)
This automatically handles closing the file, and will even make sure to do so if you run into an error in the middle of the block of code that executes within the context manager. You can loop through a_file because it basically acts as an iterable that returns the next line in the file each time it is accessed.
First, you are not calling your function averages at all. You should add that as the last line in your code, but you do not really need the function definition at all.
Use the with-statement to open and automatically close the file.
Next, you can use the re package to split the lines at the underscores.
You have to split the list of grades (assuming that they are separated by spaces)
You can use the built-in functions sum and len to calculate the sum and the number of grades.
In the end, it could look something like this
import re
with open('grades1.in.txt') as grades_file:
for line in grades_file:
name, grades = re.split("_+", line)
grades = [float(k) for k in grades.split()]
avg = sum(grades)/len(grades)
print("{} has an average grade of {:.2f}.".format(name, avg))
GoTN already answered your question.
To make it clear and improve a little bit you can try:
def averages(line, number_of_grades):
line_parsed = line.split('_')
numbers = [float(x) for line_parsed[-1].split(' ')]
name = line_parsed[0]
# you can use number_of_grades or len(numbers)
# Although, if the student only has 2 grades in the file,
# but it was supposed to be two the average will be wrong.
avg = sum(numbers)/number_of_grades
print("{} has an average grade of {:.2f}.".format(name, avg))
# or print(f'{name} has an average grade of {avg:.2f}')
NUMBER_OF_GRADES = 3
files =open('grades1.in.txt').read().splitlines()
for line in files:
if len(line) > 0:
averages(line, NUMBER_OF_GRADES)
Splitting the lines up is somewhat involved. The code below does it by first replacing all the "_" characters with spaces, then splits the result of that up. Since there can be a variable number of parts making up the full name, the results of this splitting are "sliced" using negative indexing which counts backwards for the end of the sequence of values.
That works by taking advantage of the fact that we know the last three items must be test scores, therefore everything before them must be parts comprising the name. This is the line doing that:
names, test_scores = line[:-NUMBER_OF_GRADES], line[-NUMBER_OF_GRADES:]
Here's the full code:
NUMBER_OF_GRADES = 3
with open('grades1.in.txt') as file:
for line in file:
line = line.replace('_', ' ').split()
names, test_scores = line[:-NUMBER_OF_GRADES], line[-NUMBER_OF_GRADES:]
name = ' '.join(names) # Join names together.
test_scores = [float(score) for score in test_scores] # Make numeric.
average_score = sum(test_scores) / len(test_scores)
print('%s has an average grade of %.1f' % (name, average_score))
Output:
Anne Adema has an average grade of 5.5
Bea de Bruin has an average grade of 7.2
Chris Cohen has an average grade of 7.3
Dirk Dirksen has an average grade of 4.6

Looking at a list of numbers and getting that number from another file>

I don't really know how to word the question, but I have this file with a number and a decimal next to it, like so(the file name is num.txt):
33 0.239
78 0.298
85 1.993
96 0.985
107 1.323
108 1.000
I have this string of numbers that I want to find the certain numbers from the file, take the decimal numbers, and append it to a list:
['78','85','108']
Here is my code so far:
chosen_number = ['78','85','108']
memory_list = []
for line in open(path/to/num.txt):
checker = line[0:2]
if not checker in chosen_number: continue
dec = line.split()[-1]
memory_list.append(float(dec))
The error they give to me is that it is not in a list and they only account for the 3 digit numbers. I don't really understand why this is happening and would like some tips to know how to fix it. Thanks.
As for the error, there is no actual error. The only problem is that they ignore the two digit numbers and only get the three digit numbers. I want them to get both the 2 and 3 digit numbers. For example, the script would pass 78 and 85, going to the line with '108'.
Your checker is undefined. The below code works.
N.B. I have used startswith because, the number might appear elsewhere in the line.
chosen_number = ['78','85','108']
memory_list = []
with open('path/to/num.txt') as f:
for line in f:
if any(line.startswith(i) for i in chosen_number):
memory_list.append(float(line.split()[1]))
print(memory_list)
Output:
[0.298, 1.993, 1.0]
The following would should work:
chosen_number = ['78','85','108']
memory_list = []
with open('num.txt') as f_input:
for line in f_input:
v1, v2 = line.split()
if v1 in chosen_number:
memory_list.append(float(v2))
print memory_list
Giving you:
[0.298, 1.993, 1.0]
Also, it is better to use a with statement when dealing with files so that the file is automatically closed afterwards.
Try to use this code:
chosen_number = ['78 ', '85 ', '108 ']
memory_list = []
for line in open("num.txt"):
for num in chosen_number:
if num in line:
dec = line.split()[-1]
memory_list.append(float(dec))
In chosen number, I declared numbers with a space after: '85 '. Otherwise when 0.985 is found, the if condition would be true, as they're used as string. I hope, I'm clear enough.

I need to make a simple file compression system for my GCSE Computer Science

I'm not that experienced with code and have a question pertaining to my GCSE Computer Science controlled assessment. I have got pretty far, it's just this last hurdle is holding me up.
This task requires me to use a previously made simple file compression system, and to "Develop a program that builds [upon it] to compress a text file with several sentences, including punctation. The program should be able to compress a file into a list of words and list of positions to recreate the original file. It should also be able to take a compressed file and recreate the full text, including punctuation and capitalisation, of the original file".
So far, I have made it possible to store everything as a text file with my first program:
sentence = input("Enter a sentence: ")
sentence = sentence.split()
uniquewords = []
for word in sentence:
if word not in uniquewords:
uniquewords.append(word)
positions = [uniquewords.index(word) for word in sentence]
recreated = " ".join([uniquewords[i] for i in positions])
print (uniquewords)
print (recreated)
positions=str(positions)
uniquewords=str(uniquewords)
positionlist= open("H:\Python\ControlledAssessment3\PositionList.txt","w")
positionlist.write(positions)
positionlist.close
wordlist=open("H:\Python\ControlledAssessment3\WordList.txt","w",)
wordlist.write(uniquewords)
wordlist.close
This makes everything into lists, and converts them into a string so that it is possible to write into a text document. Now, program number 2 is where the issue lies:
uniquewords=open("H:\Python\ControlledAssessment3\WordList.txt","r")
uniquewords= uniquewords.read()
positions=open("H:\Python\ControlledAssessment3\PositionList.txt","r")
positions=positions.read()
positions= [int(i) for i in positions]
print(uniquewords)
print (positions)
recreated = " ".join([uniquewords[i] for i in positions])
FinalSentence=
open("H:\Python\ControlledAssessment3\ReconstructedSentence.txt","w")
FinalSentence.write(recreated)
FinalSentence.write('\n')
FinalSentence.close
When I try and run this code, this error appears:
Traceback (most recent call last):
File "H:\Python\Task 3 Test 1.py", line 7, in <module>
positions= [int(i) for i in positions]
File "H:\Python\Task 3 Test 1.py", line 7, in <listcomp>
positions= [int(i) for i in positions]
ValueError: invalid literal for int() with base 10: '['
So, how do you suppose I get the second program to recompile the text into the sentence? Thanks, and I'm sorry if this was a lengthy post, I've spent forever trying to get this working.
I'm assuming this is something to do with the list that has been converted into a string including brackets, commas, and spaces etc. so is there a way to revert both strings back into their original state so I can recreate the sentence? Thanks.
So firstly, it is a big strange to save positions as a literal string; you should save each element (same with uniquewords). With this in mind, something like:
program1.py:
sentence = input("Type sentence: ")
# this is a test this is a test this is a hello goodbye yes 1 2 3 123
sentence = sentence.split()
uniquewords = []
for word in sentence:
if word not in uniquewords:
uniquewords.append(word)
positions = [uniquewords.index(word) for word in sentence]
with open("PositionList.txt","w") as f:
for i in positions:
f.write(str(i)+' ')
with open("WordList.txt","w") as f:
for i in uniquewords:
f.write(str(i)+' ')
program2.py:
with open("PositionList.txt","r") as f:
data = f.read().split(' ')
positions = [int(i) for i in data if i!='']
with open("WordList.txt","r") as f:
uniquewords = f.read().split(' ')
sentence = " ".join([uniquewords[i] for i in positions])
print(sentence)
PositionList.txt
0 1 2 3 0 1 2 3 0 1 2 4 5 6 7 8 9 10
WordList.txt
this is a test hello goodbye yes 1 2 3 123

Python: Read large file in chunks

Hey there, I have a rather large file that I want to process using Python and I'm kind of stuck as to how to do it.
The format of my file is like this:
0 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
1 xxx xxxx xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
So I basically want to read in the chunk up from 0-1, do my processing on it, then move on to the chunk between 1 and 2.
So far I've tried using a regex to match the number and then keep iterating, but I'm sure there has to be a better way of going about this. Any suggestion/info would be greatly appreciated.
If they are all within the same line, that is there are no line breaks between "1." and "2." then you can iterate over the lines of the file like this:
for line in open("myfile.txt"):
#do stuff
The line will be disposed of and overwritten at each iteration meaning you can handle large file sizes with ease. If they're not on the same line:
for line in open("myfile.txt"):
if #regex to match start of new string
parsed_line = line
else:
parsed_line += line
and the rest of your code.
Why don't you just read the file char by char using file.read(1)?
Then, you could - in each iteration - check whether you arrived at the char 1. Then you have to make sure that storing the string is fast.
If the "N " can only start a line, then why not use use the "simple" solution? (It sounds like this already being done, I am trying to reinforce/support it ;-))
That is, just reading a line at a time, and build up the data representing the current N object. After say N=0, and N=1 are loaded, process them together, then move onto the next pair (N=2, N=3). The only thing that is even remotely tricky is making sure not to throw out a read line. (The line read that determined the end condition -- e.g. "N " -- also contain the data for the next N).
Unless seeking is required (or IO caching is disabled or there is an absurd amount of data per item), there is really no reason not to use readline AFAIK.
Happy coding.
Here is some off-the-cuff code, which likely contains multiple errors. In any case, it shows the general idea using a minimized side-effect approach.
# given an input and previous item data, return either
# [item_number, data, next_overflow] if another item is read
# or None if there are no more items
def read_item (inp, overflow):
data = overflow or ""
# this can be replaced with any method to "read the header"
# the regex is just "the easiest". the contract is just:
# given "N ....", return N. given anything else, return None
def get_num(d):
m = re.match(r"(\d+) ", d)
return int(m.groups(1)) if m else None
for line in inp:
if data and get_num(line) ne None:
# already in an item (have data); current line "overflows".
# item number is still at start of current data
return [get_num(data), data, line]
# not in item, or new item not found yet
data += line
# and end of input, with data. only returns above
# if a "new" item was encountered; this covers case of
# no more items (or no items at all)
if data:
return [get_num(data), data, None]
else
return None
And usage might be akin to the following, where f represents an open file:
# check for error conditions (e.g. None returned)
# note feed-through of "overflow"
num1, data1, overflow = read_item(f, None)
num2, data2, overflow = read_item(f, overflow)
If the format is fixed, why not just read 3 lines at a time with readline()
If the file is small, you could read the whole file in and split() on number digits (might want to use strip() to get rid of whitespace and newlines), then fold over the list to process each string in the list. You'll probably have to check that the resultant string you are processing on is not initially empty in case two digits were next to each other.
If the file's content can be loaded in memory, and that's what you answered, then the following code (needs to have filename defined) may be a solution.
import re
regx = re.compile('^((\d+).*?)(?=^\d|\Z)',re.DOTALL|re.MULTILINE)
with open(filename) as f:
text = f.read()
def treat(inp,regx=regx):
m1 = regx.search(inp)
numb,chunk = m1.group(2,1)
li = [chunk]
for mat in regx.finditer(inp,m1.end()):
n,ch = mat.group(2,1)
if int(n) == int(numb) + 1:
yield ''.join(li)
numb = n
li = []
li.append(ch)
chunk = ch
yield ''.join(li)
for y in treat(text):
print repr(y)
This code, run on a file containing :
1 mountain
orange 2
apple
produce
2 gas
solemn
enlightment
protectorate
3 grimace
song
4 snow
wheat
51 guludururu
kelemekinonoto
52asabi dabada
5 yellow
6 pink
music
air
7 guitar
blank 8
8 Canada
9 Rimini
produces:
'1 mountain\norange 2\napple\nproduce\n'
'2 gas\nsolemn\nenlightment\nprotectorate\n'
'3 grimace\nsong\n'
'4 snow\nwheat\n51 guludururu\nkelemekinonoto\n52asabi dabada\n'
'5 yellow\n'
'6 pink \nmusic\nair\n'
'7 guitar\nblank 8\n'
'8 Canada\n'
'9 Rimini'

Categories