Python: Read in Data from File - python

I have to read data from a text file from the command line. It is not too difficult to read in each line, but I need a way to separate each part of the line.
The file contains the following in order for several hundred lines:
String (Sometimes more than 1 word)
Integer
String (Sometimes more than 1 word)
Integer
So for example the input could have:
Hello 5 Sample String 10
The current implementation I have for reading in each line is as follows... how can I modify it to separate it into what I want? I have tried splitting the line, but I always end up getting only one character of the first string this way with no integers or any part of the second string.
with open(sys.argv[1],"r") as f:
for line in f:
print(line)
The desired output would be:
Hello
5
Sample String
10
and so on for each line in the file. There could be thousands of lines in the file. I just need to separate each part so I can work with them separately.

The program can't magically split lines the way you want. You will need to read in one line at a time and parse it yourself based on the format.
Since there are two integers and an indeterminate number of (what I assume are) space-delimited words, you may be able to use a regular expression to find the integers then use them as delimiters to split up the line.

Related

python: replace every 18th character of every nth line in a text file

Because of the way the dataset I have is formatted, each hourly timestamp is written as 18 0 instead of 1800 (for example), and the extra space instead of a zero is messing up the way that Excel is converting the dataset from a TAB file to a CSV. There are > 600,000 lines, and this happens every 4th line.
see snapshot of dataset
I'm reading the text file in, reading each line, and then trying to replace the 18th character of every 4th line (the wretched space) with a 0
I think I am incorrectly understanding how to make each line a string, and also not sure how to correct the line and then re-save it into the file that will be ready to convert to a CSV
Python strings are immutable, and so they do not support item or slice assigment. You'll have to build a new string using i.e. someString[:18] + 'a' + someString[19:] or some other suitable approach, then storing it in the file again!.

Return the exact lines of a Huge file after pattern matching without using FOR in Python3

I am new to Python. My problem here is that:
I want to match a pattern against a large file and return matching lines(not just the matched string) from it. I DO NOT want a FOR loop for this as my file is huge. I am using mmap for reading the file.
in the above file, if I search for bhuvi, I should get 2 rows, bhuvi and bhuvi Kumar
I used re.findall() for this, but it just returns the substrings, not the whole lines.
Can someone please suggest what I can do here?
If your input file is huge, you cannot use readlines, but nothing
prevents you from reading one line in a loop.
As the file object is iterable, you can write the loop as:
for line in fh:
and process the content of the input line inside the loop.
The file size is not important, as you do not attempt to read all lines at once.
To check for presence of your string (bhuvi) in the line use
re.search, not re.findall.
Actually you don't need any list of matches, it is enough to find
a single match (it works quicker).
Below you have an example program (Python 3.7), writing the lines contaning your
string, along with the line number:
import re
cnt = 0
with open('input.txt') as fh:
for line in fh:
line = line.rstrip()
cnt += 1
if re.search('bhuvi', line):
print(f'{cnt}: {line}')
Note that I used rstrip() to remove the trailing newline, if any.
Edit after your comment:
You wrote that the file to check is huge. So there is a risk that
if you try to read it whole into the computer memory, the program
runs out of memory.
In such a case you would have to read the file chunk by chunk and
perform search in each chunk separately.
There is also a risk that a row with the text you are looking for will be
partially read in one chunk and the rest in the next,
so you have to take some measure to avoid this in your program.
On the other hand, if there is no other way but using mmap,
try something like re.finditer(r'[^\n]*bhuvi[^\n]*', map), i.e. create
an iterator looking for:
A sequence of chars other than \n.
Your string.
Another sequence of chars other than \n.
This way the match object returned by the iterator will match the
whole line, not your string alone.

How to save data to a file on separate items instead of one long string?

I am having trouble simply saving items into a file for later reading. When I save the file, instead of listing the items as single items, it appends the data together as one long string. According to my Google searches, this should not be appending the items.
What am I doing wrong?
Code:
with open('Ped.dta','w+') as p:
p.write(str(recnum)) # Add record number to top of file
for x in range(recnum):
p.write(dte[x]) # Write date
p.write(str(stp[x])) # Write Steps number
Since you do not show your data or your output I cannot be sure. But it seems you are trying to use the write method like the print function, but there are important differences.
Most important, write does not follow its written characters with any separator (like space by default for print) or end (like \n by default for print).
Therefore there is no space between your data and steps number or between the lines because you did not write them and Python did not add them.
So add those. Try the lines
p.write(dte[x]) # Write date
p.write(' ') # space separator
p.write(str(stp[x])) # Write Steps number
p.write('\n') # line terminator
Note that I do not know the format of your "date" that is written, so you may need to convert that to text before writing it.
Now that I have the time, I'll implement #abarnert's suggestion (in a comment) and show you how to get the advantages of the print function and still write to a file. Just use the file= parameter in Python 3, or in Python 2 after executing the statement
from __future__ import print_function
Using print you can do my four lines above in one line, since print automatically adds the space separator and newline end:
print(dte[x], str(stp[x]), file=p)
This does assume that your date datum dte[x] is to be printed as text.
Try adding a newline ('\n') character at the end of your lines as you see in docs. This should solve the problem of 'listing the items as single items', but the file you create may not be greatly structured nonetheless.
For further of your google searches you may want to check serialization, as well as json and csv formats, covered in python standard library.
You question would have befited if you gave very small example of recnum variable + original f.close() is not necessary as you have a with statement, see here at SO.

Python: How to read text between two empty lines into a string

I'm a beginner at programming and Python, and I'm writing a script to do stuff with .srt subtitle files. My problem is that I don't know how to: read through a file, and analyze text first between the beginning of the text and the first empty line and then between that empty line and the next empty line till the end of the file ("analyze" by e.g. calculate the length of a part of it, convert another part to numbers etc.).
You can read about the .srt format specification and see an example here (type: Plain); there's an empty line at the end of the file. I want to compare the display time/duration of each subtitle against the number of characters in it. Starting from the beginning of the file, each subtitle (with its number, duration info and text) is separated from the next one by an empty line (a "\n", I can find them with sth like if "\n" in line and len(line) == 2:). The time codes always contain a "-->" and always end in three digits, so if I have that in a string, I can figure out where it is. The problem is, I need to somehow do the following:
Read the subtitle text, which can be 1-3 lines with line breaks, calculate its character length.
Read the duration, convert to duration in seconds.
Read the line number (to be able to output it somewhere with my results, e.g. "duration of line 44 is 4.54 s").
I can do the second easily, but I'm not sure how to go over the whole file and tell Python: find the end of each subtitle's text, calculate the length of characters in each line, add that, read the duration, divide these, output this with the line number, and do the same with the next subtitle until you reach the end of the file. If it was one subtitle, I could do it easily, but I'm not sure how to do that check on a single one and then seek the next one. I've been looking for 2 hours for this and can't find anything like that.
Regular Expressions can be a powerful tool to help solve this type of processing.
You can use a regular expression to match or parse a single record or against the entire file.
If you don't know about Regex in python, I highly recommend you do some tutorials on the topic... and that should give you plenty of ideas how it can be applied to your problem.
There are many great references on the topic, but here is just one: http://www.diveintopython.net/regular_expressions/

Removing selected characters from text file

I have long a text file where each line looks something like /MM0001 (Table(12,)) or /MM0015 (Table(11,)). I want to keep only the four-digit number next to /MM. If it weren't for the "table(12,)" part I could just strip all the non-numeric characters, but I don't know how to extract the four-digit numbers only. Any advice on getting started?
If it's exactly that format, you could just print out line[3:7]
You could parse text line by line and then use 4th to 7th char of every line.
ln[3:7]
import re
R=re.compile(r'/MM(\d+)')
for line in file:
L=R.match(line)
if L:
print L.group(1)
or, more succinctly...
lines=[R.match(line).group(1) for line in file] #works if the lines are guaranteed to start with \MM
This should give you only the integers following a /MM and should work no matter how long the strings of integers are. If they're guaranteed to be a certain length, then you're better off with one of the other examples (which don't use regex).
if each line starts with /MM then just go through the file and print out line[3:7] e.g.
for line in file:
print line[3:7]

Categories