Using python to parse a text file without delimiters - python

I have searched thoroughly, possibly with incorrect search terms, for a way to use Python to parse a text file WITHOUT the use of delimiters. All prior discussion found assumes the use of the CSV library (with comma delimited text) but since the input file does not use a comma-delimited format, csv does not seem to be the correct library to use.
For example, I would like to parse the 18th to 29th text character of each line regardless of context. The input file is general text, say, each line is 132 characters in length.
I could post an example input but don't see the point in it if the input is general text and is to be parsed without the use of any patterns to delimit.
Ideas?

The struct module can be used to parse fixed-length format files. Simply construct a format string using the appropriate length modifier for the s format character.

with open(filename, 'r') as f:
for line in f:
print line[18:30]

You can simply use something like this:
Res = [ ]
fo = open( filename) #open your file for reading ('r' by default)
for line in fo: # parse the file line by line
Res.append( line[ 18 : 30 ] ) # extract the desired text from the current line
fo.close()
print(Res)# exploit the extracted data

If you want the 18th to 29th characters of every line...
f = open(<path>, 'r')
results = [line[18:30] for line in f.readlines() if len(line) > 29]
f.close()
for r in results:
print r

Related

I am looking for a function which help to read string from a file after specific special character in python

I am a beginner in python programming and looking for a function that helps me to read out a file of each line after a specific character, for example:
here is the format of the text file.
<ABC>
language \sometext.com xyz
The text file full of these sample sentences and I needs the string only which is between '' and '.' (only "text" in the above example.)
Here is the code but I could not get it 100% output.
f = open("test.txt", "r")
for x in f:
if "\\" in x:
x = x.rstrip('\\')
print(x)
In the above code, I am just getting the output of the first line like,
output:
language sometext.com xyz
You are calling readline twice, overwriting the line variable with the second line in the text file. The second and third lines in your code effectively do nothing.
EDIT: The original question was edited, the problem is now slightly different. My advice about using regex still stands.
I would use regex, with python's built-in re module:
import re
regex = re.compile(r"\\(.+)\.") # Pattern matching anything beween \ and .
with open("test.txt", "r") as file:
results = regex.findall(file.read())
print(results)
# Returns a list of every sub-string bewtween \ and . in the text file.
If you want to do it line by line:
file = open("test.txt", "r")
line = file.readline()
result = regex.search(line).group(1) # ".group(1)" makes sure the \ and . are not included
print(result)
# then you can continue with the next line
line = file.readline()
result = regex.search(line).group(1)
print(result)
# etc
# You can do this in a loop
# or with file.readlines() which returns a list of all the lines in the file
If you want more info on regex (regular expressions) in python, check out this good introduction: https://automatetheboringstuff.com/2e/chapter7/
or the official documentation:
https://docs.python.org/3/library/re.html

Python CSV remove new lines denoted by &#x0D

I have a BCP file that contains lots of 
 carriage return symbols. They are not meant to be there and I have no control over the original output so am left with trying to parse the file to remove them.
A sample of the data looks like....
"test1","apples","this is
some sample","3877"
"test66","bananas","this represents more
wrong data","378"
I am trying to send up with...
"test1","apples","this is some sample","3877"
"test66","bananas","this represents more wrong data","378"
Is there a simple way to do this prefereably using python CSV?
You can try:
import re
with open("old.csv") as f, open("new.csv", "w") as w:
for line in f:
line = re.sub(r"
\s*", "", line)
w.write(line)
"test1","apples","this is some sample","3877"
"test66","bananas","this represents more wrong data","378"
Demo

read the header and replace a column value with another one in Python

I am a new bie to Python and I am trying to read in a file with the below format
ORDER_NUMBER!Speed_Status!Days!
10!YES!100!
10!NO!100!
10!TRUE!100!
And the output to be written to the same file is
ORDER_NUMBER!STATUS!Days!
10!YES!100!
10!NO!100!
10!TRUE!100!
so far I tried
# a file named "repo", will be opened with the reading mode.
file = open('repo.dat', 'r+')
# This will print every line one by one in the file
for line in file:
if line.startswith('ORDER_NUMBER'):
words = [w.replace('Speed_Status', 'STATUS') for w in line.partition('!')]
file.write(words)
input()
But somehow its not working. what am I missing.
Read file ⇒ replace content ⇒ write to file:
with open('repo.dat', 'r') as f:
data = f.read()
data = data.replace('Speed_Status', 'STATUS')
with open('repo.dat', 'w') as f:
f.write(data)
The ideal way would be to use the fileinput module to replace the file contents in-place instead of opening the file in update mode r+
from __future__ import print_function
import fileinput
for line in fileinput.input("repo.dat", inplace=True):
if line.startswith('ORDER_NUMBER'):
print (line.replace("Speed_Status", "STATUS"), end="")
else:
print (line, end="")
As for why your attempt didn't work, the logic to form the words is quite incorrect, when you partition the line based on !, the list you formed back is in out of order as ['ORDER_NUMBER', '!', 'STATUS!Days!\n'] with the embedded new-line. Also your write() call would never take a non-character buffer object. You need to have cast it into a string format to print it.

FileInput as line versus fileinput as string

I have a list of files that I want to iterate over with RegEx replacements, some on individual lines, some that require multiline matches.
I am able to iterate over lines in a list of files and write to disk using this method.
import fileinput, re
ListFiles = ['in/spam.txt', 'in/eggs.txt', 'in/spam2.txt', 'in/eggs2.txt',
'in/spam3.txt', 'in/eggs3.txt', 'in/spam4.txt', 'in/eggs4.txt',
'in/spam5.txt', 'in/eggs5.txt']
with fileinput.input(files=(ListFiles), inplace=True, backup='.bak') as f:
for line in f:
line = re.sub(r'this','that', line)
print(line, end='')
Now I want to gather the output lines in f as a string, for which I can run multiline RegEx routines.
I tried a with(open), which I have been able to use to use with ReGex a single file, but it does not take a list as an argument, only a file name.
with open("spam.txt", "w") as f: # sample other use, list not allowed here.
data = f.read()
data = re.sub(r'sample', r'sample2', data)
print(data, file=f)
And I tried to gather f as a string into new variable data, as follows:
data = f(str)
data = re.sub(r'\\sc\{(.*?)\}', r'<hi rend="small_caps">\1</hi>', data) ## Ignore that this not multiline Regex for sample purposes only.
print(data)
But that produces error, that FileInput is not callable.
Is there a way that I can iterate and apply RegEx to files as line and as the same files as string in same with statement?
If it is ok to read individual files into memory as a whole then to perform multiline replacements in a list of files, you could process one file at a time:
for filename in ListFiles:
with open(filename) as file:
text = file.read() # read file into memory
text = text.replace('sample\n1', 'sample2') # make replacements
with open(filename, 'w') as file:
file.write(text) # rewrite the file

Converting all integers in a file to zero

Am new to python and I am trying to scan through a file and convert any integer I find to a value of 1.
Is there a regex I could use ? or some kind of function which I could use
def numbers_with_zero(file_):
import re
# Note:current regex will convert floats and ints to 0
# if one wants to just convert ints, and convert them to 1
# do line = re.sub(r'[-+]?\d+',r'1',line)
file_contents = []
# read file, change lines
with open(file_, 'r') as f:
for line in f:
# this regex should take care of ints, floats, and sings, if any
line = re.sub(r'[-+]?\d*\.\d+|\d+',r'0',line)
file_contents.append(line)
# reopen file and write changed lines back
with open(file_, 'w') as f:
for line in file_contents:
f.write(line)
Olle Muronde, you can find information about how to rewrite lines in file in the post. Each line of file could be considered as a string, the simplest way to replace some symbols by others is re.sub function from re module. I strongly recommend you to learn python documentation and to use google or stackoverflow search more often, because plenty good answers have been posted already.

Categories