Converting all integers in a file to zero - python

Am new to python and I am trying to scan through a file and convert any integer I find to a value of 1.
Is there a regex I could use ? or some kind of function which I could use

def numbers_with_zero(file_):
import re
# Note:current regex will convert floats and ints to 0
# if one wants to just convert ints, and convert them to 1
# do line = re.sub(r'[-+]?\d+',r'1',line)
file_contents = []
# read file, change lines
with open(file_, 'r') as f:
for line in f:
# this regex should take care of ints, floats, and sings, if any
line = re.sub(r'[-+]?\d*\.\d+|\d+',r'0',line)
file_contents.append(line)
# reopen file and write changed lines back
with open(file_, 'w') as f:
for line in file_contents:
f.write(line)

Olle Muronde, you can find information about how to rewrite lines in file in the post. Each line of file could be considered as a string, the simplest way to replace some symbols by others is re.sub function from re module. I strongly recommend you to learn python documentation and to use google or stackoverflow search more often, because plenty good answers have been posted already.

Related

How to prevent loss/miscount of data using File I/O with buffer/size hint Python

I am new to Python and writing several methods to process large log files (bigger than 5GB). THrough the research i did, I saw a lot of people using "with open" and specificying a size hint/buffer like so:
with open(filename, 'rb', buffering=102400) as f:
time_data_count = 0
logbinset = set()
#def f(n):print('{:0b}'.format(n)) #check what non iteratable function means
search_pattern = regex.compile(b'\d+\((.)+\)\s+\d+\((.)+\)')
for line in f:
if search_pattern.search(line):
x = search_pattern.search(line)
#print(x.group(1)+" "+ x.group(2))
print((x.group(1)).decode())
print((x.group(2)).decode())
Another method (this one always returns none for some reason. Could use some help finding out why:
with open(filename, 'rb') as f:
#text = []
while True:
memcap = f.read(102400)
if not memcap:
break
text = re.search(b'\d+\(.+\)\s+\d+\(.+\)',memcap)
if text is None:
print("none")
print(text.group())
In these method, I am trying to extract regex patterns from a 6GB log file. My question is, I am worried using buffers to chop the file into chunks could result in situations where the line containing the pattern is chopped in half which would result in some data being missing.
How do I make sure line intergrity is kept? How do I make sure it only breaks up my file at the end of a line? How do I make sure I don't lose data in between chunks? Or does the "with open" and read(102400) method ensure lines are not split in half when breaking the file into chunks.
First of all dont use 'rb' use just 'r' wich is used for text. 'rb' is for binary data.
the read method reads as many characters as you specify, so you woul end up with chopped lines. Use readline.
The first variant is the correct one: set a buffer size when opening the file to get fewer read operations without loosing matches that span read blocks.
If you are concerned with runtime it would be a good idea to just search once in each line and not one search to determine if there is a match and then doing the exact same search again to get at the values:
regex = re.compile(rb"\d+\((.)+\)\s+\d+\((.)+\)")
with open(filename, "rb", buffering=102400) as lines:
for line in lines:
match = regex.search(line)
if match:
print((match.group(1)).decode())
print((match.group(2)).decode())
The for loop and the filtering can be moved into functions that are implemented in C (in CPython):
regex = re.compile(rb"\d+\((.)+\)\s+\d+\((.)+\)")
with open(filename, "rb", buffering=102400) as lines:
for match in filter(bool, map(regex.search, lines)):
print((match.group(1)).decode())
print((match.group(2)).decode())
On a 64 bit Python you could also try the mmap module to map the file into memory and apply the regular expression to the whole content.

Use Python Regex to take string starting with a certain word all the way till end of line [duplicate]

This question already has answers here:
TypeError: expected string or buffer
(5 answers)
Closed 3 years ago.
Trying to take specific lines from a file that is read and make it a usable variable that is returned.
For some information about the data in the file. The syntax goes like this.
A line of text I do not need
New domain: www.example.com
Another line that I do not need
New domain: www.example2.com
Ect...
It reads the file, I've tried the a bunch of variations of the example regex pattern and know I'm close. Other than that It's rather straightforward.
data = []
with open('test.txt', 'r') as file:
data = (re.findall(r"(?<=New domain:).+$",open('test.txt'), re.M))
return data
Happy Path:
The function reads from test.txt file, looks at only the lines that start with New domain: and only takes the url all the way to the end of the line and puts it into a list.
Errors:
It just tells me that the pattern syntax is wrong.
Your regex pattern is fine, but you can't pass a file object to findall. Try this instead:
data = (re.findall(r"(?<=New domain:).+$", file.read(), re.M))
You need to read the file first before passing it to the re.findall() method. You can also simply the regex.
def find_domains():
with open('test.txt', 'r+') as file:
file_text = file.read()
data = re.findall("New domain: (.*)", file_text)
return data

How to read a very large integer (1000+ digits) that is saved in **multiple lines** in a text file using Python?

Suppose I have a very large integer (say, of 1000+ digits) that I've saved into a text file named 'large_number.txt'. But the problem is the integer has been divided into multiple lines in the file, i.e. like the following:
47451445736001306439091167216856844588711603153276
70386486105843025439939619828917593665686757934951
62176457141856560629502157223196586755079324193331
64906352462741904929101432445813822663347944758178
92575867718337217661963751590579239728245598838407
58203565325359399008402633568948830189458628227828
80181199384826282014278194139940567587151170094390
35398664372827112653829987240784473053190104293586
86515506006295864861532075273371959191420517255829
71693888707715466499115593487603532921714970056938
54370070576826684624621495650076471787294438377604
Now, I want to read this number from the file and use it as a regular integer in my program. I tried the following but I can't.
My Try (Python):
with open('large_number.txt') as f:
data = f.read().splitlines()
Is there any way to do this properly in Python 3.6 ? Or what best can be done in this situation?
Just replace the newlines with nothing, then parse:
with open('large_number.txt') as f:
data = int(f.read().replace('\n', ''))
If you might have arbitrary (ASCII) whitespace and you want to discard all of it, switch to:
import string
killwhitespace = str.maketrans(dict.fromkeys(string.whitespace))
with open('large_number.txt') as f:
data = int(f.read().translate(killwhitespace))
Either way that's significantly more efficient than processing line-by-line in this case (because you need all the lines to parse, any line-by-line solution would be ugly), both in memory and runtime.
You can use this code:
with open('large_number.txt', 'r') as myfile:
data = myfile.read().replace('\n', '')
number = int(data)
You can use str.rstrip to remove the trailing newline characters and use str.join to join the lines into one string:
with open('large_number.txt') as file:
data = int(''.join(line.rstrip() for line in file))

Using python to parse a text file without delimiters

I have searched thoroughly, possibly with incorrect search terms, for a way to use Python to parse a text file WITHOUT the use of delimiters. All prior discussion found assumes the use of the CSV library (with comma delimited text) but since the input file does not use a comma-delimited format, csv does not seem to be the correct library to use.
For example, I would like to parse the 18th to 29th text character of each line regardless of context. The input file is general text, say, each line is 132 characters in length.
I could post an example input but don't see the point in it if the input is general text and is to be parsed without the use of any patterns to delimit.
Ideas?
The struct module can be used to parse fixed-length format files. Simply construct a format string using the appropriate length modifier for the s format character.
with open(filename, 'r') as f:
for line in f:
print line[18:30]
You can simply use something like this:
Res = [ ]
fo = open( filename) #open your file for reading ('r' by default)
for line in fo: # parse the file line by line
Res.append( line[ 18 : 30 ] ) # extract the desired text from the current line
fo.close()
print(Res)# exploit the extracted data
If you want the 18th to 29th characters of every line...
f = open(<path>, 'r')
results = [line[18:30] for line in f.readlines() if len(line) > 29]
f.close()
for r in results:
print r

For each line in a file, replace multiple-whitespace substring of variable length with line break

Using Python 2.7.1, I read in a file:
input = open(file, "rU")
tmp = input.readlines()
which looks like this:
>name -----meoidoad
>longname -lksowkdkfg
>nm --kdmknskoeoe---
>nmee dowdbnufignwwwwcds--
That is, each line has a short substring of whitespaces, but the length of this substring varies by line.
I would like to write script that edits my tmp object such that when I write tmp to file, the result is
>name
-----meoidoad
>longname
-lksowkdkfg
>nm
--kdmknskoeoe---
>nmee
dowdbnufignwwwwcds--
I.e. I would like to break each line into two lines, at that substring of whitespaces (and get rid of the spaces in the process).
The starting position of the string after the whitespaces is always the same within a file, but may vary among a large batch of files I am working with. So, I need a solution that does not rely on positions.
I've seen many similar questions on here, with many well-liked answers that use short regex scripts to do so, so it is possible I am duplicating a previous question. However, none of what I've seen so far has worked for me.
import re
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
outfile.write(re.sub('\s\s+', '\n', line))
If the file isn't huge (i.e. hundreds of MB), you can do this concisely with split() and join():
with open(file, 'rU') as f, open(outfilename, 'w') as o:
o.write('\n'.join(f.read().split()))
I would also recommend against naming anything input, as that will mask the built-in.

Categories