srt handling with regex - python

Friends, how are you? I hope so!!
I need help using REGEX in python.
I need to validate some numbers, so that they are all in the same pattern. I explain:
Numbers (exemple):
0206071240004013000
04073240304015000
0001304-45.2034.4.01.2326
I need the script to read the numbers and change them so that they all have the following pattern:
20 numeric characters (The numbers that do not have must be added "0" to the left)
04073240304015000 = 00004073240304015000
Put "-" and "." this way:
0000407-32.4030.4.01.5000
I was writing the code as follows: First I remove the non-numeric characters, then I check if it has 20 numeric characters, and if I don't have it added. Now I need to put the score, but I'm having difficulties..
with open("num.txt", "r") as arquivo:
leitura = arquivo.readlines()
dados = leitura
for num in dados:
non_numeric = re.sub("[^0-9]", "", num)
characters = f'{non_numeric:0>20}'

Try:
import re
txt = '''\
0206071240004013000
04073240304015000
0001304-45.2034.4.01.2326'''
for n in re.findall(r'[\d.-]+', txt):
n = '{:0>20}'.format(n.replace('.', '').replace('-', ''))
print('{}-{}.{}.{}.{}.{}'.format(n[:7], n[7:9], n[9:13], n[13:14], n[14:16], n[16:]))
Prints:
0020607-12.4000.4.01.3000
0000407-32.4030.4.01.5000
0001304-45.2034.4.01.2326
EDIT: To read the text from file and write to a new file you can do:
import re
with open('in.txt', 'r') as f_in, open('out.txt', 'w') as f_out:
for n in re.findall(r'[\d\.\-]+', f_in.read()):
n = '{:0>20}'.format(n.replace('.', '').replace('-', ''))
print('{}-{}.{}.{}.{}.{}'.format(n[:7], n[7:9], n[9:13], n[13:14], n[14:16], n[16:]), file=f_out)

Related

Changing list to string to remove characters

I have a file that I am trying to do a word frequency list on, but I'm having trouble with the list and string aspects. I changed my file to a string to remove numbers from the file, but that ends up messing up the tokenization. The expected output is a word count of the file I am opening excluding numbers, but what I get is the following:
Counter({'<_io.TextIOWrapper': 1, "name='german/test/polarity/negative/neg_word_list.txt'": 1, "mode='r'": 1, "encoding='cp'>": 1})
done
Here's the code:
import re
from collections import Counter
def word_freq(file_tokens):
global count
for word in file_tokens:
count = Counter(file_tokens)
return count
f = open("german/test/polarity/negative/neg_word_list.txt")
clean = re.sub(r'[0-9]', '', str(f))
file_tokens = clean.split()
print(word_freq(file_tokens))
print("done")
f.close()
this ended up working, thank you to Rakesh
import re
from collections import Counter
def word_freq(file_tokens):
global count
for word in file_tokens:
count = Counter(file_tokens)
return count
f = open("german/test/polarity/negative/neg_word_list.txt")
clean = re.sub(r'[0-9]', '', f.read())
file_tokens = clean.split()
print(word_freq(file_tokens))
print("done")
f.close()
Reading further i've noticed you didn't "read" the file, you've just opened it.
if you print only opening the file:
f = open("german/test/polarity/negative/neg_word_list.txt")
print(f)
You'll notice it will tell you what the object is, "io.TextIOWrapper". So you need to read it:
f_path = open("german/test/polarity/negative/neg_word_list.txt")
f = f_path.read()
f_path.close() # don't forget to do this to clear stuff
print(f)
# >>> what's really inside the file
or another way to do this without the "close()":
# adjust your encoding
with open("german/test/polarity/negative/neg_word_list.txt", encoding="utf-8") as r:
f = r.read()
It's possible that by doing that it won't be in a list, but a plain text file, so you could iterate each line:
list_of_lines = []
# adjust your encoding
with open("german/test/polarity/negative/neg_word_list.txt", encoding="utf-8") as r:
# read each line and append to list
for line in r:
list_of_lines.append(line)

How can i append a text file to order the contents

I have a text file with about 2000 numbers, they are written to the file in a random order...how can i order them from within python? Any help is appreciated
file = open('file.txt', 'w', newline='')
s = (f'{item["Num"]}')
file.write(s + '\n')
file.close()
read = open('file.txt', 'a')
sorted(read)
You need to:
read the contents of the file: open('file.txt', 'r').read().
split the content using a separator: separator.split(contents)
convert each item to a number, otherwise, you won't be able to sort numerically: int(item)
sort the numbers: sorted(list_of_numbers)
Here is a code example, assuming the file is space separated and that the numbers are integers:
import re
file_contents = open("file.txt", "r").read() # read the contents
separator = re.compile(r'\s+', re.MULTILINE) # create a regex separator
numbers = []
for i in separator.split(f): # use the separator
try:
numbers.append(int(i)) # convert to integers and append
except ValueError: # if the item is not an integer, continue
pass
sorted_numbers = sorted(numbers)
You can now append the sorted content to another file:
with open("toappend.txt", "a") as appendable:
appendable.write(" ".join(sorted_numbers)

Python: replace a string in a CSV file

I am a beginner and I have an issue with a short code. I want to replace a string from a csv to with another string, and put out a new
csv with an new name. The strings are separated with commas.
My code is a catastrophe:
import csv
f = open('C:\\User\\Desktop\\Replace_Test\\Testreplace.csv')
csv_f = csv.reader(f)
g = open('C:\\Users\\Desktop\\Replace_Test\\Testreplace.csv')
csv_g = csv.writer(g)
findlist = ['The String, that should replaced']
replacelist = ['The string that should replace the old striong']
#the function ?:
def findReplace(find,replace):
s = f.read()
for item, replacement in zip(findlist,replacelist):
s = s.replace(item,replacement)
g.write(s)
for row in csv_f:
print(row)
f.close()
g.close()
You can do this with the regex package re. Also, if you use with you don't have to remember to close your files, which helps me.
EDIT: Keep in mind that this matches the exact string, meaning it's also case-sensitive. If you don't want that then you probably need to use an actual regex to find the strings that need replacing. You would do this by replacing find_str in the re.sub() call with r'your_regex_here'.
import re
# open your csv and read as a text string
with open(my_csv_path, 'r') as f:
my_csv_text = f.read()
find_str = 'The String, that should replaced'
replace_str = 'The string that should replace the old striong'
# substitute
new_csv_str = re.sub(find_str, replace_str, my_csv_text)
# open new file and save
new_csv_path = './my_new_csv.csv' # or whatever path and name you want
with open(new_csv_path, 'w') as f:
f.write(new_csv_str)

Stripping numbers dates until first alphabet is found from string

I am trying an efficient way to strip numbers dates or any other characters present in a string until the first alphabet is found from the end.
string - '12.abd23yahoo 04/44 231'
Output - '12.abd23yahoo'
line_inp = "12.abd23yahoo 04/44 231"
line_out = line_inp.rstrip('0123456789./')
This rstrip() call doesn't seem to work as expected, I get '12.abd23yahoo 04/44 ' instead.
I am trying below and it doesn't seem to be working.
for fname in filenames:
with open(fname) as infile:
for line in infile:
outfile.write(line.rstrip('0123456789./ '))
You need to strip spaces too:
line_out = line_inp.rstrip('0123456789./ ')
Demo:
>>> line_inp = "12.abd23yahoo 04/44 231"
>>> line_inp.rstrip('0123456789./ ')
'12.abd23yahoo'
You need to strip the newlines and add it again before you write :
for fname in filenames:
with open(fname) as infile:
outfile.writelines(line.rstrip('0123456789./ \n') + "\n"
for line in infile)
If the format is always the same you can just split:
with open(fname) as infile:
outfile.writelines(line.split(None, 1)[0] + "\n"
for line in infile)
Here's a solution using a regular expression:
import re
line_inp = "12.abd23yahoo 04/44 231"
r = re.compile('^(.*[a-zA-Z])')
m = re.match(r, line_inp)
line_out = m.group(0) # 12.abd23yahoo
The regular expression matches a group of arbitrary characters which end in a letter.

Need to remove line breaks from a text file with certain conditions

I have a text file running into 20,000 lines. A block of meaningful data for me would consist of name, address, city, state,zip, phone. My file has each of these on a new line, so a file would go like:
StoreName1
, Address
, City
,State
,Zip
, Phone
StoreName2
, Address
, City
,State
,Zip
, Phone
I need to create a CSV file and will need the above information for each store in 1 single line :
StoreName1, Address, City,State,Zip, Phone
StoreName2, Address, City,State,Zip, Phone
So essentially, I am trying to remove \r\n only at the appropriate points. How do I do this with python re. Examples would be very helpful, am a newbie at this.
Thanks.
s/[\r\n]+,/,/g
Globally substitute 'linebreak(s),' with ','
Edit:
If you want to reduce it further with a single linebreak between records:
s/[\r\n]+(,|[\r\n])/$1/g
Globally substitute 'linebreaks(s) (comma or linebreak) with capture group 1.
Edit:
And, if it really gets out of whack, this might cure it:
s/[\r\n]+\s*(,|[\r\n])\s*/$1/g
This iterator/generator version doesn't require reading the entire file into memory at once
from itertools import groupby
with open("inputfile.txt") as f:
groups = groupby(f, key=str.isspace)
for row in ("".join(map(str.strip,x[1])) for x in groups if not x[0]):
...
Assuming the data is "normal" - see my comment - I'd approach the problem this way:
with open('data.txt') as fhi, open('newdata.txt', 'w') as fho:
# Iterate over the input file.
for store in fhi:
# Read in the rest of the pertinent data
fields = [next(fhi).rstrip() for _ in range(5)]
# Generate a list of all fields for this store.
row = [store.rstrip()] + fields
# Output to the new data file.
fho.write('%s\n' % ''.join(row))
# Consume a blank line in the input file.
next(fhi)
First mind-numbigly solution
import re
ch = ('StoreName1\r\n'
', Address\r\n'
', City\r\n'
',State\r\n'
',Zip\r\n'
', Phone\r\n'
'\r\n'
'StoreName2\r\n'
', Address\r\n'
', City\r\n'
',State\r\n'
',Zip\r\n'
', Phone')
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'(.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,.+?)\r\n(,[^\r\n]+)')
with open('csvoutput.txt','wb') as f:
f.writelines(''.join(mat.groups())+'\r\n' for mat in regx.finditer(ch))
ch mimics the content of a file on a Windows platform (newlines == \r\n)
Second mind-numbigly solution
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'.+?\r\n,.+?\r\n,.+?\r\n,.+?\r\n,.+?\r\n,[^\r\n]+')
with open('csvoutput.txt','wb') as f:
f.writelines(mat.group().replace('\r\n','')+'\r\n' for mat in regx.finditer(ch))
Third mind-numbigly solution, if you want to create a CSV file with other delimiters than commas:
regx = re.compile('(?:(?<=\r\n\r\n)|(?<=\A)|(?<=\A\r\n))'
'(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,(.+?)\r\n,([^\r\n]+)')
import csv
with open('csvtry3.txt','wb') as f:
csvw = csv.writer(f,delimiter='#')
for mat in regx.finditer(ch):
csvw.writerow(mat.groups())
.
EDIT 1
You are right , tchrist, the following solution is far simpler:
regx = re.compile('(?<!\r\n)\r\n')
with open('csvtry.txt','wb') as f:
f.write(regx.sub('',ch))
.
EDIT 2
A regex isn't required:
with open('csvtry.txt','wb') as f:
f.writelines(x.replace('\r\n','')+'\r\n' for x in ch.split('\r\n\r\n'))
.
EDIT 3
Treating a file, no more ch:
'à la gnibbler" solution, in cases when the file can't be read all at once in memory because it is too big:
from itertools import groupby
with open('csvinput.txt','r') as f,open('csvoutput.txt','w') as g:
groups = groupby(f,key= lambda v: not str.isspace(v))
g.writelines(''.join(x).replace('\n','')+'\n' for k,x in groups if k)
I have another solution with regex:
import re
regx = re.compile('^((?:.+?\n)+?)(?=\n|\Z)',re.MULTILINE)
with open('input.txt','r') as f,open('csvoutput.txt','w') as g:
g.writelines(mat.group().replace('\n','')+'\n' for mat in regx.finditer(f.read()))
I find it similar to the gnibbler-like solution
f = open(infilepath, 'r')
s = ''.join([line for line in f])
s = s.replace('\n\n', '\\n')
s = s.replace('\n', '')
s = s.replace("\\n", "\n")
f.close()
f = open(infilepath, 'r')
f.write(s)
f.close()
That should do it. It will replace your input file with the new format

Categories