Python script to remove certain things form a string - python

i have a file with many lines like this,
>6_KA-RFNB-1505/2021-EPI_ISL_8285588-2021-12-02
i need to convert it to
>6_KA_2021-1202
all of the lines that require this change start in a >
The 6_KA and the 2021-12-02 are different for all lines.
I also need to add an empty line before every line that i change in thsi manner.

UPDATE: You changed the requirements from when I originally answered yourpost, but the below does what you are looking for. The principle remains the same: use regex to identify the parts of the string you are looking to replace. And then as you are going thru each line of the file create a new string based on the values you parsed out from the regex
import re
regex = re.compile('>(?P<first>[0-9a-zA-Z]{1,3}_[0-9a-zA-Z]{1,3}).*(?P<year>[0-9]{4})-(?P<month>[0-9]{2})-(?P<day>[0-9]{2})\n')
def convert_file(inputFile):
with open(inputFile, 'r') as input, open('Output.txt', 'w') as output:
for line in input:
text = regex.match(line)
if text:
output.write("\n" + text.group("first") + '_' + text.group("year") + "-" + text.group("month") + text.group("day") + "\n")
else:
output.write(line)
convert_file('data.txt')

Related

Reformatting a txt file with characters at index positions using python

Very newbie programmer asking a question here. I have searched all over the forums but can't find something to solve this issue I thought there would be a simple function for. Is there a way to do this?
I am trying to reformat a text file so I can use it with the pandas function but this requires my data to be in a specific format.
Currently my data is in the following format of a txt file with over 1000 lines of data:
["01/09/21","00:28",7.1,75,3.0,3.7,3.7,292,0.0,0.0,1025.8,81.9,17.1,44,3.7,4.6,7.1,0,0,0.00,0.00,3.0,0,0.0,292,0.0,0.0]
["01/09/21","00:58",7.0,75,2.9,5.1,5.1,248,0.0,0.0,1025.9,81.9,17.0,44,5.1,3.8,7.0,0,0,0.00,0.00,1.9,0,0.0,248,0.0,0.0
]["01/09/21","01:28",6.9,74,2.6,4.1,4.1,248,0.0,0.0,1025.8,81.9,17.0,44,4.1,4.1,6.9,0,0,0.00,0.00,2.5,0,0.0,248,0.0,0.0
I need it as
["01/09/21","00:28",7.1,75,3.0,3.7,3.7,292,0.0,0.0,1025.8,81.9,17.1,44,3.7,4.6,7.1,0,0,0.00,0.00,3.0,0,0.0,292,0.0,0.0]
["01/09/21","00:58",7.0,75,2.9,5.1,5.1,248,0.0,0.0,1025.9,81.9,17.0,44,5.1,3.8,7.0,0,0,0.00,0.00,1.9,0,0.0,248,0.0,0.0]
This requires adding a [" at the start and adding a " at the end of the date before the comma, then adding another " after the comma and another " at the end of the time section. At the end of the line, I also need to add a ], at the end.
I thought something like this would work but the second bracket appears after the line break (\n) is there any way to avoid this?
infile=open(infile)
outfile=open(outfile, 'w')
def format_line(line):
elements = line.split(',') # break the comma-separated data up
for k in range(2):
elements[k] = '"' + elements[k] + '"' # put quotes around the first two elements
print(elements[k])
new_line = ','.join(elements) # put them back together
return '[' + new_line + ']' # add the brackets
for line in infile:
outfile.write(format_line(line))
outfile.close()
You are referring to a function before it is defined.
Move the definition of format_line before it is called in the for loop.
When I rearranged your code it seems to work.
New code:
outfile=open("outputfile","w")
def format_line(line):
elements = line.split(',') # break the comma-separated data up
for k in range(2):
elements[k] = '"' + elements[k] + '"' # put quotes around the first two elements
new_line = ','.join(elements) # put them back together
return '[' + new_line + ']' # add the brackets
for line in infile:
format_line(line)

python open csv search for pattern and strip everything else

I got a csv file 'svclist.csv' which contains a single column list as follows:
pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1
pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs
I need to strip each line from everything except the PL5 directoy and the 2 numbers in the last directory
and should look like that
PL5,00
PL5,01
I started the code as follow:
clean_data = []
with open('svclist.csv', 'rt') as f:
for line in f:
if line.__contains__('profile'):
print(line, end='')
and I'm stuck here.
Thanks in advance for the help.
you can use the regular expression - (PL5)[^/].{0,}([0-9]{2,2})
For explanation, just copy the regex and paste it here - 'https://regexr.com'. This will explain how the regex is working and you can make the required changes.
import re
test_string_list = ['pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1',
'pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs']
regex = re.compile("(PL5)[^/].{0,}([0-9]{2,2})")
result = []
for test_string in test_string_list:
matchArray = regex.findall(test_string)
result.append(matchArray[0])
with open('outfile.txt', 'w') as f:
for row in result:
f.write(f'{str(row)[1:-1]}\n')
In the above code, I've created one empty list to hold the tuples. Then, I'm writing to the file. I need to remove the () at the start and end. This can be done via str(row)[1:-1] this will slice the string.
Then, I'm using formatted string to write content into 'outfile.csv'
You can use regex for this, (in general, when trying to extract a pattern this might be a good option)
import re
pattern = r"pf=/usr/sap/PL5/SYS/profile/PL5_.*(\d{2})"
with open('svclist.csv', 'rt') as f:
for line in f:
if 'profile' in line:
last_two_numbers = pattern.findall(line)[0]
print(f'PL5,{last_two_numbers}')
This code goes over each line, checks if "profile" is in the line (this is the same as _contains_), then extracts the last two digits according to the pattern
I made the assumption that the number is always between the two underscores. You could run something similar to this within your for-loop.
test_str = "pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1"
test_list = test_str.split("_") # splits the string at the underscores
output = test_list[1].strip(
"abcdefghijklmnopqrstuvwxyz" + str.swapcase("abcdefghijklmnopqrstuvwxyz")) # removing any character
try:
int(output) # testing if the any special characters are left
print(f"PL5, {output}")
except ValueError:
print(f'Something went wrong! Output is PL5,{output}')

Minor bug in code written to format a text file (incorrect spacing) (Python 3)

New to coding so sorry if this is a silly question.
I have some text that I'm attempting to format to make it more pleasant to read, so I tried my hand at writing a short program in Python to do it for me. I initially removed extra paragraph breaks in MS-Word using the find-and-replace option. The input text looks something like this:
This is a sentence. So is this one. And this.
(empty line)
This is the next line
(empty line)
and some lines are like this.
I want to eliminate all empty lines, so that there is no spacing between lines, and ensure no sentences are left hanging mid-way like in the bit above. All new lines should begin with 2 (two) empty spaces, represented by the $ symbol below. So after formatting it should look something like this:
$$This is a sentence. So is this one. And this.
$$This is the next line and some lines are like this.
I wrote the following script:
import os
directory = "C:/Users/DELL/Desktop/"
filename = "test.txt"
path = os.path.join(directory, filename)
with open(path,"r") as f_in, open(directory+"output.txt","w+") as f_out:
temp = " "
for line in f_in:
curr_line = line.strip()
temp += curr_line
#print("Current line:\n%s\n\ntemp line: %s" % (curr_line, temp))
if curr_line:
if temp[-1]==".": #check if sentence is complete
f_out.write(temp)
temp = "\n " #two blank spaces here
It eliminates all blank lines, indents new lines by two spaces, and conjoins hanging sentences, but doesn't insert the necessary blank space - so the output currently looks like (missing space between the words line and and).
$$This is a sentence. So is this one. And this.
$$This is the next lineand some lines are like this.
I tried to fix this by changing the following lines of code to read as follows:
temp += " " + curr_line
temp = "\n " #one space instead of two
and that doesn't work, and I'm not entirely sure why. It might be an issue with the text itself but I'll check on that.
Any advice would be appreciated, and if there is a better way to do what I want than this convoluted mess that I wrote, then I would like to know that as well.
EDIT: I seem to have fixed it. In my text (very long so I didn't notice it at first) there were two lines separated by 2 (two) empty lines, and so my attempt at fixing it didn't work. I moved one line a bit further below to give the following loop, which seems to have fixed it:
for line in f_in:
curr_line = line.strip()
#print("Current line:\n%s\n\ntemp line: %s" % (curr_line, temp))
if curr_line:
temp += " " + curr_line
if temp[-1]==".": #check if sentence is complete
f_out.write(temp)
temp = "\n "
I also saw that an answer below initially had a bit of Regex in it, I'll have to learn that at some point in the future I suppose.
Thanks for the help everyone.
This should work. It's effectively the same as yours but a bit more efficient. Doesn't use string concatenation + += (which are slow) but instead saves incomplete lines as a list. It then writes 2 spaces, each incomplete sentence joined by spaces and then a newline -- this simplifies it by only writing when a line is complete.
temp = []
with open(path_in, "r") as f_in, open(path_out, "w") as f_out:
for line in f_in:
curr_line = line.strip()
if curr_line:
temp.append(curr_line)
if curr_line.endswith('.'): # write our line
f_out.write(' ')
f_out.write(' '.join(temp))
f_out.write('\n')
temp.clear() # reset temp
outputs
This is a sentence. So is this one. And this.
This is the next line and some lines are like this.

Find a string and insert text after it in Python

I am still learner in python. I was not able to find a specific string and insert multiple strings after that string in python. I want to search the line in the file and insert the content of write function
I have tried the following which is inserting at the end of the file.
line = '<abc hij kdkd>'
dataFile = open('C:\\Users\\Malik\\Desktop\\release_0.5\\release_0.5\\5075442.xml', 'a')
dataFile.write('<!--Delivery Date: 02/15/2013-->\n<!--XML Script: 1.0.0.1-->\n')
dataFile.close()
You can use fileinput to modify the same file inplace and re to search for particular pattern
import fileinput,re
def modify_file(file_name,pattern,value=""):
fh=fileinput.input(file_name,inplace=True)
for line in fh:
replacement=value + line
line=re.sub(pattern,replacement,line)
sys.stdout.write(line)
fh.close()
You can call this function something like this:
modify_file("C:\\Users\\Malik\\Desktop\\release_0.5\\release_0.5\\5075442.xml",
"abc..",
"!--Delivery Date:")
Python strings are immutable, which means that you wouldn't actually modify the input string -you would create a new one which has the first part of the input string, then the text you want to insert, then the rest of the input string.
You can use the find method on Python strings to locate the text you're looking for:
def insertAfter(haystack, needle, newText):
""" Inserts 'newText' into 'haystack' right after 'needle'. """
i = haystack.find(needle)
return haystack[:i + len(needle)] + newText + haystack[i + len(needle):]
You could use it like
print insertAfter("Hello World", "lo", " beautiful") # prints 'Hello beautiful world'
Here is a suggestion to deal with files, I suppose the pattern you search is a whole line (there is nothing more on the line than the pattern and the pattern fits on one line).
line = ... # What to match
input_filepath = ... # input full path
output_filepath = ... # output full path (must be different than input)
with open(input_filepath, "r", encoding=encoding) as fin \
open(output_filepath, "w", encoding=encoding) as fout:
pattern_found = False
for theline in fin:
# Write input to output unmodified
fout.write(theline)
# if you want to get rid of spaces
theline = theline.strip()
# Find the matching pattern
if pattern_found is False and theline == line:
# Insert extra data in output file
fout.write(all_data_to_insert)
pattern_found = True
# Final check
if pattern_found is False:
raise RuntimeError("No data was inserted because line was not found")
This code is for Python 3, some modifications may be needed for Python 2, especially the with statement (see contextlib.nested. If your pattern fits in one line but is not the entire line, you may use "theline in line" instead of "theline == line". If your pattern can spread on more than one line, you need a stronger algorithm. :)
To write to the same file, you can write to another file and then move the output file over the input file. I didn't plan to release this code, but I was in the same situation some days ago. So here is a class that insert content in a file between two tags and support writing on the input file: https://gist.github.com/Cilyan/8053594
Frerich Raabe...it worked perfectly for me...good one...thanks!!!
def insertAfter(haystack, needle, newText):
#""" Inserts 'newText' into 'haystack' right after 'needle'. """
i = haystack.find(needle)
return haystack[:i + len(needle)] + newText + haystack[i + len(needle):]
with open(sddraft) as f1:
tf = open("<path to your file>", 'a+')
# Read Lines in the file and replace the required content
for line in f1.readlines():
build = insertAfter(line, "<string to find in your file>", "<new value to be inserted after the string is found in your file>") # inserts value
tf.write(build)
tf.close()
f1.close()
shutil.copy("<path to the source file --> tf>", "<path to the destination where tf needs to be copied with the file name>")
Hope this helps someone:)

Splitting lines in python based on some character

Input:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Output:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
'!' is the starting character and +0013 should be the ending of each line (if present).
Problem which I am getting:
Output is like :
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
Any help would be highly appreciated...!!!
My code:
file_open= open('sample.txt','r')
file_read= file_open.read()
file_open2= open('output.txt','w+')
counter =0
for i in file_read:
if '!' in i:
if counter == 1:
file_open2.write('\n')
counter= counter -1
counter= counter +1
file_open2.write(i)
You can try something like this:
with open("abc.txt") as f:
data=f.read().replace("\r\n","") #replace the newlines with ""
#the newline can be "\n" in your system instead of "\r\n"
ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines
for x in ans:
print "!"+x #or write to some other file
.....:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Could you just use str.split?
lines = file_read.split('!')
Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file:
file_open2.writelines('!{0}\n'.format(line) for line in lines)
You might need:
file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
if you find that you're getting more newlines than you wanted in the output.
A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly:
with open('inputfile') as fin:
lines = fin.read()
with open('outputfile','w') as fout:
fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
Another option, using replace instead of split, since you know the starting and ending characters of each line:
In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '')
In [15]: print data.replace('+0013!', "+0013\n!")
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Just for some variance, here is a regular expression answer:
import re
outputFile = open('output.txt', 'w+')
with open('sample.txt', 'r') as f:
for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL):
outputFile.write(line.replace("\n", "") + '\n')
outputFile.close()
It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4
After we have a match, we strip out the new lines from the match, and write it to the file.
Let's try to add a \n before every "!"; then let python splitlines :-) :
file_read.replace("!", "!\n").splitlines()
I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files
>>> def split_on_stream(it,sep="!"):
prev = ""
for line in it:
line = (prev + line.strip()).split(sep)
for parts in line[:-1]:
yield parts
prev = line[-1]
yield prev
>>> with open("test.txt") as fin:
for parts in split_on_stream(fin):
print parts
,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:19,000.0,0,37N22.

Categories