python open csv search for pattern and strip everything else

python open csv search for pattern and strip everything else - python

I got a csv file 'svclist.csv' which contains a single column list as follows:
pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1
pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs
I need to strip each line from everything except the PL5 directoy and the 2 numbers in the last directory
and should look like that
PL5,00
PL5,01
I started the code as follow:
clean_data = []
with open('svclist.csv', 'rt') as f:
for line in f:
if line.__contains__('profile'):
print(line, end='')
and I'm stuck here.
Thanks in advance for the help.

you can use the regular expression - (PL5)[^/].{0,}([0-9]{2,2})
For explanation, just copy the regex and paste it here - 'https://regexr.com'. This will explain how the regex is working and you can make the required changes.
import re
test_string_list = ['pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1',
'pf=/usr/sap/PL5/SYS/profile/PL5_ASCS01_s4prdascs']
regex = re.compile("(PL5)[^/].{0,}([0-9]{2,2})")
result = []
for test_string in test_string_list:
matchArray = regex.findall(test_string)
result.append(matchArray[0])
with open('outfile.txt', 'w') as f:
for row in result:
f.write(f'{str(row)[1:-1]}\n')
In the above code, I've created one empty list to hold the tuples. Then, I'm writing to the file. I need to remove the () at the start and end. This can be done via str(row)[1:-1] this will slice the string.
Then, I'm using formatted string to write content into 'outfile.csv'

You can use regex for this, (in general, when trying to extract a pattern this might be a good option)
import re
pattern = r"pf=/usr/sap/PL5/SYS/profile/PL5_.*(\d{2})"
with open('svclist.csv', 'rt') as f:
for line in f:
if 'profile' in line:
last_two_numbers = pattern.findall(line)[0]
print(f'PL5,{last_two_numbers}')
This code goes over each line, checks if "profile" is in the line (this is the same as _contains_), then extracts the last two digits according to the pattern

I made the assumption that the number is always between the two underscores. You could run something similar to this within your for-loop.
test_str = "pf=/usr/sap/PL5/SYS/profile/PL5_D00_s4prd1"
test_list = test_str.split("_") # splits the string at the underscores
output = test_list[1].strip(
"abcdefghijklmnopqrstuvwxyz" + str.swapcase("abcdefghijklmnopqrstuvwxyz")) # removing any character
try:
int(output) # testing if the any special characters are left
print(f"PL5, {output}")
except ValueError:
print(f'Something went wrong! Output is PL5,{output}')

Related

Extract chunks of text from document and write them to new text file

I have a large file text file that I want to read several lines of, and write these lines out as one line to a text file. For instance, I want to start reading in lines at a certain start word, and end on a lone parenthesis. So if my start word is 'CAR' I would want to start reading until a one parenthesis with a line break is read. The start and end words are to be kept as well.
What is the best way to achieve this? I have tried pattern matching and avoiding regex but I don't think that is possible.
Code:
array = []
f = open('text.txt','r') as infile
w = open(r'temp2.txt', 'w') as outfile
for line in f:
data = f.read()
x = re.findall(r'CAR(.*?)\)(?:\\n|$)',data,re.DOTALL)
array.append(x)
outfile.write(x)
return array
What the text may look like
( CAR: *random info*
*random info* - could be many lines of this
)

Using regular expression is totally fine for these type of problems. You cannot use them when your pattern contains recursion, like get the content from the parenthesis: ((text1)(text2)).
You can use the following regular expression: (CAR[\s\S]*?(?=\)))
See explanation...
Here you can visualize your regular expression...

We can match the text you're interested in using the regex pattern: (CAR.*)\) with flags gms.
Then we just have to remove the newline characters from the resulting matches and write them to a file.
with open("text.txt", 'r') as f:
matches = re.findall(r"(CAR.*)\)", f.read(), re.DOTALL)
with open("output.txt", 'w') as f:
for match in matches:
f.write(" ".join(match.split('\n')))
f.write('\n')
The output file looks like this:
CAR: *random info* *random info* - could be many lines of this
EDIT:
updated code to put newline between matches in output file

Having problems with strings and arrays

I want to read a text file and copy text that is in between '~~~~~~~~~~~~~' into an array. However, I'm new in Python and this is as far as I got:
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a=[0]
b=0
for i,line in enumerate(searchlines):
if '~~~~~~~~~~~~~' in line:
b=b+1
if '~~~~~~~~~~~~~' not in line:
if 's1mb4d' in line:
break
a.insert(b,line)
This is what I envisioned:
First I read all the lines of the text file,
then I declare 'a' as an array in which text should be added,
then I declare 'b' because I need it as an index. The number of lines in between the '~~~~~~~~~~~~~' is not even, that's why I use 'b' so I can put lines of text into one array index until a new '~~~~~~~~~~~~~' was found.
I check for '~~~~~~~~~~~~~', if found I increase 'b' so I can start adding lines of text into a new array index.
The text file ends with 's1mb4d', so once its found, the program ends.
And if '~~~~~~~~~~~~~' is not found in the line, I add text to the array.
But things didn't go well. Only 1 line of the entire text between those '~~~~~~~~~~~~~' is being copied to the each array index.
Here is an example of the text file:
~~~~~~~~~~~~~
Text123asdasd
asdasdjfjfjf
~~~~~~~~~~~~~
123abc
321bca
gjjgfkk
~~~~~~~~~~~~~

You could use regex expression, give a try to this:
import re
input_text = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = []
for line in input_text:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
What it does is it reads line by line looks for all characters but '~' if line consists only of '~' it ignores it, every line with text is appended to your a list afterwards.
And just because we can, oneliner (excluding import and source ofc):
import re
lines = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = [re.findall(r'[^\~]+', line) for line in lines if len(re.findall(r'[^\~]+', line)) != 0]

In python the solution to a large part of problems is often to find the right function from the standard library that does the job. Here you should try using split instead, it should be way easier.
If I understand correctly your goal, you can do it like that :
joined_lines = ''.join(searchlines)
result = joined_lines.split('~~~~~~~~~~')
The first line joins your list of lines into a sinle string, and then the second one cut that big string every times it encounters the '~~' sequence.

I tried to clean it up to the best of my knowledge, try this and let me know if it works. We can work together on this!:)
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a = []
currentline = ''
for i,line in enumerate(searchlines):
currentline += line
if '~~~~~~~~~~~~~' in line:
a.append(currentline)
elif 's1mb4d' in line:
break
Some notes:
You can use elif for your break function
Append will automatically add the next iteration to the end of the array
currentline will continue to add text on each line as long as it doesn't have 's1mb4d' or the ~~~ which I think is what you want

s = ['']
with open('path\\to\\sample.txt') as f:
for l in f:
a = l.strip().split("\n")
s += a
a = []
for line in s:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
print a
>>> [['Text123asdasd asdasdjfjfjf'], ['123abc 321bca gjjgfkk']]

If you're willing to impose/accept the constraint that the separator should be exactly 13 ~ characters (actually '\n%s\n' % ( '~' * 13) to be specific) ...
then you could accomplish this for relatively normal sized files using just
#!/usr/bin/python
## (Should be #!/usr/bin/env python; but StackOverflow's syntax highlighter?)
separator = '\n%s\n' % ('~' * 13)
with open('somefile.txt') as f:
results = f.read().split(separator)
# Use your results, a list of the strings separated by these separators.
Note that '~' * 13 is a way, in Python, of constructing a string by repeating some smaller string thirteen times. 'xx%sxx' % 'YY' is a way to "interpolate" one string into another. Of course you could just paste the thirteen ~ characters into your source code ... but I would consider constructing the string as shown to make it clear that the length is part of the string's specification --- that this is part of your file format requirements ... and that any other number of ~ characters won't be sufficient.
If you really want any line of any number of ~ characters to serve as a separator than you'll want to use the .split() method from the regular expressions module rather than the .split() method provided by the built-in string objects.
Note that this snippet of code will return all of the text between your separator lines, including any newlines they include. There are other snippets of code which can filter those out. For example given our previous results:
# ... refine results by filtering out newlines (replacing them with spaces)
results = [' '.join(each.split('\n')) for each in results]
(You could also use the .replace() string method; but I prefer the join/split combination). In this case we're using a list comprehension (a feature of Python) to iterate over each item in our results, which we're arbitrarily naming each), performing our transformation on it, and the resulting list is being boun back to the name results; I highly recommend learning and getting comfortable with list comprehension if you're going to learn Python. They're commonly used and can be a bit exotic compared to the syntax of many other programming and scripting languages).
This should work on MS Windows as well as Unix (and Unix-like) systems because of how Python handles "universal newlines." To use these examples under Python 3 you might have to work a little on the encodings and string types. (I didn't need to for my Python3.6 installed under MacOS X using Homebrew ... but just be forewarned).

Iterating through a file and replacing strings, leaving the number of characters intact

I'm attempting to anonymize a file so that all the content except certain keywords are replaced with gibberish, but the format is kept the same (including punctuation, length of string and capitalization). For example:
I am testing this, check it out! This is a keyword: long
Wow, another line.
should turn in to:
T ad ehistmg ptrs, erovj qo giw! Tgds ar o qpyeogf: long
Yeg, rmbjthe yadn.
I am attempting to do this in python, but i'm having no luck in finding a solution. I have tried replacing via tokenization and writing to another file, but without much success.

Initially let's disregard the fact that we have to preserve some keywords. We will fix that later.
The easiest way to perform this kind of 1-to-1 mapping is to use the method str.translate. The string module also contains constants that contain all ASCII lowercase and uppercase characters, and random.shuffle can be used to obtain a random permutation.
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
random.shuffle(random_caps)
random.shuffle(random_lows)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
with open('the-file-i-want.txt', 'r') as f:
contents = f.read()
translated_contents = contents.translate(translation_table)
with open('the-file-i-want.txt', 'w') as f:
f.write(translated_contents)
In python 2 the str.maketrans is a function in the string module instead of a static method of str.
The translation_table is a mapping from characters to characters, so it will map every single ASCII character to an other one. The translate method simply applies this table to each character in the string.
Important note: the above method is actually reversible, because each letter its mapped to a unique other letter. This means that using a simple analysis over the frequency of the symbols it's possible to reverse it.
If you want to make this harder or impossible, you could re-create the translation_table for every line:
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
with open('the-file-i-want.txt', 'r') as f:
translated_lines = []
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
translated_lines.append(line.translate(translation_table))
with open('the-file-i-want.txt', 'w') as f:
f.writelines(translated_lines)
Also note that you could translate and save the file line by line:
with open('the-file-i-want.txt', 'r') as f, open('output.txt', 'w') as o:
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
o.write(line.translate(translation_table))
Which means you can translate huge files with this code, as far as the lines themselves are not insanely long.
The code above messing all characters, without taking into account such keywords.
The simplest way to handle the requirement is to simply check for each line whether one of keywords occur and "reinsert" it there:
import re
import string
import random
random_caps = list(string.ascii_uppercase)
random_lows = list(string.ascii_lowercase)
keywords = ['long'] # add all the possible keywords in this list
keyword_regex = re.compile('|'.join(re.escape(word) for word in keywords))
with open('the-file-i-want.txt', 'r') as f, open('output.txt', 'w') as o:
for line in f:
random.shuffle(random_lows)
random.shuffle(random_caps)
all_random_chars = ''.join(random_lows + random_caps)
translation_table = str.maketrans(string.ascii_letters, all_random_chars)
matches = keyword_regex.finditer(line)
translated_line = list(line.translate(translation_table))
for match in matches:
translated_line[match.start():match.end()] = match.group()
o.write(''.join(translated_line))
Sample usage (using the version that prevserves keywords):
$ echo 'I am testing this, check it out! This is a keyword: long
Wow, another line.' > the-file-i-want.txt
$ python3 trans.py
$ cat output.txt
M vy hoahitc hfia, ufoum ih pzh! Hfia ia v modjpel: long
Ltj, fstkwzb hdsz.
Note how long is preserved.

Replace part of a matched string in python

I have the following matched strings:
punctacros="Tasla"_TONTA
punctacros="Tasla"_SONTA
punctacros="Tasla"_JONTA
punctacros="Tasla"_BONTA
I want to replace only a part (before the underscore) of the matched strings, and the rest of it should remain the same in each original string.
The result should look like this:
TROGA_TONTA
TROGA_SONTA
TROGA_JONTA
TROGA_BONTA

Edit:
This should work:
from re import sub
with open("/path/to/file") as myfile:
lines = []
for line in myfile:
line = sub('punctacros="Tasla"(_.*)', r'TROGA\1', line)
lines.append(line)
with open("/path/to/file", "w") as myfile:
myfile.writelines(lines)
Result:
TROGA_TONTA
TROGA_SONTA
TROGA_JONTA
TROGA_BONTA
Note however, if your file is exactly like the sample given, you can replace the re.sub line with this:
line = "TROGA_"+line.split("_", 1)[1]
eliminating the need of Regex altogether. I didn't do this though because you seem to want a Regex solution.

mystring.replace('punctacross="Tasla"', 'TROGA_')
where mystring is string with those four lines. It will return string with replaced values.

If you want to replace everything before the first underscore, try this:
#! /usr/bin/python3
data = ['punctacros="Tasla"_TONTA',
'punctacros="Tasla"_SONTA',
'punctacros="Tasla"_JONTA',
'punctacros="Tasla"_BONTA',
'somethingelse!="Tucku"_CONTA']
for s in data:
print('TROGA' + s[s.find('_'):])

Splitting lines in python based on some character

Input:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Output:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
'!' is the starting character and +0013 should be the ending of each line (if present).
Problem which I am getting:
Output is like :
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
Any help would be highly appreciated...!!!
My code:
file_open= open('sample.txt','r')
file_read= file_open.read()
file_open2= open('output.txt','w+')
counter =0
for i in file_read:
if '!' in i:
if counter == 1:
file_open2.write('\n')
counter= counter -1
counter= counter +1
file_open2.write(i)

You can try something like this:
with open("abc.txt") as f:
data=f.read().replace("\r\n","") #replace the newlines with ""
#the newline can be "\n" in your system instead of "\r\n"
ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines
for x in ans:
print "!"+x #or write to some other file
.....:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.

Could you just use str.split?
lines = file_read.split('!')
Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file:
file_open2.writelines('!{0}\n'.format(line) for line in lines)
You might need:
file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
if you find that you're getting more newlines than you wanted in the output.
A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly:
with open('inputfile') as fin:
lines = fin.read()
with open('outputfile','w') as fout:
fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)

Another option, using replace instead of split, since you know the starting and ending characters of each line:
In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '')
In [15]: print data.replace('+0013!', "+0013\n!")
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.

Just for some variance, here is a regular expression answer:
import re
outputFile = open('output.txt', 'w+')
with open('sample.txt', 'r') as f:
for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL):
outputFile.write(line.replace("\n", "") + '\n')
outputFile.close()
It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4
After we have a match, we strip out the new lines from the match, and write it to the file.

Let's try to add a \n before every "!"; then let python splitlines :-) :
file_read.replace("!", "!\n").splitlines()

I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files
>>> def split_on_stream(it,sep="!"):
prev = ""
for line in it:
line = (prev + line.strip()).split(sep)
for parts in line[:-1]:
yield parts
prev = line[-1]
yield prev
>>> with open("test.txt") as fin:
for parts in split_on_stream(fin):
print parts
,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:19,000.0,0,37N22.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python open csv search for pattern and strip everything else - python

Related

Extract chunks of text from document and write them to new text file

Having problems with strings and arrays

Iterating through a file and replacing strings, leaving the number of characters intact

Replace part of a matched string in python

Splitting lines in python based on some character

Categories

Resources