Cant extract substring from the string using regex in python

Cant extract substring from the string using regex in python - python

I want to extract the substring "login attempt [b'admin'/b'admin']" from the string:
2021-05-06T00:00:15.921179Z [HoneyPotSSHTransport,1127,5.188.87.53] login attempt [b'admin'/b'admin'] succeeded.
But python returns the whole string. My code is:
import re
hand = open('cowrie.log')
outF = open("Usernames.txt", "w")
for line in hand:
if re.findall(r'login\sattempt\s\[[a-zA-z0-9]\'[a-zA-z0-9]+\'/[a-zA-z0-9]+\'[a-zA-z0-9]+\'\]', line):
print(line)
outF.write(line)
outF.write("\n")
outF.close()
Thanks in advance. This is the LINK which contains the data from which I want to extract.

Your code states: if re.findall returns something, print the whole line. But you should print the return from re.findall and write that as a string.
Or use re.search if you expect a single match.
Note that [A-z] matches more than [A-Za-z].
import re
hand = open('cowrie.log')
outF = open("Usernames.txt", "w")
for line in hand:
res = re.search(r"login\sattempt\s\[[a-zA-Z0-9]'[a-zA-Z0-9]+'/[a-zA-Z0-9]+'[a-zA-Z0-9]+']", line)
if res:
outF.write(res.group())
outF.write("\n")
outF.close()
Usernames.txt now contains:
login attempt [b'admin'/b'admin']

Related

how to go to next line if match is found and again check for the word count in that line

I am trying to find word count by find a match line if match is found go to next line and count the word in that line
id = open('id.txt','r')
ids = id.readlines()
for i in range(0, len(ids) - 1, 1):
actual_id = ids[i]
print(actual_id)
with open('sample2.txt', 'r') as f:
for line in f:
if re.search(r'{actual_id}|RQ', line):
next_line = line.next()
if next_line == 'RQ':
print(line)
with open('output.txt', 'a') as f:
f.write('\n' + line)
Sample.txt text file:
[07-12-2022 13:27:45.728|Info|0189B31C|RQ]
<ServiceRQ><SaleInfo><CityCode Solution=1>BLQ</CityCode><CountryCode Solution=2>NL</CountryCode><CurrencyCode>EUR</CurrencyCode><Channel>ICI</Channel></ServiceRQ>
[07-12-2022 13:27:45.744|Info|0189B31D|RQ]
<ServiceRQ><SaleInfo><CityCode Solution=1>BLQ</CityCode><CountryCode>NL</CountryCode><CurrencyCode>EUR</CurrencyCode><Channel>ICI</Channel></ServiceRQ>
0189B31C
0189B31D
These are unique id's which are store in different text file I am trying to read the 1st id from text file and match that id in Sample.txt and if match is found go to next line and count the number of Solution words and print.
Please can someone help me for find the code I am little confused.

I have no experience with the "requests" module. But since no one has answered your question yet, I thought maybe this would suit you. The code should work fine if the number of lines is even. I mean, the code will put strings in the "payload" and do the rest only if there is an entire pair consisting of an odd and an even string.
with open('Sample.txt', 'r') as f:
while True:
try:
odd_line=next(f)
even_line=next(f)
except StopIteration:
break
#payload=...
#headers=...
#response=...
#print(response.text)

You can use the flag re.DOTALL with the regex {idf}\|RQ.*?</ServiceRQ>, this way the regex matches any character including a newline, and the non-greedy modifier (.*?) part makes sure that few characters as possible will be matched until the string </ServiceRQ> is found. Then, you can use findall to obtain the number of Solution words in the string.
import re
with open('sample2.txt', 'r') as sample_file:
sample2 = sample_file.read()
id_dict = {}
with open('id.txt', 'r') as id_file:
for idf in id_file.read().split():
id_found = re.findall(fr'{idf}\|RQ.*?</ServiceRQ>', sample2, re.DOTALL)
if id_found:
solution_found = re.findall('Solution', id_found[0])
id_dict[idf] = len(solution_found)
print(id_dict)
Output from id_dict
{
'0189B31C': 2,
'0189B31D': 1
}

Print line if line starts with any letter of the alphabet

I'm trying to print all of my reptile subspecies in my python program. I have a text file with a bunch of subspecies and their DNA sequence IDs. I just want to create a dictionary of subspecies (keys) and their respective DNA sequence IDs (values). But to do that I need to first learn how to separate the two.
So I want to print all of the subspecies names only, and to ignore the sequence IDs.
So far I have
import re
file = open('repCleanSubs2.txt')
for line in file:
if line.startswith('[a-zA-Z]'):
print line
I believe the compiler takes the '[a-zA-Z]'as a string literal, rather than a search for any letter of the alphabet regardless the case sensitivity, which is what I want.
Is there some syntax that I'm missing in my if statement?
Thanks!

startswith does not interpret regular expressions. use the re module you have imported to check if a string is a match:
if re.match('^[a-zA-Z]+', line) is not None:
print line
starts with: ^
one or more matching characters: +
http://www.fon.hum.uva.nl/praat/manual/Regular_expressions_1__Special_characters.html

import re
file = open('repCleanSubs2.txt')
for line in file:
match = re.findall('^[a-zA-Z]+', line)
if match:
print line, match
The ^ sign means match from the beginning of the line, letters between a-z and A-Z
+ means at least one or more characters in [a-zA-Z] must be found
re.findall will return a list of all the patterns it could find in the string you supplied to it

Try the following lines instead of the startswith.
if re.match("^[a-zA-Z]", line):
print line

Try this, its working for me:
import re
file = open('repCleanSubs2.txt')
for line in file:
if (re.match('[a-zA-Z]',line)):
print line

without using re:
import string
with open('repCleanSubs2.txt') as c_file:
for line in c_file:
if any([line.startswith(c) for c in string.letters]):
print line

Try this
file = open("abc.xyz")
file_content = file.read()
line = file_content.splitlines()
output_data = []
for i in line:
if i[0] == '[a-zA-Z]':
output_data.append(i)
print(i)

It can be done without regular expression
data = open('repCleanSubs2.txt').read().splitlines() ## Read file and extract data as list
print [i for i in data if i[0].isalpha()]

Retrieve matching strings from text file

I have the following text file and I want to retrieve the numbers in brackets
ID&number:Track_number(12930)_
ID&number:Track_number(394839)_
ID&number:Track_number(958236)_
So I've tried this
import re
file = open("text.txt", "r")
text = file.read()
file.close()
pattern = re.compile(ur'Track_number(.*)_', re.UNICODE)
string = pattern.search(text).group(1)
print string
But it only displays the first result : (12930).
I was wondering if it was possible to have a list of all the matching results.
Thanks

You can use re.findall for example
>>> re.findall('\((\d+)\)', text)
['12930', '394839', '958236']

All you have to do is replace that search with findall. This will produce a list of all the matches.

Take part of a regex search result in Python

I want to read in a header file and output each of the variables that has the form x = 1.0; as double = x;
At the moment I've got this, which just outputs the whole line:
import re
input = open("file_with_vars.hpp", 'r')
out = open("output.txt", 'w')
for line in input:
if re.match("(.*) = (\d)", line):
print >> out, line
But I can't work out how to take part of the line and output the variable name and the double string to file.
EDIT:
So now I have
for line in cell:
m = re.search('(.*)\s*=\s*(\d+\.\d+)', line)
print m.group()
But get the error ' AttributeError: 'NoneType' object has no attribute 'group' '

Use search instead of match
the regex is .*\s*=\s*\d+\.\d+
test:
import re
y="x=1.0"
m=re.search('(.*)\s*=\s*(\d+\.\d+)',y)
The group function can be used to extract the matched strings as
>>> print m.group()
'x=1.0'
>>> print m.group(1)`
'x'
>>> print m.group(2)
'1.0'
EDIT
How to search lines within a file
for line in cell:
try:
m = re.search('(.*)\s*=\s*(\d+\.\d+)', line)
print m.group()
except AttributeError:
pass
The NoneType error is caused because some lines in the file doesnot match the regex returning a None by the search method.
The try except takes care of the exception.
pass a null statement in python
for an input file
x=10.2
y=15.3
z=12.4
w=48
creates output as
x=10.2
y=15.3
z=12.4
see here w=48 doesnt match the regex returning NoneType, which is safely handled by the try block
OR
as Jerry pointed out, an if can make that more simple
for line in cell:
m = re.search('\S*\s*=\s*(\d+\.\d+)', line)
if m:
print m.group()

You are printing the line after match .as its possible to exist more than 1 match you can use re.findall() , also you need [\d\.]+ instead of \d :
for line in input:
if re.match("(.*) = [\d\.]+", line):
print re.findall("(.*) = [\d\.]+", line)
and about spaces after and before of = you need to be sure ! if its possible that there are matched like var=num you can use ? after your spaces in your regex pattern: (.*) ?= ?[\d\.]+

import re
input = open("file_with_vars.hpp", 'r')
out = open("output.txt", 'w')
for line in input:
if re.findall("(.*?)\s*=\s*(\d+(?:\.\d*)?)", line):
print >> out, line
Try this.This should work.

Splitting lines in python based on some character

Input:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
Output:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.
'!' is the starting character and +0013 should be the ending of each line (if present).
Problem which I am getting:
Output is like :
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
Any help would be highly appreciated...!!!
My code:
file_open= open('sample.txt','r')
file_read= file_open.read()
file_open2= open('output.txt','w+')
counter =0
for i in file_read:
if '!' in i:
if counter == 1:
file_open2.write('\n')
counter= counter -1
counter= counter +1
file_open2.write(i)

You can try something like this:
with open("abc.txt") as f:
data=f.read().replace("\r\n","") #replace the newlines with ""
#the newline can be "\n" in your system instead of "\r\n"
ans=filter(None,data.split("!")) #split the data at '!', then filter out empty lines
for x in ans:
print "!"+x #or write to some other file
.....:
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.

Could you just use str.split?
lines = file_read.split('!')
Now lines is a list which holds the split data. This is almost the lines you want to write -- The only difference is that they don't have trailing newlines and they don't have '!' at the start. We can put those in easily with string formatting -- e.g. '!{0}\n'.format(line). Then we can put that whole thing in a generator expression which we'll pass to file.writelines to put the data in a new file:
file_open2.writelines('!{0}\n'.format(line) for line in lines)
You might need:
file_open2.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)
if you find that you're getting more newlines than you wanted in the output.
A few other points, when opening files, it's nice to use a context manager -- This makes sure that the file is closed properly:
with open('inputfile') as fin:
lines = fin.read()
with open('outputfile','w') as fout:
fout.writelines('!{0}\n'.format(line.replace('\n','')) for line in lines)

Another option, using replace instead of split, since you know the starting and ending characters of each line:
In [14]: data = """!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/1
2/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:14,000.
0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W
55.576,+0013!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013!,A,56
281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34
:18,000.0,0,37N22.714,121W55.576,+0013!,A,56281,12/12/19,19:34:19,000.0,0,37N22.""".replace('\n', '')
In [15]: print data.replace('+0013!', "+0013\n!")
!,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
!,A,56281,12/12/19,19:34:19,000.0,0,37N22.

Just for some variance, here is a regular expression answer:
import re
outputFile = open('output.txt', 'w+')
with open('sample.txt', 'r') as f:
for line in re.findall("!.+?(?=!|$)", f.read(), re.DOTALL):
outputFile.write(line.replace("\n", "") + '\n')
outputFile.close()
It will open the output file, get the contents of the input file, and loop through all the matches using the regular expression !.+?(?=!|$) with the re.DOTALL flag. The regular expression explanation & what it matches can be found here: http://regex101.com/r/aK6aV4
After we have a match, we strip out the new lines from the match, and write it to the file.

Let's try to add a \n before every "!"; then let python splitlines :-) :
file_read.replace("!", "!\n").splitlines()

I will actually implement as a generator so that you can work on the data stream rather than the entire content of the file. This will be quite memory friendly if working with huge files
>>> def split_on_stream(it,sep="!"):
prev = ""
for line in it:
line = (prev + line.strip()).split(sep)
for parts in line[:-1]:
yield parts
prev = line[-1]
yield prev
>>> with open("test.txt") as fin:
for parts in split_on_stream(fin):
print parts
,A,56281,12/12/19,19:34:12,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:13,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:14,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:15,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:16,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:17,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:18,000.0,0,37N22.714,121W55.576,+0013
,A,56281,12/12/19,19:34:19,000.0,0,37N22.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cant extract substring from the string using regex in python - python

Related

how to go to next line if match is found and again check for the word count in that line

Print line if line starts with any letter of the alphabet

Retrieve matching strings from text file

Take part of a regex search result in Python

Splitting lines in python based on some character

Categories

Resources