Regex search term used in Notepad++ does not work with python - python

I'm working with a large .json filled with twitter bios and would like to extract screen_names. To prevent that the search also returns potential users mentioned in the bio section it is important only to extract the first match ofeach line.
When I open the file in Notepad++ I can use the following regex to do exactly that:
(^.*?)\K"screen_name": "(\w+)"
Using the same as part of an re.findall or re.search in python does not result in any matches.
I'm totally new to both Python and regex so I'm fairly certain I'm not fully aware of all the necessary coding.
Many thanks in advance!

As noted by other users Python and Notepad use different search codes, and so to achieve my wanted result I deployed the following code:
import re
regex=re.compile(r'"screen_name":\s*"(\w+)"')
with open("followers.json", "r") as f:
for line in f:
output=regex.search(line)
with open("followers.txt", "a") as outp:
outp.write(output.group(1)+"\n")
This will analyse your specified .json file, read it line by line, and save every first match of each line in the file "followers.txt".

Related

Read in CSV file in Python differ from Uipath: Line-by-line loop needed?

I am new to Python. I used to use Uipath to read in CSV files and extract certain information using regex expression.
However, when I tried in Python I get error "NoneType object has no attribute 'group'", which mean a match is not found.
But my regex expression works fine in Uipath.
Python Code:
text_lines=f.readlines()
model_regex=re.compile(r'(?<=model choice :).*')
model=model_regex.search('text_lines')
print(model.group())
do I have to place my variable (text_line) in quotation marks here?
model=model_regex.search('text_lines')
I basically open a csv file and read it all the text using
text_lines=f.readlines()
But will text_lines look exactly like in the csv file? I noted when I use print(text_lines), all the lines look jumbled up.
Is there a way to read in text_lines to look exactly like in the csv files so that my regex expression work?
Or is it for Python , I need to loop line by line of the text_lines to look for the regex expression?
Thank you

Python extract all hrefs from a .txt file

I have a folder that contains thousands of raw html code. I would like to extract all the href from each page. What would be the fastest way to do that?
href="what_i_need_here"
import re
with open('file', 'r') as f:
print (re.findall(r"href=\"(.+?)\"\n", ''.join(f.readlines())))
This would be what I guess might work, but there's no way to tell since you didn't provide any information. The regex used is href="(.+?)"\n. I read the content using f.readlines(), then combined it into a line to search using ''.join. See if it works, or add examples of the text.

How to extract word from text file to use as a variable?

I have a simple script that writes the network name to a log file. It displays the following.
All User Profile : NET_NAME
what I would like to do is open that text file and extract just the NET_NAME part and then be able to use that as a variable and also save the text file with the changes.
I have tried using split function, it kind of works when using the text, but when trying to read from the file it doesn't work. I have searched regex but do not know the syntax to achieve what I want.
split can indeed be used to achieve this. In case you're using Python3 and the content is in text.txt file, the snippet below should be able to do the trick:
with open("text.txt", "rb") as f:
content = f.read().decode("utf-8")
name = content.split(":")[1].strip()
print(name)

Webbrowser() reading through a text file for URLS

I am trying to write a script to automate browsing to my most commonly visited websites. I have put the websites into a list and am trying to open it using the webbrowser() module in Python. My code looks like the following at the moment:
import webbrowser
f = open("URLs", "r")
list = f.readline()
for line in list:
webbrowser.open_new_tab(list)
This only reads the first line from my file "URLs" and opens it in the browser. Could any one please help me understand how I can achieve reading through the entire file and also opening the URLs in different tabs?
Also other options that can help me achieve the same.
You have two main problems.
The first problem you have is that you are using readline and not readlines. readline will give you the first line in the file, while readlines gives you a list of your file contents.
Take this file as an example:
# urls.txt
http://www.google.com
http://www.imdb.com
Also, get in to the habit of using a context manager, as this will close the file for you once you have finished reading from it. Right now, even though for what you are doing, there is no real danger, you are leaving your file open.
Here is the information from the documentation on files. There is a mention about best practices with handling files and using with.
The second problem in your code is that, when you are iterating over list (which you should not use as a variable name, since it shadows the builtin list), you are passing list in to your webrowser call. This is definitely not what you are trying to do. You want to pass your iterator.
So, taking all this in to mind, your final solution will be:
import webbrowser
with open("urls.txt") as f:
for url in f:
webbrowser.open_new_tab(url.strip())
Note the strip that is called in order to ensure that newline characters are removed.
You're not reading the file properly. You're only reading the first line. Also, assuming you were reading the file properly, you're still trying to open list, which is incorrect. You should be trying to open line.
This should work for you:
import webbrowser
with open('file name goes here') as f:
all_urls = f.read().split('\n')
for each_url in all_urls:
webbrowser.open_new_tab(each_url)
My answer is assuming that you have the URLs 1 per line in the text file. If they are separated by spaces, simply change the line to all_urls = f.read().split(' '). If they're separated in another way just change the line to split accordingly.

Trying to create a Python Script to extract data from .log files

I'm trying to create a Python Script but I'm a bit stuck and can't find what I'm looking for on a Google search as it's quite specific.
I need to run a script on two .log files (auth.log and access.log) to view the following information:
Find how many attempts were made with the bin account
So how many attempts the bin account made to try and get into the server.
The logs are based off being hacked and needing to identify how and who is responsible.
Would anyone be able to give me some help in how I go about doing this? I can provide more information if needed.
Thanks in advance.
Edit:
I've managed to print all the times 'bin' appears in the log which is one way of doing it. Does anyone know if I can count how many times 'bin' appears as well?
with open("auth.log") as f:
for line in f:
if "bin" in line:
print line
Given that you work with system logs and their format is known and stable, my approach would be something like:
identify a set of keywords (either common, or one per log)
for each log, iterate line by line
once keywords match, add the relevant information from each line in e.g. a dictionary
You could use shell tools (like grep, cut and/or awk) to pre-process the log and extract relevant lines from the log (I assume you only need e.g. error entries).
You can use something like this as a starting point.
If you want ot use tool then you can use ELK(Elastic,Logstash and kibana).
if no then you have to read first log file then apply regex according to your requirment.
In case you might be interested in extracting some data and save it to a .txt file, the following sample code might be helpful:
import re
import sys
import os.path
expDate = '2018-11-27'
expTime = '11-21-09'
infile = r"/home/xenial/Datasets/CIVIT/Nov_27/rover/NMND17420010S_"+expDate+"_"+expTime+".LOG"
keep_phrases = ["FINESTEERING"]
with open(infile) as f:
f = f.readlines()
with open('/home/xenial/Datasets/CIVIT/Nov_27/rover/GPS_'+expDate+'_'+expTime+'.txt', 'w') as file:
file.write("gpsWeek,gpsSOW\n")
for line in f:
for phrase in keep_phrases:
if phrase in line:
resFind = re.findall('\.*?FINESTEERING,(\d+).*?,(\d+\.\d*)',line)[0]
gpsWeek = re.findall('\.*?FINESTEERING,(\d+)',line)[0]
gpsWeekStr = str(gpsWeek)
gpsSOW = re.findall('\.*?FINESTEERING,'+ gpsWeekStr + ',(\d+\.\d*)',line)[0]
gpsSOWStr = str(gpsSOW)
file.write(gpsWeekStr+','+gpsSOWStr+'\n')
break
print ("------------------------------------")
In my case, FINESTEERING was an interesting keyword in my .log file to extract numbers, including GPS_Week and GPS_Seconds_of_Weeks. You may modify this code to suit your own application.

Categories