Extract data using regex between specified strings - python

Question1: I want to extract the data between "Target Information" and the line before "Group Information" and store it as a variable or appropriately.
Question2: Next, I want to extract the data from "Group Information" till the end of the file and store it in a variable or something appropriate.
Question3: With this information in both the above cases, I want to extract the line just after the line which starts with "Name"
From the below code I was able to get the information between "Target Information" and "Group Information" and Captured the data in "required_lines" variable.
Next, I am trying to get the line after the line "Name". But this fails. And can the logic be implemented using regex call?
# Extract the lines between
with open ('showrcopy.txt', 'r') as f:
file = f.readlines()
required_lines1 = []
required_lines = []
inRecordingMode = False
for line in file:
if not inRecordingMode:
if line.startswith('Target Information'):
inRecordingMode = True
elif line.startswith('Group Information'):
inRecordingMode = False
else:
required_lines.append(line.strip())
print(required_lines)
#Extract the line after the line "Name"
def gen():
for x in required_lines:
yield x
for line in gen():
if "Name" in line:
print(next(gen())
showrcopy.txt
root#gnodee184119:/home/usr/redsuren# date; showrcopy -qw
Tue Aug 24 00:20:38 PDT 2021
Remote Copy System Information
Status: Started, Normal
Target Information
Name ID Type Status Policy QW-Server QW-Ver Q-Status Q-Status-Qual ATF-Timeout
s2976 4 IP ready mirror_config https://10.157.35.148:8443 4.0.007 Re-starting Quorum not stable 10
Link Information
Target Node Address Status Options
s2976 0:9:1 192.168.20.21 Up -
s2976 1:9:1 192.168.20.22 Up -
receive 0:9:1 192.168.10.21 Up -
receive 1:9:1 192.168.10.22 Up -
Group Information
Name Target Status Role Mode Options
SG_hpux_vgcgloack.r518634 s2976 Started Primary Sync auto_recover,auto_failover,path_management,auto_synchronize,active_active
LocalVV ID RemoteVV ID SyncStatus LastSyncTime
vgcglock_SG_cluster 13496 vgcglock_SG_cluster 28505 Synced NA
Name Target Status Role Mode Options
aix_rcg1_AA.r518634 s2976 Started Primary Sync auto_recover,auto_failover,path_management,auto_synchronize,active_active
LocalVV ID RemoteVV ID SyncStatus LastSyncTime
tpvvA_aix_r.2 20149 tpvvA_aix.2 41097 Synced NA
tpvvA_aix_r.3 20150 tpvvA_aix.3 41098 Synced NA
tpvvA_aix_r.4 20151 tpvvA_aix.4 41099 Synced NA
tpvvA_aix_r.5 20152 tpvvA_aix.5 41100 Synced NA
tpvvA_aix_r.6 20153 tpvvA_aix.6 41101 Synced NA
tpvvA_aix_r.7 20154 tpvvA_aix.7 41102 Synced NA
tpvvA_aix_r.8 20155 tpvvA_aix.8 41103 Synced NA
tpvvA_aix_r.9 20156 tpvvA_aix.9 41104 Synced NA
tpvvA_aix_r.10 20157 tpvvA_aix.10 41105 Synced NA

Here's a regex solution to pull the target info and group info:
import re
with open("./showrcopy.txt", "r") as f:
text = f.read()
target_info_pattern = re.compile(r"Target Information([.\s\S]*)Group Information")
group_info_pattern = re.compile(r"Group Information([.\s\S]*)")
target_info = target_info_pattern.findall(text)[0].strip().split("\n")
group_info = group_info_pattern.findall(text)[0].strip().split("\n")
target_info_line_after_name = target_info[1]
group_info_line_after_name = group_info[1]
And the lines you're interested in:
>>> target_info_line_after_name
's2976 4 IP ready mirror_config https://10.157.35.148:8443 4.0.007 Re-starting Quorum not stable 10'
>>> group_info_line_after_name
'SG_hpux_vgcgloack.r518634 s2976 Started Primary Sync auto_recover,auto_failover,path_management,auto_synchronize,active_active'

Related

RuntimeError dictionary changed size during iteration during dictionary iteration

I am getting following error when I run my code. It started happening with apparently no changes to the code. The code basically tries to rendor the device config using jinja2 template.
Traceback (most recent call last):
File "vdn_ler_generate_config.py", line 380, in <module>
lerConfig = lerTemplate.render(config=copy.copy(config.lerDevList[tLer]))
File "/Users/nileshkhambal/Documents/myansible/lib/python3.8/site-packages/jinja2/environment.py", line 1304, in render
self.environment.handle_exception()
File "/Users/nileshkhambal/Documents/myansible/lib/python3.8/site-packages/jinja2/environment.py", line 925, in handle_exception
raise rewrite_traceback_stack(source=source)
File "./templates/jinja2/JuniperLerConfigTemplate", line 71, in top-level template code
{%- for tBgpGrp in config.vrfs[tvrf].bgpProto %}
RuntimeError: dictionary changed size during iteration
The code snippet below inside "try" and "expect" block shows the problem. There is an exactly same code for second vendor and it works fine. Using print statement in expect statement I see that there is an extra group object (with 'default') added in the dictionary during the iteration. According to my YAML config there should be only one such group object that should be there.
try:
if tVendor == 'VENDOR1':
lerTemplate = ENV.get_template('Vendor1LerConfigTemplate')
lerConfig = lerTemplate.render(config=config.lerDevList[tLer])
pOutputFile = './configs/' + str(tLer) + '.cfg'
with open(pOutputFile, "w+") as opHandle:
#print('Writing configuration for {}'.format(tLer))
opHandle.write(lerConfig)
opHandle.write('\n')
except:
for aKey in config.lerDevList[tLer].vrfs.keys():
#print(config.lerDevList[tLer].vrfs[aKey].bgpProto)
print('What VRF: {} How many BGP Groups: {}'.format(aKey,len(config.lerDevList[tLer].vrfs[aKey].bgpProto.keys())))
for agrp in config.lerDevList[tLer].vrfs[aKey].bgpProto.keys():
print(config.lerDevList[tLer].vrfs[aKey].bgpProto[agrp])
continue
if tVendor == 'VENDOR2':
for aKey in config.lerDevList[tLer].vrfs.keys():
#print(config.lerDevList[tLer].vrfs[aKey].bgpProto)
for agrp in config.lerDevList[tLer].vrfs[aKey].bgpProto.keys():
print(config.lerDevList[tLer].vrfs[aKey].bgpProto[agrp])
lerTemplate = ENV.get_template('Vendor2LerConfigTemplate')
lerConfig = lerTemplate.render(config=config.lerDevList[tLer])
pOutputFile = './configs/' + str(tLer) + '.cfg'
with open(pOutputFile, "w+") as opHandle:
#print('Writing configuration for {}'.format(tLer))
opHandle.write(lerConfig)
opHandle.write('\n')
using some print() statements I can see that, the code creating group using base class object, adds one group but iteration code seems to add or see an extra group with name 'default'. 'default' is the name used in the base class for the group. once the object is instantiated it is assigned a proper group name.
Creating vrf bgp group object for xxx4-bb-pe1 BLUE: AS65YYY-BLUE-V4-PEER
Double checking bgp groups: 1
Creating vrf bgp group object for yyy6-bb-pe1 RED: AS65YYY-RED-V4-PEER
Double checking bgp groups: 1
Creating vrf bgp group object for zzz2-bb-pe1 BLUE: AS4200XXXXXX-BLUE-V4-PEER
Double checking bgp groups: 1
Creating vrf bgp group object for zzz2-bb-pe2 RED: AS4200XXXXXX-RED-V4-PEER
Double checking bgp groups: 1
Creating vrf bgp group object for xyxy2-bb-gw1 BLUE: AS4200XXXXXX-BLUE-V4-PEER
Double checking bgp groups: 1
Creating vrf bgp group object for xyxy2-bb-gw2 RED: AS4200XXXXXX-RED-V4-PEER
Double checking bgp groups: 1
Writing configuration for xxx4-bb-pe1
AS65YYY-BLUE-V4-PEER
Writing configuration for yyy6-bb-pe1
AS65YYY-RED-V4-PEER
Writing configuration for zzz2-bb-pe1
AS4200XXXXXX-BLUE-V4-PEER
Writing configuration for zzz2-bb-pe2
AS4200XXXXXX-RED-V4-PEER
Writing configuration for xyxy2-bb-gw1
What VRF: BLUE How many BGP Groups: 2 <<< extra group
AS4200XXXXXX-BLUE-V4-PEER
default << extra group
Writing configuration for xyxy2-bb-gw2
What VRF: RED How many BGP Groups: 2 <<< extra group
AS4200XXXXXX-RED-V4-PEER
default <<<< extra group
here is the default group class definition
class bgpGroup():
def __init__(self):
self.vrf = ''
self.grpName = 'default'
self.grpType = 'internal'
self.grpDescr = ''
self.grpLocalAddr = '0.0.0.0'
self.clusterid = ''
self.gr_disable = False
self.remove_private = False
self.as_override = False
self.peer_as = ''
self.local_as = ''
self.local_as_private = False
self.local_as_noprepend = False
self.holdtime = ''
self.grpFamily = []
self.grpImport = ''
self.grpExport = ''
self.grpNbrList = defaultdict(bgpNbr)
self.grpLoopDetect = False
self.grpMltHopTtl = 0
self.grpInetAsPathPrependReceive = False
self.grpLabelInet6AsPathPrependReceive = False
def __repr__(self):
return self.grpName
I really don't know what the bug is, but have you tried:
for aKey in list(config.lerDevList[tLer].vrfs.keys())
This grabs all the keys from the dictionary at the start, and if another entry is added, it won't matter.
Don't also that you don't need .keys(), though it doesn't hurt. Iterating through a dictionary is iterating through its keys.
Issue was in a jinja2 template code. A wrong variable was referenced inside a statement in for loop. Fixing it fixed the error. Sorry for the noise. Error did point to jinja2 code with the start of the loop line that had error. Did not point to the actual line with an error. All good now.

How to read lines with different format(exceptions) in Lambda function (python code) which reads from S3 to ElK?

Example Format of my Log file
-------------------------------------------------------------Start of Log file Format----------------------------------------
2020-08-14 05:35:48.752 - [INFO] - from [Class:com.webservices.services.impl.DataImpl Method:bData] in 20 - Data Single completed in 1047 mili sec
2020-08-14 05:35:48.752 - [INFO] - from [Class:com.webservices.services.impl.DataImpl Method:bData] in 20 - Data Single completed in 1099 mili sec
2020-08-14 05:35:48.762 - [ERROR] - from [Class:com.webservices.helper.THelper Method:lambda$0] in 20 - Parsing Response from server Failed
org.json.JSONException: No value for dataModel
at org.json.JSONObject.get(JSONObject.java:355) ~[android-json-0.0.20131108.vaadin1.jar:0.0.20131108.vaadin1]
-----------------------------------------------------------------------End of Log file format ------------------------------
I am currently handling data (with bad coding practice) which starts with a date in the first 3 lines of the log above.
The problem starts with 4th line onward where I have exceptions in my log file with no date. Actually The previous line has the date and its the continuation of the Exception.
I do not know how to handle those lines as the format changes. I wanted to get the date of the previous line for the exception line.
Either I have to keep the previous line's date in a temporary variable and use it if there is format change or any other way.
I need Date as mandatory to push in ELK which takes the timestamp for the log.
Also if there is a suggestion for clean code please do it. I need the lambda to run faster.
Currently this is my logic in python code (I am a complete beginner in Python code).
import boto3
import botocore
import re
import json
import requests
...
...
def handler(event, context):
print("Starting handler function")
for record in event['Records']:
# Get the bucket name and key for the new file
bucket = record['s3']['bucket']['name']
print("Bucket name - " + bucket)
print(date_time)
key = record['s3']['object']['key']
print("key name - " + key)
# Get, read, and split the file into lines
obj = s3.get_object(Bucket=bucket, Key=key)
body = obj['Body'].read()
lines = body.splitlines()
# Match the regular expressions to each line and index the JSON
for line in lines:
document = {}
try:
if line[0] == '2':
listOfData = line.split(" - ")
date = datetime.strptime(listOfData[0], "%Y-%m-%d %H:%M:%S.%f")
timestamp = date.isoformat()
level = listOfData[1]
classname = listOfData[2]
message = listOfData[-1]
document = { "timestamps": timestamp, "level": level, "classnames": classname, "messages": message }
print(document)
else:
document = {"messages": line}
except ClientError as e:
raise e
r = requests.post(url, auth=awsauth, json=document, headers=headers)
print(r)
Additional Info :
As suggested in the answer by Ben, when i print this_line , i get all lines in the log file printed properly in separate lines but there is a problem for the exception lines in between. below is what printed.
{'line content': ['2020-08-14 05:35:48.762 - [ERROR] - from [Class:com.webevics.helpr.Helpr Method:lamda] in 2 - Parsng Respns from servr Failed', 'org.jsn.JSONExcepion: No vlue for dataModel'], 'date string': '2020-08-14 05:35:47.655'}
{'line content': ['2020-08-14 05:35:48.762 - [ERROR] - from [Class:com.webserics.helpr.Helpr Method:lambda] in 2 - Parsing Respnse from servr Faied', 'org.jsn.JSONException: No vlue for dataModel', 'at org.json.JSONObject.get(JSONObject.java:355) ~[android-json-0.0.20131108.vaadin1.jar:0.0.20131108.vaadin1]'], 'date string': '2020-08-14 05:35:47.655'}
Here i am getting 2 lines printed where the first line is useless and 2nd line is better . So can we make something like the 1st line should not come and only the 2nd line is present in this_line ?
I started by placing your sample data in a file,
2020-08-14 05:35:48.752 - [INFO] - from [Class:com.webservices.services.impl.DataImpl Method:bData] in 20 - Data Single completed in 1047 mili sec
2020-08-14 05:35:48.752 - [INFO] - from [Class:com.webservices.services.impl.DataImpl Method:bData] in 20 - Data Single completed in 1099 mili sec
2020-08-14 05:35:48.762 - [ERROR] - from [Class:com.webservices.helper.THelper Method:lambda$0] in 20 - Parsing Response from server Failed
org.json.JSONException: No value for dataModel
at org.json.JSONObject.get(JSONObject.java:355) ~[android-json-0.0.20131108.vaadin1.jar:0.0.20131108.vaadin1]
2020-08-14 05:35:48.752 - [INFO] - from [Class:com.webservices.services.impl.DataImpl Method:bData] in 20 - Data Single completed in 1099 mili sec
I've added a fourth entry just to verify my approach works. The Python below detects the start of a line using a regular expression tailored to the date format
import re
with open('example.txt','r') as file_handle:
file_content = file_handle.read().split("\n")
list_of_lines = []
this_line = {}
for index, line in enumerate(file_content):
if len(line.strip())>0:
reslt = re.match('\d{4}-\d\d-\d\d \d\d:\d\d:\d\d\.\d{3}', line)
if reslt: # line starts with a date
if len(this_line.keys())>0:
list_of_lines.append(this_line)
date_string = reslt.group(0)
this_line = {'date string':date_string, 'line content': []}
this_line['line content'].append(line)
else:
this_line['line content'].append(line)
The data structure produced, list_of_lines, is a list of dictionaries. Each dictionary is an error entry which may contain one or more lines. The two keys in the error dictionary are date string and line content.
To review the output data structure, try running
for line in list_of_lines[2]['line content']:
print(line)

Get the latest FTP folder name in Python

I am trying to write a script to get the latest file from the latest sub-
directory of FTP server in Python. My problem is I am unable to figure out the
latest sub-directory. There are two options available, sub-directories have ctime available. Also in directory name date is mentioned that on which date directory was created. But I do not know how to get the name of the latest directory. I have figured out the following way (hoping for the server side to be sorted by latest ctime). I have done it the following way which will work if first object is the latest directory.
import ftplib
import os
import time
ftp = ftplib.FTP('test.rebex.net','demo', 'password')
ftp.cwd(str((ftp.nlst())[0])) #if directory is sorted in descending order by date.
But is there any way where I will find the exact directory by ctime or by date in directory name ?
Thanks a lot guys.
If your FTP server supports MLSD command, a solution is easy:
If you want to base the decision on a modification timestamp:
entries = list(ftp.mlsd())
# Only interested in directories
entries = [entry for entry in entries if entry[1]["type"] == "dir"]
# Sort by timestamp
entries.sort(key = lambda entry: entry[1]['modify'], reverse = True)
# Pick the first one
latest_name = entries[0][0]
print(latest_name)
If you want to use a file name:
# Sort by filename
entries.sort(key = lambda entry: entry[0], reverse = True)
If you need to rely on an obsolete LIST command, you have to parse a proprietary listing it returns.
A common *nix listing is like:
drw-r--r-- 1 user group 4096 Mar 26 2018 folder1-20180326
drw-r--r-- 1 user group 4096 Jun 18 11:21 folder2-20180618
-rw-r--r-- 1 user group 4467 Mar 27 2018 file-20180327.zip
-rw-r--r-- 1 user group 124529 Jun 18 15:31 file-20180618.zip
With a listing like this, this code will do:
If you want to base the decision on a modification timestamp:
lines = []
ftp.dir("", lines.append)
latest_time = None
latest_name = None
for line in lines:
tokens = line.split(maxsplit = 9)
# Only interested in directories
if tokens[0][0] == "d":
time_str = tokens[5] + " " + tokens[6] + " " + tokens[7]
time = parser.parse(time_str)
if (latest_time is None) or (time > latest_time):
latest_name = tokens[8]
latest_time = time
print(latest_name)
If you want to use a file name:
lines = []
ftp.dir("", lines.append)
latest_name = None
for line in lines:
tokens = line.split(maxsplit = 9)
# Only interested in directories
if tokens[0][0] == "d":
name = tokens[8]
if (latest_name is None) or (name > latest_name):
latest_name = name
print(latest_name)
Some FTP servers may return . and .. entries in LIST results. You may need to filter those.
Partially based on: Python FTP get the most recent file by date.
If the folder does not contain any files, only subfolders, there are other easier options.
If you want to base the decision on a modification timestamp and the server supports non-standard -t switch, you can use:
lines = ftp.nlst("-t")
latest_name = lines[-1]
See How to get files in FTP folder sorted by modification time
If you want to use a file name:
lines = ftp.nlst()
latest_name = max(lines)

How to parse multiple line catalina log in python - regex

I have catalina log:
oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated
WARNING: HTTP Session created without LoggedInSessionBean
oct 21, 2016 3:03:20 AM com.sun.jersey.spi.container.ContainerResponse logException
SEVERE: Mapped exception to response: 500 (Internal Server Error)
javax.ws.rs.WebApplicationException
at ais.api.rest.rdss.Resource.lookAT(Resource.java:22)
at sun.reflect.GeneratedMethodAccessor3019.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
I try to parse it in python. My problem is that I dont know how many lines there are in log. Minimum are 2 lines. I try read from file and when first line start with j,m,s,o etc. it mean it is first line of log, because this are first letters of months. But I dont know how to continue. When I stop read the lines ? When next line will starts with one of these letters ? But how I do that?
import datetime
import re
SPACE = r'\s'
TIME = r'(?P<time>.*?M)'
PATH = r'(?P<path>.*?\S)'
METHOD = r'(?P<method>.*?\S)'
REQUEST = r'(?P<request>.*)'
TYPE = r'(?P<type>.*?\:)'
REGEX = TIME+SPACE+PATH+SPACE+METHOD+SPACE+TYPE+SPACE+REQUEST
def parser(log_line):
match = re.search(REGEX,log_line)
return ( (match.group('time'),
match.group('path'),
match.group('method'),
match.group('type'),
match.group('request')
)
)
db = MySQLdb.connect(host="localhost", user="myuser", passwd="mypsswd", db="Database")
with db:
cursor = db.cursor()
with open("Mylog.log","rw") as f:
for line in f:
if (line.startswith('j')) or (line.startswith('f')) or (line.startswith('m')) or (line.startswith('a')) or (line.startswith('s')) or (line.startswith('o')) or (line.startswith('n')) or (line.startswith('d')) :
logLine = line
result = parser(logLine)
sql = ("INSERT INTO ..... ")
data = (result[0])
cursor.execute(sql, data)
f.close()
db.close()
Best idea I have is read just two lines at a time. But that means discard all another data. There must be better way.
I want read lines like this:
1.line - oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated WARNING: HTTP Session created without LoggedInSessionBean
2.line - oct 21, 2016 3:03:20 AM com.sun.jersey.spi.container.ContainerResponse logException SEVERE: Mapped exception to response: 500 (Internal Server Error) javax.ws.rs.WebApplicationException at ais.api.rest.rdss.Resource.lookAT(Resource.java:22) at sun.reflect.GeneratedMethodAccessor3019.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl java:43)
3.line - oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated WARNING: HTTP Session created without LoggedInSessionBean
So I want start read when line starts with datetime (this is no problem). Problem is that I want stop read when next line starts with datetime.
This may be what you want.
I read lines from the log inside a generator so that I can determine whether they are datetime lines or other lines. Also, importantly, I can flag that end-of-file has been reached in the log file.
In the main loop of the program I start accumulating lines in a list when I get a datetime line. The first time I see a datetime line I print it out if it's not empty. Since the program will have accumulated a complete line when end-of-file occurs I arrange to print the accumulated line at that point too.
import re
a_date, other, EOF = 0,1,2
def One_line():
with open('caroline.txt') as caroline:
for line in caroline:
line = line.strip()
m = re.match(r'[a-z]{3}\s+[0-9]{1,2},\s+[0-9]{4}\s+[0-9]{1,2}:[0-9]{2}:[0-9]{2}\s+[AP]M', line, re.I)
if m:
yield a_date, line
else:
yield other, line
yield EOF, ''
complete_line = []
for kind, content in One_line():
if kind in [a_date, EOF]:
if complete_line:
print (' '.join(complete_line ))
complete_line = [content]
else:
complete_line.append(content)
Output:
oct 21, 2016 12:32:13 AM org.wso2.carbon.identity.sso.agent.saml.SSOAgentHttpSessionListener sessionCreated WARNING: HTTP Session created without LoggedInSessionBean
oct 21, 2016 3:03:20 AM com.sun.jersey.spi.container.ContainerResponse logException SEVERE: Mapped exception to response: 500 (Internal Server Error) javax.ws.rs.WebApplicationException at ais.api.rest.rdss.Resource.lookAT(Resource.java:22) at sun.reflect.GeneratedMethodAccessor3019.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

Python Converting tab limited file into csv

I basically want to convert tab delimited text file http://www.linux-usb.org/usb.ids into a csv file.
I tried importing using Excel, but it is not optimal, it turns out like:
8087 Intel Corp.
0020 Integrated Rate Matching Hub
0024 Integrated Rate Matching Hub
How I want it so for easy searching is:
8087 Intel Corp. 0020 Integrated Rate Matching Hub
8087 Intel Corp. 0024 Integrated Rate Matching Hub
Is there any ways I can do this in python?
$ListDirectory = "C:\USB_List.csv"
Invoke-WebRequest 'http://www.linux-usb.org/usb.ids' -OutFile $ListDirectory
$pageContents = Get-Content $ListDirectory | Select-Object -Skip 22
"vendor`tvendor_name`tproduct`tproduct_name`r" > $ListDirectory
#Variables and Flags
$currentVid
$currentVName
$currentPid
$currentPName
$vendorDone = $TRUE
$interfaceFlag = $FALSE
$nextline
$tab = "`t"
foreach($line in $pageContents){
if($line.StartsWith("`#")){
continue
}
elseif($line.length -eq 0){
exit
}
if(!($line.StartsWith($tab)) -and ($vendorDone -eq $TRUE)){
$vendorDone = $FALSE
}
if(!($line.StartsWith($tab)) -and ($vendorDone -eq $FALSE)){
$pos = $line.IndexOf(" ")
$currentVid = $line.Substring(0, $pos)
$currentVName = $line.Substring($pos+2)
"$currentVid`t$currentVName`t`t`r" >> $ListDirectory
$vendorDone = $TRUE
}
elseif ($line.StartsWith($tab)){
if ($interfaceFlag -eq $TRUE){
$interfaceFlag = $FALSE
}
$nextline = $line.TrimStart()
if ($nextline.StartsWith($tab)){
$interfaceFlag = $TRUE
}
if ($interfaceFlag -eq $FALSE){
$pos = $nextline.IndexOf(" ")
$currentPid = $nextline.Substring(0, $pos)
$currentPName = $nextline.Substring($pos+2)
"$currentVid`t$currentVName`t$currentPid`t$currentPName`r" >> $ListDirectory
Write-Host "$currentVid`t$currentVName`t$currentPid`t$currentPName`r"
$interfaceFlag = $FALSE
}
}
}
I know the ask is for python, but I built this PowerShell script to do the job. It takes no parameters. Just run as admin from the directory where you want to store the script. The script collects everything from the http://www.linux-usb.org/usb.ids page, parses the data and writes it to a tab delimited file. You can then open the file in excel as a tab delimited file. Ensure the columns are read as "text" and not "general" and you're go to go. :)
Parsing this page is tricky because the script has to be contextually aware of every VID-Vendor line proceeding a series of PID-Product lines. I also forced the script to ignore the commented description section, the interface-interface_name lines, the random comments that he inserted throughout the USB list (sigh) and everything after and including "#List of known device classes, subclasses and protocols" which is out of scope for this request.
I hope this helps!
You just need to write a little program that scans in the data a line at a time. Then it should check to see if the first character is a tab ('\t'). If not then that value should be stored. If it does start with tab then print out the value that was previously stored followed by the current line. The result will be the list in the format you want.
Something like this would work:
import csv
lines = []
with open("usb.ids.txt") as f:
reader = csv.reader(f, delimiter="\t")
device = ""
for line in reader:
# Ignore empty lines and comments
if len(line) == 0 or (len(line[0]) > 0 and line[0][0] == "#"):
continue
if line[0] != "":
device = line[0]
elif line[1] != "":
lines.append((device, line[1]))
print(lines)
You basically need to loop through each line, and if it's a device line, remember that for the following lines. This will only work for two columns, and you would then need to write them all to a csv file but that's easy enough

Categories