How to modify this script to search for multiple keywords? - python

I am trying to modify a script. It is being difficult for me so I came for help. This script is supposed to extract data from some .out files and then write them in a .txtfile. The problem is that I have two different keywords to look for. So, I provide the script, the things I am not able to modify and then two examples of input files.
#!/usr/bin/env python
# -*- coding: utf-8
#~ Data analysis
import glob, subprocess, shutil, os, math
from funciones import *
for namefile in glob.glob("*.mol2"):
lstmol2 = []
lstG=[]
os.chdir("some_directory")
searchprocess="grep -i -H 'CURRENT VALUE OF HEAT OF FORMATION =' *.out | sort -k 4 > firstfile.txt"
#~I need also to look for 'CURRENT BEST VALUE OF HEAT OF FORMATION ='
os.system(searchprocess)
fileout=open("results.txt","w")
filein=open("firstfile.txt", "r")
#~ write data in results.txt
fileout.write('\t %s \n' %(" HOF"))
for line in filein:
linediv=line.split()
HOF=float(linediv[8])
#~or [10] (for the keyword in need to add) but in both cases I need the float. I need both data of the keywords be included on this file.
lstG.append(HOF)
fileout.close()
filein.close()
Input data, type 1:
foofoofooofoofoofoofoofoo
foofoofooofoofoofoofoofoov
foofoofooofoofoofoofoofoo
CURRENT VALUE OF HEAT OF FORMATION = 1928
foofoofooofoofoofoofoofoo
foofoofooofoofoofoofoofoov
Input data, type 2:
foofoofooofoofoofoofoofoo
foofoofooofoofoofoofoofoov
foofoofooofoofoofoofoofoo
CURRENT BEST VALUE OF HEAT OF FORMATION = 1930
foofoofooofoofoofoofoofoo
foofoofooofoofoofoofoofoov

You should update your grep command to look for the optional word with the ? operator. Use the -E flag to enable exteneded regular expressions so you don't have to escape your regex operators. Always use single quotes around your pattern:
searchprocess="grep -E -i -H 'CURRENT( BEST)? VALUE OF HEAT OF FORMATION =' *.out | sort -k 4 > firstfile.txt"
#PrestonHager is correct that you should change linediv[8] to linediv[-1], since in the cases where BEST is present, the number will be in the linediv[9] position, but in both cases linediv[-1] will give you the desired result.

Related

how to pipe stdin directly into python and parse like grep?

I'm trying to perform a sed/awk style regex substitution with python3's re module.
You can see it works fine here with a hardcoded test string:
#!/usr/bin/env python3
import re
regex = r"^(?P<time>\d+\:\d+\:\d+\.\d+)(?:\s+)(?P<func>\S+)(?:\s+)(?P<path>\S+(?: +\S+)*?)(?:\s+)(?P<span>\d+\.\d+)(?:\s+)(?P<npid>(?P<name>\S+(?: +\S+)*?)\.(?P<pid>\d+))\n"
subst = "'\\g<name>', "
line = ("21:21:54.165651 stat64 this/ 0.000012 THiNG1.12471\n"
"21:21:54.165652 stat64 /that 0.000012 2thIng.12472\n"
"21:21:54.165653 stat64 /and/the other thing.xml 0.000012 With S paces.12473\n"
"21:21:54.165654 stat64 /and/the_other_thing.xml 0.000012 without_em_.12474\n"
"21:59:57.774616 fstat64 F=90 0.000002 tmux.4129\n")
result = re.sub(regex, subst, line, 0, re.MULTILINE)
if result:
print(result)
But I'm having some trouble getting it to work the same way with the stdin:
#!/usr/bin/env python3
import sys, re
regex = r"^(?P<time>\d+\:\d+\:\d+\.\d+)(?:\s+)(?P<func>\S+)(?:\s+)(?P<path>\S+(?: +\S+)*?)(?:\s+)(?P<span>\d+\.\d+)(?:\s+)(?P<npid>(?P<name>\S+(?: +\S+)*?)\.(?P<pid>\d+))\n"
subst = "'\\g<name>', "
for line in str(sys.stdin):
#sys.stdout.write(line)
result = re.sub(regex, subst, line, 0, re.MULTILINE)
if result:
print(result,end='')
I'd like to be able to pipe input straight into it from another utility, like is common with grep and similar CLI utilities.
Any idea what the issue is here?
Addendum
I tried to keep the question simple and generalized in the hope that answers might be more useful in similar but different situations, and useful to more people. However, the details might shed some more light on the problem, so here I will include are the exact details of my current scenario:
The desired input to my script is actually the output stream from a utility called fs_usage, it's similar to utilities like ps, but provides a constant stream of system calls and filesystem operations. It tells you which files are being read from, written to, etc. in real time.
From the manual:
NAME
fs_usage -- report system calls and page faults related to filesystem activity in real-time
DESCRIPTION
The fs_usage utility presents an ongoing display of system call usage information pertaining to filesystem activity. It requires root privileges due to the kernel tracing facility it uses to operate.
By default, the activity monitored includes all system processes except for:
fs_usage, Terminal.app, telnetd, telnet, sshd, rlogind, tcsh, csh, sh, zsh. These defaults can be overridden such that output is limited to include or exclude (-e) a list of processes specified by the user.
The output presented by fs_usage is formatted according to the size of your window.
A narrow window will display fewer columns. Use a wide window for maximum data display.
You may override the formatting restrictions by forcing a wide display with the -w option.
In this case, the data displayed will wrap when the window is not wide enough.
I hack together a crude little bash script to rip the process names from the stream, and dump them to a temporary log file. You can think of it as a filter or an extractor. Here it is as a function that will dump straight to stdout (remove the comment on the last line to dump to file).
proc_enum ()
{
while true; do
sudo fs_usage -w -e 'grep' 'awk' |
grep -E -o '(?:\d\.\d{6})\s{3}\S+\.\d+' |
awk '{print $2}' |
awk -F '.' '{print $1}' \
#>/tmp/proc_names.logx
done
}
Useful Links
Regular Expressions 101
Stack Overflow - How to pipe input to python line by line from linux program?
The problem str(sys.stdin) what Python will do in for loop is this:
i = iter(str(sys.stdin))
# then in every iteration
next(i)
Here you are converting the method to str, result in my computer is:
str(sys.stdin) == "<_io.TextIOWrapper name='<stdin>' mode='r' encoding='cp1256'>"
you are not looping on the lines received by stdin, you are looping on the string representation of the function.
And another problem in the first example you are applying the re.sub on the entire text
but here you are applying for each line, so you should concatenate the result of each
line or concatenate the lines in single text before applying re.sub.
import sys, re
regex = r"^(?P<time>\d+\:\d+\:\d+\.\d+)(?:\s+)(?P<func>\S+)(?:\s+)(?P<path>\S+(?: +\S+)*?)(?:\s+)(?P<span>\d+\.\d+)(?:\s+)(?P<npid>(?P<name>\S+(?: +\S+)*?)\.(?P<pid>\d+))\n"
subst = "'\\g<name>', "
result = ''
for line in sys.stdin:
# here you should convert the input but I think is optional
line = str(line)
result += re.sub(regex, subst, line, 0, re.MULTILINE)
if result:
print(result, end='')

File parsing using Unix Shell Scripting

I am trying to do some transformation and I’m stuck. Here goes the problem description.
Below is the pipe delimited file. I have masked data!
AccountancyNumber|AccountancyNumberExtra|Amount|ApprovedBy|BranchCurrency|BranchGuid|BranchId|BranchName|CalculatedCurrency|CalculatedCurrencyAmount|CalculatedCurrencyVatAmount|ControllerBy|Country|Currency|CustomFieldEnabled|CustomFieldGuid|CustomFieldName|CustomFieldRequired|CustomFieldValue|Date|DateApproved|DateControlled|Email|EnterpriseNumber|ExpenseAccountGuid|ExpenseAccountName|ExpenseAccountStatus|ExpenseGuid|ExpenseReason|ExpenseStatus|ExternalId|GroupGuid|GroupId|GroupName|IBAN|Image|IsInvoice|MatchStatus|Merchant|MerchantEnterpriseNumber|Note|OwnerShip|PaymentMethod|PaymentMethodGuid|PaymentMethodName|ProjectGuid|ProjectId|ProjectName|Reimbursable|TravellerId|UserGUID|VatAmount|VatPercentage|XpdReference|VatCode|FileName|CreateTstamp
61470003||30.00|null|EUR|168fcea9-17d4-45a1-8b6f-bfb249cdbea6|BEL|BEL|USD,INR,EUR|35.20,2420.11,30.00|null,null,null|null|BE|EUR|true|0d4b767b-0988-47e8-9144-05e607169284|careertitle|false|FE|2018-07-24T00:00:00|null|null|abc_def#xyz.com||c32f03c6-31df-4fd8-8cc2-1c5f3a580aad|Meals - In Office|true|781d10d2-2f3b-43bc-866e-a653fefacbbe||Approved|70926|40ac7117-c7e2-42ea-b34f-96330c9380b6|BEL-FSP-Users|BEL-FSP-Users|||false|None|in office meal #1|||Personal|Cash|1ee44666-f4c7-44b3-acd3-8ecd7127480a|Cash|2cb4ccb7-634d-4386-af43-b4572ec72098|00AA06|00AA06|true||6c5a835f-5152-46db-923a-3ebd08c7dad3|null|null|XPD012245802||1820711.xml|2018-08-07 05:42:10.46
In this file, we have got CalculatedCurrency field where we have multiple values delimited by a comma. The file also has field CalculatedCurrencyAmount which too has multiple values delimited by a comma. But I need to pick up only that currency value from CalculatedCurrency field which belongs to
BranchCurrency (another field in the file) and of course corresponding CalculatedCurrencyAmount for that Currency.
Required output : -
AccountancyNumber|AccountancyNumberExtra|Amount|ApprovedBy|BranchCurrency|BranchGuid|BranchId|BranchName|CalculatedCurrency|CalculatedCurrencyAmount|CalculatedCurrencyVatAmount|ControllerBy|Country|Currency|CustomFieldEnabled|CustomFieldGuid|CustomFieldName|CustomFieldRequired|CustomFieldValue|Date|DateApproved|DateControlled|Email|EnterpriseNumber|ExpenseAccountGuid|ExpenseAccountName|ExpenseAccountStatus|ExpenseGuid|ExpenseReason|ExpenseStatus|ExternalId|GroupGuid|GroupId|GroupName|IBAN|Image|IsInvoice|MatchStatus|Merchant|MerchantEnterpriseNumber|Note|OwnerShip|PaymentMethod|PaymentMethodGuid|PaymentMethodName|ProjectGuid|ProjectId|ProjectName|Reimbursable|TravellerId|UserGUID|VatAmount|VatPercentage|XpdReference|VatCode|FileName|CreateTstamp|ActualCurrency|ActualAmount
61470003||30.00|null|EUR|168fcea9-17d4-45a1-8b6f-bfb249cdbea6|BEL|BEL|USD,INR,EUR|35.20,2420.11,30.00|null,null,null|null|BE|EUR|true|0d4b767b-0988-47e8-9144-05e607169284|careertitle|false|FE|2018-07-24T00:00:00|null|null|abc_def#xyz.com||c32f03c6-31df-4fd8-8cc2-1c5f3a580aad|Meals - In Office|true|781d10d2-2f3b-43bc-866e-a653fefacbbe||Approved|70926|40ac7117-c7e2-42ea-b34f-96330c9380b6|BEL-FSP-Users|BEL-FSP-Users|||false|None|in office meal #1|||Personal|Cash|1ee44666-f4c7-44b3-acd3-8ecd7127480a|Cash|2cb4ccb7-634d-4386-af43-b4572ec72098|00AA06|00AA06|true||6c5a835f-5152-46db-923a-3ebd08c7dad3|null|null|XPD012245802||1820711.xml|2018-08-07 05:42:10.46|EUR|30.00
Please help.
Snaplogic Python Script
from com.snaplogic.scripting.language import ScriptHook
from com.snaplogic.scripting.language.ScriptHook import *
import csv
class TransformScript(ScriptHook):
def __init__(self, input, output, error, log):
self.input = input
self.output = output
self.error = error
self.log = log
def execute(self):
self.log.info("Executing Transform script")
while self.input.hasNext():
data = self.input.next()
branch_currency = data['BranchCurrency']
calc_currency = data['CalculatedCurrency'].split(',')
calc_currency_amount = data['CalculatedCurrencyAmount'].split(',')
result = None
for i, name in enumerate(calc_currency):
result = calc_currency_amount[i] if name == branch_currency else result
data["CalculatedCurrencyAmount"] = result
result1 = calc_currency[i] if name == branch_currency else result
data["CalculatedCurrency"] = result1
try:
data["mathTryCatch"] = data["counter2"].longValue() + 33
self.output.write(data)
except Exception as e:
data["errorMessage"] = e.message
self.error.write(data)
self.log.info("Finished executing the Transform script")
hook = TransformScript(input, output, error, log)
Using bash with some arrays:
arr_find() {
echo $(( $(printf "%s\0" "${#:2}" | grep -Fnxz "$1" | cut -d: -f1) - 1 ))
}
IFS='|' read -r -a headers
while IFS='|' read -r "${headers[#]}"; do
IFS=',' read -r -a CalculatedCurrency <<<"$CalculatedCurrency"
IFS=',' read -r -a CalculatedCurrencyAmount <<<"$CalculatedCurrencyAmount"
idx=$(arr_find "$BranchCurrency" "${CalculatedCurrency[#]}")
echo "BranchCurrency is $BranchCurrency. Hence CalculatedCurrency will be ${CalculatedCurrency[$idx]} and CalculatedCurrencyAmount will have to be ${CalculatedCurrencyAmount[$idx]}."
done
First I read all headers names. Then read all values into headers. Then read CalculatedCurrency* correctly, cause they are separated by ','. Then I find the element number which is equal to BranchCurrency inside CalculatedCurrency. Having the element index and arrays, I can just print the output.
I know, the op asked for unix shell, but as an alternative option I show some code to do it using python. (Obviously this code can be heavily improved also.) The great advantage is readability, for instance, that you can address your data via name. Or the code, which is much more readable at all, than doing this with awk et al.
Save your data in data.psv, write the following script into a file main.py. I've tested it using python3 and python2. Both works. Run the script using python main.py.
Update: I've extended the script to parse all lines. In the example data, I've set BranchCurrency to EUR in the first line and USD in the secondline, as a dummy test.
File: main.py
import csv
def parse_line(row):
branch_currency = row['BranchCurrency']
calc_currency = row['CalculatedCurrency'].split(',')
calc_currency_amount = row['CalculatedCurrencyAmount'].split(',')
result = None
for i, name in enumerate(calc_currency):
result = calc_currency_amount[i] if name == branch_currency else result
return result
def main():
with open('data.psv') as f:
reader = csv.DictReader(f, delimiter='|')
for row in reader:
print(parse_line(row))
if __name__ == '__main__':
main()
Example Data:
[:~] $ cat data.psv
AccountancyNumber|AccountancyNumberExtra|Amount|ApprovedBy|BranchCurrency|BranchGuid|BranchId|BranchName|CalculatedCurrency|CalculatedCurrencyAmount|CalculatedCurrencyVatAmount|ControllerBy|Country|Currency|CustomFieldEnabled|CustomFieldGuid|CustomFieldName|CustomFieldRequired|CustomFieldValue|Date|DateApproved|DateControlled|Email|EnterpriseNumber|ExpenseAccountGuid|ExpenseAccountName|ExpenseAccountStatus|ExpenseGuid|ExpenseReason|ExpenseStatus|ExternalId|GroupGuid|GroupId|GroupName|IBAN|Image|IsInvoice|MatchStatus|Merchant|MerchantEnterpriseNumber|Note|OwnerShip|PaymentMethod|PaymentMethodGuid|PaymentMethodName|ProjectGuid|ProjectId|ProjectName|Reimbursable|TravellerId|UserGUID|VatAmount|VatPercentage|XpdReference|VatCode|FileName|CreateTstamp
61470003||35.00|null|EUR|168fcea9-17d4-45a1-8b6f-bfb249cdbea6|BEL|BEL|USD,INR,EUR|35.20,2420.11,30.00|null,null,null|null|BE|EUR|true|0d4b767b-0988-47e8-9144-05e607169284|careertitle|false|FE|2018-07-24T00:00:00|null|null|abc_def#xyz.com||c32f03c6-31df-4fd8-8cc2-1c5f3a580aad|Meals - In Office|true|781d10d2-2f3b-43bc-866e-a653fefacbbe||Approved|70926|40ac7117-c7e2-42ea-b34f-96330c9380b6|BEL-FSP-Users|BEL-FSP-Users|||false|None|in office meal #1|||Personal|Cash|1ee44666-f4c7-44b3-acd3-8ecd7127480a|Cash|2cb4ccb7-634d-4386-af43-b4572ec72098|00AA06|00AA06|true||6c5a835f-5152-46db-923a-3ebd08c7dad3|null|null|XPD012245802||1820711.xml|2018-08-07 05:42:10.46
61470003||35.00|null|USD|168fcea9-17d4-45a1-8b6f-bfb249cdbea6|BEL|BEL|USD,INR,EUR|35.20,2420.11,30.00|null,null,null|null|BE|EUR|true|0d4b767b-0988-47e8-9144-05e607169284|careertitle|false|FE|2018-07-24T00:00:00|null|null|abc_def#xyz.com||c32f03c6-31df-4fd8-8cc2-1c5f3a580aad|Meals - In Office|true|781d10d2-2f3b-43bc-866e-a653fefacbbe||Approved|70926|40ac7117-c7e2-42ea-b34f-96330c9380b6|BEL-FSP-Users|BEL-FSP-Users|||false|None|in office meal #1|||Personal|Cash|1ee44666-f4c7-44b3-acd3-8ecd7127480a|Cash|2cb4ccb7-634d-4386-af43-b4572ec72098|00AA06|00AA06|true||6c5a835f-5152-46db-923a-3ebd08c7dad3|null|null|XPD012245802||1820711.xml|2018-08-07 05:42:10.46
Example Run:
[:~] $ python main.py
30.00
35.20
Using awk:
awk 'BEGIN{FS=OFS="|"}
NR==1{print $0,"ActualCurrency","ActualAmount";next}
{n=split($9,a,",");split($10,b,",");for(i=1;i<=n;i++) if(a[i]==$5) print $0,$5,b[i]}' file
BEGIN{FS=OFS="|"} sets the input and ouput delimiter to |.
The NR==1 statement takes care of the header by adding the 2 strings.
The 9th and 10th fields are splitted based on the , separator, and values are set inside the arrays a and b.
The for loop is trying to find the value of the a array corresponding to the 5th field. If found, the corresponding value of the b is printed.
You don't need to use the script snap here at all. Writing scripts for transformations all the time hampers the performance and defeats the purpose of an IPaaS tool altogether. The mapper should suffice.
I created the following test pipeline for this problem.
I saved the data provided in this question in a file and saved it in SnapLogic for the test. In the pipeline, I parsed it using a CSV parser.
Following is the parsed result.
Then I used a mapper for doing the required transformation.
Following is the expression for getting the actual amount.
$CalculatedCurrency.split(',').indexOf($BranchCurrency) >= 0 ? $CalculatedCurrencyAmount.split(',')[$CalculatedCurrency.split(',').indexOf($BranchCurrency)] : null
Following is the result.
Avoid writing scripts for problems that can be solved using mappers.

Read a python variable in a shell script?

my python file has these 2 variables:
week_date = "01/03/16-01/09/16"
cust_id = "12345"
how can i read this into a shell script that takes in these 2 variables?
my current shell script requires manual editing of "dt" and "id". I want to read the python variables into the shell script so i can just edit my python parameter file and not so many files.
shell file:
#!/bin/sh
dt="01/03/16-01/09/16"
cust_id="12345"
In a new python file i could just import the parameter python file.
Consider something akin to the following:
#!/bin/bash
# ^^^^ NOT /bin/sh, which doesn't have process substitution available.
python_script='
import sys
d = {} # create a context for variables
exec(open(sys.argv[1], "r").read()) in d # execute the Python code in that context
for k in sys.argv[2:]:
print "%s\0" % str(d[k]).split("\0")[0] # ...and extract your strings NUL-delimited
'
read_python_vars() {
local python_file=$1; shift
local varname
for varname; do
IFS= read -r -d '' "${varname#*:}"
done < <(python -c "$python_script" "$python_file" "${#%%:*}")
}
You might then use this as:
read_python_vars config.py week_date:dt cust_id:id
echo "Customer id is $id; date range is $dt"
...or, if you didn't want to rename the variables as they were read, simply:
read_python_vars config.py week_date cust_id
echo "Customer id is $cust_id; date range is $week_date"
Advantages:
Unlike a naive regex-based solution (which would have trouble with some of the details of Python parsing -- try teaching sed to handle both raw and regular strings, and both single and triple quotes without making it into a hairball!) or a similar approach that used newline-delimited output from the Python subprocess, this will correctly handle any object for which str() gives a representation with no NUL characters that your shell script can use.
Running content through the Python interpreter also means you can determine values programmatically -- for instance, you could have some Python code that asks your version control system for the last-change-date of relevant content.
Think about scenarios such as this one:
start_date = '01/03/16'
end_date = '01/09/16'
week_date = '%s-%s' % (start_date, end_date)
...using a Python interpreter to parse Python means you aren't restricting how people can update/modify your Python config file in the future.
Now, let's talk caveats:
If your Python code has side effects, those side effects will obviously take effect (just as they would if you chose to import the file as a module in Python). Don't use this to extract configuration from a file whose contents you don't trust.
Python strings are Pascal-style: They can contain literal NULs. Strings in shell languages are C-style: They're terminated by the first NUL character. Thus, some variables can exist in Python than cannot be represented in shell without nonliteral escaping. To prevent an object whose str() representation contains NULs from spilling forward into other assignments, this code terminates strings at their first NUL.
Now, let's talk about implementation details.
${#%%:*} is an expansion of $# which trims all content after and including the first : in each argument, thus passing only the Python variable names to the interpreter. Similarly, ${varname#*:} is an expansion which trims everything up to and including the first : from the variable name passed to read. See the bash-hackers page on parameter expansion.
Using <(python ...) is process substitution syntax: The <(...) expression evaluates to a filename which, when read, will provide output of that command. Using < <(...) redirects output from that file, and thus that command (the first < is a redirection, whereas the second is part of the <( token that starts a process substitution). Using this form to get output into a while read loop avoids the bug mentioned in BashFAQ #24 ("I set variables in a loop that's in a pipeline. Why do they disappear after the loop terminates? Or, why can't I pipe data to read?").
The IFS= read -r -d '' construct has a series of components, each of which makes the behavior of read more true to the original content:
Clearing IFS for the duration of the command prevents whitespace from being trimmed from the end of the variable's content.
Using -r prevents literal backslashes from being consumed by read itself rather than represented in the output.
Using -d '' sets the first character of the empty string '' to be the record delimiter. Since C strings are NUL-terminated and the shell uses C strings, that character is a NUL. This ensures that variables' content can contain any non-NUL value, including literal newlines.
See BashFAQ #001 ("How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?") for more on the process of reading record-oriented data from a string in bash.
Other answers give a way to do exactly what you ask for, but I think the idea is a bit crazy. There's a simpler way to satisfy both scripts - move those variables into a config file. You can even preserve the simple assignment format.
Create the config itself: (ini-style)
dt="01/03/16-01/09/16"
cust_id="12345"
In python:
config_vars = {}
with open('the/file/path', 'r') as f:
for line in f:
if '=' in line:
k,v = line.split('=', 1)
config_vars[k] = v
week_date = config_vars['dt']
cust_id = config_vars['cust_id']
In bash:
source "the/file/path"
And you don't need to do crazy source parsing anymore. Alternatively you can just use json for the config file and then use json module in python and jq in shell for parsing.
I would do something like this. You may want to modify it little bit for minor changes to include/exclude quotes as I didn't really tested it for your scenario:
#!/bin/sh
exec <$python_filename
while read line
do
match=`echo $line|grep "week_date ="`
if [ $? -eq 0 ]; then
dt=`echo $line|cut -d '"' -f 2`
fi
match=`echo $line|grep "cust_id ="`
if [ $? -eq 0 ]; then
cust_id=`echo $line|cut -d '"' -f 2`
fi
done

Creating comma-separated list of paths (allowing for spaces in path), and passing result as a variable to use again

I am trying to use output from rtcontrol (Part of Pyroscope) which is used to control rtorrent tfrom the command line. I am having issues formatting the output from one call to use as input to another
I'd like to be able to only choose the torrents that satisfy that criteria but DO NOT share a path with another torrent.
The process is as follows
PATHS=$(rtcontrol ratio=+2 completed=+5d -qopath)
echo $PATHS
# Output
# /home/user/path/name1
# /home/user/path/name2
# /home/user/path/name 3
# /home/user/path/name 3
# /home/user/path/name 4
# /home/user/path/name5
# Remove duplicates paths and convert $PATHS to comma delimited variable PATHS.
#
# UNSURE HERE....
#
# PATHS="/home/user/path/name1","/home/user/path/name2","/home/user/path/name 3","/home/user/path/name 4","/home/user/path/name5"
#Pass PATHS to rtcontrol again to get torrents in one of the paths.
PATHS_2=$(rtcontrol path=$PATHS -qopath)
echo $PATHS_2
#output
#/home/user/path/name1
#/home/user/path/name1
#/home/user/path/name2
#/home/user/path/name 3
#/home/user/path/name 3
#/home/user/path/name 4
#/home/user/path/name5
# Remove duplicates and convert $PATHS_2 to comma delimited variable.
#
# UNSURE HERE....
#
# PATHS_2="/home/user/path/name2","/home/user/path/name 3","/home/user/path/name 4","/home/user/path/name5"
#Pass to rtcontrol to perform action
rtcontrol path=$PATHS_2 --cull
The reason for this is it is possible that a torrent DOES NOT satisfy the conditions or ratio=+2 completed=+5d, but has the same path as one that DOES. This is the reason for the second call rtcontrol path=$PATHS -qopath
I have tried different combinations of uniq, sed and awk as well as using pipes to pass along output. It should be noted that rtcontrol output can be piped out e.g. rtcontrol name="*Test* -qoname|uniq -u
This can be handled in pure bash, or with Python. There are also python libraries to interface with the torrent program that can perform similar functions, and even more advanced things that I am investigating.
#!/bin/python
import os
import sys
paths = {}
for line in sys.stdin:
path = line.strip()
paths[path] = path
print ','.join( sorted( paths.keys() )
exit( 0 )

Script to compare a string in two different files

I am brand new to stackoverflow and to scripting. I was looking for help to get started in a script, not necessarily looking for someone to write it.
Here's what I have:
File1.csv - contains some information, I am only interested in MAC addresses.
File2.csv - has some different information, but also contains MAC address.
I need a script that parses the MAC addresses from file1.csv and logs a report if any MAC address shows up in file2.csv.
The questions:
Any tips on the language I use, preferably perl, python or bash?
Can anyone suggest some structure for the logic needed (even if just in psuedo-code)?
update
Using #Adam Wagner's approach, I am really close!
import csv
#Need to strip out NUL values from .csv file to make python happy
class FilteredFile(file):
def next(self):
return file.next(self).replace('\x00','').replace('\xff\xfe','')
reader = csv.reader(FilteredFile('wifi_clients.csv', 'rb'), delimiter=',', quotechar='|')
s1 = set(rec[0] for rec in reader)
inventory = csv.reader(FilteredFile('inventory.csv','rb'),delimiter=',')
s2 = set(rec[6] for rec in inventory)
shared_items = s1.intersection(s2)
print shared_items
This always outputs:(even if I doctor the .csv files to have matching MAC addresses)
set([])
Contents of the csv files
wifi_clients.csv
macNames, First time seen, Last time seen,Power, # packets, BSSID, Probed ESSIDs
inventory.csv
Name,Manufacturer,Device Type,Model,Serial Number,IP Address,MAC Address,...
Here's the approach I'd take:
Iterate over each csv file (python has a handy csv module for accomplishing this), capturing the mac-address and placing it in a set (one per file). And once again, python has a great builtin set type. Here's a good example of using the csv module and of-course, the docs.
Next, you can get the intersection of set1 (file1) and set2 (file2). This will show you mac-addresses that exist in both files one and two.
Example (in python):
s1 = set([1,2,3]) # You can add things incrementally with "s1.add(value)"
s2 = set([2,3,4])
shared_items = s1.intersection(s2)
print shared_items
Which outputs:
set([2, 3])
Logging these shared items could be done with anything from printing (then redirecting output to a file), to using the logging module, to saving directly to a file.
I'm not sure how in-depth of an answer you were looking for, but this should get you started.
Update: CSV/Set usage example
Assuming you have a file "foo.csv", that looks something like this:
bob,123,127.0.0.1,mac-address-1
fred,124,127.0.0.1,mac-address-2
The simplest way to build the set, would be something like this:
import csv
set1 = set()
for record in csv.reader(open('foo.csv', 'rb')):
user, machine_id, ip_address, mac_address = record
set1.add(mac_address)
# or simply "set1.add(record[3])", if you don't need the other fields.
Obviously, you'd need something like this for each file, so you may want to put this in a function to make life easier.
Finally, if you want to go the less-verbose-but-cooler-python-way, you could also build the set like this:
csvfile = csv.reader(open('foo.csv', 'rb'))
set1 = set(rec[3] for rec in csvfile) # Assuming mac-address is the 4th column.
I strongly recommend python to do this.
'Cause you didn't give the structure of the csv file, I can only show a framework:
def get_MAC_from_file1():
... parse the file to get MAC
return a_MAC_list
def get_MAC_from_file2():
... parse the file to get MAC
return a_MAC_list
def log_MACs():
MAC_list1, MAC_list2 = get_MAC_from_file1(), get_MAC_from_file2()
for a_MAC in MAC_list1:
if a_MAC in MAC_list2:
...write your logs
if the data set is large, use a dict or set instead of the list and the intersect operation. But as it's MAC address, I guess your dataset is not that large. So keeping the script easy to read is the most important thing.
Awk is perfect for this
{
mac = $1 # assuming the mac addresses are in the first column
do_grep = "grep " mac " otherfilename" # we'll use grep to check if the mac address is in the other file
do_grep | getline mac_in_other_file # pipe the output of the grep command into a new variable
close(do_grep) # close the pipe
if(mac_in_other_file != ""){ # if grep found the mac address in the other file
print mac > "naughty_macs.log" # append the mac address to the log file
}
}
Then you'd run that on the first file:
awk -f logging_script.awk mac_list.txt
(this code is untested and I'm not the greatest awk hacker, but it should give the general idea)
For the example purpose generate 2 files that that look like yours.
File1:
for i in `seq 100`; do
echo -e "user$i\tmachine$i\t192.168.0.$i\tmac$i";
done > file1.csv
File2 (contains random entries of "mac addresses" numbered from 1-200)
for j in `seq 100`; do
i=$(($RANDOM % 200)) ;
echo -e "mac$i\tmachine$i\tuser$i";
done > file2.csv
Simplest approach would be to use join command and do a join on the appropriate field. This approach has the advantage that fields from both files would be available in the output.
Based on the example files above, the command would look like this:
join -1 4 -2 1 <(sort -k4 file1.csv) <(sort -k1 file2.csv)
join needs the input to be sorted by the field you are matching, that's why the sort is there (-k tells which column to use)
The command above matches rows from file1.csv with rows from file2.csv if column 4 in the first file is equal with column 1 from the second file.
If you only need specific fields, you can specify the output format to the join command:
join -1 4 -2 1 -o1.4 1.2 <(sort -k4 file1.csv) <(sort -k1 file2.csv)
This would print only the mac address and the machine field from the first file.
If you only need a list of matching mac addresses, you can use uniq or sort -u. Since the join output will be sorted by mac, uniq is faster. But if you need a unique list of another field, sort -u is better.
If you only need the mac addresses that match, grep can accept patterns from a file, and you can use cut to extract only the forth field.
fgrep -f<(cut -f4 file1.csv) file2.csv
The above would list all the lines in file2.csv that contain a mac address from file1
Note that I'm using fgrep which doesn't do pattern matching. Also, if file1 is big, this may be slower than the first approach. Also, it assumes that the mac is present only in the field1 of file2 and the other fields don't contain mac addresses.
If you only need the mac, you can either use -o option on fgrep but there are grep variants that don't have it, or you can pipe the output trough cut and then sort -u
fgrep -f<(cut -f4 file1.csv) file2.csv | cut -f1 | sort -u
This would be the bash way.
Python and awk hints have been shown above, I will take a stab at perl:
#!/usr/bin/perl -w
use strict;
open F1, $ARGV[0];
my %searched_mac_addresses = map {chomp; (split /\t/)[3] => 1 } <F1>;
close F1;
open F2, $ARGV[1];
while (<F2>) {
print if $searched_mac_addresses{(split "\t")[0]}
}
close F2
First you create a dictionary containing all the mac addresses from the first file:
my %searched_mac_addresses = map {chomp; (split /\t/)[3] => 1 } <F1>;
reads all the lines from the file1
chomp removes the end of line
split splits the line based on tab, you can use a more complex regexp if needed
() around split force an array context
[3] selects the forth field
map runs a piece of code for all elements of the array
=> generates a dictionary (hash in perl's terminology) element instead of an array
Then you read line by line the second file, and check if the mac exists in the above dictionary:
while (<F2>) {
print if $searched_mac_addresses{(split "\t")[0]}
}
while () will read the file F2, and put each line in the $_ variable
print without any parameters prints the default variable $_
if can postfix a instruction
dictionary elements can be accessed via {}
split by default splits the $_ default variable

Categories