I have an export from a DB table that has the following columns:
name|value|age|external_atributes
The external_atribute is on Json format. So the export looks like this:
George|10|30|{"label1":1,"label2":2,"label3":3,"label4":4,"label5":5,"label6":"6","label7":"7","label8":"8"}
Which is the most efficient way (since the export has more than 1m lines) to keep only the name and the values from label2, label5 and label6. For example from the above export I would like to keep only:
George|2|5|6
Edit: I am not sure for the sequence of the fields/variables on the JSON part. Data could be also for example:
George|10|30|{"label2":2,"label1":1,"label4":4,"label3":3,"label6":6,"label8":"8","label7":"7","label5":"5"}
Also the fact that some of the values are double quoted, while some are not, is intentional (This is how they appear also on the export).
My understanding until now is that I have to use something that has a JSON parser like Python or jq.
This is what i created on Python and seems that is working as expected:
from __future__ import print_function
import sys,json
with open(sys.argv[1], 'r') as file:
for line in file:
fields = line.split('|')
print (fields[0], json.loads(fields[3])['label2'], json.loads(fields[3])['label5'], json.loads(fields[3])['label6'], sep='|')
output:
George|2|5|6
Since I am looking for the most efficient way to do this, any comment is more than welcome.
Even if the data are easy to parse, I advise to use a json parser like jq to extract your json data:
<file jq -rR '
split("|")|[ .[0], (.[3]|fromjson|(.label2,.label5,.label6)|tostring)]|join("|")'
Both options -R and -r allows jq to accept and display a string as input and output (instead of json data).
The split function enable getting all fields into an array that can be indexed with number .[0] and .[3].
The third field is then parsed as json data with the function fromjson such that the wanted labels are extracted.
All wanted fields are put into an array and join together with the | delimiter.
You could split with multiple delimiters using a character class.
The following prints the desired result:
awk 'BEGIN { FS = "[|,:]";OFS="|"} {gsub(/"/,"",$15)}{print $1,$7,$13,$15}'
The above solution assumes that the input data is structured.
Since it is about record-based text edits, awk is most probably the best tool to accomplish the task. However, here it is a sed solution:
sed 's/\([^|]*\).*label2[^0-9]*\([0-9]*\).*label5[^0-9]*\([0-9]*\).*label6[^0-9]*\([0-9]*\).*/\1|\2|\3|\4/' inputFile
Related
I have a Dockerfile that contains lines like:
LABEL "com.datadoghq.ad.logs"='[{"source": "mysource", "name": "myservicename"}]'
I have a python script that I want to read grab this LABEL value(everything after "LABEL" and add that to yaml file. Here's my script:
#!/usr/bin/python3
import yaml
def _extract_dockerfile_labels():
labels = []
with open("testing/dockerfile") as dockerfile:
for line in dockerfile.readlines():
if line.startswith('LABEL'):
label_val = line.replace("LABEL ","", 1)
labels.append(label_val.rstrip()). # rstrip() gets rid of the newline
with open("output.yml", "w+") as out_file:
yaml.dump(labels, out_file, default_flow_style=False)
_extract_dockerfile_labels()
The outfile looks like this:
- '"com.datadoghq.ad.logs"=''[{"source": "mysource", "name": "myservicename"}]'''
What I need for that to look like is:
- "com.datadoghq.ad.logs"='[{"source": "mysource", "name": "myservicename"}]'
How can I make this work without adding extra quotes?
I guess the issue is that you are using yaml.dump() function for output, which outputs object as YAML document. It YAML notation strings have quotes.
For standart python output try using out_file.write(labels), closing file afterwards.
Note, that your string is not parsed as a dictionary currently, and so is interpretated as a plain string. So if you intended parsing it to object, you should do it yourself first.
Your encoding format doesn't look familiar to me, maybe there is a parsing lib for it available. To convert it to YAML format you should consider using something like this:
labels = labels.replace("='[{", ":\n- ").replace(", ", "\n- ").replace("}]'", "")
I want to convert a pcap file into the csv using the python code; but the problem is I should specify precisely what fields to be exported using the tshark library. I want to export all fields. when I dont specify the fields, blank file is exported. the sample code is presented below:
tshark -r /root to the file/test1.pcap -T fields -e ip.src > test1.csv
I want to remove the special fields to export ALL fields; then accessing the fields in python (using a library like pandas in dictionary format like df["Source"])
Any help appreciated!
I want to remove the special fields to export ALL fields;
This is not possible with the CSV (fields) format
$tshark -r trace.pcap -T fields
tshark: "-Tfields" was specified, but no fields were specified with "-e".
An alternative solution is to use one of the JSON formats (-T ek|json|jsonraw) or the XML format (-T pdml).
then accessing the fields in python (using a library like pandas in dictionary format like df["Source"])
In python you could parse the JSON using json.lodas() and get a dictionary. See https://www.w3schools.com/python/python_json.asp
I pass data from SAS to Python using CSV format. Have a problem with a quoting format SAS uses. Strings like "480 КЖИ" ОАО aren't quoted, but Python csv module thinks they're.
dat = ['18CA4,"480 КЖИ" ОАО', '1142F,"""Росдорлизинг"" Российская дор,лизинг,компания"" ОАО"']
for i in csv.reader(dat):
print(i)
>>['18CA4', '480 КЖИ ОАО']
>>['1142F', '"Росдорлизинг" Российская дор,лизинг,компания" ОАО']
The 2nd string is fine, but I need 480 КЖИ ОАО string to be "480 КЖИ" ОАО. Don't find such an option in csv module. Maybe it's possible to force proc export to quote all " chars?
UPD: Here's a similar problem Python CSV : field containing quotation mark at the beginning
UPD2: #Quentin have asked for details. Here they're: I have SAS8.2 connected to 9.1 server. I download custom format data from server side with proc format cntlout=..; proc download... So i get a dictionary-like dataset <key>, <value>. Then i pass this dataset in CSV format using proc export via DDE interface to Python. But proc export quotes only strings which include delimiter (comma) as i understand. So i think, i need SAS to quote quotation marks too or Python to unquote only those strings which include commas.
UPDATE: switching from proc export via DDE to direct reading of dataset with a modified SAS7BDAT Python module hugely improved performance. And i got rid of the problem above.
SAS will add extra quotes if the value has quotes in it already.
data _null_;
file log dsd ;
string='"480 КЖИ" ОАО';
put string;
run;
Generates this result:
"""480 КЖИ"" ОАО"
Perhaps the quotes are being removed at some other point in the flow from SAS to Python? Try saving the CSV file to a disk and having Python read from the disk file.
I have a 300 mb CSV with 3 million rows worth of city information from Geonames.org. I am trying to convert this CSV into JSON to import into MongoDB with mongoimport. The reason I want JSON is that it allows me to specify the "loc" field as an array and not a string for use with the geospatial index. The CSV is encoded in UTF-8.
A snippet of my CSV looks like this:
"geonameid","name","asciiname","alternatenames","loc","feature_class","feature_code","country_code","cc2","admin1_code","admin2_code","admin3_code","admin4_code"
3,"Zamīn Sūkhteh","Zamin Sukhteh","Zamin Sukhteh,Zamīn Sūkhteh","[48.91667,32.48333]","P","PPL","IR",,"15",,,
5,"Yekāhī","Yekahi","Yekahi,Yekāhī","[48.9,32.5]","P","PPL","IR",,"15",,,
7,"Tarvīḩ ‘Adāī","Tarvih `Adai","Tarvih `Adai,Tarvīḩ ‘Adāī","[48.2,32.1]","P","PPL","IR",,"15",,,
The desired JSON output (except the charset) that works with mongoimport is below:
{"geonameid":3,"name":"Zamin Sukhteh","asciiname":"Zamin Sukhteh","alternatenames":"Zamin Sukhteh,Zamin Sukhteh","loc":[48.91667,32.48333] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":5,"name":"Yekahi","asciiname":"Yekahi","alternatenames":"Yekahi,Yekahi","loc":[48.9,32.5] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
{"geonameid":7,"name":"Tarvi? ‘Adai","asciiname":"Tarvih `Adai","alternatenames":"Tarvih `Adai,Tarvi? ‘Adai","loc":[48.2,32.1] ,"feature_class":"P","feature_code":"PPL","country_code":"IR","cc2":null,"admin1_code":15,"admin2_code":null,"admin3_code":null,"admin4_code":null}
I have tried all available online CSV-JSON converters and they do not work because of the file size. The closest I got was with Mr Data Converter (the one pictured above) which would import to MongoDb after removing the start and end bracket and commas between documents. Unfortunately that tool doesn't work with a 300 mb file.
The JSON above is set to be encoded in UTF-8 but still has charset problems, most likely due to a conversion error?
I spent the last three days learning Python, trying to use Python CSVKIT, trying all the CSV-JSON scripts on stackoverflow, importing CSV to MongoDB and changing "loc" string to array (this retains the quotation marks unfortunately) and even trying to manually copy and paste 30,000 records at a time. A lot of reverse engineering, trial and error and so forth.
Does anyone have a clue how to achieve the JSON above while keeping the encoding proper like in the CSV above? I am at a complete standstill.
Python standard library (plus simplejson for decimal encoding support) has all you need:
import csv, simplejson, decimal, codecs
data = open("in.csv")
reader = csv.DictReader(data, delimiter=",", quotechar='"')
with codecs.open("out.json", "w", encoding="utf-8") as out:
for r in reader:
for k, v in r.items():
# make sure nulls are generated
if not v:
r[k] = None
# parse and generate decimal arrays
elif k == "loc":
r[k] = [decimal.Decimal(n) for n in v.strip("[]").split(",")]
# generate a number
elif k == "geonameid":
r[k] = int(v)
out.write(simplejson.dumps(r, ensure_ascii=False, use_decimal=True)+"\n")
Where "in.csv" contains your big csv file. The above code is tested as working on Python 2.6 & 2.7, with about 100MB csv file, producing a properly encoded UTF-8 file. Without surrounding brackets, array quoting nor comma delimiters, as requested.
It is also worth noting that passing both the ensure_ascii and use_decimal parameters is required for the encoding to work properly (in this case).
Finally, being based on simplejson, the python stdlib json package will also gain decimal encoding support sooner or later. So only the stdlib will ultimately be needed.
Maybe you could try importing the csv directly into mongodb using
mongoimport -d <dB> -c <collection> --type csv --file location.csv --headerline
I am brand new to stackoverflow and to scripting. I was looking for help to get started in a script, not necessarily looking for someone to write it.
Here's what I have:
File1.csv - contains some information, I am only interested in MAC addresses.
File2.csv - has some different information, but also contains MAC address.
I need a script that parses the MAC addresses from file1.csv and logs a report if any MAC address shows up in file2.csv.
The questions:
Any tips on the language I use, preferably perl, python or bash?
Can anyone suggest some structure for the logic needed (even if just in psuedo-code)?
update
Using #Adam Wagner's approach, I am really close!
import csv
#Need to strip out NUL values from .csv file to make python happy
class FilteredFile(file):
def next(self):
return file.next(self).replace('\x00','').replace('\xff\xfe','')
reader = csv.reader(FilteredFile('wifi_clients.csv', 'rb'), delimiter=',', quotechar='|')
s1 = set(rec[0] for rec in reader)
inventory = csv.reader(FilteredFile('inventory.csv','rb'),delimiter=',')
s2 = set(rec[6] for rec in inventory)
shared_items = s1.intersection(s2)
print shared_items
This always outputs:(even if I doctor the .csv files to have matching MAC addresses)
set([])
Contents of the csv files
wifi_clients.csv
macNames, First time seen, Last time seen,Power, # packets, BSSID, Probed ESSIDs
inventory.csv
Name,Manufacturer,Device Type,Model,Serial Number,IP Address,MAC Address,...
Here's the approach I'd take:
Iterate over each csv file (python has a handy csv module for accomplishing this), capturing the mac-address and placing it in a set (one per file). And once again, python has a great builtin set type. Here's a good example of using the csv module and of-course, the docs.
Next, you can get the intersection of set1 (file1) and set2 (file2). This will show you mac-addresses that exist in both files one and two.
Example (in python):
s1 = set([1,2,3]) # You can add things incrementally with "s1.add(value)"
s2 = set([2,3,4])
shared_items = s1.intersection(s2)
print shared_items
Which outputs:
set([2, 3])
Logging these shared items could be done with anything from printing (then redirecting output to a file), to using the logging module, to saving directly to a file.
I'm not sure how in-depth of an answer you were looking for, but this should get you started.
Update: CSV/Set usage example
Assuming you have a file "foo.csv", that looks something like this:
bob,123,127.0.0.1,mac-address-1
fred,124,127.0.0.1,mac-address-2
The simplest way to build the set, would be something like this:
import csv
set1 = set()
for record in csv.reader(open('foo.csv', 'rb')):
user, machine_id, ip_address, mac_address = record
set1.add(mac_address)
# or simply "set1.add(record[3])", if you don't need the other fields.
Obviously, you'd need something like this for each file, so you may want to put this in a function to make life easier.
Finally, if you want to go the less-verbose-but-cooler-python-way, you could also build the set like this:
csvfile = csv.reader(open('foo.csv', 'rb'))
set1 = set(rec[3] for rec in csvfile) # Assuming mac-address is the 4th column.
I strongly recommend python to do this.
'Cause you didn't give the structure of the csv file, I can only show a framework:
def get_MAC_from_file1():
... parse the file to get MAC
return a_MAC_list
def get_MAC_from_file2():
... parse the file to get MAC
return a_MAC_list
def log_MACs():
MAC_list1, MAC_list2 = get_MAC_from_file1(), get_MAC_from_file2()
for a_MAC in MAC_list1:
if a_MAC in MAC_list2:
...write your logs
if the data set is large, use a dict or set instead of the list and the intersect operation. But as it's MAC address, I guess your dataset is not that large. So keeping the script easy to read is the most important thing.
Awk is perfect for this
{
mac = $1 # assuming the mac addresses are in the first column
do_grep = "grep " mac " otherfilename" # we'll use grep to check if the mac address is in the other file
do_grep | getline mac_in_other_file # pipe the output of the grep command into a new variable
close(do_grep) # close the pipe
if(mac_in_other_file != ""){ # if grep found the mac address in the other file
print mac > "naughty_macs.log" # append the mac address to the log file
}
}
Then you'd run that on the first file:
awk -f logging_script.awk mac_list.txt
(this code is untested and I'm not the greatest awk hacker, but it should give the general idea)
For the example purpose generate 2 files that that look like yours.
File1:
for i in `seq 100`; do
echo -e "user$i\tmachine$i\t192.168.0.$i\tmac$i";
done > file1.csv
File2 (contains random entries of "mac addresses" numbered from 1-200)
for j in `seq 100`; do
i=$(($RANDOM % 200)) ;
echo -e "mac$i\tmachine$i\tuser$i";
done > file2.csv
Simplest approach would be to use join command and do a join on the appropriate field. This approach has the advantage that fields from both files would be available in the output.
Based on the example files above, the command would look like this:
join -1 4 -2 1 <(sort -k4 file1.csv) <(sort -k1 file2.csv)
join needs the input to be sorted by the field you are matching, that's why the sort is there (-k tells which column to use)
The command above matches rows from file1.csv with rows from file2.csv if column 4 in the first file is equal with column 1 from the second file.
If you only need specific fields, you can specify the output format to the join command:
join -1 4 -2 1 -o1.4 1.2 <(sort -k4 file1.csv) <(sort -k1 file2.csv)
This would print only the mac address and the machine field from the first file.
If you only need a list of matching mac addresses, you can use uniq or sort -u. Since the join output will be sorted by mac, uniq is faster. But if you need a unique list of another field, sort -u is better.
If you only need the mac addresses that match, grep can accept patterns from a file, and you can use cut to extract only the forth field.
fgrep -f<(cut -f4 file1.csv) file2.csv
The above would list all the lines in file2.csv that contain a mac address from file1
Note that I'm using fgrep which doesn't do pattern matching. Also, if file1 is big, this may be slower than the first approach. Also, it assumes that the mac is present only in the field1 of file2 and the other fields don't contain mac addresses.
If you only need the mac, you can either use -o option on fgrep but there are grep variants that don't have it, or you can pipe the output trough cut and then sort -u
fgrep -f<(cut -f4 file1.csv) file2.csv | cut -f1 | sort -u
This would be the bash way.
Python and awk hints have been shown above, I will take a stab at perl:
#!/usr/bin/perl -w
use strict;
open F1, $ARGV[0];
my %searched_mac_addresses = map {chomp; (split /\t/)[3] => 1 } <F1>;
close F1;
open F2, $ARGV[1];
while (<F2>) {
print if $searched_mac_addresses{(split "\t")[0]}
}
close F2
First you create a dictionary containing all the mac addresses from the first file:
my %searched_mac_addresses = map {chomp; (split /\t/)[3] => 1 } <F1>;
reads all the lines from the file1
chomp removes the end of line
split splits the line based on tab, you can use a more complex regexp if needed
() around split force an array context
[3] selects the forth field
map runs a piece of code for all elements of the array
=> generates a dictionary (hash in perl's terminology) element instead of an array
Then you read line by line the second file, and check if the mac exists in the above dictionary:
while (<F2>) {
print if $searched_mac_addresses{(split "\t")[0]}
}
while () will read the file F2, and put each line in the $_ variable
print without any parameters prints the default variable $_
if can postfix a instruction
dictionary elements can be accessed via {}
split by default splits the $_ default variable