I have an output that looks like this, where the first number corresponds to the count of the type below (e.g. 72 for Type 4, etc)
72
Type
4
51
Type
5
66
Type
6
78
Type
7
..etc
Is there a way to organize this data to look something like this:
Type 4 = 72 times
Type 5 = 51 times
Type 6 = 66 times
etc..
Essentially, the question is how to take a single column of data and sort /organize it into something more readable using bash, awk, python, etc. (Ideally, in bash, but interested to know how to do in Python).
Thank you.
Use paste to join 3 consecutive lines from stdin, then just rearrange the fields.
paste - - - < file | awk '{print $2, $3, "=", $1, "times"}'
It's simple enough with Python to read three lines of data at a time:
def perthree(iterable):
return zip(*[iter(iterable)] * 3)
with open(inputfile) as infile:
for count, type_, type_num in perthree(infile):
print('{} {} = {} times'.format(type_.strip(), type_num.strip(), count.strip()))
The .strip() calls remove any extra whitespace, including the newline at the end of each line of input text.
Demo:
>>> with open(inputfile) as infile:
... for count, type_, type_num in perthree(infile):
... print('{} {} = {} times'.format(type_.strip(), type_num.strip(), count.strip()))
...
Type 4 = 72 times
Type 5 = 51 times
Type 6 = 66 times
Type 7 = 78 times
In Bash:
#!/bin/bash
A=() I=0
while read -r LINE; do
if (( (M = ++I % 3) )); then
A[M]=$LINE
else
printf "%s %s = %s times\n" "${A[2]}" "$LINE" "${A[1]}"
fi
done
Running bash script.sh < file creates:
Type 4 = 72 times
Type 5 = 51 times
Type 6 = 66 times
Type 7 = 78 times
Note: With a default IFS ($' \t\n'), read would remove leading and trailing spaces by default.
Try this awk one liner:
$ awk 'NR%3==1{n=$1}NR%3==2{t=$1}NR%3==0{print t,$1,"=",n,"times"}' file
Type 4 = 72 times
Type 5 = 51 times
Type 6 = 66 times
Type 7 = 78 times
How it works?
awk '
NR%3==1{ # if we are on lines 1,4,7, etc (NR is the record number (or the line number)
n=$1 # set the variable n to the first (and only) word
}
NR%3==2{ # if we are on lines 2,5,7, etc
t=$1 # set the variable t to the first (and only) word
}
NR%3==0{ # if we are on lines 3,6,9, etc
print t,$1,"=",n,"times" # print the desired output
}' file
Related
I want to make function doing exactly this:
#This is my imput number
MyNumberDec = 114
MyNumberHex = hex(MyNumberDec)
print (MyNumberDec)
print (MyNumberHex)
#Output looks exactly like this:
#114
#0x8a
HexFirstDigitCharacter = MagicFunction(MyNumberHex)
HexSecondDigitCharacter = MagicFunction(MyNumberHex)
#print (HexFirstDigitCharacter )
#print (HexSecondDigitCharacter )
#I want to see this in output
#8
#A
What is that function?
Why I need this?
For calculating check-sum in message sending towards some industrial equipment
For example command R8:
N | HEX | ASC
1 52 R
2 38 8
3 38 8
4 41 A
Bytes 1 and 2 are command, bytes 3 and 4 are checksum
Way of calculating checksum: 0x52 + 0x38 = 8A
I have to send ASCII 8 as third byte and ASCII A as fourth byte
Maybe I dont need my magicfunction but other solution?
You can convert an integer to a hex string without the preceding '0x' by using the string formatter:
MyNumberDec = 114
MyNumberHex = '%02x' % MyNumberDec
print(MyNumberHex[0])
print(MyNumberHex[1])
This outputs:
7
2
Suppose I have a huge text file like below:
19990231
blabla
sssssssssssss
hhhhhhhhhhhhhh
ggggggggggggggg
20090812
blbclg
hhhhhhhhhhhhhh
ggggggggggggggg
hhhhhhhhhhhhhhh
20010221
fgghgg
sssssssssssss
hhhhhhhhhhhhhhh
ggggggggggggggg
<etc>
How can I randomly remove 100 blocks that start with numeric characters and end with a blank line? Eg:
20090812
blbclg
hhhhhhhhhhhhhh
ggggggggggggggg
hhhhhhhhhhhhhhh
<blank line>
This is not that difficult. The trick is to define the records first and this can be done with the record separator :
RS: The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.
So the number of records is given by :
$ NR=$(awk 'BEGIN{RS=""}END{print NR}' <file>)
You can then use shuf to get a hundred random numbers between 1 and NR:
$ shuf -i 1-$NR -n 100
This command you feed again in awk to select the records:
$ awk -v n=100 '(NR==n){RS="";ORS="\n\n"} # reset the RS for reading <file>
(NR==FNR){print $1; a[$1];next} # load 100 numbers in memory
!(FNR in a) { print } # print records
' <(shuf -i 1-$NR -n 100) <file>
We can also do this in one go using the Knuth shuffle and doing a double pass of the file
awk -v n=100 '
# Create n random numbers between 1 and m
function shuffle(m,n, b, i, j, t) {
for (i = m; i > 0; i--) b[i] = i
for (i = m; i > 1; i--) {
# j = random integer from 1 to i
j = int(i * rand()) + 1
# swap b[i], b[j]
t = b[i]; b[i] = b[j]; b[j] = t
}
for (i = n; i > 0; i--) a[b[i]]
}
BEGIN{RS=""; srand()}
(NR==FNR) {next}
(FNR==1) {shuffle(NR-1,n) }
!(FNR in a) { print }' <file> <file>
Using awk and shuf to delete 4 blocks out of 6 blocks where each block is 3 lines long:
$ cat tst.awk
BEGIN { RS=""; ORS="\n\n" }
NR==FNR { next }
FNR==1 {
cmd = sprintf("shuf -i 1-%d -n %d", NR-FNR, numToDel)
oRS=RS; RS="\n"
while ( (cmd | getline line) > 0 ) {
badNrs[line]
}
RS=oRS
close(cmd)
}
!(FNR in badNrs)
$ awk -v numToDel=4 -f tst.awk file file
1
2
3
10
11
12
Just change numToDel=4 to numToDel=100 for your real input.
The input file used to test against above was generated by:
$ seq 18 | awk '1; !(NR%3){print ""}' > file
which produced:
$ cat file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
here is a solution without shuffle
$ awk -v RS= -v ORS='\n\n' -v n=100 '
BEGIN {srand()}
NR==FNR{next}
FNR==1 {r[0];
while(length(r)<=n) r[int(rand()*NR)]}
!(FNR in r)' file{,}
double pass algorithm, first round is to count number of records, create a random list of index numbers up to required value, print the records not in the list. Note that if the deleted number is closer to number of records, the performance will degrade (probability of getting a new number will be low). For your case of 100 out of 600 will not be a problem. In the alternative case, it would be easier to pick the to be printed records instead of deleted records.
Since shuf is very fast I don't think this will buy you performance gains but perhaps simpler this way.
I have a fixed width text file that I must convert to a .csv where all numbers have to be converted to integers (no commas, dollar signs, quotes, etc). I have currently parsed the text file using plain python, but when utilizing the right package I seem to be at an impasse.
With csv, I can use writer.writerows in place of my print statement to write the output into my csv file, but the problem is that I have more columns (such as the date and time) that I must add after these rows that I cannot seem to do with csv. I also cannot seem to find a way to translate the blank columns in my text document to blank columns in output. csv seems to write in order.
I was reading the documentation on xlsxwriter and I see how you can write to individual columns with a set formatting, but I am unsure if it would work with my .csv requirement
My input text has a series of random groupings throughout the 50k line document but follows the below format
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
1--------------------
1ANTECR09 CHEK DPCK_R_009
TRANSIT EXTRACT SUB-SYSTEM
CURRENT DATE = 08/03/2017 JOURNAL REPORT PAGE 1
PROCESS DATE =
ID = 022000046-MNT
FILE HEADER = H080320171115
+____________________________________________________________________________________________________________________________________
R T SEQUENCE CR BT A RSN ITEM ITEM CHN USER REASO
NBR NBR OR PIC NBR DB NBR NBR COD AMOUNT SERIAL IND .......FIELD.. DESCR
5,556 01 7450282689 C 538196640 9835177743 15 $9,064.81 00 CREDIT
5,557 01 7450282690 D 031301422 362313705 38 $592.35 43431 DR CR
5,558 01 7450282691 D 021309379 601298839 38 $1,491.04 44896 DR CR
5,559 01 7450282692 D 071108834 176885 38 $6,688.00 1454 DR CR
5,560 01 7450282693 D 031309123 1390001566241 38 $293.42 6878 DR CR
My code currently parses this document, pulls the date, time, and prints only the lines where the sequence number starts with 42 and the CR is "C"
lines = []
a = 'PRINT DATE:'
b = 'ARCHIVE'
c = 'PRINT TIME:'
with open(r'textfile.txt') as in_file:
for line in in_file:
values = line.split()
if 'PRINT DATE:' in line:
dtevalue = line.split(a,1)[-1].split(b)[0]
lines.append(dtevalue)
elif 'PRINT TIME:' in line:
timevalue = line.split(c,1)[-1].split(b)[0]
lines.append(timevalue)
elif (len(values) >= 4 and values[3] == 'C'
and len(values[2]) >= 2 and values[2][:2] == '41'):
print(line)
print (lines[0])
print (lines[1])
What would be the cleanest way to achieve this result, and am I headed in the right direction by writing out the parsing first or should I have just done everything within a package first?
Any help is appreciated
Edit:
the header block (between 1----------, and +___________) is repeated throughout the document, as well as different sized groupings separated by -------
--------------------
34,615 207 4100223726 C 538196620 9866597322 10 $645.49 00 CREDIT
34,616 207 4100223727 D 022000046 8891636675 31 $645.49 111583 DR ON-
--------------------
34,617 208 4100223728 C 538196620 11701364 10 $756.19 00 CREDIT
34,618 208 4100223729 D 071923828 00 54 $305.31 11384597 BAD AC
34,619 208 4100223730 D 071923828 35110011 30 $450.88 10913052 6 DR SEL
--------------------
I would recommend slicing fixed width blocks to parse through the fixed width fields. Something like the following (incomplete) code:
data = """ 5,556 01 4250282689 C 538196640 9835177743 15 $9,064.81 00
CREDIT
5,557 01 7450282690 D 031301422 362313705 38 $592.35 43431
DR CR
5,558 01 7450282691 D 021309379 601298839 38 $1,491.04 44896
DR CR
"""
# list of data layout tuples (start_index, stop_index, field name)
# TODO add missing data layout tuples
data_layout = [(0, 12, 'r_nbr'), (12, 22, 't_nbr'), (22, 39, 'seq'), (39, 42, 'cr_db')]
for line in data.splitlines():
# skip "separator" lines
# NOTE this may be an iterative process to identify these
if line.startswith('-----'):
continue
record = {}
for start_index, stop_index, name in data_layout:
record[name] = line[start_index:stop_index].strip()
# your conditional (seems inconsistent with text)
if record['seq'].startswith('42') and record['cr_db'] == 'C':
# perform any special handling for each column
record['r_nbr'] = record['r_nbr'].replace(',', '')
# TODO other special handling (like $)
print('{r_nbr},{t_nbr},{seq},{cr_db},...'.format(**record))
Output is:
5556,01,4250282689,C,...
Update based on seemingly spurious values in undefined columns
Based on the new information provided about the "spurious" columns/fields (appear only rarely), this will likely be an iterative process.
My recommendation would be to narrow (appropriately!) the width of the desired fields. For example, if spurious data is in line[12:14] above, you could change the tuple for (12, 22, 't_nbr') to (14, 22, 't_nbr') to "skip" the spurious field.
An alternative is to add a "garbage" field in the list of tuples to handle those types of lines. Wherever the "spurious" fields appear, the "garbage" field would simply consume it.
If you need these fields, the same general approach to the "garbage" field approach still applies, but you save the data.
Update based on random separators
If they are relatively consistent, I'd simply add some logic (as I did above) to "detect" the separators and skip over them.
This question is not equal to How to print only the unique lines in BASH? because that ones suggests to remove all copies of the duplicated lines, while this one is about eliminating their duplicates only, i..e, change 1, 2, 3, 3 into 1, 2, 3 instead of just 1, 2.
This question is really hard to write because I cannot see anything to give meaning to it. But the example is clearly straight. If I have a file like this:
1
2
2
3
4
After to parse the file erasing the duplicated lines, becoming it like this:
1
3
4
I know python or some of it, this is a python script I wrote to perform it. Create a file called clean_duplicates.py and run it as:
import sys
#
# To run it use:
# python clean_duplicates.py < input.txt > clean.txt
#
def main():
lines = sys.stdin.readlines()
# print( lines )
clean_duplicates( lines )
#
# It does only removes adjacent duplicated lines, so your need to sort them
# with sensitive case before run it.
#
def clean_duplicates( lines ):
lastLine = lines[ 0 ]
nextLine = None
currentLine = None
linesCount = len( lines )
# If it is a one lined file, to print it and stop the algorithm
if linesCount == 1:
sys.stdout.write( lines[ linesCount - 1 ] )
sys.exit()
# To print the first line
if linesCount > 1 and lines[ 0 ] != lines[ 1 ]:
sys.stdout.write( lines[ 0 ] )
# To print the middle lines, range( 0, 2 ) create the list [0, 1]
for index in range( 1, linesCount - 1 ):
currentLine = lines[ index ]
nextLine = lines[ index + 1 ]
if currentLine == lastLine:
continue
lastLine = lines[ index ]
if currentLine == nextLine:
continue
sys.stdout.write( currentLine )
# To print the last line
if linesCount > 2 and lines[ linesCount - 2 ] != lines[ linesCount - 1 ]:
sys.stdout.write( lines[ linesCount - 1 ] )
if __name__ == "__main__":
main()
Although, while searching for duplicates lines remove seems to be easier to use tools as grep, sort, sed, uniq:
How to remove duplicate lines inside a text file?
removing line from list using sort, grep LINUX
Find duplicate lines in a file and count how many time each line was duplicated?
Remove duplicate entries in a Bash script
How to delete duplicate lines in a file without sorting it in Unix?
How to delete duplicate lines in a file...AWK, SED, UNIQ not working on my file
You may use uniq with -u/--unique option. As per the uniq man page:
-u / --unique
Don't output lines that are repeated in the input.
Print only lines that are unique in the INPUT.
For example:
cat /tmp/uniques.txt | uniq -u
OR, as mentioned in UUOC: Useless use of cat, better way will be to do it like:
uniq -u /tmp/uniques.txt
Both of these commands will return me value:
1
3
4
where /tmp/uniques.txt holds the number as mentioned in the question, i.e.
1
2
2
3
4
Note: uniq requires the content of file to be sorted. As mentioned in doc:
By default, uniq prints the unique lines in a sorted file, it discards all but one of identical successive input lines. so that the OUTPUT contains unique lines.
In case file is not sorted, you need to sort the content first
and then use uniq over the sorted content:
sort /tmp/uniques.txt | uniq -u
No sorting required and output order will be the same as input order:
$ awk 'NR==FNR{c[$0]++;next} c[$0]==1' file file
1
3
4
Europe Finland Office Supplies Online H 5/21/2015 193508565 7/3/2015 2339 651.21 524.96 1523180.19 1227881.44 295298.75
Europe Greece Household Online L 9/11/2015 895509612 9/26/2015 49 668.27 502.54 32745.23 24624.46 8120.77
Europe Hungary Beverages Online C 8/21/2012 722931563 8/25/2012 370 47.45 31.79 17556.50 11762.30 5794.20
Europe Hungary Beverages Online C 8/21/2012 722931563 8/25/2012 370 47.45 31.79 17556.50 11762.30 5794.20
If you have this kind of lines you can use this command.
[isuru#192 ~]$ sort duplines.txt | sed 's/\ /\-/g' | uniq | sed 's/\-/\ /g'
But keep in mind when using special characters. If there dashes in your lines makes sure to use different symbol. Here i keep a space between back & forward slash.
Before applied the code
After applied the code
Kindly use sort command with -u argument for listing unique values of any command's output.
cat file_name |sort -u
1
2
3
4
I am now facing a file trimming problem. I would like to trim rows in a tab-delimited file.
The rule is: for rows which with the same value in two columns, preserve only the row with the largest value in the third column. There may be different numbers of such redundant rows defined by two columns. If there is a tie for the largest value in the third column, preserve the first one (after ordering the file).
(1) My file looks like (tab-delimited, with several millions of rows):
1 100 25 T
1 101 26 A
1 101 27 G
1 101 30 A
1 102 40 A
1 102 40 T
(2) The output I want:
1 100 25 T
1 101 30 A
1 102 40 T
This problem is faced by my real study, not home-work. I expect to have your helps on that, because I have restricted programming skills. I prefer an computation-efficient way, because there is so many rows in my data file. Your help will be very valuable to me.
Here's a solution that will rely on the input file already being sorted appropriately. It will scan line-by-line for lines with similar start (e.g. two first columns identical), check the third column value and preserve the line with the highest value - or the line that came first in the file. When a new start is found, it prints the old line, and begins checking again.
At the end of the input file, the max line in memory is printed out.
use warnings;
use strict;
my ($max_line, $start, $max) = parse_line(scalar <DATA>);
while (<DATA>) {
my ($line, $nl_start, $nl_max) = parse_line($_);
if ($nl_start eq $start) {
if ($nl_max > $max) {
$max_line = $line;
$max = $nl_max;
}
} else {
print $max_line;
$start = $nl_start;
$max = $nl_max;
$max_line = $line;
}
}
print $max_line;
sub parse_line {
my $line = shift;
my ($start, $max) = $line =~ /^([^\t]+\t[^\t]+\t)(\d+)/;
return ($line, $start, $max);
}
__DATA__
1 100 25 T
1 101 26 A
1 101 27 G
1 101 30 A
1 102 40 A
1 102 40 T
The output is:
1 100 25 T
1 101 30 A
1 102 40 A
You stated
If there is a tie for the largest
value in the third column, preserve
the first one (after ordering the
file).
which is rather cryptic. Then you asked for output that seemed to contradict this, where the last value was printed instead of the first.
I am assuming that what you meant is "preserve the first value". If you indeed meant "preserve the last value", then simply change the > sign in if ($nl_max > $max) to >=. This will effectively preserve the last value equal instead of the first.
If you however implied some kind of sort, which "after ordering the file" seems to imply, then I do not have enough information to know what you meant.
In python you can try the following code:
res = {}
for line in (line.split() for line in open('c:\\inpt.txt','r') if line):
line = tuple(line)
if not line[:2] in res:
res[line[:2]] = line[2:]
continue
elif res[line[:2]][0] <= line[3]:
res[line[:2]] = line[2:]
f = open('c:\\tst.txt','w')
[f.write(line) for line in ('\t'.join(k+v)+'\n' for k,v in res.iteritems())]
f.close()
Here's one approach
use strict;
use warnings;
use constant
{ LINENO => 0
, LINE => 1
, SCORE => 2
};
use English qw<$INPUT_LINE_NUMBER>;
my %hash;
while ( <> ) {
# split the line to get the fields
my #fields = split /\t/;
# Assemble a key for everything except the "score"
my $key = join( '-', #fields[0,1] );
# locally cache the score
my $score = $fields[SCORE];
# if we have a score, and the current is not greater, then next
next unless ( $hash{ $key } and $score > $hash{ $key }[SCORE];
# store the line number, line text, and score
$hash{ $key } = [ $INPUT_LINE_NUMBER, $_, $score ];
}
# sort by line number and print out the text of the line stored.
foreach my $struct ( sort { $a->[LINENO] <=> $b->[LINENO] } values %hash ) {
print $struct->[LINE];
}
In Python too, but cleaner imo
import csv
spamReader = csv.reader(open('eggs'), delimiter='\t')
select = {}
for row in spamReader:
first_two, three = (row[0], row[1]), row[2]
if first_two in select:
if select[first_two][2] > three:
continue
select[first_two] = row
spamWriter = csv.writer(open('ham', 'w'), delimiter='\t')
for line in select:
spamWrite.writerow(select[line])