How can I print a grep result with the matched word? - python

I have two files, "file A":
Adygei
Albanian
Armenia_C
Armenia_Caucasus
Armenia_EBA
Armenia_LBA
Armenia_MBA
Armenian.DG
Austria_EN_HG_LBK
Austria_EN_LBK
And "fileB":
HG01880.SG Aygei_o1.SG
HG01988.SG Adygei_o2.SG
HG02419.SG Albanian_o2.SG
HG01879.SG Albanian.SG
HG01882.SG Armenia_C.SG
HG01883.SG Armenia_C.SG
HG01885.SG Armenia_EBA.SG
HG01886.SG Armenia_EBA.SG
HG01889.SG Armenia_LBA.SG
HG01890.SG Armenia_MBA.SG
What I want at the end is create a new columne (doesn't matter the position of the column) with the grep word with the word that matched. Like This:
HG01880.SG Aygei_o1.SG Adygei
HG01988.SG Adygei_o2.SG Adygei
HG02419.SG Albanian_o2.SG Albanian
HG01879.SG Albanian.SG Albanian
HG01882.SG Armenia_C.SG Armenia_C
HG01883.SG Armenia_C.SG Armenia_C
HG01885.SG Armenia_EBA.SG Armenia_EBA
HG01886.SG Armenia_EBA.SG Armenia_EBA
HG01889.SG Armenia_LBA.SG Armenia_LBA
HG01890.SG Armenia_MBA.SG Armenia_MBA
What I used to match both files in bash is grep -wFf fileA fileB > newfileA_B.txt. This can be both in python or bash

You can try something like that:
for line in $(cat fileA.txt)
do
echo "$line $(grep $line fileB.txt)"
done

Here is an example (probably inefficient) algorithm in Python (using strings instead of files) https://colab.research.google.com/drive/1bUnFXJg0m6FvXRkPybUqWux_reaJRt1c?usp=sharing

Perform the grep search once more but this time adding the flag -o which only lists the matching words. Then use paste to add it as column (define the delimiter using the -d flag).
paste -d ' ' <(grep -wFf fileA fileB) <(grep -woFf fileA fileB)

Related

vim and wc give different line counts

I have two csv files that give different results when I use wc -l (gives 65 lines for the first, 66 for the second) and when I use vim file.csv and then :$ to go to the bottom of the file (66 lines for both). I have tried viewing newline characters in vim using :set list and they look identical.
I have created the second (which shows one extra line with wc) was created from the first using pandas in Python and to_csv.
Is there anything within pandas that might generate new lines or other bash/vim tools I can use to verify the differences?
If the last character of the file is not a newline, wc won't count the last line:
$ printf 'a\nb\nc' | wc -l
2
In fact, that's how wc -l is documented to work: from man wc
-l, --lines
print the newline counts
^^^^^^^^^^^^^

feed a command a comma separated list of file names in a directory, extract a variable motif from file names for labels

I have a directory containing files that look like this:
1_reads.fastq
2_reads.fastq
89_reads.fastq
42_reads.fastq
I would like to feed a comma separated list of these file names to a command from a python program, so the input to the python command would like this:
program.py -i 1_reads.fastq,2_reads.fastq,89_reads.fastq,42_reads.fastq
Furthermore, I'd like to use the numbers in the file names for a labeling function within the python command such that the input would look like this:
program.py -i 1_reads.fastq,2_reads.fastq,89_reads.fastq,42_reads.fastq -t s1,s2,s89,s42
Its important that the file names and the label IDs are in the same order.
First: This is a very poorly-thought-out calling convention. Don't use it.
However, if you're using software someone else wrote that already has that convention baked in...
#!/bin/bash
IFS=, # use comma as separator
files=( [[:digit:]]*_* )
[[ -e $files || -L $files ]] || { echo "ERROR: No files matching glob exist" >&2; exit 1; }
prefixes=( )
for file in "${files[#]}"; do
prefixes+=( "s${file%%_*}" )
done
# "exec" only if this is the last command in the script; remove otherwise
exec program.py -i "${files[*]}" -t "${prefixes[*]}"
How this works:
IFS=, causes ${array[*]} to put a comma between each expanded element. Thus, expanding ${files[*]} and ${prefixes[*]} creates comma-separated strings with the contents of each array.
${file%%_*} removes everything after the first _ in a filename, allowing the numbers alone to be extracted.
[[ -e $files || -L $files ]] actually only tests whether the first element in that array exists (as a symlink or otherwise); however, this will always be true if the glob being expanded to form the array matched any files (unless files have been deleted between the two lines' invocation).
Try this:
program.py $(cd DIR && var=$(ls) && echo $var | tr ' ' ',')
That will pass to program.py the string returned by te command line inside the $(..).
That command line will: Enter in your directory, run ls storing the output in a variable, that will remove the newline characters replacing with spaces, and it doesn't add a trailing space. Then echo that variable to 'tr' which will translate spaces to commas.
It can be done easily in pure Bash. Make sure you run from within the directory that contains the files.
#!/bin/bash
shopt -s extglob nullglob
# Create an array of files
f=( +([[:digit:]])_reads.fastq )
# Check that there are some files...
if ((${#f[#]}==0)); then
echo "No files found. Exiting."
exit
fi
# Create an array of labels, directly from the array f:
# Remove trailing _reads.fastq
l=( "${f[#]%_reads.fastq}" )
# And prepend the letter s
l=( "${l[#]/#/s}" )
# Now the arrays f and l are good: check them:
declare -p f l
# To join the arrays, we'll use eval. Safe because the code is single-quoted!
IFS=, eval 'program.py -i "${f[*]}" -t "${l[*]}"'
Note. The use of eval here is perfectly safe as we're passing a constant string (and it's actually an idiomatic way to join an array without using a subshell or a loop). Don't modify the command, in particular the single quotes.
Thanks to Charles Duffy who convinced me to add healthy comments about the use of eval

Pattern Matching in a csv file and appending to matched lines

I want to extract those lines a csv file which match a Pattern and then append the same Pattern to the end of each extracted line as a newly added column of the csv file.
file.csv
file.csv
/var/log/0,33,New file,0
/var/log/0,34,Size increased,2345
/abc/Repli,11,New file,0
/abc/Repli,87,Size Increase,11
In above file file.csv, I executed
sed -n -i"" '/Repli/ s/$/,Repli/p' file.csv
This deletes remaining lines, which I do not want.
Extracting only the lines that match a pattern and modifying them
To select only lines containing pattern and then add pattern as a new column at the end of the line:
awk '/pattern/ {print $0 ",pattern"}' file.csv >tmp$$ && mv tmp$$ file.csv
Or,
sed -b -n -i"" '/pattern/ s/$/,pattern/p' file.csv
Keeping all lines but modifying those that match a pattern
awk '/pattern/ {$0=$0 ",pattern"} 1'
Or,
sed -b -i"" '/pattern/ s/$/,pattern/' file.csv
Remove Windows line endings while keeping all lines and modifying those that match a pattern
sed -i"" 's/\r//; /pattern/ s/$/,pattern/' file.csv
Remove Windows line endings while keeping all lines and modifying those that match a pattern containing slashes
Suppose that the pattern contains slashes like /var/log/abc/file/0/. Then:
sed -i"" 's/\r//; \|pattern| s|$|,pattern|' file.csv
For example:
sed -i"" 's/\r//; \|/var/log/abc/file/0/| s|$|,/var/log/abc/file/0/|' file.csv
I found a solution to match paths using sed. I did it through escap character and it worked.
Pattern="\/var\/log\/Model\/1\/"
Module=BE
sudo sed -i"" "s/\r//; /$Pattern/ s/$/,$Module/" resultFile.csv
Worked Fine!!
Below snippet appends the pattern in case if a match is found. If no match found just prints the line,
awk '{if($0 ~ /pattern/) print $0",pattern"; else print $0;}' file.csv

Extracting columns from text file using Perl one-liner: similar to Unix cut

I'm using Windows, and I would like to extract certain columns from a text file using a Perl, Python, batch etc. one-liner.
On Unix I could do this:
cut -d " " -f 1-3 <my file>
How can I do this on Windows?
Here is a Perl one-liner to print the first 3 whitespace-delimited columns of a file. This can be run on Windows (or Unix). Refer to perlrun.
perl -ane "print qq(#F[0..2]\n)" file.txt
you can download GNU windows and use your normal cut/awk etc..
Or natively, you can use vbscript
Set objFS = CreateObject("Scripting.FileSystemObject")
Set objArgs = WScript.Arguments
strFile = objArgs(0)
Set objFile = objFS.OpenTextFile(strFile)
Do Until objFile.AtEndOfLine
strLine=objFile.ReadLine
sp = Split(strLine," ")
s=""
For i=0 To 2
s=s&" "&sp(i)
Next
WScript.Echo s
Loop
save the above as mysplit.vbs and on command line
c:\test> cscript //nologo mysplit.vbs file
Or just simple batch
#echo off
for /f "tokens=1,2,3 delims= " %%a in (file) do (echo %%a %%b %%c)
If you want a Python one liner
c:\test> type file|python -c "import sys; print [' '.join(i.split()[:3]) for i in sys.stdin.readlines()]"
That's rather simple Python script:
for line in open("my file"):
parts = line.split(" ")
print " ".join(parts[0:3])
The easiest way to do it would be to install Cygwin and use the Unix cut command.
If you are dealing with a text file that has very long lines and you are only interested in the first 3 columns, then splitting a fixed number of times yourself will be a lot faster than using the -a option:
perl -ne "#F = split /\s/, $_, 4; print qq(#F[0..2]\n)" file.txt
rather than
perl -ane "print qq(#F[0..2]\n)" file.txt
This is because the -a option will split on every whitespace in a line, which potentially can lead to a lot of extra splitting.

Bash or python for changing spacing in files

I have a set of 10000 files. In all of them, the second line, looks like:
AAA 3.429 3.84
so there is just one space (requirement) between AAA and the two other columns. The rest of lines on each file are completely different and correspond to 10 columns of numbers.
Randomly, in around 20% of the files, and due to some errors, one gets
BBB 3.429 3.84
so now there are two spaces between the first and second column.
This is a big error so I need to fix it, changing from 2 to 1 space in the files where the error takes place.
The first approach I thought of was to write a bash script that for each file reads the 3 values of the second line and then prints them with just one space, doing it for all the files.
I wonder what do oyu think about this approach and if you could suggest something better, bashm python or someother approach.
Thanks
Performing line-based changes to text files is often simplest to do in sed.
sed -e '2s/ */ /g' infile.txt
will replace any runs of multiple spaces with a single space. This may be changing more than you want, though.
sed -e '2s/^\([^ ]*\) /\1 /' infile.txt
should just replace instances of two spaces after the first block of space-free text with a single space (though I have not tested this).
(edit: inserted 2 before s in each instance to tie the edit to the second line, specifically.)
Use sed.
for file in *
do
sed -i '' '2s/ / /' "$file"
done
The -i '' flag means to edit in-place without a backup.
Or use ed!
for file in *
do
printf "2s/ / /\nwq\n" |ed -s "$file"
done
if the error always can occur at 2nd line,
for file in file*
do
awk 'NR==2{$1=$1}1' file >temp
mv temp "$file"
done
or sed
sed -i.bak '2s/ */ /' file* # do 2nd line
Or just pure bash scripting
i=1
while read -r line
do
if [ "$i" -eq 2 ];then
echo $line
else
echo "$line"
fi
((i++))
done <"file"
Since it seems every column is separated by one space, another approach not yet mentioned is to use tr to squeeze all multi spaces into single spaces:
tr -s " " < infile > outfile
I am going to be different and go with AWK:
awk '{print $1,$2,$3}' file.txt > file1.txt
This will handle any number of spaces between fields, and replace them with one space
To handle a specific line you can add line addresses:
awk 'NR==2{print $1,$2,$3} NR!=2{print $0}' file.txt > file1.txt
i.e. rewrite line 2, but leave unchanged the other lines.
A line address can be a regular expression as well:
awk '/regexp/{print $1,$2,$3} !/regexp/{print}' file.txt > file1.txt
This answer assumes you don't want to mess with any except the second line.
#!/usr/bin/env python
import sys, os
for fname in sys.argv[1:]:
with open(fname, "r") as fin:
line1 = fin.readline()
line2 = fin.readline()
fixedLine2 = " ".join(line2.split()) + '\n'
if fixedLine2 == line2:
continue
with open(fname + ".fixed", "w") as fout:
fout.write(line1)
fout.write(line2)
for line in fin:
fout.write(line)
# Enable these lines if you want the old files replaced with the new ones.
#os.remove(fname)
#os.rename(fname + ".fixed", fname)
I don't quite understand, but yes, sed is an option. I don't think any POSIX compliant version of sed has an in file option (-i), so a fully POSIX compliant solution would be.
sed -e 's/^BBB /BBB /' <file> > <newfile>
Use sed:
sed -e 's/[[:space:]][[:space:]]/ /g' yourfile.txt >> newfile.txt
This will replace any two adjacent spaces with one. The use of [[:space:]] just makes it a little bit clearer
sed -i -e '2s/ / /g' input.txt
-i: edit files in place

Categories