Bash or python for changing spacing in files - python

I have a set of 10000 files. In all of them, the second line, looks like:
AAA 3.429 3.84
so there is just one space (requirement) between AAA and the two other columns. The rest of lines on each file are completely different and correspond to 10 columns of numbers.
Randomly, in around 20% of the files, and due to some errors, one gets
BBB 3.429 3.84
so now there are two spaces between the first and second column.
This is a big error so I need to fix it, changing from 2 to 1 space in the files where the error takes place.
The first approach I thought of was to write a bash script that for each file reads the 3 values of the second line and then prints them with just one space, doing it for all the files.
I wonder what do oyu think about this approach and if you could suggest something better, bashm python or someother approach.
Thanks

Performing line-based changes to text files is often simplest to do in sed.
sed -e '2s/ */ /g' infile.txt
will replace any runs of multiple spaces with a single space. This may be changing more than you want, though.
sed -e '2s/^\([^ ]*\) /\1 /' infile.txt
should just replace instances of two spaces after the first block of space-free text with a single space (though I have not tested this).
(edit: inserted 2 before s in each instance to tie the edit to the second line, specifically.)

Use sed.
for file in *
do
sed -i '' '2s/ / /' "$file"
done
The -i '' flag means to edit in-place without a backup.
Or use ed!
for file in *
do
printf "2s/ / /\nwq\n" |ed -s "$file"
done

if the error always can occur at 2nd line,
for file in file*
do
awk 'NR==2{$1=$1}1' file >temp
mv temp "$file"
done
or sed
sed -i.bak '2s/ */ /' file* # do 2nd line
Or just pure bash scripting
i=1
while read -r line
do
if [ "$i" -eq 2 ];then
echo $line
else
echo "$line"
fi
((i++))
done <"file"

Since it seems every column is separated by one space, another approach not yet mentioned is to use tr to squeeze all multi spaces into single spaces:
tr -s " " < infile > outfile

I am going to be different and go with AWK:
awk '{print $1,$2,$3}' file.txt > file1.txt
This will handle any number of spaces between fields, and replace them with one space
To handle a specific line you can add line addresses:
awk 'NR==2{print $1,$2,$3} NR!=2{print $0}' file.txt > file1.txt
i.e. rewrite line 2, but leave unchanged the other lines.
A line address can be a regular expression as well:
awk '/regexp/{print $1,$2,$3} !/regexp/{print}' file.txt > file1.txt

This answer assumes you don't want to mess with any except the second line.
#!/usr/bin/env python
import sys, os
for fname in sys.argv[1:]:
with open(fname, "r") as fin:
line1 = fin.readline()
line2 = fin.readline()
fixedLine2 = " ".join(line2.split()) + '\n'
if fixedLine2 == line2:
continue
with open(fname + ".fixed", "w") as fout:
fout.write(line1)
fout.write(line2)
for line in fin:
fout.write(line)
# Enable these lines if you want the old files replaced with the new ones.
#os.remove(fname)
#os.rename(fname + ".fixed", fname)

I don't quite understand, but yes, sed is an option. I don't think any POSIX compliant version of sed has an in file option (-i), so a fully POSIX compliant solution would be.
sed -e 's/^BBB /BBB /' <file> > <newfile>

Use sed:
sed -e 's/[[:space:]][[:space:]]/ /g' yourfile.txt >> newfile.txt
This will replace any two adjacent spaces with one. The use of [[:space:]] just makes it a little bit clearer

sed -i -e '2s/ / /g' input.txt
-i: edit files in place

Related

How can I print a grep result with the matched word?

I have two files, "file A":
Adygei
Albanian
Armenia_C
Armenia_Caucasus
Armenia_EBA
Armenia_LBA
Armenia_MBA
Armenian.DG
Austria_EN_HG_LBK
Austria_EN_LBK
And "fileB":
HG01880.SG Aygei_o1.SG
HG01988.SG Adygei_o2.SG
HG02419.SG Albanian_o2.SG
HG01879.SG Albanian.SG
HG01882.SG Armenia_C.SG
HG01883.SG Armenia_C.SG
HG01885.SG Armenia_EBA.SG
HG01886.SG Armenia_EBA.SG
HG01889.SG Armenia_LBA.SG
HG01890.SG Armenia_MBA.SG
What I want at the end is create a new columne (doesn't matter the position of the column) with the grep word with the word that matched. Like This:
HG01880.SG Aygei_o1.SG Adygei
HG01988.SG Adygei_o2.SG Adygei
HG02419.SG Albanian_o2.SG Albanian
HG01879.SG Albanian.SG Albanian
HG01882.SG Armenia_C.SG Armenia_C
HG01883.SG Armenia_C.SG Armenia_C
HG01885.SG Armenia_EBA.SG Armenia_EBA
HG01886.SG Armenia_EBA.SG Armenia_EBA
HG01889.SG Armenia_LBA.SG Armenia_LBA
HG01890.SG Armenia_MBA.SG Armenia_MBA
What I used to match both files in bash is grep -wFf fileA fileB > newfileA_B.txt. This can be both in python or bash
You can try something like that:
for line in $(cat fileA.txt)
do
echo "$line $(grep $line fileB.txt)"
done
Here is an example (probably inefficient) algorithm in Python (using strings instead of files) https://colab.research.google.com/drive/1bUnFXJg0m6FvXRkPybUqWux_reaJRt1c?usp=sharing
Perform the grep search once more but this time adding the flag -o which only lists the matching words. Then use paste to add it as column (define the delimiter using the -d flag).
paste -d ' ' <(grep -wFf fileA fileB) <(grep -woFf fileA fileB)

vim and wc give different line counts

I have two csv files that give different results when I use wc -l (gives 65 lines for the first, 66 for the second) and when I use vim file.csv and then :$ to go to the bottom of the file (66 lines for both). I have tried viewing newline characters in vim using :set list and they look identical.
I have created the second (which shows one extra line with wc) was created from the first using pandas in Python and to_csv.
Is there anything within pandas that might generate new lines or other bash/vim tools I can use to verify the differences?
If the last character of the file is not a newline, wc won't count the last line:
$ printf 'a\nb\nc' | wc -l
2
In fact, that's how wc -l is documented to work: from man wc
-l, --lines
print the newline counts
^^^^^^^^^^^^^

Split Command - Choose Output Name

I have a text file named myfile.txt. The file contains 50,000 lines and I would like to split it into 50 text files. I know that this is easy with the split command:
split myfile.txt
This will output 50 1000-line files: xaa, xab, and xac.
My question, how do I run split my text file so that it names the output files:
1.txt
2.txt
3.txt
...
50.txt
Seeking answers in python or bash please. Thank you!
Here is a potential solution using itertools.islice to get the chunks and string formatting for the different file names:
from itertools import islice
with open('myfile.txt') as in_file:
for i in range(1, 51):
with open('{0}.txt'.format(i), 'w') as out_file:
lines = islice(in_file, 1000)
out_file.writelines(lines)
its not exactly what you are looking for, but running
split -d myfile.txt
will output
x00
x01
x02
...
To generate test data in empty directory, you can use
seq 50000 | split -d
To rename in the way that you want, you can use
ls x* | awk '{print $0, (substr($0,2)+1) ".txt"}' | xargs -n2 mv
Here's a funny one: if your split command supports the --filter option, you can use it!
If you call
split --filter=./banana myfile.txt
then the command ./banana will be executed with the environmental variable FILE set to the name split would choose to write the chunk it's processing. This command will receive on its standard input the chunk being processed. If this command returns a non-zero status code, then split will interrupt its operations.
Together with the -d option, that's exactly what you want. With the -d option, the name split will choose for the filenames will be x01, x02, etc.
Make a script:
#!/bin/bash
# remove the leading x from FILE
n=${FILE#x}
# check that n is a number
[[ $n = +([[:digit:]]) ]] || exit 1
# remove the leading zeroes from n
n=$((10#$n))
# send stdin to file
cat > "$n.txt"
Call this script banana, chmod +x it and let's go:
split -d --filter=./banana myfile.txt
This --filter option is really funny.
Here's an example of how you could split this file in bash:
split -l 1000 -d --additional-suffix=.txt myfile.txt
The -l argument determines the number of lines included in each split file (1000 in this case, for 50 total files), the -d argument uses numbers instead of letters for the suffixes, and the value we pass to the --additional-suffix argument here gives each file a .txt file extension.
This will create
x00.txt
x01.txt
x01.txt
etc.
If you wanted to change the 'x' portion of the output files, you'd want to add a prefix after the input file (e.g. myfile.txt f would create f01.txt, f02.txt, etc.)
Note that without --additional-suffix, your files will all lack filename extensions.
I've looked to see if there's a way to split a file and name them with only the suffix, but I haven't found anything.
A simple approach:
f=open('your_file')
count_line,file = 0,1
for x in f:
count_line +=1
if count%1000 == 1:
f1 = open(str(file) + '.txt','w')
f1.write(x)
file +=1
elif count_line%1000 == 0:
f1.write(x)
f1.close()
else:f1.write(x)

Compare 2 files and remove any lines in file2 when they match values found in file1

I have two files. i am trying to remove any lines in file2 when they match values found in file1. One file has a listing like so:
File1
ZNI008
ZNI009
ZNI010
ZNI011
ZNI012
... over 19463 lines
The second file includes lines that match the items listed in first:
File2
copy /Y \\server\foldername\version\20050001_ZNI008_162635.xml \\server\foldername\version\folder\
copy /Y \\server\foldername\version\20050001_ZNI010_162635.xml \\server\foldername\version\folder\
copy /Y \\server\foldername\version\20050001_ZNI012_162635.xml \\server\foldername\version\folder\
copy /Y \\server\foldername\version\20050001_ZNI009_162635.xml \\server\foldername\version\folder\
... continues listing until line 51360
What I've tried so far:
grep -v -i -f file1.txt file2.txt > f3.txt
does not produce any output to f3.txt or remove any lines. I verified by running
wc -l file2.txt
and the result is
51360 file2.txt
I believe the reason is that there are no exact matches. When I run the following it shows nothing
comm -1 -2 file1.txt file2.txt
Running
( tr '\0' '\n' < file1.txt; tr '\0' '\n' < file2.txt ) | sort | uniq -c | egrep -v '^ +1'
shows only one match, even though I can clearly see there is more than one match.
Alternatively putting all the data into one file and running the following:
grep -Ev "$(cat file1.txt)" 1>LinesRemoved.log
says argument has too many lines to process.
I need to remove lines matching the items in file1 from file2.
i am also trying this in python:
`
#!/usr/bin/python
s = set()
# load each line of file1 into memory as elements of a set, 's'
f1 = open("file1.txt", "r")
for line in f1:
s.add(line.strip())
f1.close()
# open file2 and split each line on "_" separator,
# second field contains the value ZNIxxx
f2 = open("file2.txt", "r")
for line in f2:
if line[0:4] == "copy":
fields = line.split("_")
# check if the field exists in the set 's'
if fields[1] not in s:
match = line
else:
match = 0
else:
if match:
print match, line,
`
it is not working well.. as im getting
'Traceback (most recent call last):
File "./test.py", line 14, in ?
if fields[1] not in s:
IndexError: list index out of range'
What about:
grep -F -v -f file1 file2 > file3
I like the grep solution from byrondrossos better, but here's another option:
sed $(awk '{printf("-e /%s/d ", $1)}' file1) file2 > file3
this is using Bash and GNU sed because of the -i switch
cp file2 file3
while read -r; do
sed -i "/$REPLY/d" file3
done < file1
there is surely a better way but here's a hack around -i :D
cp file2 file3
while read -r; do
(rm file3; sed "/$REPLY/d" > file3) < file3
done < file1
this exploits shell evaluation order
alright, I guess the correct way with this idea is using ed. This should be POSIX too.
cp file2 file3
while read -r line; do
ed file3 <<EOF
/$line/d
wq
EOF
done < file1
in any case, grep seems to do be the right tool for the job.
#byrondrossos answer should work for you well ;)
This is admittedly ugly but it does work. However, the path must be the same for all of the (except of course the ZNI### portion). All but the ZNI### of the path is removed so the command grep -vf can run correctly on the sorted files.
First Convert "testfile2" to "testfileconverted" to just show the ZNI###
cat /testfile2 | sed 's:^.*_ZNI:ZNI:g' | sed 's:_.*::g' > /testfileconverted
Second use inverse grep of the converted file compared to the "testfile1" and add the reformatted output to "testfile3"
bash -c 'grep -vf <(sort /testfileconverted) <(sort /testfile1)' | sed "s:^:\copy /Y \\\|server\\\foldername\\\version\\\20050001_:g" | sed "s:$:_162635\.xml \\\|server\\\foldername\\\version\\\folder\\\:g" | sed "s:|:\\\:g" > /testfile3

Extracting columns from text file using Perl one-liner: similar to Unix cut

I'm using Windows, and I would like to extract certain columns from a text file using a Perl, Python, batch etc. one-liner.
On Unix I could do this:
cut -d " " -f 1-3 <my file>
How can I do this on Windows?
Here is a Perl one-liner to print the first 3 whitespace-delimited columns of a file. This can be run on Windows (or Unix). Refer to perlrun.
perl -ane "print qq(#F[0..2]\n)" file.txt
you can download GNU windows and use your normal cut/awk etc..
Or natively, you can use vbscript
Set objFS = CreateObject("Scripting.FileSystemObject")
Set objArgs = WScript.Arguments
strFile = objArgs(0)
Set objFile = objFS.OpenTextFile(strFile)
Do Until objFile.AtEndOfLine
strLine=objFile.ReadLine
sp = Split(strLine," ")
s=""
For i=0 To 2
s=s&" "&sp(i)
Next
WScript.Echo s
Loop
save the above as mysplit.vbs and on command line
c:\test> cscript //nologo mysplit.vbs file
Or just simple batch
#echo off
for /f "tokens=1,2,3 delims= " %%a in (file) do (echo %%a %%b %%c)
If you want a Python one liner
c:\test> type file|python -c "import sys; print [' '.join(i.split()[:3]) for i in sys.stdin.readlines()]"
That's rather simple Python script:
for line in open("my file"):
parts = line.split(" ")
print " ".join(parts[0:3])
The easiest way to do it would be to install Cygwin and use the Unix cut command.
If you are dealing with a text file that has very long lines and you are only interested in the first 3 columns, then splitting a fixed number of times yourself will be a lot faster than using the -a option:
perl -ne "#F = split /\s/, $_, 4; print qq(#F[0..2]\n)" file.txt
rather than
perl -ane "print qq(#F[0..2]\n)" file.txt
This is because the -a option will split on every whitespace in a line, which potentially can lead to a lot of extra splitting.

Categories