vim and wc give different line counts - python

I have two csv files that give different results when I use wc -l (gives 65 lines for the first, 66 for the second) and when I use vim file.csv and then :$ to go to the bottom of the file (66 lines for both). I have tried viewing newline characters in vim using :set list and they look identical.
I have created the second (which shows one extra line with wc) was created from the first using pandas in Python and to_csv.
Is there anything within pandas that might generate new lines or other bash/vim tools I can use to verify the differences?

If the last character of the file is not a newline, wc won't count the last line:
$ printf 'a\nb\nc' | wc -l
2
In fact, that's how wc -l is documented to work: from man wc
-l, --lines
print the newline counts
^^^^^^^^^^^^^

Related

How can I print a grep result with the matched word?

I have two files, "file A":
Adygei
Albanian
Armenia_C
Armenia_Caucasus
Armenia_EBA
Armenia_LBA
Armenia_MBA
Armenian.DG
Austria_EN_HG_LBK
Austria_EN_LBK
And "fileB":
HG01880.SG Aygei_o1.SG
HG01988.SG Adygei_o2.SG
HG02419.SG Albanian_o2.SG
HG01879.SG Albanian.SG
HG01882.SG Armenia_C.SG
HG01883.SG Armenia_C.SG
HG01885.SG Armenia_EBA.SG
HG01886.SG Armenia_EBA.SG
HG01889.SG Armenia_LBA.SG
HG01890.SG Armenia_MBA.SG
What I want at the end is create a new columne (doesn't matter the position of the column) with the grep word with the word that matched. Like This:
HG01880.SG Aygei_o1.SG Adygei
HG01988.SG Adygei_o2.SG Adygei
HG02419.SG Albanian_o2.SG Albanian
HG01879.SG Albanian.SG Albanian
HG01882.SG Armenia_C.SG Armenia_C
HG01883.SG Armenia_C.SG Armenia_C
HG01885.SG Armenia_EBA.SG Armenia_EBA
HG01886.SG Armenia_EBA.SG Armenia_EBA
HG01889.SG Armenia_LBA.SG Armenia_LBA
HG01890.SG Armenia_MBA.SG Armenia_MBA
What I used to match both files in bash is grep -wFf fileA fileB > newfileA_B.txt. This can be both in python or bash
You can try something like that:
for line in $(cat fileA.txt)
do
echo "$line $(grep $line fileB.txt)"
done
Here is an example (probably inefficient) algorithm in Python (using strings instead of files) https://colab.research.google.com/drive/1bUnFXJg0m6FvXRkPybUqWux_reaJRt1c?usp=sharing
Perform the grep search once more but this time adding the flag -o which only lists the matching words. Then use paste to add it as column (define the delimiter using the -d flag).
paste -d ' ' <(grep -wFf fileA fileB) <(grep -woFf fileA fileB)

load the data with numpy.loadtxt and replace a new line in a file with command

This is my first question here.
I tried to load a data file with Python.
The file demo.txt is similar as below.
12,23,34.5,56,
78,29,33,
44,55,66,78,59,100
(the number of the lines in the files are different and the number of column in each line may be different. I need to work on many data files)
numpy.loadtxt("demo.txt",delimiter=",")
gives the error message "could not convert string to float:".
To fix this problem, I try to use the command
sed -i -e 's/,\n/,/g' demo.txt
to remove the line breaks at the end of each line to combine all lines into a single line. But it failed.
However, in the VIM, it is OK to use ":s/,\n/,/g" to remove the line breaks.
Thus, my questions are
is it possible to load the data file in python without modifying the files?
if not, how can I use a command like "sed" (as I need to put this command into my script to handle a bunch of files, a shell command like "sed" is necessary) to remove the line breaks at the end of each line to combine all lines into one single line? Without the line breaks at all lines, I can read the data with numpy.loadtxt easily.
Best regards,
Yiping
Remove all newlines from a file with tr -d '\n':
$ echo -e "some\nfile\nwith\n\newlines" > file_with_newlines
$ cat file_with_newlines
some
file
with
ewlines
$ cat file_with_newlines | tr -d '\n' > file_without_newlines
$ cat file_without_newlines
somefilewithewlines$
I don't know if this will actually help you with your numpy problem, but it will remove all the (UNIX) newlines from a file.

delete specific line numbers from history

The command grep -n blink ~/.bash_history outputs all lines that contain blink. I need a command that outputs only line numbers and executes the line numbers over history -d linenum
In python:
#list generated from command
linenumbers = [1,2,3,4,5]
for count in range(linenumbers):
os.system("history -d {}".format(count))
How do I do this?
In bash:
for offset in $(history | awk '/blink/ {print $1}' | tac)
do
history -d $offset
done
You can get the offsets directly from the history command, no need to generate line numbers with grep. Also you need to delete the lines in reverse (hence use of tac), because the offset of the commands following the one being deleted are shifted down.

Python code for number of lines returning much higher number than Linux `wc -l`

When I do wc -l on a file in Linux (a CSV file of a couple million rows), it reports a number of lines that is lower than what this Python code shows (simply iterating over the lines in the file) by over a thousand. What would be the reason for that?
with open(csv) as csv_lines:
num_lines = 0
for line in csv_lines:
num_lines += 1
print(num_lines)
I've had cases where wc reports one less than the above, which makes sense in cases where the file has no terminating newline character, as it seems like wc counts complete lines (including terminating newline) while this code only counts any lines. But what would be the case for a difference of over a thousand lines?
I don't know much about line endings and things like that, so maybe I've misunderstood how wc and this Python code count lines, so maybe someone could clarify. In linux lines counting not working with python code it says that wc works by counting the number of \n characters in the file. But then what is tis Python code doing exactly?
Is there a way to reconcile the difference in numbers to figure out exactly what is causing it? Like a way to calculate number of lines from Python that counts in the same way that wc does.
The file was generated possibly on a different platform that Linux, not sure if that might be related.
Since you are using print(num_lines) I'm assuming you are using Python 3.x, and I've used Python 3.4.2 as an example.
There reason for different number of line counts comes from the fact that the file opened by open(<name>) counts both \r and \n characters as separate lines as well as the \r\n combination (docs, the universal newlines part). This leads to the following:
>>> with open('test', 'w') as f:
f.write('\r\r\r\r')
>>> with open('test') as f:
print(sum(1 for _ in f))
4
whilst wc -l gives:
$ wc -l test
0 test
The \r character is used as a newline in i.e. old Macintosh systems.
If you would like to split only on \n characters, use the newline keyword argument to open:
>>> with open('test', 'w') as f:
f.write('\r\r\r\r')
>>> with open('test', newline='\n') as f:
print(sum(1 for _ in f))
1
The 1 comes from the fact you've already mentioned. There is not a single \n character in the file so wc -l returns 0, and Python counts that as a single line.
Try taking a part of the file and repeat line counting. For example:
# take first 10000 lines
head -10000 file.csv > file_head.csv
# take last 10000 lines
tail -10000 file.csv > file_tail.csv
# take first 100MB
dd if=file.csv of=file_100M.csv bs=1M count=100

Bash or python for changing spacing in files

I have a set of 10000 files. In all of them, the second line, looks like:
AAA 3.429 3.84
so there is just one space (requirement) between AAA and the two other columns. The rest of lines on each file are completely different and correspond to 10 columns of numbers.
Randomly, in around 20% of the files, and due to some errors, one gets
BBB 3.429 3.84
so now there are two spaces between the first and second column.
This is a big error so I need to fix it, changing from 2 to 1 space in the files where the error takes place.
The first approach I thought of was to write a bash script that for each file reads the 3 values of the second line and then prints them with just one space, doing it for all the files.
I wonder what do oyu think about this approach and if you could suggest something better, bashm python or someother approach.
Thanks
Performing line-based changes to text files is often simplest to do in sed.
sed -e '2s/ */ /g' infile.txt
will replace any runs of multiple spaces with a single space. This may be changing more than you want, though.
sed -e '2s/^\([^ ]*\) /\1 /' infile.txt
should just replace instances of two spaces after the first block of space-free text with a single space (though I have not tested this).
(edit: inserted 2 before s in each instance to tie the edit to the second line, specifically.)
Use sed.
for file in *
do
sed -i '' '2s/ / /' "$file"
done
The -i '' flag means to edit in-place without a backup.
Or use ed!
for file in *
do
printf "2s/ / /\nwq\n" |ed -s "$file"
done
if the error always can occur at 2nd line,
for file in file*
do
awk 'NR==2{$1=$1}1' file >temp
mv temp "$file"
done
or sed
sed -i.bak '2s/ */ /' file* # do 2nd line
Or just pure bash scripting
i=1
while read -r line
do
if [ "$i" -eq 2 ];then
echo $line
else
echo "$line"
fi
((i++))
done <"file"
Since it seems every column is separated by one space, another approach not yet mentioned is to use tr to squeeze all multi spaces into single spaces:
tr -s " " < infile > outfile
I am going to be different and go with AWK:
awk '{print $1,$2,$3}' file.txt > file1.txt
This will handle any number of spaces between fields, and replace them with one space
To handle a specific line you can add line addresses:
awk 'NR==2{print $1,$2,$3} NR!=2{print $0}' file.txt > file1.txt
i.e. rewrite line 2, but leave unchanged the other lines.
A line address can be a regular expression as well:
awk '/regexp/{print $1,$2,$3} !/regexp/{print}' file.txt > file1.txt
This answer assumes you don't want to mess with any except the second line.
#!/usr/bin/env python
import sys, os
for fname in sys.argv[1:]:
with open(fname, "r") as fin:
line1 = fin.readline()
line2 = fin.readline()
fixedLine2 = " ".join(line2.split()) + '\n'
if fixedLine2 == line2:
continue
with open(fname + ".fixed", "w") as fout:
fout.write(line1)
fout.write(line2)
for line in fin:
fout.write(line)
# Enable these lines if you want the old files replaced with the new ones.
#os.remove(fname)
#os.rename(fname + ".fixed", fname)
I don't quite understand, but yes, sed is an option. I don't think any POSIX compliant version of sed has an in file option (-i), so a fully POSIX compliant solution would be.
sed -e 's/^BBB /BBB /' <file> > <newfile>
Use sed:
sed -e 's/[[:space:]][[:space:]]/ /g' yourfile.txt >> newfile.txt
This will replace any two adjacent spaces with one. The use of [[:space:]] just makes it a little bit clearer
sed -i -e '2s/ / /g' input.txt
-i: edit files in place

Categories