Split Command - Choose Output Name - python

I have a text file named myfile.txt. The file contains 50,000 lines and I would like to split it into 50 text files. I know that this is easy with the split command:
split myfile.txt
This will output 50 1000-line files: xaa, xab, and xac.
My question, how do I run split my text file so that it names the output files:
1.txt
2.txt
3.txt
...
50.txt
Seeking answers in python or bash please. Thank you!

Here is a potential solution using itertools.islice to get the chunks and string formatting for the different file names:
from itertools import islice
with open('myfile.txt') as in_file:
for i in range(1, 51):
with open('{0}.txt'.format(i), 'w') as out_file:
lines = islice(in_file, 1000)
out_file.writelines(lines)

its not exactly what you are looking for, but running
split -d myfile.txt
will output
x00
x01
x02
...

To generate test data in empty directory, you can use
seq 50000 | split -d
To rename in the way that you want, you can use
ls x* | awk '{print $0, (substr($0,2)+1) ".txt"}' | xargs -n2 mv

Here's a funny one: if your split command supports the --filter option, you can use it!
If you call
split --filter=./banana myfile.txt
then the command ./banana will be executed with the environmental variable FILE set to the name split would choose to write the chunk it's processing. This command will receive on its standard input the chunk being processed. If this command returns a non-zero status code, then split will interrupt its operations.
Together with the -d option, that's exactly what you want. With the -d option, the name split will choose for the filenames will be x01, x02, etc.
Make a script:
#!/bin/bash
# remove the leading x from FILE
n=${FILE#x}
# check that n is a number
[[ $n = +([[:digit:]]) ]] || exit 1
# remove the leading zeroes from n
n=$((10#$n))
# send stdin to file
cat > "$n.txt"
Call this script banana, chmod +x it and let's go:
split -d --filter=./banana myfile.txt
This --filter option is really funny.

Here's an example of how you could split this file in bash:
split -l 1000 -d --additional-suffix=.txt myfile.txt
The -l argument determines the number of lines included in each split file (1000 in this case, for 50 total files), the -d argument uses numbers instead of letters for the suffixes, and the value we pass to the --additional-suffix argument here gives each file a .txt file extension.
This will create
x00.txt
x01.txt
x01.txt
etc.
If you wanted to change the 'x' portion of the output files, you'd want to add a prefix after the input file (e.g. myfile.txt f would create f01.txt, f02.txt, etc.)
Note that without --additional-suffix, your files will all lack filename extensions.
I've looked to see if there's a way to split a file and name them with only the suffix, but I haven't found anything.

A simple approach:
f=open('your_file')
count_line,file = 0,1
for x in f:
count_line +=1
if count%1000 == 1:
f1 = open(str(file) + '.txt','w')
f1.write(x)
file +=1
elif count_line%1000 == 0:
f1.write(x)
f1.close()
else:f1.write(x)

Related

How to create a script to read a file line by line and concatenate them into a string? (Python or Bash)

I am trying to create a script currently that automates a command multiple times. I have a text file containing links to directories/files that is formatted line by line vertically. An example would be:
mv (X) /home/me
The X variable would change for every line in the directory/file text document. The script would execute the same command but change X each time. How would I got about doing this? Can someone point me in the right direction?
I appreciate the help!
Thanks a bunch!
That's a job for xargs:
xargs -d '\n' -I{} mv {} /path < file
Xargs will read standard input and for each element delimetered by a newline, it will substitute {} part by the readed part and execute mv.
import os
command = "mv {path} /home/me" # your command example, the {} will be replaced with the path
with open("path_to_file_list.txt", "r") as file:
paths = [s.strip() for s in file.readlines()] # assuming each line in the file is a path/file of the target files. the .strip() is to clear the newlines
for path in paths:
os.system(command.format(path=path)) # call each command, replacing the {path} with each file path from the text file.
cat file.txt | while read x; do
mv "$x" /home/me/
done

feed a command a comma separated list of file names in a directory, extract a variable motif from file names for labels

I have a directory containing files that look like this:
1_reads.fastq
2_reads.fastq
89_reads.fastq
42_reads.fastq
I would like to feed a comma separated list of these file names to a command from a python program, so the input to the python command would like this:
program.py -i 1_reads.fastq,2_reads.fastq,89_reads.fastq,42_reads.fastq
Furthermore, I'd like to use the numbers in the file names for a labeling function within the python command such that the input would look like this:
program.py -i 1_reads.fastq,2_reads.fastq,89_reads.fastq,42_reads.fastq -t s1,s2,s89,s42
Its important that the file names and the label IDs are in the same order.
First: This is a very poorly-thought-out calling convention. Don't use it.
However, if you're using software someone else wrote that already has that convention baked in...
#!/bin/bash
IFS=, # use comma as separator
files=( [[:digit:]]*_* )
[[ -e $files || -L $files ]] || { echo "ERROR: No files matching glob exist" >&2; exit 1; }
prefixes=( )
for file in "${files[#]}"; do
prefixes+=( "s${file%%_*}" )
done
# "exec" only if this is the last command in the script; remove otherwise
exec program.py -i "${files[*]}" -t "${prefixes[*]}"
How this works:
IFS=, causes ${array[*]} to put a comma between each expanded element. Thus, expanding ${files[*]} and ${prefixes[*]} creates comma-separated strings with the contents of each array.
${file%%_*} removes everything after the first _ in a filename, allowing the numbers alone to be extracted.
[[ -e $files || -L $files ]] actually only tests whether the first element in that array exists (as a symlink or otherwise); however, this will always be true if the glob being expanded to form the array matched any files (unless files have been deleted between the two lines' invocation).
Try this:
program.py $(cd DIR && var=$(ls) && echo $var | tr ' ' ',')
That will pass to program.py the string returned by te command line inside the $(..).
That command line will: Enter in your directory, run ls storing the output in a variable, that will remove the newline characters replacing with spaces, and it doesn't add a trailing space. Then echo that variable to 'tr' which will translate spaces to commas.
It can be done easily in pure Bash. Make sure you run from within the directory that contains the files.
#!/bin/bash
shopt -s extglob nullglob
# Create an array of files
f=( +([[:digit:]])_reads.fastq )
# Check that there are some files...
if ((${#f[#]}==0)); then
echo "No files found. Exiting."
exit
fi
# Create an array of labels, directly from the array f:
# Remove trailing _reads.fastq
l=( "${f[#]%_reads.fastq}" )
# And prepend the letter s
l=( "${l[#]/#/s}" )
# Now the arrays f and l are good: check them:
declare -p f l
# To join the arrays, we'll use eval. Safe because the code is single-quoted!
IFS=, eval 'program.py -i "${f[*]}" -t "${l[*]}"'
Note. The use of eval here is perfectly safe as we're passing a constant string (and it's actually an idiomatic way to join an array without using a subshell or a loop). Don't modify the command, in particular the single quotes.
Thanks to Charles Duffy who convinced me to add healthy comments about the use of eval

Pythonic way to send contents of a file to a pipe and count # lines in a single step

given the > 4gb file myfile.gz, I need to zcat it into a pipe for consumption by Teradata's fastload. I also need to count the number of lines in the file. Ideally, I only want to make a single pass through the file. I use awk to output the entire line ($0) to stdout and through using awk's END clause, writes the number of rows (awk's NR variable) to another file descriptor (outfile).
I've managed to do this using awk but I'd like to know if a more pythonic way exists.
#!/usr/bin/env python
from subprocess import Popen, PIPE
from os import path
the_file = "/path/to/file/myfile.gz"
outfile = "/tmp/%s.count" % path.basename(the_file)
cmd = ["-c",'zcat %s | awk \'{print $0} END {print NR > "%s"} \' ' % (the_file, outfile)]
zcat_proc = Popen(cmd, stdout = PIPE, shell=True)
The pipe is later consumed by a call to teradata's fastload, which reads from
"/dev/fd/" + str(zcat_proc.stdout.fileno())
This works but I'd like to know if its possible to skip awk and take better advantage of python. I'm also open to other methods. I have multiple large files that I need to process in this manner.
There's no need for either of zcat or Awk. Counting the lines in a gzipped file can be done with
import gzip
nlines = sum(1 for ln in gzip.open("/path/to/file/myfile.gz"))
If you want to do something else with the lines, such as pass them to a different process, do
nlines = 0
for ln in gzip.open("/path/to/file/myfile.gz"):
nlines += 1
# pass the line to the other process
Counting lines and unzipping gzip-compressed files can be easily done with Python and its standard library. You can do everything in a single pass:
import gzip, subprocess, os
fifo_path = "path/to/fastload-fifo"
os.mkfifo(fifo_path)
fastload_fifo = open(fifo_path)
fastload = subprocess.Popen(["fastload", "--read-from", fifo_path],
stdin=subprocess.PIPE)
with gzip.open("/path/to/file/myfile.gz") as f:
for i, line in enumerate(f):
fastload_fifo.write(line)
print "Number of lines", i + 1
os.unlink(fifo_path)
I don't know how to invoke Fastload -- subsitute the correct parameters in the invocation.
This can be done in one simple line of bash:
zcat myfile.gz | tee >(wc -l >&2) | fastload
This will print the line count on stderr. If you want it somewhere else you can redirect the wc output however you like.
Actually, it should not be possible to pipe the data to Fastload at all, so it would be great if somebody post here an exact example if he could.
From Teradata documentation on the Fastload configuration http://www.info.teradata.com/htmlpubs/DB_TTU_14_00/index.html#page/Load_and_Unload_Utilities/B035_2411_071A/2411Ch03.026.028.html#ww1938556
FILE=filename
Keyword phrase specifying the name of the data source that contains the input data. fileid must refer to a regular file. Specifically, pipes are not supported.

Script to compare a string in two different files

I am brand new to stackoverflow and to scripting. I was looking for help to get started in a script, not necessarily looking for someone to write it.
Here's what I have:
File1.csv - contains some information, I am only interested in MAC addresses.
File2.csv - has some different information, but also contains MAC address.
I need a script that parses the MAC addresses from file1.csv and logs a report if any MAC address shows up in file2.csv.
The questions:
Any tips on the language I use, preferably perl, python or bash?
Can anyone suggest some structure for the logic needed (even if just in psuedo-code)?
update
Using #Adam Wagner's approach, I am really close!
import csv
#Need to strip out NUL values from .csv file to make python happy
class FilteredFile(file):
def next(self):
return file.next(self).replace('\x00','').replace('\xff\xfe','')
reader = csv.reader(FilteredFile('wifi_clients.csv', 'rb'), delimiter=',', quotechar='|')
s1 = set(rec[0] for rec in reader)
inventory = csv.reader(FilteredFile('inventory.csv','rb'),delimiter=',')
s2 = set(rec[6] for rec in inventory)
shared_items = s1.intersection(s2)
print shared_items
This always outputs:(even if I doctor the .csv files to have matching MAC addresses)
set([])
Contents of the csv files
wifi_clients.csv
macNames, First time seen, Last time seen,Power, # packets, BSSID, Probed ESSIDs
inventory.csv
Name,Manufacturer,Device Type,Model,Serial Number,IP Address,MAC Address,...
Here's the approach I'd take:
Iterate over each csv file (python has a handy csv module for accomplishing this), capturing the mac-address and placing it in a set (one per file). And once again, python has a great builtin set type. Here's a good example of using the csv module and of-course, the docs.
Next, you can get the intersection of set1 (file1) and set2 (file2). This will show you mac-addresses that exist in both files one and two.
Example (in python):
s1 = set([1,2,3]) # You can add things incrementally with "s1.add(value)"
s2 = set([2,3,4])
shared_items = s1.intersection(s2)
print shared_items
Which outputs:
set([2, 3])
Logging these shared items could be done with anything from printing (then redirecting output to a file), to using the logging module, to saving directly to a file.
I'm not sure how in-depth of an answer you were looking for, but this should get you started.
Update: CSV/Set usage example
Assuming you have a file "foo.csv", that looks something like this:
bob,123,127.0.0.1,mac-address-1
fred,124,127.0.0.1,mac-address-2
The simplest way to build the set, would be something like this:
import csv
set1 = set()
for record in csv.reader(open('foo.csv', 'rb')):
user, machine_id, ip_address, mac_address = record
set1.add(mac_address)
# or simply "set1.add(record[3])", if you don't need the other fields.
Obviously, you'd need something like this for each file, so you may want to put this in a function to make life easier.
Finally, if you want to go the less-verbose-but-cooler-python-way, you could also build the set like this:
csvfile = csv.reader(open('foo.csv', 'rb'))
set1 = set(rec[3] for rec in csvfile) # Assuming mac-address is the 4th column.
I strongly recommend python to do this.
'Cause you didn't give the structure of the csv file, I can only show a framework:
def get_MAC_from_file1():
... parse the file to get MAC
return a_MAC_list
def get_MAC_from_file2():
... parse the file to get MAC
return a_MAC_list
def log_MACs():
MAC_list1, MAC_list2 = get_MAC_from_file1(), get_MAC_from_file2()
for a_MAC in MAC_list1:
if a_MAC in MAC_list2:
...write your logs
if the data set is large, use a dict or set instead of the list and the intersect operation. But as it's MAC address, I guess your dataset is not that large. So keeping the script easy to read is the most important thing.
Awk is perfect for this
{
mac = $1 # assuming the mac addresses are in the first column
do_grep = "grep " mac " otherfilename" # we'll use grep to check if the mac address is in the other file
do_grep | getline mac_in_other_file # pipe the output of the grep command into a new variable
close(do_grep) # close the pipe
if(mac_in_other_file != ""){ # if grep found the mac address in the other file
print mac > "naughty_macs.log" # append the mac address to the log file
}
}
Then you'd run that on the first file:
awk -f logging_script.awk mac_list.txt
(this code is untested and I'm not the greatest awk hacker, but it should give the general idea)
For the example purpose generate 2 files that that look like yours.
File1:
for i in `seq 100`; do
echo -e "user$i\tmachine$i\t192.168.0.$i\tmac$i";
done > file1.csv
File2 (contains random entries of "mac addresses" numbered from 1-200)
for j in `seq 100`; do
i=$(($RANDOM % 200)) ;
echo -e "mac$i\tmachine$i\tuser$i";
done > file2.csv
Simplest approach would be to use join command and do a join on the appropriate field. This approach has the advantage that fields from both files would be available in the output.
Based on the example files above, the command would look like this:
join -1 4 -2 1 <(sort -k4 file1.csv) <(sort -k1 file2.csv)
join needs the input to be sorted by the field you are matching, that's why the sort is there (-k tells which column to use)
The command above matches rows from file1.csv with rows from file2.csv if column 4 in the first file is equal with column 1 from the second file.
If you only need specific fields, you can specify the output format to the join command:
join -1 4 -2 1 -o1.4 1.2 <(sort -k4 file1.csv) <(sort -k1 file2.csv)
This would print only the mac address and the machine field from the first file.
If you only need a list of matching mac addresses, you can use uniq or sort -u. Since the join output will be sorted by mac, uniq is faster. But if you need a unique list of another field, sort -u is better.
If you only need the mac addresses that match, grep can accept patterns from a file, and you can use cut to extract only the forth field.
fgrep -f<(cut -f4 file1.csv) file2.csv
The above would list all the lines in file2.csv that contain a mac address from file1
Note that I'm using fgrep which doesn't do pattern matching. Also, if file1 is big, this may be slower than the first approach. Also, it assumes that the mac is present only in the field1 of file2 and the other fields don't contain mac addresses.
If you only need the mac, you can either use -o option on fgrep but there are grep variants that don't have it, or you can pipe the output trough cut and then sort -u
fgrep -f<(cut -f4 file1.csv) file2.csv | cut -f1 | sort -u
This would be the bash way.
Python and awk hints have been shown above, I will take a stab at perl:
#!/usr/bin/perl -w
use strict;
open F1, $ARGV[0];
my %searched_mac_addresses = map {chomp; (split /\t/)[3] => 1 } <F1>;
close F1;
open F2, $ARGV[1];
while (<F2>) {
print if $searched_mac_addresses{(split "\t")[0]}
}
close F2
First you create a dictionary containing all the mac addresses from the first file:
my %searched_mac_addresses = map {chomp; (split /\t/)[3] => 1 } <F1>;
reads all the lines from the file1
chomp removes the end of line
split splits the line based on tab, you can use a more complex regexp if needed
() around split force an array context
[3] selects the forth field
map runs a piece of code for all elements of the array
=> generates a dictionary (hash in perl's terminology) element instead of an array
Then you read line by line the second file, and check if the mac exists in the above dictionary:
while (<F2>) {
print if $searched_mac_addresses{(split "\t")[0]}
}
while () will read the file F2, and put each line in the $_ variable
print without any parameters prints the default variable $_
if can postfix a instruction
dictionary elements can be accessed via {}
split by default splits the $_ default variable

Bash or python for changing spacing in files

I have a set of 10000 files. In all of them, the second line, looks like:
AAA 3.429 3.84
so there is just one space (requirement) between AAA and the two other columns. The rest of lines on each file are completely different and correspond to 10 columns of numbers.
Randomly, in around 20% of the files, and due to some errors, one gets
BBB 3.429 3.84
so now there are two spaces between the first and second column.
This is a big error so I need to fix it, changing from 2 to 1 space in the files where the error takes place.
The first approach I thought of was to write a bash script that for each file reads the 3 values of the second line and then prints them with just one space, doing it for all the files.
I wonder what do oyu think about this approach and if you could suggest something better, bashm python or someother approach.
Thanks
Performing line-based changes to text files is often simplest to do in sed.
sed -e '2s/ */ /g' infile.txt
will replace any runs of multiple spaces with a single space. This may be changing more than you want, though.
sed -e '2s/^\([^ ]*\) /\1 /' infile.txt
should just replace instances of two spaces after the first block of space-free text with a single space (though I have not tested this).
(edit: inserted 2 before s in each instance to tie the edit to the second line, specifically.)
Use sed.
for file in *
do
sed -i '' '2s/ / /' "$file"
done
The -i '' flag means to edit in-place without a backup.
Or use ed!
for file in *
do
printf "2s/ / /\nwq\n" |ed -s "$file"
done
if the error always can occur at 2nd line,
for file in file*
do
awk 'NR==2{$1=$1}1' file >temp
mv temp "$file"
done
or sed
sed -i.bak '2s/ */ /' file* # do 2nd line
Or just pure bash scripting
i=1
while read -r line
do
if [ "$i" -eq 2 ];then
echo $line
else
echo "$line"
fi
((i++))
done <"file"
Since it seems every column is separated by one space, another approach not yet mentioned is to use tr to squeeze all multi spaces into single spaces:
tr -s " " < infile > outfile
I am going to be different and go with AWK:
awk '{print $1,$2,$3}' file.txt > file1.txt
This will handle any number of spaces between fields, and replace them with one space
To handle a specific line you can add line addresses:
awk 'NR==2{print $1,$2,$3} NR!=2{print $0}' file.txt > file1.txt
i.e. rewrite line 2, but leave unchanged the other lines.
A line address can be a regular expression as well:
awk '/regexp/{print $1,$2,$3} !/regexp/{print}' file.txt > file1.txt
This answer assumes you don't want to mess with any except the second line.
#!/usr/bin/env python
import sys, os
for fname in sys.argv[1:]:
with open(fname, "r") as fin:
line1 = fin.readline()
line2 = fin.readline()
fixedLine2 = " ".join(line2.split()) + '\n'
if fixedLine2 == line2:
continue
with open(fname + ".fixed", "w") as fout:
fout.write(line1)
fout.write(line2)
for line in fin:
fout.write(line)
# Enable these lines if you want the old files replaced with the new ones.
#os.remove(fname)
#os.rename(fname + ".fixed", fname)
I don't quite understand, but yes, sed is an option. I don't think any POSIX compliant version of sed has an in file option (-i), so a fully POSIX compliant solution would be.
sed -e 's/^BBB /BBB /' <file> > <newfile>
Use sed:
sed -e 's/[[:space:]][[:space:]]/ /g' yourfile.txt >> newfile.txt
This will replace any two adjacent spaces with one. The use of [[:space:]] just makes it a little bit clearer
sed -i -e '2s/ / /g' input.txt
-i: edit files in place

Categories