I have a huge data frame with columns names:
A,B,C,D,F,G,H,GT_a,N_a_,E_a,GT_b,N_b_,E_b,GT_c,N_c_,E_c,...,GT_n,N_n,E_n
Using unix/bash or python, I want to produce n individual files with the following columns:
A,B,C,D,F,G,H,GT_a,N_a_,E_a
A,B,C,D,F,G,H,GT_b,N_b_,E_b
A,B,C,D,F,G,H,GT_c,N_c_,E_c
....
A,B,C,D,F,G,H,GT_n,N_n_,E_n
Each file should be called: a.txt, b.txt, c.txt,...,n.txt
Here are a couple of solutions with bash tools.
1. bash
Using cut inside a bash loop.This will raise n processes and parse the file n times.
Update for the case we don't have just a sequence of letters as _ids in column names, but many string ids, repeating the same every 3 lines after the first 7 lines. We have to first read the header of the file and extract them, e.g. a quick solution is to use awk and print them every 8th, 11th, etc column into the bash array.
#!/bin/bash
first=7
#ids=( {a..n} )
ids=( $( head -1 "$1" | awk -F"_" -v RS="," -v f="$first" 'NR>f && (NR+1)%3==0{print $2}' ) )
for i in "${!ids[#]}"; do
cols="1-$first,$((first+1+3*i)),$((first+2+3*i)),$((first+3+3*i))"
cut -d, -f"$cols" "$1" > "${ids[i]}.txt"
done
Usage: bash test.sh file
2. awk
Or you can use awk. Here I customize just the number of outputs, but the others can also be done like in the first solution.
BEGIN { FS=OFS=","; times=14 }
{
for (i=1;i<=times;i++) {
print $1,$2,$3,$4,$5,$6,$7,$(5+3*i),$(6+3*i),$(7+3*i) > sprintf("%c.txt",i+96)
}
}
Usage: awk -f test.awk file.
This solution should be fast, as it parses the file once. But it shouldn't be used like this, for a large number of output files, it could throw a "too many files open" error. For the range of the letters, it should be ok.
This should write out the different files, with different headers for each file. You'll have to change the COL_NAMES_TO_WRITE to be the ones that you want.
It uses the standard library, so no pandas. It won't write out more than 26 different files.. but the filename generator could be changed to augment that and allow that.
If I'm interpreting this question correctly, you want to split this into 14 files (a..n)
You'll have to copy this code below into a file, splitter.py
And then run this command:
python3.8 splitter.py --fn largefile.txt -n 14
Where largefile.txt is your huge file that you need to split.
import argparse
import csv
import string
COL_NAMES_TO_WRITE = "A,B,C,D,F,G,H,GT_{letter},N_{letter},E_{letter}"
WRITTEN_HEADERS = set() # place to keep track of whether headers have been written
def output_file_generator(num):
if num > 26: raise ValueError(f"Can only print out 26 different files, not {num}")
i = 0
while True:
prefix = string.ascii_lowercase[i]
i = (i + 1) % num # increment modulo number of files we want
yield f"{prefix}.txt"
def col_name_generator(num):
i = 0
while True:
col_suffix = string.ascii_lowercase[i]
i = (i + 1) % num # increment modulo number of files we want
print( COL_NAMES_TO_WRITE.format(letter=col_suffix).split(','))
yield COL_NAMES_TO_WRITE.format(letter=col_suffix).split(',')
def main(filename, num_files=4):
"""Split a file into multiple files
Args:
filename (str): large filename that needs to be split into multiple files
num_files (int): number of files to split filename into
"""
print(filename)
with open(filename, 'r') as large_file_fp:
reader = csv.DictReader(large_file_fp)
output_files = output_file_generator(num_files)
col_names = col_name_generator(num_files)
for line in reader:
print(line)
filename_for_this_file = output_files.__next__()
print("filename ", filename_for_this_file)
column_names_for_this_file = col_names.__next__()
print("col names:", column_names_for_this_file)
with open(filename_for_this_file, 'a') as output_fp:
writer = csv.DictWriter(output_fp, fieldnames=column_names_for_this_file)
if filename_for_this_file not in WRITTEN_HEADERS:
writer.writeheader()
WRITTEN_HEADERS.add(filename_for_this_file)
just_these_fields = {k:v for k,v in line.items() if k in column_names_for_this_file}
writer.writerow(just_these_fields)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("-fn", "--filename", required=True, default='large_file.txt', help="filename of large file to be split")
parser.add_argument("-n", "--num_files", required=False, default=4, help="number of separate files to split large_file into")
args = parser.parse_args()
main(args.filename, int(args.num_files))
import pandas as pd
import numpy as np
c = "A,B,C,D,F,G,H,GT_a,N_a_,E_a,GT_b,N_b_,E_b,GT_c,N_c_,E_c,GT_d,N_d_,E_d,GT_e,N_e_,E_e".split(',')
df = pd.DataFrame(np.full((30, 22), c), columns=c)
c = None
c = list(df.columns)
default = c[:7]
var = np.matrix(c[7:])
var = pd.DataFrame(var.reshape(var.shape[1]//3, 3))
def dump(row):
cols = default + list(row)
magic = cols[-1][-1]
df[cols].to_csv(magic + '.txt')
var.apply(dump, axis=1)
I'm trying to make program using Python.
I want to be able to pipe program through another program:
" #EXAMPLE " ./my_python | another programme "
Here is the code I have so far.
This code saves output to file:
#!/usr/bin/env python
import os, random, string
# This is not my own code
''' As far asi know, It belongs to NullUserException. Was found on stackoverflow.com'''
length = 8
chars = string.ascii_letters.upper()+string.digits
random.seed = (os.urandom(1024))
# my code
file_out = open('newRa.txt','w') # Create a 'FILE' to save Generated Passwords
list1=[]
while len(list1) < 100000:
list1.append(''.join(random.choice(chars) for i in range(length)))
for item in list1:
file_out.write('%s\n' % item)
file_out.close()
file_out1=open('test.txt','w')
for x in list1:
file_out1.write('%s\n' %x[::-1])
This is the code I have trying to pipe it through another program:
#!/usr/bin/env python
import os,string,random,sys
length = 8
chars = string.ascii_letters.upper()+string.digits
random.seed = (os.urandom(1024))
keep=[]
keep1=[]
while len(keep)<1000:
keep.append(''.join(random.choice(chars) for i in range(length)))
print '\n',keep[::-1]
for x in keep:
keep1.append(x[::-1])
while len(keep1) < 1000:
print keep1
I have tried chmod and using the script as a executable.
Ok sorry for my lack of google search.
sys.stdout is the answer
#!/usr/bin/env python
import os,string,random,sys
length = 8
chars = string.ascii_letters.upper()+string.digits
random.seed = (os.urandom(1024))
keep=[]
while len(keep)<1000:
keep = (''.join(random.choice(chars) for i in range(length)))
print sys.stdout.write(keep)
sys.stdout.flush()
I stripped my code down (as it makes it a lot faster, But I'm getting this when execute
my code........
P5DBLF4KNone
DVFV3JQVNone
CIMKZFP0None
UZ1QA3HTNone
How do I get rid of the 'None' on the end?
What I have done to cause this ?
Should This Be A Seperate Question??
I just want to automate the extraction of small samples from two tsv files. The row where the samples are taken don't necessarily have to be precise, every sample just needs to be evenly spaced. When the cutting happens the bash shell outputs 'tail: stdout: Broken pipe' although the program still seems to run okay at first. I don't particularly like that my program outputs the word 'Broken' but I don't really care. The problem is each subsequent 'chopping' takes longer, and I can't figure out why. Do I have a memory leak? Is there something I should close? I also don't like having the try except statement but I'm not sure of a good way around it.
import os
import sys
import subprocess
import commands
import csv as tsv
def main(scorebreaks, positives, negatives):
#just to isolate the attributeId
newpositives = os.path.basename(positives)
attributeid = newpositives.rstrip('-positive.tsv')
#create output folder if it doesn't exist
path_to_script_dir = os.path.dirname(os.path.abspath(positives))
newpath = path_to_script_dir + '/ezcut_output'
if not os.path.exists(newpath): os.makedirs(newpath)
with open(scorebreaks, 'rb') as tsvfile:
tsvreader = tsv.reader(tsvfile, delimiter='\t')
scorebreakslist = zip(*(line.strip().split('\t') for line in tsvfile))
#print scorebreakslist[0][1] #would give line number at .99
#print scorebreakslist[1][1] #would give .99
whatiteration = input('What iteration? ')
chunksize = input('Chunk size? ')
numberofchunks = int(input('Number of chunks? '))-1
scorejumpamt = 1.0/numberofchunks #number of chunks is 20? score jump amt == .05
#print scorejumpamt
scorei = 1.0
choparray = [100]
while True: #cause i needed a do-while loop
scorei = float(scorei) - float(scorejumpamt)
scorei = '%.2f'%(scorei)
#print scorei
if float(scorei) < 0.00: break
try:
arraynum = scorebreakslist[1].index(str(scorei))
except ValueError:
break
#print scorebreakslist[1]
#add the linenumber to an array for use in cutting
choparray.append(scorebreakslist[0][arraynum])
#print len(choparray)
#the actual file manipulation section of code
index=0
for number in choparray:
indexkinda = 1-float(scorejumpamt)*float(index)
indexkinda = '%.2f'%(indexkinda)
#print indexkinda
if indexkinda < 0: break
if float(indexkinda) > 0.50:
#print indexkinda
cmd = 'tail -n+%s %s | head -n%s > %s/%s-%s-%s.tsv' % (number, positives, chunksize, newpath, indexkinda, attributeid, whatiteration)
subprocess.call(cmd, shell=True)
#subprocess.call(cmd, shell=True)
index+=1
else: #maybe make this not get anything below 0.1 for speed
#print indexkinda
cmd = 'tail -n+%s %s | head -n%s > %s/%s-%s-%s.tsv' % (number, negatives, chunksize, newpath, indexkinda, attributeid, whatiteration)
subprocess.call(cmd, shell=True)
index+=1
main(sys.argv[1], sys.argv[2], sys.argv[3])
You're doing this all wrong. You shouldn't be spawning subprocesses at all, and especially you shouldn't tell them to repeatedly reread the same parts of the file.
Just use the python file class to iterate over lines. Please.
Here is some text
here is line two of text
I visually select from is to is in Vim: (brackets represent the visual selection [ ])
Here [is some text
here is] line two of text
Using Python, I can obtain the range tuples of the selection:
function! GetRange()
python << EOF
import vim
buf = vim.current.buffer # the buffer
start = buf.mark('<') # start selection tuple: (1,5)
end = buf.mark('>') # end selection tuple: (2,7)
EOF
endfunction
I source this file: :so %, select the text visually, run :<,'>call GetRange() and
now that I have (1,5) and (2,7). In Python, how can I compile the string that is the following:
is some text\nhere is
Would be nice to:
Obtain this string for future manipulation
then replace this selected range with the updated/manipulated string
Try this:
fun! GetRange()
python << EOF
import vim
buf = vim.current.buffer
(lnum1, col1) = buf.mark('<')
(lnum2, col2) = buf.mark('>')
lines = vim.eval('getline({}, {})'.format(lnum1, lnum2))
lines[0] = lines[0][col1:]
lines[-1] = lines[-1][:col2]
print "\n".join(lines)
EOF
endfun
You can use vim.eval to get python values of vim functions and variables.
This would probably work if you used pure vimscript
function! GetRange()
let #" = substitute(#", '\n', '\\n', 'g')
endfunction
vnoremap ,r y:call GetRange()<CR>gvp
This will convert all newlines into \n in the visual selection and replace the selection with that string.
This mapping yanks the selection into the " register. Calls the function (isn't really necessary since its only one command). Then uses gv to reselect the visual selection and then pastes the quote register back onto the selected region.
Note: in vimscript all user defined functions must start with an Uppercase letter.
Here's another version based on Conner's answer. I took qed's suggestion and also added a fix for when the selection is entirely within one line.
import vim
def GetRange():
buf = vim.current.buffer
(lnum1, col1) = buf.mark('<')
(lnum2, col2) = buf.mark('>')
lines = vim.eval('getline({}, {})'.format(lnum1, lnum2))
if len(lines) == 1:
lines[0] = lines[0][col1:col2 + 1]
else:
lines[0] = lines[0][col1:]
lines[-1] = lines[-1][:col2 + 1]
return "\n".join(lines)
I'm making a small python script which will create random files in all shapes and sizes but it will not let me create large files. I want to be able to create files up to around 8GB in size, I know this would take a long amount of time but I'm not concerned about that.
The problem is that Python 2.7 will not handle the large numbers I am throwing at it in order to create the random text that will fill my files.
The aim of my code is to create files with random names and extentions, fill the files with a random amount of junk text and save the files. It will keep on repeating this until I close the command line window.
import os
import string
import random
ext = ['.zip', '.exe', '.txt', '.pdf', '.msi', '.rar', '.jpg', '.png', '.html', '.iso']
min = raw_input("Enter a minimum file size eg: 112 (meaning 112 bytes): ")
minInt = int(min)
max = raw_input("Enter a maximum file size: ")
maxInt = int(max)
def name_generator(chars=string.ascii_letters + string.digits):
return ''.join(random.choice(chars) for x in range(random.randint(1,10)))
def text_generator(chars=string.printable + string.whitespace):
return ''.join(random.choice(chars) for x in range(random.randint(minInt,maxInt)))
def main():
fileName = name_generator()
extension = random.choice(ext)
file = fileName + extension
print 'Creating ==> ' + file
fileHandle = open ( file, 'w' )
fileHandle.write ( text_generator() )
fileHandle.close()
print file + ' ==> Was born!'
while 1:
main()
Any help will be much appreciated!
Make it lazy, as per the following:
import string
import random
from itertools import islice
chars = string.printable + string.whitespace
# make infinite generator of random chars
random_chars = iter(lambda: random.choice(chars), '')
with open('output_file','w', buffering=102400) as fout:
fout.writelines(islice(random_chars, 1000000)) # write 'n' many
The problem is not that python cannot handle large numbers. It can.
However, you try to put the whole file contents in memory at once - you might not have enough RAM for this and additionally do not want to do this anyway.
The solution is using a generator and writing the data in chunks:
def text_generator(chars=string.printable + string.whitespace):
return (random.choice(chars) for x in range(random.randint(minInt,maxInt))
for char in text_generator():
fileHandle.write(char)
This is still horribly inefficient though - you want to write your data in blocks of e.g. 10kb instead of single bytes.
A comment about performance: you could improve it by using os.urandom() to generates random bytes and str.translate() to translate them into the range of input characters:
import os
import string
def generate_text(size, chars=string.printable+string.whitespace):
# make translation table from 0..255 to chars[0..len(chars)-1]
all_chars = string.maketrans('', '')
assert 0 < len(chars) <= len(all_chars)
result_chars = ''.join(chars[b % len(chars)] for b in range(len(all_chars)))
# generate `size` random bytes and translate them into given `chars`
return os.urandom(size).translate(string.maketrans(all_chars, result_chars))
Example:
with open('output.txt', 'wb') as outfile: # use binary mode
chunksize = 1 << 20 # 1MB
N = 8 * (1 << 10) # (N * chunksize) == 8GB
for _ in xrange(N):
outfile.write(generate_text(chunksize))
Note: to avoid skewing the random distribution, bytes larger than k*len(chars)-1 returned by os.urandom() should be discarded, where k*len(chars) <= 256 < (k+1)*len(chars).