import os,random
def randomizer(input,sample_size,output='NM_controls.xls'):
#Input file with query
query=open(input,'r').read().split('\n')
dir,file=os.path.split(input)
temp = os.path.join(dir,output)
out_file=open(temp,'w')
name_list = 833 #### got this from input
output_amount = (name_list)/(sample_size) #### = 3.6 but I want the floor value so its fine
i'm writing a function right now and the number of outputs depends on the input. so in my function , it takes in a file and then scans it and partitions the names and other data.
the next part mutually-exclusive and randomly samples from this lists. i want it to generate a certain amount of files but that depends on how many names are in the list.
is there anyway to use the 'os' module to create a certain number of lists that i did not specify in variables ?
in this case it would be 3 outputs ['output_1.txt','output_2.txt','output_3.txt']
**sorry for the confusion! so i'm using the OS module because I want to create non-existent files in the same directory that the input file was located. that is the only way i know how to do it
If you want to "create a certain number of lists" (where that number isn't known until runtime), that's easy:
my_lists = []
for _ in range(certain_number):
my_lists.append(make_another_list())
Or, even more simply:
my_lists = [make_another_list() for _ in range(certain_number)]
Or, if you have a dynamic number of input lists, and want to create one output list for each, you don't even need to calculate the number:
my_lists = [transform_a_list(input_list) for input_list in input_lists]
And obviously, there is nothing list-specific here. As you can tell, my other than the word list appearing in the names of the make_another_list and transform_a_list functions, I'm not making any use of the fact that the things you're creating a certain number of are lists. You can create a certain number of strings, numbers, arrays, threads, files, databases, etc. the same way.
If you want to write to output_amount files (as indicated by the out_file variable), then the following should work.
for n in range(output_amount):
outfile = open('%s_%d.txt'%(temp, n), 'w')
# Now write the data to output file n
If you want to create a list with n items, see abarnert's answer.
Related
This is in relation to web scraping, specifically scrapy. I want to be able to iterate an expression to create my items. As an example, lets say I import the item class as "item." In order to then store an item, I would have to code something like:
item['item_name'] = response.xpath('xpath')
My response is actually a function so it actually looks something like:
item['item_name'] = eval(xpath_function(n))
This works perfectly. However, how can I iterate this to create multiple items with different names without having to manually name each one? The code below does not work at all (and I didn't expect it to), but should give you an idea of what I am trying to accomplish:
for n in range(1, 10):
f"item['item_name{n}'] = eval(xpath_function(n))"
Basically trying to create 10 different items names item_name1 - item_name10. Hope that makes sense and I appreciate any help.
If you are just creating keys for your dictionary based on the value of n you could try something like:
for n in range(10):
item['item_name' + str(n+1)] = eval(xpath_function(n+1))
If you need to format the number (e.g. include leading zeros), you could use an f-string rather than concatenating the strings as I did.
[NB your for loop as written will only run from 1 to 9, so I have changed this in my answer.]
I have a program where I would like to randomly pull a line from a song, and string them together with other lines from other songs. Looking into I saw that the dateutil library might be able to help me parse multiple variables from a string, but it doesn't do quite what I want.
I have multiple strings like this, only much longer.
"This is the day\nOf the expanding man\nThat shape is my shade\nThere where I used to stand\nIt seems like only yesterday\nI gazed through the glass\n..."
I want to randomly pull one line from this string (To the page break) and save it as a variable but iterate this over multiple strings, any help would be much appreciated.
assuming you want to pull one line at random from the string you can use choice from the random module.
random. choice ( seq ): Return a random element from the non-empty
sequence seq. If seq is empty, raises IndexError.
from random import choice
data = "This is the day\nOf the expanding man\nThat shape is my shade\nThere where I used to stand\nIt seems like only yesterday\nI gazed through the glass\n..."
my_lines = data.splitlines()
my_line = choice(my_lines)
print(my_line)
OUTPUT
That shape is my shade
the following is code I have written that tries to open individual files, which are long strips of data and read them into an array. Essentially I have files that run over 15 times (24 hours to 360 hours), and each file has an iteration of 50, hence the two loops. I then try to open the files into an array. When I try to print a specific element in the array, I get the error "'file' object has no attribute 'getitem'". Any ideas what the problem is? Thanks.
#!/usr/bin/python
############################################
#
import csv
import sys
import numpy as np
import scipy as sp
#
#############################################
level = input("Enter a level: ");
LEVEL = str(level);
MODEL = raw_input("Enter a model: ");
NX = 360;
NY = 181;
date = 201409060000;
DATE = str(date);
#############################################
FileList = [];
data = [];
for j in range(1,51,1):
J = str(j);
for i in range(24,384,24):
I = str(i);
fileName = '/Users/alexg/ECMWF_DATA/DAT_FILES/'+MODEL+'_'+LEVEL+'_v_'+J+'_FT0'+I+'_'+DATE+'.dat';
FileList.append(fileName);
fo = open(fileName,"rb");
data.append(fo);
fo.close();
print data[1][1];
print FileList;
EDITED TO ADD:
Below, find the CORRECT array that the python script should be producing (sorry it wont let me post this inline yet):
http://i.stack.imgur.com/ItSxd.png
The problem I now run into, is that the first three values in the first row of the output matrix are:
-7.090874
-7.004936
-6.920952
These values are actually the first three values of the 11th row in the array below, which is the how it should look (performed in MATLAB). The next three values the python script outputs (as what it believes to be the second row) are:
-5.255577
-5.159874
-5.064171
These values should be found in the 22nd row. In other words, python is placing the 11th row of values in the first position, the 22nd in the second and so on. I don't have a clue as to why, or where in the code I'm specifying it do this.
You're appending the file objects themselves to data, not their contents:
fo = open(fileName,"rb");
data.append(fo);
So, when you try to print data[1][1], data[1] is a file object (a closed file object, to boot, but it would be just as broken if still open), so data[1][1] tries to treat that file object as if it were a sequence, and file objects aren't sequences.
It's not clear what format your data are in, or how you want to split it up.
If "long strips of data" just means "a bunch of lines", then you probably wanted this:
data.append(list(fo))
A file object is an iterable of lines, it's just not a sequence. You can copy any iterable into a sequence with the list function. So now, data[1][1] will be the second line in the second file.
(The difference between "iterable" and "sequence" probably isn't obvious to a newcomer to Python. The tutorial section on Iterators explains it briefly, the Glossary gives some more information, and the ABCs in the collections module define exactly what you can do with each kind of thing. But briefly: An iterable is anything you can loop over. Some iterables are sequences, like list, which means they're indexable collections that you can access like spam[0]. Others are not, like file, which just reads one line at a time into memory as you loop over it.)
If, on the other hand, you actually imported csv for a reason, you more likely wanted something like this:
reader = csv.reader(fo)
data.append(list(reader))
Now, data[1][1] will be a list of the columns from the second row of the second file.
Or maybe you just wanted to treat it as a sequence of characters:
data.append(fo.read())
Now, data[1][1] will be the second character of the second file.
There are plenty of other things you could just as easily mean, and easy ways to write each one of them… but until you know which one you want, you can't write it.
Suppose you have a data file which includes several data sets separated by the string "--" in the following format:
--
<x0_val> <y0_val>
<x1_val> <y1_val>
<x2_val> <y2_val>
--
<x0_val> <y0_val>
<x1_val> <y1_val>
<x2_val> <y2_val>
...
How can you read the whole file into an array of arrays so that you can plot all data sets afterwards to the same picture with a for loop looping over the outer array ?
genfromtxt('data.dat', delimiter=("--"))
gives lots of
Line #1550 (got 1 columns instead of 2)
I will update ...
I would first split the file into multiple files, which can reside in memory as objects or on the filesystems as new files.
You can locate the string -- with the module re.
Then you can use the link I posted above.
If you're 100% certain that you have no negative values in your file, you can try a quick:
np.genfromtxt(your_file, comments="-")
The comments="-" will force genfromtxt to ignore all the characters after -, which of course will give weird results if you have negative variables. Moreover, the result will be just a lump of your dataset in a single array
Otherwise, the safest route is to iterate on your file and store the lines that do not match -- in one list per block, something along the lines:
blocks = []
current = []
for line in your_file:
if line.startswith("-"):
blocks.append(np.array(current))
current = []
else:
current += line.split()
You may have to get rid of the first block if empty.
You could also check a mmap based solution already posted.
I have to check presence of millions of elements (20-30 letters str) in the list containing 10-100k of those elements. Is there faster way of doing that in python than set() ?
import sys
#load ids
ids = set( x.strip() for x in open(idfile) )
for line in sys.stdin:
id=line.strip()
if id in ids:
#print fastq
print id
#update ids
ids.remove( id )
set is as fast as it gets.
However, if you rewrite your code to create the set once, and not change it, you can use the frozenset built-in type. It's exactly the same except immutable.
If you're still having speed problems, you need to speed your program up in other ways, such as by using PyPy instead of cPython.
As I noted in my comment, what's probably slowing you down is that you're sequentially checking each line from sys.stdin for membership of your 'master' set. This is going to be really, really slow, and doesn't allow you to make use of the speed of set operations. As an example:
#!/usr/bin/env python
import random
# create two million-element sets of random numbers
a = set(random.sample(xrange(10000000),1000000))
b = set(random.sample(xrange(10000000),1000000))
# a intersection b
c = a & b
# a difference c
d = list(a - c)
print "set d is all remaining elements in a not common to a intersection b"
print "length of d is %s" % len(d)
The above runs in ~6 wallclock seconds on my five year-old machine, and it's testing for membership in larger sets than you require (unless I've misunderstood you). Most of that time is actually taken up creating the sets, so you won't even have that overhead. The fact that the strings you refer to are long isn't relevant here; creating a set creates a hash table, as agf explained. I suspect (though again, it's not clear from your question) that if you can get all your input data into a set before you do any membership testing, it'll be a lot faster, as opposed to reading it in one item at a time, then checking for set membership
You should try to split your data to make the search faster. The tree strcuture would allow you to find very quickly if the data is present or not.
For example, start with a simple map that links the first letter with all the keys starting with that letter, thus you don't have to search all the keys but only a smaller part of them.
This would look like :
ids = {}
for id in open(idfile):
ids.setdefault(id[0], set()).add(id)
for line in sys.stdin:
id=line.strip()
if id in ids.get(id[0], set()):
#print fastq
print id
#update ids
ids[id[0]].remove( id )
Creation will be a bit slower but search should be much faster (I would expect 20 times faster, if the fisrt character of your keys is well distributed and not always the same).
This is a first step, you could do the same thing with the second character and so on, search would then just be walking the tree with each letter...
As mentioned by urschrei, you should "vectorize" the check.
It is faster to check for the presence of a million elements once (as that is done in C) than to do the check for one element a million times.