What I am essentially looking for is the `paste' command in bash, but in Python2. Suppose I have a csv file:
a1,b1,c1,d1
a2,b2,c2,d2
a3,b3,c3,d3
And another such:
e1,f1
e2,f2
e3,f3
I want to pull them together into this:
a1,b1,c1,d1,e1,f1
a2,b2,c2,d2,e2,f2
a3,b3,c3,d3,e3,f3
This is the simplest case where I have a known number and only two. What if I wanted to do this with an arbitrary number of files without knowing how many I have.
I am thinking along the lines of using zip with a list of csv.reader iterables. There will be some unpacking involved but seems like this much python-foo is above my IQ level ATM. Can someone suggest how to implement this idea or something completely different?
I suspect this should be doable with a short snippet. Thanks.
file1 = open("file1.csv", "r")
file2 = open("file2.csv", "r")
for line in file1:
print(line.strip().strip(",") +","+ file2.readline().strip()+"\n")
Extendable for as many files as you wish. Just keep adding to the print statement. Instead of print you can also have a append to a list or whatever you wish. You may have to worry about length of files, I did not as you did not specify.
Assuming the number of files is unknown, and that all the files are properly formatted as csv have the same number of lines:
files = ['csv1', 'csv2', 'csv3']
fs = map(open, files)
done = False
while not done:
chunks = []
for f in fs:
try:
l = next(f).strip()
chunks.append(l)
except StopIteration:
done = True
break
if not done:
print ','.join(chunks)
for f in fs:
f.close()
There seems to be no easy way of using context managers with a variable list of files easily, at least in Python 2 (see a comment in the accepted answer here), so manual closing of files will be required as above.
You could try pandas
In your case, group of [a,b,c,d] and [e,f] could be treated as DataFrame in Pandas, and it's easy to do join because Pandas has function called concat.
import pandas as pd
# define group [a-d] as df1
df1 = pd.read_csv('1.csv')
# define group [e-f] as df2
df2 = pd.read_csv('2.csv')
pd.concat(df1,df2,axis=1)
Related
Hey guys I'm a rookie in python and need some help.
My problem is, that I have a folder full of text files (with lists in it), where two belong to each other and need to be read and compared.
Folder with many files: File1_in.xlo, File1_out.xlo, File2_in.xlo, File2_out.xlo, ...
--> so File1_in.xlo and File1_out.xlo belong together and need to be compared.
I already can append the lists of the 'in-Files' (or 'out-Files') and then compare them, but since there are many Files the lists become really long (thousands and thousands of entries), so the idea is to compare the files or respectively the lists pairwise.
My first try looks like:
import os
for filename in sorted(os.listdir('path')):
if filename.endswith('in.xlo'):
with open(os.path.join('path', filename)) as inn:
lines = inn.readlines()
for x in lines:
temperatureIn = x.split()[4]
if filename.endswith('out.xlo'):
with open(os.path.join('path', filename)) as outt:
lines = outt.readlines()
for x in lines:
temperatureOut = x.split()[4] #4. column in list
So the problem is, as you can see, the 'temperatureIn's are always overwritten before I can compare them with the 'temperatureOut's. I think/ hope there must be a way to open both files at once to compare the list entries.
I hope you can understand my problem and someone can help me.
Thanks
Use zip to access in-Files and out-Files in pairs
files = sorted(os.listdir('path'))
in_files = [fname for fname in files if fname.endswith('in.xlo')]
out_files = [fname for fname in files if fname.endswith('out.xlo')]
for in_file, out_file in zip(in_files, out_files):
with open(os.path.join('path', in_file)) as inn, open(os.path.join('path', out_file)) as outt:
# Do whatever you want
add them to a list created just before your for loop, as:
temps_in =[]
for x in lines:
temperatureIn = x.split()[4]
temps_in.append(temperatureIn)
Do the same thoing for temperatures out, then compare your two lists
I have been working with some code that exports layers individually filled with important data into a folder. The next thing I want to do is bring each one of those layers into a different program so that I can combine them and do some different tests. The current way that I know how to do it is by importing them one by one (as seen below).
fn0 = 'layer0'
f0 = np.genfromtxt(fn0 + '.csv', delimiter=",")
fn1 = 'layer1'
f1 = np.genfromtxt(fn1 + '.csv', delimiter=",")
The issue with continuing this way is that I may have to deal with up to 100 layers at a time, and it would be very inconvenient to have to import each layer individually.
Is there a way I can change my code to do this iteratively so that I can have a code similar to such:
N = 100
for i in range(N)
fn(i) = 'layer(i)'
f(i) = np.genfromtxt(fn(i) + '.csv', delimiter=",")
Please let me know if you know of any ways!
you can use string formatting as follows
N = 100
f = [] #create an empty list
for i in range(N)
fn_i = 'layer(%d)'%i #parentheses!
f.append(np.genfromtxt(fn_i + '.csv', delimiter=",")) #add to f
What I mean by parentheses! is that they are 'important' characters. They indicate function calls and tuples, so you shouldn't use them in variables (ever!)
The answer of Mohammad Athar is correct. However, you should not use the % printing any longer. According to PEP 3101 (https://www.python.org/dev/peps/pep-3101/) it is supposed to be replaced by format(). Moreover, as you have more than 100 files a format like layer_007.csv is probably appreciated.
Try something like:
dataDict=dict()
for counter in range(214):
fileName = 'layer_{number:03d}.csv'.format(number=counter)
dataDict[fileName] = np.genfromtxt( fileName, delimiter="," )
When using a dictionary, like here, you can directly access your data later by using the file name; it is unsorted though, such that you might prefer the list version of Mohammad Athar.
I have data that looks like this:
print(data['ra'][0], data['dec'][0])
308.3194375 89.9638467
and I very simply (!!) want to write out to a file:
f = open('output.dat', 'w')
for ii in range(0, 10):
f.write(long(ii), data['ra'][ii], data['dec'][ii])
f.close()
TypeError: write() takes exactly one argument (3 given).
Why is this so hard to do?!?!?
you are passing 3 argument in in write which is wrong.
f = open('output.dat', 'w')
for ii in range(0, 10):
f.write("%d %d %d" % (long(ii), data['ra'][ii], data['dec'][ii]))
f.close()
#Nilesh is correct about this specific error, but the broader question seems to be "how do I write data to a csv?". You don't specify what data structure your data is stored in, but it looks a lot like a Pandas dataframe, in which case data.to_csv() will make this process much simpler.
If it's not already a dataframe, then you can convert it to one:
import pandas as pd
df = pd.DataFrame({'ra': data['ra'], 'dec': data['dec']})
df.to_csv('output.dat', sep=' ')
If you are using another kind of structure (such as a numpy record array or an AstroPy table), you can find similar functionality specific to those data structures.
Just add another pair of parenthesis around the write function and it should work.
I use Python and there's a list of file names of different file types. Text files may look like these:
01.txt
02.txt
03.txt
...
Let's assume the text files are all numbered in this manner. Now I want to get all the text files with the number ranging from 1 to 25. So I would like to provide a formatstring like %02i.txt via GUI in order to identify all the matching file names.
My solution so far is a nested for loop. The outer loop iterates over the whole list and the inner loop counts from 1 to 25 for every file:
fmt = '%02i.txt'
for f in files:
for i in range(1, 25+1):
if f == fmt % i:
# do stuff
This nested loop doesn't look very pretty and the complexity is O(n²). So it could take a while on very long lists. Is there a smarter/pythonic way of doing this?
Well, yes, I could use a regular expression like ^\d{2}\.txt$, but a formatstring with % is way easier to type.
You can use a set:
fmt = '%02i.txt'
targets = {fmt % i for i in range(1, 25+1)}
then
for f in files:
if f in targets:
# do stuff
A more pythonic way to iterate through files is through use of the glob module.
>>> import glob
>>> for f in glob.iglob('[0-9][0-9].txt'):
print f
01.txt
02.txt
03.txt
Related to a previous question, I'm trying to do replacements over a number of large CSV files.
The column order (and contents) change between files, but for each file there are about 10 columns that I want and can identify by the column header names. I also have 1-2 dictionaries for each column I want. So for the columns I want, I want to use only the correct dictionaries and want to implement them sequentially.
An example of how I've tried to solve this:
# -*- coding: utf-8 -*-
import re
# imaginary csv file. pretend that we do not know the column order.
Header = [u'col1', u'col2']
Line1 = [u'A',u'X']
Line2 = [u'B',u'Y']
fileLines = [Line1,Line2]
# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}
# dict to correspond header names with the correct dictionary.
# i would like the dictionaries to be read sequentially in col1.
refD = {u'col1':[D1a,D1b],u'col2':[D2]}
# clunky replace function
def freplace(str, dict):
rc = re.compile('|'.join(re.escape(k) for k in dict))
def trans(m):
return dict[m.group(0)]
return rc.sub(trans, str)
# get correspondence between dictionary and column
C = []
for i in range(len(Header)):
if Header[i] in refD:
C.append([refD[Header[i]],i])
# loop through lines and make replacements
for line in fileLines:
for i in range(len(line)):
for j in range(len(C)):
if C[j][1] == i:
for dict in C[j][0]:
line[i] = freplace(line[i], dict)
My problem is that this code is quite slow, and I can't figure out how to speed it up. I'm a beginner, and my guess was that my freplace function is largely what is slowing things down, because it has to compile for each column in each row. I would like to take the line rc = re.compile('|'.join(re.escape(k) for k in dict)) out of that function, but don't know how to do that and still preserve what the rest of my code is doing.
There's a ton of things that you can do to speed this up:
First, use the csv module. It provides efficient and bug-free methods for reading and writing CSV files. The DictReader object in particular is what you're interested in: it will present every row it reads from the file as a dictionary keyed by its column name.
Second, compile your regexes once, not every time you use them. Save the compiled regexes in a dictionary keyed by the column that you're going to apply them to.
Third, consider that if you apply a hundred regexes to a long string, you're going to be scanning the string from start to finish a hundred times. That may not be the best approach to your problem; you might be better off investing some time in an approach that lets you read the string from start to end once.
You don't need re:
# -*- coding: utf-8 -*-
# imaginary csv file. pretend that we do not know the column order.
Header = [u'col1', u'col2']
Line1 = [u'A',u'X']
Line2 = [u'B',u'Y']
fileLines = [Line1,Line2]
# dicts to translate lines
D1a = {u'A':u'a'}
D1b = {u'B':u'b'}
D2 = {u'X':u'x',u'Y':u'y'}
# dict to correspond header names with the correct dictionary
refD = {u'col1':[D1a,D1b],u'col2':[D2]}
# now let's have some fun...
for line in fileLines:
for i, (param, word) in enumerate(zip(Header, line)):
for minitranslator in refD[param]:
if word in minitranslator:
line[i] = minitranslator[word]
returns:
[[u'a', u'x'], [u'b', u'y']]
So if that's the case, and all 10 columns have the same names each time, but out of order, (I'm not sure if this is what you're doing up there, but here goes) keep one array for the heading names, and one for each column split into elements (should be 10 items each line), now just offset which regex by doing a case/select combo, compare the element number of your header array, then inside the case, reference the data array at the same offset, since the name is what will get to the right case you should be able to use the same 10 regex's repeatedly, and not have to recompile a new "command" each time.
I hope that makes sense. I'm sorry i don't know the syntax to help you out, but I hope my idea is what you're looking for
EDIT:
I.E.
initialize all regexes before starting your loops.
then after you read a line (and after the header line)
select array[n]
case "column1"
regex(data[0]);
case "column2"
regex(data[1]);
.
.
.
.
end select
This should call the right regex for the right columns