Splitting Text File Into Columns and Rows in Python

Splitting Text File Into Columns and Rows in Python - python

I have a newbie question. I need help on separating a text file into columns and rows. Let's say I have a file like this:
1 2 3 4
2 3 4 5
and I want to put it into a 2d list called values = [[]]
i can get it to give me the rows ok and this code works ok:
values = map(int, line.split(','))
I just don't know how I can say the same thing but for the rows and the documentation doesn't make any sense
cheers

f = open(filename,'rt')
a = [[int(token) for token in line.split()] for line in f.readlines()[::2]]
In your sample file above, you have an empty line between each data row - I took this into account, but you can drop the ::2 subscript if you didn't mean to have this extra line in your data.
Edit: added conversion to int - you can use map as well, but mixing list comprehensions and map seems ugly to me.

import csv
import itertools
values = []
with open('text.file') as file_object:
for line in csv.reader(file_object, delimiter=' '):
values.append(map(int, line))
print "rows:", values
print "columns"
for column in itertools.izip(*values):
print column
Output is:
rows: [[1, 2, 3, 4], [2, 3, 4, 5]]
columns:
(1, 2)
(2, 3)
(3, 4)
(4, 5)

Get the data into your program by some method. Here's one:
f = open(tetxfile, 'r')
buffer = f.read()
f.close()
Parse the buffer into a table (note: strip() is used to clear any trailing whitespace):
table = [map(int, row.split()) for row in buffer.strip().split("\n")]
>>> print table
[[1, 2, 3, 4], [2, 3, 4, 5]]
Maybe it's ordered pairs you want instead, then transpose the table:
transpose = zip(*table)
>>> print transpose
[(1, 2), (2, 3), (3, 4), (4, 5)]

You could try to use the CSV-module. You can specify custom delimiters, so it might work.

If columns are separated by blanks
import re
A,B,C,D = [],[],[],[]
pat = re.compile('([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+)')
with open('try.txt') as f:
for line in f:
a,b,c,d = pat.match(line.strip()).groups()
A.append(int(a));B.append(int(b));C.append(int(c));D.append(int(d))
or with csv module
EDIT
A,B,C,D = [],[],[],[]
with open('try.txt') as f:
for line in f:
a,b,c,d = line.split()
A.append(int(a));B.append(int(b));C.append(int(c));D.append(int(d))
But if there are more than one blank between elements of data, this code will fail
EDIT 2
Because the solution with regex has been qualified of extremely hard to understand, it can be cleared as follows:
import re
A,B,C,D = [],[],[],[]
pat = re.compile('\s+')
with open('try.txt') as f:
for line in f:
a,b,c,d = pat.split(line.strip())
A.append(int(a));B.append(int(b));C.append(int(c));D.append(int(d))

Related

Iterate through a file reading first N values

I am reading 3 lines at a time from a file which has numbers 1,2,3...100
I want the output to look something like this
1
2
3
2
3
4
3
4
5
However with the following code, it is printing continuous numbers
with open("/home/osboxes/num", "r+") as f:
for line in f:
print(line)
line2 = f.__next__()
print(line2)
line3 = f.__next__()
print(line3)
Is there a way to go back to the iteration and skip the file line and display the output as shown above

Let's assume that instead of your file object we have an iterator like iter(range(100)) in order to produce our expected result using next you can copy the iterator using itertools.tee as many times as you want and create a zip from your iterators based on your expected output:
In [3]: r = iter(range(100))
In [4]: from itertools import tee
In [5]: r, n, m = tee(r, 3) # copy the iterator 3 times
In [6]: next(n) # consume the first item of n
Out[6]: 0
In [7]: next(m);next(m) # consume the first 2 items of m
Out[7]: 1
In [8]: list(zip(r, n, m))
#Out[8]:
#[(0, 1, 2),
# (1, 2, 3),
# (2, 3, 4),
# (3, 4, 5),
# (4, 5, 6),
# (5, 6, 7),
# ...
Now you can do the same thing with file object:
from itertools import tee
with open("/home/osboxes/num", "r+") as f:
f, n, m = tee(f, 3)
next(n);next(m);next(m)
for i, j , k in zip(r, n, m):
print(i, j, k) # or do something else with i,j,k

If it's a smaller file as you mentioned, then you can use following code, but if it's much bigger than prefer using seek() method:
with open("abc.txt", "r+") as f:
data = f.readlines()
for i in range(2, len(data)):
print("%s %s %s" % (data[i-2].rstrip(), data[i-1].rstrip(), data[i].rstrip()), end = " ")
Output:
1 2 3 2 3 4 3 4 5

If storing the whole file in a variable isn't a problem, an easy solution would be:
with open("num", "r+") as f:
lines = f.read().splitlines()
for i in range(len(lines) - 2):
print(lines[i])
print(lines[i + 1])
print(lines[i + 2])
For a more efficient solution, see #Kasramvd solution using iterators.
As an alternative without iterators, you can store the last 2 values:
with open("num", "r+") as f:
prev1, prev2 = None, None
for line in f:
if prev1 is not None and prev2 is not None:
print(prev1)
print(prev2)
print(line)
prev1, prev2 = prev2, line

Zip two file contents having related timestamp column to create a list in python

I have two files containing timestamp column with 1000+ rows. Row in file f1 is related to the row in file f2. I wanted a Python script to do [f1 nth row,f2 nth row] for all corresponding rows in the best way possible. Thanks!
f1:
05:43:44
05:59:32
f2:
05:43:51
05:59:39
e.g. [05:43:44,05:43:51], [05:59:32,05:59:39] ....

You may use zip() function. https://docs.python.org/3/library/functions.html#zip
>>> x = [1, 2, 3]
>>> y = [4, 5, 6]
>>> zipped = zip(x, y)
>>> list(zipped)
[(1, 4), (2, 5), (3, 6)]

You can do something like the following:
f1_as_list = open(f1).readlines() # get each line as a list element
f2_as_list = open(f2).readlines()
zipped_files = zip(f1_as_list, f2_as_list) # zip the two lists together

Something like this is probably the most intuitive approach.
#!/usr/bin/python3
with open("f1.txt") as f1:
with open("f2.txt") as f2:
for row1 in f1:
for row2 in f2:
print("%s %s" % (row1.strip(), row2.strip()))
Some might prefer a list comprehension, but non-pythonistas may not consider it intuitive.
with open("f1.txt") as f1:
with open("f2.txt") as f2:
print("\n".join([
"%s %s" % (row1.strip(), row2.strip())
for row1 in f1
for row2 in f2
]))

Write multiple rows from dict using csv

Update: I do not want to use pandas because I have a list of dict's and want to write each one to disk as they come in (part of webscraping workflow).
I have a dict that I'd like to write to a csv file. I've come up with a solution, but I'd like to know if there's a more pythonic solution available. Here's what I envisioned (but doesn't work):
import csv
test_dict = {"review_id": [1, 2, 3, 4],
"text": [5, 6, 7, 8]}
with open('test.csv', 'w') as csvfile:
fieldnames = ["review_id", "text"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(test_dict)
Which would ideally result in:
review_id text
1 5
2 6
3 7
4 8
The code above doesn't seem to work that way I'd expect it to and throws a value error. So, I've turned to following solution (which does work, but seems verbose).
with open('test.csv', 'w') as csvfile:
fieldnames = ["review_id", "text"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
response = test_dict
cells = [{x: {key: val}} for key, vals in response.items()
for x, val in enumerate(vals)]
rows = {}
for d in cells:
for key, val in d.items():
if key in rows:
rows[key].update(d.get(key, None))
else:
rows[key] = d.get(key, None)
for row in [val for _, val in rows.items()]:
writer.writerow(row)
Again, to reiterate what I'm looking for: the block of code directly above works (i.e., produces the desired result mentioned early in the post), but seems verbose. So, is there a more pythonic solution?
Thanks!

Your first example will work with minor edits. DictWriter expects a list of dicts rather than a dict of lists. Assuming you can't change the format of the test_dict:
import csv
test_dict = {"review_id": [1, 2, 3, 4],
"text": [5, 6, 7, 8]}
def convert_dict(mydict, numentries):
data = []
for i in range(numentries):
row = {}
for k, l in mydict.iteritems():
row[k] = l[i]
data.append(row)
return data
with open('test.csv', 'w') as csvfile:
fieldnames = ["review_id", "text"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(convert_dict(test_dict, 4))

Try using pandas of python..
Here is a simple example
import pandas as pd
test_dict = {"review_id": [1, 2, 3, 4],
"text": [5, 6, 7, 8]}
d1 = pd.DataFrame(test_dict)
d1.to_csv("output.csv")
Cheers

The built-in zip function can join together different iterables into tuples which can be passed to writerows. Try this as the last line:
writer.writerows(zip(test_dict["review_id"], test_dict["text"]))
You can see what it's doing by making a list:
>>> list(zip(test_dict["review_id"], test_dict["text"]))
[(1, 5), (2, 6), (3, 7), (4, 8)]
Edit: In this particular case, you probably want a regular csv.Writer, since what you effectively have is now a list.

If you don't mind using a 3rd-party package, you could do it with pandas.
import pandas as pd
pd.DataFrame(test_dict).to_csv('test.csv', index=False)
update
So, you have several dictionaries and all of them seems to come from a scraping routine.
import pandas as pd
test_dict = {"review_id": [1, 2, 3, 4],
"text": [5, 6, 7, 8]}
pd.DataFrame(test_dict).to_csv('test.csv', index=False)
list_of_dicts = [test_dict, test_dict]
for d in list_of_dicts:
pd.DataFrame(d).to_csv('test.csv', index=False, mode='a', header=False)
This time, you would be appending to the file and without the header.
The output is:
review_id,text
1,5
2,6
3,7
4,8
1,5
2,6
3,7
4,8
1,5
2,6
3,7
4,8

The problem is that with DictWriter.writerows() you are forced to have a dict for each row. Instead you can simply add the values changing your csv creation:
with open('test.csv', 'w') as csvfile:
fieldnames = test_dict.keys()
fieldvalues = zip(*test_dict.values())
writer = csv.writer(csvfile)
writer.writerow(fieldnames)
writer.writerows(fieldvalues)

You have two different problems in your question:
Create a csv file from a dictionary where the values are containers and not primitives.
For the first problem, the solution is generally to transform the container type into a primitive type. The most common method is creating a json-string. So for example:
>>> import json
>>> x = [2, 4, 6, 8, 10]
>>> json_string = json.dumps(x)
>>> json_string
'[2, 4, 6, 8, 10]'
So your data conversion might look like:
import json
def convert(datadict):
'''Generator which converts a dictionary of containers into a dictionary of json-strings.
args:
datadict(dict): dictionary which needs conversion
yield:
tuple: key and string
'''
for key, value in datadict.items():
yield key, json.dumps(value)
def dump_to_csv_using_dict(datadict, fields=None, filepath=None, delimiter=None):
'''Dumps a datadict value into csv
args:
datadict(list): list of dictionaries to dump
fieldnames(list): field sequence to use from the dictionary [default: sorted(datadict.keys())]
filepath(str): filepath to save to [default: 'tmp.csv']
delimiter(str): delimiter to use in csv [default: '|']
'''
fieldnames = sorted(datadict.keys()) if fields is None else fields
filepath = 'tmp.csv' if filepath is None else filepath
delimiter = '|' if not delimiter else delimiter
with open(filepath, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='ignore', delimiter=delimiter)
writer.writeheader()
for each_dict in datadict:
writer.writerow(each_dict)
So the naive conversion looks like this:
# Conversion code
test_data = {
"review_id": [1, 2, 3, 4],
"text": [5, 6, 7, 8]}
}
converted_data = dict(convert(test_data))
data_list = [converted_data]
dump_to_csv(data_list)
Create a final value that is actually some sort of a merging of two disparate data sets.
To do this, you need to find a way to combine data from different keys. This is not an easy problem to generically solve.
That said, it's easy to combine two lists with zip.
>>> x = [2, 4, 6]
>>> y = [1, 3, 5]
>>> zip(y, x)
[(1, 2), (3, 4), (5, 6)]
In addition, in the event that your lists are not the same size, python's itertools package provides a method, izip_longest, which will yield back the full zip even if one list is shorter than another. Note izip_longest returns a generator.
from itertools import izip_longest
>>> x = [2, 4]
>>> y = [1, 3, 5]
>>> z = izip_longest(y, x, fillvalue=None) # default fillvalue is None
>>> list(z) # z is a generator
[(1, 2), (3, 4), (5, None)]
So we could add another function here:
from itertoops import izip_longest
def combine(data, fields=None, default=None):
'''Combines fields within data
args:
data(dict): a dictionary with lists as values
fields(list): a list of keys to combine [default: all fields in random order]
default: default fill value [default: None]
yields:
tuple: columns combined into rows
'''
fields = data.keys() if field is None else field
columns = [data.get(field) for field in fields]
for values in izip_longest(*columns, fillvalue=default):
yield values
And now we can use this to update our original conversion.
def dump_to_csv(data, filepath=None, delimiter=None):
'''Dumps list into csv
args:
data(list): list of values to dump
filepath(str): filepath to save to [default: 'tmp.csv']
delimiter(str): delimiter to use in csv [default: '|']
'''
fieldnames = sorted(datadict.keys()) if fields is None else fields
filepath = 'tmp.csv' if filepath is None else filepath
delimiter = '|' if not delimiter else delimiter
with open(filepath, 'w') as csvfile:
writer = csv.writer(csvfile, delimiter=delimiter)
for each_row in data:
writer.writerow(each_dict)
# Conversion code
test_data = {
"review_id": [1, 2, 3, 4],
"text": [5, 6, 7, 8]}
}
combined_data = combine(test_data)
data_list = [combined_data]
dump_to_csv(data_list)

How to read the first row of an array in Python

I am new to learning Python, here is my current code:
#!/usr/bin/python
l = []
with open('datad.dat', 'r') as f:
for line in f:
line = line.strip()
if len(line) > 0:
l.append(map(float, line.split()))
print l[:,1]
I attempted to do this but made the mistake of using FORTRAN syntax, and received the following error:
File "r1.py", line 9, in <module>
print l[:,1]
TypeError: list indices must be integers, not tuple
How would I go about getting the first row or column of an array?

To print the first row use l[0], to get columns you will need to transpose with zip print(list(zip(*l))[0]).
In [14]: l = [[1,2,3],[4,5,6],[7,8,9]]
In [15]: l[0] # first row
Out[15]: [1, 2, 3]
In [16]: l[1] # second row
Out[16]: [4, 5, 6]
In [17]: l[2] # third row
Out[17]: [7, 8, 9]
In [18]: t = list(zip(*l))
In [19] t[0] # first column
Out[19]: (1, 4, 7)
In [20]: t[1] # second column
Out20]: (2, 5, 8)
In [21]: t[2] # third column
Out[21]: (3, 6, 9)
The csv module may also be useful:
import csv
with open('datad.dat', 'r') as f:
reader = csv.reader(f)
l = [map(float, row) for row in reader]

Python - Parsing Columns and Rows

I am running into some trouble with parsing the contents of a text file into a 2D array/list. I cannot use built-in libraries, so have taken a different approach. This is what my text file looks like, followed by my code
1,0,4,3,6,7,4,8,3,2,1,0
2,3,6,3,2,1,7,4,3,1,1,0
5,2,1,3,4,6,4,8,9,5,2,1
def twoDArray():
network = [[]]
filename = open('twoDArray.txt', 'r')
for line in filename.readlines():
col = line.split(line, ',')
row = line.split(',')
network.append(col,row)
print "Network = "
print network
if __name__ == "__main__":
twoDArray()
I ran this code but got this error:
Traceback (most recent call last):
File "2dArray.py", line 22, in <module>
twoDArray()
File "2dArray.py", line 8, in twoDArray
col = line.split(line, ',')
TypeError: an integer is required
I am using the comma to separate both row and column as I am not sure how I would differentiate between the two - I am confused about why it is telling me that an integer is required when the file is made up of integers

Well, I can explain the error. You're using str.split() and its usage pattern is:
str.split(separator, maxsplit)
You're using str.split(string, separator) and that isn't a valid call to split. Here is a direct link to the Python docs for this:
http://docs.python.org/library/stdtypes.html#str.split

To directly answer your question, there is a problem with the following line:
col = line.split(line, ',')
If you check the documentation for str.split, you'll find the description to be as follows:
str.split([sep[, maxsplit]])
Return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most
maxsplit splits are done (thus, the list will have at most maxsplit+1 elements). If maxsplit is not specified, then there is no limit on the number of splits (all possible splits are made).
This is not what you want. You are not trying to specify the number of splits you want to make.
Consider replacing your for loop and network.append with this:
for line in filename.readlines():
# line is a string representing the values for this row
row = line.split(',')
# row is the list of numbers strings for this row, such as ['1', '0', '4', ...]
cols = [int(x) for x in row]
# cols is the list of numbers for this row, such as [1, 0, 4, ...]
network.append(row)
# Put this row into network, such that network is [[1, 0, 4, ...], [...], ...]

"""I cannot use built-in libraries""" -- do you really mean "cannot" as in you have tried to use the csv module and failed? If so, say so. Do you mean that "may not" as in you are forbidden to use a built-in module by the terms of your homework assignment? If so, say so.
Here is an answer that works. It doesn't leave a newline attached to the end of the last item in each row. It converts the numbers to int so that you can use them for whatever purpose you have. It fixes other errors that nobody else has mentioned.
def twoDArray():
network = []
# filename = open('twoDArray.txt', 'r')
# "filename" is a very weird name for a file HANDLE
f = open('twoDArray.txt', 'r')
# for line in filename.readlines():
# readlines reads the whole file into memory at once.
# That is quite unnecessary.
for line in f: # just iterate over the file handle
line = line.rstrip('\n') # remove the newline, if any
# col = line.split(line, ',')
# wrong args, as others have said.
# In any case, only 1 split call is necessary
row = line.split(',')
# now convert string to integer
irow = [int(item) for item in row]
# network.append(col,row)
# list.append expects only ONE arg
# indentation was wrong; you need to do this once per line
network.append(irow)
print "Network = "
print network
if __name__ == "__main__":
twoDArray()

Omg...
network = []
filename = open('twoDArray.txt', 'r')
for line in filename.readlines():
network.append(line.split(','))
you take
[
[1,0,4,3,6,7,4,8,3,2,1,0],
[2,3,6,3,2,1,7,4,3,1,1,0],
[5,2,1,3,4,6,4,8,9,5,2,1]
]
or you neeed some other structure as output? Please add what do you need as output?

class TwoDArray(object):
#classmethod
def fromFile(cls, fname, *args, **kwargs):
splitOn = kwargs.pop('splitOn', None)
mode = kwargs.pop('mode', 'r')
with open(fname, mode) as inf:
return cls([line.strip('\r\n').split(splitOn) for line in inf], *args, **kwargs)
def __init__(self, data=[[]], *args, **kwargs):
dataType = kwargs.pop('dataType', lambda x:x)
super(TwoDArray,self).__init__()
self.data = [[dataType(i) for i in line] for line in data]
def __str__(self, fmt=str, endrow='\n', endcol='\t'):
return endrow.join(
endcol.join(fmt(i) for i in row) for row in self.data
)
def main():
network = TwoDArray.fromFile('twodarray.txt', splitOn=',', dataType=int)
print("Network =")
print(network)
if __name__ == "__main__":
main()

The input format is simple, so the solution should be simple too:
network = [map(int, line.split(',')) for line in open(filename)]
print network
csv module doesn't provide an advantage in this case:
import csv
print [map(int, row) for row in csv.reader(open(filename, 'rb'))]
If you need float instead of int:
print list(csv.reader(open(filename, 'rb'), quoting=csv.QUOTE_NONNUMERIC))
If you are working with numpy arrays:
import numpy
print numpy.loadtxt(filename, dtype='i', delimiter=',')
See Why NumPy instead of Python lists?
All examples produce arrays equal to:
[[1 0 4 3 6 7 4 8 3 2 1 0]
[2 3 6 3 2 1 7 4 3 1 1 0]
[5 2 1 3 4 6 4 8 9 5 2 1]]

Read the data from the file. Here's one way:
f = open('twoDArray.txt', 'r')
buffer = f.read()
f.close()
Parse the data into a table
table = [map(int, row.split(',')) for row in buffer.strip().split("\n")]
>>> print table
[[1, 0, 4, 3, 6, 7, 4, 8, 3, 2, 1, 0], [2, 3, 6, 3, 2, 1, 7, 4, 3, 1, 1, 0], [5, 2, 1, 3, 4, 6, 4, 8, 9, 5, 2, 1]]
Perhaps you want the transpose instead:
transpose = zip(*table)
>>> print transpose
[(1, 2, 5), (0, 3, 2), (4, 6, 1), (3, 3, 3), (6, 2, 4), (7, 1, 6), (4, 7, 4), (8, 4, 8), (3, 3, 9), (2, 1, 5), (1, 1, 2), (0, 0, 1)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting Text File Into Columns and Rows in Python - python

You could try to use the CSV-module. You can specify custom delimiters, so it might work.

Related

Iterate through a file reading first N values

Zip two file contents having related timestamp column to create a list in python

Write multiple rows from dict using csv

How to read the first row of an array in Python

Python - Parsing Columns and Rows

Categories

Resources