Create list of tuples (in a more elegant way) - python

I am writing a python script to read a file which consists of three columns separated by commas, create a tuple of each line, and make a list of these tuples. With the following script I achieve what I want; I was just wondering whether there is an easier / more elegant approach than writing each of the following steps in a seperate line.
import sys
fin=open(sys.argv[1],'r')
list = []
for line1 in fin:
line2 = line1[:-1]
line3 = line2.split(',')
line4 = tuple(line3)
list.append(line4)
print(list)
Thank you for your answers.

Using a list comprehension:
lst = [tuple(line.rstrip().split(',')) for line in fin]
(Don't name your variables list; it shadows the built-in and can lead to unexpected bugs).

Python comes with batteries included! If you need to read csv files, just use the csv module:
import sys, csv
with open(sys.argv[1]) as f:
lst = list(csv.reader(f))
Note that this creates a list of lists, if you want tuples for some reason, then
with open(sys.argv[1]) as f:
lst = [tuple(row) for row in csv.reader(f)]

Related

A simpler way to create a dictionary with counts from a 43 million row text file?

Context: I have a file with ~44 million rows. Each is an individual with US address, so there's a "ZIP Code" field. File is txt, pipe-delimited.
Due to size, I cannot (at least on my machine) use Pandas to analyze. So a basic question I have is: How many records (rows) are there for each distinct ZIP code? I took the following steps, but I wonder if there's a faster, more Pythonic way to do this (seems like there is, I just don't know).
Step 1: Create a set for ZIP values from file:
output = set()
with open(filename) as f:
for line in f:
output.add(line.split('|')[8] # 9th item in the split string is "ZIP" value
zip_list = list(output) # List is length of 45,292
Step 2: Created a "0" list, same length as first list:
zero_zip = [0]*len(zip_list)
Step 3: Created a dictionary (with all zeroes) from those two lists:
zip_dict = dict(zip(zip_list, zero_zip))
Step 4: Lastly I ran through the file again, this time updating the dict I just created:
with open(filename) as f:
next(f) # skip first line, which contains headers
for line in f:
zip_dict[line.split('|')[8]] +=1
I got the end result but wondering if there's a simpler way. Thanks all.
Creating the zip_dict can be replaced with a defaultdict. If you can run through every line in the file, you don't need to do it twice, you can just keep a running count.
from collections import defaultdict
d = defaultdict(int)
with open(filename) as f:
for line in f:
parts = line.split('|')
d[parts[8]] += 1
This is simple using the built-in Counter class.
from collections import Counter
with open(filename) as f:
c = Counter(line.split('|')[8] for line in f)
print(c)

Python - Read in Comma Separated File, Create Two lists

New to Python here and I'm trying to learn/figure out the basics. I'm trying to read in a file in Python that has comma separated values, one to a line. Once read in, these values should be separated into two lists, one list containing the value before the "," on each line, and the other containing the value after it.
I've played around with it for quite a while, but I just can't seem to get it.
Here's what I have so far...
with open ("mid.dat") as myfile:
data = myfile.read().replace('\n',' ')
print(data)
list1 = [x.strip() for x in data.split(',')]
print(list1)
list2 = ?
List 1 creates a list, but it's not correct. List 2, I'm not even sure how to tackle.
PS - I have searched other similar threads on here, but none of them seem to address this properly. The file in question is not a CSV file, and needs to stay as a .dat file.
Here's a sample of the data in the .dat file:
113.64,889987.226
119.64,440987774.55
330.43,446.21
Thanks.
Use string slicing:
list1= []
list2 = []
with open ("mid.dat") as myfile:
for line in myfile:
line = line.split(",").rstrip()
list1.append( line[0])
list2.append( line[1])
Python's rstrip() method strips all kinds of trailing whitespace by default, so removes return carriage "\n" too
If you want to use only builtin packages, you can use csv.
import csv
with open("mid.dat") as myfile:
csv_records = csv.reader(myfile)
list1 = []
list2 = []
for row in csv_records:
list1.append(row[0])
list2.append(row[1])
Could try this, which creates lists of floats not strings however:
from ast import literal_eval
with open("mid.dat") as f:
list1, list2 = map(list, (zip(*map(literal_eval, f.readlines()))))
Can be simplified if you don't mind list1 and list2 as tuples.
The list(*zip(*my_2d_list)) pattern is a pretty common way of transposing 2D lists using only built-in functions. It's useful in this scenario because it's easy to obtain a list (call this result) of tuples on each line in the file (where result[0] would be the first tuple, and result[n] would be the nth), and then transpose result (call this resultT) such that resultT[0] would be all the 'left values' and resultT[1] would be the 'right values'.
let's keep it very simple.
list1 = []
list2 = []
with open ("mid.dat") as myfile:
for line in myfile:
x1,x2 = map(float,line.split(','))
list1.append(x1)
list2.append(x2)
print(list1)
print(list2)
You could do this with pandas.
import pandas as pd
df = pd.read_csv('data.csv', columns=['List 1','List 2'])
If your data is a text file the respective function also exists in the pandas package. Pandas is a very powerful tool for data such as yours.
After doing so you can split your data into two independent dataframes.
list1 = df['List 1']
list2 = df['List 2']
I would stick to a dataframe because data manipulation and analysis is much easier within the pandas framework.
Here is my suggestion to be short and readable, without any additional packages to install:
with open ("mid.dat") as myfile:
listOfLines = [line.rstrip().split(',') for line in myfile]
list1 = [line[0] for line in listOfLines]
list2 = [line[1] for line in listOfLines]ility
Note: I used rstrip() to remove the end of line character.
Following is a solution obtained by correcting your own attempt:
with open("test.csv", "r") as myfile:
datastr = myfile.read().replace("\n",",")
datalist = datastr.split(",")
list1 = []; list2=[]
for i in range(len(datalist)-1): # ignore empty last item of list
if i%2 ==0:
list1.append(datalist[i])
else:
list2.append(datalist[i])
print(list1)
print(list2)
Output:
['113.64', '119.64', '330.43']
['889987.226', '440987774.55', '446.21']

List of lists (not just list) in Python

I want to make a list of lists in python.
My code is below.
import csv
f = open('agGDPpct.csv','r')
inputfile = csv.DictReader(f)
list = []
next(f) ##Skip first line (column headers)
for line in f:
array = line.rstrip().split(",")
list.append(array[1])
list.append(array[0])
list.append(array[53])
list.append(array[54])
list.append(array[55])
list.append(array[56])
list.append(array[57])
print list
I'm pulling only select columns from every row. My code pops this all into one list, as such:
['ABW', 'Aruba', '0.506252445', '0.498384331', '0.512418427', '', '', 'AND', 'Andorra', '', '', '', '', '', 'AFG', 'Afghanistan', '30.20560247', '27.09154001', '24.50744042', '24.60324707', '23.96716227'...]
But what I want is a list in which each row is its own list: [[a,b,c][d,e,f][g,h,i]...] Any tips?
You are almost there. Make all your desired inputs into a list before appending. Try this:
import csv
with open('agGDPpct.csv','r') as f:
inputfile = csv.DictReader(f)
list = []
for line in inputfile:
list.append([line[1], line[0], line[53], line[54], line[55], line[56], line[57]])
print list
To end up with a list of lists, you have to make the inner lists with the columns from each row that you want, and then append that list to the outer one. Something like:
for line in f:
array = line.rstrip().split(",")
inner = []
inner.append(array[1])
# ...
inner.append(array[57])
list.append(inner)
Note that it's also not a good practice to use the name of the type ("list") as a variable name -- this is called "shadowing", and it means that if you later try to call list(...) to convert something to a list, you'll get an error because you're trying to call a particular instance of a list, not the list built-in.
To build on csv module capabilities, I'll do
import csv
f = csv.reader(open('your.csv'))
next(f)
list_of_lists = [items[1::-1]+items[53:58] for items in f]
Note that
items is a list of items, thanks to the intervention of a csv.reader() object;
using slice addressing returns sublists taken from items, so that the + operator in this context means concatenation of lists
the first slice expression 1::-1means from 1 go to the beginning moving backwards, or [items[1], items[0]].
Referring to https://docs.python.org/2/library/csv.html#csv.DictReader
Instead of
for line in f:
Write
for line in inputfile:
And also use list.append([array[1],array[0],array[53],..]) to append a list to a list.
One more thing, referring to https://docs.python.org/2/library/stdtypes.html#iterator.next , use inputfile.next() instead of next(f) .
After these changes, you get:
import csv
f = open('agGDPpct.csv','r')
inputfile = csv.DictReader(f)
list = []
inputfile.next() ##Skip first line (column headers)
for line in inputfile:
list.append([array[1],array[0],array[53],array[54],array[55],array[56],array[57]])
print list
In addition to that, it is not a good practice to use list as a variable name as it is a reserved word for the data structure of the same name. Rename that too.
You can further improve the above code using with . I will leave that to you.
Try and see if it works.

writing text file with line breaks

I want to write a text file where many lines are created, so I want to know how to put each value on a new line.
this is my code:
import itertools
from itertools import permutations , combinations
lista=[]
splits=itertools.permutations('0123456789', 5)
for x in splits:
lista.append(x)
f=open('lala.txt', 'w')
for i in lista:
f.write(str(i))
in this part I need to put the line break: f.write(str(i))
I have tried with: f.write(str(i)\n) but gives me an error
You can use:
f.write(str(i) + '\n')
Since your lines are already in a list, you can use writelines():
import itertools
lista = [",".join(i)+'\n' for i in itertools.permutations('0123456789',5)]
with open('lala.txt', 'w') as f:
f.writelines(lista)
I've used the with statement which will automatically close the file for you; and used a list comprehension to create your initial list of permutations.
Editing the list itself might be undesirable. In such cases, you could csv module:
import itertools, os, csv
lista = list(itertools.permutations('0123456789',5))
with open('lala.txt', 'w', newline=os.linesep) as f:
writer = csv.writer(f, quoting=csv.QUOTE_NONE)
writer.writerows(lista)
The quoting=csv.QUOTE_NONE is to remove the " marks presented because the permuted items are strings.

python beginner - how to read contents of several files into unique lists?

I'd like to read the contents from several files into unique lists that I can call later - ultimately, I want to convert these lists to sets and perform intersections and subtraction on them. This must be an incredibly naive question, but after poring over the iterators and loops sections of Lutz's "Learning Python," I can't seem to wrap my head around how to approach this. Here's what I've written:
#!/usr/bin/env python
import sys
OutFileName = 'test.txt'
OutFile = open(OutFileName, 'w')
FileList = sys.argv[1: ]
Len = len(FileList)
print Len
for i in range(Len):
sys.stderr.write("Processing file %s\n" % (i))
FileNum = i
for InFileName in FileList:
InFile = open(InFileName, 'r')
PathwayList = InFile.readlines()
print PathwayList
InFile.close()
With a couple of simple test files, I get output like this:
Processing file 0
Processing file 1
['alg1\n', 'alg2\n', 'alg3\n', 'alg4\n', 'alg5\n', 'alg6']
['csr1\n', 'csr2\n', 'csr3\n', 'csr4\n', 'csr5\n', 'csr6\n', 'csr7\n', 'alg2\n', 'alg6']
These lists are correct, but how do I assign each one to a unique variable so that I can call them later (for example, by including the index # from range in the variable name)?
Thanks so much for pointing a complete programming beginner in the right direction!
#!/usr/bin/env python
import sys
FileList = sys.argv[1: ]
PathwayList = []
for InFileName in FileList:
sys.stderr.write("Processing file %s\n" % (i))
InFile = open(InFileName, 'r')
PathwayList.append(InFile.readlines())
InFile.close()
Assuming you read in two files, the following will do a line by line comparison (it won't pick up any extra lines in the longer file, but then they'd not be the same if one had more lines than the other ;)
for i, s in enumerate(zip(PathwayList[0], PathwayList[1]), 1):
if s[0] == s[1]:
print i, 'match', s[0]
else:
print i, 'non-match', s[0], '!=', s[1]
For what you're wanting to do, you might want to take a look at the difflib module in Python. For sorting, look at Mutable Sequence Types, someListVar.sort() will sort the contents of someListVar in place.
You could do it like that if you don't need to remeber where the contents come from :
PathwayList = []
for InFileName in FileList:
sys.stderr.write("Processing file %s\n" % InFileName)
InFile = open(InFileName, 'r')
PathwayList.append(InFile.readlines())
InFile.close()
for contents in PathwayList:
# do something with contents which is a list of strings
print contents
or, if you want to keep track of the files names, you could use a dictionary :
PathwayList = {}
for InFileName in FileList:
sys.stderr.write("Processing file %s\n" % InFileName)
InFile = open(InFileName, 'r')
PathwayList[InFile] = InFile.readlines()
InFile.close()
for filename, contents in PathwayList.items():
# do something with contents which is a list of strings
print filename, contents
You might want to check out Python's fileinput module, which is a part of the standard library and allows you to process multiple files at once.
Essentially, you have a list of files and you want to change to list of lines of these files...
Several ways:
result = [ list(open(n)) for n in sys.argv[1:] ]
This would get you a result like -> [ ['alg1', 'alg2', 'alg3'], ['csr1', 'csr2'...]] Accessing would be like 'result[0]' which would result in ['alg1', 'alg2', 'alg3']...
Somewhat better might be dictionary:
result = dict( (n, list(open(n))) for n in sys.argv[1:] )
If you want to just concatenate, you would just need to chain it:
import itertools
result = list(itertools.chain.from_iterable(open(n) for n in sys.argv[1:]))
# -> ['alg1', 'alg2', 'alg3', 'csr1', 'csr2'...
Not one-liners for a beginner...however now it would be a good exercies to try to comprehend what's going on :)
You need to dynamically create the variable name for each file 'number' that you're reading. (I'm being deliberately vague on purpose, knowing how to build variables like this is quite valuable and more readily remembered if you discover it yourself)
something like this will give you a start
You need a list which holds your PathwayList lists, that is a list of lists.
One remark: it is quite uncommon to use capitalized variable names. There is no strict rule for that, but by convention most people only use capitalized names for classes.

Categories