Related
Goal: Sort the text file alphabetically based on the characters that appear AFTER the final slash. Note that there are random numbers right before the final slash.
Contents of the text file:
https://www.website.com/1939332/delta.html
https://www.website.com/2237243/alpha.html
https://www.website.com/1242174/zeta.html
https://www.website.com/1839352/charlie.html
Desired output:
https://www.website.com/2237243/alpha.html
https://www.website.com/1839352/charlie.html
https://www.website.com/1939332/delta.html
https://www.website.com/1242174/zeta.html
Code Attempt:
i = 0
for line in open("test.txt").readlines(): #reading text file
List = line.rsplit('/', 1) #splits by final slash and gives me 4 lists
dct = {list[i]:list[i+1]} #tried to use a dictionary
sorted_dict=sorted(dct.items()) #sort the dictionary
textfile = open("test.txt", "w")
for element in sorted_dict:
textfile.write(element + "\n")
textfile.close()
Code does not work.
I would pass a different key function to the sorted function. For example:
with open('test.txt', 'r') as f:
lines = f.readlines()
lines = sorted(lines, key=lambda line: line.split('/')[-1])
with open('test.txt', 'w') as f:
f.writelines(lines)
See here for a more detailed explanation of key functions.
Before you run this, I am assuming you have a newline at the end of your test.txt. This will fix "combining the second and third lines".
If you really want to use a dictionary:
dct = {}
i=0
with open("test.txt") as textfile:
for line in textfile.readlines():
mylist = line.rsplit('/',1)
dct[mylist[i]] = mylist[i+1]
sorted_dict=sorted(dct.items(), key=lambda item: item[1])
with open("test.txt", "w") as textfile:
for element in sorted_dict:
textfile.write(element[i] + '/' +element[i+1])
What you did wrong
In the first line, you name your variable List, and in the second you access it using list.
List = line.rsplit('/', 1)
dct = {list[i]:list[i+1]}
Variable names are case sensitive so you need use the same capitalisation each time. Furthermore, Python already has a built-in list class. It can be overridden, but I would not recommend naming your variables list, dict, etc.
( list[i] will actually just generate a types.GenericAlias object, which is a type hint, something completely different from a list, and not what you want at all.)
You also wrote
dct = {list[i]:list[i+1]}
which repeatedly creates a new dictionary in each loop iteration, overwriting whatever was stored in dct previously. You should instead create an empty dictionary before the loop, and assign values to its keys every time you want to update it, as I have done.
You're calling sort in each iteration in the loop; you should only call once it after the loop is done. After all, you only want to sort your dictionary once.
You also open the file twice, and although you close it at the end, I would suggest using a context manager and the with statement as I have done, so that file closing is automatically handled.
My code
sorted(dct.items(), key=lambda item: item[1])
means that the sorted() function uses the second element in the item tuple (the dictionary item) as the 'metric' by which to sort.
`textfile.write(element[i] + '/' +element[i+1])`
is necessary, since, when you did rsplit('/',1), you removed the /s in your data; you need to add them back and reconstruct the string from the element tuple before you write it.
You don't need + \n in textfile.write since readlines() preserves the \n. That's why you should end text files with a newline: so that you don't have to treat the last line differently.
def sortFiles(item):
return item.split("/")[-1]
FILENAME = "test.txt"
contents = [line for line in open(FILENAME, "r").readlines() if line.strip()]
contents.sort(key=sortFiles)
with open(FILENAME, "w") as outfile:
outfile.writelines(contents)
I am processing a text file whose columns are separated by tabs .I want to get all the unique values of the first column.
Text Input e.g:
"a\t\xxx\t..\zzz\n
a\t\xxx\t....\n
b\t\xxx\t.....\n
b\t\xxx\t.....\n
c\t\xxx\t.....\n"
So in this case i would like to get an array: uniques=["a","b","c"]
Code:
def getData(fin):
input = open(fin, 'r',encoding='utf-16')
headers=input.readline().split()
lines=input.readlines()[1:]
uniques=[(lambda line: itertools.takewhile(lambda char: char!='\t',line))for line in lines]
Instead of the desired values i get a list of :
<function getData.<locals>.<listcomp>.<lambda> at 0x000000000C46DB70>
I have already read this article Python: Lambda function in List Comprehensions and I unserstood that you have to use parenthesis to ensure the right execution order.Still i get the same result.
You can just use split():
def getData(fin):
input = open(fin, 'r',encoding='utf-16')
headers=input.readline().split()
lines=input.readlines()[1:]
uniques=[line.split('\t')[0] for line in lines]
Note that this will not produce unique values, it will produce every line's value. To make this unique, do:
uniques = list(set(uniques))
May be csv can simplify your problem:
>>> import csv
>>> with open(fin, 'rb') as csvfile:
... spamreader = csv.reader(csvfile, delimiter='\t')
... list(set( row[0] for row in spamreader ))
['a', 'c', 'b']
When looking for unique elements set() is a good solution:
def getData(fin):
with open(fin, 'r') as input:
first_cols = list(set([line.split("\\")[0] for line in input.readlines()]))
You can use regex:
import re
s = """
a\txxx\t..\zzz\n
a\txxx\t....\n
b\txxx\t.....\n
b\txxx\t.....\n
c\txxx\t.....\n"
"""
new_data = re.findall('(?<=\n\s\s\s)[a-zA-Z]', s)
uniques = [a for i, a in enumerate(new_data) if a not in new_data[:i]]
Output:
['a', 'b', 'c']
Try Pandas
import pandas as pd
df = pd.read_csv(filename, sep='\t')
uniques = df[df.columns[0]].unique()
After
lines=input.readlines()[1:] # reads all lines after the header
# you read already and skips the 1st one
uniques = list(set(x.split('\t')[0] for x in lines))
Caveat: This might reorder your uniques
Your list comprehension needs to start with an expression rather than a lambda. Currently your code just creates a list of lambdas (note that the outermost parentheses enclose a lambda, not an expression). You could fix it like this:
def getData(fin):
input = open(fin, 'r',encoding='utf-16')
headers=input.readline().split()
lines=input.readlines()[1:]
uniques=[itertools.takewhile(lambda char: char!='\t',line) for line in lines]
There are still a couple of bugs in this code: (1) by the time you get to readlines(), the first row will already have been removed from the input buffer, so you should probably drop the [1:]. (2) your uniques variable will have all the entries from the first column, including duplicates.
You could fix these bugs and streamline the code a little more like this:
with open(fin, 'r',encoding='utf-16') as input:
headers=input.next().split('\t')
uniques = set(line.split('\t')[0] for line in input)
uniques = list(uniques)
If order doesn't matter then try this approach,
Open the file and then just split the words and as you said first column is always what you want to just take what you need and leave the remaining content.
with open('file.txt','r') as f:
print(set([list(line)[0] for line in f]))
output:
{'b', 'a', 'c'}
I am having some trouble trying to read a particular column in a csv file into a list in Python. Below is an example of my csv file:
Col 1 Col 2
1,000,000 1
500,000 2
250,000 3
Basically I am wanting to add column 1 into a list as integer values and am having a lot of trouble doing so. I have tried:
for row in csv.reader(csvfile):
list = [int(row.split(',')[0]) for row in csvfile]
However, I get a ValueError that says "invalid literal for int() with base 10: '"1'
I then tried:
for row in csv.reader(csvfile):
list = [(row.split(',')[0]) for row in csvfile]
This time I don't get an error however, I get the list:
['"1', '"500', '"250']
I have also tried changing the delimiter:
for row in csv.reader(csvfile):
list = [(row.split(' ')[0]) for row in csvfile]
This almost gives me the desired list however, the list includes the second column as well as, "\n" after each value:
['"1,000,000", 1\n', etc...]
If anyone could help me fix this it would be greatly appreciated!
Cheers
You should choose your delimiter wisely :
If you have floating numbers using ., use , delimiter, or if you use , for floating numbers, use ; as delimiter.
Moreover, as referred by the doc for csv.reader you can use the delimiter= argument to define your delimiter, like so:
with open('myfile.csv', 'r') as csvfile:
mylist = []
for row in csv.reader(csvfile, delimiter=';'):
mylist.append(row[0]) # careful here with [0]
or short version:
with open('myfile.csv', 'r') as csvfile:
mylist = [row[0] for row in csv.reader(csvfile, delimiter=';')]
To parse your number to a float, you will have to do
float(row[0].replace(',', ''))
You can open the file and split at the space using regular expressions:
import re
file_data = [re.split('\s+', i.strip('\n')) for i in open('filename.csv')]
final_data = [int(i[0]) for i in file_data[1:]]
First of all, you must parse your data correctly. Because it's not, in fact, CSV (Comma-Separated Values) but rather TSV (Tab-Separated) of which you should inform CSV reader (I'm assuming it's tab but you can theoretically use any whitespace with a few tweaks):
for row in csv.reader(csvfile, delimiter="\t"):
Second of all, you should strip your integer values of any commas as they don't add new information. After that, they can be easily parsed with int():
int(row[0].replace(',', ''))
Third of all, you really really should not iterate the same list twice. Either use a list comprehension or normal for loop, not both at the same time with the same variable. For example, with list comprehension:
csvfile = StringIO("Col 1\tCol 2\n1,000,000\t1\n500,000\t2\n250,000\t3\n")
reader = csv.reader(csvfile, delimiter="\t")
next(reader, None) # skip the header
lst = [int(row[0].replace(',', '')) for row in reader]
Or with normal iteration:
csvfile = StringIO("Col 1\tCol 2\n1,000,000\t1\n500,000\t2\n250,000\t3\n")
reader = csv.reader(csvfile, delimiter="\t")
lst = []
for i, row in enumerate(reader):
if i == 0:
continue # your custom header-handling code here
lst.append(int(row[0].replace(',', '')))
In both cases, lst is set to [1000000, 500000, 250000] as it should. Enjoy.
By the way, using reserved keyword list as a variable is an extremely bad idea.
UPDATE. There's one more option that I find interesting. Instead of setting the delimiter explicitly you can use csv.Sniffer to detect it e.g.:
csvdata = "Col 1\tCol 2\n1,000,000\t1\n500,000\t2\n250,000\t3\n"
csvfile = StringIO(csvdata)
dialect = csv.Sniffer().sniff(csvdata)
reader = csv.reader(csvfile, dialect=dialect)
and then just like the snippets above. This will continue working even if you replace tabs with semicolons or commas (would require quotes around your weird integers) or, possibly, something else.
I want to make a list of lists in python.
My code is below.
import csv
f = open('agGDPpct.csv','r')
inputfile = csv.DictReader(f)
list = []
next(f) ##Skip first line (column headers)
for line in f:
array = line.rstrip().split(",")
list.append(array[1])
list.append(array[0])
list.append(array[53])
list.append(array[54])
list.append(array[55])
list.append(array[56])
list.append(array[57])
print list
I'm pulling only select columns from every row. My code pops this all into one list, as such:
['ABW', 'Aruba', '0.506252445', '0.498384331', '0.512418427', '', '', 'AND', 'Andorra', '', '', '', '', '', 'AFG', 'Afghanistan', '30.20560247', '27.09154001', '24.50744042', '24.60324707', '23.96716227'...]
But what I want is a list in which each row is its own list: [[a,b,c][d,e,f][g,h,i]...] Any tips?
You are almost there. Make all your desired inputs into a list before appending. Try this:
import csv
with open('agGDPpct.csv','r') as f:
inputfile = csv.DictReader(f)
list = []
for line in inputfile:
list.append([line[1], line[0], line[53], line[54], line[55], line[56], line[57]])
print list
To end up with a list of lists, you have to make the inner lists with the columns from each row that you want, and then append that list to the outer one. Something like:
for line in f:
array = line.rstrip().split(",")
inner = []
inner.append(array[1])
# ...
inner.append(array[57])
list.append(inner)
Note that it's also not a good practice to use the name of the type ("list") as a variable name -- this is called "shadowing", and it means that if you later try to call list(...) to convert something to a list, you'll get an error because you're trying to call a particular instance of a list, not the list built-in.
To build on csv module capabilities, I'll do
import csv
f = csv.reader(open('your.csv'))
next(f)
list_of_lists = [items[1::-1]+items[53:58] for items in f]
Note that
items is a list of items, thanks to the intervention of a csv.reader() object;
using slice addressing returns sublists taken from items, so that the + operator in this context means concatenation of lists
the first slice expression 1::-1means from 1 go to the beginning moving backwards, or [items[1], items[0]].
Referring to https://docs.python.org/2/library/csv.html#csv.DictReader
Instead of
for line in f:
Write
for line in inputfile:
And also use list.append([array[1],array[0],array[53],..]) to append a list to a list.
One more thing, referring to https://docs.python.org/2/library/stdtypes.html#iterator.next , use inputfile.next() instead of next(f) .
After these changes, you get:
import csv
f = open('agGDPpct.csv','r')
inputfile = csv.DictReader(f)
list = []
inputfile.next() ##Skip first line (column headers)
for line in inputfile:
list.append([array[1],array[0],array[53],array[54],array[55],array[56],array[57]])
print list
In addition to that, it is not a good practice to use list as a variable name as it is a reserved word for the data structure of the same name. Rename that too.
You can further improve the above code using with . I will leave that to you.
Try and see if it works.
I am saving a list to a csv using the writerow function from csv module. Something went wrong when I opened the final file in MS office Excel.
Before I encounter this issue, the main problem I was trying to deal with is getting the list saved to each row. It was saving each line into a cell in row1. I made some small changes, now this happened. I am certainly very confused as a novice python guy.
import csv
inputfile = open('small.csv', 'r')
header_list = []
header = inputfile.readline()
header_list.append(header)
input_lines = []
for line in inputfile:
input_lines.append(line)
inputfile.close()
AA_list = []
for i in range(0,len(input_lines)):
if (input_lines[i].split(',')[4]) == 'AA':#column4 has different names including 'AA'
AA_list.append(input_lines[i])
full_list = header_list+AA_list
resultFile = open("AA2013.csv",'w+')
wr = csv.writer(resultFile, delimiter = ',')
wr.writerow(full_list)
Thanks!
UPDATE:
The full_list look like this: ['1,2,3,"MEM",...]
UPDATE2(APR.22nd):
Now I got three cells of data(the header in A1 and the rest in A2 and A3 respectively) in the same row. Apparently, the newline signs are not working for three items in one big list. I think the more specific question now is how do I save a list of records with '\n' behind each record to csv.
UPDATE3(APR.23rd):
original file
Importing the csv module is not enough, you need to use it as well. Right now, you're appending each line as an entire string to your list instead of a list of fields.
Start with
with open('small.csv', 'rb') as inputfile:
reader = csv.reader(inputfile, delimiter=",")
header_list = next(reader)
input_lines = list(reader)
Now header_list contains all the headers, and input_lines contains a nested list of all the rows, each one split into columns.
I think the rest should be pretty straightforward.
append() appends a list at the end of another list. So when you write header_list.append(header), it takes header as a list of characters and appends to header_list. You should write
headers = header.split(',')
header_list.append(headers)
This would split the header row by commas and headers would be the list of header words, then append them properly after header_list.
The same thing goes for AA_list.append(input_lines[i]).
I figured it out.
The different between [val], val, and val.split(",") in the writerow bracket was:
[val]: a string containing everything taking only the first column in excel(header and "2013, 1, 2,..." in A1, B1, C1 and so on ).
val: each letter or comma or space(I forgot the technical terms) take a cell in excel.
val.split(","): comma split the string in [val], and put each string separated by comma into an excel cell.
Here is what I found out: 1.the right way to export the flat list to each line by using with syntax, 2.split the list when writing row
csvwriter.writerow(JD.split())
full_list = header_list+AA_list
with open("AA2013.csv",'w+') as resultFile:
wr = csv.writer(resultFile, delimiter= ",", lineterminator = '\n')
for val in full_list:
wr.writerow(val.split(','))
The wanted output
Please correct my mistakenly used term and syntax! Thanks.