Crunching on this for a long time. Is there an easy way using Numpy or Pandas or fixing my code to get the unique values for the column in a row separated by "|"
I.e the data:
"id","fname","lname","education","gradyear","attributes"
"1","john","smith","mit|harvard|ft|ft|ft","2003|207|212|212|212","qa|admin,co|master|NULL|NULL"
"2","john","doe","htw","2000","dev"
Output should be:
"id","fname","lname","education","gradyear","attributes"
"1","john","smith","mit|harvard|ft","2003|207|212","qa|admin,co|master|NULL"
"2","john","doe","htw","2000","dev"
My broken code:
import csv
import pprint
your_list = csv.reader(open('out.csv'))
your_list = list(your_list)
#pprint.pprint(your_list)
string = "|"
cols_no=6
for line in your_list:
i=0
for col in line:
if i==cols_no:
print "\n"
i=0
if string in col:
values = col.split("|")
myset = set(values)
items = list()
for item in myset:
items.append(item)
print items
else:
print col+",",
i=i+1
It outputs:
id, fname, lname, education, gradyear, attributes, 1, john, smith, ['harvard', 'ft', 'mit']
['2003', '212', '207']
['qa', 'admin,co', 'NULL', 'master']
2, john, doe, htw, 2000, dev,
Thanks in advance!
numpy/pandas is a bit overkill for what you can achieve with csv.DictReader and csv.DictWriter with a collections.OrderedDict, eg:
import csv
from collections import OrderedDict
# If using Python 2.x - use `open('output.csv', 'wb') instead
with open('input.csv') as fin, open('output.csv', 'w') as fout:
csvin = csv.DictReader(fin)
csvout = csv.DictWriter(fout, fieldnames=csvin.fieldnames, quoting=csv.QUOTE_ALL)
csvout.writeheader()
for row in csvin:
for k, v in row.items():
row[k] = '|'.join(OrderedDict.fromkeys(v.split('|')))
csvout.writerow(row)
Gives you:
"id","fname","lname","education","gradyear","attributes"
"1","john","smith","mit|harvard|ft","2003|207|212","qa|admin,co|master|NULL"
"2","john","doe","htw","2000","dev"
If you don't care about the order when you have many items separated with |, this will work:
lst = ["id","fname","lname","education","gradyear","attributes",
"1","john","smith","mit|harvard|ft|ft|ft","2003|207|212|212|212","qa|admin,co|master|NULL|NULL",
"2","john","doe","htw","2000","dev"]
def no_duplicate(string):
return "|".join(set(string.split("|")))
result = map(no_duplicate, lst)
print result
result:
['id', 'fname', 'lname', 'education', 'gradyear', 'attributes', '1', 'john', 'smith', 'ft|harvard|mit', '2003|207|212', 'NULL|admin,co|master|qa', '2', 'john', 'doe', 'htw', '2000', 'dev']
Related
This is my current code:
import csv
data = {'name' : ['Dave', 'Dennis', 'Peter', 'Jess'],
'language': ['Python', 'C', 'Java', 'Python']}
new_data = []
for row in data:
new_row = {}
for item in row:
new_row[item['name']] = item['name']
new_data.append(new_row)
print(new_data)
header = new_data[0].keys()
print(header)
with open('output.csv', 'w') as fh:
csv_writer = csv.DictWriter(fh, header)
csv_writer.writeheader()
csv_writer.writerows(new_data)
What I am trying to achieve is that the dictionary keys are turned into the csv headers and the values turned into the rows.
But when running the code I get a TypeError: 'string indices must be integers' in line 21.
Problem
The issue here is for row in data. This is actually iterating over the keys of your data dictionary, and then you're iterating over the characters of the dictionary keys:
In [2]: data = {'name' : ['Dave', 'Dennis', 'Peter', 'Jess'],
...: 'language': ['Python', 'C', 'Java', 'Python']}
...:
...: new_data = []
...: for row in data:
...: for item in row:
...: print(item)
...:
n
a
m
e
l
a
n
g
u
a
g
e
Approach
What you actually need to do is use zip to capture both the name and favorite language of each person at the same time:
In [43]: for row in zip(*data.values()):
...: print(row)
...:
('Dave', 'Python')
('Dennis', 'C')
('Peter', 'Java')
('Jess', 'Python')
Now, you need to zip those tuples with the keys from data:
In [44]: header = data.keys()
...: for row in zip(*data.values()):
...: print(list(zip(header, row)))
...:
[('name', 'Dave'), ('language', 'Python')]
[('name', 'Dennis'), ('language', 'C')]
[('name', 'Peter'), ('language', 'Java')]
[('name', 'Jess'), ('language', 'Python')]
Solution
Now you can pass these tuples to the dict constructor to create your rowdicts which csv_writer.writerows requires:
header = data.keys()
new_data = []
for row in zip(*data.values()):
new_data.append(dict(zip(header, row)))
with open("output.csv", "w+", newline="") as f_out:
csv_writer = csv.DictWriter(f_out, header)
csv_writer.writeheader()
csv_writer.writerows(new_data)
Output in output.csv:
name,language
Dave,Python
Dennis,C
Peter,Java
Jess,Python
I am trying to covert the multiple lines under a file, to below format (tuples + list), but still struggling
sample lines under file
USER1.TEST1,SCHEMA2.TEST2
USER5.TEST,USER3.TEST1,RATE=100
SCHEMA5.TEST5,CORE12.TEST3,RATE=500
Expected output
[('USER','TEST1','USER','TEST2'),('USER5','TEST','USER3','TEST1','RATE=100'),('SCHEMA5','TEST5','CORE12','TEST3','RATE=500')]
Code i am trying ...
o_list = []
with open (i_list,'rb') as f:
if not 'tab' in i_list:
r = csv.reader(f)
else:
for line in f.readlines():
f, s = line.strip().split('.')
s = s.split(',')
o_list.append((f,) + tuple(s))
return o_list
Try using a one-liner list comprehension with a re.split:
import re
with open('filename.txt', 'r') as f:
print([re.split('\.|,', i.rstrip()) for i in f])
Output:
[['USER1', 'TEST1', 'SCHEMA2', 'TEST2'], ['USER5', 'TEST', 'USER3', 'TEST1', 'RATE=100'], ['SCHEMA5', 'TEST5', 'CORE12', 'TEST3', 'RATE=500']]
It seems like as long as you are using the csv module, you should leverage the fact that it will split on a delimiter of your choice. So you can split on , and then on each line split the fields if you need to. You can use itertools.chain to flatten the sublists into tuples:
from itertools import chain
with open ('test.csv') as f:
if not 'tab' in i_list: # not sure what i_list is
r = csv.reader(f)
else:
r = csv.reader(f, delimiter=',') # split on `,` first
o_list = [tuple(chain.from_iterable(s.split('.') for s in line)) for line in r]
o_list:
[('USER1', 'TEST1', 'SCHEMA2', 'TEST2'),
('USER5', 'TEST', 'USER3', 'TEST1', 'RATE=100'),
('SCHEMA5', 'TEST5', 'CORE12', 'TEST3', 'RATE=500')]
I have the following tab-delimited text file:
1 John 27 doctor Chicago
2 Nick 33 engineer Washington
I am trying to convert it into a python dictionary where the key is the NAME and the age, career and address are the values. I would like to exclude the rankings (1, 2).
Code:
myfile = open ("filename", "r")
d = { }
for line in myfile:
x = line.strip().split("\t")
key, values = int(x[0]), x[1:]
d.setdefault(key, []).extend(values)
print(d)
You can convert it to a dict indexed by name with values in tuples instead:
d = {}
with open('filename', 'r') as myfile:
for line in myfile:
_, name, *values = line.strip().split("\t")
d[name] = values
print(d)
With your sample input, this will output:
{'John': ('27', 'doctor', 'Chicago'), 'Nick': ('33', 'engineer', 'Washington')}
You don't explain what difficulties you face.
However, from that sample of tab-delimited text, and you want to have dict like:
{'John': ['27', 'doctor', 'Chicago'], 'Nick': ['33', 'engineer', 'Washington']}
If that's the output you want to reach, then I modified your code a bit.
myfile = open ("filename", "r")
d = { }
for line in myfile:
x = line.strip().split("\t")
key, values = x[1], x[2:]
d.setdefault(key, []).extend(values)
print(d)
I've been trying to have a program print out a sorted list depending on the requested item. When I request the list from the CSV file I'm not sure how to set only 2 of the 4 values into an integer as when it displays in the program the numbers are treated as strings and it doesn't sort properly.
Eg:
['Jess', 'F', '2009', '6302']
['Kat', 'F', '1999', '6000']
['Alexander', 'M', '1982', '50']
['Bill', 'M', '2006', '2000']
['Jack', 'M', '1998', '1500']
def sortD(choice):
clear()
csv1 = csv.reader(open('TestUnsorted.csv', 'r'), delimiter=',')
sort = sorted(csv1, key=operator.itemgetter(choice))
for eachline in sort:
print (eachline)
open('TestUnsorted.csv', 'r').close()
#From here up is where I'm having difficulty
with open('TestSorted.csv', 'w') as csvfile:
fieldnames = ['Name', 'Gender', 'Year','Count']
csv2 = csv.DictWriter(csvfile, fieldnames=fieldnames,
extrasaction='ignore', delimiter = ';')
csv2.writeheader()
for eachline in sort:
csv2.writerow({'Name': eachline[0] ,'Gender': eachline[1],'Year':eachline[2],'Count':eachline[3]})
List1.insert(0, eachline)
open('TestSorted.csv', 'w').close
Here's what my TestUnsorted file looks like:
Jack,M,1998,1500
Bill,M,2006,2000
Kat,F,1999,6000
Jess,F,2009,6302
Alexander,M,1982,50
sort = sorted(csv1, key=lambda ch: (ch[0], ch[1], int(ch[2]), int(ch[3])))
That will sort the last two values as integers.
EDIT:
Upon further reading the question, I realize choice is the index of the list that you want to sort on. You could do this instead:
if choice < 2: # or however you want to determine whether to cast to int
sort = sorted(csv1, key=operator.itemgetter(choice))
else:
sort = sorted(csv1, key=lambda ch: int(ch[choice]))
This question already has answers here:
Convert .csv table to dictionary [duplicate]
(4 answers)
Closed 9 years ago.
I have a CSV file which I am opening through this code:
open(file,"r")
When I read the file I get the output:
['hello', 'hi', 'bye']
['jelly', 'belly', 'heli']
['red', 'black', 'blue']
I want the otput something like this:
{hello:['jelly','red'], hi:['belly','black'], 'bye':['heli','blue']}
but I have no idea how
You can use collections.defaultdict and csv.DictReader:
>>> import csv
>>> from collections import defaultdict
>>> with open('abc.csv') as f:
reader = csv.DictReader(f)
d = defaultdict(list)
for row in reader:
for k, v in row.items():
d[k].append(v)
...
>>> d
defaultdict(<type 'list'>,
{'hi': ['belly', 'black'],
'bye': ['heli', 'blue'],
'hello': ['jelly', 'red']})
csv = [
['hello', 'hi', 'bye'],
['jelly', 'belly', 'heli'],
['red', 'black', 'blue'],
]
csv = zip(*csv)
result = {}
for row in csv:
result[row[0]] = row[1:]
yourHash = {}
with open(yourFile, 'r') as inFile:
for line in inFile:
line = line.rstrip().split(',')
yourHash[line[0]] = line[1:]
This assumes that each key is unique to one line. If not, this would have to be modified to:
yourHash = {}
with open(yourFile, 'r') as inFile:
for line in inFile:
line = line.rstrip().split(',')
if line[0] in yourHash:
yourHash[line[0]] += line[1:]
else:
yourHash[line[0]] = line[1:]
Of course, you can use csv, but I figured that someone would definitely post that, so I gave an alternative way to do it. Good luck!
You can use csv, read the first line to get the header, create the number of lists corresponding to the header and then create the dict:
import csv
with open(ur_csv) as fin:
reader=csv.reader(fin, quotechar="'", skipinitialspace=True)
header=[[head] for head in next(reader)]
for row in reader:
for i, e in enumerate(row):
header[i].append(e)
data={l[0]:l[1:] for l in header}
print(data)
# {'hi': ['belly', 'black'], 'bye': ['heli', 'blue'], 'hello': ['jelly', 'red']}
If you want something more terse, you can use Jon Clements excellent solution:
with open(ur_csv) as fin:
csvin = csv.reader(fin, quotechar="'", skipinitialspace=True)
header = next(csvin, [])
data=dict(zip(header, zip(*csvin)))
# {'bye': ('heli', 'blue'), 'hello': ('jelly', 'red'), 'hi': ('belly', 'black')}
But that will produce a dictionary of tuples if that matters...
And if you csv file is huge, you may want to rewrite this to generate a dictionary row by row (similar to DictReader):
import csv
def key_gen(fn):
with open(fn) as fin:
reader=csv.reader(fin, quotechar="'", skipinitialspace=True)
header=next(reader, [])
for row in reader:
yield dict(zip(header, row))
for e in key_gen(ur_csv):
print(e)
# {'hi': 'belly', 'bye': 'heli', 'hello': 'jelly'}
{'hi': 'black', 'bye': 'blue', 'hello': 'red'} etc...