I'm have a csv file
id,name,surname,age
"1, Johny, Black, 25"
"2, Armando, White, 18"
"3, Jack, Brown, ''"
"4, Ronn, Davidson, ''"
"5, Bill, Loney, 35"
first row this is list, other rows
How i can be converted this csv in dictionary. With future filter and sort
import csv
dicts = list()
with open("test.csv", "r", encoding="utf-8") as file:
csv_reader = csv.reader(file)
field_list = list()
record_list = list()
line_counter = 0
for row in csv_reader:
if line_counter == 0:
field_list = row
line_counter += 1
else:
records = row[0].split(',')
record_list.append(records)
counter = 0
full = dict()
for record in record_list:
for field in field_list:
try:
if field in full.keys():
full[field].append(record[counter])
counter += 1
else:
full[field] = [record[counter]]
if counter == len(record):
break
except Exception as e:
pass
print(full)
My code convert only 2 rows. I'm try split rows, but this don't help me.
Documentation csv lib not help me. Maybe someone knows solution
You never reset your counter to zero, the first time you loop through your nested for loop, the code initializes the dictionary keys to the first row in record_list and the counter remains equal to 0 (therefore only placing the first value in). The second time, the counter increments up to 4. So that every following time, the counter is out of index range for the record, and your exception will be raised.
I think the second half of your code should look like this:
full = dict()
for record in record_list:
counter = 0
for field in field_list:
try:
if field in full.keys():
full[field].append(record[counter])
else:
full[field] = [record[counter]]
counter += 1
except Exception as e:
pass
print(full)
The CSV library directly has a reader to convert in a dictionary:
https://docs.python.org/3/library/csv.html#csv.DictReader
Related
I am trying to open a csv file with csv.DictReader, read in just the first 5 rows of data, perform the primary process of my script, then read in the next 5 rows and do the same for them. Rinse and repeat.
I believe I have a method that works, however I am having issues with the last lines of the data not processing. I know I need to modify my if statement so that it also checks for if I'm at the end of the file, but am having trouble finding a way to do that. I've found methods online, but they involve reading in the whole file to get a row count but doing so would defeat the purpose of this script as I'm dealing with memory issues.
Here is what I have so far:
import csv
count = 0
data = []
with open('test.csv') as file:
reader = csv.DictReader(file)
for row in reader:
count +=1
data.append(row)
if count % 5 == 0 or #something to check for the end of the file:
#do stuff
data = []
Thank you for the help!
You can use the chunksize argument when reading in the csv. This will step by step read in the number of lines:
import pandas as pd
reader = pd.read_csv('test.csv', chunksize=5)
for df in reader:
# do stuff
You can handle the remaining lines after the for loop body. You can also use the more pythonic enumerate.
import csv
data = []
with open('test.csv') as file:
reader = csv.DictReader(file)
for count, row in enumerate(reader, 1):
data.append(row)
if count % 5 == 0:
# do stuff
data = []
print('handling remaining lines at end of file')
print(data)
considering the file
a,b
1,1
2,2
3,3
4,4
5,5
6,6
7,7
outputs
handling remaining lines at end of file
[OrderedDict([('a', '6'), ('b', '6')]), OrderedDict([('a', '7'), ('b', '7')])]
This is one approach using the iterator
Ex:
import csv
with open('test.csv') as file:
reader = csv.DictReader(file)
value = True
while value:
data = []
for _ in range(5): # Get 5 rows
value = next(reader, False)
if value:
data.append(value)
print(data) #List of 5 elements
Staying along the lines of what you wrote and not including any other imports:
import csv
data = []
with open('test.csv') as file:
reader = csv.DictReader(file)
for row in reader:
data.append(row)
if len(data) > 5:
del data[0]
if len(data) == 5:
# Do something with the 5 elements
print(data)
The if statements allow the array to be loaded with 5 elements before processing on the begins.
class ZeroItterNumberException(Exception):
pass
class ItterN:
def __init__(self, itterator, n):
if n<1:
raise ZeroItterNumberException("{} is not a valid number of rows.".format(n))
self.itterator = itterator
self.n = n
self.cache = []
def __iter__(self):
return self
def __next__(self):
self.cache.append(next(self.itterator))
if len(self.cache) < self.n:
return self.__next__()
if len(self.cache) > self.n:
del self.cache[0]
if len(self.cache) == 5:
return self.cache
My problem:
I am trying to compare two elements from two different arrays but the operator is not working.
Code Snippet in question:
for i in range(row_length):
print(f"ss_record: {ss_record[i]}")
print(f"row: {row[i + 1]}")
#THIS IF STATEMENT IS NOT WORKING
if ss_record[i] == row[i + 1]:
count += 1
#print()
#print(f"row length: {row_length}")
#print(f"count: {count}")
if count == row_length:
print(row[0])
exit(0)
What I have done: I tried to print the value of ss_record and row before it runs through the if statement but when it matches, count doesn't increase. I tried storing the value of row in a new array but it bugs out and only store the array length and first 2 value of row and repeats those values every next instance.
What I think the issue: I think the issue with my code is that row is being read from a CSV file and is not being converted into an integer as a result, it appears they are the same but one is an integer while the other is a string.
Entire Code:
import csv
import sys
import re
from cs50 import get_string
from sys import argv
def main():
line_count = 0
if len(argv) != 3:
print("missing command-line argument")
exit(1)
with open(sys.argv[1], 'r') as database:
sequence = open(sys.argv[2], 'r')
string = sequence.read()
reader = csv.reader(database, delimiter = ',')
for row in reader:
if line_count == 0:
row_length = len(row) - 1
ss_record = [row_length]
for i in range(row_length):
ss_record.append(ss_count(string, row[i + 1], len(row[i + 1])))
ss_record.pop(0)
line_count = 1
else:
count = 0
for i in range(row_length):
print(f"ss_record: {ss_record[i]}")
print(f"row: {row[i + 1]}")
#THIS IF STATEMENT IS NOT WORKING
if ss_record[i] == row[i + 1]:
count += 1
if count == row_length:
print(row[0])
exit(0)
#ss_count mean the # of times the substring appear in the string
def ss_count(string, substring, length):
count = 1
record = 0
pos_array = []
for m in re.finditer(substring, string):
pos_array.append(m.start())
for i in range(len(pos_array) - 1):
if pos_array[i + 1] - pos_array[i] == length:
count += 1
else:
if count > record:
record = count
count = 1
if count > record:
record = count
return record
main()
Values to use to reproduce issue:
sequence (this is a text file) = AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
substring (this is a csv file) =
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
Gist of the CSV file:
The numbers beside Alice means how many times a substring(STR/Short Tandem Repeat) appears in a row in the string(DNA sequence). In this string, AGATC appears 4 times in a row, AATG appears 1 time in a row, and TATC appears 5 times in a row. For this DNA sequence, it matches Bob and he outputted as the answer.
You were right, when you compare ss_record[i] == row[i + 1]: there is a type problem, the numbers of ss_record are integers while the numbers of the row are strings. You may acknowledge the issue by printing both ss_record and row:
print("ss_record: {}".format(ss_record)) -> ss_record: [4, 1, 5]
print("row: {}".format(row)) -> row: ['Alice', '2', '8', '3']
In order for the snippet to work you just need to change the comparison to
ss_record[i] == int(row[i + 1])
That said, I feel the code is quite complex for the task. The string class implements a count method that returns the number of non-overlapping occurrences of a given substring. Also, since the code it's working in an item basis and relies heavily in index manipulations the iteration logic is hard to follow (IMO). Here's my approach to the problem:
import csv
def match_user(dna_file, user_csv):
with open(dna_file, 'r') as r:
dna_seq = r.readline()
with open(user_csv, 'r') as r:
reader = csv.reader(r)
rows = list(reader)
target_substrings = rows[0][1:]
users = rows[1:]
num_matches = [dna_seq.count(target) for target in target_substrings]
for user in users:
user_matches = [int(x) for x in user[1:]]
if user_matches == num_matches:
return user[0]
return "Not found"
Happy Coding!
I am using python 2.7, and I have a text file that looks like this:
id value
--- ----
1 x
2 a
1 z
1 y
2 b
I am trying to get an ouput that looks like this:
id value
--- ----
1 x,z,y
2 a,b
Much appreciated!
The simplest solution would be to use collections.defaultdict and collections.OrderedDict. If you don't care about order you could also use sets instead of OrderedDict.
from collections import defaultdict, OrderedDict
# Keeps all unique values for each id
dd = defaultdict(OrderedDict)
# Keeps the unique ids in order of appearance
ids = OrderedDict()
with open(yourfilename) as f:
f = iter(f)
# skip first two lines
next(f), next(f)
for line in f:
id_, value = list(filter(bool, line.split())) # split at whitespace and remove empty ones
dd[id_][value] = None # dicts need a value, but here it doesn't matter which one...
ids[id_] = None
print('id value')
print('--- ----')
for id_ in ids:
print('{} {}'.format(id_, ','.join(dd[id_])))
Result:
id value
--- ----
1 x,z,y
2 a,b
In case you want to write it to another file just concatenate what I printed with \n and write it to a file.
I think this could also work, although the other answer seems more sophisticated:
input =['1,x',
'2,a',
'1,z',
'1,y',
'2,b',
'2,a', #added extra values to show duplicates won't be added
'1,z',
'1,y']
output = {}
for row in input:
parts = row.split(",")
id_ = parts[0]
value = parts[1]
if id_ not in output:
output[id_] = value
else:
a_List = list(output[id_])
if value not in a_List:
output[id_] += "," + value
else:
pass
You end up with a dictionary similar to what you requested.
#read
fp=open('','r')
d=fp.read().split("\n")
fp.close()
x=len(d)
for i in range(len(d)):
n= d[i].split()
d.append(n)
d=d[x:]
m={}
for i in d:
if i[0] not in m:
m[i[0]]=[i[1]]
else:
if i[1] not in m[i[0]]:
m[i[0]].append(i[1])
for i in m:
print i,",".join(m[i])
I have a list of lists that I want to make into a dictionary. Basically it's a list of births based on date (year/month/day/day of week/births). I want to tally the total births for each day to see in total how many births on each day of the week.
List example:
[2000,12,3,2,12000],[2000,12,4,3,34000]...
days_counts = {1: 23000, 2: 43000, ..., 7: 11943}
Here's the code so far:
f = open('births.csv', 'r')
text = f.read()
text = text.split("\n")
header = text[0]
data = text[1:]
for d in data:
split_data = d.split(",")
print(split_data)
So basically I want to iterate over each day and add the birth from duplicate days into the same key (obviusly).
EDIT: I have to do this with an if statement that looks for the day of week as a key in the dict. if its found, assign the corresponding births as value. If its not in dict then add key and value. I can't import anything or use lambda functions.
Use a collections.Counter() object to track the counts per day-of-the-week. You also want to use the csv module to handle the file parsing:
import csv
from collections import Counter
per_dow = Counter()
with open('births.csv', 'r') as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
dow, births = map(int, row[-2:])
per_dow[dow] += births
I've used a with statement to manage the file object; Python auto-closes the file for you when the with block ends.
Now that you have a Counter object (which is a dictionary with some extra powers), you can now find the day of the week with the most births; the following loop prints out days of the week in order from most to least:
for day, births in per_dow.most_common():
print(day, births)
Without using external libraries or if statements, you can use exception handling
birth_dict = {}
birth_list = [[2000,12,3,2,12000],[2000,12,4,3,34000]]
for birth in birth_list:
try:
birth_dict[birth[3]]+=birth[4]
except KeyError:
birth_dict[birth[3]]=birth[4]
print birth_dict
Ok, after playing around with the code and using print statements where I need them for tests, I finally did it without using any external libraries. A very special thanks to Tobey and the others.
Here's the code with tests:
f = open('births.csv', 'r')
text = f.read()
text = text.split("\n")
header = text[0]
data = text[1:-1]
days_counts = {}
for d in data:
r = d.split(",")
print(r) #<--- used to test
k = r[3]
print(k)#<--- used to test
v = int(r[4])
print(v)#<--- used to test
if k in days_counts:
days_counts[k] += v
print("If : " , days_counts)#<--- used to test
else:
days_counts[k] = v
print("Else : ", days_counts)#<--- used to test
print(days_counts)
Code without tests:
f = open('births.csv', 'r')
text = f.read()
text = text.split("\n")
header = text[0]
data = text[1:-1]
days_counts = {}
for d in data:
r = d.split(",")
k = r[3]
v = int(r[4])
if k in days_counts:
days_counts[k] += v
else:
days_counts[k] = v
print(days_counts)
I'm trying to step through a csv and assign date and time values to their own point in a 2d dictionary.This would be in a form such that an instance of:
'11/02/16' and '23:24' in their respective columns in a row would add '1' to the value in the position marked by 'X' in the dictionary 'Dates{11/01/16{23:X}}'.
Unfortunately I get a KeyError for the following code.
import csv
import sys
from sys import argv
from collections import defaultdict
script, ReadFile = argv
f = open(ReadFile,'r')
l = f.readlines()
f.close()
file_list = [row.replace('\n','').split(',') for row in l]
header = file_list[0]
Total = 0
Dates = defaultdict(dict)
print Dates
index_variable = header.index('Time')
index_variable2 = header.index('# Timestamp')
for row in file_list[1:]:
t = row[index_variable][:2]
d = row[index_variable2][:10]
if row[index_variable2][:10] in Dates:
Dates[d][t] = 1
Total += 1
print "true"
else:
Dates[d] = {}
Dates[d][t] = 1
Total =+ 1
print "false"
print Dates
If I replace the local variable 't' with "'Test'" the code works, but obviously the results are not what I'm after.
Thanks in advance!
Update: If I replace 'd' with 'Test' and keep 't' as it is, the program works completely fine. It's only when the Dictionary is specifically called as 'Dates[d][t]' that the program returns a KeyError.
Update 2: I've updated the code above to show my work. Currently the script will work /as long as no numbers are added/.
Dates[d][t] = 1 #If I change this...
Dates[d][t] += 1 #To this...
A KeyError occurs.
Update 3:
I changed a portion of my code...
for row in file_list[1:]:
t = row[index_variable][:2]
d = row[index_variable2][:10]
if d in Dates and t in Dates[d]:
Dates[d][t] += 1
print "true"
else:
Dates[d][t] = 1
print "false"
And now the script works perfectly fine. I suppose that this means the KeyError was because I was not being specific enough (???).
Assuming that what we see above is just bad formatting of the if by the machine...
I think the problem is in the else:
Dates is a dict with various keys.
The d are the first 10 characters of the 'Date' field in your input
You are wanting to count how many times the minutes got hit on a specific Date.
Dates[d] then is a dictionary whose keys are days.
t is supposed to be a dictionary of minutes that got hit on the specific day
You haven't told python that Dates[d] is a dictionary too.
But you've made a reference to Dates[d][t]. This implies that Dates[d] already exists and it has something that is subscriptable in it.
I tried this on my system
import csv
import sys
from sys import argv
from collections import defaultdict
#script, ReadFile = argv
#f = open(ReadFile,'r')
#l = f.readlines()
#f.close()
#file_list = [row.replace('\n','').split(',') for row in l]
#header = file_list[0]
file_list = [['Date','Time','Otherstuff'],
['2016-02-01','23:12:00','Sillyme1'],
['2016-02-01','23:12:04','Sillyme2'],
['2016-02-02','22:10:00','Sillyme3']]
header = file_list[0]
Dates = defaultdict(dict)
print(Dates)
index_variable = header.index('Time')
index_variable2 = header.index('Date')
for row in file_list[1:]:
t = row[index_variable][:2]
d = row[index_variable2][:10]
if d in Dates.keys():
Dates[d][t] +=1
print("true")
else:
Dates[d] = {} #Now Dates[d] contains a dictionary
Dates[d][t] = 1 ##Now we put the first counter in the Dates[d] dictionary with key t.
print(Dates)
Return was:
defaultdict(, {})
true
defaultdict(, {'2016-02-01': {'23': 2}, '2016-02-02': {'22': 1}})