Python Code Not Listing First Line in CSV File - python

I was working on IMDB movie list just to list movie names, links and my ratings. Here is the code:
import csv
r_list = open('ratings.csv')
rd = csv.reader(r_list, delimiter=',', quotechar='"')
movies = {}
for row in rd:
movies[row[5]] = [row[0], row[8]]
print(len(movies))
The output is 500 but actual number is 501. It is not showing the first line. But when I do the same thing for a list that contains 6 lines in total, it counts the first line and returns '6'.
Why?

Because you are using a dictionary and you have row[5] which is duplicate and is replaced and thereby shortening your list by the number of duplicates (minus one) you have. You cannot have 2 keys in a dictionary that are the same. That is illegal. Python handles that silently by overwritting (replacing) the value of the key you had with the new value.
e.g.
data = [('rambo', 1995), ('lethal weapon', 1992), ('rambo', 1980)]
movies = {}
for row in data:
movies[row[0]] = row[1]
print len(data) # -> 3
print len(movies) # -> 2
print movies['rambo'] # -> 1980
Solution is not to use the dictionary if you dont want duplicate keys to be replace each other.

Related

Creating a nested dictionary inside a dictionary using CSV file and comparing the values with other dictionary

for row in db:
key = row.pop('name')
print(key)
if key in row:
pass
database[key] = row
print(database)
(database[key] = {db.fieldnames[i]:row[i] for i in range(1,len(db.fieldnames))
if I try this instead of **database[key] = row**, I\'m getting 'KeyError 0'
I have a CSV file with 1st column name and few header columns. I wanted to create a dictionary with name as a key and values as another nested dictionary. The nested dictionary keys must be headers and values are the particular individual column data.
Till now I believe that I was able to achieve that using the above program. Can someone confirm whether its right or not? I also want to call out the values of the main dictionary(i.e the nested dictionary itself) and compare it with a dictionary I created in the other part of the program. If it matches it should give out the key(i.e name).
Example:
dict = {'alice':{'age':32,'marks':86},'ron':{'age':25,'marks':75}}
other = {'age':25,'marks':75}
By comparing the other and dict.values, I should get ron as a result.
sample csv file(original file has multiple rows and columns)
name age marks
alice 32 86
ron 25 75
Someone please guide me how to proceed.
def database(file):
d = {}
f = csv.reader(file, delimiter =",")
header = next(f)
n = len(header)
for row in f:
d[row[0]] = {header[i]:row[i] for i in range(1,n)}
return d
Thank you #Avishka Dambawinna.

How to access an item in a list within a listed dictionary.

I currently have a lot of customer data where each row of data is an individual interaction by a customer, but where some customers have had multiple interactions, thus many customers have multiple rows. Some variables in each of multiple-interaction customer rows are the same while others variables are different (ie. age may be the same, but different stores).
I have attempted to create a dictionary where customer id is the key and the row data is attached to the id. This means that attached to each key is a list of lists. I am thus trying to access an item (single variable) from a first interaction based on each unique customer from a slew of different interactions.
import sys
import re
import csv
from collections import defaultdict
def extract_data(filename):
customer_list = {}
count = 0
counter = 1
file = open(filename, 'r')
reader = csv.reader(file, delimiter=',')
for row in reader:
if row[2] not in customer_list:
customer_list[row[2]] = [row]
count += 1
else:
customer_list[row[2]].append(row)
print 'total number of customers: ', len(customer_list.keys())
zipcodes = []
numzips = 0
for customer in customer_list:
for item in customer.value():
if item[1[7]] not in zipcodes:
zipcodes.append(item[1[7]])
numzips += 1
print zipcodes
print numzips
Note i'm pretty sure i can't use item[1[7]] to reference the first list and then the 7th item in the list, but i also do not want to iterate over each inner dictionary list for each item. I have gotten a range of different errors and really do not know how to proceed.
Any help / advice would be much appreciated.
Assuming your dictionary looks something like this :
customer_dict=
{
"cust_id_1" : [[item1],[item2,item3],[item4,item5]],
"cust_id_2" : [[item7],[item8],[item9,item10,item11]]
}
In order to access item4 , you can use customer_dict["cust_id1"][2][0]
Hope I got the intended dictionary right .

Python delete items from list to be added as column to list of dicts

I have two sets of data. Both have a little over 13000 rows and, one of them (the one I open as a csv in the main function), has two columns that I need to match up to the other file (opened as text file and put into list of dictionaries in example_05() function).
They are from the same source and I need to make sure the data stays the same when I add the last two parameters for each row in the list of dicts because I have about 20 extra rows in the .csv file that I'm adding to the list of dicts so I must have extra or null data in the .csv file.
To delete these anomalous rows, I'm trying to compare the indices of the list of Q* values from the .csv file to the {'Q*':} value in the dictionary within the list of dictionaries (each dictionary is a row) to look for mismatches because they should be the same and then just delete the item from the mass_list before I add it to the list of dictionaries as I do at the end of example_05() function.
When I try to compare them I get an 'IndexError: list index out of range' error at this line:
if row10['Q*'] != Q_list_2[check_index]:
Can anybody tell me why? Here's example_05() and the main function:
def example_05(filename):
with open(filename,'r') as file : data = file.readlines()
header, data = data[0].split(), data[1:]
#...... convert each line to a dict, using header words keys
global kept
kept = []
for line in data :
line = [to_float(term) for term in line.split()]
kept.append( dict( zip(header, line) ) )
del mass_list[0]
mass_list_2 = [to_float(j) for j in mass_list]
del Q_list[0]
Q_list_2 = [to_float(k) for k in Q_list]
print "Number in Q_list_2 list = "
print len(Q_list_2)
check_index = 0
delete_index = 0
for row10 in kept:
if row10['Q*'] != Q_list_2[check_index]:
del mass_list_2[delete_index]
del Q_list_2[delete_index]
check_index+=1
delete_index+=1
else:
check_index+=1
delete_index+=1
continue
k_index=0
for d in kept:
d['log_10_m'] = mass_list_2[k_index]
k_index+=1
print "Number in mass_list_2 list = "
print len(mass_list_2)
if __name__ == '__main__' :
f = open('MagandMass20150401.csv')
csv_f = csv.reader(f)
mag_list = []
mass_list = []
Q_list = []
for row in csv_f:
mag_list.append(row[17])
mass_list.append(row[18])
Q_list.append(row[15])
del csv_f
f.close()
example_05('summ20150401.txt')

Write last three entries per name in a file

I have the following data in a file:
Sarah,10
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
I would like to keep the last three rows for each person. The output would be:
John,5
Sarah,7
Sarah,8
John,4
Sarah,2
In the example, the first row for Sarah was removed since there where three later rows. The rows in the output also maintain the same order as the rows in the input. How can I do this?
Additional Information
You are all amazing - Thank you so much. Final code which seems to have been deleted from this post is -
import collections
with open("Class2.txt", mode="r",encoding="utf-8") as fp:
count = collections.defaultdict(int)
rev = reversed(fp.readlines())
rev_out = []
for line in rev:
name, value = line.split(',')
if count[name] >= 3:
continue
count[name] += 1
rev_out.append((name, value))
out = list(reversed(rev_out))
print (out)
Since this looks like csv data, use the csv module to read and write it. As you read each line, store the rows grouped by the first column. Store the line number along with the row so that they can be written out maintaining the same order as the input. Use a bound deque to keep only the last three rows for each name. Finally, sort the rows and write them out.
import csv
by_name = defaultdict(lambda x: deque(x, maxlen=3))
with open('my_data.csv') as f_in
for i, row in enumerate(csv.reader(f_in)):
by_name[row[0]].append((i, row))
# sort the rows for each name by line number, discarding the number
rows = sorted(row[1] for value in by_name.values() for row in value, key=lambda row: row[0])
with open('out_data.csv', 'w') as f_out:
csv.writer(f_out).writerows(rows)

Parsing CSV / tab-delimited txt file with Python

I currently have a CSV file which, when opened in Excel, has a total of 5 columns. Only columns A and C are of any significance to me and the data in the remaining columns is irrelevant.
Starting on line 8 and then working in multiples of 7 (ie. lines 8, 15, 22, 29, 36 etc...), I am looking to create a dictionary with Python 2.7 with the information from these fields. The data in column A will be the key (a 6-digit integer) and the data in column C being the respective value for the key. I've tried to highlight this below but the formatting isn't the best:-
A B C D
1 CDCDCDCD
2 VDDBDDB
3
4
5
6
7 DDEFEEF FEFEFEFE
8 123456 JONES
9
10
11
12
13
14
15 293849 SMITH
As per the above, I am looking to extract the value from A7 (DDEFEEF) as a key in my dictionary and "FEFEFEFE" being the respective data and then add another entry to my dictionary, jumping to line 15 with "2938495" being my key and "Smith" being the respective value.
Any suggestions? The source file is a .txt file with entries being tab-delimited.
Thanks
Clarification:
Just to clarify, so far, I have tried the below:-
import csv
mydict = {:}
f = open("myfile", 'rt')
reader = csv.reader(f)
for row in reader:
print row
The above simply prints out all content though a row at a time. I did try "for row(7) in reader" but this returned an error. I then researched it and had a go at the below but it didn't work neither:
import csv
from itertools import islice
entries = csv.reader(open("myfile", 'rb'))
mydict = {'key' : 'value'}
for i in xrange(6):
mydict['i(0)] = 'I(2) # integers representing columns
range = islice(entries,6)
for entry in range:
mydict[entries(0) = entries(2)] # integers representing columns
Start by turning the text into a list of lists. That will take care of the parsing part:
lol = list(csv.reader(open('text.txt', 'rb'), delimiter='\t'))
The rest can be done with indexed lookups:
d = dict()
key = lol[6][0] # cell A7
value = lol[6][3] # cell D7
d[key] = value # add the entry to the dictionary
...
Although there is nothing wrong with the other solutions presented, you could simplify and greatly escalate your solutions by using python's excellent library pandas.
Pandas is a library for handling data in Python, preferred by many Data Scientists.
Pandas has a simplified CSV interface to read and parse files, that can be used to return a list of dictionaries, each containing a single line of the file. The keys will be the column names, and the values will be the ones in each cell.
In your case:
import pandas
def create_dictionary(filename):
my_data = pandas.DataFrame.from_csv(filename, sep='\t', index_col=False)
# Here you can delete the dataframe columns you don't want!
del my_data['B']
del my_data['D']
# ...
# Now you transform the DataFrame to a list of dictionaries
list_of_dicts = [item for item in my_data.T.to_dict().values()]
return list_of_dicts
# Usage:
x = create_dictionary("myfile.csv")
If the file is large, you may not want to load it entirely into memory at once. This approach avoids that. (Of course, making a dict out of it could still take up some RAM, but it's guaranteed to be smaller than the original file.)
my_dict = {}
for i, line in enumerate(file):
if (i - 8) % 7:
continue
k, v = line.split("\t")[:3:2]
my_dict[k] = v
Edit: Not sure where I got extend from before. I meant update

Categories