Parsing CSV / tab-delimited txt file with Python - python

I currently have a CSV file which, when opened in Excel, has a total of 5 columns. Only columns A and C are of any significance to me and the data in the remaining columns is irrelevant.
Starting on line 8 and then working in multiples of 7 (ie. lines 8, 15, 22, 29, 36 etc...), I am looking to create a dictionary with Python 2.7 with the information from these fields. The data in column A will be the key (a 6-digit integer) and the data in column C being the respective value for the key. I've tried to highlight this below but the formatting isn't the best:-
A B C D
1 CDCDCDCD
2 VDDBDDB
3
4
5
6
7 DDEFEEF FEFEFEFE
8 123456 JONES
9
10
11
12
13
14
15 293849 SMITH
As per the above, I am looking to extract the value from A7 (DDEFEEF) as a key in my dictionary and "FEFEFEFE" being the respective data and then add another entry to my dictionary, jumping to line 15 with "2938495" being my key and "Smith" being the respective value.
Any suggestions? The source file is a .txt file with entries being tab-delimited.
Thanks
Clarification:
Just to clarify, so far, I have tried the below:-
import csv
mydict = {:}
f = open("myfile", 'rt')
reader = csv.reader(f)
for row in reader:
print row
The above simply prints out all content though a row at a time. I did try "for row(7) in reader" but this returned an error. I then researched it and had a go at the below but it didn't work neither:
import csv
from itertools import islice
entries = csv.reader(open("myfile", 'rb'))
mydict = {'key' : 'value'}
for i in xrange(6):
mydict['i(0)] = 'I(2) # integers representing columns
range = islice(entries,6)
for entry in range:
mydict[entries(0) = entries(2)] # integers representing columns

Start by turning the text into a list of lists. That will take care of the parsing part:
lol = list(csv.reader(open('text.txt', 'rb'), delimiter='\t'))
The rest can be done with indexed lookups:
d = dict()
key = lol[6][0] # cell A7
value = lol[6][3] # cell D7
d[key] = value # add the entry to the dictionary
...

Although there is nothing wrong with the other solutions presented, you could simplify and greatly escalate your solutions by using python's excellent library pandas.
Pandas is a library for handling data in Python, preferred by many Data Scientists.
Pandas has a simplified CSV interface to read and parse files, that can be used to return a list of dictionaries, each containing a single line of the file. The keys will be the column names, and the values will be the ones in each cell.
In your case:
import pandas
def create_dictionary(filename):
my_data = pandas.DataFrame.from_csv(filename, sep='\t', index_col=False)
# Here you can delete the dataframe columns you don't want!
del my_data['B']
del my_data['D']
# ...
# Now you transform the DataFrame to a list of dictionaries
list_of_dicts = [item for item in my_data.T.to_dict().values()]
return list_of_dicts
# Usage:
x = create_dictionary("myfile.csv")

If the file is large, you may not want to load it entirely into memory at once. This approach avoids that. (Of course, making a dict out of it could still take up some RAM, but it's guaranteed to be smaller than the original file.)
my_dict = {}
for i, line in enumerate(file):
if (i - 8) % 7:
continue
k, v = line.split("\t")[:3:2]
my_dict[k] = v
Edit: Not sure where I got extend from before. I meant update

Related

Creating a nested dictionary inside a dictionary using CSV file and comparing the values with other dictionary

for row in db:
key = row.pop('name')
print(key)
if key in row:
pass
database[key] = row
print(database)
(database[key] = {db.fieldnames[i]:row[i] for i in range(1,len(db.fieldnames))
if I try this instead of **database[key] = row**, I\'m getting 'KeyError 0'
I have a CSV file with 1st column name and few header columns. I wanted to create a dictionary with name as a key and values as another nested dictionary. The nested dictionary keys must be headers and values are the particular individual column data.
Till now I believe that I was able to achieve that using the above program. Can someone confirm whether its right or not? I also want to call out the values of the main dictionary(i.e the nested dictionary itself) and compare it with a dictionary I created in the other part of the program. If it matches it should give out the key(i.e name).
Example:
dict = {'alice':{'age':32,'marks':86},'ron':{'age':25,'marks':75}}
other = {'age':25,'marks':75}
By comparing the other and dict.values, I should get ron as a result.
sample csv file(original file has multiple rows and columns)
name age marks
alice 32 86
ron 25 75
Someone please guide me how to proceed.
def database(file):
d = {}
f = csv.reader(file, delimiter =",")
header = next(f)
n = len(header)
for row in f:
d[row[0]] = {header[i]:row[i] for i in range(1,n)}
return d
Thank you #Avishka Dambawinna.

Unable to generate correct hash table for columns in CSV file

I have a CSV file with the following columns,
ZoneMaterialName1,ZoneThickness1
Copper,2.5
Copper,2.5
Aluminium,3
Zinc,
Zinc,
Zinc,6
Aluminium,4
As can be seen, some values are repeating multiple times, and can occasionally be blank or a period.
I would like a hash table with unique values only, like
ZoneMaterialName1,ZoneThickness1
Copper:[2.5]
Aluminium:[3,4]
Zinc:[6]
Here is the code I came up with, the output is missing the float numbers like 2.5 and allowing whitespaces and periods as well.
import csv
from collections import defaultdict
import csv
afile = open('/mnt/c/python_test/Book2.csv', 'r+')
csvReader1 = csv.reader(afile)
reader = csv.DictReader(open('/mnt/c/python_test/Book2.csv'))
nodes = defaultdict(type(''))
for row in reader:
if (row['ZoneThickness1'] !=' ' and row['ZoneThickness1'] !='.'):
nodes[row['ZoneMaterialName1']]+=(row['ZoneThickness1'])
new_dict = {a:list(set(b)) for a, b in nodes.items()}
print new_dict
Approach: I create a dictionary originally and converted its values to a set.
I suggest you try to cast the second column to float and add only those values that are valid floating point numbers.
Also, you can use a set to avoid duplicate values for the some material.
This could be done like this (I used Python 3.x since you tagged this question for both python versions):
import collections
import csv
result = collections.defaultdict(set)
with open('test.txt', 'r') as f:
csv_r = csv.DictReader(f)
for row in csv_r:
try:
v = float(row['ZoneThickness1'])
except ValueError:
# skip this line, because it is not a valid float
continue
# this will add the material if it doesn't exist yet and
# will also add the value if it doesn't exist yet for this material
result[row['ZoneMaterialName1']].add(v)
for k, v in result.items():
print(k, v)
This gives the following output:
Copper {2.5}
Aluminium {3.0, 4.0}
Zinc {6.0}

Dynamically create lists of lists from CSV file data

I have a CSV file that I want am reading as a configuration file to create a list of lists to store data in.
The format of my CSV file is:
list_name, search_criteria
channel1, c1
channel2, c2
channel3, c3
I want to read in the CSV file and dynamically create the list of lists from the list_name data as it could grow and shrink over time and I always want whatever is defined in the CSV file.
The list_name in the CSV file is a "prefix" to the list name i want to create dynamically. For example, read in "channel1", "channel2", "channel3" from the csv file and create a lists of lists where "mainList[]" is the core list and contains 3 lists within it named "channel1_channel_list", "channel2_channel_list", "channel3_cannel_list".
I realize my naming conventions could be simplified so please disregard. I'll remain once i have a working solution. I will be using the search criteria to populate the lists within mainLists[].
Here is my incomplete code:
mainList = []
with open('list_config.csv') as input_file:
dictReader = csv.DictReader(input_file)
for row in dictReader:
listName = row['list_name'] + '_channel_list'
Here's how to read your data and create a dict from it.
As well as reading data from files, the csv readers can read their data from a list of strings, which is handy for example code like this. With your data you need to specify skipinitialspace=True to skip over the spaces after the commas.
import csv
data = '''\
list_name, search_criteria
channel1, c1
channel2, c2
channel3, c3
'''.splitlines()
dictReader = csv.DictReader(data, skipinitialspace=True)
main_dict = {}
for row in dictReader:
name = row['list_name'] + '_channel_list'
criteria = row['search_criteria']
main_dict[name] = criteria
# Print data sorted by list_name
keys = sorted(main_dict)
for k in keys:
print(k, main_dict[k])
output
channel1_channel_list c1
channel2_channel_list c2
channel3_channel_list c3
This code is a little more complicated than Joe Iddon's version, but it's a little easier to adapt if you have more than two entries per row. OTOH, if you do only have two entries per row then you should probably use Joe's simpler approach.
Load in the file in a with statement, then use csv.reader which returns a reader object. This can then be converted to a dictionary by passing it into dict():
with open('list_config.csv') as input_file:
dictionary = dict(csv.reader(input_file))
Now, the contents of dictionary is:
{'channel3': ' c3', 'channel1': ' c1', 'channel2': ' c2'}

Iterate over a column containing keys from a dict. Return matched keys from second dict keeping order of keys from first dict

I have been stack with a problem for a couple of days with Python (2.7). I have 2 data sets, A and B, from 2 different populations, containing ordered positions along the chromosomes (defined by a name, e.g. rs4957684) and their corresponding frequencies in the 2 populations. Most of the positions in B match those in A. I need to get the frequencies in A and B of only those positions that match between A and B, and in the corresponding order along the chromosomes.
I created a csv file (df.csv) with 4 columns: keys from A (c1), values from A (c2), keys from B (c3), values from B (c4).
First I created 2 dicts, dA and dB, with keys and values (positions and frequencies respectively) from A and B, and looked for the keys that match between A and B. From the matched keys I generated 2 new dicts for A and B (dA2 and dB2).
The problem is that, since they are dicts, I cannot get the order of the matched positions in the chromosomes so I figured out another strategy:
Iterate along c1 and see whether any key from c3 matches the ordered keys in c1. If yes, return an ordered list with the values (of A and B) of the matched keys.
I wrote this code:
import csv
from collections import OrderedDict
with open('df.csv', mode='r') as infile: # input file
# to open the file in universal-newline mode
reader = csv.reader(open('df.csv', 'rU'), quotechar='"', delimiter = ',')
dA= dict((rows[1],rows[2]) for rows in reader)
dB= dict((rows[3],rows[4]) for rows in reader)
import sys
sys.stdout = open("df2.csv", "w")
for key, value in dB:
if rows[3] in dA.key():
print rows[2], rows[4]
Here the script seems to run but I get no output
# I also tried this:
for row in reader:
if row[3] in dA.key():
print row[4]
...and I have the same problem.
As I see, you imported OrderedDict, but didn't use it. You should build OrderedDict to save keys order:
dict_a = OrderedDict((rows[1],rows[2]) for rows in reader)
dict_b = dict((rows[3],rows[4]) for rows in reader)
for key, value in dict_a.iteritems():
if dict_b[key] == value:
print value

Python Code Not Listing First Line in CSV File

I was working on IMDB movie list just to list movie names, links and my ratings. Here is the code:
import csv
r_list = open('ratings.csv')
rd = csv.reader(r_list, delimiter=',', quotechar='"')
movies = {}
for row in rd:
movies[row[5]] = [row[0], row[8]]
print(len(movies))
The output is 500 but actual number is 501. It is not showing the first line. But when I do the same thing for a list that contains 6 lines in total, it counts the first line and returns '6'.
Why?
Because you are using a dictionary and you have row[5] which is duplicate and is replaced and thereby shortening your list by the number of duplicates (minus one) you have. You cannot have 2 keys in a dictionary that are the same. That is illegal. Python handles that silently by overwritting (replacing) the value of the key you had with the new value.
e.g.
data = [('rambo', 1995), ('lethal weapon', 1992), ('rambo', 1980)]
movies = {}
for row in data:
movies[row[0]] = row[1]
print len(data) # -> 3
print len(movies) # -> 2
print movies['rambo'] # -> 1980
Solution is not to use the dictionary if you dont want duplicate keys to be replace each other.

Categories