Print lines with constraint in Python - python

Suppose we have the following text file with column a and column b:
D000001 T109
D000001 T195
D000002 T115
D000002 T131
D000003 T073
D000004 T170
I wonder how to produce the following structure:
D000001 T109 T195
D000002 T115 T131
D000003 T073
D000004 T170
Pasted below is initial skeleton in Python.
from __future__ import print_function
with open('descr2semtype_short.txt') as f:
for line in f:
line = line.rstrip()
a, b = line.split()
print(a + ' ' + b)

You can use itertools.groupby:
import itertools, operator
with open('descr2semtype_short.txt') as f:
for key, items in itertools.groupby(
(line.rstrip().split(None,1) for line in f),
operator.itemgetter(0)):
print(key, ' '.join(item[1] for item in items))
which gives the desired output:
D000001 T109 T195
D000002 T115 T131
D000003 T073
D000004 T170

Instead of printing them there, you can keep a dictionary of the lines , with the first element of the line as the key and the second element as value (as a list , so that if another element comes for same key you can append to it).
And then print them at the end.
Example -
from __future__ import print_function
d = {}
with open('descr2semtype_short.txt') as f:
for line in f:
line = line.rstrip()
a, b = line.split()
if a not in d:
d[a] = []
d[a].append(b)
for k,v in d.iteritems():
print(k + ' ' + ' '.join(v))
From Python 2.7 onwards, If the order of the lines is important, then instead of Dictionary , we can use OrderedDict .
Example -
from __future__ import print_function
from collections import OrderedDict
d = OrderedDict()
with open('descr2semtype_short.txt') as f:
for line in f:
line = line.rstrip()
a, b = line.split()
if a not in d:
d[a] = []
d[a].append(b)
for k,v in d.items():
print(k + ' ' + ' '.join(v))

I would do it with OrderedDict , this way:
from collections import OrderedDict
d = OrderedDict()
with open('1.txt', 'r') as f:
for line in f:
a,b = line.strip().split()
print a,b
if a not in d:
d[a] = [b]
else:
d[a].append(b)
print d
Output:
OrderedDict([('D000001', ['T109', 'T109', 'T195']), ('D000002', ['T115', 'T115', 'T131']), ('D000003', ['T073', 'T073']), ('D000004', ['T170', 'T170', 'T175', 'T180'])])

Related

Group certain columns and summing up another column from a CSV

I have data in a csv that needs to be parsed. It looks like:
Date,Tag,Amount
13/06/2018,ABC,6750000
13/06/2018,ABC,159800
24/05/2018,ABC,-1848920
16/05/2018,AB,-1829700
16/05/2018,AB,3600000
28/06/2018,A,15938000
16/05/2018,AB,3748998
28/06/2018,A,1035000
28/06/2018,A,1035000
14/06/2018,ABC,2122717
You can see each date has a tag and number next to it.
what i am trying to achieve is to make the date and tag the key and group by the date and tag and to sum up the amount.
expected result
Date,Tag,Amount
13/06/2018,ABC,5220680
16/05/2018,AB,5519298
28/06/2018,A,18008000
14/06/2018,ABC,2122717
the code i am using now is below which is not working.
from collections import defaultdict
import csv
d = defaultdict(int)
with open("file.csv") as f:
for line in f:
tokens = [t.strip() for t in line.split(",")]
try:
date = int(tokens[0])
tag = int(tokens[1])
amount = int(tokens[2])
except ValueError:
continue
d[date] += amount
print d
could someone show me how to aheive this please without using pandas
You should definitely use pandas. With the exception that you have to code this by yourself, you can just install the pandas module, import it (import pandas as pd) as solve this problem with 2 simple and intuitive lines of code
>>> df = pd.read_csv('file.csv')
>>> df.groupby(['Date', 'Tag']).Amount.sum()
Date Tag
13/06/2018 ABC 6909800
14/06/2018 ABC 2122717
16/05/2018 AB 5519298
24/05/2018 ABC -1848920
28/06/2018 A 18008000
If you really need to code this yourself, you can use a nested defaultdict so you can have two layers of groupby. Also, why you try to cast to int the date and the tag? Makes no sense at all. Just remove it.
d = defaultdict(lambda: defaultdict(int))
for line in z:
tokens = [t.strip() for t in line.split(",")]
try:
date = tokens[0]
tag = tokens[1]
amount = int(tokens[2])
except ValueError as e:
continue
d[date][tag] += amount
The output is:
13/06/2018 ABC 6909800
24/05/2018 ABC -1848920
16/05/2018 AB 5519298
28/06/2018 A 18008000
14/06/2018 ABC 2122717
To output the result above, just iterate through the items:
for k,v in d.items():
for k2, v2 in v.items():
print(k,k2,v2)
To make your code even better, read the first line only, and then iterate from the second line til the end. That way, your try/except can be removed and you'd get a simpler and cleaner code. But you can pick up from here, right? ;)
To write to a csv, simply
s = '\n'.join(['{0} {1} {2}'.format(k, k2, v2) for k,v in d.items() for k2,v2 in v.items()])
with open('output.txt', 'w') as f:
f.write(s)
This is one approach using a simple iteration.
Ex:
from collections import defaultdict
import csv
result = defaultdict(int)
with open(filename) as infile:
reader = csv.reader(infile)
header = next(reader)
for line in reader:
result[tuple(line[:2])] += int(line[2])
print(header)
for k, v in result.items():
print(k[0], k[1], v)
Output:
14/06/2018 ABC 2122717
13/06/2018 ABC 6909800
28/06/2018 A 18008000
16/05/2018 AB 5519298
24/05/2018 ABC -1848920
To CSV
with open(filename, "wb") as outfile:
writer = csv.writer(outfile)
writer.writerow(header)
for k, v in result.items():
writer.writerow([k[0], k[1], v])
You can use itertools.groupby:
from itertools import groupby
import csv
header, *data = csv.reader(open('filename.csv'))
new_data = [[a, list(b)] for a, b in groupby(sorted(data, key=lambda x:x[:2]), key=lambda x:x[:2])]
results = [[*a, sum(int(c) for *_, c in b)] for a, b in new_data]
with open('calc_results.csv', 'w') as f:
write = csv.writer(f)
write.writerows([header, *results])
Output:
Date,Tag,Amount
13/06/2018,ABC,6909800
14/06/2018,ABC,2122717
16/05/2018,AB,5519298
24/05/2018,ABC,-1848920
28/06/2018,A,18008000

How to count the characters from the csv?

My CSV have below data
['value']
['abcd']
['def abc']
I want to count each characters in descending order of value, value is the header in the csv file. I have wrote one script below. Is there any better script than this?
from csv import DictReader
with open("name.csv") as f:
a1 = [row["value"] for row in DictReader(f)]
#a1
from collections import Counter
counts = Counter()
for line in a1:
counts.update(list((line)))
x=dict(counts)
from collections import defaultdict
d = defaultdict(int)
for w in sorted(x, key=x.get, reverse=True):
print (w, x[w])
from collections import defaultdict
path = "name.csv"
d_list = defaultdict(int)
with open(path, 'r') as fl:
for word in fl:
for ch in word:
#if word[0] == ch:
dd[ch] += 1
del d_list['\n']
del d_list[' ']
#print (d_list)
dd = sorted(d_list.items(), key=lambda v:v[1], reverse=True)
#dd_lex = sorted(dd, key = lambda k: (-k[1],k[0]))
for el in dd:
print (el[0] + ' '+ str(el[1]))

editing a text file in python and making a new one

I have a text file like this:
>ENST00000511961.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370661.3|RNF14-003|RNF14|278
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQ
>ENST00000506822.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370662.1|RNF14-004|GAPDH|132
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKE
>ENST00000513019.1|ENSG00000013561.13|OTTHUMG00000129660.5|OTTHUMT00000370663.1|RNF14-005|ACTB|99
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLS
>ENST00000356143.1|ENSG00000013561.13|OTTHUMG00000129660.5|-|RNF14-202|HELLE|474
MSSEDREAQEDELLALASIYDGDEFRKAESVQGGETRIYLDLPQNFKIFVSGNSNECLQNSGFEYTICFLPPLVLNFELPPDYPSSSPPSFTLSGKWLSPTQLSALCKHLDNLWEEHRGSVVLFAWMQFLKEETLAYLNIVSPFELKIGSQKKVQRRTAQASPNTELDFGGAAGSDVDQEEIVDERAVQDVESLSNLIQEILDFDQAQQIKCFNSKLFLCSICFCEKLGSECMYFLECRHVYCKACLKDYFEIQIRDGQVQCLNCPEPKCPSVATPGQVKELVEAELFARYDRLLLQSSLDLMADVVYCPRPCCQLPVMQEPGCTMGICSSCNFAFCTLCRLTYHGVSPCKVTAEKLMDLRNEYLQADEANKRLLDQRYGKRVIQKAL
I want to make a list in python for the 6th element of the lines that start with ">".
to do so, I first make a dictionary in python and then the keys should be the list that I want. like this:
from itertools import groupby
with open('infile.txt') as f:
groups = groupby(f, key=lambda x: not x.startswith(">"))
d = {}
for k,v in groups:
if not k:
key, val = list(v)[0].rstrip(), "".join(map(str.rstrip,next(groups)[1],""))
d[key] = val
k = d.keys()
res = [el[5:] for s in k for el in s.split("|")]
but it returns all elements in the line starts with ">".
do you know how to fix it?
here is expected output:
["RNF14", "GAPDH", "ACTB", "HELLE"]
This should help. ->Using a simple iterattion, str.startswith and str.split
Demo:
res = []
with open(filename, "r") as infile:
for line in infile:
if line.startswith(">"):
val = line.split("|")
res.append(val[5])
print(res)
Output:
['RNF14', 'GAPDH', 'ACTB', 'HELLE']
In you code Replace
res = [el[5:] for s in k for el in s.split("|")]
with
res = [s.split("|")[5] for s in k ] #Should work.
a solution near yours with filter instead of groupby and map
with open('infile.txt') as f:
lines = f.readlines()
groups = filter(lambda x: x.startswith(">"), lines)
res = list(map(lambda x: x.split('|')[5],groups))

Sort word frequencies by descending order of frequencies

I have a text file that has word frequencies in the format:
word<space>freq
where freq is a number. I want to sort the file such as the frequencies are in descending order. For that, I have tried the following:
Read the file into a dictionary:
kvp = {}
d = {}
with open("/home/melvyn/word_freq.txt") as myfile:
for line in myfile:
word, freq = line.partition(" ")[::2]
kvp[word.strip()] = int(freq)
Sort the dictionary by values:
d = sorted(kvp.items(), key=lambda x:x[1])
Write the sorted dictionary into another text file:
with open('/home/melvyn/word_freq_sorted.txt', 'w') as f:
json.dump(d, f)
I have the following questions:
1. Sorting is not happening. Why?
2. How can I add new line between every key-value pair while doing a json.dump? Is there a cleaner way to write the dictionary contents into the text file?
Instead of json.dump, try writing to the file with file.write, formatting the strings as needed.
import json
kvp = {}
d = {}
with open("a.txt", "r") as f:
for line in f:
word, freq = line.partition(" ")[::2]
kvp[word.strip()] = int(freq)
d = sorted(kvp.items(), key=lambda x:x[1])
with open("b.txt", "w") as f:
for i, v in d:
f.write(str(i) + " " + str(v) + "\n")

Converting text file to dictionary in python

So lets say I want to convert the following to a dictionary where the 1st column is keys, and 2nd column is values.
http://pastebin.com/29bXkYhd
The following code works for this (assume romEdges.txt is the name of the file):
f = open('romEdges.txt')
dic = {}
for l in f:
k, v = l.split()
if k in dic:
dic[k].extend(v)
else:
dic[k] = [v]
f.close()
OK
But why doesn't the code work for this file?
http://pastebin.com/Za0McsAM
If anyone can tell me the correct code for the 2nd text file to work as well I would appreciate it.
Thanks in advance.
You should use append instead of extend
from collections import defaultdict
d = defaultdict(list)
with open("romEdges.txt") as fin:
for line in fin:
k, v = line.strip().split()
d[k].append(v)
print d
or using sets to prevent duplicates
d = defaultdict(set)
with open("romEdges.txt") as fin:
for line in fin:
k, v = line.strip().split()
d[k].add(v)
print d
If you want to append the data to dictionary, then you can use update in python. Please use following code:
f = open('your file name')
dic = {}
for l in f:
k,v = l.split()
if k in dic:
dict.update({k:v })
else:
dic[k] = [v]
print dic
f.close()
output:
{'0100464': ['0100360'], '0100317': ['0100039'], '0100405': ['0100181'], '0100545': ['0100212'], '0100008': ['0000459'], '0100073': ['0100072'], '0100044': ['0100426'], '0100062': ['0100033'], '0100061': ['0000461'], '0100066': ['0100067'], '0100067': ['0100164'], '0100064': ['0100353'], '0100080': ['0100468'], '0100566': ['0100356'], '0100048': ['0100066'], '0100005': ['0100448'], '0100007': ['0100008'], '0100318': ['0100319'], '0100045': ['0100046'], '0100238': ['0100150'], '0100040': ['0100244'], '0100024': ['0100394'], '0100025': ['0100026'], '0100022': ['0100419'], '0100009': ['0100010'], '0100020': ['0100021'], '0100313': ['0100350'], '0100297': ['0100381'], '0100490': ['0100484'], '0100049': ['0100336'], '0100075': ['0100076'], '0100074': ['0100075'], '0100077': ['0000195'], '0100071': ['0100072'], '0100265': ['0000202'], '0100266': ['0000201'], '0100035': ['0100226'], '0100079': ['0100348'], '0100050': ['0100058'], '0100017': ['0100369'], '0100030': ['0100465'], '0100033': ['0100322'], '0100058': ['0100056'], '0100013': ['0100326'], '0100036': ['0100463'], '0100321': ['0100320'], '0100323': ['0100503'], '0100003': ['0100004'], '0100056': ['0100489'], '0100055': ['0100033'], '0100053': ['0100495'], '0100286': ['0100461'], '0100285': ['0100196'], '0100482': ['0100483']}

Categories