count occurencies of a word - python

I checked similar topics, but the results are poor.
I have a file like this:
S1_22 45317082 31 0 9 22 1543
S1_23 3859606 40 3 3 34 2111
S1_24 48088383 49 6 1 42 2400
S1_25 43387855 39 1 7 31 2425
S1_26 39016907 39 2 7 30 1977
S1_27 57612149 23 0 0 23 1843
S1_28 42505824 23 1 1 21 1092
S1_29 54856684 18 0 2 16 1018
S1_29 54856684 18 0 2 16 1018
S1_29 54856684 18 0 2 16 1018
S1_29 54856684 18 0 2 16 1018
I wanted to count occurencies of words in first column, and based on that write the output file with additional field stating uniq if count == 1 and multi if count > 0
I produced the code:
import csv
import collections
infile = 'Results'
names = collections.Counter()
with open(infile) as input_file:
for row in csv.reader(input_file, delimiter='\t'):
names[row[0]] += 1
print names[row[0]],row[0]
but it doesn't work properly
I can't put everything into list, since the file is too big

If you want this code to work you should indent your print statement:
names[row[0]] += 1
print names[row[0]],row[0]
But what you actually want is:
import csv
import collections
infile = 'Result'
names = collections.Counter()
with open(infile) as input_file:
for row in csv.reader(input_file, delimiter='\t'):
names[row[0]] += 1
for name, count in names.iteritems():
print name, count
Edit: To show the rest of the row, you can use a second dict, as in:
names = collections.Counter()
rows = {}
with open(infile) as input_file:
for row in csv.reader(input_file, delimiter='\t'):
rows[row[0]] = row
names[row[0]] += 1
for name, count in names.iteritems():
print rows[name], count

The print statement at the end does not look like what you want. Because of its indentation it is only executed once. It will print S1_29, since that is the value of row[0] in the last iteration of the loop.
You're on the right track. Instead of that print statement, just iterate through the keys & values of the counter and check if each value is greater than or equal to 1.

Related

Extract data from alternate rows with python

I want to extract the number corresponding to O2H from the following file format (The delimiter used here is space):
# Timestep No_Moles No_Specs SH2 S2H4 S4H6 S2H2 H2 S2H3 OSH2 Mo1250O3736S57H111 OSH S3H6 OH2 S3H4 O2S SH OS2H3
144500 3802 15 3639 113 1 10 18 2 7 1 3 2 1 2 1 1 1
# Timestep No_Moles No_Specs SH2 S2H4 S2H2 H2 S2H3 OSH2 Mo1250O3733S61H115 OS2H2 OSH S3H6 OS O2S2H2 OH2 S3H4 SH
149000 3801 15 3634 114 11 18 2 7 1 1 2 2 1 1 4 2 1
# Timestep No_Moles No_Specs SH2 OS2H3 S3H Mo1250O3375S605H1526 OS S2H4 O3S3H3 OSH2 OSH S2H2 H2 OH2 OS2H2 S2H O2S3H3 SH O4S4H4 OH O2S2H O6S5H3 O6S5H5 O3S4H4 O2S3H2 O3S4H3 OS3H3 O3S2H2 O4S3H4 O3S3H O6S4H5 OS4H3 O3S2H O5S4H4 OS2H O2SH2 S2H3 O4S3H3 O3S3H4 O O5S3H4 O5S3H3 OS3H4 O2S4H4 O4S4H3 O2SH O2S2H2 O5S4H5 O3S3H2 S3H6
589000 3269 48 2900 11 1 1 47 11 1 81 74 26 25 21 17 1 3 5 2 3 3 1 1 2 2 1 2 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Timestep No_Moles No_Specs SH2 Mo1250O3034S578H1742 OH2 OSH2 O3S3H5 OS2H2 OS OSH O2S3H2 OH O3S2H2 O6S6H4 SH O2S2H2 S2H2 OS2H H2 OS2H3 O5S4H2 O7S6H5 S3H2 O2SH2 OSH3 O7S6H4 O2S2H3 O6S5H3 O2SH O4S4H O3S2H3 S2 O2S2H S5H3 O7S4H4 O3S3H OS3H OS4H O5S3H3 S3H O17S12H9 O3S3H2 O7S5H4 O4SH3 O3S2H O7S8H4 O3S3H3 O11S9H6 OS3H2 S4H2 O10S8H6 O4S3H2 O5S5H4 O6S8H4 OS2 OS3H6 S3H3
959500 3254 55 2597 1 83 119 1 46 59 172 4 3 4 1 27 7 38 6 23 3 1 2 3 5 3 1 2 1 2 1 1 6 3 1 1 2 1 1 1 1 1 3 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1
That is, all the alternate rows contain the corresponding data of its previous row.
And I want the output to look like this
1
4
21
83
How it should work:
1 (14th number on 2nd row which corresponds to 14th word of 1st row i.e. O2H)
4 (16th number on 4th row which corresponds to 16th word of 3rd row i.e. O2H)
21 (15th number on 6th row which corresponds to 15th word of 5th row i.e. O2H)
83 (6th number on 8th row which corresponds to 6th word of 7th row i.e. O2H)
I was trying to extract it using regex but couldnot do it. Can anyone please help me to extract the data?
You easily parse this to a dataframe and select the desired column to fetch the values.
Assuming your data looks like the sample you've provided, you can try the following:
import pandas as pd
with open("data.txt") as f:
lines = [line.strip() for line in f.readlines()]
header = max(lines, key=len).replace("#", "").split()
df = pd.DataFrame([line.split() for line in lines[1::2]], columns=header)
print(df["OH2"])
df.to_csv("parsed_data.csv", index=False)
Output:
0 1
1 11
2 1
3 83
Name: OH2, dtype: object
Dumping this to a .csv would yield:
i think you want OH2 and not O2H and it's a typo. Assuming this:
(1) iterate every single line
(2) take in account only even lines. ( if (line_counter % 2) == 0: continue )
(3) splitting all the spaces and using a counter variable, count the index of the OH2 in the even line. assuming it is 14 in the first line
(4) access the following line ( +1 index ) and splitting spaces of the following line, access the element at the index of the element that you find in point (3)
since you haven't post any code i assumed your problem was more about finding a way to achieve this, than coding, so i wrote you the algorithm
Thank you, everyone, for the help, I figured out the solution
i=0
j=1
with open ('input.txt','r') as fin:
with open ('output.txt','w') as fout:
for lines in fin: #Iterating over each lines
lists = lines.split() #Splits each line in list of words
try:
if i%2 == 0: #Odd lines
index_of_OH2 = lists.index('OH2')
#print(index_of_OH2)
i=i+1
if j%2 == 0: #Even lines
number_of_OH2 = lists[index_of_OH2-1]
print(number_of_OH2 + '\n')
fout.write(number_of_OH2 + '\n')
j=j+1
except:
pass
Output:
1
4
21
83
try:, except: pass added so that if OH2 is not found in that line it moves on without error

Group by a range of numbers Python

I have a list of numbers in a python data frame and want to group these numbers by a specific range and count. The numbers range from 0 to 20 but lets say there might not be any number 6 in that case I want it to show 0.
dataframe column looks like
|points|
5
1
7
3
2
2
1
18
15
4
5
I want it to look like the below
range | count
1 2
2 2
3 1
4 1
5 2
6 0
7 ...
8
9...
I would iterate through the input lines and fill up a dict with the values.
All you have to do then is count...
import collections
#read your input and store the numbers in a list
lines = []
with open('input.txt') as f:
lines = [int(line.rstrip()) for line in f]
#pre fill the dictionary with 0s from 0 to the highest occurring number in your input.
values = {}
for i in range(max(lines)+1):
values[i] = 0
# increment the occurrence by 1 for any found value
for val in lines:
values[val] += 1
# Order the dict:
values = collections.OrderedDict(sorted(values.items()))
print("range\t|\tcount")
for k in values:
print(str(k) + "\t\t\t" + str(values[k]))
repl: https://repl.it/repls/DesertedDeafeningCgibin
Edit:
a slightly more elegant version using dict comprehension:
# read input as in the first example
values = {i : 0 for i in range(max(lines)+1)}
for val in lines:
values[val] += 1
# order and print as in the first example

Read text file into list based on specific criterion

I have a text file with the following content:
str1 str2 str3 str4
0 1 12 34
0 2 4 6
0 3 5 22
0 56 2 18
0 3 99 12
0 8 5 7
1 66 78 9
I want to read the above text file into a list such that program starts reading from the row where first column has a value greater than zero.
How do I do it in Python 3.5?
I tried genfromtxt() but i can only skip fixed number of lines from top. Since I will be reading different files i need something else.
This is one way with csv module.
import csv
from io import StringIO
mystr = StringIO("""\
str1 str2 str3 str4
0 1 12 34
0 2 4 6
0 3 5 22
0 56 2 18
0 3 99 12
0 8 5 7
1 66 78 9
2 50 45 4
""")
res = []
# replace mystr with open('file.csv', 'r')
with mystr as f:
reader = csv.reader(mystr, delimiter=' ', skipinitialspace=True)
next(reader) # skip header
for line in reader:
row = list(map(int, filter(None, line))) # convert to integers
if row[0] > 0: # apply condition
res.append(row)
print(res)
[[1, 66, 78, 9], [2, 50, 45, 4]]
lst = []
flag = 0
with open('a.txt') as f:
for line in f:
try:
if float(line.split()[0].strip('.')) >0:
flag = 1
if flag == 1:
lst += [float(i.strip('.')) for i in line.split()]
except:
pass

python; merge dictionaries with each dictionary in a new column of the output csv file

With the following script, I parse 3 files to one dictionary in python. The dictionaries do not have all similar keys and I want the values of each dictionary in a new column in my output csv file. So the keys must be all in one column, followed by columns each containing the values of the different dictionaries.
The problem with my script is that is only appending values if they exist, and the result is that the values of the different dictionaries are places in the wrong columns of the output csv file.
My script is as follows:
def get_file_values(find_files, output_name):
for root, dirs, files in os.walk(os.getcwd()):
if all(x in files for x in find_files):
outputs = []
for f in find_files:
d = {}
with open(os.path.join(root, f), 'r') as f1:
for line in f1:
ta = line.split()
d[ta[1]] = int(ta[0])
outputs.append(d)
d3 = defaultdict(list)
for k, v in chain(*(d.items() for d in outputs)):
d3[k].append(v)
with open(os.path.join(root, output_name), 'w+', newline='') as fnew:
writer = csv.writer(fnew)
writer.writerow(["genome", "contig", "genes", "SCM", "plasmidgenes"])
for k, v in d3.items():
fnew.write(os.path.basename(root) + ',')
writer.writerow([k] + v)
print(d3)
get_file_values(['genes.faa.genespercontig.csv', 'hmmer.analyze.txt.results.txt', 'genes.fna.blast_dbplasmid.out'], 'output_contigs_SCMgenes.csv')
My output now is:
genome contig genes SCM plasmidgenes
Linda 9 359 295 42
Linda 42 1 2
Linda 73 29 5
Linda 43 17 6
Linda 74 4
Linda 48 11
Linda 66 27
And I want to have it like;
genome contig genes SCM plasmidgenes
Linda 9 359 295 42
Linda 42 1 2 0
Linda 73 0 29 5
Linda 43 17 0 6
Linda 74 0 0 4
Linda 48 0 11 0
Linda 66 27 0 0
Easiest fix: Check if the value exists, if it does append it, else append a 0 to your data array.
Probably a more complicated fix: Use a different data structure such as Pandas or a two-dimensional array that resembles your data.
Example with two dimensional array:
You would first loop through the files and fill the d3 array with d3[lineNumber][key]. e.g. d3[0]['genome'] would be your first rows first column.
Then you should be able to output the file with the following block:
with open(os.path.join(root, output_name), 'w+', newline='') as fnew:
writer = csv.writer(fnew)
# write header row
header = ""
for k, v in d3[0].items():
header += k + ','
writer.writerow(header)
# write data rows
for key, row in d3.items():
line = ""
line += os.path.basename(root)
for k, v in row.items():
line += ',' + v
writer.writerow(line)

How to remove a specific string common in multiple lines in a CSV file using python script?

I have a csv file which contains 65000 lines (Size approximately 28 MB). In each of the lines a certain path in the beginning is given e.g. "c:\abc\bcd\def\123\456". Now let's say the path "c:\abc\bcd\" is common in all the lines and rest of the content is different. I have to remove the common part (In this case "c:\abc\bcd\") from all the lines using a python script. For example the content of the CSV file is as mentioned.
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0.frag 16 24 3
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert 87 116 69
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert.bin 75 95 61
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-0 0 0
C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.link-link-6 0 0 0
In the above example I need the output as below
FILE0.frag 0 0 0
FILE0.vert 0 0 0
FILE0.link-link-0.frag 17 25 2
FILE0.link-link-0.vert 85 111 68
FILE0.link-link-0.vert.bin 77 97 60
FILE0.link-link-0 0 0
FILE0.link 0 0 0
Can any of you please help me out with this?
^\S+/
You can simply use this regex over each line and replace by empty string.See demo.
https://regex101.com/r/cK4iV0/17
import re
p = re.compile(ur'^\S+/', re.MULTILINE)
test_str = u"C:/Abc/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.frag 16 24 3\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert 87 116 69\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.vert.bin 75 95 61\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0 0 0\nC:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-6 0 0 0 "
subst = u" "
result = re.sub(p, subst, test_str)
What about something like,
import csv
with open("file.csv", 'rb') as f:
sl = []
csvread = csv.reader(f, delimiter=' ')
for line in csvread:
sl.append(line.replace("C:/Abc/Def/Test/temp\.\test\GLNext\", ""))
To write the list sl out to filenew use,
with open('filenew.csv', 'wb') as f:
csvwrite = csv.writer(f, delimiter=' ')
for line in sl:
csvwrite.writerow(line)
You can automatically detect the common prefix without the need to hardcode it. You don't really need regex for this. os.path.commonprefix can be used
instead:
import csv
import os
with open('data.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
paths = [] #stores all paths
rows = [] #stores all lines
for row in reader:
paths.append(row[0].split("/")) #split path by "/"
rows.append(row)
commonprefix = os.path.commonprefix(paths) #finds prefix common to all paths
for row in rows:
row[0] = row[0].replace('/'.join(commonprefix)+'/', "") #remove prefix
rows now has a list of lists which you can write to a file
with open('data2.csv', 'wb') as csvfile:
writer = csv.writer(csvfile)
for row in rows:
writer.writerow(row)
The following Python script will read your file in (assuming it looks like your example) and will create a version removing the common folders:
import os.path, csv
finput = open("d:\\input.csv","r")
csv_input = csv.reader(finput, delimiter=" ", skipinitialspace=True)
csv_output = csv.writer(open("d:\\output.csv", "wb"), delimiter=" ")
# Create a set of unique folder names
set_folders = set()
for input_row in csv_input:
set_folders.add(os.path.split(input_row[0])[0])
# Determine the common prefix
base_folder = os.path.split(os.path.commonprefix(set_folders))[0]
nprefix = len(base_folder) + 1
# Go back to the start of the input CSV
finput.seek(0)
for input_row in csv_input:
csv_output.writerow([input_row[0][nprefix:]] + input_row[1:])
Using the following as input:
C:/Abc/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0
C:/Abc/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0
C:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-0.frag 16 24 3
C:/Abc/Def/Test/temp/test/GLNext2/FILE0.link-link-0.vert 87 116 69
C:/Abc/Def/Test/temp/test/GLNext5/FILE0.link-link-0.vert.bin 75 95 61
C:/Abc/Def/Test/temp/test/GLNext7/FILE0.link-link-0 0 0
C:/Abc/Def/Test/temp/test/GLNext/FILE0.link-link-6 0 0 0
The output is as follows:
GLNext/FILE0.frag 0 0 0
GLNext/FILE0.vert 0 0 0
GLNext/FILE0.link-link-0.frag 16 24 3
GLNext2/FILE0.link-link-0.vert 87 116 69
GLNext5/FILE0.link-link-0.vert.bin 75 95 61
GLNext7/FILE0.link-link-0 0 0
GLNext/FILE0.link-link-6 0 0 0
With one space between each column, although this could easily be changed.
So i tried something like this
for dirName, subdirList, fileList in os.walk(Directory):
for fname in fileList:
if fname.endswith('.csv'):
for line in fileinput.input(os.path.join(dirName, fname), inplace = 1):
location = line.find(r'GLNext')
if location > 0:
location += len('GLNext')
print line.replace(line[:location], ".")
else:
print line
You can use the pandas library for this. Doing so, you can leverage pandas' amazing handling of big CSV files (even in the hundreds of MB).
Code:
import pandas as pd
csv_file = 'test_csv.csv'
df = pd.read_csv(csv_file, header=None)
print df
print "-------------------------------------------"
path = "C:/Abc/bcd/Def/Test/temp/test/GLNext/"
df[0] = df[0].replace({path:""}, regex=True)
print df
# df.to_csv("truncated.csv") # Export to new file.
Result:
0 1 2 3
0 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.frag 0 0 0
1 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.vert 0 0 0
2 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 16 24 3
3 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 87 116 69
4 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 75 95 61
5 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 0 0 NaN
6 C:/Abc/bcd/Def/Test/temp/test/GLNext/FILE0.lin... 0 0 0
-------------------------------------------
0 1 2 3
0 FILE0.frag 0 0 0
1 FILE0.vert 0 0 0
2 FILE0.link-link-0.frag 16 24 3
3 FILE0.link-link-0.vert 87 116 69
4 FILE0.link-link-0.vert.bin 75 95 61
5 FILE0.link-link-0 0 0 NaN
6 FILE0.link-link-6 0 0 0

Categories