how to get a sequence after a word with whitespace - python

For school I have to parse a string after a word with a lot of whitespace, but I just can't get it.
Because the file is a genbank.
So for example:
BLA
1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj
//
What I have tried is this.
if line.startswith("BLA"):
start = line.find("BLA")
end = line.find("//")
line = line[:end]
s_string = ""
string = list()
if s_string:
string.append(line)
else:
line = line.strip()
my_seq += line
But what I get is:
**output**
BLA
and that is the only thing it get and I want to get the output be like
**output**
BLA 1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj
So I don't know what to do, I tried to get it like that last output. But without success. My teacher told me that I had to do like. If BLA is True you can go iterate it. And if you see "//" you have to stop, but when I tried it with that True - statement I get nothing.
I tried to search it up online, and it said I had to do it with bio seqIO. But the teacher said we can't use that.

Here is my solution:
lines = """BLA
1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj
//"""
lines = lines.strip().split("//")
lines = lines[0].split("BLA")
lines = [i.strip() for i in lines]
print("BLA", " ", lines[1])
Output:
BLA 1 sjafhkashfjhsjfhkjsfkjakshfkjsjkf
2 isfshkdfhjksfkhksfhjkshkfhkjsakjfhk
3 kahsfkjshakjfhksjhfkskjfkaskfksj

Related

Regex in Python returns nothing (search parameters keywords for search for when using regex)

I'm not so sure about how regex works, but I'm trying to make a project where (haven't still set it up, but working on the pdf indexing side of code first with a test pdf) to analyze the mark scheme pdf, and based on that do anything with the useful data.
Issue is, is that when I enter the search parameters in regex, it returns nothing from the pdf. I'm trying iterate or go through each row with the beginning 1 - 2 digits (Question column), then A-D (Answer column) using re.compile(r'\d{1} [A-D]') in the following code:
import re
import requests
import pdfplumber
import pandas as pd
def download_file(url):
local_filename = url.split('/')[-1]
with requests.get(url) as r:
with open(local_filename, 'wb') as f:
f.write(r.content)
return local_filename
ap_url = 'https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf'
ap = download_file(ap_url)
with pdfplumber.open(ap) as pdf:
page = pdf.pages[1]
text = page.extract_text()
#print(text)
new_vend_re = re.compile(r'\d{1} [A-D]')
for line in text.split('\n'):
if new_vend_re.match(line):
print(line)
When I run the code, I do not get anything in return. Printing the text though will print the whole page.
Here is the PDF I'm trying to work with: https://papers.gceguide.com/A%20Levels/Biology%20(9700)/2019/9700_m19_ms_12.pdf
You're matching only one single space between the digits and the marks, but if you look at the output of text, there is more than one space between the digits and marks.
'9700/12 Cambridge International AS/A Level – Mark Scheme March 2019\nPUBLISHED \n \nQuestion Answer Marks \n1 A 1\n2 C 1\n3 C 1\n4 A 1\n5 A 1\n6 C 1\n7 A 1\n8 D 1\n9 A 1\n10 C 1\n11 B 1\n12 D 1\n13 B 1\n...
Change your regex to the following to match not only one, but one or more spaces:
new_vend_re = re.compile(r'\d{1}\s+[A-D]')
See the answer by alexpdev to get to know the difference of new_vend_re.match() and new_vend_re.search(). If you run this within your code, you will get the following output:
1 A 1
2 C 1
3 C 1
4 A 1
5 A 1
6 C 1
7 A 1
8 D 1
9 A 1
(You can also see here, that there are always two spaces instead of one).
//Edit: Fixed typo in regex

How to count word frequencies from an input file? [duplicate]

This question already has answers here:
How do I count the occurrences of a list item?
(29 answers)
Closed 2 years ago.
I'm trying to have my program read a single line formed by words separated by commas. For example if we have:
hello,cat,man,hey,dog,boy,Hello,man,cat,woman,dog,Cat,hey,boy
in the input file, the program would need to separate each word on a single line and ditch the commas. After that the program would count frequencies of the words in the input file.
f = open('input1.csv') # create file object
userInput = f.read()
seperated = userInput.split(',')
for word in seperated:
freq = seperated.count(word)
print(word, freq)
The problem with this code is it prints the initial count for the same word that's counted twice. The output for this program would be:
hello 1
cat 2
man 2
hey 2
dog 2
boy 1
Hello 1
man 2
cat 2
woman 1
dog 2
Cat 1
hey 2
boy
1
The correct output would be:
hello 1
cat 2
man 2
hey 2
dog 2
boy 2
Hello 1
woman 1
Cat 1
Question is how do I make my output look more polished by having the final count instead of the initial one?
This is a common pattern and core programming skill. You should try collecting and counting words each time you encounter them, in a dictionary. I'll give you the idea, but it's best you practise the exact implementation yourself. Happy hacking!
(Also recommend the "pretty print" python built-in method)
import pprint
for word in file:
word_dict[word] += 1
pprint.pprint(word_dict)
A couple of extra tips - you may want to f.close() your file when you're finished, (E: I misread so disregard the rest...) and it looks like you want to look at converting your words to lower case so that different capitalisations aren't counted seperately. There are python built in methods to do this you can find by searching
try using a dictionary:
f = open('input1.csv') # create file object
userInput = f.read()
seperated = userInput.split(',')
wordsDict = {}
for word in seperated:
if word not in wordsDict:
wordsDict[word] = 1
else:
wordsDict[word] = int(wordsDict.get(word)) + 1
for i in wordsDict:
print i, wordsDict[i]
)
Create a new dictionary. Add the word as key and the count of that as value to it
count_dict={}
for w in seperated:
count_dict[w]=seperated.count(w)
for key,value in count_dict.items():
print(key,value)

How do I remove blank lines from a string in Python?

Lets say I have a variable that's data contained blank lines how would I remove them without making every thong one long line?
How would I turn this:
1
2
3
Into this:
1
2
3
Without turning it into this:
123
import os
text = os.linesep.join([s for s in text.splitlines() if s])
You can simply do this by using replace() like data.replace('\n\n', '\n')
Refer this example for better understanding.!!
data = '1\n\n2\n\n3\n\n'
print(data)
data = data.replace('\n\n', '\n')
print(data)
Output
1
2
3
1
2
3
text = text.replace(r"\n{2,}","\n")

I need to make a simple file compression system for my GCSE Computer Science

I'm not that experienced with code and have a question pertaining to my GCSE Computer Science controlled assessment. I have got pretty far, it's just this last hurdle is holding me up.
This task requires me to use a previously made simple file compression system, and to "Develop a program that builds [upon it] to compress a text file with several sentences, including punctation. The program should be able to compress a file into a list of words and list of positions to recreate the original file. It should also be able to take a compressed file and recreate the full text, including punctuation and capitalisation, of the original file".
So far, I have made it possible to store everything as a text file with my first program:
sentence = input("Enter a sentence: ")
sentence = sentence.split()
uniquewords = []
for word in sentence:
if word not in uniquewords:
uniquewords.append(word)
positions = [uniquewords.index(word) for word in sentence]
recreated = " ".join([uniquewords[i] for i in positions])
print (uniquewords)
print (recreated)
positions=str(positions)
uniquewords=str(uniquewords)
positionlist= open("H:\Python\ControlledAssessment3\PositionList.txt","w")
positionlist.write(positions)
positionlist.close
wordlist=open("H:\Python\ControlledAssessment3\WordList.txt","w",)
wordlist.write(uniquewords)
wordlist.close
This makes everything into lists, and converts them into a string so that it is possible to write into a text document. Now, program number 2 is where the issue lies:
uniquewords=open("H:\Python\ControlledAssessment3\WordList.txt","r")
uniquewords= uniquewords.read()
positions=open("H:\Python\ControlledAssessment3\PositionList.txt","r")
positions=positions.read()
positions= [int(i) for i in positions]
print(uniquewords)
print (positions)
recreated = " ".join([uniquewords[i] for i in positions])
FinalSentence=
open("H:\Python\ControlledAssessment3\ReconstructedSentence.txt","w")
FinalSentence.write(recreated)
FinalSentence.write('\n')
FinalSentence.close
When I try and run this code, this error appears:
Traceback (most recent call last):
File "H:\Python\Task 3 Test 1.py", line 7, in <module>
positions= [int(i) for i in positions]
File "H:\Python\Task 3 Test 1.py", line 7, in <listcomp>
positions= [int(i) for i in positions]
ValueError: invalid literal for int() with base 10: '['
So, how do you suppose I get the second program to recompile the text into the sentence? Thanks, and I'm sorry if this was a lengthy post, I've spent forever trying to get this working.
I'm assuming this is something to do with the list that has been converted into a string including brackets, commas, and spaces etc. so is there a way to revert both strings back into their original state so I can recreate the sentence? Thanks.
So firstly, it is a big strange to save positions as a literal string; you should save each element (same with uniquewords). With this in mind, something like:
program1.py:
sentence = input("Type sentence: ")
# this is a test this is a test this is a hello goodbye yes 1 2 3 123
sentence = sentence.split()
uniquewords = []
for word in sentence:
if word not in uniquewords:
uniquewords.append(word)
positions = [uniquewords.index(word) for word in sentence]
with open("PositionList.txt","w") as f:
for i in positions:
f.write(str(i)+' ')
with open("WordList.txt","w") as f:
for i in uniquewords:
f.write(str(i)+' ')
program2.py:
with open("PositionList.txt","r") as f:
data = f.read().split(' ')
positions = [int(i) for i in data if i!='']
with open("WordList.txt","r") as f:
uniquewords = f.read().split(' ')
sentence = " ".join([uniquewords[i] for i in positions])
print(sentence)
PositionList.txt
0 1 2 3 0 1 2 3 0 1 2 4 5 6 7 8 9 10
WordList.txt
this is a test hello goodbye yes 1 2 3 123

Read file and get certain value from each line of file

I'm stuck on a particular point on something, I'm hoping you guys could perhaps suggest a better method.
For each line of file I'm reading I want to get the nth word of the line, store this and print them on a single line.
I have the following code:
import os
p = './output.txt'
word_line = ' '
myfile = open(p, 'r')
for words in myfile.readlines()[1:]: # I remove the first line because I don't want it
current_word = words.strip().split(' ')[4]
word_line += current_word
print word_line
myfile.close()
The file it reads looks like this:
1 abc-abc.abc (1235456) [AS100] bla 123 etc
2 abc-abc.abc (1235456) [AS10] bla 123 etc
3 abc-abc.abc (1235456) [AS1] bla 123 etc
4 abc-abc.abc (1235456) [AS56] bla 123 etc
5 abc-abc.abc (1235456) [AS8] bla 123 etc
6 abc-abc.abc (1235456) [AS200] bla 123 etc
etc
My current code outputs the following:
[AS100][AS10][AS1][AS56][AS8][AS200]
Only problem is, it is not always fixed as the 4th value of the line, as sometimes it appears as 5th, etc or not at all.
I'm currently trying out:
if re.match("[AS", words):
f_word = re.match(".*[(.*)",words)
This isn't working out, I'm trying to see if in the current line it finds an open "[" If it does to display the content of it before the closing "]. Move on to the new line and keep on doing this.
Eventually have the following desired output:
AS100 AS10 AS1 AS56 AS8 AS200
I could really use some advise on this. Thanks
EDIT:
m = re.search(r'\[AS(.*?)]', words)
if m:
f_word += ' ' + m.group(1)
Thanks
[ is a special character in regular expressions and denotes the start of a character class. Escape it.
m = re.search(r'\[AS(.*?)]', words)
if m:
f_word = m.group(1)

Categories