Fast text use (getting it up to compare word vectors) - python

I am a little ashamed that I have to ask this question because I feel like I should know this. I haven't been programming long but I am trying to apply what I learn to a project I'm working on, and that is how I got to this question. Fast Text has a library of word and associated points https://fasttext.cc/docs/en/english-vectors.html . It is used to find the vector of the word. I just want to look a word or two up and see what the result is in order to see if it is useful for my project. They have provided a list of vectors and then a small code chunck. I cannot make heads or tails out of it. some of it I get but i do not see a print function - is it returning the data to a different part of your own code? I also am not sure where the chunk of code opens the data file, usually fname is a handle right? Or are they expecting you to type your file's path there. I also am not familiar with io, I googled the word but didn't find anything useful. Is this something I need to download or is it already a part of python. I know I might be a little out of my league but I learn best by doing, so please don't hate on me.
import io
def load_vectors(fname):
fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
n, d = map(int, fin.readline().split())
data = {}
for line in fin:
tokens = line.rstrip().split(' ')
data[tokens[0]] = map(float, tokens[1:])
return data

Try the following:
my_file_name = 'C:/path/to/file.txt' # Use the path to your file of rows of sentences
my_data = load_vectors(my_file_name) # Function will return data
print(my_data) # To see the output

Related

Modifying letters in a file

I'm new to programming so I'm pretty lost. I'm currently learning Python and I need to open a text file and change every letter to the next one in the alphabet (e.g a -> b, b -> c, etc.). How would I go about writing a code like this?
This sounds like a neat problem to work on for a beginner.
Things you may want to look at:
The open() function, which allows you to open files and read/write to them. For example
https://docs.python.org/3/library/functions.html#open
with open('test.out', 'r+') as fi:
all_lines = fi.readlines() # Read all lines from the file
fi.write('this string will be written to the file')
# The file is closed at this point in the code; `with()` is a context manager, look that up
The os.replace() function, which lets you overwrite one file with another. You might try reading the input file, writing to a new output file, then overwriting the input file with the new output file; this will let you do that.
https://docs.python.org/3/library/os.html#os.replace
Replacing a character with the next increment of a character is an interesting twist, as it's not something that a lot of python programmers have to deal with. Here's one way to increment a character:
x = 'c'
print(chr(ord(x) + 1)) # will print 'd'
Without just giving away the answer, this should give you the pieces that you need to get started, feel free to ask more questions.
I think that this will work very well. The code can be shortened I think but Im still not sure how. Not an expert with with open statements.
with open("(your text file path)", "r") as f:
data = f.readline()
new_data = ""
for x in range(len(data)):
i = ord(data[x][0])
i += 1
x = chr(i)
new_data += x
print(new_data)
with open("(your text file path)", "w") as f:
f.write(new_data)
You must change your letters to numbers so that you can increment them by one, and then change them back to letters. This should work.

Getting data from fastq by generator

I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:
#hhhhhhhh
ATGCGTAGGGG
+
IIIIIIIIIIIII
I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea?
I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions.
thanks in advance.
Paulo
I got some ideas from a code in Biostar:
import sys
import gzip
filename = sys.argv[1]
def parsing_fastq_files(filename):
with gzip.open(filename, "rb") as infile:
count_lines = 0
for line in infile:
line = line.decode()
if count_lines % 4 == 0:
ids = line[1:].strip()
yield ids
if count_lines == 1:
reads = line.rstrip()
yield reads
count_lines += 1
total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))
I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.
Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.
Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.
Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.

Searching for a string in a file and saving the results

I have a few quite large text files with data on them. I need to find a string that repeats from the data and the string will always have an id number after it. I will need to then save that number.
Ive done some simple scripting with python but I am unsure where to start from with this or if python is even a good idea for this problem. Any help is appreciated.
I will post more information next time (my bad), but I managed to get something to work that should do it for me.
import re
with open("test.txt", "r") as opened:
text = opened.read()
output = re.findall(r"\bdata........", text)
out_str = ",".join(output)
print (out_str)
#with open("output.txt", "w") as outp:
#outp.write(out_str)

Python list of lists no loops

So full disclosure, this is hw, but I am having a lot of difficulty figuring this out. My professor has a rather particular challenge in this one portion of the assignment that I can't quite seem to figure out. Basically I'm trying to read a very very large file and put it into a list of lists that's represented as a movie recommendation matrix. She says that this can be done without a for loop and suggests using the readlines() method.
I've been running this code:
movMat = []
with open(u"movie-matrix.txt", 'r', encoding="ISO-8859-1") as f:
movMat.append(f.readlines())
But when I run diff on the output, it is not equivalent to the original file. Any suggestions for how I should go about this?
Update: upon further analysis, I think my code is correct. I added this to make it a tuple.
with open(u"movie-matrix.txt", 'r', encoding="ISO-8859-1") as f:
movMat = list(enumerate(f.readlines()))
Update2: Since people seem to want the file I'm reading from, allow me to explain. This is a ranking system from 1-5. If a person has not ranked a file, they are denoted with a ';'. This is the second line of the file.
"3;;;;3;;;;;;;;3;;;;;;;;;2;;;;;;;;3;;;;;;;;;;;;5;;;;;;;1;;;;;;;;;;;;;;;3;;;;;;;;3;;;;;;;;;;;4;;;;4;;;;;3;;;2;;;;;;;2;;;;;;;;3;;;;;;;;;;;;;;;;;;;;4;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;4;;;;;;;;;;;;;;;3;;;;3;;;4;2;;;;;;3;;;;;;4;;;;3;;;;;3;;;;;;;;;;;;2;;;;;;;;;;;;;;;3;4;;;;;;5;;;;;;;;;;;3;2;;;1;;;;;4;;;4;3;;;;;;;;;;;;4;3;;;;;;;;2;;3;;2;;;;;;;;;;;;;;;4;;;;;1;;2;;;;;;;;;;;;;;;;;;;5;;;;;;;;;;;;;;;;;4;;;;;;;;;;4;4;;;;2;3;;;;;;3;;4;;;;;;4;;;;;3;3;;;;;;1;;4;;;;;;;;;4;;;;;;;;;2;;;;3;;;;;;4;;;;;;;3;;;;;;;;4;;;;;4;;;;;;;;;;;1;;;;;;5;;;;;;;;;;;;4;;;3;;;;;;;;2;;1;;;;;;;;;4;;;;;;;;;;;;;;;3;;;;;;;;;;;5;;;;4;;;;;;;3;;;;;;;;2;;;;;;;;;;3;;;;;5;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;3;;;;;;;;;;;;;;;;;;2;;;3;4;;;;;3;;;;;4;;;;;;;;4;;4;3;;;;;4;;3;;;1;;3;;;;;2;;;;;;;;;;;4;;;;;;;;;;;3;;;;3;;;;;;;;;;;;;;;;;;;3;;;;4;;;;;;3;;;;;;;;;;;;4;;;;;;;;;;;3;;;;;;;;3;;;4;;4;;;;;;3;;;;;;;3;;;;;;;;;3;1;;;;;;;;;;;;;;;;3;;;;;3;5;;4;;;;;;4;;3;4;;;;;;;;3;;;;;;;;;;;3;;;;3;;;;;;;;;;;;;;4;;5;;;;;;;;;;;;;;;;;;4;;;;2;;2;;;;;;;;;;3;;;;;;4;;;3;;;4;;;;3;;;3;;;;;;;;;;;;;;;;;3;;;;;;;;3;;;;;;;;;;4;;;;;;;;;5"
I can't think of any case where f.readlines() would be better than just using f as an iterable. That is, for example,
with open('movie-matrix.txt', 'r', encoding="ISO-8859-1") as f:
movMat = list(f)
(no reason using u'...' notation in Python 3 -- which you have to be using if built-in open takes encoding=...!-).
Yes, f.readlines() would be equivalent to list(f) -- but it's more verbose and less obvious, so, what's the point?!
Assuming you have to output this to another file, since you mention "running diff on the output", that would be
with open('other.txt', 'w', encoding="ISO-8859-1") as f:
f.writelines(movMat)
no non-for-loop alternatives there:-).

Extract certain offsets in a file with python

i am trying to learn python as i go, but i have come to a brick wall.
i am just trying to extract certain offsets in a .bin file.
i have a bin file with a length of "00FFFFF0"
lets say i want to extract from "0x3F000" with a block size of "0x800" and then put that in a file how would i go about it? i dont have any code yet and am hoping i will get some good input. i am a beginner to python (been doing it for a few months) and would like to learn how to do this really just for educational purposes.
but the point is i want to be able to extract specific (offset;block size)
i hope you understand what i mean. and i very much appreciate any help i am given. thanks
It's pretty self-explanatory, actually:
# Use the with statement to open a file so it will later be closed automatically
with open("in.bin", "rb") as infile: # rb = read binary
infile.seek(0x3F000, 0) # 0 = start of file, optional in this case
data = infile.read(0x800)
with open("out.bin", "wb") as outfile:
outfile.write(data)

Categories