Getting data from fastq by generator - python

I have a task in a training that i have to read and filter the 'good' reads of big fastq files. It contains a header, a dna string, + sign and some symbols(qualities of each dna string). Ex:
#hhhhhhhh
ATGCGTAGGGG
+
IIIIIIIIIIIII
I down sampled, got the code working, saving in a python dictionary. But turns out the original files are huge and I rewrite the code to give a generator. It did work for the down-sampled sample. But I was wondering if its a good idea to get out all the data and filtering in a dictionary. Does anybody here a better idea?
I am asking because I am doing it by myself. I start learning python for some months and I still learning, but I doing alone. Because this I asking for tips and help here and sorry if some times i ask silly questions.
thanks in advance.
Paulo
I got some ideas from a code in Biostar:
import sys
import gzip
filename = sys.argv[1]
def parsing_fastq_files(filename):
with gzip.open(filename, "rb") as infile:
count_lines = 0
for line in infile:
line = line.decode()
if count_lines % 4 == 0:
ids = line[1:].strip()
yield ids
if count_lines == 1:
reads = line.rstrip()
yield reads
count_lines += 1
total_reads = parsing_fastq_files(filename)
print(next(total_reads))
print(next(total_reads))
I now need to figure out to get the data filtered by using 'if value.endswith('expression'):' but if I use a dict for example, but thats my doubt because the amount of keys and vals.

Since this training forces you to code this manually, and you have code that reads the fastQ as a generator, you can now use whatever metric (by phredscore maybe?) you have for determining the quality of the read. You can append each "good" read to a new file so you don't have much stuff in your working memory if almost all reads turn out to be good.
Writing to file is a slow operation, so you could wait until you have, say, 50000 good sequences and then write them to file.
Check out https://bioinformatics.stackexchange.com/ if you do a lot of bioinformatics programming.

Related

How to speed up reading/scanning mass files for an exact keyword match

So ive looked on google and the only results i get is "reading large files", not much about how to speed up reading multiple files.
I have a sound-alias-keyword. This keyword will need to be scanned in up to 128 files
and i could have up to 1,600 keywords to scan for in said files.
So as you can see thats a lot of opening/reading. And its loading time is very slow. I cant have it be this slow for my program. I need to reduce the load time by 10 fold.
So i have this code snippet which reads files line by line and if a mention of the keyword is in said line then it will do an exact-match check.
for file in weapon_files:
with open(file, 'r', encoding="ISO-8859-1") as weapon_file:
for m in weapon_file:
if sAlias in m:
t = re.compile(fr'\b{sAlias}\b')
result = t.search(m)
if result:
called_file_.append(''.join(f"root{file[len(WAW_ROOT_DIR):]}"))
I then thought id try and see if i could speed things up by turning the file into a string. Do a basic scan to see if theres any mention and if so, then do an exact-match search. But this approach didnt speed it up by anything worth caring about.
for file in weapon_files:
with open(file, 'r', encoding="ISO-8859-1") as weapon_file:
w = weapon_file.read()
if sAlias in w:
t = re.compile(fr'\b{sAlias}\b')
result = t.search(w)
if result:
called_file_.append(''.join(f"root{file[len(WAW_ROOT_DIR):]}"))
I then thought id just open each file, turn it into a string and then append all the file-strings together, check for any mention, then do an exact-match search. Which did actually reduce the loading time but then i realised i cant use that approach as the whole point of scanning these files for an exact-keyword-match is to then store the matched-file-directory into a list. This approach removes any chance of that.
weaponString = ""
for file in weapon_files:
with open(file, 'r', encoding="ISO-8859-1") as weapon_file:
e = weapon_file.read()
weaponString += e
if sAlias in weaponString:
t = re.compile(fr'\b{sAlias}\b')
result = t.search(weaponString)
if result:
called_file_.append(''.join(f"root{file[len(WAW_ROOT_DIR):]}"))
This is what the files look like.
It may also be worth mentioning these files have no .extension, but i dont think thats an issue as python can still read these files just fine.
WEAPONFILE\displayName\WEAPON_30CAL\modeName\\playerAnimType\smg\altWeapon\\AIOverlayDescription\WEAPON_SUBMACHINEGUNNER\weaponType\bullet\weaponClass\mg\penetrateType\large\impactType\bullet_large\inventoryType\primary\fireType\Full Auto\clipType\lmg\twoHanded\1\rifleBullet\0\armorPiercing\0\boltAction\0\aimDownSight\1\rechamberWhileAds\1\noADSAutoReload\0\noPartialReload\0\segmentedReload\0\adsFire\0\noAdsWhenMagEmpty\0\avoidDropCleanup\0\enhanced\0\bayonet\0\cancelAutoHolsterWhenEmpty\0\suppressAmmoReserveDisplay\0\laserSightDuringNightvision\0\blocksProne\0\silenced\0\mountableWeapon\0\autoAimRange\1200\aimAssistRange\3200\aimAssistRangeAds\3200\enemyCrosshairRange\720\crosshairColorChange\1\moveSpeedScale\0.75\adsMoveSpeedScale\0.75\sprintDurationScale\0.75\gunMaxPitch\6\gunMaxYaw\6\lowAmmoWarningThreshold\0.33\ammoName\30cal\maxAmmo\500\startAmmo\500\clipName\30cal\clipSize\125\shotCount\1\dropAmmoMin\200\dropAmmoMax\250\reloadAmmoAdd\0\reloadStartAdd\0\damage\130\minDamage\90\meleeDamage\150\maxDamageRange\1024\minDamageRange\2400\playerDamage\70\locNone\1\locHelmet\3\locHead\3\locNeck\1\locTorsoUpper\1\locTorsoLower\1\locRightArmUpper\1\locRightArmLower\1\locRightHand\1\locLeftArmUpper\1\locLeftArmLower\1\locLeftHand\1\locRightLegUpper\1\locRightLegLower\1\locRightFoot\1\locLeftLegUpper\1\locLeftLegLower\1\locLeftFoot\1\locGun\0\fireTime\0.096\fireDelay\0\meleeTime\0.5\meleeChargeTime\1\meleeDelay\0.05\meleeChargeDelay\0.15\reloadTime\7\reloadEmptyTime\6\reloadEmptyAddTime\0\reloadStartTime\0\reloadEndTime\0\reloadAddTime\4.75\reloadStartAddTime\0\rechamberTime\0.1\rechamberBoltTime\0\dropTime\0.83\raiseTime\0.9\altDropTime\0.7\altRaiseTime\0\quickDropTime\0.25\quickRaiseTime\0.25\firstRaiseTime\1.5\emptyDropTime\0.5\emptyRaiseTime\0.5\sprintInTime\0.5\sprintLoopTime\0.8\sprintOutTime\0.2\deployTime\0.5\breakdownTime\0.5\nightVisionWearTime\0.5\nightVisionWearTimeFadeOutEnd\0\nightVisionWearTimePowerUp\0\nightVisionRemoveTime\0.5\nightVisionRemoveTimePowerDown\0\nightVisionRemoveTimeFadeInStart\0\standMoveF\0\standMoveR\0\standMoveU\-2\standRotP\0\standRotY\0\standRotR\0\standMoveMinSpeed\0\standRotMinSpeed\0\posMoveRate\8\posRotRate\8\sprintOfsF\1\sprintOfsR\-2\sprintOfsU\-1\sprintRotP\10\sprintRotY\45\sprintRotR\-20\sprintBobH\8\sprintBobV\6\sprintScale\0.9\duckedSprintOfsF\2\duckedSprintOfsR\-1\duckedSprintOfsU\0\duckedSprintRotP\10\duckedSprintRotY\25\duckedSprintRotR\-20\duckedSprintBobH\2\duckedSprintBobV\3\duckedSprintScale\0.8\duckedMoveF\0\duckedMoveR\0\duckedMoveU\-1.5\duckedRotP\0\duckedRotY\0\duckedRotR\0\duckedOfsF\-0.5\duckedOfsR\0.25\duckedOfsU\-0.6\duckedMoveMinSpeed\0\duckedRotMinSpeed\0\proneMoveF\-160\proneMoveR\-75\proneMoveU\-120\proneRotP\0\proneRotY\300\proneRotR\-300\proneOfsF\0\proneOfsR\0.5\proneOfsU\-1\posProneMoveRate\10\posProneRotRate\10\proneMoveMinSpeed\0\proneRotMinSpeed\0\hipIdleAmount\30\adsIdleAmount\28\hipIdleSpeed\1\adsIdleSpeed\0.9\idleCrouchFactor\0.75\idleProneFactor\0.4\adsSpread\0\adsAimPitch\0\adsTransInTime\0.22\adsTransOutTime\0.4\adsTransBlendTime\0.1\adsReloadTransTime\0.3\adsCrosshairInFrac\1\adsCrosshairOutFrac\0.2\adsZoomFov\50\adsZoomInFrac\0.7\adsZoomOutFrac\0.4\adsBobFactor\0\adsViewBobMult\0.25\adsViewErrorMin\0\adsViewErrorMax\0\hipSpreadStandMin\4\hipSpreadDuckedMin\3.5\hipSpreadProneMin\3\hipSpreadMax\10\hipSpreadDuckedMax\8\hipSpreadProneMax\6\hipSpreadFireAdd\0.6\hipSpreadTurnAdd\0\hipSpreadMoveAdd\5\hipSpreadDecayRate\4\hipSpreadDuckedDecay\1.05\hipSpreadProneDecay\1.1\hipGunKickReducedKickBullets\0\hipGunKickReducedKickPercent\0\hipGunKickPitchMin\5\hipGunKickPitchMax\-15\hipGunKickYawMin\5\hipGunKickYawMax\-5\hipGunKickAccel\800\hipGunKickSpeedMax\2000\hipGunKickSpeedDecay\16\hipGunKickStaticDecay\20\adsGunKickReducedKickBullets\0\adsGunKickReducedKickPercent\75\adsGunKickPitchMin\5\adsGunKickPitchMax\15\adsGunKickYawMin\-5\adsGunKickYawMax\10\adsGunKickAccel\800\adsGunKickSpeedMax\2000\adsGunKickSpeedDecay\32\adsGunKickStaticDecay\40\hipViewKickPitchMin\70\hipViewKickPitchMax\80\hipViewKickYawMin\-30\hipViewKickYawMax\-60\hipViewKickCenterSpeed\1500\adsViewKickPitchMin\45\adsViewKickPitchMax\55\adsViewKickYawMin\-70\adsViewKickYawMax\70\adsViewKickCenterSpeed\1800\swayMaxAngle\4\swayLerpSpeed\6\swayPitchScale\0.1\swayYawScale\0.1\swayHorizScale\0.2\swayVertScale\0.2\swayShellShockScale\5\adsSwayMaxAngle\4\adsSwayLerpSpeed\6\adsSwayPitchScale\0.1\adsSwayYawScale\0\adsSwayHorizScale\0.08\adsSwayVertScale\0.1\fightDist\720\maxDist\340\aiVsAiAccuracyGraph\thompson.accu\aiVsPlayerAccuracyGraph\light_machine_gun.accu\reticleCenter\\reticleSide\reticle_side_small\reticleCenterSize\4\reticleSideSize\8\reticleMinOfs\0\hipReticleSidePos\0\adsOverlayShader\\adsOverlayShaderLowRes\\adsOverlayReticle\none\adsOverlayWidth\220\adsOverlayHeight\220\gunModel\viewmodel_usa_30cal_lmg\gunModel2\\gunModel3\\gunModel4\\gunModel5\\gunModel6\\gunModel7\\gunModel8\\gunModel9\\gunModel10\\gunModel11\\gunModel12\\gunModel13\\gunModel14\\gunModel15\\gunModel16\\handModel\viewmodel_hands_no_model\worldModel\weapon_usa_30cal_lmg\worldModel2\\worldModel3\\worldModel4\\worldModel5\\worldModel6\\worldModel7\\worldModel8\\worldModel9\\worldModel10\\worldModel11\\worldModel12\\worldModel13\\worldModel14\\worldModel15\\worldModel16\\worldClipModel\\knifeModel\viewmodel_usa_kbar_knife\worldKnifeModel\weapon_usa_kbar_knife\idleAnim\viewmodel_30cal_idle\emptyIdleAnim\viewmodel_30cal_empty_idle\fireAnim\viewmodel_30cal_fire\lastShotAnim\viewmodel_30cal_lastshot\rechamberAnim\\meleeAnim\viewmodel_knife_slash\meleeChargeAnim\viewmodel_knife_stick\reloadAnim\viewmodel_30cal_partial_reload\reloadEmptyAnim\viewmodel_30cal_reload\reloadStartAnim\\reloadEndAnim\\raiseAnim\viewmodel_30cal_pullout\dropAnim\viewmodel_30cal_putaway\firstRaiseAnim\viewmodel_30cal_first_raise\altRaiseAnim\\altDropAnim\\quickRaiseAnim\viewmodel_30cal_pullout_fast\quickDropAnim\viewmodel_30cal_putaway_fast\emptyRaiseAnim\viewmodel_30cal_pullout_empty\emptyDropAnim\viewmodel_30cal_putaway_empty\sprintInAnim\\sprintLoopAnim\\sprintOutAnim\\nightVisionWearAnim\\nightVisionRemoveAnim\\adsFireAnim\viewmodel_30cal_ADS_fire\adsLastShotAnim\viewmodel_30cal_ADS_lastshot\adsRechamberAnim\\adsUpAnim\viewmodel_30cal_ADS_up\adsDownAnim\viewmodel_30cal_ADS_down\deployAnim\\breakdownAnim\\viewFlashEffect\weapon/muzzleflashes/fx_30cal_bulletweap_view\worldFlashEffect\weapon/muzzleflashes/fx_30cal_bulletweap\viewShellEjectEffect\weapon/shellejects/fx_heavy_link_view\worldShellEjectEffect\weapon/shellejects/fx_heavy\viewLastShotEjectEffect\\worldLastShotEjectEffect\\worldClipDropEffect\\pickupSound\weap_pickup\pickupSoundPlayer\weap_pickup_plr\ammoPickupSound\ammo_pickup\ammoPickupSoundPlayer\ammo_pickup_plr\breakdownSound\\breakdownSoundPlayer\\deploySound\\deploySoundPlayer\\finishDeploySound\\finishDeploySoundPlayer\\fireSound\weap_30cal_fire\fireSoundPlayer\weap_30cal_fire_plr\lastShotSound\weap_30cal_fire\lastShotSoundPlayer\weap_30cal_fire_plr\emptyFireSound\dryfire_rifle\emptyFireSoundPlayer\dryfire_rifle_plr\crackSound\\whizbySound\\meleeSwipeSound\melee_swing\meleeSwipeSoundPlayer\melee_swing_plr\meleeHitSound\melee_hit\meleeMissSound\\rechamberSound\\rechamberSoundPlayer\\reloadSound\gr_30cal_3p_full\reloadSoundPlayer\\reloadEmptySound\gr_30cal_3p_full\reloadEmptySoundPlayer\\reloadStartSound\\reloadStartSoundPlayer\\reloadEndSound\\reloadEndSoundPlayer\\altSwitchSound\\altSwitchSoundPlayer\\raiseSound\weap_raise\raiseSoundPlayer\weap_raise_plr\firstRaiseSound\weap_raise\firstRaiseSoundPlayer\weap_raise_plr\putawaySound\weap_putaway\putawaySoundPlayer\weap_putaway_plr\nightVisionWearSound\\nightVisionWearSoundPlayer\\nightVisionRemoveSound\\nightVisionRemoveSoundPlayer\\standMountedWeapdef\\crouchMountedWeapdef\\proneMountedWeapdef\\mountedModel\\hudIcon\hud_icon_30cal\killIcon\hud_icon_30cal\dpadIcon\\ammoCounterIcon\\hudIconRatio\4:1\killIconRatio\4:1\dpadIconRatio\4:1\ammoCounterIconRatio\4:1\ammoCounterClip\Beltfed\flipKillIcon\1\fireRumble\defaultweapon_fire\meleeImpactRumble\defaultweapon_melee\adsDofStart\0\adsDofEnd\7.5\hideTags\\notetrackSoundMap\gr_30cal_start_plr gr_30cal_start_plr
gr_30cal_open_plr gr_30cal_open_plr
gr_30cal_grab_belt_plr gr_30cal_grab_belt_plr
gr_30cal_belt_remove_plr gr_30cal_belt_remove_plr
gr_30cal_belt_raise_plr gr_30cal_belt_raise_plr
gr_30cal_belt_contact_plr gr_30cal_belt_contact_plr
gr_30cal_belt_press_plr gr_30cal_belt_press_plr
gr_30cal_close_plr gr_30cal_close_plr
gr_30cal_charge_plr gr_30cal_charge_plr
gr_30cal_ammo_toss_plr gr_30cal_ammo_toss_plr
gr_30cal_charge_release_plr gr_30cal_charge_release_plr
gr_30cal_lid_bonk_plr gr_30cal_lid_bonk_plr
knife_stab_plr knife_stab_plr
knife_pull_plr knife_pull_plr
Knife_slash_plr Knife_slash_plr
gr_mg_deploy_start gr_mg_deploy_start
gr_mg_deploy_end gr_mg_deploy_end
gr_mg_break_down gr_mg_break_down
gr_30cal_tap_plr gr_30cal_tap_plr
Any help is appreciated.
Instead of searching line by line, you can search the entire file at once. I have included a code example below, which searches one file for multiple keywords and prints the keywords found.
keywords = ["gr_30cal_open_plr", "gr_mg_deploy_end", "wontfindthis"]
with open("test.txt") as f:
contents = f.read()
# Search the file for each keyword.
keywords_found = {keyword for keyword in keywords if keyword in contents}
if keywords_found:
print("This file contains the following keywords:")
print(keywords_found)
else:
print("This file did not contain any keywords.")
I'll explain the code. f.read() will read the file contents. Then I use a set comprehension to get all of the keywords found in the file. I use a set because that will keep only the unique keywords -- I assume you don't need to know how many times a keyword appears in the file. (A set comprehension is similar to a list comprehension, but it creates a set.) Testing whether the keyword is in the file is as easy as keyword in contents.
I used your sample file contents and duplicated it multiple times so the file contained 45,252,362 lines (1.8 GB). And my code above took less than 1 second.
well you can use multiprocessing to speed up your work, but I don't think it is the best way, but well I am sharing the code so you can try it for yourself and see if that work for you or not.
import multiprocessing
def process(file):
with open(file, 'r', encoding="ISO-8859-1") as weapon_file:
for m in weapon_file:
if sAlias in m:
t = re.compile(fr'\b{sAlias}\b')
result = t.search(m)
if result:
called_file_.append
(''.join(f"root{file[len(WAW_ROOT_DIR):]}"))
p = multiprocessing.Pool()
for file in weapon_files:
# now launch a process for each file.
# The result will be approximately one process per CPU core available.
p.apply_async(process, [file])
p.close()
p.join() # Wait for all child processes to be closed.
hope it helps.

Fast text use (getting it up to compare word vectors)

I am a little ashamed that I have to ask this question because I feel like I should know this. I haven't been programming long but I am trying to apply what I learn to a project I'm working on, and that is how I got to this question. Fast Text has a library of word and associated points https://fasttext.cc/docs/en/english-vectors.html . It is used to find the vector of the word. I just want to look a word or two up and see what the result is in order to see if it is useful for my project. They have provided a list of vectors and then a small code chunck. I cannot make heads or tails out of it. some of it I get but i do not see a print function - is it returning the data to a different part of your own code? I also am not sure where the chunk of code opens the data file, usually fname is a handle right? Or are they expecting you to type your file's path there. I also am not familiar with io, I googled the word but didn't find anything useful. Is this something I need to download or is it already a part of python. I know I might be a little out of my league but I learn best by doing, so please don't hate on me.
import io
def load_vectors(fname):
fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
n, d = map(int, fin.readline().split())
data = {}
for line in fin:
tokens = line.rstrip().split(' ')
data[tokens[0]] = map(float, tokens[1:])
return data
Try the following:
my_file_name = 'C:/path/to/file.txt' # Use the path to your file of rows of sentences
my_data = load_vectors(my_file_name) # Function will return data
print(my_data) # To see the output

Tips for working with large quantity .txt files (and overall large size) - python?

I'm working on a script to parse txt files and store them into a pandas dataframe that I can export to a CSV.
My script works easily when I was using <100 of my files - but now when trying to run it on the full sample, I'm running into a lot of issues.
Im dealing with ~8000 .txt files with an average size of 300 KB, so in total about 2.5 GB in size.
I was wondering if I could get tips on how to make my code more efficient.
for opening and reading files, I use:
filenames = os.listdir('.')
dict = {}
for file in filenames:
with open(file) as f:
contents = f.read()
dict[file.replace(".txt", "")] = contents
Doing print(dict) crashes (at least it seems like it) my python.
Is there a better way to handle this?
Additionally, I also convert all the values in my dict to lowercase, using:
def lower_dict(d):
lcase_dict = dict((k, v.lower()) for k, v in d.items())
return lcase_dict
lower = lower_dict(dict)
I haven't tried this yet (can't get passed the opening/reading stage), but I was wondering if this would cause problems?
Now, before I am marked as duplicate, I did read this: How can I read large text files in Python, line by line, without loading it into memory?
however, that user seemed to be working with 1 very large file which was 5GB, whereas I am working with multiple small files totalling 2.5GB (and actually my ENTIRE sample is something like 50GB and 60,000 files). So I was wondering if my approach would need to be different.
Sorry if this is a dumb question, unfortunately, I am not well versed in the field of RAM and computer processing methods.
Any help is very much appreciated.
thanks
I believe the thing slowing your code down the most is the .replace() method your are using. I believe this is because the built-in replace method is iterative, and as a result is very inefficient. Try using the re module in your for loops. Here is an example of how I used the module recently to replace the keys "T", ":" and "-" with "" which in this case removed them from the file:
for line in lines:
line = re.sub('[T:-]', '', line)
Let me know if this helps!

Read specific lines of csv file

Hllo guys,
so i have a huge CSV file (500K of lines), i want to process the file simultaneously with 4 processes (so each one will read aprox. 100K of lines)
what is the best way to do it using multi proccessing?
what i have up til now:
def csv_handler(path, procceses = 5):
test_arr = []
with open(path) as fd:
reader = DictReader(fd)
for row in reader:
test_arr.append(row)
current_line = 0
equal_length = len(test_arr) / 5
for i in range(5):
process1 = multiprocessing.Process(target=get_data, args=(test_arr[current_line: current_line + equal_length],))
current_line = current_line + equal_length
i know it's a bad udea to do that with one reading line, but i don't find another option..
i would be happy to get some ideas to how to do it in a better way!
CSV is a pretty tricky format to split the reads up with, and other file formats may be more ideal.
The basic problem is that as lines may be different lengths, you can't know where to start reading a particular lines easily to "fseek" to it. You would have to scan through the file counting newlines, which is basically, reading it.
But you can get pretty close which sounds like it is enough for your needs. Say for two parts, take the file size, divide that by 2.
The first part you start at zero, and stop after completing the record at file_size / 2.
The second part, you seek to file_size / 2, look for the next new line, and start there.
This way while the Python processes won't all get exactly the same amount it will be pretty close, and avoids too much inter-process message passing or multi-threading and with CPython probably the global interpreter lock.
Of course all the normal things for optimising either file IO, or Python code still apply (depending on where your bottleneck lies. You need to measure this.).

How do I write a simple, Python parsing script?

Most of what I do involves writing simple parsing scripts that reads search terms from one file and searches, line by line, another file. Once the search term is found, the line and sometimes the following line are written to another output file. The code I use is rudimentary and likely crude.
#!/usr/bin/env python
data = open("data.txt", "r")
search_terms = ids.read().splitlines()
data.close()
db = open("db.txt", "r")
output = open("output.txt", "w")
for term in search_terms:
for line in db:
if line.find(term) > -1:
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found %s" % term)
There are a few problems here. First, I don't think it's the most efficient and fastest to search line by line, but I'm not exactly sure about that. Second, I often run into issues with cursor placement and the cursor doesn't reset to the beginning of the file when the search term is found. Third, while I am usually confident that all of the terms can be found in the db, there are rare times when I can't be sure, so I would like to write to another file whenever it iterates through the entire db and can't find the term. I've tried adding a snippet that counts the number of lines of the db so if the find() function gets to the last line and the term isn't found, then it outputs to another "not found" file, but I haven't been able to get my elif and else loops right.
Overall, I'd just like any hints or corrections that could make this sort of script more efficient and robust.
Thanks.
Unless it's a really big file, why not iterate line by line? If the input file's size is some significant portion of your machine's available resources (memory), then you might want to look into buffered input and other, more low-level abstractions of what the computer is doing. But if you're talking about a few hundred MB or less on a relatively modern machine, let the computer do the computing ;)
Off the bat you might want to get into the habit of using the built-in context manager with. For instance, in your snippet, you don't have a call to output.close().
with open('data.txt', 'r') as f_in:
search_terms = f_in.read().splitlines()
Now search_terms is a handle to a list that has each line from data.txt as a string (but with the newline characters removed). And data.txt is closed thanks to with.
In fact, I would do that with the db.txt file, also.
with open('db.txt', 'r') as f_in:
lines = f_in.read().splitlines()
Context managers are cool.
As a side note, you could open your destination file now, and do your parsing and results-tracking with it open the whole time, but I like leaving as many files closed as possible for as long as possible.
I would suggest setting the biggest object on the outside of your loop, which I'm guessing is db.txt contents. The outermost loop only usually only gets iterated once, so might as well put the biggest thing there.
results = []
for i, line in enumerate(lines):
for term in search_terms:
if term in line:
# Use something not likely to appear in your line as a separator
# for these "second lines". I used three pipe characters, but
# you could just as easily use something even more random
results.append('{}|||{}'.format(line, lines[i+1]))
if results:
with open('output.txt', 'w') as f_out:
for result in results:
# Don't forget to replace your custom field separator
f_out.write('> {}\n'.format(result.replace('|||', '\n')))
else:
with open('no_results.txt', 'w') as f_out:
# This will write an empty file to disk
pass
The nice thing about this approach is each line in db.txt is checked once for each search_term in search_terms. However, the downside is that any line will be recorded for each search term it contains, ie., if it has three search terms in it, that line will appear in your output.txt three times.
And all the files are magically closed.
Context managers are cool.
Good luck!
search_terms keeps whole data.txt in memory. That it's not good in general but in this case it's not quite bad.
Looking line-by-line is not sufficient but if the case is simple and files are not too big it's not a big deal. If you want more efficiency you should sort data.txt file and put this to some tree-like structure. It depends on data which is inside.
You have to use seek to move pointer back after using next.
Propably the easiest way here is to generate two lists of lines and search using in like:
`db = open('db.txt').readlines()
db_words = [x.split() for x in db]
data = open('data.txt').readlines()
print('Lines in db {}'.format(len(db)))
for item in db:
for words in db_words:
if item in words:
print("Found {}".format(item))`
Your key issue is that you may be looping in the wrong order -- in your code as posted, you'll always exhaust the db looking for the first term, so after the first pass of the outer for loop db will be at end, no more lines to read, no other term will ever be found.
Other improvements include using the with statement to guarantee file closure, and a set to track which search terms were not found. (There are also typos in your posted code, such as opening a file as data but then reading it as ids).
So, for example, something like:
with open("data.txt", "r") as data:
search_terms = data.read().splitlines()
missing_terms = set(search_terms)
with open("db.txt", "r") as db, open("output.txt", "w") as output:
for line in db:
for term in search_terms:
if term in line:
missing_terms.discard(term)
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found {}".format(term))
break
if missing_terms:
diagnose_not_found(missing_terms)
where the diagnose_not_found function does whatever you need to do to warn the user about missing terms.
There are assumptions embedded here, such as the fact that you don't care if some other search term is present in a line where you've found a previous one, or the very next one; they might take substantial work to fix if not applicable and it will require that you edit your Q with a very complete and unambiguous list of specifications.
If your db is actually small enough to comfortably fit in memory, slurping it all in as a list of lines once and for all would allow easier accommodation for more demanding specs (as in that case you can easily go back and forth, while iterating on a file means you can only go forward one line at a time), so if your specs are indeed more demanding please also clarify if this crucial condition hold, or rather you need this script to process potentially humungous db files (say gigabyte-plus sizes, so as to not "comfortably fit in memory", depending on your platform of course).

Categories