Processing large .txt file in Python works only on small files

Processing large .txt file in Python works only on small files - python

I have several 1+ gb text files of URLs. I'm trying to use Python to find and replace in order to quickly strip down the URLs.
Because these files are big, I don't want to load them into memory.
My code works on small test files of 50 lines, but when I use this code on a big text file, it actually makes the file larger.
import re
import sys
def ProcessLargeTextFile():
with open("C:\\Users\\Combined files\\test2.txt", "r") as r, open("C:\\Users\\Combined files\\output.txt", "w") as w:
for line in r:
line = line.replace('https://twitter.com/', '')
w.write(line)
return
ProcessLargeTextFile()
print("Finished")
Small files I tested my code with result in the twitter username (as desired)
username_1
username_2
username_3
while large files result in
https://twitter.com/username_1਍ഀ
https://twitter.com/username_2਍ഀ
https://twitter.com/username_3਍ഀ

It's a problem with the encoding of the file, this works:
import re
def main():
inputfile = open("1-10_no_dups_split_2.txt", "r", encoding="UTF-16")
outputfile = open("output.txt", "a", encoding="UTF-8")
for line in inputfile:
line = re.sub("^https://twitter.com/", "", line)
outputfile.write(line)
outputfile.close()
main()
The trick being to specify UTF-16 on reading it, then output it as UTF-8. And viola, the weird stuff goes away :) I do a lot of work moving text files around with Python. There are many setting you can do to play with the encoding to automatically replace certain characters and what not, just read up about the "open" command if you get into at weird spot, or post back here :).
Doing a quick look at the results, you'll probably want to have a few regexes so you can catch https://mobile.twitter.com/ and other stuff, but that's another story.. Good luck!

You can use the open() method's buffering parameter.
Here is the code for it.
import re
import sys
def ProcessLargeTextFile():
with open("C:\\Users\\Combined files\\test2.txt", "r",buffering=200000000) as r, open("C:\\Users\\Combined files\\output.txt", "w") as w:
for line in r:
line = line.replace('https://twitter.com/', '')
w.write(line)
return
ProcessLargeTextFile()
print("Finished")
So I am reading 20 MB of data into memory at a time.

Related

How to speed up reading/scanning mass files for an exact keyword match

So ive looked on google and the only results i get is "reading large files", not much about how to speed up reading multiple files.
I have a sound-alias-keyword. This keyword will need to be scanned in up to 128 files
and i could have up to 1,600 keywords to scan for in said files.
So as you can see thats a lot of opening/reading. And its loading time is very slow. I cant have it be this slow for my program. I need to reduce the load time by 10 fold.
So i have this code snippet which reads files line by line and if a mention of the keyword is in said line then it will do an exact-match check.
for file in weapon_files:
with open(file, 'r', encoding="ISO-8859-1") as weapon_file:
for m in weapon_file:
if sAlias in m:
t = re.compile(fr'\b{sAlias}\b')
result = t.search(m)
if result:
called_file_.append(''.join(f"root{file[len(WAW_ROOT_DIR):]}"))
I then thought id try and see if i could speed things up by turning the file into a string. Do a basic scan to see if theres any mention and if so, then do an exact-match search. But this approach didnt speed it up by anything worth caring about.
for file in weapon_files:
with open(file, 'r', encoding="ISO-8859-1") as weapon_file:
w = weapon_file.read()
if sAlias in w:
t = re.compile(fr'\b{sAlias}\b')
result = t.search(w)
if result:
called_file_.append(''.join(f"root{file[len(WAW_ROOT_DIR):]}"))
I then thought id just open each file, turn it into a string and then append all the file-strings together, check for any mention, then do an exact-match search. Which did actually reduce the loading time but then i realised i cant use that approach as the whole point of scanning these files for an exact-keyword-match is to then store the matched-file-directory into a list. This approach removes any chance of that.
weaponString = ""
for file in weapon_files:
with open(file, 'r', encoding="ISO-8859-1") as weapon_file:
e = weapon_file.read()
weaponString += e
if sAlias in weaponString:
t = re.compile(fr'\b{sAlias}\b')
result = t.search(weaponString)
if result:
called_file_.append(''.join(f"root{file[len(WAW_ROOT_DIR):]}"))
This is what the files look like.
It may also be worth mentioning these files have no .extension, but i dont think thats an issue as python can still read these files just fine.
WEAPONFILE\displayName\WEAPON_30CAL\modeName\\playerAnimType\smg\altWeapon\\AIOverlayDescription\WEAPON_SUBMACHINEGUNNER\weaponType\bullet\weaponClass\mg\penetrateType\large\impactType\bullet_large\inventoryType\primary\fireType\Full Auto\clipType\lmg\twoHanded\1\rifleBullet\0\armorPiercing\0\boltAction\0\aimDownSight\1\rechamberWhileAds\1\noADSAutoReload\0\noPartialReload\0\segmentedReload\0\adsFire\0\noAdsWhenMagEmpty\0\avoidDropCleanup\0\enhanced\0\bayonet\0\cancelAutoHolsterWhenEmpty\0\suppressAmmoReserveDisplay\0\laserSightDuringNightvision\0\blocksProne\0\silenced\0\mountableWeapon\0\autoAimRange\1200\aimAssistRange\3200\aimAssistRangeAds\3200\enemyCrosshairRange\720\crosshairColorChange\1\moveSpeedScale\0.75\adsMoveSpeedScale\0.75\sprintDurationScale\0.75\gunMaxPitch\6\gunMaxYaw\6\lowAmmoWarningThreshold\0.33\ammoName\30cal\maxAmmo\500\startAmmo\500\clipName\30cal\clipSize\125\shotCount\1\dropAmmoMin\200\dropAmmoMax\250\reloadAmmoAdd\0\reloadStartAdd\0\damage\130\minDamage\90\meleeDamage\150\maxDamageRange\1024\minDamageRange\2400\playerDamage\70\locNone\1\locHelmet\3\locHead\3\locNeck\1\locTorsoUpper\1\locTorsoLower\1\locRightArmUpper\1\locRightArmLower\1\locRightHand\1\locLeftArmUpper\1\locLeftArmLower\1\locLeftHand\1\locRightLegUpper\1\locRightLegLower\1\locRightFoot\1\locLeftLegUpper\1\locLeftLegLower\1\locLeftFoot\1\locGun\0\fireTime\0.096\fireDelay\0\meleeTime\0.5\meleeChargeTime\1\meleeDelay\0.05\meleeChargeDelay\0.15\reloadTime\7\reloadEmptyTime\6\reloadEmptyAddTime\0\reloadStartTime\0\reloadEndTime\0\reloadAddTime\4.75\reloadStartAddTime\0\rechamberTime\0.1\rechamberBoltTime\0\dropTime\0.83\raiseTime\0.9\altDropTime\0.7\altRaiseTime\0\quickDropTime\0.25\quickRaiseTime\0.25\firstRaiseTime\1.5\emptyDropTime\0.5\emptyRaiseTime\0.5\sprintInTime\0.5\sprintLoopTime\0.8\sprintOutTime\0.2\deployTime\0.5\breakdownTime\0.5\nightVisionWearTime\0.5\nightVisionWearTimeFadeOutEnd\0\nightVisionWearTimePowerUp\0\nightVisionRemoveTime\0.5\nightVisionRemoveTimePowerDown\0\nightVisionRemoveTimeFadeInStart\0\standMoveF\0\standMoveR\0\standMoveU\-2\standRotP\0\standRotY\0\standRotR\0\standMoveMinSpeed\0\standRotMinSpeed\0\posMoveRate\8\posRotRate\8\sprintOfsF\1\sprintOfsR\-2\sprintOfsU\-1\sprintRotP\10\sprintRotY\45\sprintRotR\-20\sprintBobH\8\sprintBobV\6\sprintScale\0.9\duckedSprintOfsF\2\duckedSprintOfsR\-1\duckedSprintOfsU\0\duckedSprintRotP\10\duckedSprintRotY\25\duckedSprintRotR\-20\duckedSprintBobH\2\duckedSprintBobV\3\duckedSprintScale\0.8\duckedMoveF\0\duckedMoveR\0\duckedMoveU\-1.5\duckedRotP\0\duckedRotY\0\duckedRotR\0\duckedOfsF\-0.5\duckedOfsR\0.25\duckedOfsU\-0.6\duckedMoveMinSpeed\0\duckedRotMinSpeed\0\proneMoveF\-160\proneMoveR\-75\proneMoveU\-120\proneRotP\0\proneRotY\300\proneRotR\-300\proneOfsF\0\proneOfsR\0.5\proneOfsU\-1\posProneMoveRate\10\posProneRotRate\10\proneMoveMinSpeed\0\proneRotMinSpeed\0\hipIdleAmount\30\adsIdleAmount\28\hipIdleSpeed\1\adsIdleSpeed\0.9\idleCrouchFactor\0.75\idleProneFactor\0.4\adsSpread\0\adsAimPitch\0\adsTransInTime\0.22\adsTransOutTime\0.4\adsTransBlendTime\0.1\adsReloadTransTime\0.3\adsCrosshairInFrac\1\adsCrosshairOutFrac\0.2\adsZoomFov\50\adsZoomInFrac\0.7\adsZoomOutFrac\0.4\adsBobFactor\0\adsViewBobMult\0.25\adsViewErrorMin\0\adsViewErrorMax\0\hipSpreadStandMin\4\hipSpreadDuckedMin\3.5\hipSpreadProneMin\3\hipSpreadMax\10\hipSpreadDuckedMax\8\hipSpreadProneMax\6\hipSpreadFireAdd\0.6\hipSpreadTurnAdd\0\hipSpreadMoveAdd\5\hipSpreadDecayRate\4\hipSpreadDuckedDecay\1.05\hipSpreadProneDecay\1.1\hipGunKickReducedKickBullets\0\hipGunKickReducedKickPercent\0\hipGunKickPitchMin\5\hipGunKickPitchMax\-15\hipGunKickYawMin\5\hipGunKickYawMax\-5\hipGunKickAccel\800\hipGunKickSpeedMax\2000\hipGunKickSpeedDecay\16\hipGunKickStaticDecay\20\adsGunKickReducedKickBullets\0\adsGunKickReducedKickPercent\75\adsGunKickPitchMin\5\adsGunKickPitchMax\15\adsGunKickYawMin\-5\adsGunKickYawMax\10\adsGunKickAccel\800\adsGunKickSpeedMax\2000\adsGunKickSpeedDecay\32\adsGunKickStaticDecay\40\hipViewKickPitchMin\70\hipViewKickPitchMax\80\hipViewKickYawMin\-30\hipViewKickYawMax\-60\hipViewKickCenterSpeed\1500\adsViewKickPitchMin\45\adsViewKickPitchMax\55\adsViewKickYawMin\-70\adsViewKickYawMax\70\adsViewKickCenterSpeed\1800\swayMaxAngle\4\swayLerpSpeed\6\swayPitchScale\0.1\swayYawScale\0.1\swayHorizScale\0.2\swayVertScale\0.2\swayShellShockScale\5\adsSwayMaxAngle\4\adsSwayLerpSpeed\6\adsSwayPitchScale\0.1\adsSwayYawScale\0\adsSwayHorizScale\0.08\adsSwayVertScale\0.1\fightDist\720\maxDist\340\aiVsAiAccuracyGraph\thompson.accu\aiVsPlayerAccuracyGraph\light_machine_gun.accu\reticleCenter\\reticleSide\reticle_side_small\reticleCenterSize\4\reticleSideSize\8\reticleMinOfs\0\hipReticleSidePos\0\adsOverlayShader\\adsOverlayShaderLowRes\\adsOverlayReticle\none\adsOverlayWidth\220\adsOverlayHeight\220\gunModel\viewmodel_usa_30cal_lmg\gunModel2\\gunModel3\\gunModel4\\gunModel5\\gunModel6\\gunModel7\\gunModel8\\gunModel9\\gunModel10\\gunModel11\\gunModel12\\gunModel13\\gunModel14\\gunModel15\\gunModel16\\handModel\viewmodel_hands_no_model\worldModel\weapon_usa_30cal_lmg\worldModel2\\worldModel3\\worldModel4\\worldModel5\\worldModel6\\worldModel7\\worldModel8\\worldModel9\\worldModel10\\worldModel11\\worldModel12\\worldModel13\\worldModel14\\worldModel15\\worldModel16\\worldClipModel\\knifeModel\viewmodel_usa_kbar_knife\worldKnifeModel\weapon_usa_kbar_knife\idleAnim\viewmodel_30cal_idle\emptyIdleAnim\viewmodel_30cal_empty_idle\fireAnim\viewmodel_30cal_fire\lastShotAnim\viewmodel_30cal_lastshot\rechamberAnim\\meleeAnim\viewmodel_knife_slash\meleeChargeAnim\viewmodel_knife_stick\reloadAnim\viewmodel_30cal_partial_reload\reloadEmptyAnim\viewmodel_30cal_reload\reloadStartAnim\\reloadEndAnim\\raiseAnim\viewmodel_30cal_pullout\dropAnim\viewmodel_30cal_putaway\firstRaiseAnim\viewmodel_30cal_first_raise\altRaiseAnim\\altDropAnim\\quickRaiseAnim\viewmodel_30cal_pullout_fast\quickDropAnim\viewmodel_30cal_putaway_fast\emptyRaiseAnim\viewmodel_30cal_pullout_empty\emptyDropAnim\viewmodel_30cal_putaway_empty\sprintInAnim\\sprintLoopAnim\\sprintOutAnim\\nightVisionWearAnim\\nightVisionRemoveAnim\\adsFireAnim\viewmodel_30cal_ADS_fire\adsLastShotAnim\viewmodel_30cal_ADS_lastshot\adsRechamberAnim\\adsUpAnim\viewmodel_30cal_ADS_up\adsDownAnim\viewmodel_30cal_ADS_down\deployAnim\\breakdownAnim\\viewFlashEffect\weapon/muzzleflashes/fx_30cal_bulletweap_view\worldFlashEffect\weapon/muzzleflashes/fx_30cal_bulletweap\viewShellEjectEffect\weapon/shellejects/fx_heavy_link_view\worldShellEjectEffect\weapon/shellejects/fx_heavy\viewLastShotEjectEffect\\worldLastShotEjectEffect\\worldClipDropEffect\\pickupSound\weap_pickup\pickupSoundPlayer\weap_pickup_plr\ammoPickupSound\ammo_pickup\ammoPickupSoundPlayer\ammo_pickup_plr\breakdownSound\\breakdownSoundPlayer\\deploySound\\deploySoundPlayer\\finishDeploySound\\finishDeploySoundPlayer\\fireSound\weap_30cal_fire\fireSoundPlayer\weap_30cal_fire_plr\lastShotSound\weap_30cal_fire\lastShotSoundPlayer\weap_30cal_fire_plr\emptyFireSound\dryfire_rifle\emptyFireSoundPlayer\dryfire_rifle_plr\crackSound\\whizbySound\\meleeSwipeSound\melee_swing\meleeSwipeSoundPlayer\melee_swing_plr\meleeHitSound\melee_hit\meleeMissSound\\rechamberSound\\rechamberSoundPlayer\\reloadSound\gr_30cal_3p_full\reloadSoundPlayer\\reloadEmptySound\gr_30cal_3p_full\reloadEmptySoundPlayer\\reloadStartSound\\reloadStartSoundPlayer\\reloadEndSound\\reloadEndSoundPlayer\\altSwitchSound\\altSwitchSoundPlayer\\raiseSound\weap_raise\raiseSoundPlayer\weap_raise_plr\firstRaiseSound\weap_raise\firstRaiseSoundPlayer\weap_raise_plr\putawaySound\weap_putaway\putawaySoundPlayer\weap_putaway_plr\nightVisionWearSound\\nightVisionWearSoundPlayer\\nightVisionRemoveSound\\nightVisionRemoveSoundPlayer\\standMountedWeapdef\\crouchMountedWeapdef\\proneMountedWeapdef\\mountedModel\\hudIcon\hud_icon_30cal\killIcon\hud_icon_30cal\dpadIcon\\ammoCounterIcon\\hudIconRatio\4:1\killIconRatio\4:1\dpadIconRatio\4:1\ammoCounterIconRatio\4:1\ammoCounterClip\Beltfed\flipKillIcon\1\fireRumble\defaultweapon_fire\meleeImpactRumble\defaultweapon_melee\adsDofStart\0\adsDofEnd\7.5\hideTags\\notetrackSoundMap\gr_30cal_start_plr gr_30cal_start_plr
gr_30cal_open_plr gr_30cal_open_plr
gr_30cal_grab_belt_plr gr_30cal_grab_belt_plr
gr_30cal_belt_remove_plr gr_30cal_belt_remove_plr
gr_30cal_belt_raise_plr gr_30cal_belt_raise_plr
gr_30cal_belt_contact_plr gr_30cal_belt_contact_plr
gr_30cal_belt_press_plr gr_30cal_belt_press_plr
gr_30cal_close_plr gr_30cal_close_plr
gr_30cal_charge_plr gr_30cal_charge_plr
gr_30cal_ammo_toss_plr gr_30cal_ammo_toss_plr
gr_30cal_charge_release_plr gr_30cal_charge_release_plr
gr_30cal_lid_bonk_plr gr_30cal_lid_bonk_plr
knife_stab_plr knife_stab_plr
knife_pull_plr knife_pull_plr
Knife_slash_plr Knife_slash_plr
gr_mg_deploy_start gr_mg_deploy_start
gr_mg_deploy_end gr_mg_deploy_end
gr_mg_break_down gr_mg_break_down
gr_30cal_tap_plr gr_30cal_tap_plr
Any help is appreciated.

Instead of searching line by line, you can search the entire file at once. I have included a code example below, which searches one file for multiple keywords and prints the keywords found.
keywords = ["gr_30cal_open_plr", "gr_mg_deploy_end", "wontfindthis"]
with open("test.txt") as f:
contents = f.read()
# Search the file for each keyword.
keywords_found = {keyword for keyword in keywords if keyword in contents}
if keywords_found:
print("This file contains the following keywords:")
print(keywords_found)
else:
print("This file did not contain any keywords.")
I'll explain the code. f.read() will read the file contents. Then I use a set comprehension to get all of the keywords found in the file. I use a set because that will keep only the unique keywords -- I assume you don't need to know how many times a keyword appears in the file. (A set comprehension is similar to a list comprehension, but it creates a set.) Testing whether the keyword is in the file is as easy as keyword in contents.
I used your sample file contents and duplicated it multiple times so the file contained 45,252,362 lines (1.8 GB). And my code above took less than 1 second.

well you can use multiprocessing to speed up your work, but I don't think it is the best way, but well I am sharing the code so you can try it for yourself and see if that work for you or not.
import multiprocessing
def process(file):
with open(file, 'r', encoding="ISO-8859-1") as weapon_file:
for m in weapon_file:
if sAlias in m:
t = re.compile(fr'\b{sAlias}\b')
result = t.search(m)
if result:
called_file_.append
(''.join(f"root{file[len(WAW_ROOT_DIR):]}"))
p = multiprocessing.Pool()
for file in weapon_files:
# now launch a process for each file.
# The result will be approximately one process per CPU core available.
p.apply_async(process, [file])
p.close()
p.join() # Wait for all child processes to be closed.
hope it helps.

Read and copy specific chunks of text in python

I have seen several similar questions on SO (copying trigger lines or chunks of definite sizes), but they don't quite fit to what I'm trying to do. I have a very large text file (output from Valgrind) that I'd like to cut down to only the parts I need.
The structure of the file is as follows: they are blocks of lines that start with a title line containing the string 'in loss record'. I want to trigger only on those title lines that also contain the string 'definitely lost', then copy all the lines below until another title line is reached (at which point the decision process is repeated).
How can I implement such a select-and-copy script in Python?
Here's what I've tried so far. It works, but I don't think is the most efficient (or pythonic) way of doing it, and so I'd like to see faster approaches, as the files I'm working with are usually quite large. (This method takes 1.8s for a 290M file)
with open("in_file.txt","r") as fin:
with open("out_file.txt","w") as fout:
lines = fin.read().split("\n")
i=0
while i<len(lines):
if "blocks are definitely lost in loss record" in lines[i]:
fout.write(lines[i].rstrip()+"\n")
i+=1
while i<len(lines) and "loss record" not in lines[i]:
fout.write(lines[i].rstrip()+"\n")
i+=1
i+=1

You might try with a regex and using mmap
Something similar to:
import re, mmap
# create a regex that will define each block of text you want here:
pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M)
with open(fn, 'r+b') as f:
mm = mmap.mmap(f.fileno(), 0)
for i, m in enumerate(pat.finditer(mm)):
# m is a block that you want.
print m.group(1)
Given you have no input example, that regex certainly does not work -- but you get the idea.
With mmap the entire file is treated as a string but not necessarily all in memory so large files can be searched and blocks of it selected in this way.
If your file comfortably fits in memory, you can just read the file and use a regex directly (pseudo Python):
with open(fn) as fo:
pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M)
for i, block in pat.finditer(of.read()):
# deal with each block
If you want a line by line non regex approach, read the file line by line (assuming it is a \n delimited text file):
with open(fn) as fo:
for line in fo:
# deal with each line here
# DON'T do something like string=fo.read() and
# then iterate over the lines of the string please...
# unless you need random access to the lines out of order

Another way to do this is to use groupby to identify header lines and set functions that will either write or ignore following lines. Then you can iterate the file line by line and reduce your memory footprint.
import itertools
def megs(val):
return val * (2**20)
def ignorelines(lines):
for line in lines:
pass
# assuming ascii or utf-8 you save a small amount of processing by avoiding decode/encode
# and a few fewer trips to the disk with larger buffers
with open('test.log', 'rb', buffering=megs(4)) as infile,\
open('out.log', 'wb', buffering=megs(4)) as outfile:
dump_fctn = ignorelines # ignore lines til we see a good header
# group by header or contained lines
for is_hdr, block in itertools.groupby(infile, lambda x: b'in loss record' in x):
if is_hdr:
for hdr in block:
if b'definitely lost' in hdr:
outfile.write(hdr)
dump_fctn = outfile.writelines
else:
dump_fctn = ignorelines
else:
# either writelines or ignorelines, depending on last header seen
dump_fctn(block)
print(open('out.log').read())

I m trying to search a string in multiple files and print them out in another file

I have multiple .txt files in a source folder of which i have given the path in "src" .I want to search strings which looks like "abcd.aiq" and print them in a file which i named as "fi".
I have written the following code and it doesnt print anything inside the file although it doesnt give any error.
import glob
import re
import os
src = (C:\Auto_TEST\Testing\Automation")
file_array= glob.glob(os.path.join(src,".txt"))
fi= open("aiq_hits.txt","w")
for input_file in file_array:
fo=open(input_file,"r")
line=fo.readline()
for line in fo:
line=r.strip()
x= re.findall('\S*.aiq\S*',line)
line= fo.readline()
for item in x:
fi.write("%s\n" %item)
fo.close()
fi.close()

I suppose this is what you are trying:
import glob
import re
import os.path
src = 'C:/Auto_TEST/Testing/Automation'
file_array = glob.glob(os.path.join(src,'*.txt'))
with open("aiq_hits.txt","w") as out_file:
for input_filename in file_array:
with open(input_filename) as in_file:
for line in in_file:
match = re.findall(r'\S*.aiq\S*', line)
for item in match:
out_file.write("%s\n" %item)
Let me quickly describe the changes I've made:
Opening files directly is not always a good idea. If the script crashes, the opened file object isn't being closed again, which can lead to data loss.
Since PEP 343 Python has the with statement, wich is generally agreed on being a better solution when handling files.
Calling f.readline() multiple times results in the script skipping these lines, because for line in f: reads lines on its own.
Finally, after every matching item you found you've been closing both the input file and the output file, so further reading or writing isn't possible anymore.
Edit: If you might need to tweak your regex, this might be a useful resource.

Python - Calling lines from a text file to compile a pattern search of a second file

Forgive me if this is asked and answered. If so, chalk it up to my being new to programming and not knowing enough to search properly.
I have a need to read in a file containing a series of several hundred phrases, such as names or email addresses, one per line, to be used as part of a compiled search term - pattern = re.search(name). The 'pattern' variable will be used to search another file of over 5 million lines to identify and extract select fields from relevant lines.
The text of the name file being read in for variable would be in the format of:
John\n
Bill\n
Harry#helpme.com\n
Sally\n
So far I have the below code which does not error out, but also does not process and close out. If I pass the names manually using slightly different code with a sys.argv[1], everything works fine. The code (which should be) in bold is the area I am having problems with - starting at "lines = open...."
import sys
import re
import csv
import os
searchdata = open("reallybigfile", "r")
Certfile = csv.writer(open('Certfile.csv', 'ab'), delimiter=',')
**lines = open("Filewithnames.txt", 'r')
while True:
for line in lines:
line.rstrip('\n')
lines.seek(0)
for nam in lines:
pat = re.compile(nam)**
for f in searchdata.readlines():
if pat.search(f):
fields = f.strip().split(',')
Certfile.writerow([nam, fields[3], fields[4]])
lines.close()
The code at the bottom (starting "for f in searchdata.readlines():") locates, extracts and writes the fields fine. I have been unable to find a way to read in the Filewithnames.txt file and have it use each line. It either hangs, as with this code, or it reads all lines of the file to the last line and returns data only for the last line, e.g. 'Sally'.
Thanks in advance.

while True is an infinite loop, and there is no way to break out of it that I can see. That will definitely cause the program to continue to run forever and not throw an error.
Remove the while True line and de-indent that loop's code, and see what happens.
EDIT:
I have resolved a few issues, as commented, but I will leave you to figure out the precise regex you need to accomplish your goal.
import sys
import re
import csv
import os
searchdata = open("c:\\dev\\in\\1.txt", "r")
# Certfile = csv.writer(open('c:\\dev\\Certfile.csv', 'ab'), delimiter=',') #moved to later to ensure the file will be closed
lines = open("c:\\dev\\in\\2.txt", 'r')
pats = [] # An array of patterns
for line in lines:
line.rstrip()
lines.seek(0)
# Add additional conditioning/escaping of input here.
for nam in lines:
pats.append(re.compile(nam))
with open('c:\\dev\\Certfile.csv', 'ab') as outfile: #This line opens the file
Certfile = csv.writer(outfile, delimiter=',') #This line interprets the output into CSV
for f in searchdata.readlines():
for pat in pats: #A loop for processing all of the patterns
if pat.search(f) is not None:
fields = f.strip().split(',')
Certfile.writerow([pat.pattern, fields[3], fields[4]])
lines.close()
searchdata.close()
First of all, make sure to close all the files, including your output file.
As stated before, the while True loop was causing you to run infinitely.
You need a regex or set of regexes to cover all of your possible "names." The code is simpler to do a set of regexes, so that is what I have done here. This may not be the most efficient. This includes a loop for processing all of the patterns.
I believe you need additional parsing of the input file to give you clean regular expressions. I have left some space for you to do that.
Hope that helps!

How do I modify the last line of a file?

The last line of my file is:
29-dez,40,
How can I modify that line so that it reads:
29-Dez,40,90,100,50
Note: I don't want to write a new line. I want to take the same line and put new values after 29-Dez,40,
I'm new at python. I'm having a lot of trouble manipulating files and for me every example I look at seems difficult.

Unless the file is huge, you'll probably find it easier to read the entire file into a data structure (which might just be a list of lines), and then modify the data structure in memory, and finally write it back to the file.
On the other hand maybe your file is really huge - multiple GBs at least. In which case: the last line is probably terminated with a new line character, if you seek to that position you can overwrite it with the new text at the end of the last line.
So perhaps:
f = open("foo.file", "wb")
f.seek(-len(os.linesep), os.SEEK_END)
f.write("new text at end of last line" + os.linesep)
f.close()
(Modulo line endings on different platforms)

To expand on what Doug said, in order to read the file contents into a data structure you can use the readlines() method of the file object.
The below code sample reads the file into a list of "lines", edits the last line, then writes it back out to the file:
#!/usr/bin/python
MYFILE="file.txt"
# read the file into a list of lines
lines = open(MYFILE, 'r').readlines()
# now edit the last line of the list of lines
new_last_line = (lines[-1].rstrip() + ",90,100,50")
lines[-1] = new_last_line
# now write the modified list back out to the file
open(MYFILE, 'w').writelines(lines)
If the file is very large then this approach will not work well, because this reads all the file lines into memory each time and writes them back out to the file, which is very inefficient. For a small file however this will work fine.

Don't work with files directly, make a data structure that fits your needs in form of a class and make read from/write to file methods.

I recently wrote a script to do something very similar to this. It would traverse a project, find all module dependencies and add any missing import statements. I won't clutter this post up with the entire script, but I'll show how I went about modifying my files.
import os
from mmap import mmap
def insert_import(filename, text):
if len(text) < 1:
return
f = open(filename, 'r+')
m = mmap(f.fileno(), os.path.getsize(filename))
origSize = m.size()
m.resize(origSize + len(text))
pos = 0
while True:
l = m.readline()
if l.startswith(('import', 'from')):
continue
else:
pos = m.tell() - len(l)
break
m[pos+len(text):] = m[pos:origSize]
m[pos:pos+len(text)] = text
m.close()
f.close()
Summary: This snippet takes a filename and a blob of text to insert. It finds the last import statement already present, and sticks the text in at that location.
The part I suggest paying most attention to is the use of mmap. It lets you work with files in the same manner you may work with a string. Very handy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing large .txt file in Python works only on small files - python

Related

How to speed up reading/scanning mass files for an exact keyword match

Read and copy specific chunks of text in python

I m trying to search a string in multiple files and print them out in another file

Python - Calling lines from a text file to compile a pattern search of a second file

How do I modify the last line of a file?

Categories

Resources