Python list of lists no loops - python

So full disclosure, this is hw, but I am having a lot of difficulty figuring this out. My professor has a rather particular challenge in this one portion of the assignment that I can't quite seem to figure out. Basically I'm trying to read a very very large file and put it into a list of lists that's represented as a movie recommendation matrix. She says that this can be done without a for loop and suggests using the readlines() method.
I've been running this code:
movMat = []
with open(u"movie-matrix.txt", 'r', encoding="ISO-8859-1") as f:
movMat.append(f.readlines())
But when I run diff on the output, it is not equivalent to the original file. Any suggestions for how I should go about this?
Update: upon further analysis, I think my code is correct. I added this to make it a tuple.
with open(u"movie-matrix.txt", 'r', encoding="ISO-8859-1") as f:
movMat = list(enumerate(f.readlines()))
Update2: Since people seem to want the file I'm reading from, allow me to explain. This is a ranking system from 1-5. If a person has not ranked a file, they are denoted with a ';'. This is the second line of the file.
"3;;;;3;;;;;;;;3;;;;;;;;;2;;;;;;;;3;;;;;;;;;;;;5;;;;;;;1;;;;;;;;;;;;;;;3;;;;;;;;3;;;;;;;;;;;4;;;;4;;;;;3;;;2;;;;;;;2;;;;;;;;3;;;;;;;;;;;;;;;;;;;;4;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;4;;;;;;;;;;;;;;;3;;;;3;;;4;2;;;;;;3;;;;;;4;;;;3;;;;;3;;;;;;;;;;;;2;;;;;;;;;;;;;;;3;4;;;;;;5;;;;;;;;;;;3;2;;;1;;;;;4;;;4;3;;;;;;;;;;;;4;3;;;;;;;;2;;3;;2;;;;;;;;;;;;;;;4;;;;;1;;2;;;;;;;;;;;;;;;;;;;5;;;;;;;;;;;;;;;;;4;;;;;;;;;;4;4;;;;2;3;;;;;;3;;4;;;;;;4;;;;;3;3;;;;;;1;;4;;;;;;;;;4;;;;;;;;;2;;;;3;;;;;;4;;;;;;;3;;;;;;;;4;;;;;4;;;;;;;;;;;1;;;;;;5;;;;;;;;;;;;4;;;3;;;;;;;;2;;1;;;;;;;;;4;;;;;;;;;;;;;;;3;;;;;;;;;;;5;;;;4;;;;;;;3;;;;;;;;2;;;;;;;;;;3;;;;;5;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;3;;;;;;;;;;;;;;;;;;2;;;3;4;;;;;3;;;;;4;;;;;;;;4;;4;3;;;;;4;;3;;;1;;3;;;;;2;;;;;;;;;;;4;;;;;;;;;;;3;;;;3;;;;;;;;;;;;;;;;;;;3;;;;4;;;;;;3;;;;;;;;;;;;4;;;;;;;;;;;3;;;;;;;;3;;;4;;4;;;;;;3;;;;;;;3;;;;;;;;;3;1;;;;;;;;;;;;;;;;3;;;;;3;5;;4;;;;;;4;;3;4;;;;;;;;3;;;;;;;;;;;3;;;;3;;;;;;;;;;;;;;4;;5;;;;;;;;;;;;;;;;;;4;;;;2;;2;;;;;;;;;;3;;;;;;4;;;3;;;4;;;;3;;;3;;;;;;;;;;;;;;;;;3;;;;;;;;3;;;;;;;;;;4;;;;;;;;;5"

I can't think of any case where f.readlines() would be better than just using f as an iterable. That is, for example,
with open('movie-matrix.txt', 'r', encoding="ISO-8859-1") as f:
movMat = list(f)
(no reason using u'...' notation in Python 3 -- which you have to be using if built-in open takes encoding=...!-).
Yes, f.readlines() would be equivalent to list(f) -- but it's more verbose and less obvious, so, what's the point?!
Assuming you have to output this to another file, since you mention "running diff on the output", that would be
with open('other.txt', 'w', encoding="ISO-8859-1") as f:
f.writelines(movMat)
no non-for-loop alternatives there:-).

Related

How to speed up reading/scanning mass files for an exact keyword match

So ive looked on google and the only results i get is "reading large files", not much about how to speed up reading multiple files.
I have a sound-alias-keyword. This keyword will need to be scanned in up to 128 files
and i could have up to 1,600 keywords to scan for in said files.
So as you can see thats a lot of opening/reading. And its loading time is very slow. I cant have it be this slow for my program. I need to reduce the load time by 10 fold.
So i have this code snippet which reads files line by line and if a mention of the keyword is in said line then it will do an exact-match check.
for file in weapon_files:
with open(file, 'r', encoding="ISO-8859-1") as weapon_file:
for m in weapon_file:
if sAlias in m:
t = re.compile(fr'\b{sAlias}\b')
result = t.search(m)
if result:
called_file_.append(''.join(f"root{file[len(WAW_ROOT_DIR):]}"))
I then thought id try and see if i could speed things up by turning the file into a string. Do a basic scan to see if theres any mention and if so, then do an exact-match search. But this approach didnt speed it up by anything worth caring about.
for file in weapon_files:
with open(file, 'r', encoding="ISO-8859-1") as weapon_file:
w = weapon_file.read()
if sAlias in w:
t = re.compile(fr'\b{sAlias}\b')
result = t.search(w)
if result:
called_file_.append(''.join(f"root{file[len(WAW_ROOT_DIR):]}"))
I then thought id just open each file, turn it into a string and then append all the file-strings together, check for any mention, then do an exact-match search. Which did actually reduce the loading time but then i realised i cant use that approach as the whole point of scanning these files for an exact-keyword-match is to then store the matched-file-directory into a list. This approach removes any chance of that.
weaponString = ""
for file in weapon_files:
with open(file, 'r', encoding="ISO-8859-1") as weapon_file:
e = weapon_file.read()
weaponString += e
if sAlias in weaponString:
t = re.compile(fr'\b{sAlias}\b')
result = t.search(weaponString)
if result:
called_file_.append(''.join(f"root{file[len(WAW_ROOT_DIR):]}"))
This is what the files look like.
It may also be worth mentioning these files have no .extension, but i dont think thats an issue as python can still read these files just fine.
WEAPONFILE\displayName\WEAPON_30CAL\modeName\\playerAnimType\smg\altWeapon\\AIOverlayDescription\WEAPON_SUBMACHINEGUNNER\weaponType\bullet\weaponClass\mg\penetrateType\large\impactType\bullet_large\inventoryType\primary\fireType\Full Auto\clipType\lmg\twoHanded\1\rifleBullet\0\armorPiercing\0\boltAction\0\aimDownSight\1\rechamberWhileAds\1\noADSAutoReload\0\noPartialReload\0\segmentedReload\0\adsFire\0\noAdsWhenMagEmpty\0\avoidDropCleanup\0\enhanced\0\bayonet\0\cancelAutoHolsterWhenEmpty\0\suppressAmmoReserveDisplay\0\laserSightDuringNightvision\0\blocksProne\0\silenced\0\mountableWeapon\0\autoAimRange\1200\aimAssistRange\3200\aimAssistRangeAds\3200\enemyCrosshairRange\720\crosshairColorChange\1\moveSpeedScale\0.75\adsMoveSpeedScale\0.75\sprintDurationScale\0.75\gunMaxPitch\6\gunMaxYaw\6\lowAmmoWarningThreshold\0.33\ammoName\30cal\maxAmmo\500\startAmmo\500\clipName\30cal\clipSize\125\shotCount\1\dropAmmoMin\200\dropAmmoMax\250\reloadAmmoAdd\0\reloadStartAdd\0\damage\130\minDamage\90\meleeDamage\150\maxDamageRange\1024\minDamageRange\2400\playerDamage\70\locNone\1\locHelmet\3\locHead\3\locNeck\1\locTorsoUpper\1\locTorsoLower\1\locRightArmUpper\1\locRightArmLower\1\locRightHand\1\locLeftArmUpper\1\locLeftArmLower\1\locLeftHand\1\locRightLegUpper\1\locRightLegLower\1\locRightFoot\1\locLeftLegUpper\1\locLeftLegLower\1\locLeftFoot\1\locGun\0\fireTime\0.096\fireDelay\0\meleeTime\0.5\meleeChargeTime\1\meleeDelay\0.05\meleeChargeDelay\0.15\reloadTime\7\reloadEmptyTime\6\reloadEmptyAddTime\0\reloadStartTime\0\reloadEndTime\0\reloadAddTime\4.75\reloadStartAddTime\0\rechamberTime\0.1\rechamberBoltTime\0\dropTime\0.83\raiseTime\0.9\altDropTime\0.7\altRaiseTime\0\quickDropTime\0.25\quickRaiseTime\0.25\firstRaiseTime\1.5\emptyDropTime\0.5\emptyRaiseTime\0.5\sprintInTime\0.5\sprintLoopTime\0.8\sprintOutTime\0.2\deployTime\0.5\breakdownTime\0.5\nightVisionWearTime\0.5\nightVisionWearTimeFadeOutEnd\0\nightVisionWearTimePowerUp\0\nightVisionRemoveTime\0.5\nightVisionRemoveTimePowerDown\0\nightVisionRemoveTimeFadeInStart\0\standMoveF\0\standMoveR\0\standMoveU\-2\standRotP\0\standRotY\0\standRotR\0\standMoveMinSpeed\0\standRotMinSpeed\0\posMoveRate\8\posRotRate\8\sprintOfsF\1\sprintOfsR\-2\sprintOfsU\-1\sprintRotP\10\sprintRotY\45\sprintRotR\-20\sprintBobH\8\sprintBobV\6\sprintScale\0.9\duckedSprintOfsF\2\duckedSprintOfsR\-1\duckedSprintOfsU\0\duckedSprintRotP\10\duckedSprintRotY\25\duckedSprintRotR\-20\duckedSprintBobH\2\duckedSprintBobV\3\duckedSprintScale\0.8\duckedMoveF\0\duckedMoveR\0\duckedMoveU\-1.5\duckedRotP\0\duckedRotY\0\duckedRotR\0\duckedOfsF\-0.5\duckedOfsR\0.25\duckedOfsU\-0.6\duckedMoveMinSpeed\0\duckedRotMinSpeed\0\proneMoveF\-160\proneMoveR\-75\proneMoveU\-120\proneRotP\0\proneRotY\300\proneRotR\-300\proneOfsF\0\proneOfsR\0.5\proneOfsU\-1\posProneMoveRate\10\posProneRotRate\10\proneMoveMinSpeed\0\proneRotMinSpeed\0\hipIdleAmount\30\adsIdleAmount\28\hipIdleSpeed\1\adsIdleSpeed\0.9\idleCrouchFactor\0.75\idleProneFactor\0.4\adsSpread\0\adsAimPitch\0\adsTransInTime\0.22\adsTransOutTime\0.4\adsTransBlendTime\0.1\adsReloadTransTime\0.3\adsCrosshairInFrac\1\adsCrosshairOutFrac\0.2\adsZoomFov\50\adsZoomInFrac\0.7\adsZoomOutFrac\0.4\adsBobFactor\0\adsViewBobMult\0.25\adsViewErrorMin\0\adsViewErrorMax\0\hipSpreadStandMin\4\hipSpreadDuckedMin\3.5\hipSpreadProneMin\3\hipSpreadMax\10\hipSpreadDuckedMax\8\hipSpreadProneMax\6\hipSpreadFireAdd\0.6\hipSpreadTurnAdd\0\hipSpreadMoveAdd\5\hipSpreadDecayRate\4\hipSpreadDuckedDecay\1.05\hipSpreadProneDecay\1.1\hipGunKickReducedKickBullets\0\hipGunKickReducedKickPercent\0\hipGunKickPitchMin\5\hipGunKickPitchMax\-15\hipGunKickYawMin\5\hipGunKickYawMax\-5\hipGunKickAccel\800\hipGunKickSpeedMax\2000\hipGunKickSpeedDecay\16\hipGunKickStaticDecay\20\adsGunKickReducedKickBullets\0\adsGunKickReducedKickPercent\75\adsGunKickPitchMin\5\adsGunKickPitchMax\15\adsGunKickYawMin\-5\adsGunKickYawMax\10\adsGunKickAccel\800\adsGunKickSpeedMax\2000\adsGunKickSpeedDecay\32\adsGunKickStaticDecay\40\hipViewKickPitchMin\70\hipViewKickPitchMax\80\hipViewKickYawMin\-30\hipViewKickYawMax\-60\hipViewKickCenterSpeed\1500\adsViewKickPitchMin\45\adsViewKickPitchMax\55\adsViewKickYawMin\-70\adsViewKickYawMax\70\adsViewKickCenterSpeed\1800\swayMaxAngle\4\swayLerpSpeed\6\swayPitchScale\0.1\swayYawScale\0.1\swayHorizScale\0.2\swayVertScale\0.2\swayShellShockScale\5\adsSwayMaxAngle\4\adsSwayLerpSpeed\6\adsSwayPitchScale\0.1\adsSwayYawScale\0\adsSwayHorizScale\0.08\adsSwayVertScale\0.1\fightDist\720\maxDist\340\aiVsAiAccuracyGraph\thompson.accu\aiVsPlayerAccuracyGraph\light_machine_gun.accu\reticleCenter\\reticleSide\reticle_side_small\reticleCenterSize\4\reticleSideSize\8\reticleMinOfs\0\hipReticleSidePos\0\adsOverlayShader\\adsOverlayShaderLowRes\\adsOverlayReticle\none\adsOverlayWidth\220\adsOverlayHeight\220\gunModel\viewmodel_usa_30cal_lmg\gunModel2\\gunModel3\\gunModel4\\gunModel5\\gunModel6\\gunModel7\\gunModel8\\gunModel9\\gunModel10\\gunModel11\\gunModel12\\gunModel13\\gunModel14\\gunModel15\\gunModel16\\handModel\viewmodel_hands_no_model\worldModel\weapon_usa_30cal_lmg\worldModel2\\worldModel3\\worldModel4\\worldModel5\\worldModel6\\worldModel7\\worldModel8\\worldModel9\\worldModel10\\worldModel11\\worldModel12\\worldModel13\\worldModel14\\worldModel15\\worldModel16\\worldClipModel\\knifeModel\viewmodel_usa_kbar_knife\worldKnifeModel\weapon_usa_kbar_knife\idleAnim\viewmodel_30cal_idle\emptyIdleAnim\viewmodel_30cal_empty_idle\fireAnim\viewmodel_30cal_fire\lastShotAnim\viewmodel_30cal_lastshot\rechamberAnim\\meleeAnim\viewmodel_knife_slash\meleeChargeAnim\viewmodel_knife_stick\reloadAnim\viewmodel_30cal_partial_reload\reloadEmptyAnim\viewmodel_30cal_reload\reloadStartAnim\\reloadEndAnim\\raiseAnim\viewmodel_30cal_pullout\dropAnim\viewmodel_30cal_putaway\firstRaiseAnim\viewmodel_30cal_first_raise\altRaiseAnim\\altDropAnim\\quickRaiseAnim\viewmodel_30cal_pullout_fast\quickDropAnim\viewmodel_30cal_putaway_fast\emptyRaiseAnim\viewmodel_30cal_pullout_empty\emptyDropAnim\viewmodel_30cal_putaway_empty\sprintInAnim\\sprintLoopAnim\\sprintOutAnim\\nightVisionWearAnim\\nightVisionRemoveAnim\\adsFireAnim\viewmodel_30cal_ADS_fire\adsLastShotAnim\viewmodel_30cal_ADS_lastshot\adsRechamberAnim\\adsUpAnim\viewmodel_30cal_ADS_up\adsDownAnim\viewmodel_30cal_ADS_down\deployAnim\\breakdownAnim\\viewFlashEffect\weapon/muzzleflashes/fx_30cal_bulletweap_view\worldFlashEffect\weapon/muzzleflashes/fx_30cal_bulletweap\viewShellEjectEffect\weapon/shellejects/fx_heavy_link_view\worldShellEjectEffect\weapon/shellejects/fx_heavy\viewLastShotEjectEffect\\worldLastShotEjectEffect\\worldClipDropEffect\\pickupSound\weap_pickup\pickupSoundPlayer\weap_pickup_plr\ammoPickupSound\ammo_pickup\ammoPickupSoundPlayer\ammo_pickup_plr\breakdownSound\\breakdownSoundPlayer\\deploySound\\deploySoundPlayer\\finishDeploySound\\finishDeploySoundPlayer\\fireSound\weap_30cal_fire\fireSoundPlayer\weap_30cal_fire_plr\lastShotSound\weap_30cal_fire\lastShotSoundPlayer\weap_30cal_fire_plr\emptyFireSound\dryfire_rifle\emptyFireSoundPlayer\dryfire_rifle_plr\crackSound\\whizbySound\\meleeSwipeSound\melee_swing\meleeSwipeSoundPlayer\melee_swing_plr\meleeHitSound\melee_hit\meleeMissSound\\rechamberSound\\rechamberSoundPlayer\\reloadSound\gr_30cal_3p_full\reloadSoundPlayer\\reloadEmptySound\gr_30cal_3p_full\reloadEmptySoundPlayer\\reloadStartSound\\reloadStartSoundPlayer\\reloadEndSound\\reloadEndSoundPlayer\\altSwitchSound\\altSwitchSoundPlayer\\raiseSound\weap_raise\raiseSoundPlayer\weap_raise_plr\firstRaiseSound\weap_raise\firstRaiseSoundPlayer\weap_raise_plr\putawaySound\weap_putaway\putawaySoundPlayer\weap_putaway_plr\nightVisionWearSound\\nightVisionWearSoundPlayer\\nightVisionRemoveSound\\nightVisionRemoveSoundPlayer\\standMountedWeapdef\\crouchMountedWeapdef\\proneMountedWeapdef\\mountedModel\\hudIcon\hud_icon_30cal\killIcon\hud_icon_30cal\dpadIcon\\ammoCounterIcon\\hudIconRatio\4:1\killIconRatio\4:1\dpadIconRatio\4:1\ammoCounterIconRatio\4:1\ammoCounterClip\Beltfed\flipKillIcon\1\fireRumble\defaultweapon_fire\meleeImpactRumble\defaultweapon_melee\adsDofStart\0\adsDofEnd\7.5\hideTags\\notetrackSoundMap\gr_30cal_start_plr gr_30cal_start_plr
gr_30cal_open_plr gr_30cal_open_plr
gr_30cal_grab_belt_plr gr_30cal_grab_belt_plr
gr_30cal_belt_remove_plr gr_30cal_belt_remove_plr
gr_30cal_belt_raise_plr gr_30cal_belt_raise_plr
gr_30cal_belt_contact_plr gr_30cal_belt_contact_plr
gr_30cal_belt_press_plr gr_30cal_belt_press_plr
gr_30cal_close_plr gr_30cal_close_plr
gr_30cal_charge_plr gr_30cal_charge_plr
gr_30cal_ammo_toss_plr gr_30cal_ammo_toss_plr
gr_30cal_charge_release_plr gr_30cal_charge_release_plr
gr_30cal_lid_bonk_plr gr_30cal_lid_bonk_plr
knife_stab_plr knife_stab_plr
knife_pull_plr knife_pull_plr
Knife_slash_plr Knife_slash_plr
gr_mg_deploy_start gr_mg_deploy_start
gr_mg_deploy_end gr_mg_deploy_end
gr_mg_break_down gr_mg_break_down
gr_30cal_tap_plr gr_30cal_tap_plr
Any help is appreciated.
Instead of searching line by line, you can search the entire file at once. I have included a code example below, which searches one file for multiple keywords and prints the keywords found.
keywords = ["gr_30cal_open_plr", "gr_mg_deploy_end", "wontfindthis"]
with open("test.txt") as f:
contents = f.read()
# Search the file for each keyword.
keywords_found = {keyword for keyword in keywords if keyword in contents}
if keywords_found:
print("This file contains the following keywords:")
print(keywords_found)
else:
print("This file did not contain any keywords.")
I'll explain the code. f.read() will read the file contents. Then I use a set comprehension to get all of the keywords found in the file. I use a set because that will keep only the unique keywords -- I assume you don't need to know how many times a keyword appears in the file. (A set comprehension is similar to a list comprehension, but it creates a set.) Testing whether the keyword is in the file is as easy as keyword in contents.
I used your sample file contents and duplicated it multiple times so the file contained 45,252,362 lines (1.8 GB). And my code above took less than 1 second.
well you can use multiprocessing to speed up your work, but I don't think it is the best way, but well I am sharing the code so you can try it for yourself and see if that work for you or not.
import multiprocessing
def process(file):
with open(file, 'r', encoding="ISO-8859-1") as weapon_file:
for m in weapon_file:
if sAlias in m:
t = re.compile(fr'\b{sAlias}\b')
result = t.search(m)
if result:
called_file_.append
(''.join(f"root{file[len(WAW_ROOT_DIR):]}"))
p = multiprocessing.Pool()
for file in weapon_files:
# now launch a process for each file.
# The result will be approximately one process per CPU core available.
p.apply_async(process, [file])
p.close()
p.join() # Wait for all child processes to be closed.
hope it helps.

Fast text use (getting it up to compare word vectors)

I am a little ashamed that I have to ask this question because I feel like I should know this. I haven't been programming long but I am trying to apply what I learn to a project I'm working on, and that is how I got to this question. Fast Text has a library of word and associated points https://fasttext.cc/docs/en/english-vectors.html . It is used to find the vector of the word. I just want to look a word or two up and see what the result is in order to see if it is useful for my project. They have provided a list of vectors and then a small code chunck. I cannot make heads or tails out of it. some of it I get but i do not see a print function - is it returning the data to a different part of your own code? I also am not sure where the chunk of code opens the data file, usually fname is a handle right? Or are they expecting you to type your file's path there. I also am not familiar with io, I googled the word but didn't find anything useful. Is this something I need to download or is it already a part of python. I know I might be a little out of my league but I learn best by doing, so please don't hate on me.
import io
def load_vectors(fname):
fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
n, d = map(int, fin.readline().split())
data = {}
for line in fin:
tokens = line.rstrip().split(' ')
data[tokens[0]] = map(float, tokens[1:])
return data
Try the following:
my_file_name = 'C:/path/to/file.txt' # Use the path to your file of rows of sentences
my_data = load_vectors(my_file_name) # Function will return data
print(my_data) # To see the output

Labelling and Grouping Postcodes using Python

I'm fairly new to Python and I am attempting to group various postcodes together under predefined labels. For example "SA31" would be labelled a "HywelDDAPostcode"
I have some code where I read lots of postcodes from a singled columned file into a list and compare them with postcodes that are in predefined lists. However, when I output my postcode labels only the Label "UKPostcodes" is outputted for every postcode in my original file. It would appear that the first two conditions in my code always evaluate to false no matter what. Am I doing the right thing using "in"? Or perhaps it's a file reading issue? I'm not sure
The input file is simply a file which contains a list of postcodes ( in reality it has thousands of rows)
The CSV file
Here is my code:
import csv
with open('postcodes.csv', newline='') as f:
reader = csv.reader(f)
your_list = list(reader)
my_list =[]
HywelDDAPostcodes=["SA46","SY23","SY24","SA18","SA16","SA43","SA31","SA65","SA61","SA62","SA17","SA48","SA40","SA19","SA20","SA44","SA15","SA14","SA73","SA32","SA67","SA45",
"SA38","SA42","SA41","SA72","SA71","SA69","SA68","SA33","SA70","SY25","SA34","LL40","LL42","LL36","SY18","SY17","SY20","SY16","LD6"]
NationalPostcodes=["LL58","LL59","LL60","LL61","LL62","LL63","LL64","LL65","LL66","LL67","LL68","LL69","LL70","LL71","LL72","LL73","LL74","LL75","LL76","LL77","LL78",
"NP1","NP2","NP23","NP3","CF31","CF32","CF33","CF34","CF35","CF36","CF3","CF46","CF81","CF82","CF83","SA35","SA39","SA4","SA47","LL16","LL18","LL21","LL22","LL24","LL25","LL26","LL27","LL28","LL29","LL30","LL31","LL32","LL33","LL34","LL57","CH7","LL11","LL15","LL16","LL17","LL18","LL19","LL20","LL21","LL22","CH1","CH4","CH5","CH6","CH7","LL12","CF1","CF32","CF35","CF5","CF61","CF62","CF63","CF64","CF71","LL23","LL37","LL38","LL39","LL41","LL43","LL44","LL45","LL46","LL47","LL48","LL49","LL51","LL52","LL53","LL54","LL55","LL56","LL57","CF46","CF47","CF48","NP4","NP5","NP6","NP7","SA10","SA11","SA12","SA13","SA8","CF3","NP10","NP19","NP20","NP9","SA36","SA37","SA63","SA64","SA66","CF44","CF48","HR3","HR5","LD1","LD2","LD3","LD4","LD5","LD7","LD8","NP8","SY10","SY15","SY19","SY21","SY22","SY5","CF37","CF38","CF39","CF4","CF40","CF41","CF42","CF43","CF45","CF72","SA1","SA2","SA3","SA4","SA5","SA6","SA7","SA1","NP4","NP44","NP6","LL13","LL14","SY13","SY14"]
NationalPostcodes2= list(dict.fromkeys(NationalPostcodes))
labels=["HywelDDA","NationalPostcodes","UKPostcodes"]
for postcode in your_list:
#print(postcode)
if postcode in HywelDDAPostcodes:
my_list.append(labels[0])
if postcode in NationalPostcodes2:
my_list.append(labels[1])
else:
my_list.append(labels[2])
with open('DiscretisedPostcodes.csv','w') as result_file:
wr = csv.writer(result_file, dialect='excel')
for item in my_list:
wr.writerow([item,])
If anyone has any advice as to what could be causing the issue or just any advice surrounding Python, in general, I would very much appreciate it. Thank you!
The reason why your comparison block isn't working is that when you use csv reader to read your file, each line is being added to your_list as a list. So you are making a list of lists and when you compare those things it doesn't match.
['LL58'] == 'LL58' # fails
So, inspect your_list and see what I mean. You should make a shell your_list before you read the file and append each new reading to it. Then inspect that to make sure it looks good. It would also behoove you to use the strip() command to strip off whitespace from each item. I can't recall if csv reader does that automatically.
Also... a better structure for testing for membership is to use sets instead of lists. in will work for lists, but it is MUCH faster for sets, so I would put your comparison items into sets.
Lastly, it isn't clear what you are trying to do with NationalPostcodes2. Just use your NationalPostcodes, but put them in a set with {}.
#Jeff H's answer is correct, but for what it's worth here's how I might write this code (untested):
# Note: Since, as you wrote, these are only single-column files I did not use the csv
# module, as it will just add additional unnecessary overhead.
# Read the known data from files--this will always be more flexible and maintainable than
# hard-coding them in your code. This is just one possible scheme for doing this; e.g.
# you could also put all of them into a single JSON file
standard_postcode_files = {
'HywelDDA': 'hyweldda.csv',
'NationalPostcodes': 'nationalpostcodes.csv',
'UKPostcodes': 'ukpostcodes.csv'
}
def read_postcode_file(filename):
with open(filename) as f:
# exclude blank lines and strip additional whitespace
return [line.strip() for line in f if line.strip()]
standard_postcodes = {}
for key, filename in standard_postcode_files.items():
standard_postcodes[key] = set(read_postcode_file(filename))
# Assuming all post codes are unique to a set, map postcodes to the set they belong to
postcodes_reversed = {v: k for k, s in standard_postcodes.items() for v in s}
your_postcodes = read_postcode_file('postcodes.csv')
labels = [postcodes_reversed[code] for code in your_postcodes]
with open('DiscretisedPostCodes.csv', 'w') as f:
for label in labels:
f.write(label + '\n')
I would probably do other things like not make the input filename hard-coded. If you need to work with multiple columns using the csv module would also be fine with minimal additional changes, but since you're just writing one item per line I figured it was unnecessary.

How do I write a simple, Python parsing script?

Most of what I do involves writing simple parsing scripts that reads search terms from one file and searches, line by line, another file. Once the search term is found, the line and sometimes the following line are written to another output file. The code I use is rudimentary and likely crude.
#!/usr/bin/env python
data = open("data.txt", "r")
search_terms = ids.read().splitlines()
data.close()
db = open("db.txt", "r")
output = open("output.txt", "w")
for term in search_terms:
for line in db:
if line.find(term) > -1:
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found %s" % term)
There are a few problems here. First, I don't think it's the most efficient and fastest to search line by line, but I'm not exactly sure about that. Second, I often run into issues with cursor placement and the cursor doesn't reset to the beginning of the file when the search term is found. Third, while I am usually confident that all of the terms can be found in the db, there are rare times when I can't be sure, so I would like to write to another file whenever it iterates through the entire db and can't find the term. I've tried adding a snippet that counts the number of lines of the db so if the find() function gets to the last line and the term isn't found, then it outputs to another "not found" file, but I haven't been able to get my elif and else loops right.
Overall, I'd just like any hints or corrections that could make this sort of script more efficient and robust.
Thanks.
Unless it's a really big file, why not iterate line by line? If the input file's size is some significant portion of your machine's available resources (memory), then you might want to look into buffered input and other, more low-level abstractions of what the computer is doing. But if you're talking about a few hundred MB or less on a relatively modern machine, let the computer do the computing ;)
Off the bat you might want to get into the habit of using the built-in context manager with. For instance, in your snippet, you don't have a call to output.close().
with open('data.txt', 'r') as f_in:
search_terms = f_in.read().splitlines()
Now search_terms is a handle to a list that has each line from data.txt as a string (but with the newline characters removed). And data.txt is closed thanks to with.
In fact, I would do that with the db.txt file, also.
with open('db.txt', 'r') as f_in:
lines = f_in.read().splitlines()
Context managers are cool.
As a side note, you could open your destination file now, and do your parsing and results-tracking with it open the whole time, but I like leaving as many files closed as possible for as long as possible.
I would suggest setting the biggest object on the outside of your loop, which I'm guessing is db.txt contents. The outermost loop only usually only gets iterated once, so might as well put the biggest thing there.
results = []
for i, line in enumerate(lines):
for term in search_terms:
if term in line:
# Use something not likely to appear in your line as a separator
# for these "second lines". I used three pipe characters, but
# you could just as easily use something even more random
results.append('{}|||{}'.format(line, lines[i+1]))
if results:
with open('output.txt', 'w') as f_out:
for result in results:
# Don't forget to replace your custom field separator
f_out.write('> {}\n'.format(result.replace('|||', '\n')))
else:
with open('no_results.txt', 'w') as f_out:
# This will write an empty file to disk
pass
The nice thing about this approach is each line in db.txt is checked once for each search_term in search_terms. However, the downside is that any line will be recorded for each search term it contains, ie., if it has three search terms in it, that line will appear in your output.txt three times.
And all the files are magically closed.
Context managers are cool.
Good luck!
search_terms keeps whole data.txt in memory. That it's not good in general but in this case it's not quite bad.
Looking line-by-line is not sufficient but if the case is simple and files are not too big it's not a big deal. If you want more efficiency you should sort data.txt file and put this to some tree-like structure. It depends on data which is inside.
You have to use seek to move pointer back after using next.
Propably the easiest way here is to generate two lists of lines and search using in like:
`db = open('db.txt').readlines()
db_words = [x.split() for x in db]
data = open('data.txt').readlines()
print('Lines in db {}'.format(len(db)))
for item in db:
for words in db_words:
if item in words:
print("Found {}".format(item))`
Your key issue is that you may be looping in the wrong order -- in your code as posted, you'll always exhaust the db looking for the first term, so after the first pass of the outer for loop db will be at end, no more lines to read, no other term will ever be found.
Other improvements include using the with statement to guarantee file closure, and a set to track which search terms were not found. (There are also typos in your posted code, such as opening a file as data but then reading it as ids).
So, for example, something like:
with open("data.txt", "r") as data:
search_terms = data.read().splitlines()
missing_terms = set(search_terms)
with open("db.txt", "r") as db, open("output.txt", "w") as output:
for line in db:
for term in search_terms:
if term in line:
missing_terms.discard(term)
next_line = db.next()
output.write(">" + head + "\n" + next_line)
print("Found {}".format(term))
break
if missing_terms:
diagnose_not_found(missing_terms)
where the diagnose_not_found function does whatever you need to do to warn the user about missing terms.
There are assumptions embedded here, such as the fact that you don't care if some other search term is present in a line where you've found a previous one, or the very next one; they might take substantial work to fix if not applicable and it will require that you edit your Q with a very complete and unambiguous list of specifications.
If your db is actually small enough to comfortably fit in memory, slurping it all in as a list of lines once and for all would allow easier accommodation for more demanding specs (as in that case you can easily go back and forth, while iterating on a file means you can only go forward one line at a time), so if your specs are indeed more demanding please also clarify if this crucial condition hold, or rather you need this script to process potentially humungous db files (say gigabyte-plus sizes, so as to not "comfortably fit in memory", depending on your platform of course).

Update strings in a text file at a specific location

I would like to find a better solution to achieve the following three steps:
read strings at a given row
update strings
write the updated strings back
Below are my code which works but I am wondering is there any better (simple) solutions?
new='99999'
f=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP','r+')
lines=f.readlines()
#the row number we want to update is given, so just load the content
x = lines[95]
print(x)
f.close()
#replace
f1=open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP')
con = f1.read()
print con
con1 = con.replace(x[2:8],new) #only certain columns in this row needs to be updated
print con1
f1.close()
#write
f2 = open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'w')
f2.write(con1)
f2.close()
Thanks!
UPDATE: get an idea from jtmoulia this time it becomes easier
def replace_line(file_name, line_num, col_s, col_e, text):
lines = open(file_name, 'r').readlines()
temp=lines[line_num]
temp = temp.replace(temp[col_s:col_e],text)
lines[line_num]=temp
out = open(file_name, 'w')
out.writelines(lines)
out.close()
The problem with textual data, even when tabulated, is that the byte offsets are not predictable. For example, when representing numbers with strings you have one byte per digit, whereas when using binary (e.g. two's complement) you always need four or eight bytes either for small and large integers.
Nevertheless, if your text format is strict enough you can get along by replacing bytes without changing the size of the file, you can try using the standard mmap module. With it, you'll be able to treat a file as a mutable byte string and modify parts of it inplace and letting the kernel do the file saving for you.
Otherwise, whatever of the other answers are much better suited for the problem.
Well, to begin with you don't need to keep reopening and reading from the file every time. The r+ mode allows you to read and write to the given file.
Perhaps something like
with open('C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP', 'r+') as f:
lines = f.readlines()
#... Perform whatever replacement you'd like on lines
f.seek(0)
f.writelines(lines)
Also, Editing specific line in text file in python
When I had to do something similar (for a Webmin customization), I did it entirely in PERL because that's what the Webmin framework used, and I found it quite easy. I assume (but don't know for sure) there are equivalent things in Python. First read the entire file into memory all at once (the PERL way to do this is probably called "slurp"). (This idea of holding the entire file in memory rather than just one line used to make little sense {or even be impossible}. But these days RAM is so large it's the only way to go.) Then use the split operator to divide the file into lines and put each line in a different element of a giant array. You can then use the desired line number as an index into the array (remember array indices usually start with 0). Finally, use "regular expression" processing to change the text of the line. Then change another line, and another, and another (or make another change to the same line). When you're all done, use join to put all the lines in the array back together into one giant string. Then write the whole modified file out.
While I don't have the complete code handy, here's an approximate fragment of some of the PERL code so you can see what I mean:
our #filelines = ();
our $lineno = 43;
our $oldstring = 'foobar';
our $newstring = 'fee fie fo fum';
$filelines[$lineno-1] =~ s/$oldstring/$newstring/ig;
# "ig" modifiers for case-insensitivity and possible multiple occurences in the line
# use different modifiers at the end of the s/// construct as needed
FILENAME = 'C:/Users/th/Dropbox/com/MS1Ctt-P-temp.INP'
lines = list(open(FILENAME))
lines[95][2:8] = '99999'
open(FILENAME, 'w').write(''.join(lines))

Categories