Random (well, apparently) "IndexError: list index out of range" - python

I am trying to write a programme that parses all the xml files within a single directory. The code seems to work OK, but sometimes a file is parsed without any trouble (when it is alone or when it is the first one to be parsed), sometimes parsing the same file returns ""IndexError: list index out of range"
from xml.dom.minidom import parse, parseString
import os
liste=open('oup_list_hybrid.txt','a')
for r,d,f in os.walk('C:/Users/bober/Documents/Analyse_citation_crossref/'):
for files in f:
if files.endswith(".xml"):
print files
dom=parse(files)
for element in dom.getElementsByTagName('record'):
rights = element.getElementsByTagName('dc:rights')
doi = element.getElementsByTagName('dc:identifier')
date= element.getElementsByTagName('dc:date')
try:
valeurrights=rights[0].firstChild.nodeValue
valeurdoi=doi[1].firstChild.nodeValue
valeurdate=date[0].firstChild.nodeValue
resultat=valeurrights+';'+valeurdoi+';'+valeurdate+'\n'
liste.write(resultat)
except IndexError:
print 'pb avec'+files
continue
break
liste.close()
What am I doing wrong here ?
Thanks in advance for any help !

Are you sure that rights, doi or date actually contain anything? If the getElementsByTagName doesn't find anything, these lists will be empty.
doi may also only contain one element, and you're trying to access the second doi[1].
Long story short, check your lists actually contain data before accessing it, or use a try-catch

Related

Writing a script to find certain lines/string in multiple documents

I have a folder with multiple files (.doc and .docx). For the sake of this question I want to primarily deal with the .doc files unless for of these file types and be accounted for in the code.
I'm writing a code to read the folder and identify the .doc files. The objective is to output the paragraph 3, 4, and 7. I'm not sure why but python is reading each paragraph from a different spot in each file. I'm thinking maybe there are spacing/formatting inconsistencies that I wasn't aware of initially. To work around the formatting issue, I was thinking I could define the strings I want outputted. But I'm not sure how to do that. I tried to take add a string in the code but that didn't work.
How can I modify my code to be able to account for finding the strings that I want?
Original Code
doc = ''
for file in glob.glob(r'folderpathway*.docx'):
doc = docx.Document(file)
print (doc.paragraphs[3].text)
print (doc.paragraphs[4].text)
print (doc.paragraphs[7].text)
Code to account for the formatting issues
doc = ''
for file in glob.glob(r'folderpathway*.docx'):
doc = docx.Document(file)
print (doc.paragraphs["Substance Number"].text)
TypeError: list indices must be integers or slices, not str

How would I extract only specific text from this webpage?

I am looking for ways to take this line of code:
{"id":"76561198170104957","names":[{"name":"Mountain Dew"},{"name":"Sugardust"}],"kills":2394,"deaths":2617,"ff_kills":89,"ff_deaths":110,"playtime":"P5DT3H45M18S"}
and extract ONLY the "kills, deaths, ff_kills, and ff_deaths strings and their associated numbers into a list. This code varies in length depending on the user, so a static index won't really work I don't think. The code is also read from a webpage if that opens up any possibilities. Thanks.
That format is called JSON. You can easily parse it with python. Example:
import json
line = r'{"id":"76561198170104957","names":[{"name":"Mountain Dew"},{"name":"Sugardust"}],"kills":2394,"deaths":2617,"ff_kills":89,"ff_deaths":110,"playtime":"P5DT3H45M18S"}'
j = json.loads(line)
print(j['kills']);

Access last string after split function to create new list

I am a beginner in Python and I have been working on a code to access two types of files (dcd and inp files), combine them and create a new list with the matching strings.
I got stuck somewhere at the beginning. I want to get all dcd files here. So they have .dcd extension but the first part is not the same. So I was thinking if there is a way to access them after I have split the string.
#collect all dcd files into a list
list1 = []
for filename1 in glob.glob('*/FEP_SYAF014*/FEP1/298/*/*.dcd'):
filename1 = filename1.split('/')
filename1.sort()
list1.append(filename1)
I want to get only names with dcd extension that are indexed [5] and create a new list or mutate this one, but I am not sure how to do that.
p.s I have just posted first part of the code
Thank you !
the oddly sorted part
this one looks better
and this is how I would like it to look like, but sorted and without eq* files.
want this sorted
just use sort with a sort key: os.path.basename (extracts only the basename of the file to perform sort):
import os, glob
list1 = sorted(glob.glob('*/FEP_SYAF014*/FEP1/298/*/*.dcd'), key = os.path.basename)
So this worked. I just added del filename1[:5] to get rid of other unnecessary string parts
import os, glob
list1 = sorted(glob.glob('/FEP_SYAF014/FEP1/298//.dcd'), key = os.path.basename)
for filename1 in sorted(glob.glob('*/FEP_SYAF014 */FEP1/298/*/*.dcd'),key = os.path.basename):
filename1 = filename1.split('/')
filename1.sort()
list1.append(filename1)
del filename1[:5]
print filename1
Your sort function is applied to file name parts. This is not what you want. If I understand well you want to sort the filename list, not the parts of the filename.
The code given by Jean François is great but I guess you'd like to get your own code working.
You need to extract the file name by using only the last part of the split
A split returns a list of strings. Each item is a part of the original.
filename = filename.split ('/')[len (filename.split ('/'))-1]
This line will get you the last part of the split
Then you can add that part to your list
And after all that you can sort your list
Hope this helps!

Python: Using str.split and getting list index out of range

I just started using python and am trying to convert some of my R code into python. The task is relatively simple; I have many csv file with a variable name (in this case cell lines) and values ( IC50's). I need to pull out all variables and their values shared in common among all files. Some of these files share the save variables but are formatted differently. For example in some files a variable is just "Cell_line" and in others it is MEL:Cell_line. So first things first to make a direct string comparison I need to format them the same and hence am trying ti use str.split() to do so. There is probably a much better way to do this but for now I am using the following code:
import csv
import os
# Change working directory
os.chdir("/Users/joshuamannheimer/downloads")
file_name="NCI60_Bleomycin.csv"
with open(file_name) as csvfile:
NCI_data=csv.reader(csvfile, delimiter=',')
alldata={}
for row in NCI_data:
name_str=row[0]
splt=name_str.split(':')
n_name=splt[1]
alldata[n_name]=row
[1]
name_str.split return a list of length 2. Since the portion I want is after the ":" I want the second element which should be indexed as splt[1] as splt[0] is the first in python. However when I run the code I get this error message "IndexError: list index out of range"
I'm trying the second element out of a list of length 2 thus I have no idea why it is out of range. Any help or suggestions would be appreciated.
I am pretty sure that there are some rows where name_str does not have a : in them. From your own example if the name_str is Cell_line it would fail.
If you are sure that there would only be 1 : in name_str (at max) , or if there are multiple : you want to select the last one, instead of splt[1] , you should use - splt[-1] . -1 index would take the last element in the list (unless its empty) .
The simple answer is that sometimes the data isn't following the specification being assumed when you write this code (i.e. that there is a colon and two fields).
The easiest way to deal with this is to add an if block if len(splot)==2: and do the subsequent lines within that block.
Optionally, add an else: and print the lines that are not so spec or save them somewhere so you can diagnose.
Like this:
import csv
import os
# Change working directory
os.chdir("/Users/joshuamannheimer/downloads")
file_name="NCI60_Bleomycin.csv"
with open(file_name) as csvfile:
NCI_data=csv.reader(csvfile, delimiter=',')
alldata={}
for row in NCI_data:
name_str=row[0]
splt=name_str.split(':')
if len(splt)==2:
n_name=splt[1]
alldata[n_name]=row
else:
print "invalid name: "+name_str
Alternatively, you can use try/except, which in this case is a bit more robust because we can handle IndexError anywhere, in either row[0] or in split[1], with the one exception handler, and we don't have to specify that the length of the : split field should be 2.
In addition we could explicitly check that there actually is a : before splitting, and assign the name appropriately.
import csv
import os
# Change working directory
os.chdir("/Users/joshuamannheimer/downloads")
file_name="NCI60_Bleomycin.csv"
with open(file_name) as csvfile:
NCI_data=csv.reader(csvfile, delimiter=',')
alldata={}
for row in NCI_data:
try:
name_str=row[0]
if ':' in name_str:
splt=name_str.split(':')
n_name=splt[1]
else:
n_name = name_str
alldata[n_name]=row
except IndexError:
print "bad row:"+str(row)

Random.choice pulling entire list, instead of single items

I am (continually) working on a project with gmail and imaplib. I am searching gmail for emails that contain a specific word, and receiving a list of unique ids (most of my code is based off of/is Doug Hellman's, from his excellent imaplib tutorial). Those unique ids are saved to a list. I am trying to pull a single id from the list, but random.choice keeps pulling the entire list. Here is the code:
import imaplib
import random
from imaplib_list_parse import parse_list_response
c = imaplib_connect.open_connection()
msg_ids = []
c.select('[Gmail]/Chats', readonly=True)
typ, msg_ids = c.search(None, '(BODY "friend")')
random_id = random.choice(msg_ids)
print random_id
I've messed around in the interpreter and msg_ids is definitely a list. I've also attempted to pull specific elements of the array (ex: msg_ids[1] etc) but it says "IndexError: list index out of range" which I understand to mean "the thing you are looking for isn't there", which is confusing because it is there.
Is there anytime a list is not a list? Or something? I'm confused.
As always, I appreciate any feedback the wonderful people of stackoverflow could give :)
I think random_id is a list of list, something like: [[1,2,3,5]]。So when you call msg_ids[1], it raise IndexError. And since there is only one element in the list, random.choice() always return the element. You can try:
print random.choice(msg_ids[0])
To debug this kind of things, you can try print the random_id, or use IPython to interactive your code to find out the problem.

Categories