Reading and Grouping a List of Data in Python

Reading and Grouping a List of Data in Python - python

I have been struggling with managing some data. I have data that I have turned into a list of lists each basic sublist has a structure like the following
<1x>begins
<2x>value-1
<3x>value-2
<4x>value-3
some indeterminate number of other values
<1y>next observation begins
<2y>value-1
<3y>value-2
<4y>value-3
some indeterminate number of other values
this continues for an indeterminate number of times in each sublist
EDIT I need to get all the occurrences of <2,<3 & <4 separated out and grouped together I am creating a new list of lists [[<2x>value-1,<3x>value-2, <4x>value-3], [<2y>value-1, <3y>value-2, <4y>value-3]]
EDIT all of the lines that follow <4x> and <4y> (and for that matter <4anyalpha> have the same type of coding and I don't know a-priori how high the numbers can go-just think of these as sgml tags that are not closed I used numbers because my fingers were hurting from all the coding I have been doing today.
The solution I have come up with finally is not very pretty
listINeed=[]
for sublist in biglist:
for line in sublist:
if '<2' in line:
var2=line
if '<3' in line:
var3=line
if '<4' in line:
var4=line
templist=[]
templist.append(var2)
templist.append(var3)
templist.append(var4)
listIneed.append(templist)
templist=[]
var4=var2=var3=''
I have looked at ways to try to clean this up but have not been successful. This works fine I just saw this as another opportunity to learn more about python because I would think that this should be processable by a one line function.

itertools.groupby() can get you by.
itertools.groupby(biglist, operator.itemgetter(2))

If you want to pick out the second, third, and fourth elements of each sublist, this should work:
listINeed = [sublist[1:4] for sublist in biglist]

You're off to a good start by noticing that your original solution may work but lacks elegance.
You should parse the string in a loop, creating a new variable for each line.
Here's some sample code:
import re
s = """<1x>begins
<2x>value-1
<3x>value-2
<4x>value-3
some indeterminate number of other values
<1y>next observation begins
<2y>value-1
<3y>value-2
<4y>value-3"""
firstMatch = re.compile('^\<1x')
numMatch = re.compile('^\<(\d+)')
listIneed = []
templist = None
for line in s.split():
if firstMatch.match(line):
if templist is not None:
listIneed.append(templist)
templist = [line]
elif numMatch.match(line):
#print 'The matching number is %s' % numMatch.match(line).groups(1)
templist.append(line)
if templist is not None: listIneed.append(templist)
print listIneed

If I've understood your question correctly:
import re
def getlines(ori):
matches = re.finditer(r'(<([1-4])[a-zA-Z]>.*)', ori)
mainlist = []
sublist = []
for sr in matches:
if int(sr.groups()[1]) == 1:
if sublist != []:
mainlist.append(sublist)
sublist = []
else:
sublist.append(sr.groups()[0])
else:
mainlist.append(sublist)
return mainlist
...would do the job for you, if you felt like using regular expressions.
The version below would break all of the data down into sublists (not just the first four in each grouping) which might be more useful depending what else you need to do to the data. Use David's listINeed = [sublist[1:4] for sublist in biglist] to get the first four results from each list for the specific task above.
import re
def getlines(ori):
matches = re.finditer(r'(<(\d*)[a-zA-Z]>.*)', ori)
mainlist = []
sublist = []
for sr in matches:
if int(sr.groups()[1]) == 1:
print "1 found!"
if sublist != []:
mainlist.append(sublist)
sublist = []
else:
sublist.append(sr.groups()[0])
else:
mainlist.append(sublist)
return mainlist

Related

Python list slicing error

I am using this code: https://pastebin.com/mQkpxdeV
wordlist[overticker] = thesentence[0:spaces]
in this function:
def mediumparser(inpdat3):
spaceswitch = 0
overticker = 0
thesentence = "this sentence is to count the spaces"
wordlist = []
while spaceswitch == 0:
spaces = thesentence.find(' ')
wordlist[overticker] = thesentence[0:spaces] # this is where we save the words at
thesentence = thesentence[spaces:len(thesentence)] # this is where we change the sentence for the
# next run-through
print('here2')
print(wordlist)
I can't figure out why it just keeps saying list index out of range.
The program seems to work but it gives an error, what am I doing wrong? I have looked through this book by Mark Lutz for an answer and I can't find one.

The "list index out of range" problem is never with list splicing, as shown in this simple test:
>>> l = []
>>> l[0:1200]
[]
>>> l[-400:1200]
[]
so the problem is with your left hand assignment wordlist[overticker] which uses a list access, not slicing, and which is subject to "list index out of range".
Just those 4 lines of your code are enough to find the issue
wordlist = []
while spaceswitch == 0:
spaces = thesentence.find(' ')
wordlist[overticker] = ...
wordlist is just empty. You have to extend/append the list (or use a dictionary if you want to dynamically create items according to a key)

Instead of doing wordlist[overticker] with wordlist being a empty list, you will need to use append instead, since indexing an empty list wouldn't make sense.
wordlist.append(thesentence[0:spaces])
Alternatively, you can pre-initiate the list with 20 empty strings.
wordlist = [""]*20
wordlist[overticker] = thesentence[0:spaces]
P.S.
wordlist[overticker] is called indexing, wordlist[1:10] is called slicing.

Intro to Python - Lists questions

we've started doing Lists in our class and I'm a bit confused thus coming here since previous questions/answers have helped me in the past.
The first question was to sum up all negative numbers in a list, I think I got it right but just want to double check.
import random
def sumNegative(lst):
sum = 0
for e in lst:
if e < 0:
sum = sum + e
return sum
lst = []
for i in range(100):
lst.append(random.randrange(-1000, 1000))
print(sumNegative(lst))
For the 2nd question, I'm a bit stuck on how to write it. The question was:
Count how many words occur in a list up to and including the first occurrence of the word “sap”. I'm assuming it's a random list but wasn't given much info so just going off that.
I know the ending would be similar but no idea how the initial part would be since it's string opposed to numbers.
I wrote a code for a in-class problem which was to count how many odd numbers are on a list(It was random list here, so assuming it's random for that question as well) and got:
import random
def countOdd(lst):
odd = 0
for e in lst:
if e % 2 = 0:
odd = odd + 1
return odd
lst = []
for i in range(100):
lst.append(random.randint(0, 1000))
print(countOdd(lst))
How exactly would I change this to fit the criteria for the 2nd question? I'm just confused on that part. Thanks.

The code to sum -ve numbers looks fine! I might suggest testing it on a list that you can manually check, such as:
print(sumNegative([1, -1, -2]))
The same logic would apply to your random list.
A note about your countOdd function, it appears that you are missing an = (== checks for equality, = is for assignment) and the code seems to count even numbers, not odd. The code should be:
def countOdd(lst):
odd = 0
for e in lst:
if e%2 == 1: # Odd%2 == 1
odd = odd + 1
return odd
As for your second question, you can use a very similar function:
def countWordsBeforeSap(inputList):
numWords = 0
for word in inputList:
if word.lower() != "sap":
numWords = numWords + 1
else:
return numWords
inputList = ["trees", "produce", "sap"]
print(countWordsBeforeSap(inputList))
To explain the above, the countWordsBeforeSap function:
Starts iterating through the words.
If the word is anything other than "sap" it increments the counter and continues
If the word IS "sap" then it returns early from the function
The function could be more general by passing in the word that you wanted to check for:
def countWordsBefore(inputList, wordToCheckFor):
numWords = 0
for word in inputList:
if word.lower() != wordToCheckFor:
numWords = numWords + 1
else:
return numWords
inputList = ["trees", "produce", "sap"]
print(countWordsBeforeSap(inputList, "sap"))
If the words that you are checking come from a single string then you would initially need to split the string into individual words like so:
inputString = "Trees produce sap"
inputList = inputString.split(" ")
Which splits the initial string into words that are separated by spaces.
Hope this helps!
Tom

def count_words(lst, end="sap"):
"""Note that I added an extra input parameter.
This input parameter has a default value of "sap" which is the actual question.
However you can change this input parameter to any other word if you want to by
just doing "count_words(lst, "another_word".
"""
words = []
# First we need to loop through each item in the list.
for item in lst:
# We append the item to our "words" list first thing in this loop,
# as this will make sure we will count up to and INCLUDING.
words.append(item)
# Now check if we have reached the 'end' word.
if item == end:
# Break out of the loop prematurely, as we have reached the end.
break
# Our 'words' list now has all the words up to and including the 'end' variable.
# 'len' will return how many items there are in the list.
return len(words)
lst = ["something", "another", "woo", "sap", "this_wont_be_counted"]
print(count_words(lst))
Hope this helps you understand lists better!

You can make effective use of list/generator comprehensions. Below are fast and memory efficient.
1. Sum of negatives:
print(sum( i<0 for i in lst))
2. Count of words before sap: Like you sample list, it assumes no numbers are there in list.
print(lst.index('sap'))
If it's a random list. Filter strings. Find Index for sap
l = ['a','b',1,2,'sap',3,'d']
l = filter(lambda x: type(x)==str, l)
print(l.index('sap'))
3. Count of odd numbers:
print(sum(i%2 != 0 for i in lst))

Nested for loops with large data set

I have a list of sublists each of which consists of one or more strings. I am comparing each string in one sublist to every other string in the other sublists. This consists of writing two for loops. However, my data set is ~5000 sublists, which means my program keeps running forever unless I run the code in increments of 500 sublists. How do I change the flow of this program so I can still look at all j values corresponding to each i, and yet be able to run the program for ~5000 sublists. (wn is Wordnet library)
Here's part of my code:
for i in range(len(somelist)):
if i == len(somelist)-1: #if the last sublist, do not compare
break
title_former = somelist[i]
for word in title_former:
singular = wn.morphy(word) #convert to singular
if singular == None:
pass
elif singular != None:
newWordSyn = getNewWordSyn(word,singular)
if not newWordSyn:
uncounted_words.append(word)
else:
for j in range(i+1,len(somelist)):
title_latter = somelist[j]
for word1 in title_latter:
singular1 = wn.morphy(word1)
if singular1 == None:
uncounted_words.append(word1)
elif singular1 != None:
newWordSyn1 = getNewWordSyn(word1,singular1)
tempSimilarity = newWordSyn.wup_similarity(newWordSyn1)
Example:
Input = [['space', 'invaders'], ['draw']]
Output= {('space','draw'):0.5,('invaders','draw'):0.2}
The output is a dictionary with corresponding string pair tuple and their similarity value. The above code snippet is not complete.

How about doing a bit of preprocessing instead of doing a bunch of operations over and over? I did not test this, but you get the idea; you need to take anything you can out of the loop.
# Preprocessing:
unencountered_words = []
preprocessed_somelist = []
for sublist in somelist:
new_sublist = []
preprocessed_somelist.append(new_sublist)
for word in sublist:
temp = wn.morphy(word)
if temp:
new_sublist.append(temp)
else:
unencountered_words.append(word)
# Nested loops:
for i in range(len(preprocessed_somelist) - 1): #equivalent to your logic
for word in preprocessed_somelist[i]:
for j in range(i+1, len(preprocessed_somelist)):
for word1 in preprocessed_somelist[j]:
tempSimilarity = newWordSyn.wup_similarity(newWordSyn1)

you could try something like this but I doubt it will be faster (and you will probably need to change the distance function)
def dist(s1,s2):
return sum([i!=j for i,j in zip(s1,s2)]) + abs(len(s1)-len(s2))
dict([((k,v),dist(k,v)) for k,v in itertools.product(Input1,Input2)]

This is always going to have scaling issues, because you're doing n^2 string comparisons. Julius' optimization is certainly a good starting point.
The next thing you can do is store similarity results so you don't have to compare the same words repeatedly.
One other optimisation you can make is store comparisons of words and reuse them if the same words are encountered.
key = (newWordSyn, newWordSyn1)
if key in prevCompared:
tempSimilarity = prevCompared[(word, word1)]
else:
tempSimilarity = newWordSyn.wup_similarity(newWordSyn1)
prevCompared[key] = tempSimilarity
prevCompared[(newWordSyn1, newWordSyn)] = tempSimilarity
This only helps if you will see a lot of the same word combination, but i think wup_similarity is quite expensive.

Append a list from a search function

This is my current code but I am not entirely sure how to append the list with the results of the search. If anyone could provide any help it would be appreciated.
import sys
import re
with open('text.log') as f:
z=[]
count = 0
match = re.compile(r'^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(.?)$')
for l in f:
if match.search(l):
z = l.strip().split("\t")[5:8]
z.pop(1)
print(z[1]) # Now print them
print('\n')
count += 1
if count == 20:
print("\n\n\n\n\n-----NEW GROUPING OF 20 RESULTS-----\n\n\n\n\n\n")
count = 0
else:
print('wrong')
sys.exit()

A few thoughts:
1) Whenever you have questions about regular expressions, you can use tools like the Python Regex Tool Site to confirm your REs are doing what you think they're doing.
2) In your comment, you said that you don't want every element in ip to be printed. The any() function will return True if any of the elements in the iterable are True, hence the function name.
if any(match.search(s) for s in ip):
# this if statement will be true if ANY of the elements of ip match the
# regex, and all of the statements under it will be executed
print(ip) # now you're printing the whole list, so even the ones that
# didn't match will be printed
3) If you want the tryme() function to return just matching elements of the ip list, try this:
def tryme(ip):
match = re.compile("^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$")
return [element for element in ip if match.search(element)]
Don't forget to modify the main body of your code to catch the returned list and print it out.
4) You have a useless assignment:
with open('text.log') as f:
listofip=[]
count = 0
for l in f:
listofip = l.strip().split("\t")[5:8] # Grab the elements you want...
The second assignment to listofip overwrites the the empty list you created, so you should either get rid of the listofip = [] or use a different name when you splice up your input line. Based on your original question title, I think something like this may be more appropriate for you:
import sys
import re
import operator
# using my definition of tryme() from section 3 of my answer
with open('text.log') as f:
list_of_ips = []
count = 0
retriever = operator.itemgetter(5, 7, 8)
for l in f:
list_of_ips.append(tryme(retriever(l.strip().split("\t"))))
count += len(list_of_ips) # add however many ips matched to current count
if count >= 20:
print("\n\n\n\n\n-----NEW GROUPING OF RESULTS-----\n\n\n\n\n\n")
count = 0 # reset count
list_of_ips = [] # empty the list of ips
This will iterate over the file, grab the elements that match your RE, append them to a list, and print them out once there are over 20 in the list. Note that it may print groups larger than 20. I have also added operator.itemgetter() to simplify your slicing and popping.

How to reference the next item in a list in Python?

I'm fairly new to Python, and am trying to put together a Markov chain generator. The bit that's giving me problems is focused on adding each word in a list to a dictionary, associated with the word immediately following.
def trainMarkovChain():
"""Trains the Markov chain on the list of words, returning a dictionary."""
words = wordList()
Markov_dict = dict()
for i in words:
if i in Markov_dict:
Markov_dict[i].append(words.index(i+1))
else:
Markov_dict[i] = [words.index(i+1)]
print Markov_dict
wordList() is a previous function that turns a text file into a list of words. Just what it sounds like. I'm getting an error saying that I can't concatenate strings and integers, referring to words.index(i+1), but if that's not how to refer to the next item then how is it done?

You can also do it as:
for a,b in zip(words, words[1:]):
This will assign a as an element in the list and b as the next element.

The following code, simplified a bit, should produce what you require. I'll elaborate more if something needs explaining.
words = 'Trains the Markov chain on the list of words, returning a dictionary'.split()
chain = {}
for i, word in enumerate(words):
# ensure there's a record
next_words = chain.setdefault(word, [])
# break on the last word
if i + 1 == len(words):
break
# append the next word
next_words.append(words[i + 1])
print(words)
print(chain)
assert len(chain) == 11
assert chain['the'] == ['Markov', 'list']
assert chain['dictionary'] == []

def markov_chain(list):
markov = {}
for index, i in enumerate(list):
if index<len(list)-1:
markov[i]=list[index+1]
return (markov)
The code above takes a list as an input and returns the corresponding markov chain as a dictionary.

You can use loops to get that, but it's actually a waste to have to put the rest of your code in a loop when you only need the next element.
There are two nice options to avoid this:
Option 1 - if you know the next index, just call it:
my_list[my_index]
Although most of the times you won't know the index, but still you might want to avoid the for loop.
Option 2 - use iterators
& check this tutorial
my_iterator = iter(my_list)
next(my_iterator) # no loop required

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading and Grouping a List of Data in Python - python

itertools.groupby() can get you by. itertools.groupby(biglist, operator.itemgetter(2))

If you want to pick out the second, third, and fourth elements of each sublist, this should work: listINeed = [sublist[1:4] for sublist in biglist]

Related

Python list slicing error

Intro to Python - Lists questions

Nested for loops with large data set

Append a list from a search function

How to reference the next item in a list in Python?

Categories

Resources