I have a list of lists like this.
documents = [['Human machine interface for lab abc computer applications','4'],
['A survey of user opinion of computer system response time','3'],
['The EPS user interface management system','2']]
Now i need to iterate through the above list and output a list of strings, as shown below (without the numbers in the original list)
documents = ['Human machine interface for lab abc computer applications',
'A survey of user opinion of computer system response time',
'The EPS user interface management system']
The simplest solution for doing exactly what you specified is:
documents = [sub_list[0] for sub_list in documents]
This is basically equivalent to the iterative version:
temp = []
for sub_list in documents:
temp.append(sub_list[0])
documents = temp
This is however not really a general way of iterating through a multidimensional list with an arbitrary number of dimensions, since nested list comprehensions / nested for loops can get ugly; however you should be safe doing it for 2 or 3-d lists.
If you do decide to you need to flatten more than 3 dimensions, I'd recommend implementing a recursive traversal function which flattens all non-flat layers.
If you want to simply iterate over the loop and do things with the elements (rather than the specific results requested in the question), you could use a basic for loop
for row in documents:
#do stuff with the row
print(row)
for column in row:
#do stuff with the columns for a particular row
print(column)
if(row[1] > 10):
print('The value is much too large!!')
This is a language feature known as "flow control".
Note that if you only want the result given in the question, a list comprehension like machine yearning provided is the best way to do it.
documents = [doc[0] for doc in documents]
Note that it discards your original documents list (you are overwriting the original variable) so use the following if you want to have a copy of the first column as well as a copy of your original list:
document_first_row = [doc[0] for doc in documents]
As explained in http://docs.python.org/library/operator.html#operator.itemgetter, You can also try with
from operator import itemgetter
documents = map(itemgetter(0), documents)
that should be faster than using an explicit loop.
**edit. thanks DSM. This is wrong as it just flattens the lists. I didn't notice the extra data inside the list after the text that OP wants to ignore.
Ok I'll make it really easy for you!
itertools.chain.from_iterable(documents)
As others have said, it depends on what final behavior your need. So if you need something more complex than that, use recursive traversal or if you are like me, use an iterative traversal. I can help you with that if you need it.
The question is dead but still knowing one more way doesn't hurt:
documents = [['Human machine interface for lab abc computer applications','4'],
['A survey of user opinion of computer system response time','3'],
['The EPS user interface management system','2']]
document = []
for first,*remaining in documents:
document.append(first)
print(document)
['Human machine interface for lab abc computer applications',
'A survey of user opinion of computer system response time',
'The EPS user interface management system'
]
You can also use zip with argument unpacking to transform a list of "rows" into a list of columns:
rows=[[1,'a','foo'],
[2,'b','bar'],
[3,'c','baz']]
columns=zip(*rows)
print columns
#[(1,2,3),
# ('a','b','c'),
# ('foo','bar','baz')]
print columns[0]
#(1,2,3)
the * operator passes all the rows in as seperate arguments to zip
zip(*rows) == zip(row1,row2,row3,...)
zip takes all the rows and assembles columns with one item from each list
you could use numpy array
for instance
document = [['the quick brown fox', '2' ],['jumped over the lazy fox ','3']]
#
import numpy as np
document = np.array(document)
document=document[:,0]
Related
I have worked with C# and Java and everytime I needed a List I wanted to store objects of the same type, like:
List<Customer> customers;
//or
List<Invoice> invoices;
The official Python-Doc states:
Lists might contain items of different types, but usually the items all have the same type.
Therefore this is possible:
list = [1, 2, 3, "cat", some_object, "another cat", 4, "and a cat again"]
There are hundreds of examples of mixed lists and explanations of why it is possible.
I understand why it is possible in python to have lists with mixed types but I cannot think of an real life example where it would be useful.
I am wondering for quite some time if there is a real advantage of having mixed lists.
Or is it just: "we don't need it, but we can!"
Am I missing something very important about mixed lists which would make me a better python-programmer?
Edit:
I thought this answer will be given that python cant do otherwise so let me rephrase that:
Or is it just: "we don't need it, but we can't do otherwise!"
Python lists store items of the same type -- references to other objects.
Or is it just: "we don't need it, but we can!"
No it's not. Python doesn't have type checking, so it's rather "because we can't do otherwise".
First, as referenced in the above-linked cross-site question, remember that a language like Java will let you create an array of Object anyway, which is pretty similar to Python's default list.
Second, there is an extremely common use case for data structures containing elements of different types. Here's an example:
def get_row():
response = input('Enter your name and score, separated by a space: ').split()
return response[0], int(response[1])
Now you're returning a tuple that contains a string and an integer.
Here's another example:
names = ['Alice', 'Bob', 'Charlie']
scores = [8, 12, 9]
results = dict(zip(names, scores))
Without being able to create tuples with a string and an integer, we might have to do something like results = {name:scores[i] for i, name in enumerate(names)}. Could we? Sure... but it's nice that we can choose the one that looks better, without even having to think about whether our data structure holds str or int or object, and without having to cast objects before the compiler will allow us to work with them.
It has applications cause polymorphism. Imagine you have multiple sports objects, like socker, basketball and football, each one is played different but all are played, so you can say
socker = Socker()
basketball = Basketball()
football = Football()
sportsILike = [socker, basketball, football]
for sport in sportsILike:
sport.play()
The interesting thing is that those sports are different classes!
lst = [(u'course', u'session'), (u'instructor', u'session'), (u'session', u'trainee'), (u'person', u'trainee'), (u'person', u'instructor'), (u'course', u'instructor')]
I've above list of tuple, I need to sort it with following logic....
each tuple's 2nd element is dependent on 1st element, e.g. (course, session) -> session is dependent on course and so on..
I want a sorted list based on priority of their dependency, less or independent object will come first so output should be like below,
lst = [course, person, instructor, session, trainee]
You're looking for what's called a topological sort. The wikipedia page shows the classic Kahn and depth-first-search algorithms for it; Python examples are here (a bit dated, but should still run fine), on pypi (stable and reusable -- you can also read the code online here) and here (Tarjan's algorithm, that kind-of also deals with cycles in the dependencies specified), just to name a few.
Conceptually, what you need to do is create a directed acyclic graph with edges determined by the contents of your list, and then do a topological sort on the graph. The algorithm to do this doesn't exist in Python's standard library (at least, not that I can think of off the top of my head), but you can find plenty of third-party implementations online, such as http://www.bitformation.com/art/python_toposort.html
The function at that website takes a list of all the strings, items, and another list of the pairs between strings, partial_order. Your lst should be passed as the second argument. To generate the first argument, you can use itertools.chain.from_iterable(lst), so the overall function call would be
import itertools
lst = ...
ordering = topological_sort(itertools.chain.from_iterable(lst), lst)
Or you could modify the function from the website to only take one argument, and to create the nodes in the graph directly from the values in your lst.
EDIT: Using the topsort module Alex Martelli linked to, you could just pass lst directly.
I'm working through some tutorials on Python and am at a position where I am trying to decide what data type/structure to use in a certain situation.
I'm not clear on the differences between arrays, lists, dictionaries and tuples.
How do you decide which one is appropriate - my current understanding doesn't let me distinguish between them at all - they seem to be the same thing.
What are the benefits/typical use cases for each one?
How do you decide which data type to use? Easy:
You look at which are available and choose the one that does what you want. And if there isn't one, you make one.
In this case a dict is a pretty obvious solution.
Tuples first. These are list-like things that cannot be modified. Because the contents of a tuple cannot change, you can use a tuple as a key in a dictionary. That's the most useful place for them in my opinion. For instance if you have a list like item = ["Ford pickup", 1993, 9995] and you want to make a little in-memory database with the prices you might try something like:
ikey = tuple(item[0], item[1])
idata = item[2]
db[ikey] = idata
Lists, seem to be like arrays or vectors in other programming languages and are usually used for the same types of things in Python. However, they are more flexible in that you can put different types of things into the same list. Generally, they are the most flexible data structure since you can put a whole list into a single list element of another list, but for real data crunching they may not be efficient enough.
a = [1,"fred",7.3]
b = []
b.append(1)
b[0] = "fred"
b.append(a) # now the second element of b is the whole list a
Dictionaries are often used a lot like lists, but now you can use any immutable thing as the index to the dictionary. However, unlike lists, dictionaries don't have a natural order and can't be sorted in place. Of course you can create your own class that incorporates a sorted list and a dictionary in order to make a dict behave like an Ordered Dictionary. There are examples on the Python Cookbook site.
c = {}
d = ("ford pickup",1993)
c[d] = 9995
Arrays are getting closer to the bit level for when you are doing heavy duty data crunching and you don't want the frills of lists or dictionaries. They are not often used outside of scientific applications. Leave these until you know for sure that you need them.
Lists and Dicts are the real workhorses of Python data storage.
Best type for counting elements like this is usually defaultdict
from collections import defaultdict
s = 'asdhbaklfbdkabhvsdybvailybvdaklybdfklabhdvhba'
d = defaultdict(int)
for c in s:
d[c] += 1
print d['a'] # prints 7
Do you really require speed/efficiency? Then go with a pure and simple dict.
Personal:
I mostly work with lists and dictionaries.
It seems that this satisfies most cases.
Sometimes:
Tuples can be helpful--if you want to pair/match elements. Besides that, I don't really use it.
However:
I write high-level scripts that don't need to drill down into the core "efficiency" where every byte and every memory/nanosecond matters. I don't believe most people need to drill this deep.
Alright so I am making a commandline based implementation of a website search feature. The website has a list of all the links I need in alphabetical order.
Usage would be something like
./find.py LinkThatStartsWithB
So it would navigate to the webpage associated with the letter B.
My questions is what is the most efficient/smartest way to use the input by the user and navigate to the webpage?
What I was thinking at first was something along the lines of using a list and then getting the first letter of the word and using the numeric identifier to tell where to go in list index.
(A = 1, B = 2...)
Example code:
#Use base url as starting point then add extension on end.
Base_URL = "http://www.website.com/"
#Use list index as representation of letter
Alphabetic_Urls = [
"/extensionA.html",
"/extensionB.html",
"/extensionC.html",
]
Or would Dictionary be a better bet?
Thanks
How are you getting this list of URLS?
If your commandline app is crawling the website for links, and you are only looking for a single item, building a dictionary is pointless. It will take at least as long to build the dict as it would to just check as you go! eg, just search as:
for link in mysite.getallLinks():
if link[0] == firstletter:
print link
If you are going to be doing multiple searches (rather than just a single commandline parameter), then it might be worth building a dictionary using something like:
import collections
d=collections.defaultdict(list)
for link in mysite.getallLinks():
d[link[0]].append(link) # Dict of first letter -> list of links
# Print all links starting with firstletter
for link in d[firstletter]:
print link
Though given that there are just 26 buckets, it's not going to make that much of a difference.
The smartest way here will be whatever makes the code simplest to read. When you've only got 26 items in a list, who cares what algorithm it uses to look through it? You'd have to use something really, really stupid to make it have an impact on performance.
If you're really interested in the performance though, you'd need to benchmark different options. Looking at just the complexity doesn't tell the whole story, because it hides the factors involved. For instance, a dictionary lookup will involve computing the hash of the key, looking that up in tables, then checking equality. For short lists, a simple linear search can sometimes be more efficient, depending on how costly the hashing algorithm is.
If your example is really accurate though, can't you just take the first letter of the input string and predict the URL from that? ("/extension" + letter + ".html")
Dictionary!
O(1)
Dictionary would be a good choice if you have (and will always have) a small number of items. If the list of URL's is going to expand in the future you will probably actually want to sort the URL's by their letter and then match the input against that instead of hard-coding the dictionary for each one.
Since it sounds like you're only talking about 26 total items, you probably don't have to worry too much about efficiency. Anything you come up with should be fast enough.
In general, I recommend trying to use the data structure that is the best approximation of your problem domain. For example, it sounds like you are trying to map letters to URLs. E.g., this is the "A" url and this is the "B" url. In that case, a mapping data structure like a dict sounds appropriate:
html_files = {
'a': '/extensionA.html',
'b': '/extensionB.html',
'c': '/extensionC.html',
}
Although in this exact example you could actually cheat it and skip the data structure altogether -- '/extension%s.html' % letter.upper() :)
The following code that I wrote takes a set of 68,000 items and tries to find similar items based on text location in the strings. The process takes a bit on this i3 4130 I'm temporarily using to code on - is there any way to speed this up? I'm making a type of 'did you mean?' function, so I need to sort on the spot of what the user enters.
I'm not trying to compare by similarity in a dictionary that's already created using keywords, I'm trying to compare the similar between the user's input on the fly and all existing keys. The user may mistype a key, so that's why it would say "did you mean?", like Google search does.
Sorting does not affect the time, according to averaged tests.
def similar_movies(movie):
start=time.clock()
movie=capitalize(movie)
similarmovies={}
allmovies=all_movies() #returns set of all 68000 movies
for item in allmovies:
'''if similar(movie.lower(),item.lower())>.5 or movie in item: #older algorithm
similarmovies[item]=similar(movie.lower(),item.lower())'''
if movie in item: #newer algorithm,
similarmovies[item]=1.0
print item
else:
similarmovies[item]=similar(movie.lower(),item.lower())
similarmovieshigh=sorted(similarmovies, key=similarmovies.get, reverse=True)[:10]
print time.clock()-start
return similarmovieshigh
Other functions used:
from difflib import SequenceMatcher
def similar(a, b):
output=SequenceMatcher(None, a, b).ratio()
return output
def all_movies(): #returns set of all keys in sub dicts(movies)
people=list(ratings.keys())
allmovies=[]
for item in people:
for i in ratings[item]:
allmovies.append(i)
allmovies=set(allmovies)
return allmovies
The dictionary is in this format, except with thousands of names:
ratings={'Shane': {'Avatar': 4.2, '127 Hours': 4.7}, 'Joe': {'Into The Wild': 4.5, 'Unstoppable': 3.0}}
Your algorithm is going to be O(n2), since within every title, the in operator has to check every sub-string of the title to determine if the entered text is within it. So yeah, I can understand why you would want this to run faster.
An i3 doesn't provide much compute power, so pre-computing as much as possible is the only solution, and running extra software such as a database is probably going to provide poor results, again due to the capability.
You might consider using a dictionary of title words (possibly with pre-computed phonetic changes to eliminate most common misspellings - the Porter Stemmer algorithm should provide some helpful reduction rules, e.g. to allow "unstop" to match "unstoppable").
So, for example, one key in your dictionary would be "wild" (or a phonetic adjustment), and the value associated with that key would be a list of all titles that contain "wild"; you would have the same for "the", "into", "avatar", "hours", "127", and all other words in your list of 68,000 titles. Just as an example, your dictionary's "wild" entry might look like:
"wild": ["Into The Wild", "Wild Wild West", "Wild Things"]
(Yes, I searched for "wild" on IMDB just so this list could have more entries - probably not the best choice, but not many titles have "avatar", "unstoppable", or "hours" in them).
Common words such as "the" might have enough entries that you would want to exclude them, so a persistent copy of the dictionary might be helpful to allow you to make specific adjustments, although it isn't necessary, and the compute time should be relatively quick at start-up.
When the user types in some text, you split the text into words, apply any phonetic reductions if you choose to use them, and then concatenate all of the title lists for all of the words from the user, including duplicates.
Then, count the duplicates and sort by how many times a title was matched. If a user types "The Wild", you'd have two matches on "Into The Wild" ("the" and "wild"), so it should sort higher than titles with only "the" or "wild" but not both in them.
Your list of ratings can be searched after the final sorted list is built, with ratings appended to each entry; this operation should be quick, since your ratings are already within a dictionary, keyed by name.
This turns an O(n2) search into a O(log(n)) search for each word entered, which should make a big difference in performance, if it suits your needs.
In all_movies(): instead of appending to a list you could add to a set and not cast keys() to a list:
def all_movies():
allmovies = set()
for item in ratings.keys():
for i in ratings[item]:
allmovies.add(i)
return allmovies
EDIT: or only using one for-loop:
def all_movies():
result = []
for rating_dict in ratings.values()
result += rating_dict.keys()
return result
Nothing I could spot in similar_movies.
Also have a look at celery: http://docs.celeryproject.org/en/latest/ for multi-processing,
especially the chunks concept: http://docs.celeryproject.org/en/latest/userguide/canvas.html#chunks
If you're developing for a production system, I'd suggest using a full text search engine like Whoosh (Python), Elastic Search (Java), or Apache Solr (Java). A full text search engine is a server that builds an index to implement full text search including fuzzy or proximity searches efficiently. Many popular database system also features full search text engine like PostgreSQL FTS and MySQL FTS that may be an acceptable alternative if you are already using these database engines.
If this code is developed mostly for self learning and you want to learn how to implement fuzzy searches, you may want to look at normalizing the movie titles in the index and the search terms. There are methods like Soundex and Metaphone that normalizes search terms based on how it likely sounds in English and this normalized term can be used to create the search index. PostgreSQL have implementation of these algorithms. Note that these algorithms are very basic building blocks, a proper full text search engine will take into account misspelling, synonyms, stop words, language specific quirks, and optimizations like parallel/distributed processing, etc.