I write some code to read a long text file. it has 10000 English words in txt file.then I want to use split() to get all word to train them, the code is like this:
with open('/train.txt', 'r') as fin
text=fin.read()
len(text)#result is 10000
len(text.split() #result is 2800
IT only get 2800 words of the text when using split(),but I think it should be the whole text and the both results of len() should be the same 10000.
why? due to my computer limited? or my text has problem?
the result of text command is as follow:
'My mother drove me to the airport with the windows rolled down It was seventy-five degrees in Phoenix the sky a perfect cloudless blue I was wearing my favorite shirtsleeve less white eyelet lace I was wearing it as a farewell gesture My carry-on item was a parka In the Olympic Peninsula of northwest Washington State a small town named Forks exists under a near-constant cover of clouds It rains on this inconsequential town more than any other place in the United States of America It was from this town and its gloomy omnipresent shade that my mother escaped with me when I was only a few months old It was in this town that I would been compelled to spend a month every summer until I was fourteen That was the year I finally put my foot down these past three summers my dad Charlie vacationed with me in California for two weeks instead It was to Forks that I now exiled myself an action that I took with great horror I detested Forks I loved Phoenix I loved the sun and the blistering heat I loved the vigorous sprawling cityBella my mom said to me the last of a thousand times before I got on the plane You don have to do this My mom looks like me except with short hair and laugh lines I felt a spasm of panic as I stared at her wide childlike eyes How could I leave my loving erratic hare-brained mother to fend for herself Of course she had Phil now so the bills would probably get paid there would be food in the refrigerator gas in her car and someone to call when she got lost but still I want to go I lied I would always been a bad liar but I would been saying this lie so frequently lately that it sounded almost convincing now Tell Charlie I said hi I willI ll see you soon she insisted You can come home whenever you want I ll come right back as soon as you need me But I could see the sacrifice in her eyes behind the promise Don worry about me I urged It ll be great I love you MomShe hugged my tightly for a minute and then I got on the plane and she was gone It a four-hour flight from Phoenix to Seattle another hour in a small plane up to Port Angeles and then an hour drive back down to Forks Flying did bother me the hour in the car with Charlie though I was a little worried about Charlie had really been fairly nice about the whole thing He seemed genuinely please that I was combing to live with him for the first time with any degree of permanence He would already gotten me registered for high school and was going to help me get a car But it was sure to be awkward with Charlie Neither of us was what anyone would call verbose and I did know what there was to say regardless I knew he was more than a little confused by my decision like my mother before me I had made a secret of my distaste for Forks When I landed in Port Angeles it was raining I did see it as an omen just unavoidable I would already said my goodbyes to the sun Charlie was waiting for me with the cruiser This I was expecting too Charlie is Police Chief Swan to the good people of Forks My primary motivation behind buying a car despite the scarcity of my funds was that I refused to be driven around town in a car with red and blue lights on top Nothing slows traffic down like a cop Charlie gave me an awkward one-armed hug when I stumbled my way off the plane It good to see you Bells he said smiling as he automatically caught and steadied me You haven changed much How Renee Mom fine It good to see you re too Dad I was allowed to call him Charlie to his face I had only a few bags Most of my Arizona clothes were too permeable for Washington My mom and I had pooled our resources to supplement my winter wardrobe but it was still scanty It all fit easily into the trunk of the cruiser I found a good car for you really cheap he announced when we were strapped in What kind of car I was suspicious of the way he said good car for you as opposed to just good car Well it a truck actually a Chevy Where did you find it Do you remember Billy Black from La Push La Push is the tiny Indian reservation on the coastNo He used to go fishing with us during the summer Charlie promptedThat would explain why I did remember him I do a good job of blocking painful unnecessary things from my memory He in a wheelchair now Charlie continued when I did respond so he can drive anymore and he offered to sell me his truck cheap What year is it I could see from his change of expression that this was the question he was hoping I would ask Well Billy done a lot of work on the engine it only a few years old really I hope he did think so little of me as to believe I would give up that easily When did he buy it He bought it in I think Did he buy it new Well no I think it was new in the early sixties or late fifties at the earliest he admitted sheepishly Dad I don really know anything about cars I would be able to fix it if anything went wrong and I could not afford a mechanic Really Bella the thing runs great They don build them like that anymore The thing I at the very least How cheap is cheap After all that was the part I could not compromise on Well honey I kind of already bought it for you As a homecoming gift Charlie peeked sideways at me with a hopeful expression Wow Free You did need to do that Dad I was going to buy myself a car I don mind I want you to be happy here He was looking ahead at the road when he said this Charlie was comfortable with expressing his emotions out loud I inherited that from him So I was looking straight ahead as I responded That really nice Dad Thanks I really appreciate it No need to add that my being happy in Forks is He did need to suffer along with me And I never looked a free truck in the mouth or engine Well now you re welcome he mumbled embarrassed by my thanksWe exchanged a few more comments on the weather which was wet and that was pretty much it for conversation We stared out the windows in silence It was better because it was raining yet though the clouds were dense and opaque It was easier because I knew what to expect of my day Mike came to sit by me in English and walked me to my next class with ChessClub Eric glaring at him all the while that was nattering People did look at me quite as much as they had yesterday I sat with a big group at lunch that included Mike Eric Jessica and several other people whose names and faces I now remembered I began to feel like I was treading water instead of drowning in it It was worse because I was tired I still could not sleep with the wind echoing around the house It was worse because Mr Varner called on me inTrig when my hand was raised and I had the wrong answer It was miserable because I had to play volleyball and the one time I did cringe out of the way of the ball I hit my teammate in the head with itAnd it was worse because Edward Cullen was in school at all All morning I was dreading lunch fearing his bizarre glares Part of mew anted to confront him and demand to know what his problem was While I was lying sleepless in my bed I even imagined what I would say But I knew myself too well to think I would really have the guts to do it I made the Cowardly Lion look like the terminator But when I walked into the cafeteria with Jessica trying to keep my eyes from sweeping the place for him and failing entirely I saw that his four siblings of sorts were sitting together at the same table and he was not with them Mike intercepted8 us and steered9 us to his table Jessica seemed elated by the attention and her friends quickly joined us But as I tried to listen to their easy chatter I was terribly uncomfortable waiting nervously for the moment he would arrive I hoped that he would simply ignore me when he came and prove my suspicions false He did come and as time passed I grew more and more tense I walked to Biology with more confidence when by the end of lunch he still had showed Mike who was taking on the qualities of a golden retriever walked faithfully by my side to class I held my breath at the door but Edward Cullen was there either I exhaled and went to my seat Mike followed talking about an upcoming trip to the beach He lingered by my desk till the bell rang Then he smiled at me wistfully and went to sit by a girl with braces and a bad perm It looked like I was going to have to do something about Mike and it would be easy Ina town like this where everyone lived on top of everyone else diplomacy was essential I had never been enormously tactful I had no practice dealing with overly friendly boys I was relieved that I had the desk to myself that Edward was absent I told myself that repeatedly But I could get rid of the nagging suspicion that I was the reason he was there It was ridiculous and egotistical to think that I could affect anyone that strongly It was impossible And yet I could not stop worrying that it was true When the school day was finally done and the blush was fading out of my cheeks from the volleyball incident I changed quickly back into my jeans and navy blue sweater I hurried from the girls locker room pleased to find that I had successfully evaded my retriever friend for the moment I walked swiftly out to the parking lot It was crowded now with fleeing students I got in my truck and dug through my bag to make sure I had what I needed Last night I would discovered that Charlie could not cook much besides fried eggs and bacon So I requested that I be assigned kitchen detail for the duration of my stay He was willing enough to hand over the keys to the banquet hall I also found out that he had no food in the house So I had my shopping list and the cash from the jar in the cupboard labeled FOOD\\\\x0cMONEY and I was on my way to the Thrift way I gunned my deafening engine to life ignoring the heads that turned in my direction and backed carefully into a place in the line of cars that were waiting to exit the parking lot As I waited trying to pretend that-the earsplitting rumble was coming from someone else car I saw the twoCullens and the Hale twins getting into their car It was the shiny newVolvo Of course I had noticed their clothes before I would been too mesmerized by their faces Now that I looked it was obvious that they were all dressed exceptionally well simply but in clothes that subtly hinted at designer origins With their remarkable good looks the style with which they carried themselves they could have worn dishrags and pulled it off It seemed excessive for them to have both looks and money But as far as I could tell life worked that way most of the time It did look as if it bought them any acceptance here No I did fully believe that The isolation must be their desire I could not imagine any door that would be opened by that degree of beauty They looked at my noisy truck as I passed them just like everyone else I kept my eyes straight forward and was relieved when I finally was free of the school grounds The Thrift way was not far from the school just a few streets south off the highway It was nice to be inside the supermarket it felt normal Idid the shopping at home and I fell into the pattern of the familiar task gladly The store was big enough inside that I could not hear the tapping of the rain on the roof to remind me where I was When I got home I unloaded all the groceries stuffing them in whereverI could find an open space I hoped Charlie would mind I wrapped potatoes in foil and stuck them in the oven to bake covered a steak in marinade and balanced it on top of a carton of eggs in the fridge When I was finished with that I took my book bag upstairs Before starting my homework I changed into a pair of dry sweats pulled my damp hair up into a pony-tail and checked my e-mail for the first time I had three messages Bella my mom wrote…Write me as soon as you get in Tell me how your flight was Is it raining I miss you already I am almost finished packing for Florida butI can find my pink blouse Do you know where I put it Phil says hi Mom I sighed and went to the next It was sent eight hours after the first Bella she wrote…Why haven you e-mailed me yet What are you waiting for Mom The last was from this morning Isabella If I haven heard from you by pm today I am calling Charlie I checked the clock I still had an hour but my mom was well known for the gun MomCalm down I am writing right now Don do anything rash Bella I sent that and began again MomEverything is great Of course it raining I was waiting for something to write about School isn bad just a little repetitive I met some nice kids who sit by me at lunch Your blouse is at the dry cleaners you were supposed to pick it upFriday Charlie bought me a truck can you believe it I love it It old but really sturdy which is good you know for me I miss you too I ll write again soon but I am not going to check my e-mail every five minutes Relax breathe I love you Bella I had decided to read Withering Heights the novel we were currently studying in English yet again for the fun of it and that what I was doing when Charlie came home I would lost track of the time and I hurried downstairs to take the potatoes out and put the steak in to broil Bella my father called out when he heard me on the stairs Who else I thought to myself Hey Dad welcome home Thanks He hung up his gun belt and stepped out of his boots as I bustled about the kitchen As far as I was aware he would never shot the gun-on the job But he kept it ready When I came here as a child he would always remove the bullets as soon as he walked in the door I guess he considered me old enough now not to shoot myself by accident and not depressed enough to shoot myself on purpose What for dinner" he asked warily My mother was an imaginative cook and her experiments were always edible I was surprised and sad that he seemed to remember that far back Steak and potatoes I answered and he looked relieved He seemed to feel awkward standing in the kitchen doing nothing he lumbered into the living room to watch TV while I worked We were both more comfortable that way I made a salad while the steaks cooked and set the table I called him in when dinner was ready and he sniffed appreciatively as he walked into the room'
The result of text.split() command is as follow:
['My',
'mother',
'drove',
'me',
'to',
'the',
'airport',
'with',
'the',
'windows',
'rolled',
'down',
'It',
'was',
'seventy-five',
'degrees',
'in',
'Phoenix',
'the',
'sky',
'a',
'perfect',
'cloudless',
'blue',
'I',
'was',
'wearing',
'my',
'favorite',
'shirtsleeve',
'less',
'white',
'eyelet',
'lace',
'I',
'was',
'wearing',
'it',
'as',
'a',
'farewell',
'gesture',
'My',
'carry-on',
'item',
'was',
'a',
'parka',
'In',
'the',
'Olympic',
'Peninsula',
'of',
'northwest',
'Washington',
'State',
'a',
'small',
'town',
'named',
'Forks',
'exists',
'under',
'a',
'near-constant',
'cover',
'of',
'clouds',
'It',
'rains',
'on',
'this',
'inconsequential',
'town',
'more',
'than',
'any',
'other',
'place',
'in',
'the',
'United',
'States',
'of',
'America',
'It',
'was',
'from',
'this',
'town',
'and',
'its',
'gloomy',
'omnipresent',
'shade',
'that',
'my',
'mother',
'escaped',
'with',
'me',
'when',
'I',
'was',
'only',
'a',
'few',
'months',
'old',
'It',
'was',
'in',
'this',
'town',
'that',
'I',
'would',
'been',
'compelled',
'to',
'spend',
'a',
'month',
'every',
'summer',
'until',
'I',
'was',
'fourteen',
'That',
'was',
'the',
'year',
'I',
'finally',
'put',
'my',
'foot',
'down',
'these',
'past',
'three',
'summers',
'my',
'dad',
'Charlie',
'vacationed',
'with',
'me',
'in',
'California',
'for',
'two',
'weeks',
'instead',
'It',
'was',
'to',
'Forks',
'that',
'I',
'now',
'exiled',
'myself',
'an',
'action',
'that',
'I',
'took',
'with',
'great',
'horror',
'I',
'detested',
'Forks',
'I',
'loved',
'Phoenix',
'I',
'loved',
'the',
'sun',
'and',
'the',
'blistering',
'heat',
'I',
'loved',
'the',
'vigorous',
'sprawling',
'cityBella',
'my',
'mom',
'said',
'to',
'me',
'the',
'last',
'of',
'a',
'thousand',
'times',
'before',
'I',
'got',
'on',
'the',
'plane',
'You',
'don',
'have',
'to',
'do',
'this',
'My',
'mom',
'looks',
'like',
'me',
'except',
'with',
'short',
'hair',
'and',
'laugh',
'lines',
'I',
'felt',
'a',
'spasm',
'of',
'panic',
'as',
'I',
'stared',
'at',
'her',
'wide',
'childlike',
'eyes',
'How',
'could',
'I',
'leave',
'my',
'loving',
'erratic',
'hare-brained',
'mother',
'to',
'fend',
'for',
'herself',
'Of',
'course',
'she',
'had',
'Phil',
'now',
'so',
'the',
'bills',
'would',
'probably',
'get',
'paid',
'there',
'would',
'be',
'food',
'in',
'the',
'refrigerator',
'gas',
'in',
'her',
'car',
'and',
'someone',
'to',
'call',
'when',
'she',
'got',
'lost',
'but',
'still',
'I',
'want',
'to',
'go',
'I',
'lied',
'I',
'would',
'always',
'been',
'a',
'bad',
'liar',
'but',
'I',
'would',
'been',
'saying',
'this',
'lie',
'so',
'frequently',
'lately',
'that',
'it',
'sounded',
'almost',
'convincing',
'now',
'Tell',
'Charlie',
'I',
'said',
'hi',
'I',
'willI',
'll',
'see',
'you',
'soon',
'she',
'insisted',
'You',
'can',
'come',
'home',
'whenever',
'you',
'want',
'I',
'll',
'come',
'right',
'back',
'as',
'soon',
'as',
'you',
'need',
'me',
'But',
'I',
'could',
'see',
'the',
'sacrifice',
'in',
'her',
'eyes',
'behind',
'the',
'promise',
'Don',
'worry',
'about',
'me',
'I',
'urged',
'It',
'll',
'be',
'great',
'I',
'love',
'you',
'MomShe',
'hugged',
'my',
'tightly',
'for',
'a',
'minute',
'and',
'then',
'I',
'got',
'on',
'the',
'plane',
'and',
'she',
'was',
'gone',
'It',
'a',
'four-hour',
'flight',
'from',
'Phoenix',
'to',
'Seattle',
'another',
'hour',
'in',
'a',
'small',
'plane',
'up',
'to',
'Port',
'Angeles',
'and',
'then',
'an',
'hour',
'drive',
'back',
'down',
'to',
'Forks',
'Flying',
'did',
'bother',
'me',
'the',
'hour',
'in',
'the',
'car',
'with',
'Charlie',
'though',
'I',
'was',
'a',
'little',
'worried',
'about',
'Charlie',
'had',
'really',
'been',
'fairly',
'nice',
'about',
'the',
'whole',
'thing',
'He',
'seemed',
'genuinely',
'please',
'that',
'I',
'was',
'combing',
'to',
'live',
'with',
'him',
'for',
'the',
'first',
'time',
'with',
'any',
'degree',
'of',
'permanence',
'He',
'would',
'already',
'gotten',
'me',
'registered',
'for',
'high',
'school',
'and',
'was',
'going',
'to',
'help',
'me',
'get',
'a',
'car',
'But',
'it',
'was',
'sure',
'to',
'be',
'awkward',
'with',
'Charlie',
'Neither',
'of',
'us',
'was',
'what',
'anyone',
'would',
'call',
'verbose',
'and',
'I',
'did',
'know',
'what',
'there',
'was',
'to',
'say',
'regardless',
'I',
'knew',
'he',
'was',
'more',
'than',
'a',
'little',
'confused',
'by',
'my',
'decision',
'like',
'my',
'mother',
'before',
'me',
'I',
'had',
'made',
'a',
'secret',
'of',
'my',
'distaste',
'for',
'Forks',
'When',
'I',
'landed',
'in',
'Port',
'Angeles',
'it',
'was',
'raining',
'I',
'did',
'see',
'it',
'as',
'an',
'omen',
'just',
'unavoidable',
'I',
'would',
'already',
'said',
'my',
'goodbyes',
'to',
'the',
'sun',
'Charlie',
'was',
'waiting',
'for',
'me',
'with',
'the',
'cruiser',
'This',
'I',
'was',
'expecting',
'too',
'Charlie',
'is',
'Police',
'Chief',
'Swan',
'to',
'the',
'good',
'people',
'of',
'Forks',
'My',
'primary',
'motivation',
'behind',
'buying',
'a',
'car',
'despite',
'the',
'scarcity',
'of',
'my',
'funds',
'was',
'that',
'I',
'refused',
'to',
'be',
'driven',
'around',
'town',
'in',
'a',
'car',
'with',
'red',
'and',
'blue',
'lights',
'on',
'top',
'Nothing',
'slows',
'traffic',
'down',
'like',
'a',
'cop',
'Charlie',
'gave',
'me',
'an',
'awkward',
'one-armed',
'hug',
'when',
'I',
'stumbled',
'my',
'way',
'off',
'the',
'plane',
'It',
'good',
'to',
'see',
'you',
'Bells',
'he',
'said',
'smiling',
'as',
'he',
'automatically',
'caught',
'and',
'steadied',
'me',
'You',
'haven',
'changed',
'much',
'How',
'Renee',
'Mom',
'fine',
'It',
'good',
'to',
'see',
'you',
're',
'too',
'Dad',
'I',
'was',
'allowed',
'to',
'call',
'him',
'Charlie',
'to',
'his',
'face',
'I',
'had',
'only',
'a',
'few',
'bags',
'Most',
'of',
'my',
'Arizona',
'clothes',
'were',
'too',
'permeable',
'for',
'Washington',
'My',
'mom',
'and',
'I',
'had',
'pooled',
'our',
'resources',
'to',
'supplement',
'my',
'winter',
'wardrobe',
'but',
'it',
'was',
'still',
'scanty',
'It',
'all',
'fit',
'easily',
'into',
'the',
'trunk',
'of',
'the',
'cruiser',
'I',
'found',
'a',
'good',
'car',
'for',
'you',
'really',
'cheap',
'he',
'announced',
'when',
'we',
'were',
'strapped',
'in',
'What',
'kind',
'of',
'car',
'I',
'was',
'suspicious',
'of',
'the',
'way',
'he',
'said',
'good',
'car',
'for',
'you',
'as',
'opposed',
'to',
'just',
'good',
'car',
'Well',
'it',
'a',
'truck',
'actually',
'a',
'Chevy',
'Where',
'did',
'you',
'find',
'it',
'Do',
'you',
'remember',
'Billy',
'Black',
'from',
'La',
'Push',
'La',
'Push',
'is',
'the',
'tiny',
'Indian',
'reservation',
'on',
'the',
'coastNo',
'He',
'used',
'to',
'go',
'fishing',
'with',
'us',
'during',
'the',
'summer',
'Charlie',
'promptedThat',
'would',
'explain',
'why',
'I',
'did',
'remember',
'him',
'I',
'do',
'a',
'good',
'job',
'of',
'blocking',
'painful',
'unnecessary',
'things',
'from',
'my',
'memory',
'He',
'in',
'a',
'wheelchair',
'now',
'Charlie',
'continued',
'when',
'I',
'did',
'respond',
'so',
'he',
'can',
'drive',
'anymore',
'and',
'he',
'offered',
'to',
'sell',
'me',
'his',
'truck',
'cheap',
'What',
'year',
'is',
'it',
'I',
'could',
'see',
'from',
'his',
'change',
'of',
'expression',
'that',
'this',
'was',
'the',
'question',
'he',
'was',
'hoping',
'I',
'would',
'ask',
'Well',
'Billy',
'done',
'a',
'lot',
'of',
'work',
'on',
'the',
'engine',
'it',
'only',
'a',
'few',
'years',
'old',
'really',
'I',
'hope',
'he',
'did',
'think',
'so',
'little',
'of',
'me',
'as',
'to',
'believe',
'I',
'would',
'give',
'up',
'that',
'easily',
'When',
'did',
'he',
'buy',
'it',
'He',
'bought',
'it',
'in',
'I',
'think',
'Did',
'he',
'buy',
'it',
'new',
'Well',
'no',
'I',
'think',
'it',
'was',
'new',
'in',
'the',
'early',
'sixties',
'or',
'late',
'fifties',
'at',
'the',
'earliest',
'he',
'admitted',
'sheepishly',
'Dad',
'I',
'don',
'really',
'know',
'anything',
'about',
'cars',
'I',
'would',
'be',
'able',
'to',
'fix',
'it',
'if',
'anything',
'went',
'wrong',
'and',
'I',
'could',
'not',
'afford',
'a',
'mechanic',
'Really',
'Bella',
'the',
'thing',
'runs',
'great',
'They',
'don',
'build',
'them',
'like',
'that',
'anymore',
'The',
'thing',
'I',
'at',
'the',
'very',
'least',
'How',
'cheap',
'is',
'cheap',
...]
len(text) is the total number of characters in the file 'train.txt' (assuming ASCII text, this will be the same as your file-size).
len(text.split(...) is the total number of tokens in the file (as determined by your delimiter).
Sidenote: Assuming your delimiter is \n you can cross verify this on unix with cat train.txt | wc -l.
this question is asked here before
What is a good strategy to group similar words?
but no clear answer is given on how to "group" items. The solution based on difflib is basically search, for given item, difflib can return the most similar word out of a list. But how can this be used for grouping?
I would like to reduce
['ape', 'appel', 'apple', 'peach', 'puppy']
to
['ape', 'appel', 'peach', 'puppy']
or
['ape', 'apple', 'peach', 'puppy']
One idea I tried was, for each item, iterate through the list, if get_close_matches returns more than one match, use it, if not keep the word as is. This partly worked, but it can suggest apple for appel, then appel for apple, these words would simply switch places and nothing would change.
I would appreciate any pointers, names of libraries, etc.
Note: also in terms of performance, we have a list of 300,000 items, and get_close_matches seems a bit slow. Does anyone know of a C/++ based solution out there?
Thanks,
Note: Further investigation revealed kmedoid is the right algorithm (as well as hierarchical clustering), since kmedoid does not require "centers", it takes / uses data points themselves as centers (these points are called medoids, hence the name). In word grouping case, the medoid would be the representative element of that group / cluster.
You need to normalize the groups. In each group, pick one word or coding that represents the group. Then group the words by their representative.
Some possible ways:
Pick the first encountered word.
Pick the lexicographic first word.
Derive a pattern for all the words.
Pick an unique index.
Use the soundex as pattern.
Grouping the words could be difficult, though. If A is similar to B, and B is similar to C, A and C is not necessarily similar to each other. If B is the representative, both A and C could be included in the group. But if A or C is the representative, the other could not be included.
Going by the first alternative (first encountered word):
class Seeder:
def __init__(self):
self.seeds = set()
self.cache = dict()
def get_seed(self, word):
LIMIT = 2
seed = self.cache.get(word,None)
if seed is not None:
return seed
for seed in self.seeds:
if self.distance(seed, word) <= LIMIT:
self.cache[word] = seed
return seed
self.seeds.add(word)
self.cache[word] = word
return word
def distance(self, s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
for zz in xrange(0,l2):
for sz in xrange(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
return matrix[l2][l1]
import itertools
def group_similar(words):
seeder = Seeder()
words = sorted(words, key=seeder.get_seed)
groups = itertools.groupby(words, key=seeder.get_seed)
return [list(v) for k,v in groups]
Example:
import pprint
print pprint.pprint(group_similar([
'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
'people', 'into', 'year', 'your', 'good', 'some', 'could',
'them', 'see', 'other', 'than', 'then', 'now', 'look',
'only', 'come', 'its', 'over', 'think', 'also', 'back',
'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
'way', 'even', 'new', 'want', 'because', 'any', 'these',
'give', 'day', 'most', 'us'
]), width=120)
Output:
[['after'],
['also'],
['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'],
['back'],
['because'],
['but', 'about', 'get', 'just'],
['first'],
['from'],
['good', 'look'],
['have', 'make', 'give'],
['his', 'her', 'if', 'him', 'its', 'how', 'us'],
['into'],
['know', 'new'],
['like', 'time', 'take'],
['most'],
['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'],
['only'],
['over', 'our', 'even'],
['people'],
['say', 'she', 'way', 'day'],
['some', 'see', 'come'],
['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'],
['think'],
['well'],
['what', 'who', 'when', 'than'],
['with', 'will', 'which'],
['work'],
['would', 'could'],
['year', 'your']]
You have to decide in closed matches words, which words you want to use. May be get the first element from the list which get_close_matches is returning, or just use random function on that list and get one element from closed matches.
There must be some sort of rule, for it..
In [19]: import difflib
In [20]: a = ['ape', 'appel', 'apple', 'peach', 'puppy']
In [21]: a = ['appel', 'apple', 'peach', 'puppy']
In [22]: b = difflib.get_close_matches('ape',a)
In [23]: b
Out[23]: ['apple', 'appel']
In [24]: import random
In [25]: c = random.choice(b)
In [26]: c
Out[26]: 'apple'
In [27]:
Now remove c from the initial list, thats it...
For c++, you can use Levenshtein_distance
Here is another version using Affinity Propagation algorithm.
import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
import itertools
words = np.array(
['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
'people', 'into', 'year', 'your', 'good', 'some', 'could',
'them', 'see', 'other', 'than', 'then', 'now', 'look',
'only', 'come', 'its', 'over', 'think', 'also', 'back',
'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
'way', 'even', 'new', 'want', 'because', 'any', 'these',
'give', 'day', 'most', 'us'])
print "calculating distances..."
(dim,) = words.shape
f = lambda (x,y): -leven.distance(x,y)
res=np.fromiter(itertools.imap(f, itertools.product(words, words)), dtype=np.uint8)
A = np.reshape(res,(dim,dim))
af = AffinityPropagation().fit(A)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
unique_labels = set(labels)
for i in unique_labels:
print words[labels==i]
Distances had to be converted to similarities, I did that by taking the negative of distance. The output is
['to' 'you' 'do' 'by' 'so' 'who' 'go' 'into' 'also' 'two']
['it' 'with' 'at' 'if' 'get' 'its' 'first']
['of' 'for' 'from' 'or' 'your' 'look' 'after' 'work']
['the' 'be' 'have' 'I' 'he' 'we' 'her' 'she' 'me' 'give']
['this' 'his' 'which' 'him']
['and' 'a' 'in' 'an' 'my' 'all' 'can' 'any']
['on' 'one' 'good' 'some' 'see' 'only' 'come' 'over']
['would' 'could']
['but' 'out' 'about' 'our' 'most']
['make' 'like' 'time' 'take' 'back']
['that' 'they' 'there' 'their' 'when' 'them' 'other' 'than' 'then' 'think'
'even' 'these']
['not' 'no' 'know' 'now' 'how' 'new']
['will' 'people' 'year' 'well']
['say' 'what' 'way' 'want' 'day']
['because']
['as' 'up' 'just' 'use' 'us']
Another method could be using matrix factorization, using SVD. First we create word distance matrix, for 100 words this would be 100 x 100 matrix representating the distance from each word to all other words. Then, SVD is ran on this matrix, the u in the resulting u,s,v can be seen as membership strength to each cluster.
Code
import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import itertools
words = np.array(
['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
'people', 'into', 'year', 'your', 'good', 'some', 'could',
'them', 'see', 'other', 'than', 'then', 'now', 'look',
'only', 'come', 'its', 'over', 'think', 'also', 'back',
'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
'way', 'even', 'new', 'want', 'because', 'any', 'these',
'give', 'day', 'most', 'us'])
print "calculating distances..."
(dim,) = words.shape
f = lambda (x,y): leven.distance(x,y)
res=np.fromiter(itertools.imap(f, itertools.product(words, words)),
dtype=np.uint8)
A = np.reshape(res,(dim,dim))
print "svd..."
u,s,v = lin.svd(A, full_matrices=False)
print u.shape
print s.shape
print s
print v.shape
data = u[:,0:10]
k=KMeans(init='k-means++', k=25, n_init=10)
k.fit(data)
centroids = k.cluster_centers_
labels = k.labels_
print labels
for i in range(np.max(labels)):
print words[labels==i]
def dist(x,y):
return np.sqrt(np.sum((x-y)**2, axis=1))
print "centroid points.."
for i,c in enumerate(centroids):
idx = np.argmin(dist(c,data[labels==i]))
print words[labels==i][idx]
print words[labels==i]
plt.plot(centroids[:,0],centroids[:,1],'x')
plt.hold(True)
plt.plot(u[:,0], u[:,1], '.')
plt.show()
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
ax.plot(u[:,0], u[:,1], u[:,2],'.', zs=0,
zdir='z', label='zs=0, zdir=z')
plt.show()
The result
any
['and' 'an' 'can' 'any']
do
['to' 'you' 'do' 'so' 'go' 'no' 'two' 'how']
when
['who' 'when' 'well']
my
['be' 'I' 'by' 'we' 'my' 'up' 'me' 'use']
your
['for' 'or' 'out' 'about' 'your' 'our']
its
['it' 'his' 'if' 'him' 'its']
could
['would' 'people' 'could']
this
['this' 'think' 'these']
she
['the' 'he' 'she' 'see']
back
['all' 'back' 'want']
one
['of' 'on' 'one' 'only' 'even' 'new']
just
['but' 'just' 'first' 'most']
come
['some' 'come']
that
['that' 'than']
way
['say' 'what' 'way' 'day']
like
['like' 'time' 'give']
in
['in' 'into']
get
['her' 'get' 'year']
because
['because']
will
['with' 'will' 'which']
over
['other' 'over' 'after']
as
['a' 'as' 'at' 'also' 'us']
them
['they' 'there' 'their' 'them' 'then']
good
['not' 'from' 'know' 'good' 'now' 'look' 'work']
have
['have' 'make' 'take']
The selection of k for number of clusters is important, k=25 gives much better results than k=20 for instance.
The code also selects a representative word for each cluster by picking the word whose u[..] coordinate is closest to the cluster centroid.
Here is an approach based on medoids. First install MlPy. On Ubuntu
sudo apt-get install python-mlpy
Then
import numpy as np
import mlpy
class distance:
def compute(self, s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
for zz in xrange(0,l2):
for sz in xrange(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
return matrix[l2][l1]
x = np.array(['ape', 'appel', 'apple', 'peach', 'puppy'])
km = mlpy.Kmedoids(k=3, dist=distance())
medoids,clusters,a,b = km.compute(x)
print medoids
print clusters
print a
print x[medoids]
for i,c in enumerate(x[medoids]):
print "medoid", c
print x[clusters[a==i]]
The output is
[4 3 1]
[0 2]
[2 2]
['puppy' 'peach' 'appel']
medoid puppy
[]
medoid peach
[]
medoid appel
['ape' 'apple']
The bigger word list and using k=10
medoid he
['or' 'his' 'my' 'have' 'if' 'year' 'of' 'who' 'us' 'use' 'people' 'see'
'make' 'be' 'up' 'we' 'the' 'one' 'her' 'by' 'it' 'him' 'she' 'me' 'over'
'after' 'get' 'what' 'I']
medoid out
['just' 'only' 'your' 'you' 'could' 'our' 'most' 'first' 'would' 'but'
'about']
medoid to
['from' 'go' 'its' 'do' 'into' 'so' 'for' 'also' 'no' 'two']
medoid now
['new' 'how' 'know' 'not']
medoid time
['like' 'take' 'come' 'some' 'give']
medoid because
[]
medoid an
['want' 'on' 'in' 'back' 'say' 'and' 'a' 'all' 'can' 'as' 'way' 'at' 'day'
'any']
medoid look
['work' 'good']
medoid will
['with' 'well' 'which']
medoid then
['think' 'that' 'these' 'even' 'their' 'when' 'other' 'this' 'they' 'there'
'than' 'them']
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Checking if a string's characters are ascending alphabetically and its ascent is evenly spaced python
I have a list of strings/words:
mylist = ['twas', 'brillig', 'and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the', 'mome', 'raths', 'outgrabe', '"beware', 'the', 'jabberwock', 'my', 'son', 'the', 'jaws', 'that', 'bite', 'the', 'claws', 'that', 'catch', 'beware', 'the', 'jubjub', 'bird', 'and', 'shun', 'the', 'frumious', 'bandersnatch', 'he', 'took', 'his', 'vorpal', 'sword', 'in', 'hand', 'long', 'time', 'the', 'manxome', 'foe', 'he', 'sought', 'so', 'rested', 'he', 'by', 'the', 'tumtum', 'tree', 'and', 'stood', 'awhile', 'in', 'thought', 'and', 'as', 'in', 'uffish', 'thought', 'he', 'stood', 'the', 'jabberwock', 'with', 'eyes', 'of', 'flame', 'came', 'whiffling', 'through', 'the', 'tulgey', 'wood', 'and', 'burbled', 'as', 'it', 'came', 'one', 'two', 'one', 'two', 'and', 'through', 'and', 'through', 'the', 'vorpal', 'blade', 'went', 'snicker-snack', 'he', 'left', 'it', 'dead', 'and', 'with', 'its', 'head', 'he', 'went', 'galumphing', 'back', '"and', 'has', 'thou', 'slain', 'the', 'jabberwock', 'come', 'to', 'my', 'arms', 'my', 'beamish', 'boy', 'o', 'frabjous', 'day', 'callooh', 'callay', 'he', 'chortled', 'in', 'his', 'joy', '`twas', 'brillig', 'and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the', 'mome', 'raths', 'outgrabe']
firstly i need to only get the words which have 3 or more characters in them - i assume a for loop for that or something.
then i need to get a list of words which contain only words that increase from left to right alphabetically and are a fixed number apart. (e.g. ('ace', 2) or ('ceg', 2) does not have to be 2) the list also has to be sorted in alphabetical order and each element should be a tuple consisting of the word and character difference.
I think i have to use a for loop but im not sure how to use it in this case and am not sure how to do the second part.
for the list above the answer i should get is:
([])
I do not have the newest version of python.
Any help is greatly appreciated.
You should probably start by learning how to use a for loop. A for loop will get things out of a collection and assign them to a variable:
for letter in "strings are collections":
print letter
Or..
for thing in ['this is a list', 'of', 4, 'things']:
if thing == 4:
print '4 is in the list'
If you're able to do more than this, then try something, figure out where you get stuck, and ask as more specifically what you need help with.
Take this problem in steps
To filter words with length >= 3
[w for w in mylist if len(w) >= 3]
To see if the words are increasing in regular interval? Calculate the difference, between consecutive letters, create a set and check if the length == 1
diff =lambda word:len({ord(n)-ord(c) for n,c in zip(word[1:],word)}) == 1
Now use this new function to filter the remaining words
[w for w in mylist if len(w) >= 3 and diff(w)]