I have a set of words such as this:
mike dc car dc george dc jerry dc
Each word, mike dc george dc is separated by a space. How can I create a two-word set and separate the two-word set by a tab? I would like to print it to the standard output stdout.
EDIT
I tried using this:
print '\t'.join(hypoth), but it doesn't really cut it. All the words here are just tab delimited. I would ideally like the first two words separated by a space and each two word-set tab delimited.
Assuming you have
two_word_sets = ["mike dc", "car dc", "george dc", "jerry dc"]
use
print "\t".join(two_word_sets)
or, for Python 3:
print("\t".join(two_word_sets))
to print the tab-separated list to stdout.
If you only have
mystr = "mike dc car dc george dc jerry dc"
you can calculate a as follows:
words = mystr.split()
two_word_sets = [" ".join(tup) for tup in zip(words[::2], words[1::2])]
This might look a bit complicated, but note that zip(a_proto[::2], a_proto[1::2]) is just [('mike', 'dc'), ('car', 'dc'), ('george', 'dc'), ('jerry', 'dc')]. The rest of the list comprehension joins these together with a space.
Note that for very long lists/input strings you would use izip from [itertools], because zip actually creates a list of tuples whereas izip returns a generator.
You can do this in 1-2 lines, but it is easiest to read if you break it up:
words = "mike dc car dc george dc jerry dc"
wlist = words.split()
mystr = ""
for i in range(0, len(wlist), 2):
mystr = "%s%s %s\t" % (mystr, wlist[i], wlist[i+1])
print mystr
Related
I'm attempting to capitalize all words in a section of text that only appear once. I have the bit that finds which words only appear once down, but when I go to replace the original word with the .upper version, a bunch of other stuff gets capitalized too. It's a small program, so here's the code.
from collections import Counter
from string import punctuation
path = input("Path to file: ")
with open(path) as f:
word_counts = Counter(word.strip(punctuation) for line in f for word in line.replace(")", " ").replace("(", " ")
.replace(":", " ").replace("", " ").split())
wordlist = open(path).read().replace("\n", " ").replace(")", " ").replace("(", " ").replace("", " ")
unique = [word for word, count in word_counts.items() if count == 1]
for word in unique:
print(word)
wordlist = wordlist.replace(word, str(word.upper()))
print(wordlist)
The output should be 'Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan., as sojournings is the first word that only appears once. Instead, it outputs GenesIs 37:1 Jacob lIved In the land of hIs FATher's SOJOURNINGS, In the land of Canaan. Because some of the other letters appear in keywords, it tries to capitalize them as well.
Any ideas?
I rewrote the code pretty significantly since some of the chained replace calls might prove to be unreliable.
import string
# The sentence.
sentence = "Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan."
rm_punc = sentence.translate(None, string.punctuation) # remove punctuation
words = rm_punc.split(' ') # split spaces to get a list of words
# Find all unique word occurrences.
single_occurrences = []
for word in words:
# if word only occurs 1 time, append it to the list
if words.count(word) == 1:
single_occurrences.append(word)
# For each unique word, find it's index and capitalize the letter at that index
# in the initial string (the letter at that index is also the first letter of
# the word). Note that strings are immutable, so we are actually creating a new
# string on each iteration. Also, sometimes small words occur inside of other
# words, e.g. 'an' inside of 'land'. In order to make sure that our call to
# `index()` doesn't find these small words, we keep track of `start` which
# makes sure we only ever search from the end of the previously found word.
start = 0
for word in single_occurrences:
try:
word_idx = start + sentence[start:].index(word)
except ValueError:
# Could not find word in sentence. Skip it.
pass
else:
# Update counter.
start = word_idx + len(word)
# Rebuild sentence with capitalization.
first_letter = sentence[word_idx].upper()
sentence = sentence[:word_idx] + first_letter + sentence[word_idx+1:]
print(sentence)
Text replacement by patters calls for regex.
Your text is a bit tricky, you have to
remove digits
remove punktuations
split into words
care about capitalisation: 'It's' vs 'it's'
only replace full matches 'remote' vs 'mote' when replacing mote
etc.
This should do this - see comments inside for explanations:
bible.txt is from your link
from collections import Counter
from string import punctuation , digits
import re
from collections import defaultdict
with open(r"SO\AllThingsPython\P4\bible.txt") as f:
s = f.read()
# get a set of unwanted characters and clean the text
ps = set(punctuation + digits)
s2 = ''.join( c for c in s if c not in ps)
# split into words
s3 = s2.split()
# create a set of all capitalizations of each word
repl = defaultdict(set)
for word in s3:
repl[word.upper()].add(word) # f.e. {..., 'IN': {'In', 'in'}, 'THE': {'The', 'the'}, ...}
# count all words _upper case_ and use those that only occure once
single_occurence_upper_words = [w for w,n in Counter( (w.upper() for w in s3) ).most_common() if n == 1]
text = s
# now the replace part - for all upper single words
for upp in single_occurence_upper_words:
# for all occuring capitalizations in the text
for orig in repl[upp]:
# use regex replace to find the original word from our repl dict with
# space/punktuation before/after it and replace it with the uppercase word
text = re.sub(f"(?<=[{punctuation} ])({orig})(?=[{punctuation} ])",upp, text)
print(text)
Output (shortened):
Genesis 37:1 Jacob lived in the land of his father's SOJOURNINGS, in the land of Canaan.
2 These are the GENERATIONS of Jacob.
Joseph, being seventeen years old, was pasturing the flock with his brothers. He was a boy with the sons of Bilhah and Zilpah, his father's wives. And Joseph brought a BAD report of them to their father. 3 Now Israel loved Joseph more than any other of his sons, because he was the son of his old age. And he made him a robe of many colors. [a] 4 But when his brothers saw that their father loved him more than all his brothers, they hated him
and could not speak PEACEFULLY to him.
<snipp>
The regex uses lookahead '(?=...)' and lookbehind '(?<=...)'syntax to make sure we replace only full words, see regex syntax.
I have four speakers like this:
Team_A=[Fred,Bob]
Team_B=[John,Jake]
They are having a conversation and it is all represented by a string, ie. convo=
Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine
How do I disassemble and reassemble the string so I can split it into 2 strings, 1 string of what Team_A said, and 1 string from what Team_A said?
output: team_A_said="hello how is it going?", team_B_said="hi we are doing fine"
The lines don't matter.
I have this awful find... then slice code that is not scalable. Can someone suggest something else? Any libraries to help with this?
I didn't find anything in nltk library
This code assumes that contents of convo strictly conforms to the
name\nstuff they said\n\n
pattern. The only tricky code it uses is zip(*[iter(lines)]*3), which creates a list of triplets of strings from the lines list. For a discussion on this technique and alternate techniques, please see How do you split a list into evenly sized chunks in Python?.
#!/usr/bin/env python
team_ids = ('A', 'B')
team_names = (
('Fred', 'Bob'),
('John', 'Jake'),
)
#Build a dict to get team name from person name
teams = {}
for team_id, names in zip(team_ids, team_names):
for name in names:
teams[name] = team_id
#Each block in convo MUST consist of <name>\n<one line of text>\n\n
#Do NOT omit the final blank line at the end
convo = '''Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine
'''
lines = convo.splitlines()
#Group lines into <name><text><empty> chunks
#and append the text into the appropriate list in `said`
said = {'A': [], 'B': []}
for name, text, _ in zip(*[iter(lines)]*3):
team_id = teams[name]
said[team_id].append(text)
for team_id in team_ids:
print 'Team %s said: %r' % (team_id, ' '.join(said[team_id]))
output
Team A said: 'hello how is it going?'
Team B said: 'hi we are doing fine'
You could use a regular expression to split up each entry. itertools.ifilter can then be used to extract the required entries for each conversation.
import itertools
import re
def get_team_conversation(entries, team):
return [e for e in itertools.ifilter(lambda x: x.split('\n')[0] in team, entries)]
Team_A = ['Fred', 'Bob']
Team_B = ['John', 'Jake']
convo = """
Fred
hello
John
hi
Bob
how is it going?
Jake
we are doing fine"""
find_teams = '^(' + '|'.join(Team_A + Team_B) + r')$'
entries = [e[0].strip() for e in re.findall('(' + find_teams + '.*?)' + '(?=' + find_teams + r'|\Z)', convo, re.S+re.M)]
print 'Team-A', get_team_conversation(entries, Team_A)
print 'Team-B', get_team_conversation(entries, Team_B)
Giving the following output:
Team-A ['Fred\nhello', 'Bob\nhow is it going?']
Team_B ['John\nhi', 'Jake\nwe are doing fine']
It is a problem of language parsing.
Answer is a Work in progress
Finite state machine
A conversation transcript can be understood by imagining it as parsed by automata with the following states :
[start] ---> [Name]----> [Text]-+----->[end]
^ |
| | (whitespaces)
+-----------------+
You can parse your conversation by making it follow that state machine. If your parsing succeeds (ie. follows the states to end of text) you can browse your "conversation tree" to derive meaning.
Tokenizing your conversation (lexer)
You need functions to recognize the name state. This is straightforward
name = (Team_A | Team_B) + '\n'
Conversation alternation
In this answer, I did not assume that a conversation involves alternating between the people speaking, like this conversation would :
Fred # author 1
hello
John # author 2
hi
Bob # author 3
how is it going ?
Bob # ERROR : author 3 again !
are we still on for saturday, Fred ?
This might be problematic if your transcript concatenates answers from same author
I was a assigned to group ana grams together lexiographicaly.
Below is one of the test cases:
Input:
eat tea tan ate nat bat
Output:
ate eat tea
bat
nat tan
However, I keep getting the following output where the the anagrams are encapsulated in the list and the order at which each line gets printed varies every time:
['ate', 'eat', 'tea']
['nat', 'tan']
['bat']
or
['nat', 'tan']
['bat']
['ate', 'eat', 'tea']
or
['ate', 'eat', 'tea']
['bat']
['nat', 'tan']
How do I fix this so that it outputs without being capped in a list and possibly in the right order?
This is what i have done so far:
import sys
from collections import *
def ComputeAnagrams(string):
d = defaultdict(list)
for word in string:
key = ''.join(sorted(word))
d[key].append(word)
return d
def main():
for string in sys.stdin:
stringList = string.split()
if len(stringList) == 0:
break
d = ComputeAnagrams(stringList)
for key,anagrams in d.items():
if len(anagrams) >=1:
print(sorted(anagrams))
print ('')
main()
Note: the machine that runs this programs reads input from stdin/keyboard and prints the output to console(stdout).
I believe the issue is -
print(sorted(''.join(anagrams)))
You are using sorted after join the list to a string, in that case, sorted returns a list of characters in the sorted order (I guess that is the current output you are getting).
If you want the elements in sorted order, sorted should be used on anagrams list, not the string after joining. Example -
print(', '.join(sorted(anagrams)))
I am also using ', ' to join the strings, so as to use , as the separator, otherwise the output would be all strings together without any spaces in-between, if you want, you can use any other separator you want.
Demo -
After above change -
Input -
eat tea tan ate nat bat
Output -
bat
ate, eat, tea
nat, tan
I am following a tutorial to identify and print the words in between a particular string;
f is the string Mango grapes Lemon Ginger Pineapple
def findFruit(f):
global fruit
found = [re.search(r'(.*?) (Lemon) (.*?)$', word) for word in f]
for i in found:
if i is not None:
fruit = i.group(1)
fruit = i.group(3)
grapes and Ginger will be outputted when i print fruit. However what i want the output is to look like "grapes" # "Ginger" (note the "" and # sign).
You can use string formatting here with the use of the str.format() function:
def findFruit(f):
found = re.search(r'.*? (.*?) Lemon (.*?) .*?$', f)
if found is not None:
print '"{}" # "{}"'.format(found.group(1), found.group(2))
Or, a lovely solution Kimvais posted in the comments:
print '"{0}" # "{1}"'.format(*found.groups())
I've done some edits. Firstly, a for-loop isn't needed here (nor is a list comprehension. You're iterating through each letter of the string, instead of each word. Even then you don't want to iterate through each word.
I also changed your regular expression (Do note that I'm not that great in regex, so there probably is a better solution).
I have the following text file:
"""[' Hoffa remains Allen Iverson Bill Cosby WWE Payback results Juneteenth shooting Miss Utah flub Octopus pants Magna Carta Holy Grail China supercomputer Sibling bullying ']"""
I would like to create a list from it and apply a function to each name
this is my code so far:
listing = open(fileName, 'r')
lines = listing.read().split(',')
for line in lines:
#Function
Strip out character like """['] first from the start and end of the string using str.strip, now split the resulting string at six spaces(' '*6). Splitting returns a list, but some items still have traling and leading white-spaces, you can remove them using str.strip again.
with open(fileName) as f:
lis = [x.strip() for x in f.read().strip('\'"[]').split(' '*6)]
print lis
...
['Hoffa remains', 'Allen Iverson', 'Bill Cosby', 'WWE Payback results', 'Juneteenth shooting', 'Miss Utah flub', 'Octopus pants', 'Magna Carta Holy Grail', 'China supercomputer', 'Sibling bullying']
Applying function to the above list:
List comprehension:
[func(x) for x in lis]
map:
map(func, lis)
I would first refer you to some other similar posts: similar post
And you can't use a coma here you don't have a coma between the data you wan't to separate. This function splits the string you have into substring depending on the delimiter you gave it: a coma ','.