Location String Munging in Python - python

(Python 2.7, using datasets from http://www.policemisconduct.net/databases/, 2009, 2010)
[[YOU CAN SKIP DOWN TO ____ IF YOU DON'T CARE ABOUT NATURE OF MY DATA]]
I'm fairly new to Python and programming in general - I'd like someone to explain the results I'm getting from my loop.
I'm trying to loop through the 'location' column of a police misconduct dataset. Its format is as follows:
city, state, USA
(I'm aware the URL above has the data broken into separate 2009 and 2010 files, where the location is already in 2 separate columns, as well as a Google Fusion to which I am referring. This question is specifically about how to make A look like B, as well as the errors I'm throwing and why.)
Allow me a simplified version of my question. Consider the following five locations in test.csv:
Tallahassee, Florida, USA
Denver, Colorado, USA
Watertown, New York, USA
Kalamazoo, Michigan, USA
Toronto, Ontario, Canada
I run the follow script:
def censor(text, word):
texts = str(text)
words = texts.split() #creates the list
x = "" * len(word) #creates the stars with correct length
for i in range(len(words)):
if words[i] == word:
words[i] = x #replace
return "".join(words)
places = pd.read_csv(test.csv) #the 5-place list above
censor(places,"USA")
And get the following
'Tallahassee,Florida,0Denver,Colorado,1Watertown,NewYork,2Kalamazoo,Michigan,3Toronto,Ontario,Canada'
Obviously, the numbers shouldn't be there; it's one big long string (but an array [] instead of "" string throws errors when trying to use the .split method...); Even the spaces I want were dropped.
Adding an alpha character in the return line ""string.join(words) as I tinkered made me even more confused about the loop I had written... (so now line 8 reads: 'return "a".join(words)')
'Tallahassee,aFlorida,aa0aDenver,aColorado,aa1aWatertown,aNewaYork,aa2aKalamazoo,aMichigan,aa3aToronto,aOntario,Canada'
...and the only thing that does well is make me sound like Luigi when I read it.
How can I make a) a two, separate nx1 arrays, where n is the number of observations in each array for State and City, and b) one nX2 array with analogous columns...
Thanks! (And sorry for n3wb? :(

I suggest this.
def censor_word(word, word_to_censor):
word = word.strip()
if word.lower() == word_to_censor.lower():
return '*' * len(word)
else:
return word
def censor(line, word_to_censor):
words = str(line).split(',') #creates the list
words = [censor_word(w, word_to_censor) for w in words]
return ", ".join(words)
with open("test.csv", "rt") as f:
for line in f:
print(censor(line, "USA"))
Sorry, I have to run out the door. Usually I explain the code but cannot right now. If you have questions, I will answer them later.

If you want to process this as a string, there's a much easier way to do your "censorship". You can use the replace method of a string to remove all instances of "USA" (or any other substring for that matter).
f = open('places.csv')
text = str(f.read())
f.close()
places = text.replace(', USA','')
It's then very simple to recreate your dataframe using string operations:
t1 = places.split('\n')
t2 = [p.replace(' ','').strip() for p in t1]
final_places = [p.split(',') for p in t2]
This gives you a result in array form (fixed):
[['Tallahassee', 'Florida'], ['Denver', 'Colorado'], ['Watertown', 'NewYork'], ['Kalamazoo', 'Michigan'], ['Toronto', 'Ontario', 'Canada']]
To get cities/states:
cities = [p[0] for p in final_places]
states = [p[1] for p in final_places]

Related

Deciphering script in Python issue

Cheers, I am looking for help with my small Python project. Problem says, that program has to be able to decipher "monoalphabetic substitute cipher", while we have complete database, which words will definetely (at least once) be ciphered.
I have tried to create such a database with words, that are ciphered:
lst_sample = []
n = int(input('Number of words in database: '))
for i in range(n):
x = input()
lst_sample.append(x)
Way, that I am trying to "decipher" is to observe words', let's say structure, where different letter I am assigning numbers based on their presence in word (e.g. feed = 0112, hood = 0112 are the same, because it is combination of three different letters in such a combination). I am using subprogram pattern() for it:
def pattern(word):
nextNum = 0
letternNums = {}
wordPattern = []
for letter in word:
if letter not in letterNums:
letternNums[letter] = str(nextNum)
nextNum += 1
wordPattern.append(letterNums[letter])
return ''.join(wordPattern)
Right after, I have made database of ciphered words:
lst_en = []
q = input('Insert ciphered words: ')
if q == '':
print(lst_en)
else:
lst_en.append(q)
With such a databases I could finally create process to deciphering.
for i in lst_en:
for q in lst_sample:
x = p
word = i
if pattern(x) == pattern(word):
print(x)
print(word)
print()
If words in database lst_sample have different letter length (e.g. food, car, yellow), there is no problem to assign decrypted words, even when they have the same length, I can sort them based on their different structure: (e.g. puff, sort).
The main problem, which I am not able to solve, comes, when word has the same length and structure (e.g. jane, word).
I have no idea how to solve this problem, while keeping such an script architecture as described above. Is there any way, how that could be solved using another if statement or anything similar? Is there any way, how to solve it with infortmation that words in lst_sample will for sure be in ciphered text?
Thanks for all help!

Deleting empty records from a file starting with specific characters

I have a file containing DBLP dataset which consists of bibliographic data in computer science. I want to delete some of the records with missing information. For example, I want to delete records with the missing venue. In this dataset, the venue is followed by '#c'.
In this code, I am splitting documents by the title of manuscripts ("#*"). Now, I am trying to delete records without venue name.
Input Data:
#*Toward Connectionist Parsing.
##Steven L. Small,Garrison W. Cottrell,Lokendra Shastri
#t1982
#c
#index14997
#*A Framework for Reinforcement Learning on Real Robots.
##William D. Smart,Leslie Pack Kaelbling
#t1998
#cAAAI/IAAI
#index14998
#*Efficient Goal-Directed Exploration.
##Yury V. Smirnov,Sven Koenig,Manuela M. Veloso,Reid G. Simmons
#t1996
#cAAAI/IAAI, Vol. 1
#index14999
My code:
inFile = open('lorem.txt','r')
Data = inFile.read()
data = Data.split("#*")
ouFile = open('testdata.txt','w')
for idx, word in enumerate(data):
print("i = ", idx)
if not('#!' in data[idx]):
del data[idx]
idx = idx - 1
else:
ouFile.write("#*" + data[idx])
ouFile.close()
inFile.close()
Expected Output:
#*A Framework for Reinforcement Learning on Real Robots.
##William D. Smart,Leslie Pack Kaelbling
#t1998
#cAAAI/IAAI
#index14998
#*Efficient Goal-Directed Exploration.
##Yury V. Smirnov,Sven Koenig,Manuela M. Veloso,Reid G. Simmons
#t1996
#cAAAI/IAAI, Vol. 1
#index14999
Actual Output:
An empty output file
str.find will give you an index of sub-string, or -1 if the sub-string does not exist.
DOCUMENT_SEP = '#*'
with open('lorem.txt') as in_file:
documents = in_file.read().split(DOCUMENT_SEP)
with open('testdata.txt', 'w') as out_file:
for document in documents:
i = document.find('#c')
if i < 0: # no "#c"
continue
# "#c" exists, but no trailing venue information
if not document[i+2:i+3].strip():
continue
out_file.write(DOCUMENT_SEP)
out_file.write(document)
Instead of closing manually, I used a with statement.
No need to use index; deleting an item in the middle of loop will make index calculation complex.
Using regular expressions like #c[A-Z].. will make the code simpler.
The reason your code wasn't working is because there's no #! in any of your entries.
If you want to exclude entries with empty #c fields, you can try this:
inFile = open('lorem.txt','r')
Data = inFile.read()
data = Data.split("#*")
ouFile = open('testdata.txt','w')
for idx, word in enumerate(data):
print("i = ", idx)
if not '#c\n' in data[idx] and len(word) > 0:
ouFile.write("#*" + data[idx])
ouFile.close()
inFile.close()
In general, try not to delete elements of a list you're looping through. It can cause a lot of unexpected drama.

Python - counting words without utilizing built in functions or importing libraries

I'm having trouble with this assignment... I'm attempting to count multiple occurances of a word(s) in a text file.
#most common word
fh = open("romeo.txt")
master_list = fh.read().split()
print(len(master_list))
compare_list = []
count_list = []
for word in master_list:
if word not in compare_list:
compare_list.append(word)
count_list.append(1)
else:
for rw in range(len(compare_list)):
for r in master_list:
if compare_list[rw] == r :
count_list[rw] += 1
print(len(count_list))
print(count_list)
This is the data from the text file romeo(dot)txt
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
You could try using a dictionary instead of a list. When you see each word, check if it's in the dictionary, if not add it else increment it's value.
I think your problem is in this block
if word not in compare_list:
compare_list.append(word)
count_list.append(1)
else:
for rw in range(len(compare_list)):
for r in master_list:
if compare_list[rw] == r :
count_list[rw] += 1
The else clause is executed when word is found in compare_list. So the behaviour should be to find it's index, then increment the corresponding index in count_list - you start by iterating through indexes of compare_list which is good, but then instead of comparing it to the current word, you start another iteration through master_list. Take out the for r in master_list: loop and just compare compare_list[rw] to word and I think you'll get there.
As others have noted, a dict would be a better structure for storing results - but it's hard to tell from your question if this would be "against the rules".

Split txt file into multiple new files with regex

I am calling on the collective wisdom of Stack Overflow because I am at my wits end trying to figure out how to do this and I'm a newbie self-taught coder.
I have a txt file of Letters to the Editor that I need to split into their own individual files.
The files are all formatted in relatively the same way with:
For once, before offering such generous but the unasked for advice, put yourselves in...
Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...
Why is it that The Times does not urge totalitarian Arab slates and terrorist...
PAUL STONEHILL Los Angeles
There you go again. Your editorial again makes groundless criticisms of the Israeli...
On Dec. 7 you called proportional representation “bizarre," despite its use in the...
Proportional representation distorts Israeli politics? Huh? If Israel changes the...
MATTHEW SHUGART Laguna Beach
Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...
Although the mayor did not support Proposition U (the slow-growth initiative) his...
If West Los Angeles is any indication of the no-growth policy, where do we go from here?
MARJORIE L. SCHWARTZ Los Angeles
I thought that the best way to go about it would be to try and use regex to identify the lines that started with a name that's all in capital letters since that's the only way to really tell where one letter ends and another begins.
I have tried quite a few different approaches but nothing seems to work quite right. All the other answers I have seen are based on a repeatable line or word. (for example the answers posted here how to split single txt file into multiple txt files by Python and here Python read through file until match, read until next pattern). It all seems to not work when I have to adjust it to accept my regex of all capital words.
The closest I've managed to get is the code below. It creates the right number of files. But after the second file is created it all goes wrong. The third file is empty and in all the rest the text is all out of order and/or incomplete. Paragraphs that should be in file 4 are in file 5 or file 7 etc or missing entirely.
import re
thefile = raw_input('Filename to split: ')
name_occur = []
full_file = []
pattern = re.compile("^[A-Z]{4,}")
with open (thefile, 'rt') as in_file:
for line in in_file:
full_file.append(line)
if pattern.search(line):
name_occur.append(line)
totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)
while letters <= totalFiles:
f1 = open(thefile + '-' + str(letters) + ".txt", "a")
doIHaveToCopyTheLine = False
ignoreLines = False
for line in full_file:
if not ignoreLines:
f1.write(line)
full_file.remove(line)
if pattern.search(line):
doIHaveToCopyTheLine = True
ignoreLines = True
letters += 1
f1.close()
I am open to completely scrapping this approach and doing it another way (but still in Python). Any help or advice would be greatly appreciated. Please assume I am the inexperienced newbie that I am if you are awesome enough to take your time to help me.
I took a simpler approach and avoided regex. The tactic here is essentially to count the uppercase letters in the first three words and make sure they pass certain logic. I went for first word is uppercase and either the second or third word is uppercase too, but you can adjust this if needed. This will then write each letter to new files with the same name as the original file (note: it assumes your file has an extension like .txt or such) but with an incremented integer appended. Try it out and see how it works for you.
import string
def split_letters(fullpath):
current_letter = []
letter_index = 1
fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)
with open(fullpath, 'r') as letters_file:
letters = letters_file.readlines()
for line in letters:
words = line.split()
upper_words = []
for word in words:
upper_word = ''.join(
c for c in word if c in string.ascii_uppercase)
upper_words.append(upper_word)
len_upper_words = len(upper_words)
first_word_upper = len_upper_words and len(upper_words[0]) > 1
second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
if first_word_upper and (second_word_upper or third_word_upper):
current_letter.append(line)
new_filename = '{0}{1}.{2}'.format(
fullpath_base, letter_index, fullpath_ext)
with open(new_filename, 'w') as new_letter:
new_letter.writelines(current_letter)
current_letter = []
letter_index += 1
else:
current_letter.append(line)
I tested it on your sample input and it worked fine.
While the other answer is suitable, you may still be curious about using a regex to split up a file.
smallfile = None
buf = ""
with open ('input_file.txt', 'rt') as f:
for line in f:
buf += str(line)
if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
if smallfile:
smallfile.close()
match = re.findall(r'^([A-Z\s\.]+\b)' , line)
smallfile_name = '{}.txt'.format(match[0])
smallfile = open(smallfile_name, 'w')
smallfile.write(buf)
buf = ""
if smallfile:
smallfile.close()
If you run on Linux, use csplit.
Otherwise, check out these two threads:
How can I split a text file into multiple text files using python?
How to match "anything up until this sequence of characters" in a regular expression?

Index error in loop to separate body of text by speaker in python

I've got a corpus of text which takes the following form:
JOHN: Thanks for coming, everyone!
(EVERYONE GRUMBLES)
ROGER: They're really glad to see you, huh?
DAVIS: They're glad to see the both of you.
In order to analyze the text, I want to divide it into chunks, by speaker. I want to retain John and Roger, but not Davis. I also want to find the number of times the certain phrases like (EVERYONE GRUMBLES) occur during each person's speech.
My first thought was to use NLTK, so I imported it and used the following code to remove all the punctuation and tokenize the text, so that each word within the corpus becomes an individual token:
f = open("text.txt")
raw_t = f.read()
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(raw_t.decode('utf-8'))
text = nltk.Text(tokens)
Then, I thought that I could create a global list, within which I would include all of the instances of John and Roger speaking.
I figured that I'd first see if each word in the text corpus was upper case and in the list of acceptable names, and if it was, I'd examine every subsequent word until the next incidence of a term that was both upper case and was found in the list of acceptable names. I'd then add all the words from the initial instance of a speaker's name, through to one word less than the next speaker's name, and add this series of tokens/words to my global list.
I've written:
k = 0
i = 0
j = 1
names =["JOHN","ROGER"]
global_list =[]
for i in range(len(text)):
if (text[i].isupper() and text[i] in names):
for j in range(len(text)-i):
if (text[i+j].isupper() and text[i+j] in names):
global_list[k] = text[i:(j-1)]
k+=1
else: j+=1
else: i+=1
Unfortunately, this doesn't work, and I get the following index error:
IndexError Traceback (most recent call last)
<ipython-input-49-97de0c68b674> in <module>()
6 for j in range(len(text)-i):
7 if (text[i+j].isupper() and text[i+j] in names):
----> 8 list_speeches[k] = text[i:(j-1)]
9 k+=1
10 else: j+=1
IndexError: list assignment index out of range
I feel like I'm screwing up something really basic here, but am not exactly why I'm getting this index error. Can anyone shed some light on this?
Break the text into paragraphs with re.split(r"\n\s*\n", text), then examine the first word of each paragraph to see who is speaking. And don't worry about the nltk-- you haven't used it yet, and you don't need to.
Ok, figured this out after a bit of digging around. The initial loop mentioned in the question had a whole bunch of extraneous content, so I simplified it to:
names =["JOHN","ROGER"]
global_list = []
i = 0
for i in range(len(text)):
if (text[i].isupper()) and (text[i] in names):
j=0
while (text[i+j].islower()) and (text[i+j] not in names):
j+=1
global_list.append(text[i:(j-1)])
This generated a list, although, problematically, each item in this list was made up of words starting from the name through to the end of the document. Because each item began at the appropriate name while ending at the last word of the text corpus, it was easy to derive the length of each segment by subtracting the length of the following segment from it:
x=1
new_list = range(len(global_list)-1)
for x in range(len(global_list)):
if x == len(global_list):
new_list[x-1] = global_list[x]
else:
new_list[x-1] = global_list[x][:(len(global_list[x])-len(global_list[x+1]))]
(x was set to 1 because the original code gave me the first speaker's content twice).
This wasn't in the least bit pretty, but it ended up working. If anyone's got a prettier way of doing it — and I'm sure it exists, since I think I've messed up the initial loop — I'd love to see it.

Categories