How do I separate words from a text file into single letters?
I'm given a text where I have to calculate the frequency of the letters in a text. However, I can't seem to figure out how I separate the words into single letters so I can count the unique elements and from there determine their frequency.
I apologize for not having the text in a text file, but the following text I'm given:
alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, and what is the use of a book,' thought alice without pictures or conversation?'
so she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy- chain would be worth the trouble of getting up and picking the daisies, when suddenly a white rabbit with pink eyes ran close by her.
there was nothing so very remarkable in that; nor did alice think it so very much out of the way to hear the rabbit say to itself, `oh dear! oh dear! i shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the rabbit actually took a watch out of its waistcoat- pocket, and looked at it, and then hurried on, alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.
in another moment down went alice after it, never once considering how in the world she was to get out again.
the rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that alice had not a moment to think about stopping herself before she found herself falling down a very deep well.
I'm supposed to separate into getting 26 variables a-z, and then determine their frequency which is given as the following:
I tried making the following code so far:
# Check where the current file you are working in, is saved.
import os
os.getcwd()
#print(os.getcwd())
# 1. Change the current working directory to the place where you have saved the file.
os.chdir('C:/Users/Annik/Desktop/DTU/02633 Introduction to programming/Datafiles')
os.getcwd()
#print(os.chdir('C:/Users/Annik/Desktop/DTU/02633 Introduction to programming/Datafiles'))
# 2. Listing the content of current working directory type
os.listdir(os.getcwd())
#print(os.listdir(os.getcwd()))
#importing the file
filein = open("small_text.txt", "r") #opens the file for reading
lines = filein.readlines() #reads all lines into an array
smalltxt = "".join(lines) #Joins the lines into one big string.
import numpy as np
def letterFrequency(filename):
#counts the frequency of letters in a text
unique_elems, counts = np.unique(separate_words, return_counts=True)
return unique_elems
I just don't know how to separate the letters in the text, so I can count the unique elements.
You can use collections.Counter to get your frequencies directly from the text.
Then just select the 26 keys you are interested, because it will also include whitespaces and other signs.
from collections import Counter
[...]
with open("small_text.txt", "r") as file:
text = file.read()
keys = "abcdefghijklmnopqrstuvwxyz"
c = Counter(text.lower())
# initialize occurrence with zeros to have all keys present.
occurrence = dict.fromkeys(keys, 0)
occurrence.update({k:v for k,v in c.items() if k in keys})
total = sum(occurrence.values())
frequency = {k:v/total for k,v in occurrence.items()}
[...]
To handle upper case str.lower might be useful as well.
"how I separate the words into single letters" since you want to calculate the count of the characters you can implement python counter in collections.
For example
import collections
import pprint
...
...
file_input = input('File_Name: ')
with open(file_input, 'r') as info:
count = collections.Counter(info.read().upper()) # reading file
value = pprint.pformat(count)
print(value)
...
...
This read your file will output the count of characters present.
Related
I'm working on a project in my work using purely Python 3:
If I take my scanner, (Because I work in inventory) and anything I scan goes into a text doc, and I scan the location "117" , and then I scan any device in any other location, (the proceeding lines in the text doc "100203") and I run the script and it plugs in '117' in the search on our database and changes each of the devices (whether they were assigned to that location or not) into that location, (Validating those devices are in location '117')
My main question is the 3rd objective down from the Objectives list below that doesn't have "Done" after it.
Objective:
Pull strings from a text document, convert it into a dictionary. = (Text_Dictionary) **Done**
Assign the first var in the dictionary to a separate var. = (First_Line) **Done**
All proceeding var's greater then the first var in the dictionary should be passed into a function individually. = (Proceeding_Lines)
Side note: The code should loop in a fashion that should (.pop) the var from the dictionary/list, But I'm open for other alternatives. (Not mandatory)
What I already have is:
Project.py:
1 import re
2 import os
3 import time
4 import sys
5
6 with open(r"C:\Users\...\text_dictionary.txt") as f:
7 Text_Dictionary = [line.rstrip('\n') for line in
8 open(r"C:\Users\...\text_dictionary.txt")]
9
10 Text_Dict = (Text_Dictionary)
11 First_Line = (Text_Dictionary[0])
12
13 print("The first line is: ", First_Line)
14
15 end = (len(Text_Dictionary) + 1)
16 i = (len(Text_Dictionary))
17
What I have isn't much on the surface, but I have another "*.py" file fill of code that I am going to copy in for the action that I wish to preform on each of the vars in the Text_Dictionary.txt. Lines 15 - 16 was me messing with what I thought might solve this.
In the imported text document, the var's look very close to this (Same length)(All digits):
Text_Dictionary.txt:
117
23000
53455
23454
34534
...
Note: These values will change for each time the code is ran, meaning someone will type/scan in these lines of digits each time.
Explained concept:
Ideally, I would like to have the first line point towards a direction, and the rest of the digits would follow; however, each (Example: '53455') needs to be ran separately then the next in line and (Example: '117') would be where '53455' goes. You could say the first line is static throughout the code, unless otherwise changed inText_Dictionary.txt. '117'is ran in conjunction with each iteration proceeding it.
Background:
This is for inventory management for my office, I am in no way payed for doing this, but it would make my job a heck-of-a-lot easier. Also, I know basic python to get myself around, but this kinda stumped me. Thank you to whoever answers!
I've no clue what you're asking, but I'm going to take a guess. Before I do so, your code was annoying me:
with open("file.txt") as f:
product_ids = [line.strip() for line in f if not line.isspace()]
There. That's all you need. It protects against blank lines in the file and weird invisible spaces too, this way. I decided to leave the data as strings because it probably represents an inventory ID, and in the future that might be upgraded to "53455-b.42454#62dkMlwee".
I'm going to hazard a guess that you want to run different code depending on the number at the top. If so, you can use a dictionary containing functions. You said that you wanted to run code from another file, so this is another_file.py:
__all__ = ["dispatch_whatever"]
dispatch_whatever = {}
def descriptive_name_for_117(product_id):
pass
dispatch_whatever["117"] = descriptive_name_for_117
And back in main_program.py, which is stored in the same directory:
from another_file import dispatch_whatever
for product_id in product_ids[1:]:
dispatch_whatever[product_ids[0]](product_id)
I'm pretty new to Python! I recently started coding a program which I want to write and read to and from text files, while compressing/decompressing sentences (sort of).
However, I've run into a couple problems which I can't seem to fix, basically, I've managed to code the compressing section. But when I go to read the contents of the text file, I'm not sure how to recreate the original sentence through the positions and unique words?!
###This section will compress the sentence(s)###
txt_file = open("User_sentences.txt","wt")
user_sntnce = input(str("\nPlease enter your sentences you would like compressed."))
user_sntnce_list = user_sntnce.split(" ")
print(user_sntnce_list)
for word in user_sntnce_list:
if word not in uq_words:
uq_words.append(word)
txt_file.write(str(uq_words) + "\n")
for i in user_sntnce_list:
positions = int(uq_words.index(i) + 1)
index.append(positions)
print(positions)
print(i)
txt_file.write(str(positions))
txt_file.close()
###This section will DECOMPRESS the sentence(s)###
if GuideChoice == "2":
txt_file = open("User_sentences.txt","r")
contents = txt_file.readline()
words = eval(contents)
print(words)
txt_file.close()
This is my code so far, it seems to work, however as I've said I'm really stuck, and I really don't know how to move on and recreate the original sentence from the text file.
From what understand you want to substitute each word in a text file with a word of your choice (a shorter one if you want to "compress"). Meanwhile you keep a "dictionary" (not in the python sense) uq_words where you associate each different word with an index.
So a sentence "today I like pizza, today is like yesterday" will become:
"12341536".
I tried your code removing if GuideChoice == "2": and defining uq_words=[] and index=[].
If that's what you intend to do then:
I imagine you are calling this compression from time to time, it's in a function. So doing what you do in the second line is to open a NEW file with the same name of the previous ones, meaning you will always have the last sentence compressed, loosing the previous.
Try to read every time the lines, rewrite all and add the new (kinda what you did in contents = txt_file.readline().
You are printing both the compressed translation (like "2345") AND the array whose component are the words of the splitted sentence. I do not think that is the "compressed" document you are aiming for. Just the "2345" part, right?
Since, I believe, you want to keep a dictionary, but this code is inside a function, you will loose the dictionary every time the function ends. So write 2 documents: one with the compressed text (every time refreshed and not rewritten!) and another file with 2 columns, where you write the dictionary. You pass the dictionary file name as a string to the function, so you can update it in case new words are added, and you read it as a NX2 array (N the number of words).
I have a huge number of names from different sources.
I need to extract all the groups (part of the names), which repeat from one to another.
In the example below program should locate: Post, Office, Post Office.
I need to get popularity count.
So I want to extract a sorted by popularity list of phrases.
Here is an example of names:
Post Office - High Littleton
Post Office Pilton Outreach Services
Town Street Post Office
post office St Thomas
Basically need to find out some algorithm or better library, to get such results:
Post Office: 16999
Post: 17934
Office: 16999
Tesco: 7300
...
Here is the full example of names.
I wrote a code which is fine for single words, but not for sentences:
from textblob import TextBlob
import operator
title_file = open("names.txt", 'r')
blob = TextBlob(title_file.read())
list = sorted(blob.word_counts.items(), key=operator.itemgetter(1))
print list
You are not looking for clustering (and that is probably why "all of them suck" for #andrewmatte).
What you are looking for is word counting (or more precisely, n-gram-counting). Which is actually a much easier problem. Thst is why you won't be finding any library for that...
Well, actually you jave some libraries. In python, for example, the collections module has the class Counter that has much of the reusable code.
An untested, very basic code:
from collections import Counter
counter = Counter()
for s in sentences:
words = s.split(" ")
for i in range(len(words)):
counter.add(words[i])
if i > 0: counter.add((words[i-1], words[i]))
You csn get the most frequent from counter. If you want words and word pairs separate, feel free to use two counters. If you need longer phrases, add an inner loop. You may also want to clean sentences (e.g. lowercase) and use a regexp for splitting.
Are you looking for something like this?
workspace={}
with open('names.txt','r') as f:
for name in f:
if len(name): # makes sure line isnt empty
if name in workspace:
workspace[name]+=1
else:
workspace[name]=1
for name in workspace:
print "{}: {}".format(name,workspace[name])
Been learning Python the last couple of days for the function of completing a data extraction. I'm not getting anywhere & hope one of you lovely people can advise.
I need to extract data that follows: RESP, CRESP, RTTime and RT.
Here's a snippit for an example of the mess I have to deal with.
Thoughts?
Level: 4
*** LogFrame Start ***
Procedure: ActProcScenarios
No: 1
Line1: It is almost time for your town's spring festival. A friend of yours is
Line2: on the committee and asks if you would be prepared to help out with the
Line3: barbecue in the park. There is a large barn for use if it rains.
Line4: You hope that on that day it will be
pfrag: s-n-y
pword: sunny
pletter: u
Quest: Does the town have an autumn festival?
Correct: {LEFTARROW}
ScenarioListPract: 1
Topic: practice
Subtheme: practice
ActPracScenarios: 1
Running: ActPracScenarios
ActPracScenarios.Cycle: 1
ActPracScenarios.Sample: 1
DisplayFragInstr.OnsetDelay: 17
DisplayFragInstr.OnsetTime: 98031
DisplayFragInstr.DurationError: -999999
DisplayFragInstr.RTTime: 103886
DisplayFragInstr.ACC: 0
DisplayFragInstr.RT: 5855
DisplayFragInstr.RESP: {DOWNARROW}
DisplayFragInstr.CRESP:
FragInput.OnsetDelay: 13
FragInput.OnsetTime: 103899
FragInput.DurationError: -999999
FragInput.RTTime: 104998
I think regular expressions would be the right tool here because the \b word boundary anchors allow you to make sure that RESP only matches a whole word RESP and not just part of a longer word (like CRESP).
Something like this should get you started:
>>> import re
>>> for line in myfile:
... match = re.search(r"\b(RT|RTTime|RESP|CRESP): (.*)", line)
... if match:
... print("Matched {0} with value {1}".format(match.group(1),
... match.group(2)))
Output:
Matched RTTime with value 103886
Matched RT with value 5855
Matched RESP with value {DOWNARROW}
Matched CRESP with value
Matched RTTime with value 104998
transform it to a dict first, then just get items from the dict as you wish
d = {k.strip(): v.strip() for (k, v) in
[line.split(':') for line in s.split('\n') if line.find(':') != -1]}
print (d['DisplayFragInstr.RESP'], d['DisplayFragInstr.CRESP'],
d['DisplayFragInstr.RTTime'], d['DisplayFragInstr.RT'])
>>> ('{DOWNARROW}', '', '103886', '5855')
I think you may be making things harder for yourself than needed. E-prime has a file format called .edat that is designed for the purpose you are describing. An edat file is another format that contains the same information as the .txt file but it a way that makes extracting variables easier. I personally only use the type of text file you have posted here as a form of data storage redundancy.
If you are doing things this way because you do not have a software key, it might help to know that the E-Merge and E-DataAid programs for eprime don't require a key. You only need the key for editing build files. Whoever provided you with the .txt files should probably have an install disk for these programs. If not, it is available on the PST website (I believe you need a serial code to create an account, but not certain)
Eprime generally creates a .edat file that matches the content of the text file you have posted an example of. Sometimes though if eprime crashes you don't get the edat file and only have the .txt. Luckily you can generate the edat file from the .txt file.
Here's how I would approach this issue: If you do not have the edat files available first use E-DataAid to recover the files.
Then presuming you have multiple participants you can use e-merge to merge all of the edat files together for all participants in who completed this task.
Open the merged file. It might look a little chaotic depending on how much you have in the file. You can got to Go to tools->Arrange columns This will show a list of all your variables. Adjust so that only the desired variables are in the right hand box. Hit ok.
Looking at the file you posted it says level 4 at the top so I'm guessing there are a lot of procedures in this experiment. If you have many procedures in the program you might at this point have lines that just have startup info and NULL in the locations where your variables or interest are. You and fix this by going to tools->filter and creating a filter to eliminate those lines. Sometimes also depending on file structure you might also end up with duplicate lines of the same data. You can also fix this with filtering.
You can then export this file as a csv
import re
import pprint
def parse_logs(file_name):
with open(file_name, "r") as f:
lines = [line.strip() for line in f.readlines()]
base_regex = r'^.*{0}: (.*)$'
match_terms = ["RESP", "CRESP", "RTTime", "RT"]
regexes = {term: base_regex.format(term) for term in match_terms}
output_list = []
for line in lines:
for key, regex in regexes.items():
match = re.match(regex, line)
if match:
match_tuple = (key, match.groups()[0])
output_list.append(match_tuple)
return output_list
pprint.pprint(parse_logs("respregex"))
Edit: Tim and Guy's answers are both better. I was in a hurry to write something and missed two much more elegant solutions.
This is my first post, so I'm sorry if I do anything wrong. That said, I searched for the question and found something similar that was never answered due to the OP not giving sufficient information. This is also homework, so I'm just looking for a hint. I really want to get this on my own.
I need to read in a debate file (.txt), and pull and store all of the lines that one candidate says to put in a word cloud. The file format is supposed to help, but I'm blanking on how to do this. The hint is that each time a new person speaks, their name followed by a colon is the first word in the first line. However, candidates' data can span multiple lines. I am supposed to store each person's lines separately. Here is a sample of the file:
LEHRER: This debate and the next three -- two presidential, one vice
presidential -- are sponsored by the Commission on Presidential
Debates. Tonight's 90 minutes will be about domestic issues and will
follow a format designed by the commission. There will be six roughly
15-minute segments with two-minute answers for the first question,
then open discussion for the remainder of each segment.
Gentlemen, welcome to you both. Let's start the economy, segment one,
and let's begin with jobs. What are the major differences between the
two of you about how you would go about creating new jobs?
LEHRER: You have two minutes. Each of you have two minutes to start. A
coin toss has determined, Mr. President, you go first.
OBAMA: Well, thank you very much, Jim, for this opportunity. I want to
thank Governor Romney and the University of Denver for your
hospitality.
There are a lot of points I want to make tonight, but the most
important one is that 20 years ago I became the luckiest man on Earth
because Michelle Obama agreed to marry me.
This is what I have for a function so far:
def getCandidate(myFile):
file = open(myFile, "r")
obama = []
romney = []
lehrer = []
file = file.readlines()
I'm just not sure how to iterate through the data so that it separates each person's words correctly. I created a dummy file to create the word cloud, and I'm able to do that fine, so all I am wondering is how to extract the information I need.
Thank you! If there is more information I can offer please let me know. This is a beginning Python course.
EDIT: New code added from a response. This works to an extent, but only grabs the first line of each candidate's response, not their entire response. I need to write code that continues to store each line under that candidate until a new name is at the start of a line.
def getCandidate(myFile, candidate):
file = open(myFile, "r")
OBAMA = []
ROMNEY = []
LEHRER = []
file = file.readlines()
for line in file:
if line.startswith("OBAMA:"):
OBAMA.append(line)
if line.startswith("ROMNEY:"):
ROMNEY.append(line)
if line.startswith("LEHRER:"):
LEHRER.append(line)
if candidate == "OBAMA":
return OBAMA
if candidate == "ROMNEY":
return ROMNEY
EDIT: I now have a new question. How can I generalize the file so that I can open any debate file between two people and a moderator? I am having a lot of trouble with this one.
I've been given a hint to look at the beginning of the line and see if the last word of each line to see if it ends in ":", but I'm still not sure how to do this. I tried splitting each line on spaces and then looking at the first item in the line, but that's as far as I've gotten.
The hint is this: after you split your lines, iterate over them and check with the string function startswith for each candidate, then append.
The iteration over a file is very simple:
for row in file:
do_something_with_row
EDIT:
To keep putting the lines until you find a new candidate, you have to keep track with a variable of the last candidate seen and if you don't find any match at the beginning of the line, you stick with the same candidate as before.
if line.startswith('OBAMA'):
last_seen=OBAMA
OBAMA.append(line)
elif blah blah blah
else:
last_seen.append(line)
By the way, I would change the definitio of the function: instead of take the name of the candidate and returning only his lines, it would be better to return a dictionary with the candidate name as keys and their lines as values, so you wouldn't need to parse the file more than once. When you will work with bigger file this could be a lifesaver.