Making a program which will read and write to text files - python

I'm pretty new to Python! I recently started coding a program which I want to write and read to and from text files, while compressing/decompressing sentences (sort of).
However, I've run into a couple problems which I can't seem to fix, basically, I've managed to code the compressing section. But when I go to read the contents of the text file, I'm not sure how to recreate the original sentence through the positions and unique words?!
###This section will compress the sentence(s)###
txt_file = open("User_sentences.txt","wt")
user_sntnce = input(str("\nPlease enter your sentences you would like compressed."))
user_sntnce_list = user_sntnce.split(" ")
print(user_sntnce_list)
for word in user_sntnce_list:
if word not in uq_words:
uq_words.append(word)
txt_file.write(str(uq_words) + "\n")
for i in user_sntnce_list:
positions = int(uq_words.index(i) + 1)
index.append(positions)
print(positions)
print(i)
txt_file.write(str(positions))
txt_file.close()
###This section will DECOMPRESS the sentence(s)###
if GuideChoice == "2":
txt_file = open("User_sentences.txt","r")
contents = txt_file.readline()
words = eval(contents)
print(words)
txt_file.close()
This is my code so far, it seems to work, however as I've said I'm really stuck, and I really don't know how to move on and recreate the original sentence from the text file.

From what understand you want to substitute each word in a text file with a word of your choice (a shorter one if you want to "compress"). Meanwhile you keep a "dictionary" (not in the python sense) uq_words where you associate each different word with an index.
So a sentence "today I like pizza, today is like yesterday" will become:
"12341536".
I tried your code removing if GuideChoice == "2": and defining uq_words=[] and index=[].
If that's what you intend to do then:
I imagine you are calling this compression from time to time, it's in a function. So doing what you do in the second line is to open a NEW file with the same name of the previous ones, meaning you will always have the last sentence compressed, loosing the previous.
Try to read every time the lines, rewrite all and add the new (kinda what you did in contents = txt_file.readline().
You are printing both the compressed translation (like "2345") AND the array whose component are the words of the splitted sentence. I do not think that is the "compressed" document you are aiming for. Just the "2345" part, right?
Since, I believe, you want to keep a dictionary, but this code is inside a function, you will loose the dictionary every time the function ends. So write 2 documents: one with the compressed text (every time refreshed and not rewritten!) and another file with 2 columns, where you write the dictionary. You pass the dictionary file name as a string to the function, so you can update it in case new words are added, and you read it as a NX2 array (N the number of words).

Related

How to separate words to single letters from text file python

How do I separate words from a text file into single letters?
I'm given a text where I have to calculate the frequency of the letters in a text. However, I can't seem to figure out how I separate the words into single letters so I can count the unique elements and from there determine their frequency.
I apologize for not having the text in a text file, but the following text I'm given:
alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, and what is the use of a book,' thought alice without pictures or conversation?'
so she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy- chain would be worth the trouble of getting up and picking the daisies, when suddenly a white rabbit with pink eyes ran close by her.
there was nothing so very remarkable in that; nor did alice think it so very much out of the way to hear the rabbit say to itself, `oh dear! oh dear! i shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the rabbit actually took a watch out of its waistcoat- pocket, and looked at it, and then hurried on, alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.
in another moment down went alice after it, never once considering how in the world she was to get out again.
the rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that alice had not a moment to think about stopping herself before she found herself falling down a very deep well.
I'm supposed to separate into getting 26 variables a-z, and then determine their frequency which is given as the following:
I tried making the following code so far:
# Check where the current file you are working in, is saved.
import os
os.getcwd()
#print(os.getcwd())
# 1. Change the current working directory to the place where you have saved the file.
os.chdir('C:/Users/Annik/Desktop/DTU/02633 Introduction to programming/Datafiles')
os.getcwd()
#print(os.chdir('C:/Users/Annik/Desktop/DTU/02633 Introduction to programming/Datafiles'))
# 2. Listing the content of current working directory type
os.listdir(os.getcwd())
#print(os.listdir(os.getcwd()))
#importing the file
filein = open("small_text.txt", "r") #opens the file for reading
lines = filein.readlines() #reads all lines into an array
smalltxt = "".join(lines) #Joins the lines into one big string.
import numpy as np
def letterFrequency(filename):
#counts the frequency of letters in a text
unique_elems, counts = np.unique(separate_words, return_counts=True)
return unique_elems
I just don't know how to separate the letters in the text, so I can count the unique elements.
You can use collections.Counter to get your frequencies directly from the text.
Then just select the 26 keys you are interested, because it will also include whitespaces and other signs.
from collections import Counter
[...]
with open("small_text.txt", "r") as file:
text = file.read()
keys = "abcdefghijklmnopqrstuvwxyz"
c = Counter(text.lower())
# initialize occurrence with zeros to have all keys present.
occurrence = dict.fromkeys(keys, 0)
occurrence.update({k:v for k,v in c.items() if k in keys})
total = sum(occurrence.values())
frequency = {k:v/total for k,v in occurrence.items()}
[...]
To handle upper case str.lower might be useful as well.
"how I separate the words into single letters" since you want to calculate the count of the characters you can implement python counter in collections.
For example
import collections
import pprint
...
...
file_input = input('File_Name: ')
with open(file_input, 'r') as info:
count = collections.Counter(info.read().upper()) # reading file
value = pprint.pformat(count)
print(value)
...
...
This read your file will output the count of characters present.

How to parse a complex text file using Python string methods or regex and export into tabular form

As the title mentions, my issue is that I don't understand quite how to extract the data I need for my table (The columns for the table I need are Date, Time, Courtroom, File Number, Defendant Name, Attorney, Bond, Charge, etc.)
I think regex is what I need but my class did not go over this, so I am confused on how to parse in order to extract and output the correct data into an organized table...
I am supposed to turn my text file from this
https://pastebin.com/ZM8EPu0p
and export it into a more readable format like this- example output is below
Here is what I have so far.
def readFile(court):
csv_rows = []
# read and split txt file into pages & chunks of data by pagragraph
with open(court, "r") as file:
data_chunks = file.read().split("\n\n")
for chunk in data_chunks:
chunk = chunk.strip # .strip removes useless spaces
if str(data_chunks[:4]).isnumeric(): # if first 4 characters are digits
entry = None # initialize an empty dictionary
elif (
str(data_chunks).isspace() and entry
): # if we're on an empty line and the entry dict is not empty
csv_rows.DictWriter(dialect="excel") # turn csv_rows into needed output
entry = {}
else:
# parse here?
print(data_chunks)
return csv_rows
readFile("/Users/mia/Desktop/School/programming/court.txt")
It is quite a lot of work to achieve that, but it is possible. If you split it in a couple of sub-tasks.
First, your input looks like a text file so you could parse it line by line. -- using https://www.w3schools.com/python/ref_file_readlines.asp
Then, I noticed that your data can be split in pages. You would need to prepare a lot of regular expressions, but you can start with one for identifying where each page starts. -- you may want to read this as your expression might get quite complicated: https://www.w3schools.com/python/python_regex.asp
The goal of this step is to collect all lines from a page in some container (might be a list, dict, whatever you find it suitable).
And afterwards, write some code that parses the information page by page. But for simplicity I suggest to start with something easy, like the columns for "no, file number and defendant".
And when you got some data in a reliable manner, you can address the export part, using pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html

How to assign a single variable to a specific line in python?

I was not clear enough in my last question, and so I'll explain my question more this time.
I am creating 2 separate programs, where the first one will create a text file with 2 generated numbers, one on line 1 and the second on line 2.
Basically I saved it like this:
In this example I'm not generating numbers, just assigning them quickly.
a = 15
b = 16
saving = open('filename.txt', "w")
saving.write(a+"\n")
saving.write(b+"\n")
saving.close()
Then I opened it on the next one:
opening = open('filename.txt', "w")
a = opening.read()
opening.close()
print(a) #This will print the whole document, but I need each line to be differnet
Now I got the whole file loaded into 'a', but I need it split up, which is something that i have not got a clue on how to do. I don't believe creating a list will help, as I need each number (Variables a and b from program 1) to be different variables in program 2. The reason I need them as 2 separate variables is because I need to divide it by a different number. If I do need to do a list, please say. I tried finding an answer for about an hour in total, though I couldn't find anything.
The reason I can't post the whole program is because I haven't got access to it from here, and no, this is not cheating as we are free to research and ask questions outside the classroom, if someone wonders about that after looking at my previous question.
If you need more info please put it in a comment and I'll respond ASAP.
opening = open('filename.txt') # "w" is not necessary since you're opening it read-only
a = [b.split() for b in opening.readlines()] # create a list of each line and strip the newline "\n" character
print(a[0]) # print first line
print(a[1]) # print second line

Split words and creating new files with different names(python)

I need to write a program like this:
Write a program that reads a file .picasa.ini and copies pictures in new files, whose names are the same as identification numbers of person on these pictures (eg. 8ff985a43603dbf8.jpg). If there are more person on the picture it makes more copies. If a person is on more pictures, later override earlier copies of pictures; if a person 8ff985a43603dbf8 may appear in more pictures, only one file with this name will exist. You must presume that we have a simple file .picasa.ini.
I have an .ini, that consists:
[img_8538.jpg]
faces=rect64(4ac022d1820c8624),**d5a2d2f6f0d7ccbc**
backuphash=46512
[img_8551.jpg]
faces=rect64(acb64583d1eb84cb),**2623af3d8cb8e040**;rect64(58bf441388df9592),**d85d127e5c45cdc2**
backuphash=8108
...
Is this a good way to start this program?
for line in open('C:\Users\Admin\Desktop\podatki-picasa\.picasa.ini'):
if line.startswith('faces'):
line.split() # what must I do here to split the bolded words?
Is there a better way to do this? Remember the .jpg file must be created with a new name, so I think I should link the current .jpg file with the bolded one.
Consider using ConfigParser. Then you will have to split each value by hand, as you describe.
import ConfigParser
import string
config = ConfigParser.ConfigParser()
config.read('C:\Users\Admin\Desktop\podatki-picasa\.picasa.ini')
imgs = []
for item in config.sections():
imgs.append(config.get(item, 'faces'))
This is still work in progress. Just want to ask if it's correct.
edit:
Still don't know hot to split the bolded words out of there. This split function really is a pain for me.
Suggestions:
Your lines don't start with 'faces', so your second line won't work the way you want it to. Depending on how the rest of the file looks, you might only need to check whether the line is empty or not at that point.
To get the information you need, first split at ',' and work from there
Try at a solution: The elements you need seem to always have a ',' before them, so you can start by splitting at the ',' sign and taking everything from the 1-index elemnt onwards [1::] . Then if what I am thinking is correct, you split those elements twice again: at the ";" and take the 0-index element of that and at that " ", again taking the 0-index element.
for line in open('thingy.ini'):
if line != "\n":
personelements = line.split(",")[1::]
for person in personelements:
personstring = person.split(";")[0].split(" ")[0]
print personstring
works for me to get:
d5a2d2f6f0d7ccbc
2623af3d8cb8e040
d85d127e5c45cdc2

Python program to search for specific strings in hash values (coding help)

Trying to write a code that searches hash values for specific string's (input by user) and returns the hash if searchquery is present in that line.
Doing this to kind of just learn python a bit more, but it could be a real world application used by an HR department to search a .csv resume database for specific words in each resume.
I'd like this program to look through a .csv file that has three entries per line (id#;applicant name;resume text)
I set it up so that it creates a hash, then created a string for the resume text hash entry, and am trying to use the .find() function to return the entire hash for each instance.
What i'd like is if the word "gpa" is used as a search query and it is found in s['resumetext'] for three applicants(rows in .csv file), it prints the id, name, and resume for every row that has it.(All three applicants)
As it is right now, my program prints the first row in the .csv file(print resume['id'], resume['name'], resume['resumetext']) no matter what the searchquery is, whether it's in the resumetext or not.
lastly, are there better ways to doing this, by searching word documents, pdf's and .txt files in a folder for specific words using python (i've just started reading about the re module and am wondering if this may be the route, rather than putting everything in a .csv file.)
def find_details(id2find):
resumes_f=open("resume_data.csv")
for each_line in resumes_f:
s={}
(s['id'], s['name'], s['resumetext']) = each_line.split(";")
resumetext = str(s['resumetext'])
if resumetext.find(id2find):
return(s)
else:
print "No data matches your search query. Please try again"
searchquery = raw_input("please enter your search term")
resume = find_details(searchquery)
if resume:
print resume['id'], resume['name'], resume['resumetext']
The line
resumetext = str(s['resumetext'])
is redundant, because s['resumetext'] is already a string (since it comes as one of the results from a .split call). So, you can merge this line and the next into
if id2find in s['resumetext']: ...
Your following else is misaligned -- with it placed like that, you'll print the message over and over again. You want to place it after the for loop (and the else isn't needed, though it would work), so I'd suggest:
for each_line in resumes_f:
s = dict(zip('id name resumetext'.split(), each_line.split(";"))
if id2find in s['resumetext']:
return(s)
print "No data matches your search query. Please try again"
I've also shown an alternative way to build dict s, although yours is fine too.
What #Justin Peel said. Also to be more pythonic I would say change
if resumetext.find(id2find) != -1: to if id2find in resumetext:
A few more changes: you might want to lower case the comparison and user input so it matches GPA, gpa, Gpa, etc. You can do this by doing searchquery = raw_input("please enter your search term").lower() and resumetext = s['resumetext'].lower(). You'll note I removed the explicit cast around s['resumetext'] as it's not needed.
One change that I recommend for your code is changing
if resumetext.find(id2find):
to
if resumetext.find(id2find) != -1:
because find() returns -1 if id2find wasn't in resumetext. Otherwise, it returns the index where id2find is first found in resumetext, which could be 0. As #Personman commented, this would give you the false positive because -1 is interpreted as True in Python.
I think that problem has something to do with the fact that find_details() only returns the first entry for which the search string is found in resumetext. It might be good to make find_details() into a generator instead and then you could iterate over it and print the found records out one by one.

Categories