Saving stripped text from csv file as a string object with Python - python

I'd like to be able to save text from a file that I had to retrieve from online and decompress (part of an assignment), in order to carry on with my next steps. Specifically, I'd like to save it as its own string object.
I can see the exact text I need when I print it in the following manner.
for line in seq:
print(line.strip())
I just can't seem to figure out how to assign the stripped text to a variable.

You could use a list, and append each line of text to the list.
my_text = []
for line in seq:
my_text.append(line.strip())

Related

Streamlit file objects creating very large line spacings when decoded to string, printed to file

I am writing a Streamlit app that takes in tensor output data from a .txt file, formats it, and both shows information on the data and prints the formatted data back to a new .txt file for later use.
After uploading the txt file to Streamlit and decoding it to a single long string, I alter the string and write it to a new txt file. When I open the txt file, the line spacings are huge, it looks like extra newlines have been put in but when you highlight the text, it is just large line spacings.
As well as this, when I use splitlines() on the string, the array that is returned is empty. This is the case even though the string is not empty and does contain newlines - I think it is to do with the large line spacings, but I am not sure.
The program is split into modules, but the code that is meant to format the file is in just two functions. One adds delimiters and works like this (with Streamlit as st):
def delim(file):
#read the selected file and write it to variable elems as a string
elems = file.decode('utf-8')
#replace the applicable parts of variable elems with the delimiters
elems = elems.replace('e+002', 'e+002, ')
elems = elems.replace('e+003', 'e+003, ')
elems = elems.replace('e+004', 'e+004, ')
elems = elems.replace('e+005', 'e+005, ')
elems = elems.replace('e+006', 'e+006, ')
elems = elems.replace('e+007', 'e+007, ')
elems = elems.replace('e+008', 'e+008, ')
elems = elems.replace('e+009', 'e+009, ')
with open('final_file.txt', 'w') as magma_file:
#write a txt file with the stored, altered text in variable elems
magma_file.write(elems)
#close the writeable file to be safe
magma_file.close()
st.success('Delimiters successfully added')
The second part, where I am getting the empty array, is in a second function. The whole function is not necessary to see the issue, but the part that is not working is here:
def addElem(file):
#create counting variables
counter = 0
linecount = 1
#put file as string in variable checks
checks = file.decode('utf-8')
checks.splitlines()
#check to see if the start of the file is formatted correctly. This is the part giving me strife
if checks[0].rstrip().endswith('5'):
with open('final_file.txt', 'w') as ff:
#iterate through the lines in the file
for line in checks:
counter+=1
# and so on, not relevant to the problem
The variable checks does contain a string after decoding the file, but when I use splitlines() then look inside checks[0], checks[1] etc., they are all empty. I tried commenting out other code, the conditional statement, removing the rstrip() and just seeing what was in the checks array after splitting the string, but it was still nothing. I tried changing splitlines() to split() using various delimiters including \n, but the array remained empty.
This program logic worked perfectly when I was running it locally using a console application interacting directly with the file system, so probably the problem is something to do with how a Streamlit "file like object" works. I read through the docs at Streamlit, but it doesn't give much detail on this.
This program is not for my use, so I can't keep it as a console app. I did ask about this on the Streamlit community a month ago, but so far no one has answered and I am not sure whether it is an unusual problem or just a terrible question.
I am wondering if there is a better way to decode the file to a string, but decoding to unicode doesn't explain the line spacings so I think something else is going on.

How to match text in two different file and extract values

So I have two files. One yaml file that contains tibetan words : its meaning. Another csv file that contains only word and it's POStag. As below:
yaml file :
ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།
ད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།
ད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།
ད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།
csv file :
ད་ཆུ PART
ད་གདོད DET
Desired output:
ད་ཆུ PART དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་གདོད DET ད་གཟོད་དང་དོན་འདྲ།
Any idea on how to make text match from csv file to yaml file and extract its meaning in csv?
On a functional point of view, you have:
a dictionary, meaning here a key: value thing
a list of words to search in that dictionary, and that will produce a record
If everything can fit in memory, you can first read the yaml file to produce a Python dictionary, and then read the words file, one line at a time and use the above dictionary to generate the expected line. If the yaml file is too large, you could use the dbm (or shelve) module as an on disk dictionary.
As you have not shown any code, I cannot either... I can just say that you can simply use process the second file as a plain text one and just read it one line at a time. For the first one, you can either look for a yaml module from PyPI, or if the syntax is always as simple as the lines you have shown, just process it as text one line at a time and use split to extract the key and the value.
Assuming your files are called dict.yml and input.csv.
You can start by turning the yaml file into a dictionary with
import yaml
with open('dict.yaml', 'r') as file:
trans_dict = yaml.safe_load(file)
Which should give you
>>> trans_dict
{'ད་གདོད': 'ད་གཟོད་དང་དོན་འདྲ།',
'ད་ཆུ': 'དངུལ་ཆུ་ཡི་མིང་གཞན།',
'ད་ཕྲུག': 'དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།',
'ད་བེར': 'སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།',
'ད་མེ་དུམ་མེ': 'དམ་དུམ་ལ་ལྟོས།'}
Then, you can iterate over the lines in the CSV and use the dictionary to get the definition:
outputs = []
with open('text.txt', 'r') as file:
for line in file:
term = line.strip()
definition = trans_dict.get(term.strip())
outputs.append(
term if definition is None
else f"{term} {definition}"
)
From here, your outputs variable should contain ['ད་ཆུ དངུལ་ཆུ་ཡི་མིང་གཞན།', 'ད་གདོད ད་གཟོད་དང་དོན་འདྲ།']. If you optionally wanted to write this out to a file, you could do
with open('output.txt', 'w') as file:
file.write('\n'.join(outputs))
If you had more tokens on each line of the CSV (unclear from your post), you could iterate over those tokens within a line, but you'd be able to apply basically the same approach.
The easiest solution that came to my mind would be iterating over all lines in the YAML-file and checking if the word is inside the CSV-file:
YAML_LINES = "ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།\nད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན\nད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན\nད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།\nད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།".split("\n")
CSV_LINES = "ད་ཆུ\nད་གདོད".split("\n")
for line in YAML_LINES:
word, meaning = line.split(": ")
if word in CSV_LINES:
output = word + " " + meaning
print(output)
The YAML_LINES and CSV_LINES lists are only to provide a quick and dirty example.

How does for loop print all the line of a text file without using readline()

I'm studying the file section and I'm confused by the code below.
def printAllLines(fileObject):
for line in fileObject:
print(line, end = "")
In this case, does one iteration of a line equal one line of the original text file?
Is there any index in a text file?
Can I think of a pure text file as a list that contains multiple items?
And each item contains a line of text?
A file object created through the open() function in Python is an object that contains each line of the file. For a text file, it is a typing.TextIO object; if this were a binary file, it would be typing.BinaryIO. This object is iterable but cannot be indexed, as it does not define ___getitem___.
TL;DR: You can think of indexing a file using a for loop as syntactic sugar; you can cut down on lines using it, but don't think about it too hard.
To answer each of your questions:
Yes, each iteration of line in fileObject is one line of the text file.
No, you cannot index a text file without readline() or another function. You can't do fileObject[1], for example.
Don't think of it like this. Think of this format of indexing a file as a useful trick.
Hopefully this fully answers your question.

Having trouble to split a file with text into seperate words

I have been trying to split a file with text into distinct words.
I tried using the iter method, the nltk module and just splits, but something doesn't add when i am trying to append the outcome to a list.
Maybe there is some problem with the syntax of my approaching the file.
txt = open(game_file)
print txt.read()
names = []
linestream = iter(txt.read())
for line in linestream:
for word in line.split():
names.append(word)
when I try to print the list names, i just get '[]'.
Remove print txt.read(), you are iterating through empty opened file
Or make new variable text = txt.read() and do stuff with it
When you do txt.read() you're already at the end of your file. So when you try to restart it, the file pointer is already at the end and it does not find anything.
Try to delete your 2nd line and it should work!
Also, you don't need to do iter(txt.read()),
for line in txt should work!
Creating "iter" object of _any_file_obj_.read() returns iter object which iterates over every single character present in the file. Which is surely you dont want to acheive here as you want to split file text into distinct words.
If you want to get the every word form the text file, then you can follow the following approach.
word_list = []
txt = open(any_file) # creating file object
for line in txt.readlines():
if line:
[word_list.append(word) for word in line.split()]
txt.seek(0)
The last line txt.seek(0) is very important.
All this time, your code was giving empty list [] because the files current position after one full iteration was pointing at the end of file (EOF). _file_obj_.seek() can be used to return files current position to wherever you want in the opened file

Extracting N JSON objects contained in a single line from a text file in Python 2.7?

I have a huge text file that contains several JSON objects inside of it that I want to parse into a csv file. Just because i'm dealing with someone else's data I cannot really change the format its being delivered in.
Since I dont know how many objects JSON objects I just can create a couple set of dictionaries, wrap them in a list and then json.loads() the list.
Also, since all the objects are in a single text line I can't a regex expression to separete each individual json object and then put them on a list.(It's a super complicated and sometimes triple nested json at some points.
Here's, my current code
def json_to_csv(text_file_name,desired_csv_name):
#Cleans up a bit of the text file
file = fileinput.FileInput(text_file_name, inplace=True)
ile = fileinput.FileInput(text_file_name, inplace=True)
for line in file:
sys.stdout.write(line.replace(u'\'', u'"'))
for line in ile:
sys.stdout.write(re.sub(r'("[\s\w]*)"([\s\w]*")', r"\1\2", line))
#try to load the text file to content var
with open(text_file_name, "rb") as fin:
content = json.load(fin)
#Rest of the logic using the json data in content
#that uses it for the desired csv format
This code gives a ValueError: Extra data: line 1 column 159816 because there is more than one object there.
I seen similar questions in Google and StackOverflow. But none of those solutions none because of the fact that it's just one really long line in a text file and I dont know how many objects there are in the file.
If you are trying to split apart the highest level braces you could do something like
string = '{"NextToken": {"value": "...'
objects = eval("[" + string + "]")
and then parse each item in the list.

Categories