I have a text file which has been made from a excel file, in the excel file cell A2 has the name 'Supplier A'. When I import the text file and i use the following code:
filea = open ( "jag.txt").readlines()
lines =[x.split() for x in filea]
print lines [0][1]
It returns just 'supplier' and not Supplier A, the A is located in lines [0][2]. How dow I import it and have it recognise the complete word. Because if a copy the text field back into excel it does copy it properly so the txt file definitley recognises them as being together.
Excel regulary is using the 'tab' as separtor sign for saving in 'txt' format.
So you should try something like this:
lines = []
with open('jag.txt') as f:
lines = [ line.split('\t') for line in f.read().splitlines() ]
print(lines)
and should get something like this
[ ['A1', 'A2', ...], ['B1', 'B2'], ... ]
Why not only "f.readlines()"? Because using this, your last cell will also contain the carriage return sign ('\n').
Why using the with statement? With will close the file finally, and this is a good election in any case.
An alternative way to parse your text file could be the python (included) csv module. Using the csv.reader can be a very convenient way to parse character separated files/structures:
with open('jag.txt') as f:
lines = [ line for line in csv.reader(f, delimiter='\t') ]
-Colin-
It does so because str.split() splits between every whitespace, tab and line break. You can use str.split(',') as an alternative, but in fact you really want to use the csv-module for tasks like this.
What character (space, tab, comma etc) are the values seperated on each line? Your current code will split the text at whitespace, by using split() without a split character.
Related
So I have two files. One yaml file that contains tibetan words : its meaning. Another csv file that contains only word and it's POStag. As below:
yaml file :
ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།
ད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།
ད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།
ད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།
csv file :
ད་ཆུ PART
ད་གདོད DET
Desired output:
ད་ཆུ PART དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་གདོད DET ད་གཟོད་དང་དོན་འདྲ།
Any idea on how to make text match from csv file to yaml file and extract its meaning in csv?
On a functional point of view, you have:
a dictionary, meaning here a key: value thing
a list of words to search in that dictionary, and that will produce a record
If everything can fit in memory, you can first read the yaml file to produce a Python dictionary, and then read the words file, one line at a time and use the above dictionary to generate the expected line. If the yaml file is too large, you could use the dbm (or shelve) module as an on disk dictionary.
As you have not shown any code, I cannot either... I can just say that you can simply use process the second file as a plain text one and just read it one line at a time. For the first one, you can either look for a yaml module from PyPI, or if the syntax is always as simple as the lines you have shown, just process it as text one line at a time and use split to extract the key and the value.
Assuming your files are called dict.yml and input.csv.
You can start by turning the yaml file into a dictionary with
import yaml
with open('dict.yaml', 'r') as file:
trans_dict = yaml.safe_load(file)
Which should give you
>>> trans_dict
{'ད་གདོད': 'ད་གཟོད་དང་དོན་འདྲ།',
'ད་ཆུ': 'དངུལ་ཆུ་ཡི་མིང་གཞན།',
'ད་ཕྲུག': 'དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།',
'ད་བེར': 'སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།',
'ད་མེ་དུམ་མེ': 'དམ་དུམ་ལ་ལྟོས།'}
Then, you can iterate over the lines in the CSV and use the dictionary to get the definition:
outputs = []
with open('text.txt', 'r') as file:
for line in file:
term = line.strip()
definition = trans_dict.get(term.strip())
outputs.append(
term if definition is None
else f"{term} {definition}"
)
From here, your outputs variable should contain ['ད་ཆུ དངུལ་ཆུ་ཡི་མིང་གཞན།', 'ད་གདོད ད་གཟོད་དང་དོན་འདྲ།']. If you optionally wanted to write this out to a file, you could do
with open('output.txt', 'w') as file:
file.write('\n'.join(outputs))
If you had more tokens on each line of the CSV (unclear from your post), you could iterate over those tokens within a line, but you'd be able to apply basically the same approach.
The easiest solution that came to my mind would be iterating over all lines in the YAML-file and checking if the word is inside the CSV-file:
YAML_LINES = "ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།\nད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན\nད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན\nད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།\nད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།".split("\n")
CSV_LINES = "ད་ཆུ\nད་གདོད".split("\n")
for line in YAML_LINES:
word, meaning = line.split(": ")
if word in CSV_LINES:
output = word + " " + meaning
print(output)
The YAML_LINES and CSV_LINES lists are only to provide a quick and dirty example.
I have spent 5 hours throughout the dark recesses of SO so I am posting this question as a last resort, and I am genuinely hoping someone can point me in the right direction here:
Scenario:
I have some .csv files (UTF-8 CSVs: verified with the file -I command) from Google surveys that are in multiple languages. Output:
download.csv: application/csv; charset=utf-8
I have a "dictionary" file that has the translations for the questions and answers (one column is the $language and the other is English).
There are LOTS of special type characters (umlauts and French accent letters, etc..) in the data from Google, because French, German, Dutch
The dictionary file I built reads fine as UTF-8 including special characters and creates the find/replace keys accurately (verified with print commands)
The issue is that the Google files only read correctly (maintain proper characters) using the csv.read function in Python. However, that function does not have a .replace and so I can do one or the other:
read in the source file, make no replacements, and get a perfect copy (not what I need)
convert the csv files/rows to a fileinput/string (UTF-8 still, mind) and get an utterly thrashed output file with missing replacements because the data "looses" the encoding between the csv read and the string somehow?
The code (here) comes closest to working, except there is no .replace method on csv.reader:
import csv
#set source, output
source = 'fr_to_trans.csv'
output = 'fr_translated.csv'
dictionary = 'frtrans.csv'
find = []
replace = []
# build the dictionary itself:
with open(dictionary, encoding='utf-8') as dict_file:
for line in dict_file:
#print(line)
temp_split = []
temp_split = line.split(',')
if "!!" in temp_split[0] :
temp_split[0] = temp_split[0].replace("!!", ",")
find.append(temp_split[0])
if "!!" in temp_split[1] :
temp_split[1] = temp_split[1].replace("!!", ",")
replace.append(temp_split [1])
#print(len(find))
#print(len(replace))
#set loop counters
check_each = len(find)
# Read in the file to parse
with open(source, 'r', encoding='utf-8') as s_file, open(output, 'w', encoding='utf-8') as t_file :
output_writer = csv.writer(t_file)
for row in csv.reader(s_file):
the_row = row
print(the_row) #THIS RETURNS THE CORRECT, FORMATTED, UTF-8 DATA
i = 0
# find and replace everything in the find array with it's value in the replace array
while i < check_each :
print(find[i])
print(replace[i])
# THIS LINE DOES NOT WORK:
the_row = the_row.replace(find[i], replace[i])
i = i + 1
output_writer.writerow(the_row)
I have to assume that even though the Google files say they are UTF-8, they are a special "Google branded UTF-8" or some such nonsense. The fact that the file opens correctly with csv.reader, but then you can do nothing to it is infuriating beyond measure.
Just to clarify what I have tried:
Treat files as text and let Python sort out the encoding (fails)
Treat files as UTF-8 text (fails)
Open file as UTF-8, replace strings, and write out using the csv.writer (fails)
Convert the_row to a string, then replace, then write out with csv.writer (fails)
Quick edit - tried utf-8-sig with strings - better, but the output is still totally mangled because it isn't reading it as a csv, but strings
I have not tried:
"cell by cell" comparison instead of the whole row (working on that while this percolates on SO)
Different encoding of the file (I can only get UTF-8 CSVs so would need some sort of utility?)
If these were ASCII text I would have been done ages ago, but this whole "UTF-8 that isn't but is" thing is driving me mad. Anyone got any ideas on this?
Each row yielded by csv.reader is a list of cell values like
['42', 'spam', 'eggs']
Thus the line
# THIS LINE DOES NOT WORK:
the_row = the_row.replace(find[i], replace[i])
cannot possibly work, because lists don't have a replace method.
What might work is to iterate over the row list and find/replace on each cell value (I'm assuming they are all strings)
the_row = [cell.replace(find[i], replace[i]) for cell in the row]
However, if all you want to do is replace all instances of some characters in the file with some other characters then it's simpler to open the file as a text file and replace without invoking any csv machinery:
with open(source, 'r', encoding='utf-8') as s_file, open(output, 'w', encoding='utf-8') as t_file :
text = source.read()
for old, new in zip(find, replace):
text = text.replace(old, new)
t_file.write(text)
If the find/replace mapping is the same for all files, you can use str.translate to avoid the for loop.
# Make a reusable translation table
trans_table = str.maketrans(dict(zip(find, replace)))
with open(source, 'r', encoding='utf-8') as s_file, open(output, 'w', encoding='utf-8') as t_file :
text = source.read()
text = text.translate(trans_table)
t_file.write(text)
For clarity: csvs are text files, only formatted so that their contents can be interpreted as rows and columns. If you want to manipulate their contents as pure text it's fine to edit them as normal text files: as long as you don't change any of the characters used as delimiters or quote marks they will still be usuable as csvs when you want to use them as such.
I am trying to import several text files into my Spyder file, which I want to add to a list later on.
why does
test1 = open("test1.txt")
result in test1 as "TextIOWrapper"? How would I bring the content over into the python file?
Thanks in advance
You need to read the lines into your list after opening it. For example, the code should be:
with open('test1.txt') as f:
test1= f.readlines()
The above code will read the contents of your text file into the list test1. However, if the data in your text file is separated over multiple lines, the escape char '\n' will be included in your list.
To avoid this, use the below refined code:
test1= [line.rstrip('\n') for line in open('test1.txt')]
Using the python open built-in function in this way:
with open('myfile.csv', mode='r') as rows:
for r in rows:
print(r.__repr__())
I obtain this ouput
'col1,col2,col3\n'
'fst,snd,trd\n'
'1,2,3\n'
I don't want the \n character. Do you know some efficient way to remove that char (in place of the obvious r.replace('\n',''))?
If you are trying to read and parse csv file, Python's csv module might serve better:
import csv
reader = csv.reader(open('myfile.csv', 'r'))
for row in reader:
print(', '.join(row))
Although you cannot change the line terminator for reader here, it ends a row with either '\r' or '\n', which works for your case.
https://docs.python.org/3/library/csv.html#csv.Dialect.lineterminator
Again, for most of the cases, I don't think you need to parse csv file manually. There are a few issues/reasons that makes csv module easier for you: field containing separator, field containing newline character, field containing quote character, etc.
You can use string.strip(), which (with no arguments) removes any whitespace from the start and end of a string:
for r in rows:
print(r.strip())
If you want to remove only newlines, you can pass that character as an argument to strip:
for r in rows:
print(r.strip('\n'))
For a clean solution, you could use a generator to wrap open, like this:
def open_no_newlines(*args, **kwargs):
with open(*args, **kwargs) as f:
for line in f:
yield line.strip('\n')
You can then use open_no_newlines like this:
for line in open_no_newlines('myfile.csv', mode='r'):
print(line)
I have a file that looks like this:
1111,AAAA,aaaa\n
2222,BB\nBB,bbbb\n
3333,CCC\nC,cccc\n
...
Where \n represents a newline.
When I read this line-by-line, it's read as:
1111,AAAA,aaaa\n
2222,BB\n
BB,bbbb\n
3333,CCC\n
C,cccc\n
...
This is a very large file. Is there a way to read a line until a specific number of delimiters, or remove the newline character within a column in Python?
I think after you read the line, you need to count the number of commas
aStr.count(',')
While the number of commas is too small (there can be more than one \n in the input), then read the next line and concatenate the strings
while aStr.count(',') < Num:
another = file.readline()
aStr = aStr + another
1111,AAAA,aaaa\n
2222,BB\nBB,bbbb\n
According to your file \n here is not actually a newline character, it is plain text.
For actually stripping newline characters you could use strip() or other variations like rstrip() ot lstrip().
If you work with large files you don't need to load full content in memory. You could iterate line by line until some counter or anything else.
I think perhaps you are parsing a CSV file that has embedded newlines in some of the text fields. Further, I suppose that the program that created the file put quotation marks (") around the fields.
That is, I supposed that your text file actually looks like this:
1111,AAAA,aaaa
2222,"BB
BB",bbbb
3333,"CCC
C",cccc
If that is the case, you might want to use code with better CSV support than just line.split(','). Consider this program:
import csv
with open('foo.csv') as fp:
reader = csv.reader(fp)
for row in reader:
print row
Which produces this output:
['1111', 'AAAA', 'aaaa']
['2222', 'BB\nBB', 'bbbb']
['3333', 'CCC\nC', 'cccc']
Notice the five lines (delimited by newline characters) of the CSV file become 3 rows (some with embedded newline characters) in the CSV data structure.