I am trying to create a python script that will read data from a text file and then checks if it has .(two letters), which well tell me if is a country code. I have tried using split and other methods but have not got it to work? Here is the code I have so far -->
# Python program to
# demonstrate reading files
# using for loop
import re
file2 = open('contry.txt', 'w')
file3 = open('noncountry.txt', 'w')
# Opening file
file1 = open('myfile.txt', 'r')
count = 0
noncountrycount = 0
countrycounter = 0
# Using for loop
print("Using for loop")
for line in file1:
count += 1
pattern = re.compile(r'^\.\w{2}\s')
if pattern.match(line):
print(line)
countrycounter += 1
else:
print("fail", line)
noncountrycount += 1
print(noncountrycount)
print(countrycounter)
file1.close()
file2.close()
file3.close()
The txt file has this in it
.aaa generic American Automobile Association, Inc.
.aarp generic AARP
.abarth generic Fiat Chrysler Automobiles N.V.
.abb generic ABB Ltd
.abbott generic Abbott Laboratories, Inc.
.abbvie generic AbbVie Inc.
.abc generic Disney Enterprises, Inc.
.able generic Able Inc.
.abogado generic Minds + Machines Group Limited
.abudhabi generic Abu Dhabi Systems and Information Centre
.ac country-code Internet Computer Bureau Limited
.academy generic Binky Moon, LLC
.accenture generic Accenture plc
.accountant generic dot Accountant Limited
.accountants generic Binky Moon, LLC
.aco generic ACO Severin Ahlmann GmbH & Co. KG
.active generic Not assigned
.actor generic United TLD Holdco Ltd.
.ad country-code Andorra Telecom
.adac generic Allgemeiner Deutscher Automobil-Club e.V. (ADAC)
.ads generic Charleston Road Registry Inc.
.adult generic ICM Registry AD LLC
.ae country-code Telecommunication Regulatory Authority (TRA)
.aeg generic Aktiebolaget Electrolux
.aero sponsored Societe Internationale de Telecommunications Aeronautique (SITA INC USA)
I am getting this error now
File "C:/Users/tyler/Desktop/Python Class/findcountrycodes/Test.py", line 15, in for line in file1: File "C:\Users\tyler\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032: character maps to
Is this something You were looking for:
with open('lorem.txt') as file:
data = file.readlines()
for line in data:
temp = line.split()[0]
if len(temp) == 3:
print(temp)
In short:
file.readlines() in this case returns a list of all lines in the file, pretty much it split the file by \n.
Then for each of those lines it gets split even more by spaces, and since the code You need is the first in the line it is also first in the list, so now it is important to check if the first item in the list is 3 characters long because since Your formatting seems pretty consistent only a length of 3 will be a country code.
It's usually not only an issue with the code, so we need all the context to reproduce, debug and solve.
Encoding error
The final hint was the console output (error, stacktrace) you pasted.
Read the stacktrace & research
This is how I read & analyze the error-output (Python's stacktrace):
... C:/Users/tyler/Desktop ...
... findcountrycodes/Test.py", line 15 ...
... Python36\lib\encodings*cp1252*.py ...
... UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032:
From this output we can extract important contextual information to research & solve the issue:
you are using Windows
the line 15 in your script Test.py points to the erroneous statement reading the file: file1 = open('myfile.txt', 'r')
you are using Python 3.6 and the currently used encoding was Windows 1252 (cp-1252)
the root-cause is UnicodeDecodeError, a frequently occuring Python Exception when reading files
You can now:
research Stackoverflow and the web for this exception: UnicodeDecodeError.
improve your question by adding this context (as keywords, tag, or dump as plain output)
Try a different encoding
One answer suggests to use the nowadays common UTF-8:
open(filename, encoding="utf8")
Detect the file encoding
An methodical solution-approach would be:
check the file's encoding or charset, e.g. using an editor, on windows Notepad or Notepad++
open the file your Python code with the proper encoding
See also:
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>
Get encoding of a file in Windows
Filtering lines for country-codes
So you want only the lines with country-codes.
Filtering expected
Then expected these 3 lines of your input file to be filtered:
.ad country-code Andorra Telecom
.ac country-code Internet Computer Bureau Limited
.ae country-code Telecommunication Regulatory Authority (TRA)
Solution using regex
As you already did, test each line of the file.
Test if the line starts with these 4 characters .xx (where xx can be any ASCII-letter).
Regex explained
This regular expression tests for a valid two-letter country code:
^\.\w{2}\s
^ from the start of the string (line)
\. (first) letter should be a dot
\w{2} (followed by) any two word-characters (⚠️ also matches _0)
\s (followed by) a single whitespace (blank, tab, etc.)
Python code
This is done in your code as follows (assuming the line is populated from read lines):
import re
line = '.ad '
pattern = re.compile(r'^\.\w{2}\s')
if pattern.match(line):
print('found country-code')
Here is a runnable demo on IDEone
Further Readings
Filter list with regex
Python 3 documentation: Regular Expression HOWTO
Bharath Sivakumar, on Medium (2020): Extracting Words from a string in Python using the “re” module
koenwoortman's blog (2020): Remove None values from a list in Python
You are splitting on three spaces but the character codes are only followed by one space so your logic is wrong.
>>> s = '.ac country-code Internet Computer Bureau Limited'
>>> s.strip().split(' ')
['.ac country-code', ' Internet Computer Bureau Limited']
>>>
Check if the third character is not a space and the fourth character is a space.
>>> if s[2] != ' ' and s[3] == ' ':
... print(f'country code: {s[:3]}')
... else: print('NO')
...
country code: .ac
>>> s = '.abogado generic Minds + Machines Group Limited'
>>> if s[2] != ' ' and s[3] == ' ':
... print(f'country code: {s[:3]}')
... else: print('NO')
...
NO
>>>
Related
I'm struggling to get readline() and split() to work together as I was expecting. Im trying to use .split(')') to cut down some data from a text file and write some of that data to a next text file.
I have tried writing everything from the line.
I have tried [cnt % 2] to get what I expected.
line = fp.readline()
fw = open('output.txt', "w+")
cnt = 1
while line:
print("Line {}: {}".format(cnt, line.strip()))
line = fp.readline()
line = line.split(')')[0]
fw.write(line + "\n")
cnt += 1
Example from the text file im reading from.
WELD 190 Manufacturing I Introduction to MasterCAM (3)
1½ hours lecture - 4½ hours laboratory
Note: Cross listed as DT 190/ENGR 190/IT 190
This course will introduce the students to MasterCAM and 2D and basic 3D
modeling. Students will receive instructions and drawings of parts requiring
2- or 3-axis machining. Students will design, model, program, set-up and run
their parts on various machines, including plasma cutters, water jet cutters and
milling machines.
WELD 197 Welding Technology Topics (.5 - 3)
I'm very far off from actually effectively scraping this data but I'm trying to get a start.
My goal is to extract only class name and number and remove descriptions.
Thanks as always!
I believe to solve your current problem, if you're only attempting to parse one line, you will simply need to move your second line = fp.readline() line to the end of the while loop. Currently, you are actually starting the parsing from the second line, because you have already used a readline in the first line of your example code.
After the change it would look like this:
line = fp.readline() # read in the first line
fw = open('output.txt', "w+")
cnt = 1
while line:
print("Line {}: {}".format(cnt, line.strip()))
line = line.split(')')[0]
fw.write(line + "\n")
cnt += 1
line = fp.readline() # read in next line after parsing done
Output for your example input text:
WELD 190 Manufacturing I Introduction to MasterCAM (3
Assuming your other class text blocks share the same structure than the one you showed you might want to use a regular expression to extract the class name and class number:
Following I assume that every text block contains the information "XX hours lecture" at the same order where 'XX' stands for any kind of number (time frame). In the variable 'match_re' I define a regular matching expression to match only to the defined spot 'XX hours lecture'. And by using 'match.group(2)' I restrict my match to the part within the inmost bracket pair.
The matching expression below probably won't be complete for you yet since I don't know your whole text file.
Below I extract the string: WELD 190 Manufacturing I Introduction to MasterCAM (3)
import re
string = "WELD 190 Manufacturing I Introduction to MasterCAM (3) 1½ hours lecture - 4½ hours laboratory Note: Cross listed as DT 190/ENGR 190/IT 190 This course will introduce the students to MasterCAM and 2D and basic 3D modeling. Students will receive instructions and drawings of parts requiring 2- or 3-axis machining. Students will design, model, program, set-up and run their parts on various machines, including plasma cutters, water jet cutters and milling machines. WELD 197 Welding Technology Topics (.5 - 3)"
match_re = "(^(.*)\d.* hours lecture)"
match = re.search(match_re,string)
if match:
print(match.group(2))
else:
print("No match")
I am trying to make a new list in my existing csv file (not using pandas).
Here is my code:
with open ('/Users/Weindependent/Desktop/dataset/albumlist.csv','r') as case0:
reader = csv.DictReader(case0)
album = []
for row in reader:
album.append(row)
print ("Number of albums is:",len(album))
The CSV file was downloaded from the Rolling Stone's Top 500 albums data set on data.world.
My logic is to create an empty list named album and have all the records in this list. But it seems the line of for row in reader has some issue.
the error message I got is:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 1040: invalid continuation byte
Can anyone let me know what I did wrong?
You need to open the file in the correct codec; UTF-8 is not the correct one. The dataset doesn't specify it, but I have determined that the most likely codec is mac_roman:
with open ('/Users/Weindependent/Desktop/dataset/albumlist.csv', 'r', encoding='mac_roman') as case0:
The original Kaggle dataset doesn't bother to document it, and the various kernels that use the set all just clobber the encoding. It's clearly a 8-bit Latin-variant (the majority of the data is ASCII with a few individual 8-bit codepoints).
So I analysed the data, and found there are just two such codepoints in 9 rows:
>>> import re
>>> eightbit = re.compile(rb'[\x80-\xff]')
>>> with open('albumlist.csv', 'rb') as bindata:
... nonascii = [l for l in bindata if eightbit.search(l)]
...
>>> len(nonascii)
9
>>> {c for l in nonascii for c in eightbit.findall(l)}
{b'\x89', b'\xca'}
The 0x89 byte appears in just one line:
>>> sum(l.count(b'\x89') for l in nonascii)
1
>>> sum(l.count(b'\xca') for l in nonascii)
22
>>> next(l for l in nonascii if b'\x89' in l)
b'359,1972,Honky Ch\x89teau,Elton John,Rock,"Pop Rock,\xcaClassic Rock"\r\n'
That's clearly Elton John's 1972 Honky Château album, so the 0x89 byte must represent the U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX codepoint.
The 0xCA bytes all appear to represent an alternative space character, they all appear righ after commas in the genre and subgenre columns (with one album exception):
>>> import csv
>>> for row in csv.reader((l.decode('ascii', 'backslashreplace') for l in nonascii)):
... for col in row:
... if '\\' in col: print(col)
...
Reggae,\xcaPop,\xcaFolk, World, & Country,\xcaStage & Screen
Reggae,\xcaRoots Reggae,\xcaRocksteady,\xcaContemporary,\xcaSoundtrack
Electronic,\xcaStage & Screen
Soundtrack,\xcaDisco
Rock,\xcaBlues
Blues Rock,\xcaElectric Blues,\xcaHarmonica Blues
Garage Rock,\xcaPsychedelic Rock
Honky Ch\x89teau
Pop Rock,\xcaClassic Rock
Funk / Soul,\xcaFolk, World, & Country
Rock,\xcaPop
Stan Getz\xca/\xcaJoao Gilberto\xcafeaturing\xcaAntonio Carlos Jobim
Bossa Nova,\xcaLatin Jazz
Lo-Fi,\xcaIndie Rock
These 0xCA bytes are almost certainly representing the U+00A0 NO-BREAK SPACE codepoint.
With these two mappings, you can try to determine what 8-bit codecs would make the same mapping. Rather than manually try out all Python's codecs I used Tripleee's 8-bit codec mapping to see what codecs use these mappings. There are only two:
0x89
â (U+00E2): mac_arabic, mac_croatian, mac_farsi, mac_greek, mac_iceland, mac_roman, mac_romanian, mac_turkish
0xca
(U+00A0): mac_centeuro, mac_croatian, mac_cyrillic, mac_greek, mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish
There are 6 encodings that are listed in both sets:
>>> set1 = set('mac_arabic, mac_croatian, mac_farsi, mac_greek, mac_iceland, mac_roman, mac_romanian, mac_turkish'.split(', '))
>>> set2 = set('mac_centeuro, mac_croatian, mac_cyrillic, mac_greek, mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish'.split(', '))
>>> set1 & set2
{'mac_turkish', 'mac_iceland', 'mac_romanian', 'mac_greek', 'mac_croatian', 'mac_roman'}
Of these, Mac OS Roman mac_roman codec is probably the most likely to have been used as Microsoft Excel for Mac used Mac Roman to create CSV files for a long time. However, it doesn't really matter, any of those 6 would work here.
You may want to replace those U+00A0 non-breaking spaces if you want to split out the genre and subgenre columns (really the genre and style columns if these were taken from Discogs).
I am learning python for data mining and I have a text file that contains a list of world cities and their coordinates. With my code, I am trying to find the coordinates of a list of cities. Everything works well until there is a city name with non-standard characters. I expect the program will skip that name and move to the next, but it terminates. How will I make the program skip names it cannot find and continue to the next?
lst = ['Paris', 'London', 'Helsinki', 'Amsterdam', 'Sant Julià de Lòria',
'New York', 'Dublin']
source = 'world.txt'
fh = open(source)
n = 0
for line in fh:
line.rstrip()
if lst[n] not in line:
continue
else:
co = line.split(',')
print lst[n], 'Lat: ', co[5], 'Long: ', co[6]
if n < (len(lst)-1):
n = n + 1
else:
break
The outcome of this run is:
>>>
Paris Lat: 33.180704 Long: 67.470836
London Lat: -11.758217 Long: 17.084013
Helsinki Lat: 60.175556 Long: 24.934167
Amsterdam Lat: 6.25 Long: -57.5166667
>>>
Your code has a number of issues. The following fixes most, if not all, of them — and should never terminate when a city isn't found.
# -*- coding: iso-8859-1 -*-
from __future__ import print_function
cities = ['Paris', 'London', 'Helsinki', 'Amsterdam', 'Sant Julià de Lòria', 'New York',
'Dublin']
SOURCE = 'world.txt'
for city in cities:
with open(SOURCE) as fh:
for line in fh:
if city in line:
fields = line.split(',')
print(fields[0], 'Lat: ', fields[5], 'Long: ', fields[6])
break
It may be an encoding problem. You have to know in which encoding is the file "world.txt".
If you do not know it, try the most commonly used encodings.
Replace the line :
fh = open(source)
With the lines :
import codecs
fh = codecs.open(source, 'r', 'utf-8')
If it still does not work, replace the 'utf-8' by 'cp1252', then by 'iso-8859-1'.
If none of these common encoding works, you have to find the encoding by yourself. Try to open "world.txt" in Notepad++, this text editor is able to make encoding inference. (Not sure if Notepad++ is able to open a 3 million line files, though).
It is also a good practice to know in which encoding is your own python source file, and to tell it explicitly by adding a line like that # -*- coding: utf-8 -*- at the beginning of your source file.
Of course, you have to specify the exact encoding in which your source file is. Once again, you can determine it by opening it in Notepad++.
I have a Python Script that generate a CSV (data parsed from a website).
Here is an exemple of the CSV file:
File1.csv
China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
China;Beijing;BeiwaiOnline BFSU;;;
Italy;Curno;Bergamo, Anderson House;;Yes;
File2.csv
China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
This;Is;A;New;Line;;
Italy;Curno;Bergamo, Anderson House;;Yes;
As you can see,
China;Beijing;BeiwaiOnline BFSU;;; ==> This line from File1.csv is not more present in File2.csv and This;Is;A;New;Line;; ==> This line from File2.csv is new (is not present in File1.csv).
I am looking for a way to compare this two CSV files (one important thing to know is that the order of the lines doesn't count ... they cant be anywhere).
What I'd like to have is a script which can tell me:
- One new line : This;Is;A;New;Line;;
- One removed line : China;Beijing;BeiwaiOnline BFSU;;;
And so on ... !
I've tried but without any success:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import csv
f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
now = [row for row in c2]
past = [row for row in c1]
for row in now:
#print row
lol = past.index(row)
print lol
f1.close()
f2.close()
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Any idea of the best way to proceed ? Thank you so much in advance ;)
EDIT:
import csv
f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
s1 = set(c1)
s2 = set(c2)
lol = s1 - s2
print type(lol)
print lol
This seems to be a good idea but :
Traceback (most recent call last):
File "compare.py", line 20, in <module>
s1 = set(c1)
TypeError: unhashable type: 'list'
EDIT 2 (Please don't care about what is above):
*with your help, here is the script I'm writing :*
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import csv
### COMPARISON THING ###
x=0
fichiers = os.listdir('/me/CSV')
for fichier in fichiers:
if '.csv' in fichier:
print('%s -----> %s' % (x,fichier))
x=x+1
choice = raw_input("Which file do you want to compare with the new output ? ->>>")
past_file = fichiers[int(choice)]
print 'We gonna compare %s to our output' % past_file
s_now = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/now.csv', 'r'), delimiter=';')) ## OUR OUTPUT
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
added = [";".join(row) for row in s_now - s_past] # in "now" but not in "past"
removed = [";".join(row) for row in s_past - s_now] # in "past" but not in "now"
c = csv.writer(open("CHANGELOG.csv", "a"),delimiter=";" )
line = ['AD']
for item_added in added:
line.append(item_added)
c.writerow(['AD',item_added])
line = ['RM']
for item_removed in removed:
line.append(item_removed)
c.writerow(line)
Two kind of errors:
File "programcompare.py", line 21, in <genexpr>
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: line contains NULL byte
or
File "programcompare.py", line 21, in <genexpr>
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: newline inside string
It was working few minutes ago but I've changed the CSV files to test with different datas and here I am :-)
Sorry, last question !
If your data is not prohibitively large, loading them into a set (or frozenset) will be an easy approach:
s_now = frozenset(tuple(row) for row in csv.reader(open('now.csv', 'r'), delimiter=';'))
s_past = frozenset(tuple(row) for row in csv.reader(open('past.csv', 'r'), delimiter=';'))
To get the list of entries that were added:
added = [";".join(row) for row in s_now - s_past] # in "now" but not in "past"
# Or, simply "added = list(s_now - s_past)" to keep them as tuples.
similarly, list of entries that were removed:
removed = [";".join(row) for row in s_past - s_now] # in "past" but not in "now"
To address your updated question on why you're seeing TypeError: unhashable type: 'list', the csv returns each entry as a list when iterated. lists are not hashable and therefore cannot be inserted into a set.
To address this, you'll need to convert the list entries into tuples before adding the to the set. See previous section in my answer for an example of how this can be done.
To address the additional errors you're seeing, they are both due to the content of your CSV files.
_csv.Error: newline inside string
It looks like you have quote characters (") somewhere in data which confuses the parser. I'm not familiar enough with the CSV module to tell you exactly what has gone wrong, not without having a peek at your data anyway.
I did however manage to reproduce the error as such:
>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";")]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_csv.Error: newline inside string
In this case, it can fixed by instructing the reader not to do any special processing with quotes (see csv.QUOTE_NONE). (Do note that this will disable the handling of quoted data whereby delimiters can appear within a quoted string without the string being split into separate entries.)
>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";", quoting=csv.QUOTE_NONE)]
[['hello', 'wo', '"rld']]
_csv.Error: line contains NULL byte
I'm guessing this might be down to the encoding of your CSV files. See the following questions:
Python CSV error: line contains NULL byte
"Line contains NULL byte" in CSV reader (Python)
Read the csv files line by line into sets. Compare the sets.
>>> s1 = set('''China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
... United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
... Italy;Bari;Bari, The British School;;Yes;
... China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
... China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
... China;Beijing;BeiwaiOnline BFSU;;;
... Italy;Curno;Bergamo, Anderson House;;Yes;'''.split('\n'))
>>> s2 = set('''China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
... United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
... Italy;Bari;Bari, The British School;;Yes;
... China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
... China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
... This;Is;A;New;Line;;
... Italy;Curno;Bergamo, Anderson House;;Yes;'''.split('\n'))
>>> s1 - s2
set(['China;Beijing;BeiwaiOnline BFSU;;;'])
>>> s2 - s1
set(['This;Is;A;New;Line;;'])
I have to convert a number of large files (up to 2GB) of EBCDIC 500 encoded files to Latin-1. Since I could only find EBCDIC to ASCII converters (dd, recode) and the files contain some additional proprietary character codes, I thought I'd write my own converter.
I have the character mapping so I'm interested in the technical aspects.
This is my approach so far:
# char mapping lookup table
EBCDIC_TO_LATIN1 = {
0xC1:'41', # A
0xC2:'42', # B
# and so on...
}
BUFFER_SIZE = 1024 * 64
ebd_file = file(sys.argv[1], 'rb')
latin1_file = file(sys.argv[2], 'wb')
buffer = ebd_file.read(BUFFER_SIZE)
while buffer:
latin1_file.write(ebd2latin1(buffer))
buffer = ebd_file.read(BUFFER_SIZE)
ebd_file.close()
latin1_file.close()
This is the function that does the converting:
def ebd2latin1(ebcdic):
result = []
for ch in ebcdic:
result.append(EBCDIC_TO_LATIN1[ord(ch)])
return ''.join(result).decode('hex')
The question is whether or not this is a sensible approach from an engineering standpoint. Does it have some serious design issues? Is the buffer size OK? And so on...
As for the "proprietary characters" that some don't believe in: Each file contains a year's worth of patent documents in SGML format. The patent office has been using EBCDIC until they switched to Unicode in 2005. So there are thousands of documents within each file. They are separated by some hex values that are not part of any IBM specification. They were added by the patent office. Also, at the beginning of each file there are a few digits in ASCII that tell you about the length of the file. I don't really need that information but if I want to process the file so I have to deal with them.
Also:
$ recode IBM500/CR-LF..Latin1 file.ebc
recode: file.ebc failed: Ambiguous output in step `CR-LF..data'
Thanks for the help so far.
EBCDIC 500, aka Code Page 500, is amongst Pythons encodings, although you link to cp1047, which doesn't. Which one are you using, really? Anyway this works for cp500 (or any other encoding that you have).
from __future__ import with_statement
import sys
from contextlib import nested
BUFFER_SIZE = 16384
with nested(open(sys.argv[1], 'rb'), open(sys.argv[2], 'wb')) as (infile, outfile):
while True:
buffer = infile.read(BUFFER_SIZE)
if not buffer:
break
outfile.write(buffer.decode('cp500').encode('latin1'))
This way you shouldn't need to keep track of the mappings yourself.
If you set up the table correctly, then you just need to do:
translated_chars = ebcdic.translate(EBCDIC_TO_LATIN1)
where ebcdic contains EBCDIC characters and EBCDIC_TO_LATIN1 is a 256-char string which maps each EBCDIC character to its Latin-1 equivalent. The characters in EBCDIC_TO_LATIN1 are the actual binary values rather than their hex representations. For example, if you are using code page 500, the first 16 bytes of EBCDIC_TO_LATIN1 would be
'\x00\x01\x02\x03\x37\x2D\x2E\x2F\x16\x05\x25\x0B\x0C\x0D\x0E\x0F'
using this reference.
While this might not help the original poster anymore, some time ago I released a package for Python 2.6+ and 3.2+ that adds most of the western 8 bit mainframe codecs including CP1047 (French) and CP1141 (German): https://pypi.python.org/pypi/ebcdic. Simply import ebcdic to add the codecs and then use open(..., encoding='cp1047') to read or write files.
Answer 1:
Yet another silly question: What gave you the impression that recode produced only ASCII as output? AFAICT it will transcode ANY of its repertoire of charsets to ANY of its repertoire, AND its repertoire includes IBM cp500 and cp1047, and OF COURSE latin1. Reading the comments, you will note that Lennaert and I have discovered that there aren't any "proprietary" codes in those two IBM character sets. So you may well be able to use recode after all, once you are certain what charset you've actually got.
Answer 2:
If you really need/want to transcode IBM cp1047 via Python, you might like to firstly get the mapping from an authoritative source, processing it via script with some checks:
URL = "http://source.icu-project.org/repos/icu/data/trunk/charset/data/ucm/glibc-IBM1047-2.1.2.ucm"
"""
Sample lines:
<U0000> \x00 |0
<U0001> \x01 |0
<U0002> \x02 |0
<U0003> \x03 |0
<U0004> \x37 |0
<U0005> \x2D |0
"""
import urllib, re
text = urllib.urlopen(URL).read()
regex = r"<U([0-9a-fA-F]{4,4})>\s+\\x([0-9a-fA-F]{2,2})\s"
results = re.findall(regex, text)
wlist = [None] * 256
for result in results:
unum, inum = [int(x, 16) for x in result]
assert wlist[inum] is None
assert 0 <= unum <= 255
wlist[inum] = chr(unum)
assert not any(x is None for x in wlist)
print repr(''.join(wlist))
Then carefully copy/paste the output into your transcoding script for use with Vinay's buffer.translate(the_mapping) idea, with a buffer size perhaps a bit larger than 16KB and certainly a bit smaller than 2GB :-)
No crystal ball, no info from OP, so had a bit of a rummage in the EPO website. Found freely downloadable weekly patent info files, still available in cp500/SGML even though website says this to be replaced by utf8/XML in 2006 :-). Got the 2009 week 27 file. Is a zip containing 2 files s350927[ab].bin. "bin" means "not XML". Got the spec! Looks possible that "proprietary codes" are actually BINARY fields. Each record has a fixed 252-byte header. First 5 bytes are record length in EBCDIC e.g. hex F0F2F2F0F8 -> 2208 bytes. Last 2 bytes of the fixed header are the BINARY length (redundant) of the following variable part. In the middle are several text fields, two 2-byte binary fields, and one 4-byte binary field. The binary fields are serial numbers within groups, but all I saw are 1. The variable part is SGML.
Example (last record from s350927b.bin):
Record number: 7266
pprint of header text and binary slices:
['EPB102055619 TXT00000001',
1,
' 20090701200927 08013627.8 EP20090528NN ',
1,
1,
' T *lots of spaces snipped*']
Edited version of the rather long SGML:
<PATDOC FILE="08013627.8" CY=EP DNUM=2055619 KIND=B1 DATE=20090701 STATUS=N>
*snip*
<B541>DE<B542>Windschutzeinheit für ein Motorrad
<B541>EN<B542>Windshield unit for saddle-ride type vehicle
<B541>FR<B542>Unité pare-brise pour motocyclette</B540>
*snip*
</PATDOC>
There are no header or trailer records, just this one record format.
So: if the OP's annual files are anything like this, we might be able to help him out.
Update: Above was the "2 a.m. in my timezone" version. Here's a bit more info:
OP said: "at the beginning of each file there are a few digits in ASCII that tell you about the length of the file." ... translate that to "at the beginning of each record there are five digits in EBCDIC that tell you exactly the length of the record" and we have a (very fuzzy) match!
Here is the URL of the documentation page: http://docs.epoline.org/ebd/info.htm
The FIRST file mentioned is the spec.
Here is the URL of the download-weekly-data page: http://ebd2.epoline.org/jsp/ebdst35.jsp
An observation: The data that I looked at is in the ST.35 series. There is also available for download ST.32 which appears to be a parallel version containing only the SGML content (in "reduced cp437/850", one tag per line). This indicates that the fields in the fixed-length header of the ST.35 records may not be very interesting, and can thus be skipped over, which would greatly simplify the transcoding task.
For what it's worth, here is my (investigatory, written after midnight) code:
[Update 2: tidied up the code a little; no functionality changes]
from pprint import pprint as pp
import sys
from struct import unpack
HDRSZ = 252
T = '>s' # text
H = '>H' # binary 2 bytes
I = '>I' # binary 4 bytes
hdr_defn = [
6, T,
38, H,
40, T,
94, I,
98, H,
100, T,
251, H, # length of following SGML text
HDRSZ + 1
]
# above positions as per spec, reduce to allow for counting from 1
for i in xrange(0, len(hdr_defn), 2):
hdr_defn[i] -= 1
def records(fname, output_encoding='latin1', debug=False):
xlator=''.join(chr(i).decode('cp500').encode(output_encoding, 'replace') for i in range(256))
# print repr(xlator)
def xlate(ebcdic):
return ebcdic.translate(xlator)
# return ebcdic.decode('cp500') # use this if unicode output desired
f = open(fname, 'rb')
recnum = -1
while True:
# get header
buff = f.read(HDRSZ)
if not buff:
return # EOF
recnum += 1
if debug: print "\nrecnum", recnum
assert len(buff) == HDRSZ
recsz = int(xlate(buff[:5]))
if debug: print "recsz", recsz
# split remainder of header into text and binary pieces
fields = []
for i in xrange(0, len(hdr_defn) - 2, 2):
ty = hdr_defn[i + 1]
piece = buff[hdr_defn[i]:hdr_defn[i+2]]
if ty == T:
fields.append(xlate(piece))
else:
fields.append(unpack(ty, piece)[0])
if debug: pp(fields)
sgmlsz = fields.pop()
if debug: print "sgmlsz: %d; expected: %d - %d = %d" % (sgmlsz, recsz, HDRSZ, recsz - HDRSZ)
assert sgmlsz == recsz - HDRSZ
# get sgml part
sgml = f.read(sgmlsz)
assert len(sgml) == sgmlsz
sgml = xlate(sgml)
if debug: print "sgml", sgml
yield recnum, fields, sgml
if __name__ == "__main__":
maxrecs = int(sys.argv[1]) # dumping out the last `maxrecs` records in the file
fname = sys.argv[2]
keep = [None] * maxrecs
for recnum, fields, sgml in records(fname):
# do something useful here
keep[recnum % maxrecs] = (recnum, fields, sgml)
keep.sort()
for k in keep:
if k:
recnum, fields, sgml = k
print
print recnum
pp(fields)
print sgml
Assuming cp500 contains all of your "additional propietary characters", a more concise version based on Lennart's answer using the codecs module:
import sys, codecs
BUFFER_SIZE = 64*1024
ebd_file = codecs.open(sys.argv[1], 'r', 'cp500')
latin1_file = codecs.open(sys.argv[2], 'w', 'latin1')
buffer = ebd_file.read(BUFFER_SIZE)
while buffer:
latin1_file.write(buffer)
buffer = ebd_file.read(BUFFER_SIZE)
ebd_file.close()
latin1_file.close()