I am learning python for data mining and I have a text file that contains a list of world cities and their coordinates. With my code, I am trying to find the coordinates of a list of cities. Everything works well until there is a city name with non-standard characters. I expect the program will skip that name and move to the next, but it terminates. How will I make the program skip names it cannot find and continue to the next?
lst = ['Paris', 'London', 'Helsinki', 'Amsterdam', 'Sant Julià de Lòria',
'New York', 'Dublin']
source = 'world.txt'
fh = open(source)
n = 0
for line in fh:
line.rstrip()
if lst[n] not in line:
continue
else:
co = line.split(',')
print lst[n], 'Lat: ', co[5], 'Long: ', co[6]
if n < (len(lst)-1):
n = n + 1
else:
break
The outcome of this run is:
>>>
Paris Lat: 33.180704 Long: 67.470836
London Lat: -11.758217 Long: 17.084013
Helsinki Lat: 60.175556 Long: 24.934167
Amsterdam Lat: 6.25 Long: -57.5166667
>>>
Your code has a number of issues. The following fixes most, if not all, of them — and should never terminate when a city isn't found.
# -*- coding: iso-8859-1 -*-
from __future__ import print_function
cities = ['Paris', 'London', 'Helsinki', 'Amsterdam', 'Sant Julià de Lòria', 'New York',
'Dublin']
SOURCE = 'world.txt'
for city in cities:
with open(SOURCE) as fh:
for line in fh:
if city in line:
fields = line.split(',')
print(fields[0], 'Lat: ', fields[5], 'Long: ', fields[6])
break
It may be an encoding problem. You have to know in which encoding is the file "world.txt".
If you do not know it, try the most commonly used encodings.
Replace the line :
fh = open(source)
With the lines :
import codecs
fh = codecs.open(source, 'r', 'utf-8')
If it still does not work, replace the 'utf-8' by 'cp1252', then by 'iso-8859-1'.
If none of these common encoding works, you have to find the encoding by yourself. Try to open "world.txt" in Notepad++, this text editor is able to make encoding inference. (Not sure if Notepad++ is able to open a 3 million line files, though).
It is also a good practice to know in which encoding is your own python source file, and to tell it explicitly by adding a line like that # -*- coding: utf-8 -*- at the beginning of your source file.
Of course, you have to specify the exact encoding in which your source file is. Once again, you can determine it by opening it in Notepad++.
Related
I am trying to create a python script that will read data from a text file and then checks if it has .(two letters), which well tell me if is a country code. I have tried using split and other methods but have not got it to work? Here is the code I have so far -->
# Python program to
# demonstrate reading files
# using for loop
import re
file2 = open('contry.txt', 'w')
file3 = open('noncountry.txt', 'w')
# Opening file
file1 = open('myfile.txt', 'r')
count = 0
noncountrycount = 0
countrycounter = 0
# Using for loop
print("Using for loop")
for line in file1:
count += 1
pattern = re.compile(r'^\.\w{2}\s')
if pattern.match(line):
print(line)
countrycounter += 1
else:
print("fail", line)
noncountrycount += 1
print(noncountrycount)
print(countrycounter)
file1.close()
file2.close()
file3.close()
The txt file has this in it
.aaa generic American Automobile Association, Inc.
.aarp generic AARP
.abarth generic Fiat Chrysler Automobiles N.V.
.abb generic ABB Ltd
.abbott generic Abbott Laboratories, Inc.
.abbvie generic AbbVie Inc.
.abc generic Disney Enterprises, Inc.
.able generic Able Inc.
.abogado generic Minds + Machines Group Limited
.abudhabi generic Abu Dhabi Systems and Information Centre
.ac country-code Internet Computer Bureau Limited
.academy generic Binky Moon, LLC
.accenture generic Accenture plc
.accountant generic dot Accountant Limited
.accountants generic Binky Moon, LLC
.aco generic ACO Severin Ahlmann GmbH & Co. KG
.active generic Not assigned
.actor generic United TLD Holdco Ltd.
.ad country-code Andorra Telecom
.adac generic Allgemeiner Deutscher Automobil-Club e.V. (ADAC)
.ads generic Charleston Road Registry Inc.
.adult generic ICM Registry AD LLC
.ae country-code Telecommunication Regulatory Authority (TRA)
.aeg generic Aktiebolaget Electrolux
.aero sponsored Societe Internationale de Telecommunications Aeronautique (SITA INC USA)
I am getting this error now
File "C:/Users/tyler/Desktop/Python Class/findcountrycodes/Test.py", line 15, in for line in file1: File "C:\Users\tyler\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032: character maps to
Is this something You were looking for:
with open('lorem.txt') as file:
data = file.readlines()
for line in data:
temp = line.split()[0]
if len(temp) == 3:
print(temp)
In short:
file.readlines() in this case returns a list of all lines in the file, pretty much it split the file by \n.
Then for each of those lines it gets split even more by spaces, and since the code You need is the first in the line it is also first in the list, so now it is important to check if the first item in the list is 3 characters long because since Your formatting seems pretty consistent only a length of 3 will be a country code.
It's usually not only an issue with the code, so we need all the context to reproduce, debug and solve.
Encoding error
The final hint was the console output (error, stacktrace) you pasted.
Read the stacktrace & research
This is how I read & analyze the error-output (Python's stacktrace):
... C:/Users/tyler/Desktop ...
... findcountrycodes/Test.py", line 15 ...
... Python36\lib\encodings*cp1252*.py ...
... UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 8032:
From this output we can extract important contextual information to research & solve the issue:
you are using Windows
the line 15 in your script Test.py points to the erroneous statement reading the file: file1 = open('myfile.txt', 'r')
you are using Python 3.6 and the currently used encoding was Windows 1252 (cp-1252)
the root-cause is UnicodeDecodeError, a frequently occuring Python Exception when reading files
You can now:
research Stackoverflow and the web for this exception: UnicodeDecodeError.
improve your question by adding this context (as keywords, tag, or dump as plain output)
Try a different encoding
One answer suggests to use the nowadays common UTF-8:
open(filename, encoding="utf8")
Detect the file encoding
An methodical solution-approach would be:
check the file's encoding or charset, e.g. using an editor, on windows Notepad or Notepad++
open the file your Python code with the proper encoding
See also:
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>
Get encoding of a file in Windows
Filtering lines for country-codes
So you want only the lines with country-codes.
Filtering expected
Then expected these 3 lines of your input file to be filtered:
.ad country-code Andorra Telecom
.ac country-code Internet Computer Bureau Limited
.ae country-code Telecommunication Regulatory Authority (TRA)
Solution using regex
As you already did, test each line of the file.
Test if the line starts with these 4 characters .xx (where xx can be any ASCII-letter).
Regex explained
This regular expression tests for a valid two-letter country code:
^\.\w{2}\s
^ from the start of the string (line)
\. (first) letter should be a dot
\w{2} (followed by) any two word-characters (⚠️ also matches _0)
\s (followed by) a single whitespace (blank, tab, etc.)
Python code
This is done in your code as follows (assuming the line is populated from read lines):
import re
line = '.ad '
pattern = re.compile(r'^\.\w{2}\s')
if pattern.match(line):
print('found country-code')
Here is a runnable demo on IDEone
Further Readings
Filter list with regex
Python 3 documentation: Regular Expression HOWTO
Bharath Sivakumar, on Medium (2020): Extracting Words from a string in Python using the “re” module
koenwoortman's blog (2020): Remove None values from a list in Python
You are splitting on three spaces but the character codes are only followed by one space so your logic is wrong.
>>> s = '.ac country-code Internet Computer Bureau Limited'
>>> s.strip().split(' ')
['.ac country-code', ' Internet Computer Bureau Limited']
>>>
Check if the third character is not a space and the fourth character is a space.
>>> if s[2] != ' ' and s[3] == ' ':
... print(f'country code: {s[:3]}')
... else: print('NO')
...
country code: .ac
>>> s = '.abogado generic Minds + Machines Group Limited'
>>> if s[2] != ' ' and s[3] == ' ':
... print(f'country code: {s[:3]}')
... else: print('NO')
...
NO
>>>
I know this is a common beginner issue and there are a ton of questions like this here on stack exchange and I've been searching through them but i still can't figure this out. I have some data from a scrape that looks like this (about 1000 items in the list):
inputList = [[u'someplace', u'3901 West Millen Drive', u'Hobbs', u'NH',
u'88240', u'37.751117', u'-103.187709999'], [u'\u0100lon someplace', u'3120
S Las Vegas Blvd', u'Las Duman', u'AL', u'89109', u'36.129066', u'-145.168791']]
I'm trying to write it to a csv file like this:
for i in inputList:
for ii in i:
ii.replace(" u'\u2019'", "") #just trying to get rid of offending character
ii.encode("utf-8")
def csvWrite(inList, outFile):
import csv
destination = open(outFile, 'w')
writer = csv.writer(destination, delimiter = ',')
data = inList
writer.writerows(data)
destination.close()
csvWrite(inputList, output)
but I keep getting this error on, writer.writerows(data):
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 5: ordinal not in range(128)
I've tried a bunch of different thing to encode the data in the list, but still always get the error. I'm open to just ignoring the characters that can't be encoded to ascii. Can anyone point me in the right direction, I'm using python2.6
this line seems strange: ii.replace(" u'\u2019'", ""), did you mean ii.replace(u"\u2019", u"") ?
if you just want to remove those bad characters you could use this code instead:
for i in inputList:
for ii in i:
ii = "".join(list( filter((lambda x: ord(x) < 128), ii)))
print ii
Output:
someplace
3901 West Millen Drive
Hobbs
NH
88240
37.751117
-103.187709999
lon someplace
3120 S Las Vegas Blvd
Las Duman
AL
89109
36.129066
-145.168791
the final code will look like this:
inputList = [[u'someplace', u'3901 West Millen Drive', u'Hobbs', u'NH',
u'88240', u'37.751117', u'-103.187709999'], [u'\u0100lon someplace', u'3120 S Las Vegas Blvd', u'Las Duman', u'AL', u'89109', u'36.129066', u'-145.168791']]
cleared_inputList = []
for i in inputList:
c_i = []
for ii in i:
ii = "".join(list( filter((lambda x: ord(x) < 128), ii)))
c_i.append(ii)
cleared_inputList.append(c_i)
def csvWrite(inList, outFile):
import csv
destination = open(outFile, 'w')
writer = csv.writer(destination, delimiter = ',')
data = inList
writer.writerows(data)
destination.close()
csvWrite(cleared_inputList, output)
I'm trying to edit a CSV file using informations from a first one. That doesn't seem simple to me as I should filter multiple things. Let's explain my problem.
I have two CSV files, let's say patch.csv and origin.csv. Output csv file should have the same pattern as origin.csv, but with corrected values.
I want to replace trip_headsign column fields in origin.csv using forward_line_name column in patch.csv if direction_id field in origin.csv row is 0, or using backward_line_name if direction_id is 1.
I want to do this only if the part of the line_id value in patch.csv between ":" and ":" symbols is the same as the part of route_id value in origin.csv before the ":" symbol.
I know how to replace a whole line, but not only some parts, especially that I sometimes have to look only part of a value.
Here is a sample of origin.csv:
route_id,service_id,trip_id,trip_headsign,direction_id,block_id
210210109:001,2913,70405957139549,70405957,0,
210210109:001,2916,70405961139553,70405961,1,
and a sample of patch.csv:
line_id,line_code,line_name,forward_line_name,forward_direction,backward_line_name,backward_direction,line_color,line_sort,network_id,commercial_mode_id,contributor_id,geometry_id,line_opening_time,line_closing_time
OIF:100110010:10OIF439,10,Boulogne Pont de Saint-Cloud - Gare d'Austerlitz,BOULOGNE / PONT DE ST CLOUD - GARE D'AUSTERLITZ,OIF:SA:8754700,GARE D'AUSTERLITZ - BOULOGNE / PONT DE ST CLOUD,OIF:SA:59400,DFB039,91,OIF:439,metro,OIF,geometry:line:100110010:10,05:30:00,25:47:00
OIF:210210109:001OIF30,001,FFOURCHES LONGUEVILLE PROVINS,Place Mérot - GARE DE LONGUEVILLE,,GARE DE LONGUEVILLE - Place Mérot,OIF:SA:63:49,000000 1,OIF:30,bus,OIF,,05:39:00,19:50:00
Each file has hundred of lines I need to parse and edit this way.
Separator is comma in my csv files.
Based on mhopeng answer to a previous question, I obtained that code:
#!/usr/bin/env python2
from __future__ import print_function
import fileinput
import sys
# first get the route info from patch.csv
f = open(sys.argv[1])
d = open(sys.argv[2])
# ignore header line
#line1 = f.readline()
#line2 = d.readline()
# get line of data
for line1 in f.readline():
line1 = f.readline().split(',')
route_id = line1[0].split(':')[1] # '210210109'
route_forward = line1[3]
route_backward = line1[5]
line_code = line1[1]
# process origin.csv and replace lines in-place
for line in fileinput.input(sys.argv[2], inplace=1):
line2 = d.readline().split(',')
num_route = line2[0].split(':')[0]
# prevent lines with same route_id but different line_code to be considered as the same line
if line.startswith(route_id) and (num_route == line_code):
if line.startswith(route_id):
newline = line.split(',')
if newline[4] == 0:
newline[3] = route_backward
else:
newline[3] = route_forward
print('\t'.join(newline),end="")
else:
print(line,end="")
But unfortunately, that doesn't push the right forward or backward_line_name in trip_headsign (always forward is used), the condition to compare patch.csv line_code to the end of route_id of origin.csv (after the ":") doesn't work, and the script finally triggers that error, before finishing parsing the file:
Traceback (most recent call last):
File "./GTFS_enhancer_headsigns.py", line 28, in
if newline[4] == 0:
IndexError: list index out of range
Could you please help me fixing these three problems?
Thanks for your help :)
You really should consider using the python csv module instead of split().
Out of experience , everything is much easier when working with csv files and the csv module.
This way you can iterate through the dataset in a structured way without the risk of getting index out of range errors.
I've tried to convert a CSV file coded in UTF-16 (exported by another program) to a simple array in Python 2.7 with very few luck.
Here's the nearest solution I've found:
from io import BytesIO
with open ('c:\\pfm\\bdh.txt','rb') as f:
x = f.read().decode('UTF-16').encode('UTF-8')
for line in csv.reader(BytesIO(x)):
print line
This code returns:
[' \tNombre\tEtiqueta\tExtensi\xc3\xb3n de archivo\tTama\xc3\xb1ol\xc3\xb3gico\tCategor\xc3\xada']
['1\tnom1\tetq1\text1 ...
What I'm trying to get it's something like this:
[['','Nombre','Etiqueta','Extensión de archivo','Tamaño lógico','Categoría']
['1','nom1','etq1','ext1','123','cat1']
['2','nom2','etq2','ext2','456','cat2']]
So, I'd need to convert those hexadecimal chars to latin typos (as: á,é,í,ó,ú, or ñ), and those tab-separated strings into arrays fields.
Do I really need to use dictionaries for the first part? I think there should be an easier solution, as I can see and write all these characater by keyboard.
For the second part, I think the CSV library won't help in this case, as I read it can't manage UTF-16 yet.
Could you give me a hand? Thank you!
ITEM #1: The hexadecimal characters
You are getting the:
[' \tNombre\tEtiqueta\tExtensi\xc3\xb3n de archivo\tTama\xc3\xb1ol\xc3\xb3gico\tCategor\xc3\xada']
output because you are printing a list. The behaviour of the list is to print the representation of each item. That is, it is the equivalent of:
print('[{0}]'.format(','.join[repr(item) for item in lst]))
If you use print(line[0]) you will get the output of the line.
ITEM #2: The output
The problem here is that the csv parser is not parsing the content as a tab-separated file, but a comma-separated file. You can fix this by using:
for line in csv.reader(BytesIO(s), delimiter='\t'):
print(line)
instead.
This will give you the desired result.
Processing a UTF-16 file with the csv module in Python 2 can indeed be a pain. Re-encoding to UTF-8 works, but you then still need to decode the resulting columns to produce unicode values.
Note also that your data appears to be tab delimited; the csv.reader() by default uses commas, not tabs, to separate columns. You'll need to configure it to use tabs instead by setting delimiter='\t' when constructing the reader.
Use io.open() to read UTF-16 and produce unicode lines. You can then use codecs.iterencode() to translate the decoded unicode values from the UTF-16 file to UTF-8.
To decode the rows back to unicode values, you could use an extra generator to do so as you iterate:
import csv
import codecs
import io
def row_decode(reader, encoding='utf8'):
for row in reader:
yield [col.decode('utf8') for col in row]
with io.open('c:\\pfm\\bdh.txt', encoding='utf16') as f:
wrapped = codecs.iterencode(f, 'utf8')
reader = csv.reader(wrapped, delimiter='\t')
for row in row_decode(reader):
print row
Each line will still use repr() on each contained value, which means that you'll see Python string literal syntax to represent strings. Any non-printable or non-ASCII codepoint will be represented by an escape code:
>>> [u'', u'Nombre', u'Etiqueta', u'Extensión de archivo', u'Tamaño lógico', u'Categoría']
[u'', u'Nombre', u'Etiqueta', u'Extensi\xf3n de archivo', u'Tama\xf1o l\xf3gico', u'Categor\xeda']
This is normal; the output is meant to be useful as a debugging aid and can be pasted back into any Python session to reproduce the original value, without worrying about terminal encodings.
For example, ó is represented as \xf3, representing the Unicode codepoint U+00F3 LATIN SMALL LETTER O WITH ACUTE. If you were to print this one column, Python will encode the Unicode string to bytes matching your terminal encoding, resulting in your terminal showing you the correct string again:
>>> u'Extensi\xf3n de archivo'
u'Extensi\xf3n de archivo'
>>> print u'Extensi\xf3n de archivo'
Extensión de archivo
Demo:
>>> import csv, codecs, io
>>> io.open('/tmp/demo.csv', 'w', encoding='utf16').write(u'''\
... \tNombre\tEtiqueta\tExtensi\xf3n de archivo\tTama\xf1o l\xf3gico\tCategor\xeda
... ''')
63L
>>> def row_decode(reader, encoding='utf8'):
... for row in reader:
... yield [col.decode('utf8') for col in row]
...
>>> with io.open('/tmp/demo.csv', encoding='utf16') as f:
... wrapped = codecs.iterencode(f, 'utf8')
... reader = csv.reader(wrapped, delimiter='\t')
... for row in row_decode(reader):
... print row
...
[u' ', u'Nombre', u'Etiqueta', u'Extensi\xf3n de archivo', u'Tama\xf1o l\xf3gico', u'Categor\xeda']
>>> # the row is displayed using repr() for each column; the values are correct:
...
>>> print row[3], row[4], row[5]
Extensión de archivo Tamaño lógico Categoría
I have a Python Script that generate a CSV (data parsed from a website).
Here is an exemple of the CSV file:
File1.csv
China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
China;Beijing;BeiwaiOnline BFSU;;;
Italy;Curno;Bergamo, Anderson House;;Yes;
File2.csv
China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
This;Is;A;New;Line;;
Italy;Curno;Bergamo, Anderson House;;Yes;
As you can see,
China;Beijing;BeiwaiOnline BFSU;;; ==> This line from File1.csv is not more present in File2.csv and This;Is;A;New;Line;; ==> This line from File2.csv is new (is not present in File1.csv).
I am looking for a way to compare this two CSV files (one important thing to know is that the order of the lines doesn't count ... they cant be anywhere).
What I'd like to have is a script which can tell me:
- One new line : This;Is;A;New;Line;;
- One removed line : China;Beijing;BeiwaiOnline BFSU;;;
And so on ... !
I've tried but without any success:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import csv
f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
now = [row for row in c2]
past = [row for row in c1]
for row in now:
#print row
lol = past.index(row)
print lol
f1.close()
f2.close()
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Any idea of the best way to proceed ? Thank you so much in advance ;)
EDIT:
import csv
f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
s1 = set(c1)
s2 = set(c2)
lol = s1 - s2
print type(lol)
print lol
This seems to be a good idea but :
Traceback (most recent call last):
File "compare.py", line 20, in <module>
s1 = set(c1)
TypeError: unhashable type: 'list'
EDIT 2 (Please don't care about what is above):
*with your help, here is the script I'm writing :*
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import csv
### COMPARISON THING ###
x=0
fichiers = os.listdir('/me/CSV')
for fichier in fichiers:
if '.csv' in fichier:
print('%s -----> %s' % (x,fichier))
x=x+1
choice = raw_input("Which file do you want to compare with the new output ? ->>>")
past_file = fichiers[int(choice)]
print 'We gonna compare %s to our output' % past_file
s_now = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/now.csv', 'r'), delimiter=';')) ## OUR OUTPUT
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
added = [";".join(row) for row in s_now - s_past] # in "now" but not in "past"
removed = [";".join(row) for row in s_past - s_now] # in "past" but not in "now"
c = csv.writer(open("CHANGELOG.csv", "a"),delimiter=";" )
line = ['AD']
for item_added in added:
line.append(item_added)
c.writerow(['AD',item_added])
line = ['RM']
for item_removed in removed:
line.append(item_removed)
c.writerow(line)
Two kind of errors:
File "programcompare.py", line 21, in <genexpr>
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: line contains NULL byte
or
File "programcompare.py", line 21, in <genexpr>
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: newline inside string
It was working few minutes ago but I've changed the CSV files to test with different datas and here I am :-)
Sorry, last question !
If your data is not prohibitively large, loading them into a set (or frozenset) will be an easy approach:
s_now = frozenset(tuple(row) for row in csv.reader(open('now.csv', 'r'), delimiter=';'))
s_past = frozenset(tuple(row) for row in csv.reader(open('past.csv', 'r'), delimiter=';'))
To get the list of entries that were added:
added = [";".join(row) for row in s_now - s_past] # in "now" but not in "past"
# Or, simply "added = list(s_now - s_past)" to keep them as tuples.
similarly, list of entries that were removed:
removed = [";".join(row) for row in s_past - s_now] # in "past" but not in "now"
To address your updated question on why you're seeing TypeError: unhashable type: 'list', the csv returns each entry as a list when iterated. lists are not hashable and therefore cannot be inserted into a set.
To address this, you'll need to convert the list entries into tuples before adding the to the set. See previous section in my answer for an example of how this can be done.
To address the additional errors you're seeing, they are both due to the content of your CSV files.
_csv.Error: newline inside string
It looks like you have quote characters (") somewhere in data which confuses the parser. I'm not familiar enough with the CSV module to tell you exactly what has gone wrong, not without having a peek at your data anyway.
I did however manage to reproduce the error as such:
>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";")]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_csv.Error: newline inside string
In this case, it can fixed by instructing the reader not to do any special processing with quotes (see csv.QUOTE_NONE). (Do note that this will disable the handling of quoted data whereby delimiters can appear within a quoted string without the string being split into separate entries.)
>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";", quoting=csv.QUOTE_NONE)]
[['hello', 'wo', '"rld']]
_csv.Error: line contains NULL byte
I'm guessing this might be down to the encoding of your CSV files. See the following questions:
Python CSV error: line contains NULL byte
"Line contains NULL byte" in CSV reader (Python)
Read the csv files line by line into sets. Compare the sets.
>>> s1 = set('''China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
... United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
... Italy;Bari;Bari, The British School;;Yes;
... China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
... China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
... China;Beijing;BeiwaiOnline BFSU;;;
... Italy;Curno;Bergamo, Anderson House;;Yes;'''.split('\n'))
>>> s2 = set('''China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
... United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
... Italy;Bari;Bari, The British School;;Yes;
... China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
... China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
... This;Is;A;New;Line;;
... Italy;Curno;Bergamo, Anderson House;;Yes;'''.split('\n'))
>>> s1 - s2
set(['China;Beijing;BeiwaiOnline BFSU;;;'])
>>> s2 - s1
set(['This;Is;A;New;Line;;'])