I've some twitter feed loaded in Pandas Series that I want to store in HDF5 format. Here's a sample of it:
>>> feeds[80:90]
80 BØR MAN STARTE en tweet med store bokstaver? F...
81 #NRKSigrid #audunlysbakken Har du husket Per S...
82 Lurer på om IS har fått med seg kaoset ved Eur...
83 synes han hørte på P3 at Opoku uttales Opoko. ...
84 De statsbærende partiene Ap og Høyre må ta sky...
85 April 2014. Blir MDG det nye arbeider #partiet...
86 MDG: Hasj for kjøtt. #valg2015
87 Grønt skifte.. https://t.co/OuM8quaMz0
88 Kinderegg https://t.co/AsECmw2sV9
89 MDG for honning, frukt og grønt. https://t.co/...
Name: feeds, dtype: object
Whenever I try to load the above data from a saved HDF5 file, some values are missing and are replaced by ''... And the same values reappear when I change the indexing. For example, while storing rows with index 84-85:
>>> store = pd.HDFStore('feed.hd5')
>>> store.append('feed', feeds[84:86], min_itemsize=200, encoding='utf-8')
>>> store.close()
when I read the file, the value of 84th row is now missing:
>>> pd.read_hdf('feed.hd5', 'feed')
84
85 April 2014. Blir MDG det nye arbeider #partiet...
Name: feeds, dtype: object
I get the same output as above if I do this way too:
>>> feeds[84:86].to_hdf('feed.hd5', 'feed', format='table', data_columns=True)
>>> pd.read_hdf('feed.hd5', 'feed')
But If I change the index to, say, [84:87] from [84:86], the 84th row is now loaded.
>>> feeds[84:87].to_hdf('feed.hd5', 'feed', format='table', data_columns=True)
>>> res = pd.read_hdf('feed.hd5', 'feed')
>>> res
84 De statsbærende partiene Ap og Høyre må ta sky...
85 April 2014. Blir MDG det nye arbeider #partiet...
86 MDG: Hasj for kjøtt. #valg2015
Name: feeds, dtype: object
But now, the loaded string is missing some characters when compared with the original tweet. here's that 84th row valued tweet:
>>> # Original tweet (Length: 140)
>>> print (feeds[84])
De statsbærende partiene Ap og Høyre må ta skylda for Miljøpartiets fremgang. Velgerne har sett at SV og V ikke vinner frem i miljøspørsmål.
>>> # Loaded tweet (Length: 134)
>>> print (res[84])
De statsbærende partiene Ap og Høyre må ta skylda for Miljøpartiets fremgang. Velgerne har sett at SV og V ikke vinner frem i miljøspø
I plan to use Python 3.3.x mainly for this unicode column support in PyTables (Am I wrong?) but could not store all the data successfully, yet. Can anyone explain this and let me know how can I avoid it ?
I am using OS: Mac OS X Yosemite, Pandas: 0.16.2, Python: 3.3.5, PyTables: 3.2.0
UPDATE: I confirmed with HDFView (http://www.hdfgroup.org/products/java/hdfview/) that the data is indeed getting stored always (although with some last characters missing) but I am unable to load it successfully every time though.
Thanks.
See the doc-string here.
You need to provide encoding='utf-8' otherwise this will be stored with your default python encoding (which might or might not work). Reading will use the written encoding.
The data
In [13]: df[84:86]
Out[13]:
tweet_id username tweet_time tweet
84 641437756275720192 #nicecap 2015-09-09T02:27:33+00:00 De statsbærende partiene Ap og Høyre må ta sky...
85 641434661391101952 #nicecap 2015-09-09T02:15:15+00:00 April 2014. Blir MDG det nye arbeider #partiet...
Appending, supply the encoding.
In [11]: store.append('feed',df[84:86],encoding='utf-8')
Supply the encoding when read as well
In [12]: store.select('feed',encoding='utf-8')
Out[12]:
tweet_id username tweet_time tweet
84 641437756275720192 #nicecap 2015-09-09T02:27:33+00:00 De statsbærende partiene Ap og Høyre må ta sky...
85 641434661391101952 #nicecap 2015-09-09T02:15:15+00:00 April 2014. Blir MDG det nye arbeider #partiet...
Here's how its stored
In [14]: store.get_storer('feed')
Out[14]: frame_table (typ->appendable,nrows->2,ncols->4,indexers->[index])
In [15]: store.get_storer('feed').attrs
Out[15]:
/feed._v_attrs (AttributeSet), 15 attributes:
[CLASS := 'GROUP',
TITLE := '',
VERSION := '1.0',
data_columns := [],
encoding := 'utf-8',
index_cols := [(0, 'index')],
info := {1: {'names': [None], 'type': 'Index'}, 'index': {}},
levels := 1,
metadata := [],
nan_rep := 'nan',
non_index_axes := [(1, ['tweet_id', 'username', 'tweet_time', 'tweet'])],
pandas_type := 'frame_table',
pandas_version := '0.15.2',
table_type := 'appendable_frame',
values_cols := ['values_block_0', 'values_block_1']]
So, I suppose this is a bug in that I should by default use the stored encoding when reading. I created an issue here
I have found the issue and could partly able to correct it . If someone first tweets a tweet that is near to 140 characters and another person retweets it, the latter one doesn't contain the full tweet as there will be some text pre-appended for retweets like RT #username:. As a result, the tweet is now more than 140 characters and hence is stripped down to 140 and is obtained as such via twitter APIs for python like tweepy or Python twitter tools (these are the two I have tested...). Sometimes the last character of these kind of tweets is a character '…' which has a length of 1 and an ordinal value of 8230 (try chr(8230) in python 3.x or unichr(8230) for python 2.x...). When these are stored in a HDF5 file and read via pd.read_hdf, it could not be done and instead pandas replaces the whole tweet with just ''.
This could be rectified as below:
>>> # replace that '…' character with '' (empty char)
>>> ch = chr(8230)
>>> feeds.str.replace(ch, '')
>>> # Store it in HDF5 now... # Not sure if it preserves the encoding...
>>> feeds.to_hdf('feed.h5', 'feed', format='table', append=True, encoding='utf-8', data_columns=True)
>>> # If you prefer this way
>>> with pd.HDFStore('feed.h5') as store:
store.append('feed', feeds, min_itemsize=200, encoding='utf-8')
>>> # Now read it safely
>>> pd.read_hdf('feed.h5', 'feed')
However, the problem still appears sometimes if there are some unicode characters... Giving the encoding='utf-8' option didn't really help, at least for my case. Any help in this regard is appreciated... :)
Related
I'm trying to extract text from a PDF file that has Address information, shown as below
CALIFORNIA EYE SPECIALISTS MED GRP INC
1900 W GARVEY AVE S # 335
WEST COVINA CA 91790
and I'm using below logic to extract the data
f = open(addressPath.pdf,'rb')
pdf_reader = PyPDF2.PdfFileReader(f)
first_page = pdf_reader.getPage(0)
mytext = first_page.extractText().split('\n')
but I'm getting below output, logic is introducing additional spaces.
Any Idea, why this is happening?
C A L I F O RN IA E YE SP E C I A L I STS M ED GRP INC
19 00 W G A R V EY A VE S # 3 35
WE S T CO VI NA C A 91 7 90
PyPDF2 was handling spaces not at all for a long time. In April 2022, I improved the situation with a very simple logic. It was too simple and got many cases wrong.
The contributor pubpub-zz changed that. Today, version 2.1.0 was released which improves spacing a lot.
I have a fixed width text file that I must convert to a .csv where all numbers have to be converted to integers (no commas, dollar signs, quotes, etc). I have currently parsed the text file using plain python, but when utilizing the right package I seem to be at an impasse.
With csv, I can use writer.writerows in place of my print statement to write the output into my csv file, but the problem is that I have more columns (such as the date and time) that I must add after these rows that I cannot seem to do with csv. I also cannot seem to find a way to translate the blank columns in my text document to blank columns in output. csv seems to write in order.
I was reading the documentation on xlsxwriter and I see how you can write to individual columns with a set formatting, but I am unsure if it would work with my .csv requirement
My input text has a series of random groupings throughout the 50k line document but follows the below format
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
* START ******************************************************************************************************************** START *
1--------------------
1ANTECR09 CHEK DPCK_R_009
TRANSIT EXTRACT SUB-SYSTEM
CURRENT DATE = 08/03/2017 JOURNAL REPORT PAGE 1
PROCESS DATE =
ID = 022000046-MNT
FILE HEADER = H080320171115
+____________________________________________________________________________________________________________________________________
R T SEQUENCE CR BT A RSN ITEM ITEM CHN USER REASO
NBR NBR OR PIC NBR DB NBR NBR COD AMOUNT SERIAL IND .......FIELD.. DESCR
5,556 01 7450282689 C 538196640 9835177743 15 $9,064.81 00 CREDIT
5,557 01 7450282690 D 031301422 362313705 38 $592.35 43431 DR CR
5,558 01 7450282691 D 021309379 601298839 38 $1,491.04 44896 DR CR
5,559 01 7450282692 D 071108834 176885 38 $6,688.00 1454 DR CR
5,560 01 7450282693 D 031309123 1390001566241 38 $293.42 6878 DR CR
My code currently parses this document, pulls the date, time, and prints only the lines where the sequence number starts with 42 and the CR is "C"
lines = []
a = 'PRINT DATE:'
b = 'ARCHIVE'
c = 'PRINT TIME:'
with open(r'textfile.txt') as in_file:
for line in in_file:
values = line.split()
if 'PRINT DATE:' in line:
dtevalue = line.split(a,1)[-1].split(b)[0]
lines.append(dtevalue)
elif 'PRINT TIME:' in line:
timevalue = line.split(c,1)[-1].split(b)[0]
lines.append(timevalue)
elif (len(values) >= 4 and values[3] == 'C'
and len(values[2]) >= 2 and values[2][:2] == '41'):
print(line)
print (lines[0])
print (lines[1])
What would be the cleanest way to achieve this result, and am I headed in the right direction by writing out the parsing first or should I have just done everything within a package first?
Any help is appreciated
Edit:
the header block (between 1----------, and +___________) is repeated throughout the document, as well as different sized groupings separated by -------
--------------------
34,615 207 4100223726 C 538196620 9866597322 10 $645.49 00 CREDIT
34,616 207 4100223727 D 022000046 8891636675 31 $645.49 111583 DR ON-
--------------------
34,617 208 4100223728 C 538196620 11701364 10 $756.19 00 CREDIT
34,618 208 4100223729 D 071923828 00 54 $305.31 11384597 BAD AC
34,619 208 4100223730 D 071923828 35110011 30 $450.88 10913052 6 DR SEL
--------------------
I would recommend slicing fixed width blocks to parse through the fixed width fields. Something like the following (incomplete) code:
data = """ 5,556 01 4250282689 C 538196640 9835177743 15 $9,064.81 00
CREDIT
5,557 01 7450282690 D 031301422 362313705 38 $592.35 43431
DR CR
5,558 01 7450282691 D 021309379 601298839 38 $1,491.04 44896
DR CR
"""
# list of data layout tuples (start_index, stop_index, field name)
# TODO add missing data layout tuples
data_layout = [(0, 12, 'r_nbr'), (12, 22, 't_nbr'), (22, 39, 'seq'), (39, 42, 'cr_db')]
for line in data.splitlines():
# skip "separator" lines
# NOTE this may be an iterative process to identify these
if line.startswith('-----'):
continue
record = {}
for start_index, stop_index, name in data_layout:
record[name] = line[start_index:stop_index].strip()
# your conditional (seems inconsistent with text)
if record['seq'].startswith('42') and record['cr_db'] == 'C':
# perform any special handling for each column
record['r_nbr'] = record['r_nbr'].replace(',', '')
# TODO other special handling (like $)
print('{r_nbr},{t_nbr},{seq},{cr_db},...'.format(**record))
Output is:
5556,01,4250282689,C,...
Update based on seemingly spurious values in undefined columns
Based on the new information provided about the "spurious" columns/fields (appear only rarely), this will likely be an iterative process.
My recommendation would be to narrow (appropriately!) the width of the desired fields. For example, if spurious data is in line[12:14] above, you could change the tuple for (12, 22, 't_nbr') to (14, 22, 't_nbr') to "skip" the spurious field.
An alternative is to add a "garbage" field in the list of tuples to handle those types of lines. Wherever the "spurious" fields appear, the "garbage" field would simply consume it.
If you need these fields, the same general approach to the "garbage" field approach still applies, but you save the data.
Update based on random separators
If they are relatively consistent, I'd simply add some logic (as I did above) to "detect" the separators and skip over them.
I work in hotel. here is raw file from rapports i have.I need to extract data in order to have something like data['roomNumber']=('paxNumber',isbb,)
Here is a sample that concern only 2 room, the 10 and 12 so the data i need should be BreakfastData = {'10':['2','BB'],'12':['1','BB']}
1)roomNumber : 'start and ends with number' or 'start with number and strictly one or more space followd by string'
2)paxNumber are the two numbers just before the 'VA' string
3)isbb is defined by the 'BB' or 'HPDJ' occurrence which can be find between two '/'. But sometimes the format is not good so it can be '/HPDJ/' or '/ HPDJ /' or '/ HPDJ/' etc
10 PxxxxD,David,Mme, Mr T- EXPEDIA TRAVEL
08.05.17 12.05.17 TP
SUP DBL / HPDJ / DEBIT CB AGENCE - NR
2 0 VA
NR
12
LxxxxSH,Claudia,Mrs
08.05.17 19.05.17 TP
1 0 VA
NR BB
SUP SGL / BB / EN ATTENTE DE VIREMENT- EVITER LA 66 -
.... etc
edit :latest
import re
data = {}
pax=''
r = re.compile(r"(\d+)\W*(\d+)\W*VA")
r2 = re.compile(r"/\s*(BB|HPDJ)\s*/")
r3 = re.compile(r"\d+\n")
r4 = re.compile(r"\d+\s+\w")
PATH = "/home/ryms/regextest"
with open(PATH, 'rb') as raw:
text=raw.read()
#roomNumber = re.search(r4, text).group()
#roomNumber2 = re.search(r3, text).group()
roomNumber = re.search(r4, text).group().split()[0]
roomNumber2 = re.search(r3, text).group().split()[0]
pax = re.findall(r, text)
adult = pax[0]; enfant = pax[1]
# if enfant is '0':
# pax=adult
# else:
# pax=(str(adult)+'+'+str(enfant))
bb = re.findall(r2, text) #On recherche BB ou HPDJ
data[roomNumber]=pax,bb
print(data)
print(roomNumber)
print(roomNumber2)
return
{'10': ([('2', '2'), ('1', '1')], ['HPDJ', 'BB'])}
10
12
[Finished in 0.1s]
How can i get the two roomNumber in my return?
I have lot of trouble with the \n issue and read(), readline(), readlines().what is the trick?
When i will have all raw data, how will i get the proper BreakfastData{}? will i use .zip()?
At the bigining i wanted to split the file and then parse it , but i try so may things, i get lost. And for that i need a regex that match both pattern.
On first case you want to select two numbers which are followed by 'VA' you can do like this
r = re.compile(r"(\d+)\W*(\d+)\W*VA")
In second case you can get HPDJ or BB like this
r = re.compile(r"/\s*(HPDJ|BB)\s*/")
this will handle all cases you mentioned >> /HPDJ/' or '/ HPDJ /' or '/ HPDJ/'
The regex expression to get the text before the VA is as follows:
r = re.compile(r"(.*) VA")
Then the "number" (which will be a string) will be stored in the first group of the search match object, once you run the search.
I am not quite sure what the room number even is, because your description is a bit unclear, so I cannot help with that unless you clarify.
I have a csv file that contains some strange (incorrect) encoded danish characters (å-ø-æ). In my Django view I'm trying to grab a string from the first row, and the date from the second row in the file. The file looks like this if I copy paste it.
01,01,Project Name: SAM_LOGIK_rsm¿de_HD,,,Statistics as of: Sat Oct 01 17:09:16 2016
02,01,Project created: Tue Apr 12 09:10:16 2016,,,Last Session Started: Sat Oct 01 16:59:22 2016
The string SAM_LOGIK_rsm¿de_HD should be SAM_LOGIK_Årsmøde_HD - which is the value I want to store in the DB.
I am decoding the file with iso-8859-1 (otherwise I get an error).
with open(latest, 'rt', encoding='iso-8859-1') as csvfile:
for i, row in enumerate(csvfile):
if "Project Name:" in row:
this = row.split(',')
project_list.append(this[2][14:]) # gets the project name as is
if i >= 1:
break
else:
this = row.split(',')
date = datetime.strptime(this[5][22:-1], '%c') # datetime object
project_list.append(date)
if i >= 1:
break # break at row 2
csvfile.close()
This stores the string 'as is', and I'm not sure what to do to convert it back into danish before I store it in the DB. The DB and Django are set up to work with danish chars.
If I try to decode it as utf.8 - I get a UnicodeDecodeError which reveals some more information.
01,01,Project Name: SAM_LOGIK_\x81rsm\xbfde_HD,,,Statistics as of: Sat Oct'
01 17:09:16 2016\r02,01,Project created: Tue Apr 12 09:10:16 2016,,,Last'
EDIT:
I found out that the strings in the csv are actually corrupted - and the application that created them (Avid Media Composer) at least consistently applies the same values for - Å-å-Æ-æ-Ø-ø
Å = \x81 unassigned in UTF8
å = Œ - u"\u0153" OE ligature
Æ = ® - chr(174)
æ = ¾ - chr(190)
Ø = » - chr(187)
ø = ¿ - chr(191)
I fixed it like this.
replacements = {'\x81':'Å','Œ':'å','®':'Æ','¾':'æ','¿':'ø','»':'Ø'}
with open(newest, 'rt', encoding='iso-8859-1') as csvfile:
for i, row in enumerate(csvfile):
if "Project Name:" in row:
this = row.split(',')
project_list.append("".join([replacements.get(c, c) for c in this[2][14:]]))
if i >= 1:
break
else:
this = row.split(',')
date = datetime.strptime(this[5][22:-1], '%c') # datetime object
project_list.append(date)
if i >= 1:
break # break at row 2
try that
row.decode('iso-8859-1').encode('utf-8')
And if you're use the "with" statement closing the file isn't necessary
I have a Python Script that generate a CSV (data parsed from a website).
Here is an exemple of the CSV file:
File1.csv
China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
China;Beijing;BeiwaiOnline BFSU;;;
Italy;Curno;Bergamo, Anderson House;;Yes;
File2.csv
China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
Italy;Bari;Bari, The British School;;Yes;
China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
This;Is;A;New;Line;;
Italy;Curno;Bergamo, Anderson House;;Yes;
As you can see,
China;Beijing;BeiwaiOnline BFSU;;; ==> This line from File1.csv is not more present in File2.csv and This;Is;A;New;Line;; ==> This line from File2.csv is new (is not present in File1.csv).
I am looking for a way to compare this two CSV files (one important thing to know is that the order of the lines doesn't count ... they cant be anywhere).
What I'd like to have is a script which can tell me:
- One new line : This;Is;A;New;Line;;
- One removed line : China;Beijing;BeiwaiOnline BFSU;;;
And so on ... !
I've tried but without any success:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import csv
f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
now = [row for row in c2]
past = [row for row in c1]
for row in now:
#print row
lol = past.index(row)
print lol
f1.close()
f2.close()
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Any idea of the best way to proceed ? Thank you so much in advance ;)
EDIT:
import csv
f1 = file('now.csv', 'r')
f2 = file('past.csv', 'r')
c1 = csv.reader(f1)
c2 = csv.reader(f2)
s1 = set(c1)
s2 = set(c2)
lol = s1 - s2
print type(lol)
print lol
This seems to be a good idea but :
Traceback (most recent call last):
File "compare.py", line 20, in <module>
s1 = set(c1)
TypeError: unhashable type: 'list'
EDIT 2 (Please don't care about what is above):
*with your help, here is the script I'm writing :*
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import csv
### COMPARISON THING ###
x=0
fichiers = os.listdir('/me/CSV')
for fichier in fichiers:
if '.csv' in fichier:
print('%s -----> %s' % (x,fichier))
x=x+1
choice = raw_input("Which file do you want to compare with the new output ? ->>>")
past_file = fichiers[int(choice)]
print 'We gonna compare %s to our output' % past_file
s_now = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/now.csv', 'r'), delimiter=';')) ## OUR OUTPUT
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
added = [";".join(row) for row in s_now - s_past] # in "now" but not in "past"
removed = [";".join(row) for row in s_past - s_now] # in "past" but not in "now"
c = csv.writer(open("CHANGELOG.csv", "a"),delimiter=";" )
line = ['AD']
for item_added in added:
line.append(item_added)
c.writerow(['AD',item_added])
line = ['RM']
for item_removed in removed:
line.append(item_removed)
c.writerow(line)
Two kind of errors:
File "programcompare.py", line 21, in <genexpr>
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: line contains NULL byte
or
File "programcompare.py", line 21, in <genexpr>
s_past = frozenset(tuple(row) for row in csv.reader(open('/me/CSV/'+past_file, 'r'), delimiter=';')) ## CHOOSEN ONE
_csv.Error: newline inside string
It was working few minutes ago but I've changed the CSV files to test with different datas and here I am :-)
Sorry, last question !
If your data is not prohibitively large, loading them into a set (or frozenset) will be an easy approach:
s_now = frozenset(tuple(row) for row in csv.reader(open('now.csv', 'r'), delimiter=';'))
s_past = frozenset(tuple(row) for row in csv.reader(open('past.csv', 'r'), delimiter=';'))
To get the list of entries that were added:
added = [";".join(row) for row in s_now - s_past] # in "now" but not in "past"
# Or, simply "added = list(s_now - s_past)" to keep them as tuples.
similarly, list of entries that were removed:
removed = [";".join(row) for row in s_past - s_now] # in "past" but not in "now"
To address your updated question on why you're seeing TypeError: unhashable type: 'list', the csv returns each entry as a list when iterated. lists are not hashable and therefore cannot be inserted into a set.
To address this, you'll need to convert the list entries into tuples before adding the to the set. See previous section in my answer for an example of how this can be done.
To address the additional errors you're seeing, they are both due to the content of your CSV files.
_csv.Error: newline inside string
It looks like you have quote characters (") somewhere in data which confuses the parser. I'm not familiar enough with the CSV module to tell you exactly what has gone wrong, not without having a peek at your data anyway.
I did however manage to reproduce the error as such:
>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";")]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
_csv.Error: newline inside string
In this case, it can fixed by instructing the reader not to do any special processing with quotes (see csv.QUOTE_NONE). (Do note that this will disable the handling of quoted data whereby delimiters can appear within a quoted string without the string being split into separate entries.)
>>> [e for e in csv.reader(['hello;wo;"rld'], delimiter=";", quoting=csv.QUOTE_NONE)]
[['hello', 'wo', '"rld']]
_csv.Error: line contains NULL byte
I'm guessing this might be down to the encoding of your CSV files. See the following questions:
Python CSV error: line contains NULL byte
"Line contains NULL byte" in CSV reader (Python)
Read the csv files line by line into sets. Compare the sets.
>>> s1 = set('''China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
... United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
... Italy;Bari;Bari, The British School;;Yes;
... China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
... China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
... China;Beijing;BeiwaiOnline BFSU;;;
... Italy;Curno;Bergamo, Anderson House;;Yes;'''.split('\n'))
>>> s2 = set('''China;Beijing;Auralog Software Development (Deijing) Co. Ltd.;;;
... United Kingdom;Oxford;Azad University (Ir) In Oxford Ltd;;;
... Italy;Bari;Bari, The British School;;Yes;
... China;Beijing;Beijing Foreign Enterprise Service Group Co Ltd;;;
... China;Beijing;Beijing Ying Biao Human Resources Development Limited;;Yes;
... This;Is;A;New;Line;;
... Italy;Curno;Bergamo, Anderson House;;Yes;'''.split('\n'))
>>> s1 - s2
set(['China;Beijing;BeiwaiOnline BFSU;;;'])
>>> s2 - s1
set(['This;Is;A;New;Line;;'])