Parsing files (ics/ icalendar) using Python - python

I have a .ics file in the following format. What is the best way to parse it? I need to retrieve the Summary, Description, and Time for each of the entries.
BEGIN:VCALENDAR
X-LOTUS-CHARSET:UTF-8
VERSION:2.0
PRODID:-//Lotus Development Corporation//NONSGML Notes 8.0//EN
METHOD:PUBLISH
BEGIN:VTIMEZONE
TZID:India
BEGIN:STANDARD
DTSTART:19500101T020000
TZOFFSETFROM:+0530
TZOFFSETTO:+0530
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID="India":20100615T111500
DTEND;TZID="India":20100615T121500
TRANSP:OPAQUE
DTSTAMP:20100713T071035Z
CLASS:PUBLIC
DESCRIPTION:Emails\nDarlene\n Murphy\nDr. Ferri\n
UID:12D3901F0AD9E83E65257743001F2C9A-Lotus_Notes_Generated
X-LOTUS-UPDATE-SEQ:1
X-LOTUS-UPDATE-WISL:$S:1;$L:1;$B:1;$R:1;$E:1;$W:1;$O:1;$M:1
X-LOTUS-NOTESVERSION:2
X-LOTUS-APPTTYPE:0
X-LOTUS-CHILD_UID:12D3901F0AD9E83E65257743001F2C9A
END:VEVENT
BEGIN:VEVENT
DTSTART;TZID="India":20100628T130000
DTEND;TZID="India":20100628T133000
TRANSP:OPAQUE
DTSTAMP:20100628T055408Z
CLASS:PUBLIC
DESCRIPTION:
SUMMARY:smart energy management
LOCATION:8778/92050462
UID:07F96A3F1C9547366525775000203D96-Lotus_Notes_Generated
X-LOTUS-UPDATE-SEQ:1
X-LOTUS-UPDATE-WISL:$S:1;$L:1;$B:1;$R:1;$E:1;$W:1;$O:1;$M:1
X-LOTUS-NOTESVERSION:2
X-LOTUS-NOTICETYPE:A
X-LOTUS-APPTTYPE:3
X-LOTUS-CHILD_UID:07F96A3F1C9547366525775000203D96
END:VEVENT
BEGIN:VEVENT
DTSTART;TZID="India":20100629T110000
DTEND;TZID="India":20100629T120000
TRANSP:OPAQUE
DTSTAMP:20100713T071037Z
CLASS:PUBLIC
SUMMARY:meeting
UID:6011DDDD659E49D765257751001D2B4B-Lotus_Notes_Generated
X-LOTUS-UPDATE-SEQ:1
X-LOTUS-UPDATE-WISL:$S:1;$L:1;$B:1;$R:1;$E:1;$W:1;$O:1;$M:1
X-LOTUS-NOTESVERSION:2
X-LOTUS-APPTTYPE:0
X-LOTUS-CHILD_UID:6011DDDD659E49D765257751001D2B4B
END:VEVENT

The icalendar package looks nice.
For instance, to write a file:
from icalendar import Calendar, Event
from datetime import datetime
from pytz import UTC # timezone
cal = Calendar()
cal.add('prodid', '-//My calendar product//mxm.dk//')
cal.add('version', '2.0')
event = Event()
event.add('summary', 'Python meeting about calendaring')
event.add('dtstart', datetime(2005,4,4,8,0,0,tzinfo=UTC))
event.add('dtend', datetime(2005,4,4,10,0,0,tzinfo=UTC))
event.add('dtstamp', datetime(2005,4,4,0,10,0,tzinfo=UTC))
event['uid'] = '20050115T101010/27346262376#mxm.dk'
event.add('priority', 5)
cal.add_component(event)
f = open('example.ics', 'wb')
f.write(cal.to_ical())
f.close()
Tadaaa, you get this file:
BEGIN:VCALENDAR
PRODID:-//My calendar product//mxm.dk//
VERSION:2.0
BEGIN:VEVENT
DTEND;VALUE=DATE:20050404T100000Z
DTSTAMP;VALUE=DATE:20050404T001000Z
DTSTART;VALUE=DATE:20050404T080000Z
PRIORITY:5
SUMMARY:Python meeting about calendaring
UID:20050115T101010/27346262376#mxm.dk
END:VEVENT
END:VCALENDAR
But what lies in this file?
g = open('example.ics','rb')
gcal = Calendar.from_ical(g.read())
for component in gcal.walk():
print component.name
g.close()
You can see it easily:
>>>
VCALENDAR
VEVENT
>>>
What about parsing the data about the events:
g = open('example.ics','rb')
gcal = Calendar.from_ical(g.read())
for component in gcal.walk():
if component.name == "VEVENT":
print(component.get('summary'))
print(component.get('dtstart'))
print(component.get('dtend'))
print(component.get('dtstamp'))
g.close()
Now you get:
>>>
Python meeting about calendaring
20050404T080000Z
20050404T100000Z
20050404T001000Z
>>>

You could probably also use the vobject module for this: http://pypi.python.org/pypi/vobject
If you have a sample.ics file you can read it's contents like, so:
# read the data from the file
data = open("sample.ics").read()
# parse the top-level event with vobject
cal = vobject.readOne(data)
# Get Summary
print 'Summary: ', cal.vevent.summary.valueRepr()
# Get Description
print 'Description: ', cal.vevent.description.valueRepr()
# Get Time
print 'Time (as a datetime object): ', cal.vevent.dtstart.value
print 'Time (as a string): ', cal.vevent.dtstart.valueRepr()

New to python; the above comments were very helpful so wanted to post a more complete sample.
# ics to csv example
# dependency: https://pypi.org/project/vobject/
import vobject
import csv
with open('sample.csv', mode='w') as csv_out:
csv_writer = csv.writer(csv_out, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
csv_writer.writerow(['WHAT', 'WHO', 'FROM', 'TO', 'DESCRIPTION'])
# read the data from the file
data = open("sample.ics").read()
# iterate through the contents
for cal in vobject.readComponents(data):
for component in cal.components():
if component.name == "VEVENT":
# write to csv
csv_writer.writerow([component.summary.valueRepr(),component.attendee.valueRepr(),component.dtstart.valueRepr(),component.dtend.valueRepr(),component.description.valueRepr()])

Four years later and understanding ICS format a bit better, if those were the only fields I needed, I'd just use the native string methods:
import io
# Probably not a valid .ics file, but we don't really care for the example
# it works fine regardless
file = io.StringIO('''
BEGIN:VCALENDAR
X-LOTUS-CHARSET:UTF-8
VERSION:2.0
DESCRIPTION:Emails\nDarlene\n Murphy\nDr. Ferri\n
SUMMARY:smart energy management
LOCATION:8778/92050462
DTSTART;TZID="India":20100629T110000
DTEND;TZID="India":20100629T120000
TRANSP:OPAQUE
DTSTAMP:20100713T071037Z
CLASS:PUBLIC
SUMMARY:meeting
UID:6011DDDD659E49D765257751001D2B4B-Lotus_Notes_Generated
X-LOTUS-UPDATE-SEQ:1
X-LOTUS-UPDATE-WISL:$S:1;$L:1;$B:1;$R:1;$E:1;$W:1;$O:1;$M:1
X-LOTUS-NOTESVERSION:2
X-LOTUS-APPTTYPE:0
X-LOTUS-CHILD_UID:6011DDDD659E49D765257751001D2B4B
END:VEVENT
'''.strip())
parsing = False
for line in file:
field, _, data = line.partition(':')
if field in ('SUMMARY', 'DESCRIPTION', 'DTSTAMP'):
parsing = True
print(field)
print('\t'+'\n\t'.join(data.split('\n')))
elif parsing and not data:
print('\t'+'\n\t'.join(field.split('\n')))
else:
parsing = False
Storing the data and parsing the datetime is left as an exercise for the reader (it's always UTC)
old answer below
You could use a regex:
import re
text = #your text
print(re.search("SUMMARY:.*?:", text, re.DOTALL).group())
print(re.search("DESCRIPTION:.*?:", text, re.DOTALL).group())
print(re.search("DTSTAMP:.*:?", text, re.DOTALL).group())
I'm sure it may be possible to skip the first and last words, I'm just not sure how to do it with regex. You could do it this way though:
print(' '.join(re.search("SUMMARY:.*?:", text, re.DOTALL).group().replace(':', ' ').split()[1:-1])

In case anyone else is looking at this, the ics package seems like it's updated better than any others mentioned in the thread. https://pypi.org/project/ics/
Here's some sample code I'm using:
from ics import Calendar, Event
with open(in_file, 'r') as file:
ics_text = file.read()
c = Calendar(ics_text) for e in c.events:
print(e.name)

I'd parse line by line and do a search for your terms, then get the index and extract that and X number of characters further (however many you think you'll need). Then parse that much smaller string to get it to be what you need.

Related

Does anyone know how to add row numbers?

I can open this file directly from the net,and I want to add row numbers to each line based on rules. If you need header row number,then start from number 1, if no need, then start from next line. This is my code, I tried a lot but doesn't work. It looks like picture. Does anyone how to solve this problem? Thanks in advance!
import sys
class Main:
def task1(self):
print('*' * 30, 'Task')
import urllib.request
# url
url = 'http://www.born.nhely.hu/group_list.txt'
# Initiate a request to get a response
while True:
try:
response = urllib.request.urlopen(url)
except Exception as e:
print('An error has occurred, the request is being made again, the error message is as follows:', e)
else:
break
# Print all student information
content = response.read().decode('utf-8')
#add row number
header_row = input("Do you want to know header_row numbers? Y OR N?")
if header_row == 'Y':
for i, line in enumerate(content, start=1):
print(f'{i},{line}')
else:
for i, line in enumerate(content, start=0):
print('{},{}'.format(i, line.strip()))
def start(self):
self.task1()
Main().start()
Have a look at the data you are downloading:
Name;Short name;Email;Country;Other spoken languages
ABOUELHASSAN Shehab Ibrahim Adbelazin;?;dwedar909#gmail.com;?;?
AGHAEI HOSSEIN ABADI Mohammad Mehdi;Matt;mahdiaghaei355#gmail.com;Iran;English
...
Now look at the results you are getting:
1,N
2,a
3,m
4,e
5,;
6,S
7,h
8,o
...
It should be apparent that you are looping character by character; not line by line.
When you have:
for i, line in enumerate(content, start=1):
print(f'{i},{line}')
content is a string -- not a list of lines -- so you will loop over the string character by character with the for loop.
So to fix, do:
for i, line in enumerate(content.splitlines(), start=1):
print(f'{i},{line}')
Or, you can change the method of reading from the server to reading lines instead of characters:
content = response.readlines()
Your absorbing the .txt content in one big string... if you use .readlines() instead of .read(), you can achieve what you want.
You should modify this:
# Print all student information
content = response.read().decode('utf-8')
To this:
# Print all student information
content = response.readlines()
You can use the repr() method to take a look at your data:
print(repr(content))
'Name;Short name;Email;Country;Other spoken languages\r\nABOUELHASSAN Shehab Ibrahim Adbelazin;?;dwedar909#gmail.com;?;?\r\nAGHAEI HOSSEIN ABADI Mohammad Mehdi;Matt;mahdiaghaei355#gmail.com;Iran;English\r\nAMIN Asjad;?;;?;?\r\nATILA Arda Burak;Arda;arda_atila#hotmail.com;Turkey;English\r\nBELTRAN CASTRO Carlos Ricardo;Ricardo;crbeltrancas#gmail.com;Colombia;English, Chinese\r\nBhatti Muhammad Hasan;?;;?;?\r\nCAKIR Alp Hazar;Alp;alphazarc#gmail.com;Turkey;English\r\nDENG Zhihui;Deng;dzhfalcon0727#gmail.com;China;English\r\nDURUER Ahmet Enes;Ahmet / kahverengi;hello#ahmetduruer.com;Turkey;English\r\nENKHZAYA Jagar;Jager;japman2400#gmail.com;Mongolia;English\r\nGHAIBAH Sanaa;Sanaa;sanaagheibeh12#gmail.com;Syria;English\r\nGUO Ruizheng;?;ruizhengguo#gmail.com;China;English\r\nGURBANZADE Gurban;Qurban;gurbanzade01#gmail.com;Azeribaijan;English, Russian, Turkish\r\nHASNAIN Syed Muhammad;Hasnain;syedhasnainhijazy313#gmail.com;Pakistan;?\r\nISMAYILOV Firdovsi;Firi;firiisi#gmail.com;Azeribaijan ?;English,Russian,Turkish\r\nKINGRANI Muskan;Muskan;muskankingrani4#gmail.com;India;English\r\nKOKO Susan Kekeli Ruth;Susan;susankoko3#gmail.com;Ghana;N/A\r\nKOLA-OLALEYE Adeola Damilola;Adeola;inboxadeola#gmail.com;Nigeria;French\r\nLEWIS Madison Buse;?;madisonbuse#yahoo.com;Turkey;Turkish\r\nLI Ting;Ting;514053044#qq.com;China;English\r\nMARUSENKO Svetlana;Svetlana;svetlana.maru#gmail.com;Russia;English, German\r\nMOHANTY Cyrus;cyrus;cyrusmohanty5261#gmail.com;India;English\r\nMOTHOBI Thabo Emmanuel;thabo;thabomothobi#icloud.com;South Africa;English\r\nNayudu Yashmit Vinay;?;;?;?\r\nPurevsuren Davaadorj;?;Purevsuren.davaadorj99#gmail.com;Mongolia ?;English\r\nSAJID Anoosha;Anoosha;anooshasajid12#gmail.com;Pakistan;English\r\nSHANG Rongxiang;Xiang;1074482757#qq.com;China;English\r\nSU Haobo;Su;2483851740#qq.com;China;English\r\nTAKEUCHI ROSSMAN Elly;Elly;elliebanana10th#gmail.com;Japan;English\r\nULUSOY Nedim Can;Nedim;nedimcanulusoy#gmail.com;Turkey;English, Hungarian\r\nXuan Qijian;Xuan;xjwjadon#gmail.com;China ?;?\r\nYUAN Gaopeng;Yuan;1277237374#qq.com;China;English\r\n'
vs
print(repr(content))
[b'Name;Short name;Email;Country;Other spoken languages\r\n', b'ABOUELHASSAN Shehab Ibrahim Adbelazin;?;dwedar909#gmail.com;?;?\r\n', b'AGHAEI HOSSEIN ABADI Mohammad Mehdi;Matt;mahdiaghaei355#gmail.com;Iran;English\r\n', b'AMIN Asjad;?;;?;?\r\n', b'ATILA Arda Burak;Arda;arda_atila#hotmail.com;Turkey;English\r\n', b'BELTRAN CASTRO Carlos Ricardo;Ricardo;crbeltrancas#gmail.com;Colombia;English, Chinese\r\n', b'Bhatti Muhammad Hasan;?;;?;?\r\n', b'CAKIR Alp Hazar;Alp;alphazarc#gmail.com;Turkey;English\r\n', b'DENG Zhihui;Deng;dzhfalcon0727#gmail.com;China;English\r\n', b'DURUER Ahmet Enes;Ahmet / kahverengi;hello#ahmetduruer.com;Turkey;English\r\n', b'ENKHZAYA Jagar;Jager;japman2400#gmail.com;Mongolia;English\r\n', b'GHAIBAH Sanaa;Sanaa;sanaagheibeh12#gmail.com;Syria;English\r\n', b'GUO Ruizheng;?;ruizhengguo#gmail.com;China;English\r\n', b'GURBANZADE Gurban;Qurban;gurbanzade01#gmail.com;Azeribaijan;English, Russian, Turkish\r\n', b'HASNAIN Syed Muhammad;Hasnain;syedhasnainhijazy313#gmail.com;Pakistan;?\r\n', b'ISMAYILOV Firdovsi;Firi;firiisi#gmail.com;Azeribaijan ?;English,Russian,Turkish\r\n', b'KINGRANI Muskan;Muskan;muskankingrani4#gmail.com;India;English\r\n', b'KOKO Susan Kekeli Ruth;Susan;susankoko3#gmail.com;Ghana;N/A\r\n', b'KOLA-OLALEYE Adeola Damilola;Adeola;inboxadeola#gmail.com;Nigeria;French\r\n', b'LEWIS Madison Buse;?;madisonbuse#yahoo.com;Turkey;Turkish\r\n', b'LI Ting;Ting;514053044#qq.com;China;English\r\n', b'MARUSENKO Svetlana;Svetlana;svetlana.maru#gmail.com;Russia;English, German\r\n', b'MOHANTY Cyrus;cyrus;cyrusmohanty5261#gmail.com;India;English\r\n', b'MOTHOBI Thabo Emmanuel;thabo;thabomothobi#icloud.com;South Africa;English\r\n', b'Nayudu Yashmit Vinay;?;;?;?\r\n', b'Purevsuren Davaadorj;?;Purevsuren.davaadorj99#gmail.com;Mongolia ?;English\r\n', b'SAJID Anoosha;Anoosha;anooshasajid12#gmail.com;Pakistan;English\r\n', b'SHANG Rongxiang;Xiang;1074482757#qq.com;China;English\r\n', b'SU Haobo;Su;2483851740#qq.com;China;English\r\n', b'TAKEUCHI ROSSMAN Elly;Elly;elliebanana10th#gmail.com;Japan;English\r\n', b'ULUSOY Nedim Can;Nedim;nedimcanulusoy#gmail.com;Turkey;English, Hungarian\r\n', b'Xuan Qijian;Xuan;xjwjadon#gmail.com;China ?;?\r\n', b'YUAN Gaopeng;Yuan;1277237374#qq.com;China;English\r\n']
Also, instead of hard-coding the charset as utf-8, you can use response.headers.get_content_charset()

how to convert google-maps GeoJSON to GPX, retaining location names

I have exported my google-maps Point Of Interests (saved places / locations) via the takeout tool. How can i convert this to GPX, so that i can import it into OSMAnd?
I tried using gpsbabel:
gpsbabel -i geojson -f my-saved-locations.json -o gpx -F my-saved-locations_converted.gpx
But this did not retain the title/name of each point of interest - and instead just used names like WPT001, WPT002, etc.
in the end I solved this by creating a small python script to convert between the formats.
This could be easily adapted for specific needs:
#!/usr/bin/env python3
import argparse
import json
import xml.etree.ElementTree as ET
from xml.dom import minidom
def ingestJson(geoJsonFilepath):
poiList = []
with open(geoJsonFilepath) as fileObj:
data = json.load(fileObj)
for f in data["features"]:
poiList.append({'title': f["properties"]["Title"],
'lon': f["geometry"]["coordinates"][0],
'lat': f["geometry"]["coordinates"][1],
'link': f["properties"].get("Google Maps URL", ''),
'address': f["properties"]["Location"].get("Address", '')})
return poiList
def dumpGpx(gpxFilePath, poiList):
gpx = ET.Element("gpx", version="1.1", creator="", xmlns="http://www.topografix.com/GPX/1/1")
for poi in poiList:
wpt = ET.SubElement(gpx, "wpt", lat=str(poi["lat"]), lon=str(poi["lon"]))
ET.SubElement(wpt, "name").text = poi["title"]
ET.SubElement(wpt, "desc").text = poi["address"]
ET.SubElement(wpt, "link").text = poi["link"]
xmlstr = minidom.parseString(ET.tostring(gpx)).toprettyxml(encoding="utf-8", indent=" ")
with open(gpxFilePath, "wb") as f:
f.write(xmlstr)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--inputGeoJsonFilepath', required=True)
parser.add_argument('--outputGpxFilepath', required=True)
args = parser.parse_args()
poiList = ingestJson(args.inputGeoJsonFilepath)
dumpGpx(args.outputGpxFilepath, poiList=poiList)
if __name__ == "__main__":
main()
...
it can be called like so:
./convert-googlemaps-geojson-to-gpx.py \
--inputGeoJsonFilepath my-saved-locations.json \
--outputGpxFilepath my-saved-locations_converted.gpx
There is also a NPM script called "togpx":
https://github.com/tyrasd/togpx
I didn't try it, but it claims to keep as much information as possible.

How to add SYLT(synced lyrics) tag on ID3v2 mp3 file using python?

I want to add synced lyrics from vtt on my mp3 file using python. I tried using the mutagen module but it didn't work as intended.
from mutagen.id3 import ID3, USLT, SLT
import sys
import webvtt
lyrics = webvtt.read(sys.argv[2])
lyri = []
lyr = []
for lyric in lyrics:
times = [int(x) for x in lyric.start.replace(".", ":").split(":")]
ms = times[-1]+1000*times[-2]+1000*60*times[-3]+1000*60*60*times[-4]
lyri.append((lyric.text,ms))
lyr.append(lyric.text)
fil = ID3(sys.argv[1])
tag = USLT(encoding=3, lang='kor', text="\n".join(lyr)) # this is unsynced lyrics
#tag = SLT(encoding=3, lang='kor', format=2, type=1, text=lyri) --- not working
print(tag)
fil.add(tag)
fil.save(v1=0)
How can I solve this problem?
I use mutagen to parse an mp3 file that already has SYLT data, and found the usage of SYLT:
from mutagen.id3 import ID3, SYLT, Encoding
tag = ID3(mp3path)
sync_lrc = [("Do you know what's worth fighting for", 17640),
("When it's not worth dying for?", 23640), ...] # [(lrc, millisecond), ]
tag.setall("SYLT", [SYLT(encoding=Encoding.UTF8, lang='eng', format=2, type=1, text=sync_lrc)])
tag.save(v2_version=3)
But I can't figure out format=2, type=1 means.
check
https://id3.org/id3v2.3.0#Synchronised_lyrics.2Ftext
format 1: Absolute time, 32 bit sized, using MPEG frames as unit
format 2: Absolute time, 32 bit sized, using milliseconds as unit
type 0: is other
type 1: is lyrics
type 2 : is text transcription
type 3 : is movement/part name (e.g. "Adagio")
type 4 : is events (e.g. "Don Quijote enters the stage")
type 5 : is chord (e.g. "Bb F Fsus")
type 6 : is trivia/'pop up' information

Python configparser reads comments in values

ConfigParser also reads comments. Why? Shouldn't this be a default thing to "ignore" inline comments?
I reproduce my problem with the following script:
import configparser
config = configparser.ConfigParser()
config.read("C:\\_SVN\\BMO\\Source\\Server\\PythonExecutor\\Resources\\visionapplication.ini")
for section in config.sections():
for item in config.items(section):
print("{}={}".format(section, item))
The ini file looks as follows:
[LPI]
reference_size_mm_width = 30 ;mm
reference_size_mm_height = 40 ;mm
print_pixel_pitch_mm = 0.03525 ; mm
eye_cascade = "TBD\haarcascade_eye.xml" #
The output:
C:\_Temp>python read.py
LPI=('reference_size_mm_width', '30 ;mm')
LPI=('reference_size_mm_height', '40 ;mm')
LPI=('print_pixel_pitch_mm', '0.03525 ; mm')
LPI=('eye_cascade', '"TBD\\haarcascade_eye.xml" #')
I don't want to read 30 ;mm but I want to read just the number '30'.
What am I doing wrong?
PS: Python3.7
hi use inline_comment_prefixes while creating configparser object check example below
config = configparser.ConfigParser(inline_comment_prefixes = (";",))
Here is detailed documentation.

How do I encode properly a possibly chinese encoding in python?

I am scraping the following link:
http://www.footballcornersta.com/en/league.php?select=all&league=%E8%8B%B1%E8%B6%85&year=2014&month=1&Submit=Submit
and the following string contains all the available options in a menu relevant to league:
ls_main = [['E','ENG PR','英超'],['E','ENG FAC','英足总杯'],['E','ENG Champ','英冠'],['E','ENG D1','英甲'],['I','ITA D1','意甲'],['I','ITA D2','意乙'],['S','SPA D1','西甲'],['S','SPA D2','西乙'],['G','GER D1','德甲'],['G','GER D2','德乙'],['F','FRA D1','法甲'],['F','FRA D2','法乙'],['S','SCO PR','苏超'],['R','RUS PR','俄超'],['T','TUR PR','土超'],['B','BRA D1','巴西甲'],['U','USA MLS','美职联'],['A','ARG D1','阿根甲'],['J','JP D1','日职业'],['J','JP D2','日职乙'],['A','AUS D1','澳A联'],['K','KOR D1','韩K联'],['C','CHN PR','中超'],['E','EURO Cup','欧洲杯'],['I','Italy Supe','意超杯'],['K','KOR K3','K3联'],['C','CHN D1','中甲'],['D','DEN D2-E','丹乙东'],['D','DEN D2-W','丹乙西'],['D','DEN D1','丹甲'],['D','DEN PR','丹超'],['U','UKR U21','乌克兰U21'],['U','UD2','乌克甲'],['U','UKR D1','乌克超'],['U','Uzber D1','乌兹超'],['U','URU D1','乌拉甲'],['U','UZB D2','乌茲甲'],['I','ISR D2','以色列乙'],['I','ISR D1','以色列甲'],['I','ISR PR','以色列超'],['I','Iraq L','伊拉联'],['I','Ira D1','伊朗甲'],['I','IRA P','伊朗联'],['R','RUS D2C','俄乙中'],['R','RUS D2U','俄乙乌'],['R','RUS D2S','俄乙南'],['R','RUS D2W','俄乙西'],['R','RUS RL','俄后赛'],['R','RUS D1','俄甲'],['R','RUS PR','俄超'],['B','BUL D1','保甲'],['C','CRO D1','克甲'],['I','ICE PR','冰岛超'],['G','GHA PL','加纳超'],['H','Hun U19','匈U19'],['H','HUN D2E','匈乙东'],['H','HUN D2W','匈乙西'],['H','HUN D1','匈甲'],['N','NIR IFAC','北爱冠'],['N','NIRE PR','北爱超'],['S','SAfrica D1','南非甲'],['S','SAfrica NSLP','南非超'],['L','LUX D1','卢森甲'],['I','IDN PR','印尼超'],['I','IND D1','印度甲'],['G','GUAT D1','危地甲'],['E','ECU D1','厄甲'],['F','Friendly','友谊赛'],['K','KAZ D1','哈萨超'],['C','COL D2','哥伦乙'],['C','COL C','哥伦杯'],['C','COL D1','哥伦甲'],['C','COS D1','哥斯甲'],['T','TUR U23','土A2青'],['T','TUR D3L1','土丙1'],['T','TUR D3L2','土丙2'],['T','TUR D3L3','土丙3'],['T','TUR2BK','土乙白'],['T','TUR2BB','土乙红'],['T','TUR D1','土甲'],['E','EGY PR','埃及超'],['S','Serbia D2','塞尔乙'],['S','Serbia 1','塞尔联'],['C','CYP D2','塞浦乙'],['C','CYP D1','塞浦甲'],['M','MEX U20','墨西U20'],['M','Mex D2','墨西乙'],['M','MEX D1','墨西联'],['A','AUT D3E','奥丙东'],['A','AUT D3C','奥丙中'],['A','AUT D3W','奥丙西'],['A','AUT D2','奥乙'],['A','AUT D1','奥甲'],['V','VEN D1','委超'],['W','WAL D2','威甲'],['W','WAL D2CA','威联盟'],['W','WAL D1','威超'],['A','Ang D1','安哥甲'],['N','NIG P','尼日超'],['P','PAR D1','巴拉甲'],['B','BRA D2','巴西乙'],['B','BRA CP','巴锦赛'],['G','GRE D3N','希丙北'],['G','GRE D3S','希丙南'],['G','GRE D2','希乙'],['G','GRE D1','希甲'],['G','GER U17','德U17'],['G','GER U19','德U19'],['G','GER D3','德丙'],['G','GER RN','德北联'],['G','GER RS','德南联'],['G','GER RW','德西联'],['I','ITA D3A','意丙A'],['I','ITA D3B','意丙B'],['I','ITA D3C1','意丙C1'],['I','ITA D3C2','意丙C2'],['I','ITA CP U20','意青U20'],['E','EST D3','愛沙丙'],['N','NOR D2-A','挪乙A'],['N','NOR D2-B','挪乙B'],['N','NOR D2-C','挪乙C'],['N','NOR D2-D','挪乙D'],['N','NORC','挪威杯'],['N','NOR D1','挪甲'],['N','NOR PR','挪超'],['C','CZE D3','捷丙'],['C','CZE MSFL','捷丙M'],['C','CZE D2','捷乙'],['C','CZE U19','捷克U19'],['C','CZE D1','捷克甲'],['M','Mol D2','摩尔乙'],['M','MOL D1','摩尔甲'],['M','MOR D2','摩洛哥乙'],['M','MOR D1','摩洛超'],['S','Slovakia D3E','斯丙東'],['S','Slovakia D3W','斯丙西'],['S','Slovakia D2','斯伐乙'],['S','Slovakia D1','斯伐甲'],['S','Slovenia D1','斯洛甲'],['S','SIN D1','新加联'],['J','JL3','日丙联'],['C','CHI D2','智乙'],['C','CHI D1','智甲'],['G','Geo','格鲁甲'],['G','GEO PR','格鲁超'],['U','UEFA CL','欧冠杯'],['U','UEFA SC','欧霸杯'],['B','BEL D3A','比丙A'],['B','BEL D3B','比丙B'],['B','BEL D2','比乙'],['B','BEL W1','比女甲'],['B','BEL C','比杯'],['B','BEL D1','比甲'],['S','SAU D2','沙地甲'],['S','SAU D1','沙地联'],['F','FRA D4A','法丁A'],['F','FRA D4B','法丁B'],['F','FRA D4C','法丁C'],['F','FRA D4D','法丁D'],['F','FRA D3','法丙'],['F','FRA U19','法国U19'],['F','FRA C','法国杯'],['P','POL D2E','波乙東'],['P','POL D2W','波乙西'],['P','POL D2','波兰乙'],['P','POL D1','波兰甲'],['B','BOS D1','波斯甲'],['P','POL YL','波青联'],['T','THA D1','泰甲'],['T','THA PL','泰超'],['H','HON D1','洪都甲'],['A','Aus BP','澳布超'],['E','EST D1','爱沙甲'],['I','IRE D1','爱甲'],['I','IRE PR','爱超'],['B','BOL D1','玻利甲'],['F','Friendly','球会赛'],['S','SWI D1','瑞士甲'],['S','SWI PR','瑞士超'],['S','SWE D2','瑞甲'],['S','SWE D1','瑞超'],['B','BLR D2','白俄甲'],['B','BLR D1','白俄超'],['P','Peru D1','秘鲁甲'],['T','TUN D2','突尼乙'],['T','Tun D1','突尼甲'],['R','ROM D2G1','罗乙1'],['R','ROM D2G2','罗乙2'],['R','ROM D1','罗甲'],['L','LIBERT C','自由杯'],['F','FIN D2','芬甲'],['F','FIN D1','芬超'],['S','SCO D3','苏丙'],['S','SUD PL','苏丹超'],['S','SCO D2','苏乙'],['S','SCO D1','苏甲'],['S','SCO HL','苏高联'],['E','ENG D2','英乙'],['E','ENG RyPR','英依超'],['E','ENG UP','英北超'],['E','ENG SP','英南超'],['E','ENG Trophy','英挑杯'],['E','ENG Con','英非'],['E','ENG CN','英非北'],['H','HOL D2','荷乙'],['H','HOL Yl','荷青甲'],['S','SV D1','萨尔超'],['P','POR U19','葡U19'],['P','POR D1','葡甲'],['P','POR PR','葡超'],['S','SPA D3B1','西丙1'],['S','SPA D3B2','西丙2'],['S','SPA D3B3','西丙3'],['S','SPA D3B4','西丙4'],['S','SPA Futsal','西內足'],['S','SPA W1','西女超'],['B','BRA CC','里州赛'],['A','Arg D2M1','阿乙M1'],['A','Arg D2M2','阿乙M2'],['A','Arg D2M3','阿乙M3'],['A','ALG D2','阿及乙'],['A','ALG D1','阿及甲'],['A','AZE D1','阿塞甲'],['A','ALB D1','阿巴超'],['A','ARG D2','阿根乙'],['U','UAE D2','阿联乙'],['K','KOR NL','韩联盟'],['F','FYRM D2','马其乙'],['M','MacedoniaFyr','马其甲'],['M','MAS D1','马来超'],['M','MON D2','黑山乙'],['M','MON D1','黑山甲'],['F','FCWC','世冠杯'],['W','World Cup','世界杯'],['F','FIFAWYC','世青杯'],['C','CWPL','中女超'],['C','CFC','中足协杯'],['D','DEN C','丹麦杯'],['A','Asia CL','亚冠杯'],['A','AFC','亚洲杯'],['R','Rus Cup','俄罗斯杯'],['H','HUN C','匈杯'],['N','NIR C','北爱杯'],['T','TUR C','土杯'],['T','Tenno Hai','天皇杯'],['W','WWC','女世杯'],['I','ITA Cup','意杯'],['G','GER C','德国杯'],['J','JPN LC','日联杯'],['S','SCO FAC','苏足总杯'],['E','ENG JPT','英锦赛'],['E','ENG FAC','足总杯'],['C','CAF NC','非洲杯'],['K','K-LC','韩联杯'],['H','HK D1','香港甲']];
The link of the page I am scraping contains the third character, but when I copy it becomes the link above.
I am not sure about the encoding.
import re
html = 'source of page'
matches = re.findall('ls_main = \[\[.*?;', html)[0]
matches = matches.decode('unknown encoding').encode('utf-8')
How can I put the original character in the string of the link ?
I use Python 2.7.
%XX encoding can be done using urllib.qutoe:
>>> import urllib
>>> urllib.quote('英冠')
'%E8%8B%B1%E5%86%A0'
>>> urllib.quote(u'英冠'.encode('utf-8')) # with explicit utf-8 encoding.
'%E8%8B%B1%E5%86%A0'
To get back the original string, use urllib.unquote:
>>> urllib.unquote('%E8%8B%B1%E5%86%A0')
'\xe8\x8b\xb1\xe5\x86\xa0'
>>> print(urllib.unquote('%E8%8B%B1%E5%86%A0'))
英冠
In Python 3.x, use urllib.parse.quote, urllib.parse.unquote:
>>> import urllib.parse
>>> urllib.parse.quote('英冠', encoding='utf-8')
'%E8%8B%B1%E5%86%A0'
>>> urllib.parse.unquote('%E8%8B%B1%E5%86%A0', encoding='utf-8')
'英冠'

Categories