issue with *.ics splitting strings with more than one line *Python* - python

I have tried as many methods I could find, and always got the same result, but there must be a fix for this?
I am downloading an ICS from a website, were one of the lines "Summary", is split in two.
When I load this into a string these two lines get automaticly joined into 1 string, unless there are "\n".
so I have tried to replace both "\n" and "\r", but there is no change on my issue.
Code
from icalendar import Calendar, Event
from datetime import datetime
import icalendar
import urllib.request
import re
from clear import clear_screen
cal = Calendar()
def download_ics():
url = "https://www.pogdesign.co.uk/cat/download_ics/7d903a054695a48977d46683f29384de"
file_name = "pogdesign.ics"
urllib.request.urlretrieve(url, file_name)
def get_start_time(time):
time = datetime.strftime(time, "%A - %H:%M")
return time
def get_time(time):
time = datetime.strftime(time, "%H:%M")
return time
def check_Summary(text):
#newline = re.sub('[\r\n]', '', text)
newline = text.translate(str.maketrans("", "", "\r\n"))
return newline
def main():
download_ics()
clear_screen()
e = open('pogdesign.ics', 'rb')
ecal = icalendar.Calendar.from_ical(e.read())
for component in ecal.walk():
if component.name == "VEVENT":
summary = check_Summary(component.get("SUMMARY"))
print(summary)
print("\t Start : " + get_start_time(component.decoded("DTSTART")) + " - " + get_time(component.decoded("DTEND")))
print()
e.close()
if __name__ == "__main__":
main()
output
Young Sheldon S06E11 - Ruthless, Toothless, and a Week ofBed Rest
Start : Friday - 02:00 - 02:30
The Good Doctor S06E11 - The Good Boy
Start : Tuesday - 04:00 - 05:00
National Treasure: Edge of History S01E08 - Family Tree
Start : Thursday - 05:59 - 06:59
National Treasure: Edge of History S01E09 - A Meeting withSalazar
Start : Thursday - 05:59 - 06:59
The Last of Us S01E03 - Long Long Time
Start : Monday - 03:00 - 04:00
The Last of Us S01E04 - Please Hold My Hand
Start : Monday - 03:00 - 04:00
Anne Rice's Mayfair Witches S01E04 - Curiouser and Curiouser
Start : Monday - 03:00 - 04:00
Anne Rice's Mayfair Witches S01E05 - The Thrall
Start : Monday - 03:00 - 04:00
The Ark S01E01 - Everyone Wanted to Be on This Ship
Start : Thursday - 04:00 - 05:00
I have looked through all kinds of solutions, like converting the text to "utf-8" and "ISO-8859-8".
I have tried some functions I found in the icalendar.
have even asked ChatGPT for help.
as you might see on the first line on the output:
Young Sheldon S06E11 - Ruthless, Toothless, and a Week ofBed Rest
and
National Treasure: Edge of History S01E09 - A Meeting withSalazar
These two lines in the downloaded ics, is on two seperate lines, and i cannot manage to make them split, or not join at all...

So far as the icalendar.Calendar class is concerned, that ical is incorrectly formatted.
icalendar.Calendar.from_ical() calls icalendar.Calendar.parser.Contentlines.from_ical() which is
def from_ical(cls, ical, strict=False):
"""Unfold the content lines in an iCalendar into long content lines.
"""
ical = to_unicode(ical)
# a fold is carriage return followed by either a space or a tab
return cls(uFOLD.sub('', ical), strict=strict)
where uFOLD is re.compile('(\r?\n)+[ \t]')
That means it's removing each series of newlines that is followed by one space or tab character – not replacing it with a space. The ical file you're retrieving has e.g.
SUMMARY:Young Sheldon S06E11 - \\nRuthless\\, Toothless\\, and a Week of\r\n Bed Rest\r\n
so when of\r\n Bed is matched it becomes ofBed.
This line-folding format is defined in RFC 2445 which gives the example
For example the line:
DESCRIPTION:This is a long description that exists on a long line.
Can be represented as:
DESCRIPTION:This is a lo
ng description
that exists on a long line.
which makes clear that the implementation in from_ical() is correct.
If you're quite sure that the source ical will always fold lines on words, you could adjust for that by adding a space after each line fold, like:
ecal = icalendar.Calendar.from_ical(e.read().replace(b'\r\n ', b'\r\n '))

Related

Process lines with different sizes to csv

I'm trying to convert a PDF bank extract to csv. I'm fairly new into python, but I managed to extract text from pdf. I'm ended with something similar to this:
AMAZON 23/12/2019 15:40 -R$ 100,00 R$ 400,00 credit
Some Restaurant 23/12/2019 14:00 -R$ 10,00 R$ 500 credit
Received from John Doe 22/12/2019 15:00 R$ 510 R$ 500,00
03 Games 22/12/2019 15:00 R$ 10 R$ 10,00 debit
I want this output:
AMAZON;23/12/2019;-100,00
Some Restaurant;23/12/2019;-10,00
Received from John Doe;22/12/2019;510
03 Games;22/12/2019;10
First field have different sizes, I don't need time and currency format. I don't need last 2 fields.
I have this code so far (just extracting text from PDF):
import pdfplumber
import sys
url = sys.argv[1]
pdf = pdfplumber.open(url)
pdf_pages = len(pdf.pages)
for i in range(pdf_pages):
page = pdf.pages[(i)]
text = page.extract_text()
print(text)
pdf.close()
Can anyone give some directions?
Try using this the split method. To split the strings into lines and into the separate parts and pick then the parts.
The following link explains it very nicely.
https://www.w3schools.com/python/showpython.asp?filename=demo_ref_string_split
lines:List[str] = text.split("\n")
for line in lines:
entries:List[str] = line.split()
date_entry_index: int = get_date_index(entries)
name = entries[0]
for index in range(1, date_entry_index + 1):
name += " " + entries[index]
print(f"{name};{entries[date_entry_index]};{entries[date_entry_index + 2]}")
def get_date_index(entries_check:List[str]) -> int:
# either you could use the function below or you check if the entry only contains digits and "/"
for index, entry in enumerate(entries):
if len(entry) == 10:
continue
if entry[2] != "/" or entry[5] != "/":
continue
# here you could check if the other parts of the date are digits or some letters or something similar.
return index
else:
raise ValueError("No Date found")
That should print it.

How to format dates and times?

I'm trying to get make a comments section for a website with the backend written in python. As of now, everything works fine except I cannot figure out how to format the date and time the way I want.
What I am trying to have at the time of posting is either of these:
Posted on Tue, 06/12/18 at - 11:20
or
Posted on 06/12/18 at - 11:21
Currently, what I have when the method is called is this:
import time
from datetime import *
time = ("Posted on " + str(datetime.now().day) + "/"
+ str(datetime.now().month) + " at: " + str(datetime.now().hour)
+ ":" + str(datetime.now().minute))
You can use datetime.datetime.strftime() to build any format you want:
import datetime
time = datetime.datetime.now().strftime("Posted on %d/%m/%y at - %H:%M")
print(time) # Posted on 07/12/17 at - 00:29
Read up more on special classes you can use when building your date: strftime() and strptime() Behavior
datetime.datetime.now().strftime('%d/%m/%y at - %H:%M')
'06/12/17 at - 16:29'

How do I distinguish two emails from one string in python using regex

I have a string (from a page source) that contains two emails:
<span class="inlinemeta">From: D Hui <dhui#tcmclinic.com>
Sent: Friday, June 18, 2010 12:57 PM
</span>
<span class="inlinemeta">To: 'pcox#medcoc.org'
Subject: New med approved?
What I need is to pull out the four attributes: SentFrom, SentTo, SentOn, Subject.
With the help on stackoverflow, I am able to get SentOn, I now am stuck on how to distinguish the two emails.
Considering the actual raw text to be parsed could differ from one to one with minors like From may include a prefix (in this case it's D Hui) or may not (like the second email), and To could also be like that, so I really need a bit flexible on the solution.
Thank you very much in advance, I just started python a week ago so please pardon me if the question is too simple or too easy to find a solution online.
At the meantime, I surely will try myself to see if I can figure it out.
This is a more general solution that breaks the text into lines. It also uses split and strip to handle the date and subject without a regex.
import re
message_text = """
<span class="inlinemeta">From: D Hui <dhui#tcmclinic.com>
Sent: Friday, June 18, 2010 12:57 PM
</span>
<span class="inlinemeta">To: 'pcox#medcoc.org'
Subject: New med approved?
"""
email_regex = r"[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"
for line in message_text.split('\n'):
line = line.rstrip()
if 'From:' in line:
e_from = re.search(email_regex, line).group(0)
if 'Sent:' in line:
e_sent = line.split(':')[1].strip()
if 'To:' in line:
e_to = re.search(email_regex, line).group(0)
if 'Subject:' in line:
e_subject = line.split(':')[1].strip()
print "e_from = %s" % e_from
print "e_sent = %s" % e_sent
print "e_to = %s" % e_to
print "e_subject = %s" % e_subject
Output
e_from = dhui#tcmclinic.com
e_sent = Friday, June 18, 2010 12
e_to = pcox#medcoc.org
e_subject = New med approved?
The email_regex comes from emailregex.com

Searching and sorting in Python

i am writing a script in python that searches for strings and suposedly does different things when encounters strings.
import re, datetime
from datetime import *
f = open(raw_input('Name of file to search: ')
strToSearch = ''
for line in f:
strToSearch += line
patFinder = re.compile('\d{2}\/\d{2}\/\d{4}\sA\d{3}\sB\d{3}')
findPat1 = re.findall(patFinder, strToSearch)
# search only dates
datFinder = re.compile('\d{2}\/\d{2}\/\d{4}')
findDat = re.findall(datFinder, strToSearch)
nowDate = date.today()
fileLst = open('cels.txt', 'w')
ntrdLst = open('not_ready.txt', 'w')
for i in findPat1:
for Date in findDat:
Date = datetime.strptime(Date, '%d/%m/%Y')
Date = Date.date()
endDate = Date + timedelta(days=731)
if endDate < nowDate:
fileLst.write(i)
else:
ntrdLst.write(i)
f.close()
fileLst.close()
ntrdLst.close()
toClose = raw_input('File was modified, press enter to close: ')
so basically it searches for a string with dates and numbers and then same list but only dates, converts the dates, adds 2 years to each and compares, if the date surpass today's date, goes to the ntrdLst, if not, to fileLst.
My problem is that it writes the same list (i) multiple times and doesn't do the sorting.
i am fearly new to python and programming so i am asking for your help. thanks in advance
edit: -----------------
the normal output was this (without the date and if statement)
27/01/2009 A448 B448
22/10/2001 A434 B434
06/09/2007 A825 B825
06/09/2007 A434 B434
06/05/2010 A826 B826
what i would like is if i had a date that is after date.today() say like 27/01/2016 to write to another file and what i keep getting is the script printing this list 30x times or doesn't take to account the if statement.
(sorry, the if was indeed indented the last loop, i went wrong while putting it in here)
You're computing endDate in a loop, once for each date... but not doing anything with it in the loop. So, after the loop is over, you have the very last endDate, and you use only that one to decide which file to write to.
I'm not sure what your logic is supposed to be, but I'm pretty sure you want to put the if statement with the writes inside the inner loop.
If you do that, then if you have, say, 100 pattern matches and 25 dates, you'll end up writing 2500 strings--some to one file, some to the other. Is that what you wanted?
SOLVED
i gave it a little (A LOT) of thought about it and just got all together in one piece. i knew that there too many for loops but now i got it. Thanks anyway to you whom have reached a helping hand to me. I leave the code for anyone having a similar problem.
nowDate = date.today()
for line in sourceFile:
s = re.compile('(\d{2}\/\d{2}\/\d{4})\s(C\d{3}\sS\d{3})')
s1 = re.search(s, line)
if s1:
date = s1.group(1)
date = datetime.strptime(date, '%d/%m/%Y')
date = date.date()
endDate = date + timedelta(days=731)
if endDate <= nowDate:
fileLst.write(s1.group())
fileLst.write('\n')
else:
print ('not ready: ', date.strftime('%d-%m-%Y'))
ntrdLst.write(s1.group(1))
ntrdLst.write('\n')

Python - Cutting the first 5 characters from a string - how to get my new shorter string?

I'm very new to Python. I am working on an LCD Raspberry Pi project, displaying strings on an LCD.
I create a string from a command to show a radio track name (line 1 in my code), however this string always starts with 'Name:'. This reads directly from MPD (Music Player Daemon) so nothing I can do about that up front.
As it always starts with the same number of characters, I want to remove '5' characters from the start of this string, and have a new string to play with. Sounds simple to me...but I cannot make this work.
I'm trying:
Print station[5:]
based on something I found while searching for the answer, but it appears to do nothing.
Here's the main block of my code: (again, the 5th line was intended to work...)
f=os.popen("echo 'currentsong' | nc localhost 6600 | grep -e '^Name: '")
station = ""
for i in f.readlines():
station += i
print station[5:]
str_pad = " " * 16
station = str_pad + station
for i in range (0, len(station)):
lcd_byte(LCD_LINE_1, LCD_CMD)
lcd_text = station[i:(i+16)]
lcd_string(lcd_text,1)
time.sleep(0.3)
lcd_byte(LCD_LINE_1, LCD_CMD)
lcd_string(str_pad,1)
lcd_byte(LCD_LINE_2, LCD_CMD)
lcd_string("**Playing**",2)
This just continues to show the entire line, such as "Name: Pink Flloyd - Money"
If anyone can help, I'll be truly grateful.
Thanks
The problem is that when you do print station[5:] you're slicing the string and displaying it, but never saving the result. Remember, in Python, strings are immutable (they don't change). As a result, doing station[5:] will simply return a new string that is never saved.
Instead, replace line 5 with station = station[5:]. This will overwrite the station string with the new version that doesn't start with Name:.
Well, just printing station[5:] won't do what you want. You need to do:
station = station[5:]
Replace lines 4 and 5 with:
station += i[5:]
print station

Categories