Sorting by month-year groups by month instead - python

I have a curious python problem.
The script takes two csv files, one with a column of dates and the other a column of text snippets. in the other excel file there is a bunch of names (substrings).
All that the code does is step through both lists building up a name-mentioned-per-month matrix.
FILE with dates and text: (Date, Snippet first column)
ENTRY 1 : Sun 21 nov 2014 etc, The release of the iphone 7 was...
-strings file
iphone 7
apple
apples
innovation etc.
The problem is that when i try to order it so that the columns follow in asceding order, e.g. oct-2014, nov-2014, dec-2014 and so on, it just groups the months together instead, which isn't what i want
import csv
from datetime import datetime
file_1 = input('Enter first CSV name (one with the date and snippet): ')
file_2 = input('Enter second CSV name (one with the strings): ')
outp = input('Enter the output CSV name: ')
file_1_list = []
head = True
for row in csv.reader(open(file_1, encoding='utf-8', errors='ignore')):
if head:
head = False
continue
date = datetime.strptime(row[0].strip(), '%a %b %d %H:%M:%S %Z %Y')
date_str = date.strftime('%b %Y')
file_1_list.append([date_str, row[1].strip()])
file_2_dict = {}
for line in csv.reader(open(file_2, encoding='utf-8', errors='ignore')):
s = line[0].strip()
for d in file_1_list:
if s.lower() in d[1].lower():
if s in file_2_dict.keys():
if d[0] in file_2_dict[s].keys():
file_2_dict[s][d[0]] += 1
else:
file_2_dict[s][d[0]] = 1
else:
file_2_dict[s] = {
d[0]: 1
}
months = []
for v in file_2_dict.values():
for k in v.keys():
if k not in months:
months.append(k)
months.sort()
rows = [[''] + months]
for k in file_2_dict.keys():
tmp = [k]
for m in months:
try:
tmp.append(file_2_dict[k][m])
except:
tmp.append(0)
rows.append(tmp)
print("still working on it be patient")
writer = csv.writer(open(outp, "w", encoding='utf-8', newline=''))
for r in rows:
writer.writerow(r)
print('Done...')
From my understanding I am months.sort() isnt doing what i expect it to?
I have looked here , where they apply some other function to sort the data, using attrgetter,
from operator import attrgetter
>>> l = [date(2014, 4, 11), date(2014, 4, 2), date(2014, 4, 3), date(2014, 4, 8)]
and then
sorted(l, key=attrgetter('month'))
But I am not sure whether that would work for me?
From my understanding I parse the dates 12-13, am I missing an order data first, like
data = sorted(data, key = lambda row: datetime.strptime(row[0], "%b-%y"))
I have only just started learning python and so many things are new to me i dont know what is right and what isnt?
What I want(of course with the correctly sorted data):

This took a while because you had so much unrelated stuff about reading csv files and finding and counting tags. But you already have all that, and it should have been completely excluded from the question to avoid confusing people.
It looks like your actual question is "How do I sort dates?"
Of course "Apr-16" comes before "Oct-14", didn't they teach you the alphabet in school? A is the first letter! I'm just being silly to emphasize a point -- it's because they are simple strings, not dates.
You need to convert the string to a date with the datetime class method strptime, as you already noticed. Because the class has the same name as the module, you need to pay attention to how it is imported. You then go back to a string later with the member method strftime on the actual datetime (or date) instance.
Here's an example:
from datetime import datetime
unsorted_strings = ['Oct-14', 'Dec-15', 'Apr-16']
unsorted_dates = [datetime.strptime(value, '%b-%y') for value in unsorted_strings]
sorted_dates = sorted(unsorted_dates)
sorted_strings = [value.strftime('%b-%y') for value in sorted_dates]
print(sorted_strings)
['Oct-14', 'Dec-15', 'Apr-16']
or skipping to the end
from datetime import datetime
unsorted_strings = ['Oct-14', 'Dec-15', 'Apr-16']
print (sorted(unsorted_strings, key = lambda x: datetime.strptime(x, '%b-%y')))
['Oct-14', 'Dec-15', 'Apr-16']

Related

How to analyze dates in the form of tuples from a database?

I have extracted a series of dates from a database I am storing within a list. Like this:
query_age = db.select([customer.columns.dob])
proxy_age = connection.execute(query_age)
result_age = proxy_age.fetchall()
date = []
for row in result_age:
date.append(' '.join(row))
It comes down to looking like this: ['2002-11-03', '1993-08-25', '1998-01-30']
I have tried the following but it comes out very unpythonic:
ages = []
years = []
for row in y:
ages.append(''.join(row))
for i in ages:
years.append(int(i[:4]))
years_age = [int(x) for x in years]
print(years_age)
I figured with this I could just convert the given string to an integer and subtract from 2021 but it looks ugly to me.
I am trying to pass the items within the list to a function called 'age' which will determine the amount of elapsed time between the specific date and the present in years.
I have tried datetime.strptime() but cannot figure it out. I am a very new programmer. Any help would be appreciated.
Using the information from this post, you can create a function age like the one below. Note that fmt is simply the format of your date string
import datetime
from dateutil.relativedelta import relativedelta
def age(dob, fmt = '%Y-%m-%d'):
today = datetime.datetime.now()
birthdate = datetime.datetime.strptime(dob, fmt)
difference_in_years = relativedelta(today, birthdate).years
return(difference_in_years)
Then, using the information from your post above:
DOB = ['2002-11-03', '1993-08-25', '1998-01-30']
for d in DOB:
print("DOB:%s --> Age: %i" % (d, age(d)))
# DOB:2002-11-03 --> Age: 18
# DOB:1993-08-25 --> Age: 28
# DOB:1998-01-30 --> Age: 23

Need to output the remaining holidays in the txt file using python

happy holidays!
I am working on the project that needs to send reminders about public holidays 3 weeks in advance. I have completed this part and now need to add a function that will also send the remaining holidays for the year in addition to the upcoming holiday. Any tips or suggestions on how I can approach this will be greatly appreciated as I am new to coding!
Here is the code I have for now:
import datetime
from datetime import timedelta
import calendar
import time
import smtplib as smtp
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from email.message import EmailMessage
holidayFile = 'calendar.txt'
def run(input):
checkTodaysHolidays()
def checkTodaysHolidays():
file = open(holidayFile, 'r')
date = (datetime.date.today() + datetime.timedelta(days=21)).strftime('%Y/%m/%d')
publicHolidayName = ''
for line in file:
if date in line:
publicHolidayName = " ".join(line.split()[1:])
Thank you.
I think the easiest way to do this would be to use the datetime and timedelta modules you've already imported.
I would convert the data in your text file into an array, and then build a function that compares today's date to the holidays in this list.
holidayFile = open('calendar.txt', 'r')
holidayLines = holidayFile.readlines()
holidayFile.close()
holidayNames = []
holidayDates = []
for x in range(0, len(holidayLines) ):
# ... Get the date, first
this = holidayLines[x].split(" ") # since we know they're formatted "YYYY/MM/DD Name of Holiday"
rawdate = this[0]
datechunks = rawdate.split("/") # separate out the YYYY, MM, and DD for use
newdate = (datechunks[0] ,datechunks[1] , datechunks[2])
holidayDates.append(newdate)
# ... then get the Name
del this[0] # remove the dates from our split array
name = "".join(this)
holidayNames.append(name)
So in the block before our function, I:
1: Open the file and store each line, then close it.
2: Iterate through each line and separate out the date, and store the touples in an array.
3: Save the name to a separate array.
Then we do the comparison.
def CheckAllHolidays():
returnNames = [] # a storage array for the names of each holiday
returnDays = [] # a storage array for all the holidays that are expected in our response
today = datetime.datetime.now()
threeweeks = timedelta(weeks=3)
for x in range(0, len(holidayDates) ):
doi = holidayDates[x] # a touple containing the date of interest
year = doi[0]
month = doi[1]
day = doi[2]
holiday = datetime.datetime(year, month, day)
if holiday > today:
# convert the holiday date to a date three weeks earlier using timedelta
returnDays.append( holiday - threeweeks )
returnNames.append( holidayNames[x] )
else:
pass # do nothing if date has passed
return(returnDays, returnNames)
What I did here is I:
1: Create an array inside the function to store our holiday names.
2: Convert the date from the previous array into a datetime.datetime() object.
3: Compare two objects of like kind in an if block, and
4: Return a list of dates three-weeks before each holiday, with names for the holidays that a reminder should be set for.
Then you're all set. You could call
ReminderDates = CheckAllHolidays()[0]
ReminderNames = CheckAllHolidays()[1]
and then use those two lists to create your reminders! ReminderDates would be an array filled with datetime.datetime() objects, and ReminderNames would be an array filled with string values.
I'm sorry my response was kinda long, but I really hope I was able to help you with your issue! Happy Holidays <3

Python Dates Hashtable

I am trying to make a hash table to speed up the process of finding the difference between a particular date to a holiday date (I have a list of 10 holiday dates).
holidays =['2014-01-01', '2014-01-20', '2014-02-17', '2014-05-26',
'2014-07-04', '2014-09-01', '2014-10-13', '2013-11-11',
'2013-11-28', '2013-12-25']
from datetime import datetime
holidaydate=[]
for i in range(10):
holidaydate.append(datetime.strptime(holidays[i], '%Y-%m-%d'))
newdate=pd.to_datetime(df.YEAR*10000+df.MONTH*100+df.DAY_OF_MONTH,format='%Y-%m-%d')
#newdate contains all the 0.5 million of dates!
Now I want to use a hash table to calculate the difference between each of the 0.5 million dates in "newdate" to the closest holiday. I do NOT want to do the same calculation millions of times, thats why I want to use a hashtable for this.
I tried searching for a solution on google but only found stuff such as:
keys = ['a', 'b', 'c']
values = [1, 2, 3]
hash = {k:v for k, v in zip(keys, values)}
And this does not work in my case.
Thanks for your help!
You need to create the table first. Like this.
import datetime
holidays =['2014-01-01', '2014-01-20', '2014-02-17', '2014-05-26',
'2014-07-04', '2014-09-01', '2014-10-13', '2013-11-11',
'2013-11-28', '2013-12-25']
hdates = []
def return_date(txt):
_t = txt.split("-")
return datetime.date(int(_t[0]), int(_t[1]), int(_t[2]))
def find_closest(d):
_d = min(hdates, key=lambda x:abs(x-d))
_diff = abs(_d - d).days
return _d, _diff
# Convert holidays to datetime.date
for h in holidays:
hdates.append(return_date(h))
# Build the "hash" table
hash_table = {}
i_date = datetime.date(2013, 1, 1)
while i_date < datetime.date(2016,1,1):
cd, cdiff = find_closest(i_date)
hash_table[i_date] = {"date": cd, "difference": cdiff}
i_date = i_date + datetime.timedelta(days=1)
print hash_table[datetime.date(2014,10,15)]
This works on datetime.date objects instead of raw strings. It has a built-in function to convert a "yyyy-mm-dd" string to datetime.date though.
This creates a hash table for all dates between 1/1/2013 and 31/12/2015 and then tests this with just one date. You would then loop your 0.5 million dates and match the result in this dictionary (key is datetime.date object but you can of course convert this back to string if you so desire).
Anyway, this should give you the idea how to do this.

How would I normalize dates in a csv file? python

I have a CSV file with a field named start_date that contains data in a variety of formats.
Some of the formats include e.g., June 23, 1912 or 5/11/1930 (month, day, year). But not all values are valid dates.
I want to add a start_date_description field adjacent to the start_date column to filter invalid date values into. Lastly, normalize all valid date values in start_date to ISO 8601 (i.e., YYYY-MM-DD).
So far I was only able to load the start_date into my file, I am stuck and would appreciate ant help. Please, any solution especially without using a library would be great!
import csv
date_column = ("start_date")
f = open("test.csv","r")
csv_reader = csv.reader(f)
headers = None
results = []
for row in csv_reader:
if not headers:
headers = []
for i, col in enumerate(row):
if col in date_column:
headers.append(i)
else:
results.append(([row[i] for i in headers]))
print results
One way is to use dateutil module, you can parse data as follows:
from dateutil import parser
parser.parse('3/16/78')
parser.parse('4-Apr') # this will give current year i.e. 2017
Then parsing to your format can be done by
dt = parser.parse('3/16/78')
dt.strftime('%Y-%m-%d')
Suppose you have table in dataframe format, you can now define parsing function and apply to column as follows:
def parse_date(start_time):
try:
return parser.parse(x).strftime('%Y-%m-%d')
except:
return ''
df['parse_date'] = df.start_date.map(lambda x: parse_date(x))
Question: ... add a start_date_description ... normalize ... to ISO 8601
This reads the File test.csv and validates the Date String in Column start_date with Date Directive Patterns and returns a
dict{description, ISO}. The returned dict is used to update the current Row dict and the updated Row dict is writen to the File test_update.csv.
Put this in a NEW Python File and run it!
A missing valid Date Directive Pattern could be simple added to the Array.
Python ยป 3.6 Documentation: 8.1.8. strftime() and strptime() Behavior
from datetime import datetime as dt
import re
def validate(date):
def _dict(desc, date):
return {'start_date_description':desc, 'ISO':date}
for format in [('%m/%d/%y','Valid'), ('%b-%y','Short, missing Day'), ('%d-%b-%y','Valid'),
('%d-%b','Short, missing Year')]: #, ('%B %d. %Y','Valid')]:
try:
_dt = dt.strptime(date, format[0])
return _dict(format[1], _dt.strftime('%Y-%m-%d'))
except:
continue
if not re.search(r'\d+', date):
return _dict('No Digit', None)
return _dict('Unknown Pattern', None)
with open('test.csv') as fh_in, open('test_update.csv', 'w') as fh_out:
csv_reader = csv.DictReader(fh_in)
csv_writer = csv.DictWriter(fh_out,
fieldnames=csv_reader.fieldnames +
['start_date_description', 'ISO'] )
csv_writer.writeheader()
for row, values in enumerate(csv_reader,2):
values.update(validate(values['start_date']))
# Show only Invalid Dates
if any(w in values['start_date_description']
for w in ['Unknown', 'No Digit', 'missing']):
print('{:>3}: {v[start_date]:13.13} {v[start_date_description]:<22} {v[ISO]}'.
format(row, v=values))
csv_writer.writerow(values)
Output:
start_date start_date_description ISO
June 23. 1912 Valid 1912-06-23
12/31/91 Valid 1991-12-31
Oct-84 Short, missing Day 1984-10-01
Feb-09 Short, missing Day 2009-02-01
10-Dec-80 Valid 1980-12-10
10/7/81 Valid 1981-10-07
Facere volupt No Digit None
... (omitted for brevity)
Tested with Python: 3.4.2

Python: Split timestamp by date and hour

I have a list of timestamps in the following format:
1/1/2013 3:30
I began to learn python some weeks ago and I have no idea how to split the date and time. Can anyone of you help me?
Output should be on column including
1/1/2013
and one column including
3:30
I think that all you need is str.split ...
>>> s = '1/1/2013 3:30'
>>> s.split()
['1/1/2013', '3:30']
If it's in a list, you can do with a list-comprehension:
>>> lst = ['1/1/2013 3:30', '1/2/2013 3:30']
>>> [s.split() for s in lst]
[['1/1/2013', '3:30'], ['1/2/2013', '3:30']]
If you want to use this date and time further in your code to perform operations on this data such as comparing dates, you can convert this timestamp to datetime objects. Refer the documentation on datetime module.
You can use the following code to convert your timestamp to datetime object.
>>> import datetime
>>> timestamp = datetime.datetime.strptime("1/1/2013 3:30", "%d/%m/%y %H:%M")
>>> timestamp
datetime.datetime(2013, 1, 1, 3, 30)
>>> timestamp.date()
datetime.date(2013, 1, 1)
>>> timestamp.time()
datetime.time(3, 30)
If you just want to strip date and time to use them as strings, use method suggested by mgilson.
Here is pseudocode to accomplish what you had mentioned in your comment:
f = file("path/to/file.csv", "r")
timestamp_column = 10
def get_updated_row(i, row):
row = row.split(',')
try:
timestamp = row.pop(timestamp_column) #remove column
if i == 0:
#header
row.extend(["date", "time"]) #add columns
else:
#normal row
date = timestamp[0]
time = timestamp[1]
row.extend([date, time])
except IndexError:
print("ERROR: Unable to parse row {0}".format(i))
return ','.join(row)
with f.read() as csv:
for i, row in enumerate(csv):
print(get_updated_row(i, row)) #write to file here instead if necessary

Categories