I have a dataset and only wants to have the rows inside a time range.
I put all the good rows in a Series object. But when I re-assign that object to the DataFrame object, I get NaT values:
code:
def get_tweets_from_range_in_csv():
csvfile1 = "results_dataGOOGL050"
df1 = temp(csvfile1)
def temp(csvfile):
tweetdats = []
d = pd.read_csv(csvfile + ".csv", encoding='latin-1')
start = datetime.datetime.strptime("01-01-2018", "%d-%m-%Y")
end = datetime.datetime.strptime("01-06-2018", "%d-%m-%Y")
for index, current_tweet in d['Date'].iteritems():
date_tw = datetime.datetime.strptime(current_tweet[:10], "%Y-%m-%d")
if start <= date_tw <= end:
tweetdats.append(date_tw)
else:
d.drop(index, inplace=True)
d = d.drop("Likes", 1)
d = d.drop("RTs", 1)
d = d.drop("Sentiment", 1)
d = d.drop("User", 1)
d = d.drop("Followers", 1)
df1['Date'] = pd.Series(tweetdats)
return d
Output of tweetdats:
tweetdats
Out[340]:
[datetime.datetime(2018, 1, 30, 0, 0),
datetime.datetime(2018, 4, 1, 0, 0),
datetime.datetime(2018, 4, 1, 0, 0),
datetime.datetime(2018, 4, 1, 0, 0),
datetime.datetime(2018, 1, 5, 0, 0),
datetime.datetime(2018, 1, 5, 0, 0),
datetime.datetime(2018, 1, 8, 0, 0),
datetime.datetime(2018, 1, 20, 0, 0),
datetime.datetime(2018, 1, 22, 0, 0),
datetime.datetime(2018, 1, 5, 0, 0)]
You do not need to iterate through your dataframe with a for loop to select the rows inside the time range of interest.
Let us assume that your initial dataframe df has a 'Date' column containing the dates in datetime format; you can then simply create a new dataframe new_df:
new_df=df[(pd.to_datetime(df.time) > start) & (pd.to_datetime(self.df.time) < end)]
This way you do not have to copy and paste the "good" rows in a Series and then reassign them to a dataframe.
Your temp function would look like:
def temp(csvfile):
df = pd.read_csv(csvfile + ".csv", encoding='latin-1')
start = datetime.datetime.strptime("01-01-2018", "%d-%m-%Y")
end = datetime.datetime.strptime("01-06-2018", "%d-%m-%Y")
new_df=df[(pd.to_datetime(df.time) > start) & (pd.to_datetime(self.df.time) < end)]
Hope this helps!
Related
I have to write a function which will count a difference in days between 2 dates. If the number of days exceeds 30, then split it in n date ranges and save in list or dict. I have started with the function but cannot finish it. The function has to calculate dynamic values.
For example
start_date = '2020-07-01'
end_date = '2020-09-15'
difference = (end_date - start_date).days
dateranges = []
dateranges.append(start_date)
if difference > 30:
end_date = start_date + dt.timedelta(days=30)
dateranges.append(end_date)
But I do not get how to make it cyclic, when it take each time a new start_date and end_date and calculates the difference between them. For example I always add 30 dys here but it can be that a less number of days has to be added.
If you want list of date ranges, between 2 dates, with maximum difference of 30 days, you can use timedelta to iterate over the range and split accordingly.
from datetime import datetime
from datetime import timedelta
def get_range(start_date, end_date, date_diff):
start_date = datetime.strptime(start_date, "%Y-%m-%d")
end_date = datetime.strptime(end_date, "%Y-%m-%d")
if abs((end_date - start_date).days) <= date_diff:
return [datetime.strftime(start_date,"%Y-%m-%d"),datetime.strftime(end_date,"%Y-%m-%d")]
else:
result=[]
while 1:
d3=start_date+timedelta(days=date_diff)
if d3>=end_date:
result.append([datetime.strftime(start_date,"%Y-%m-%d"),datetime.strftime(end_date,"%Y-%m-%d")])
break
else:
result.append([datetime.strftime(start_date,"%Y-%m-%d"),datetime.strftime(d3,"%Y-%m-%d")])
start_date=d3+timedelta(days=1)
return result
print(get_range('2020-07-01', '2020-09-15',30))
Output
[['2020-07-01', '2020-07-31'], ['2020-08-01', '2020-08-31'], ['2020-09-01', '2020-09-15']]
Here I assume, by date range, you mean start and end date like [start,end].
For example I always add 30 dys here but it can be that a less number
of days has to be added.
datetime library and timedelta function takes care of it and increase month if need.
from datetime import datetime
from datetime import timedelta
start_date = '2020-07-01'
end_date = '2020-09-15'
def date_difference(d1, d2):
d1 = datetime.strptime(d1, "%Y-%m-%d")
d2 = datetime.strptime(d2, "%Y-%m-%d")
if abs((d2 - d1).days) > 30:
dates = []
# if you don't want to include start_date, use range(1, 30) instead.
for i in range(0, 30):
dates.append(d1 + timedelta(days=i))
return dates
return []
print(date_difference(start_date, end_date))
Output:
[datetime.datetime(2020, 7, 1, 0, 0), datetime.datetime(2020, 7, 2, 0, 0),
datetime.datetime(2020, 7, 3, 0, 0), datetime.datetime(2020, 7, 4, 0, 0),
datetime.datetime(2020, 7, 5, 0, 0), datetime.datetime(2020, 7, 6, 0, 0),
datetime.datetime(2020, 7, 7, 0, 0), datetime.datetime(2020, 7, 8, 0, 0),
datetime.datetime(2020, 7, 9, 0, 0), datetime.datetime(2020, 7, 10, 0, 0),
...
datetime.datetime(2020, 7, 30, 0, 0)]
from datetime import datetime,timedelta
def split_dates(prev_date,next_date,interval):
date_ranges = []
if next_date > prev_date+timedelta(days=30):
while prev_date <= next_date:
prev_date += timedelta(days=interval)
date_ranges.append(prev_date)
return date_ranges
I have a few files that have a randomly generated number that corresponds with a date:
736815 = 01/05/2018
I need to create a function or process that applies logic to sequential numbers so that the next number equals the next calendar date.
Ideally i would need it in a key:pair format, so that when i convert the file to a new format i can apply the date in place of the auto file name.
Hopefully this makes more sense, it is to be used to name a converted file.
I think origin parameter is possible use here, also add unit='D' to to_datetime:
df = pd.DataFrame({'col':[5678, 5679, 5680]})
df['date'] = pd.to_datetime(df['col'] - 5678, unit='D', origin=pd.Timestamp('2020-01-01'))
print (df)
col date
0 5678 2020-01-01
1 5679 2020-01-02
2 5680 2020-01-03
Non pandas solution, only pure python with same idea:
from datetime import datetime, timedelta
L = [5678, 5679, 5680]
a = [timedelta(x-5678) + datetime(2020,1,1) for x in L]
print (a)
[datetime.datetime(2020, 1, 1, 0, 0),
datetime.datetime(2020, 1, 2, 0, 0),
datetime.datetime(2020, 1, 3, 0, 0)]
The number doesn't need to translate into the date directly in any way. You just need to pick a start date and a number, and add another number either via simple addition or via a timedelta:
from datetime import date, timedelta
from random import randint
start_date = date.today()
start_int = randint(1000, 10000)
for i in range(10):
print(start_int + i, start_date + timedelta(days=i))
6964 2020-01-29
6965 2020-01-30
6966 2020-01-31
6967 2020-02-01
6968 2020-02-02
6969 2020-02-03
6970 2020-02-04
6971 2020-02-05
6972 2020-02-06
6973 2020-02-07
If you're getting your list of numbers from somewhere else, add/subtract appropriately from a start int/date for the same effect.
Another solution is to create an object, which encapsulates the date and the base number to count from. Each call to this object (implemented using the __call__ special method) will create a new date object using the time delta between the base number and the supplied number.
import datetime
class RelativeDate:
def __init__(self, date, base):
self.date = date
self.base = base
def __call__(self, number):
delta = datetime.timedelta(days=number - self.base)
return self.date + delta
def create_base_date(number, date):
return RelativeDate(
date=datetime.datetime.strptime(date, '%d/%m/%Y'),
base=number,
)
base_date = create_base_date(1, '03/01/2020')
base_date(3)
datetime.datetime(2020, 1, 5, 0, 0)
Example snippet:
base_date = create_base_date(1, '03/01/2020')
{i: base_date(i) for i in range(1, 10)}
Output:
{1: datetime.datetime(2020, 1, 3, 0, 0),
2: datetime.datetime(2020, 1, 4, 0, 0),
3: datetime.datetime(2020, 1, 5, 0, 0),
4: datetime.datetime(2020, 1, 6, 0, 0),
5: datetime.datetime(2020, 1, 7, 0, 0),
6: datetime.datetime(2020, 1, 8, 0, 0),
7: datetime.datetime(2020, 1, 9, 0, 0),
8: datetime.datetime(2020, 1, 10, 0, 0),
9: datetime.datetime(2020, 1, 11, 0, 0)}
I am trying to plot data from temperature sensor with time steps. I have time steps in format "hh:mm:ss" after conversion from string to datetime format. First value in the list is "21:47:22" and the last one is "06:12:22" the next day.I have been trying to plot these values with order of indexes in the list however Python automaticaly sorting it from "00:00:00" to "24:00:00" on the x axis. Here is the image.
Could you please advice how to solve this issue? Below my code:
import matplotlib.pyplot as plt
import datetime
data = []
sensor1 = []
sensor2 = []
time = []
with open("output.txt","r") as f:
data = f.readlines()
first_sensor_len = len(data[0])
for var in data:
if var[2:7] == "First" and len(var) == first_sensor_len:
sensor1.append(var[28:33])
sensor2.append(var[75:80])
time.append(datetime.datetime.strptime(var[36:44], "%H:%M:%S"))
elif var[2:8] == "Second" and len(var) == first_sensor_len:
sensor2.append(var[29:34])
sensor1.append(var[75:80])
time.append(datetime.datetime.strptime(var[83:91], "%H:%M:%S"))
plt.plot(time, sensor1)
plt.show()
Supposed time looks like
timestr = ["21:47:22", "22:12:22", "23:12:22", "00:12:22", "01:12:22", "03:12:22", "06:12:22"]
time = [datetime.datetime.strptime(ts, "%H:%M:%S") for ts in timestr]
time
[datetime.datetime(1900, 1, 1, 21, 47, 22),
datetime.datetime(1900, 1, 1, 22, 12, 22),
datetime.datetime(1900, 1, 1, 23, 12, 22),
datetime.datetime(1900, 1, 1, 0, 12, 22),
datetime.datetime(1900, 1, 1, 1, 12, 22),
datetime.datetime(1900, 1, 1, 3, 12, 22),
datetime.datetime(1900, 1, 1, 6, 12, 22)]
You can use np.diff from numpy to mark every first time of a new day. If the difference of two consecutive time values is negative, there was midnight in between.
(This boolean array is appended to one initial False, which states that the first time value has always no day offset; the result of np.diff is generally one entry shorter than its input.)
import numpy as np
newday_marker = np.append(False, np.diff(time) < datetime.timedelta(0))
newday_marker
array([False, False, False, True, False, False, False], dtype=bool)
With np.cumsum this array can be transformed into the array of dayoffsets for each time value.
day_offset = np.cumsum(newday_marker)
day_offset
array([0, 0, 0, 1, 1, 1, 1], dtype=int32)
In the end this has to be converted to timedeltas and then can be added to the original list of time values:
date_offset = [datetime.timedelta(int(dt)) for dt in day_offset]
dtime = [t + dos for t, dos in zip(time, date_offset)]
dtime
[datetime.datetime(1900, 1, 1, 21, 47, 22),
datetime.datetime(1900, 1, 1, 22, 12, 22),
datetime.datetime(1900, 1, 1, 23, 12, 22),
datetime.datetime(1900, 1, 2, 0, 12, 22),
datetime.datetime(1900, 1, 2, 1, 12, 22),
datetime.datetime(1900, 1, 2, 3, 12, 22),
datetime.datetime(1900, 1, 2, 6, 12, 22)]
As the title says, I'm trying to generate a list of datetimes corresponding to the occurrences of a specific day of the month between two dates.
So given a start date, an end date, and a day of the month, I want to see every occurrence of that day of the month:
from datetime import datetime
end_date = datetime(2012, 9, 15, 0, 0)
start_date = datetime(2012, 6, 1, 0, 0)
day_of_month = 16
dates = "magic code goes here"
dates would then hold an array as such:
dates == [
datetime(2012, 6, 16, 0, 0),
datetime(2012, 7, 16, 0, 0),
datetime(2012, 8, 16, 0, 0)
]
The issue I'm running into is the number of checks I have to perform. First I have to check if it's the start year, if so, then I have to start at the beginning month, but if the day of the month is before the start date, then I have to skip that month. This same thing applies for the end of the period. Not to mention I have to check if the period starts and ends in the same year. All in all it's turning into quite a mess of nested if and for statements.
Here is my solution:
import numpy as np
for year in np.arange(start_date.year, end_date.year + 1):
for month in np.arange(1, 13):
date = datetime(year, month, day_of_month, 0, 0)
if start_date < date < end_date:
dates.append(date)
Is there a more Pythonic way to accomplish this?
Here's a quick and dirty (but reasonably efficient) solution:
import datetime
d = start_date
days = []
while d <= end_date: # Change to < if you do not want the end_date
if d.day == day_of_month:
days.append(d)
d += datetime.timedelta(1)
days
# [datetime.datetime(2012, 6, 16, 0, 0),
# datetime.datetime(2012, 7, 16, 0, 0),
# datetime.datetime(2012, 8, 16, 0, 0)]
Ideally, you want to use pandas for this.
This is a succinct, but not efficient, way using pandas.date_range.
from datetime import datetime
import pandas as pd
end_date = datetime(2012, 9, 15, 0, 0)
start_date = datetime(2012, 6, 1, 0, 0)
day_of_month = 16
rng = [i.to_pydatetime() for i in pd.date_range(start_date, end_date, freq='1D') if i.day == day_of_month]
# [datetime.datetime(2012, 6, 16, 0, 0),
# datetime.datetime(2012, 7, 16, 0, 0),
# datetime.datetime(2012, 8, 16, 0, 0)]
Here is a more efficient method using a generator for the date range, which does not rely on pandas:
def daterange(start_date, end_date):
for n in range(int ((end_date - start_date).days)):
yield start_date + timedelta(n)
rng = [i for i in daterange(start_date, end_date) if i.day == day_of_month]
# [datetime.datetime(2012, 6, 16, 0, 0),
# datetime.datetime(2012, 7, 16, 0, 0),
# datetime.datetime(2012, 8, 16, 0, 0)]
I have a situation where I need to get the third latest date, i.e
INPUT :
['14-04-2001', '29-12-2061', '21-10-2019',
'07-01-1973', '19-07-2014','11-03-1992','21-10-2019']
Also , INPUT
6
14-04-2001
29-12-2061
21-10-2019
07-01-1973
19-07-2014
11-03-1992
OUTPUT : 19-07-2014
import datetime
datelist = ['14-04-2001', '29-12-2061', '21-10-2019', '07-01-1973', '19-07-2014','11-03-1992','21-10-2019' ]
for d in datelist:
x = datetime.datetime.strptime(d,'%d-%m-%Y')
print x
How can i achieve this?
You can sort the list and take the 3rd element from it.
my_list = [datetime.datetime.strptime(d,'%d-%m-%Y') for d in list]
# [datetime.datetime(2001, 4, 14, 0, 0), datetime.datetime(2061, 12, 29, 0, 0), datetime.datetime(2019, 10, 21, 0, 0), datetime.datetime(1973, 1, 7, 0, 0), datetime.datetime(2014, 7, 19, 0, 0), datetime.datetime(1992, 3, 11, 0, 0), datetime.datetime(2019, 10, 21, 0, 0)]
my_list.sort(reverse=True)
my_list[2]
# datetime.datetime(2019, 10, 21, 0, 0)
Also, as per Kerorin's suggestion, if you don't need to sort in-place and just need the 3rd element always, you can simply do
sorted(my_list, reverse=True)[2]
Update
To remove the duplicates, taking inspiration from this answer, you can do the following -
import datetime
datelist = ['14-04-2001', '29-12-2061', '21-10-2019', '07-01-1973', '19-07-2014', '11-03-1992', '21-10-2019']
seen = set()
my_list = [datetime.datetime.strptime(d,'%d-%m-%Y')
for d in datelist
if d not in seen and not seen.add(d)]
my_list.sort(reverse=True)
You can use heapq.nlargest to do this.
import heapq
from datetime import datetime
datelist = [
'14-04-2001',
'29-12-2061',
'21-10-2019',
'07-01-1973',
'19-07-2014',
'11-03-1992',
'21-10-2019'
]
heapq.nlargest(3, {datetime.strptime(d, "%d-%m-%Y") for d in datelist})[-1]
This return datetime.datetime(2014, 7, 19, 0, 0)