I am trying to plot data from temperature sensor with time steps. I have time steps in format "hh:mm:ss" after conversion from string to datetime format. First value in the list is "21:47:22" and the last one is "06:12:22" the next day.I have been trying to plot these values with order of indexes in the list however Python automaticaly sorting it from "00:00:00" to "24:00:00" on the x axis. Here is the image.
Could you please advice how to solve this issue? Below my code:
import matplotlib.pyplot as plt
import datetime
data = []
sensor1 = []
sensor2 = []
time = []
with open("output.txt","r") as f:
data = f.readlines()
first_sensor_len = len(data[0])
for var in data:
if var[2:7] == "First" and len(var) == first_sensor_len:
sensor1.append(var[28:33])
sensor2.append(var[75:80])
time.append(datetime.datetime.strptime(var[36:44], "%H:%M:%S"))
elif var[2:8] == "Second" and len(var) == first_sensor_len:
sensor2.append(var[29:34])
sensor1.append(var[75:80])
time.append(datetime.datetime.strptime(var[83:91], "%H:%M:%S"))
plt.plot(time, sensor1)
plt.show()
Supposed time looks like
timestr = ["21:47:22", "22:12:22", "23:12:22", "00:12:22", "01:12:22", "03:12:22", "06:12:22"]
time = [datetime.datetime.strptime(ts, "%H:%M:%S") for ts in timestr]
time
[datetime.datetime(1900, 1, 1, 21, 47, 22),
datetime.datetime(1900, 1, 1, 22, 12, 22),
datetime.datetime(1900, 1, 1, 23, 12, 22),
datetime.datetime(1900, 1, 1, 0, 12, 22),
datetime.datetime(1900, 1, 1, 1, 12, 22),
datetime.datetime(1900, 1, 1, 3, 12, 22),
datetime.datetime(1900, 1, 1, 6, 12, 22)]
You can use np.diff from numpy to mark every first time of a new day. If the difference of two consecutive time values is negative, there was midnight in between.
(This boolean array is appended to one initial False, which states that the first time value has always no day offset; the result of np.diff is generally one entry shorter than its input.)
import numpy as np
newday_marker = np.append(False, np.diff(time) < datetime.timedelta(0))
newday_marker
array([False, False, False, True, False, False, False], dtype=bool)
With np.cumsum this array can be transformed into the array of dayoffsets for each time value.
day_offset = np.cumsum(newday_marker)
day_offset
array([0, 0, 0, 1, 1, 1, 1], dtype=int32)
In the end this has to be converted to timedeltas and then can be added to the original list of time values:
date_offset = [datetime.timedelta(int(dt)) for dt in day_offset]
dtime = [t + dos for t, dos in zip(time, date_offset)]
dtime
[datetime.datetime(1900, 1, 1, 21, 47, 22),
datetime.datetime(1900, 1, 1, 22, 12, 22),
datetime.datetime(1900, 1, 1, 23, 12, 22),
datetime.datetime(1900, 1, 2, 0, 12, 22),
datetime.datetime(1900, 1, 2, 1, 12, 22),
datetime.datetime(1900, 1, 2, 3, 12, 22),
datetime.datetime(1900, 1, 2, 6, 12, 22)]
Related
Given today's date, what is the efficient way to retrieve the first and last date for previous 3 months (i.e. 3/1/2020' and '3/31/2020'; '2/1/2020' and '2/29/2020'; '1/1/2020' and '1/31/2020')?
EDIT
For previous month's first and last, the following code is working as expected. But I am not sure how to retrieve the previous 2nd and 3rd month's first and last date.
from datetime import date, timedelta
last_day_of_prev_month = date.today().replace(day=1) - timedelta(days=1)
start_day_of_prev_month = (date.today().replace(day=1)
- timedelta(days=last_day_of_prev_month.day))
# For printing results
print("First day of prev month:", start_day_of_prev_month)
print("Last day of prev month:", last_day_of_prev_month)
You may
get the 3 previous month
create the date with day 1, and last day by going to the next and remove 1 day
def before_month(month):
v = [9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
return v[month:month + 3]
dd = datetime(2020, 4, 7)
dates = [[dd.replace(month=month, day=1), dd.replace(month=month, day=monthrange(dd.year, month)[1])]
for month in before_month(dd.month)]
print(dates)
# [[datetime.datetime(2020, 1, 1, 0, 0), datetime.datetime(2020, 1, 31, 0, 0)],
# [datetime.datetime(2020, 2, 1, 0, 0), datetime.datetime(2020, 2, 29, 0, 0)],
# [datetime.datetime(2020, 3, 1, 0, 0), datetime.datetime(2020, 3, 31, 0, 0)]]
I did not found another nice way to get the 3 previous month, but sometimes the easiest way it the one to use
You can loop over the 3 previous month, just update the date to the first day of the actual month at the end of every iteration:
from datetime import date, timedelta
d = date.today()
date_array = []
date_string_array = []
for month in range(1, 4):
first_day_of_month = d.replace(day=1)
last_day_of_previous_month = first_day_of_month - timedelta(days=1)
first_day_of_previous_month = last_day_of_previous_month.replace(day=1)
date_array.append((first_day_of_previous_month, last_day_of_previous_month))
date_string_array.append((first_day_of_previos_month.strftime("%m/%d/%Y"), last_day_of_previos_month.strftime("%m/%d/%Y")))
d = first_day_of_previos_month
print(date_array)
print(date_string_array)
Results:
[(datetime.date(2020, 3, 1), datetime.date(2020, 3, 31)), (datetime.date(2020, 2, 1), datetime.date(2020, 2, 29)), (datetime.date(2020, 2, 1), datetime.date(2020, 2, 29))]
[('03/01/2020', '03/31/2020'), ('03/01/2020', '03/31/2020'), ('03/01/2020', '03/31/2020')]
I have a dataset and only wants to have the rows inside a time range.
I put all the good rows in a Series object. But when I re-assign that object to the DataFrame object, I get NaT values:
code:
def get_tweets_from_range_in_csv():
csvfile1 = "results_dataGOOGL050"
df1 = temp(csvfile1)
def temp(csvfile):
tweetdats = []
d = pd.read_csv(csvfile + ".csv", encoding='latin-1')
start = datetime.datetime.strptime("01-01-2018", "%d-%m-%Y")
end = datetime.datetime.strptime("01-06-2018", "%d-%m-%Y")
for index, current_tweet in d['Date'].iteritems():
date_tw = datetime.datetime.strptime(current_tweet[:10], "%Y-%m-%d")
if start <= date_tw <= end:
tweetdats.append(date_tw)
else:
d.drop(index, inplace=True)
d = d.drop("Likes", 1)
d = d.drop("RTs", 1)
d = d.drop("Sentiment", 1)
d = d.drop("User", 1)
d = d.drop("Followers", 1)
df1['Date'] = pd.Series(tweetdats)
return d
Output of tweetdats:
tweetdats
Out[340]:
[datetime.datetime(2018, 1, 30, 0, 0),
datetime.datetime(2018, 4, 1, 0, 0),
datetime.datetime(2018, 4, 1, 0, 0),
datetime.datetime(2018, 4, 1, 0, 0),
datetime.datetime(2018, 1, 5, 0, 0),
datetime.datetime(2018, 1, 5, 0, 0),
datetime.datetime(2018, 1, 8, 0, 0),
datetime.datetime(2018, 1, 20, 0, 0),
datetime.datetime(2018, 1, 22, 0, 0),
datetime.datetime(2018, 1, 5, 0, 0)]
You do not need to iterate through your dataframe with a for loop to select the rows inside the time range of interest.
Let us assume that your initial dataframe df has a 'Date' column containing the dates in datetime format; you can then simply create a new dataframe new_df:
new_df=df[(pd.to_datetime(df.time) > start) & (pd.to_datetime(self.df.time) < end)]
This way you do not have to copy and paste the "good" rows in a Series and then reassign them to a dataframe.
Your temp function would look like:
def temp(csvfile):
df = pd.read_csv(csvfile + ".csv", encoding='latin-1')
start = datetime.datetime.strptime("01-01-2018", "%d-%m-%Y")
end = datetime.datetime.strptime("01-06-2018", "%d-%m-%Y")
new_df=df[(pd.to_datetime(df.time) > start) & (pd.to_datetime(self.df.time) < end)]
Hope this helps!
I have a situation where I need to get the third latest date, i.e
INPUT :
['14-04-2001', '29-12-2061', '21-10-2019',
'07-01-1973', '19-07-2014','11-03-1992','21-10-2019']
Also , INPUT
6
14-04-2001
29-12-2061
21-10-2019
07-01-1973
19-07-2014
11-03-1992
OUTPUT : 19-07-2014
import datetime
datelist = ['14-04-2001', '29-12-2061', '21-10-2019', '07-01-1973', '19-07-2014','11-03-1992','21-10-2019' ]
for d in datelist:
x = datetime.datetime.strptime(d,'%d-%m-%Y')
print x
How can i achieve this?
You can sort the list and take the 3rd element from it.
my_list = [datetime.datetime.strptime(d,'%d-%m-%Y') for d in list]
# [datetime.datetime(2001, 4, 14, 0, 0), datetime.datetime(2061, 12, 29, 0, 0), datetime.datetime(2019, 10, 21, 0, 0), datetime.datetime(1973, 1, 7, 0, 0), datetime.datetime(2014, 7, 19, 0, 0), datetime.datetime(1992, 3, 11, 0, 0), datetime.datetime(2019, 10, 21, 0, 0)]
my_list.sort(reverse=True)
my_list[2]
# datetime.datetime(2019, 10, 21, 0, 0)
Also, as per Kerorin's suggestion, if you don't need to sort in-place and just need the 3rd element always, you can simply do
sorted(my_list, reverse=True)[2]
Update
To remove the duplicates, taking inspiration from this answer, you can do the following -
import datetime
datelist = ['14-04-2001', '29-12-2061', '21-10-2019', '07-01-1973', '19-07-2014', '11-03-1992', '21-10-2019']
seen = set()
my_list = [datetime.datetime.strptime(d,'%d-%m-%Y')
for d in datelist
if d not in seen and not seen.add(d)]
my_list.sort(reverse=True)
You can use heapq.nlargest to do this.
import heapq
from datetime import datetime
datelist = [
'14-04-2001',
'29-12-2061',
'21-10-2019',
'07-01-1973',
'19-07-2014',
'11-03-1992',
'21-10-2019'
]
heapq.nlargest(3, {datetime.strptime(d, "%d-%m-%Y") for d in datelist})[-1]
This return datetime.datetime(2014, 7, 19, 0, 0)
I have two lists. One list name 'date' has dates in it which are related to persons birth date.
data = [ datetime.datetime(1958, 3, 15, 0, 0), datetime.datetime(1958, 9, 15, 0, 0), datetime.datetime(1930, 10, 23, 0, 0), datetime.datetime(1928, 9, 15, 0, 0), datetime.datetime(1928, 1, 23, 0, 0), datetime.datetime(1925, 11, 15, 0, 0), datetime.datetime(1962, 7, 20, 0, 0),datetime.datetime(1960, 12, 14, 0, 0), datetime.datetime(1960, 5, 10, 0, 0),datetime.datetime(1963, 9, 7, 0, 0), datetime.datetime(1956, 3, 10, 0, 0), datetime.datetime(1955, 2, 15, 0, 0),datetime.datetime(1958, 11, 14, 0, 0),datetime.datetime(1956, 8, 24, 0, 0),datetime.datetime(1990, 4, 30, 0, 0)]
Now next list contains marriage dates.
marriage = [ datetime.datetime(1985, 5, 14, 0, 0),datetime.datetime(1945, 6, 15, 0, 0), datetime.datetime(1938, 6, 11, 0, 0), datetime.datetime(1995, 4, 5, 0, 0), datetime.datetime(1987, 2, 26, 0, 0), datetime.datetime(1983, 12, 13, 0, 0), datetime.datetime(1980, 9, 16, 0, 0), datetime.datetime(2011, 6, 19, 0, 0)]
each date from the 'marriage' list is related to 2 dates from 'date' list. Now, I want to compare one date from marriage list to two dates from date list so that i can print"birth date is less than marriage.
How can accomplish this task using loop? confused with this one.
Please note that I used import datetime, import re to accomplish date comparison.
for i in range(len(data)):
if data[i] < marriage[i]:
print "birthdate is lt marriage date"
else:
print "birthdate is gt or eq to marriage date"
I'm not sure what you are trying to accomplish here... Also you don't need re for date comparison, you can use normal < > == <= >= operators.
This also sounds like a job for a hash(dictionary)...
marriage = {
'marriage1' : {
'1' : <birthday>,
'2' : <birthday>,
'marriage-date' : <marriage-date>
},
'marriage2' : {
'1' : <birthday>,
'2' : <birthday>,
'marriage-date' : <marriage-date>
}
}
A hash(dictionary) will make comparisons much easier with lists that don't contain the same number of values.
This assumes that the marriage and birth dates are in the same order (i.e., the first two birth dates correspond to the first marriage date and the next 2 birth dates correspond to the second marriage date)
for i in range(len(marriage)):
if marriage[i] > data[i*2] and marriage[i] > data[(i*2)+1]:
print "Both birthdates less than marriage data"
I believe my assumption is correct because there are twice as many entries in the data list as there are in the marriage list.
update
Technically, I want to convert log data into time series frequency in spark. I've searched a lot, but didn't find a good way to deal with big data.
I know pd.dataframe can get count for some feature, but my dataset is too big to use a dataframe.
which means I need to deal with each line by MapReduce.
And what I've tried are probably stupid....
I have a RDD, whose lines are lists of tuples, which looks like:
[(datetime.datetime(2015, 9, 1, 0, 4, 12), 1),((datetime.datetime(2015, 9, 2, 0, 4, 12), 1),(datetime.datetime(2015, 4, 1, 0, 4, 12), 1),(datetime.datetime(2015, 9, 1, 0, 4, 12),1)]
[(datetime.datetime(2015, 10, 1, 0, 4, 12), 1),(datetime.datetime(2015, 7, 1, 0, 4, 12), 1)]
In each tuple, the first element is a date,
can I write a map function in spark by python to fill the count of of the tuples with the same (month, day, hour) into a 3-d array according to the date (month, day, hour) as (x,y,z) coordinates in the tuple.
here is what I've done:
def write_array(input_rdd, array):
for item in input_rdd:
requestTime = item[0]
array[requestTime.month - 1, requestTime.day -1, requestTime.hour] += 1
array_to_fill = np.zeros([12, 31, 24], dtype=np.int)
filled_array = RDD_to_fill.map(lambda s:write_array(s, array_to_fill)).collect()
with open("output.txt", 'w') as output:
json.dump(traffic, output)
And the error is:
Traceback (most recent call last):
File "traffic_count.py", line 67, in <module>
main()
File "traffic_count.py", line 58, in main
traffic = organic_userList.Map(lambda s: write_array(s, traffic_array)) \
AttributeError: 'PipelinedRDD' object has no attribute 'Map'
I thought there must be some way to save the elements in each line of RDD into a exist data structure..... Can someone help me?
Many Thanks!
If you can have the output data be a list of ((month, day, hour), count) values, the following should work:
from pyspark import SparkConf, SparkContext
import datetime
conf = SparkConf().setMaster("local[*]").setAppName("WriteDates")
sc = SparkContext(conf = conf)
RDD_to_fill = sc.parallelize([(datetime.datetime(2015, 9, 1, 0, 4, 12), 1),(datetime.datetime(2015, 9, 2, 0, 4, 12), 1),(datetime.datetime(2015, 4, 1, 0, 4, 12), 1),(datetime.datetime(2015, 9, 1, 0, 4, 12),1), (datetime.datetime(2015, 10, 1, 0, 4, 12), 1), (datetime.datetime(2015, 7, 1, 0, 4, 12), 1)])
def map_date(tup):
return ((tup[0].month, tup[0].day, tup[0].hour), tup[1])
date_rdd = RDD_to_fill.map(map_date).reduceByKey(lambda x, y: x + y)
# create a tuple for every (month, day, hour) and set the value to 0
zeros = []
for month in range(1,13):
for day in range(1,32):
for hour in range(24):
zeros.append(((month, day, hour), 0))
zeros_rdd = sc.parallelize(zeros)
# union the rdd with the date_rdd (dates with non-zero values) with the zeros_rdd (dates with all zero values)
# and then add aggregate them together (via addition) by key (i.e., date tuple)
filled_tups = date_rdd.union(zeros_rdd).reduceByKey(lambda x, y: x + y).collect()
Then, if you want to access the count for any (month, day, hour) period, you can easily do the following:
filled_dict = dict(filled_tups)
# get count for Sept 1 at 00:00
print(filled_dict[(9,1,0)]) # prints 2
Note this code doesn't properly account for non-existing days such as Feb 30, Feb 31, April 31, June 31...