How does Python handle time conversions? - python

Recently I have been working on a time series data set and have written a script to automate some plotting. Using the pd.to_datetime function (provided with a specific format), I assumed would automatically convert every time entry to the appropriate format.
The raw data follows this format:
%d/%m/%YYYY HH:MM (HH:MM is irrelevant in this case so don't worry about it as we are only interested in the daily average)
However, it seems Python intermittently changes the 'raw timestamps' and changes the format to:
%d-%m-%YYYY
Why is this the case and how can I make sure Python doesn't do this?
I have received the below error and struggle to work out why this is the case.
I have looked at the following SO but I don't have the same issue.
time data does not match format
The data itself is provided in the following CSV and is all in the %d/%m/%Y format.
My code for my function is attached in case there are any errors with how I've converted the timestamps.
def plotFunction(dataframe):
for i in wellNames:
my_list = dataframe["Date"].values
DatesRev = []
for j in my_list:
a=j[0:10]
DatesRev.append(a)
#We now need to re-add the dates to our data frame
df2 = pd.DataFrame(data= DatesRev)
df2.columns = ["DatesRev"]
dataframe["DatesRev"] = df2["DatesRev"]
# print (dataframe)
# #df2= pd.DataFrame(DatesRev)
# #df2.columns = ['DatesRev']
# #dataframe['DatesRev'] = df2['DatesRev']
wellID = dataframe[dataframe['Well']==i]
wellID['DatesRev'] = pd.to_datetime(wellID['DatesRev'], format='%d/%m/%Y')
print (i)
# ax = wellID.set_index('DatesRev').plot()
# xfmt = mdates.DateFormatter('%d-%m-%Y')
# ax.xaxis.set_major_formatter(xfmt)
# plt.xticks(rotation=90)
# ax.legend(bbox_to_anchor=(1.04,1), loc="upper left")
# plt.title(i)
# plt.show()
# plt.savefig(i + ".jpg", bbox_inches='tight')

The problem is that python does not recognize / very well. I came across this problem myself. Dashes are what it recognizes the best. If I may suggest, try to keep the - formatting. It is just Python being Python. :P
In wellID['DatesRev'], change format='%d/%m/%Y' to format='%d-%m-%Y' and that should possibly fix your problem.

Related

How to make it so that all available data is being pulled instead of specifically typing out a date range for this script?

The available options dates are below. How can I write a code so that it pulls all those dates instead of having to type them all out in a separate row?
2022-03-11, 2022-03-18, 2022-03-25, 2022-04-01, 2022-04-08, 2022-04-14, 2022-04-22, 2022-05-20, 2022-06-17, 2022-07-15, 2022-10-21, 2023-01-20, 2024-01-19
import yfinance as yf
gme = yf.Ticker("gme")
opt = gme.option_chain('2022-03-11')
print(opt)
First of all, as these dates have no regular pattern, you should create a list of the dates.
list1=['2022-03-11', '2022-03-18', '2022-03-25', '2022-04-01', '2022-04-08', '2022-04-14', '2022-04-22', '2022-05-20', '2022-06-17', '2022-07-15', '2022-10-21', '2023-01-20', '2024-01-19']
After you have created the list, you can initiate your code as how you have done:
import yfinance as yf
gme = yf.Ticker("gme")
But right now, since you would want to have everything being printed out, and I assume you would need to save it to file for a better view (as I have checked the output and I personally prefer csv for yfinance), you can do this:
for date in list1:
df = gme.option_chain(date)
df_call = df[0]
df_put = df[1]
df_call.to_csv(f'call_{date}.csv')
df_put.to_csv(f'put_{date}.csv')

How to work with Rows/Columns from CSV files?

I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.

Python Pandas replacing part of a string

I'm trying to filter data that is stored in a .csv file that contains time and angle values and save filtered data in an output .csv file. I solved the filtering part, but the problem is that time is recorded in hh:mm:ss:msmsmsms (12:55:34:500) format and I want to change that to hhmmss (125534) or in other words remove the : and the millisecond part.
I tried using the .replace function but I keep getting the KeyError: 'time' error.
Input data:
time,angle
12:45:55,56
12:45:56,89
12:45:57,112
12:45:58,189
12:45:59,122
12:46:00,123
Code:
import pandas as pd
#define min and max angle values
alpha_min = 110
alpha_max = 125
#read input .csv file
data = pd.read_csv('test_csv3.csv', index_col=0)
#filter by angle size
data = data[(data['angle'] < alpha_max) & (data['angle'] > alpha_min)]
#replace ":" with "" in time values
data['time'] = data['time'].replace(':','')
#display results
print data
#write results
data.to_csv('test_csv3_output.csv')
That's because time is an index. You can do this and remove the index_col=0:
data = pd.read_csv('test_csv3.csv')
And change this line:
data['time'] = pd.to_datetime(data['time']).dt.strftime('%H%M%S')
Output:
time angle
2 124557 112
4 124559 122
5 124600 123
What would print (data.keys()) or print(data.head()) yield? It seems like you have a stray character before\after the time index string, happens from time to time, depending on how the csv was created vs how it was read (see this question).
If it's not a bigger project and/or you just want the data, you could just do some silly workaround like: timeKeyString=list(data.columns.values)[0] (assuming time is the first one).

converting a 1d array to netcdf

I have a 1d array which is a time series hourly dataset encompassing 49090 points which needs to be converted to netcdf format.
In the code below, result_u2 is a 1d array which stores result from a for loop. It has 49090 datapoints.
nhours = 49091;#one added to no of datapoints
unout.units = 'hours since 2012-10-20 00:00:00'
unout.calendar = 'gregorian'
ncout = Dataset('output.nc','w','NETCDF3');
ncout.createDimension('time',nhours);
datesout = [datetime.datetime(2012,10,20,0,0,0)+n*timedelta(hours=1) for n in range(nhours)]; # create datevalues
timevar = ncout.createVariable('time','float64',('time'));timevar.setncattr('units',unout);timevar[:]=date2num(datesout,unout);
winds = ncout.createVariable('winds','float32',('time',));winds.setncattr('units','m/s');winds[:] = result_u2;
ncout.close()
I'm new to programming. The code I tried above should be able to write the nc file but while running the script no nc file is being created. Please help.
My suggestions would be to have a look at Python syntax in general, if you want to use it / the netCDF4 package. E.g. there are no semicolons in Python code.
Check out the API documentation - the tutorial you find there basically covers what you're asking. Then, your code could look like
import datetime
import netCDF4
# using "with" syntax so you don't have to do the cleanup:
with netCDF4.Dataset('output.nc', 'w', format='NETCDF3_CLASSIC') as ncout:
# create time dimension
nhours = 49091
time = ncout.createDimension('time', nhours)
# create the time variable
times = ncout.createVariable('time', 'f8', ('time',))
times.units = 'hours since 2012-10-20 00:00:00'
times.calendar = 'gregorian'
# fill time
dates = [datetime.datetime(2012,10,20,0,0,0)+n*datetime.timedelta(hours=1) for n in range(nhours)]
times[:] = netCDF4.date2num(dates, units=times.units, calendar=times.calendar)
# create variable 'wind', dependent on time
wind = ncout.createVariable('wind', 'f8', ('time',))
wind.units = 'm/s'
# fill with data, using your 1d array here:
wind[:] = result_u2

Remove timestamp from matplotlib.num2date date, and use date stamp in axis xlabel

I have an array of date values (T_date), imported from a CSV and converted to numbers using
T_date, T_price = np.loadtxt('TESC.csv', unpack = True, delimiter = ',',
skiprows=1, usecols=(0,4),
converters = {0: mdates.strpdate2num('%Y-%m-%d')})
I then create a new array starting from 0, with 0 being the earliest date using
T_date0=T_date-np.amin(T_date)
I wish to label the x axis of a pyplot plot as
plt.xlabel("Date, 0 = ", firstdate)
where firstdate=
firstdate = mdates.num2date(np.amin(T_date)).
This creates a datetime value in the form Y-m-d time, but as my data has no time values I wish to remove it.
When I run this, I get the error
"AttributeError: 'datetime.datetime' object has no attribute 'iteritems'"
Any help much appreciated.
Just try:
plt.xlabel("Date, 0 = {}".format(firstdate.strftime("%Y-%m-%d")))
And for completeness: the error you're seeing is due to python trying to unpack the 2nd argument you've passed to xlabel, as per its definition:
Definition: plt.xlabel(s, *args, **kwargs)
As for the question asked only in your title ("Remove timestamp from matplotlib.num2date date"), you should look into the formatting options of the xaxis. This example from the matplotlib website sums it up very well. You'll need:
ax.fmt_xdata = mdates.DateFormatter('%Y-%m-%d')

Categories