I have some code and am wondering how I can properly format it and save to a csv file. I haven't been able to find a question that answers this question in the way I want, and since I am fairly new to python I'm having trouble changing code that answers similar questions to something that suits my need.
It breaks down to the following:
assets = [asset_1, asset_2, asset_3]
for i in range(len(assets)):
price = data.current(assets[i], "price")
Now if I
print price
it prints as
price 1
price 2
price 3
I can get it to print as price 1, price 2, price 3 without trouble, but it seems pretty janky for me to write something like print price[0], price[1], price[2], newline especially when I hope to extend this to many more assets in the future. All the solutions I've come accross that remove the newlines then cause price1, price2, price3, price1, price2... and I want the newline to happen specifically after price 3. I have also been able to get it to export the data to a csv without trouble, but the csv just has one column similar to the first print statement. How can I get each of the 3 prices into a csv and then a newline?
I attempted to follow this answer, and changed my print statements to follow suit, but got a syntax error when trying to print prices[index] index in my case being 0:2 (tried 1:3 in case but didn't work as expected)
Here is my actual code.
from catalyst.api import record, symbol, symbols
from datetime import datetime
import os, csv, pytz, sys
import catalyst
def initialize(context):
# Portfolio assets list
context.assets = [symbol("XMR_DASH"), symbol("BTC_XMR"), symbol("BTC_DASH")]
'''
# Creates a .CSV file with the same name as this script to store results
context.csvfile = open(os.path.splitext(os.path.basename(__file__))[0]+'.csv', 'w+')
context.csvwriter = csv.writer(context.csvfile)
'''
def handle_data(context, data):
date = context.blotter.current_dt # Current time for each iteration in the simulation
price = data.current(context.assets, "price")
print price
'''
for i in range(0,3):
price = data.current(context.assets[i], "price")
print price[any index gives syntax error]
'''
def analyze(context=None, results=None):
pass
'''
# Close open file properly at the end
context.csvfile.close()
'''
start = datetime(2017, 7, 30, 0, 0, 0, 0, pytz.utc)
end = datetime(2017, 7, 31, 0, 0, 0, 0, pytz.utc)
results = catalyst.run_algorithm(start=start, end=end, initialize=initialize,
capital_base=10000, handle_data=handle_data,
bundle='poloniex', analyze=analyze, data_frequency='minute')
I don't need help with writing data to a csv, but do need help formatting the data that is written. Preferably an answer would also work with print statements so I can visualize what's going on before going to the csv.
You can simple print the array this way:
assets = [asset_1, asset_2, asset_3]
print(","join([str(data.current(assets[i], "price")) for i in assets]))
",".join(array) joins the array with a comma between values. A Print will print the concatenated array, with a newline at the end
[str(data.current(assets[i], "price")) for i in assets] builds the array of prices you want to concatenate
Try changing the below..
for i in range(0,3):
price = data.current(context.assets[i], "price")
print price[any index gives syntax error]
To
price=[]
for i in range(0,3):
price.append(data.current(context.assets[i], "price"))
print(','.join([str(x) for x in price])
And the issue you were facing I believe is that price is a single float/double/int and not a list of prices. Thus any index on price would fail. Now with this we created a price list.
Related
I am scraping data with python. I get a csv file and can split it into columns in excel later. But I am encountering an issue I have not been able to solve. Sometimes the scraped items have two statuses and sometimes just one. The second status is thus moving the other values in the columns to the right and as a result the dates are not all in the same column which would be useful to sort the rows.
Do you have any idea how to make the columns merge if there are two statuses for example or other solutions?
Maybe is is also an issue that I still need to separate the values into columns manually with excel.
Here is my code
#call packages
import random
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
# define driver etc.
service_obj = Service("C:\\Users\\joerg\\PycharmProjects\\dynamic2\\chromedriver.exe")
browser = webdriver.Chrome(service=service_obj)
# create loop
initiative_list = []
for i in range(0, 2):
url = 'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page='+str(i)
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
initiative_list.extend(initiatives)
df = pd.DataFrame(initiative_list)
#create csv
print(df)
df.to_csv('Initiativen.csv')
df.columns = ['tosplit']
new_df = df['tosplit'].str.split('\n', expand=True)
print(new_df)
new_df.to_csv('Initiativennew.csv')
I tried to merge the columns if there are two statuses.
make the columns merge if there are two statuses for example or other solutions
[If by "statuses" you mean the yellow labels ending in OPEN/UPCOMING/etc, then] it should be taken care of by the following parts of the getDetails_iiaRow (below the dividing line):
labels = cssSelect(iiaEl, 'div.field span.label')
and then
'labels': ', '.join([l.text.strip() for l in labels])
So, multiple labels will be separated by commas (or any other separator you apply .join to).
initiative_item = browser.find_elements(By.CSS_SELECTOR, "initivative-item")
initiatives = [item.text for item in initiative_item]
Instead of doing it like this and then having to split and clean things, you should consider extracting each item in a more specific manner and have each "row" be represented as a dictionary (with the column-names as the keys, so nothing gets mis-aligned later). If you wrap it as a function:
def cssSelect(el, sel): return el.find_elements(By.CSS_SELECTOR, sel)
def getDetails_iiaRow(iiaEl):
title = cssSelect(iiaEl, 'div.search-result-title')
labels = cssSelect(iiaEl, 'div.field span.label')
iiarDets = {
'title': title[0].text.strip() if title else None,
'labels': ', '.join([l.text.strip() for l in labels])
}
cvSel = 'div[translate]+div:last-child'
for c in cssSelect(iiaEl, f'div:has(>{cvSel})'):
colName = cssSelect(c, 'div[translate]')[0].text.strip()
iiarDets[colName] = cssSelect(c, cvSel)[0].text.strip()
link = iiaEl.get_attribute('href')
if link[:1] == '/':
link = f'https://ec.europa.eu/{link}'
iiarDets['link'] = iiaEl.get_attribute('href')
return iiarDets
then you can simply loop through the pages like:
initiative_list = []
for i in range(0, 2):
url = f'https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives_de?page={i}'
browser.get(url)
time.sleep(random.randint(5, 10))
initiative_list += [
getDetails_iiaRow(iia) for iia in
cssSelect(browser, 'initivative-item>article>a ')
]
and the since it's all cleaned already, you can directly save the data with
pd.DataFrame(initiative_list).to_csv('Initiativen.csv', index=False)
The output I got for the first 3 pages looks like:
I think it is worth working a little bit harder to get your data rationalised before putting it in the csv rather than trying to unpick the damage once ragged data has been exported.
A quick look at each record in the page suggests that there are five main items that you want to export and these correspond to the five top-level divs in the a element.
The complexity (as you note) comes because there are sometimes two statuses specified, and in that case there is sometimes a separate date range for each and sometimes a single date range.
I have therefore chosen to put the three ever present fields as the first three columns, followed next by the status + date range columns as pairs. Finally I have removed the field names (these should effectively become the column headings) to leave only the variable data in the rows.
initiatives = [processDiv(item) for item in initiative_item]
def processDiv(item):
divs = item.find_elements(By.XPATH, "./article/a/div")
if "\n" in divs[0].text:
statuses = divs[0].text.split("\n")
if len(divs) > 5:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[5].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], statuses[0], divs[4].text.split("\n")[1], statuses[1], divs[4].text.split("\n")[1]]
else:
return [divs[1].text, divs[2].text.split("\n")[1], divs[3].text.split("\n")[1], divs[0].text, divs[4].text.split("\n")[1]]
The above approach sticks as close to yours as I can. You will clearly need to rework the pandas code to reflect the slightly altered data structure.
Personally, I would invest even more time in clearly identifying the best definitions for the fields that represent each piece of data that you wish to retrieve (rather than as simply divs 0-5), and extract the text directly from them (rather than messing around with split). In this way you are far more likely to create robust code that can be maintained over time (perhaps not your goal).
I'm trying to subset and return a pandas df. The main obstacle is I'm executing the subset from a df that is being continually updated.
I'm appending data on a timer that imports the same dataset every minute. I then want to subset this updated data and return it for a separate function. Specifically, the subset df will be emailed. I'm hoping to continually repeat this process.
I'll lay out each intended step below. I'm falling down on step 3.
Import the dataset over a 24 hour period
Continually update same dataset every minute
Subset the data frame by condition
If a new row is appended to df, execute email notification
Using below, data is imported from yahoo finance where the same data is pulled every minute.
I'm then aiming to subset specific rows from this updated dataframe and return the data to be emailed.
I only want to execute the email function when a new row of data has been appended.
The condition outlined below will return a new row at every minute (which is by design for testing purposes). My actual condition will return between 0-10 instances a day.
The example df outlined in df_out is an example that may be taken at a point throughout a day.
import pandas as pd
import yfinance as yf
import datetime
import pytz
from threading import Thread
from time import sleep
import numpy as np
import pandas as pd
import requests
from email.mime.application import MIMEApplication
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
import smtplib
def scheduled_update():
# Step 1. import data for a 24 hour period
my_date = datetime.datetime.now(pytz.timezone('Etc/GMT-5'))
prev_24hrs = my_date - datetime.timedelta(hours = 25, minutes = 0)
data = yf.download(tickers = 'EURUSD=X',
start = prev_24hrs,
end = my_date,
interval = '1m'
).iloc[:-1]#.reset_index()
# Step 2. update data every minute in 5 min blocks
while True:
sleep(60)
upd_data = yf.download(tickers = 'EURUSD=X',
start = my_date - datetime.timedelta(hours = 0, minutes = 5),
end = datetime.datetime.now(pytz.timezone('Etc/GMT-5')),
interval = '1m')
print(upd_data)
if len(upd_data) != 0:
# Here's the check to see if the data meets the desired condition.
# The last row is again removed for the same reason as noted earlier.
upd_data = upd_data.loc[upd_data['High'].lt(0.98000)].iloc[:-1]
# Merges the two sets of data.
data = pd.concat([data, upd_data])
# For the first few minutes after the function starts running, there will be
# duplicate timestamps, so this gets rid of those. This way is fastest, according to:
# https://stackoverflow.com/questions/13035764/remove-pandas-rows-with-duplicate-indices
data = data[~data.index.duplicated(keep = 'first')]
print(data)
else:
print('No new data')
return data
thread = Thread(target = scheduled_update)
thread.start()
For arguments sake, let's say during the day that a new row has been appended and we call the df as df_out. When the new row has been appended, I want to execute the email notification.
# Step 3. return subset of data to be emailed
#df_out = scheduled_update()
# example df
df_out = pd.DataFrame({'Datetime' : ['2022-10-10 01:44:00+01:00','2022-10-10 01:45:00+01:00','2022-10-10 01:46:00+01:00','2022-10-10 01:47:00+01:00','2022-10-10 01:48:00+01:00'],
'Open' : [0.973899,0.973710,0.973615,0.973410,0.973799],
'High' : [0.973999,0.974110,0.973115,0.973210,0.973899],
'Low' : [0.973899,0.973710,0.973615,0.973710,0.973499],
'Close' : [0.973999,0.974110,0.973115,0.973410,0.973499],
'Adj Close' : [0.973999,0.974110,0.973115,0.973410,0.973499],
'Volume' : [0,0,0,0,0],
})
# Step 4. send notification containing df_out
def send_tradeNotification(send_to, subject, df):
# google account and password
send_from = 'xxxx1#gmail.com'
password = 'password'
# email message
message = """\
<p><strong>Notification </strong></p>
<p>
<br>
</p>
<p><strong>-
</strong><br><strong>Regards </strong></p>
"""
for receiver in send_to:
multipart = MIMEMultipart()
multipart['From'] = send_from
multipart['To'] = receiver
multipart['Subject'] = subject
attachment = MIMEApplication(df.to_csv())
attachment['Content-Disposition'] = 'attachment; filename=" {}"'.format(f'{subject}.csv')
multipart.attach(attachment)
multipart.attach(MIMEText(message, 'html'))
server = smtplib.SMTP('smtp.gmail.com', 587)
server.starttls()
server.login(multipart['From'], password)
server.sendmail(multipart['From'], multipart['To'], multipart.as_string())
server.quit()
#send_tradeNotification(['xxxx2#gmail.com'], 'Trade Setup', df_out)
It's a little unclear exactly what you're trying to do, but my assumptions are:
You want to have a 'baseline' dataset from the previous 24 hours.
You want to pull new data every minute after that.
If the new data matches a specific criteria, you want to append that data to the existing data.
Unless there's some specific reason you have the initial data outside of the scheduled_update() function, I think it's likely easiest just to keep everything in there. So for the first few lines of your function, I'd do the following:
def scheduled_update():
# Using pandas time functions as they're a bit cleaner than datetime, but either will work
my_date = pd.Timestamp.now('Etc/GMT-5')
prev_24hrs = my_date - pd.Timedelta(hours = 25)
data = yf.download(tickers = 'EURUSD=X',
start = prev_24hrs,
end = my_date,
interval = '1m'
).iloc[:-1]
In my testing, I noticed that the last value in the downloaded data sometimes included a seconds value in it. Adding the .iloc[:-1] to the end of the DataFrame ensures that only whole minutes are included.
What was the original purpose of including the while datetime.datetime.now().minute % 1 != 0 lines? .minute % 1 will always be equal to 0 so it wasn't clear what that was doing.
As #Josh Friedlander indicated, upd_data is defined as a list outside of the function, so trying to use pandas methods on it won't work.
Instead, upd_data should be a new DataFrame that contains the recent data that you're importing. If the data is being updated every minute, it doesn't make sense to download a full 24 hours every time, so I changed that to the following so it only pulls the last 5 minutes:
while True:
sleep(60)
upd_data = yf.download(tickers = 'EURUSD=X',
start = pd.Timestamp.now('Etc/GMT-5') - pd.Timedelta(minutes = 5),
end = pd.Timestamp.now('Etc/GMT-5'),
interval = '1m')
You could probably get away with only pulling 2 or 3 minutes instead, but this ensures that there's some overlap in case there are delays or issues with the download.
As I'm writing this, Yahoo isn't updating the exchange rate since trading is apparently closed for the day, so there's no new data in the upd_data DataFrame.
When that's the case, it doesn't make sense to do any assessment on it, so the following chunk of code checks to see if there were any updates. And if there are, then it updates the data DataFrame. If not, it just prints a statement that there are no updates (the whole else block can be deleted, otherwise it'll print that out all night... Ideally you could set things up so that it doesn't even run overnight, but that's not really critical at this point.)
if len(upd_data.index) > 0:
# Here's the check to see if the data meets the desired condition.
# The last row is again removed for the same reason as noted earlier.
upd_data = upd_data.loc[upd_data['High'].gt(0.97000)].iloc[:-1]
# Merges the two sets of data.
data = pd.concat([data, upd_data])
# For the first few minutes after the function starts running, there will be
# duplicate timestamps, so this gets rid of those. This way is fastest, according to:
# https://stackoverflow.com/questions/13035764/remove-pandas-rows-with-duplicate-indices
data = data[~data.index.duplicated(keep = 'first')]
print(data)
else:
print('No new data')
After this, it's not clear how you want to proceed. In your original code, the return will never be executed since the while loop will continue indefinitely. If you put the return in the loop, though, it will break the loop when the loop reaches that line. It looks like there are ways in the threading module that can pass values out, but that's not something I'm familiar with so I can't comment on that. There are likely other options as well, depending on what your end goals are!
EDIT based on updated question
Based on your update, the line in my code:
data = data[~data.index.duplicated(keep = 'first')]
can be changed to:
data = data[~data.index.duplicated(keep = False)]
This will get rid of all duplicate rows, leaving only the rows with new information.
You should just include the email in this section as well. Put in another if statement to check if the length of data is greater than 0, and send an email if it is.
You also need to increase the line indent of the if/else from my code in your while loop. It shouldn't be at the same indentation level as the start of the while loop.
SECOND EDIT based on comments
Based on what you've written, just add the following if/else loop to check for updates, and put your email function within that loop if the condition is met.
if len(upd_data.index) > 0:
# Here's the check to see if the data meets the desired condition.
# The last row is again removed for the same reason as noted earlier.
upd_data = upd_data.loc[upd_data['High'].gt(0.97000)].iloc[:-1]
# Merges the two sets of data.
# Keeps all data values but drops duplicate entries.
data = pd.concat([data, upd_data])
data = data[~data.index.duplicated(keep = 'first')]
# Note change here to False to delete all rows with duplicate timestamps
# Also changed df name to send_data to keep it separate from full df.
send_data = data[~data.index.duplicated(keep = False)]
if len(send_data.index) != 0: # If there is new data that meets the condition
**run your whole email code here**
print(send_data)
else: # If there is new data, but it doesn't meet the condition
print('New rows retrieved, none meet condition')
else: # If there is no new data (markets are closed, for example)
print('No new data')
Ok I got this, So you want to append a data into every minute You created a list and used append function but you want to use Pandas library so you can actually do this for your function
def scheduled_update():
while datetime.datetime.now().minute % 1 != 0:
sleep(1)
x = pd.DataFrame(data)
while True:
sleep(60)
df_out = x[x['High'] > 0.97000]
print(df_out)
return df_out
thread = Thread(target = scheduled_update)
thread.start()
Afterwards, If you want to append a data into pandas array dataframe, you can use:
DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)
use this dataframe.append function
every 1 minute where you made the modulus operator to get if it is surpassed 1 minute inside first while loop and remove the upd_data[] array which is useless now.
Logic:
import pandas as pd
df1 = pd.DataFrame({"a":[1, 2, 3, 4],
"b":[5, 6, 7, 8]})
df2 = pd.DataFrame({"a":[1, 2, 3],
"b":[5, 6, 7]})
print(df1, "\n")
print(df2)
df1.append(df2, ignore_index = True) # we set Ignore index because we don't want the indexs to be the same.
Output Picture
I want to create something like a csv file database for a school project.
I have a list of values/rows that stay the same like the UID but also values that I want to keep track of like a timestamp. Simplified example below:
UID
Timestamps
First
Date1,Date2,Date3
Second
Date1,Date2,Date3,Date4,Date5
What I get right now is:
UID
Timestamps
First
Date1
Second
Date1
Is there an elegant way to do the following in pandas that is somewhat fast and doesn´t require iteration ? :
If a specific UID has been found append the timestamp to the associated cell.
EDIT:
I hope I can make my question more clear now (sorry I am new to this). I want to scrape a list of links (UID) and add a timestamp everytime the script detects a change in an html element associated with the link. The logic for checking if a change has been made does not exist yet.
links_extended=[]
datestamp_extended=[]
while(True):
try:
a = driver.find_element_by_id('thermoplast').find_elements_by_tag_name("a")
links = [x.get_attribute("href") for x in a]
datestamp = []
now = datetime.now()
for date in links:
date = now.strftime("%Y-%m-%d %H:%M:%S")
datestamp.append(date)
datestamp_extended.extend(datestamp)
links_extended.extend(links)
#if no next link is found while loop ends
except:
hot_dict = {'Link': links_extended, 'Datestamps': datestamp_extended}
hotlist_df = pd.DataFrame(data = hot_dict)
break
Thank you very much! :)
I have about 10 columns of data in a CSV file that I want to get statistics on using python. I am currently using the import csv module to open the file and read the contents. But I also want to look at 2 particular columns to compare data and get a percentage of accuracy based on the data.
Although I can open the file and parse through the rows I cannot figure out for example how to compare:
Row[i] Column[8] with Row[i] Column[10]
My pseudo code would be something like this:
category = Row[i] Column[8]
label = Row[i] Column[10]
if(category!=label):
difference+=1
totalChecked+=1
else:
correct+=1
totalChecked+=1
The only thing I am able to do is to read the entire row. But I want to get the exact Row and Column of my 2 variables category and label and compare them.
How do I work with specific row/columns for an entire excel sheet?
convert both to pandas dataframes and compare similarly as this example. Whatever dataset your working on using the Pandas module, alongside any other necessary relevant modules, and transforming the data into lists and dataframes, would be first step to working with it imo.
I've taken the liberty and time/ effort to delve into this myself as it will be useful to me going forward. Columns don't have to have the same lengths at all in his example, so that's good. I've tested the below code (Python 3.8) and it works successfully.
With only a slight adaptations can be used for your specific data columns, objects and purposes.
import pandas as pd
A = pd.read_csv(r'C:\Users\User\Documents\query_sequences.csv') #dropped the S fom _sequences
B = pd.read_csv(r'C:\Users\User\Documents\Sequence_reference.csv')
print(A.columns)
print(B.columns)
my_unknown_id = A['Unknown_sample_no'].tolist() #Unknown_sample_no
my_unknown_seq = A['Unknown_sample_seq'].tolist() #Unknown_sample_seq
Reference_Species1 = B['Reference_sequences_ID'].tolist()
Reference_Sequences1 = B['Reference_Sequences'].tolist() #it was Reference_sequences
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1)) #it was Reference_sequences
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
Ref_dict = dict(zip(Reference_Species1, Reference_Sequences1))
Unknown_dict = dict(zip(my_unknown_id, my_unknown_seq))
print(Ref_dict)
print(Unknown_dict)
import re
filename = 'seq_match_compare2.csv'
f = open(filename, 'a') #in his eg it was 'w'
headers = 'Query_ID, Query_Seq, Ref_species, Ref_seq, Match, Match start Position\n'
f.write(headers)
for ID, seq in Unknown_dict.items():
for species, seq1 in Ref_dict.items():
m = re.search(seq, seq1)
if m:
match = m.group()
pos = m.start() + 1
f.write(str(ID) + ',' + seq + ',' + species + ',' + seq1 + ',' + match + ',' + str(pos) + '\n')
f.close()
And I did it myself too, assuming your columns contained integers, and according to your specifications (As best at the moment I can). Its my first try [Its my first attempt without webscraping, so go easy]. You could use my code below for a benchmark of how to move forward on your question.
Basically it does what you want (give you the skeleton) and does this : "imports csv in python using pandas module, converts to dataframes, works on specific columns only in those df's, make new columns (results), prints results alongside the original data in the terminal, and saves to new csv. It's as as messy as my python is , but it works! personally (& professionally) speaking is a milestone for me and I Will hopefully be working on it at a later date to improve it readability, scope, functionality and abilities [as the days go by (from next weekend).]
# This is work in progress, (although it does work and does a job), and its doing that for you. there are redundant lines of code in it, even the lines not hashed out (because im a self teaching newbie on my weekends). I was just finishing up on getting the results printed to a new csv file (done too). You can see how you could convert your columns & rows into lists with pandas dataframes, and start to do calculations with them in Python, and get your results back out to a new CSV. It a start on how you can answer your question going forward
#ITS FOR HER TO DO MUCH MORE & BETTER ON!! BUT IT DOES IN BASIC TERMS WHAT SHE ASKED FOR.
import pandas as pd
from pandas import DataFrame
import csv
import itertools #redundant now'?
A = pd.read_csv(r'C:\Users\User\Documents\book6 category labels.csv')
A["Category"].fillna("empty data - missing value", inplace = True)
#A["Blank1"].fillna("empty data - missing value", inplace = True)
# ...etc
print(A.columns)
MyCat=A['Category'].tolist()
MyLab=A['Label'].tolist()
My_Cats = A['Category1'].tolist()
My_Labs = A['Label1'].tolist()
#Ref_dict0 = zip(My_Labs, My_Cats) #good to compare whole columns as block, Enumerate ZIP 19:06 01/06/2020 FORGET THIS FOR NOW, WAS PART OF A LATTER ATTEMPT TO COMPARE TEXT & MISSED TEXT WITH INTERGER FIELDS. DOESNT EFFECT PROGRAM
Ref_dict = dict(zip(My_Labs, My_Cats))
Compareprep = dict(zip(My_Cats, My_Labs))
Ref_dict = dict(zip(My_Cats, My_Labs))
print(Ref_dict)
import re #this is for string matching & comparison. redundant in my example here but youll need it to compare tables if strings.
#filename = 'CATS&LABS64.csv' # when i got to exporting part, this is redundant now
#csvfile = open(filename, 'a') #when i tried to export results/output it first time - redundant
print("Given Dataframe :\n", A)
A['Lab-Cat_diff'] = A['Category1'].sub(A['Label1'], axis=0)
print("\nDifference of score1 and score2 :\n", A)
#YOU CAN DO OTHER MATCHES, COMPARISONS AND CALCULTAIONS YOURSELF HERE AND ADD THEM TO THE OUTPUT
result = (print("\nDifference of score1 and score2 :\n", A))
result2 = print(A) and print(result)
def result22(result2):
for aSentence in result2:
df = pd.DataFrame(result2)
print(str())
return df
print(result2)
print(result22) # printing out the function itself 'produces nothing but its name of course
output_df = DataFrame((result2),A)
output_df.to_csv('some_name5523.csv')
Yes, i know, its by no means perfect At all, but wanted to give you the heads up about panda's and dataframes for doing what you want moving forward.
I am using a Python script with the xlrd module to read dates from an Excel file I have saved on my computer. Now, I already know that Excel saves dates as a serial date, which counts the number of days since Jan 0, 1900. I have been using the datetime function to convert this number into a date object with the form YYYY-MM-DD. For example:
v_created = sh.cell(r,created_col).value
if not v_created:
v_created = 0
elif not isinstance(v_created, basestring):
v_created = datetime.date(1900, 1, 1) + datetime.timedelta(int(v_created)-2)
print 'Create '
print v_created
This prints the following output:
Create
2013-09-26
On the other hand, the next block of code should do the exact same thing, but puts a float into the variable instead of a date:
v_updated = sh.cell(r,updated_col).value
if not v_updated:
v_updated = 0
elif not isinstance(v_updated, basestring):
v_upated = datetime.date(1900, 1, 1) + datetime.timedelta(int(v_updated)-2)
print 'Updated '
print v_updated
As far as I can tell, this block of code is identical to the first, but spits out the following output:
Updated
41543.5895833
I am sending these values to an Oracle database. When I run the query, I get the following error:
Traceback (most recent call last):
cursor.execute(query, values)
cx_Oracle.DatabaseError: ORA-00932: inconsistent datatypes: expected DATE got NUMBER
Why is one block of code outputting a date object while the very next block of code outputting a float? As far as I can tell, Excel is storing the dates in the exact same way with the same formatting options.
You have a typo.
You write:
v_upated = datetime.date(1900, 1, 1) + datetime.timedelta(int(v_updated)-2)
and then print v_updated
When I run it, it works 41543 is actually the number of days since 1900... like you calculated.