Libre office calc and excel showing different value - python

I am trying to do some date parsing in python and while parsing I came to this weird error that said
time data 'nan' does not match format '%d/%m/%y'
As i checked my .csv file in libreoffice calc everything looked fine. No nan values what so ever. However when I checked it in excel(excel mobile version. Since I don't want to pay) I saw different value. Value that was shown as follows in different editor
Libre office calc - 11/09/93
excel - ########.
Here is a screenshot below:
How could I change it in LibreOffice or python so that it won't be treated as nan values but the real values like they should be.
I don't have much knowledge in excel and Libreoffice calc so any explanation to solve this simple issue would be welcome.
Here is the python code
import pandas as pd
from datetime import datetime as dt
loc = "C:/Data/"
season1993_94 = pd.read_csv(loc + '1993-94.csv')
def parse_date_type1(date):
if date == '':
return None
return dt.strptime(date, '%d/%m/%y').date()
def parse_date_type2(date):
if date == '':
return None
return dt.strptime(date, '%d/%m/%Y').date()
season1993_94.Date = season1993_94.Date.astype(str).apply(parse_date_type1)
Error:
<ipython-input-13-46ff7e1afe94> in <module>()
----> 1 season1993_94.Date = season1993_94.Date.astype(str).apply(parse_date_type1)
ValueError: time data 'nan' does not match format '%d/%m/%y'
PS: If the question seems inappropriate as per the context given, please feel free to edit it.

To see what is going on, use a text editor such as Notepad++. Viewing with Excel or Calc may not show the problem; at least, the problem cannot be seen from the images in the question.
The error occurs with a CSV file consisting of the following three lines.
Date,Place
28/08/93,Southampton
,Newcastle
Here is the solution, adapted from How to convert string to datetime with nulls - python, pandas?
season1993_94['Date'] = pd.to_datetime(season1993_94['Date'], errors='coerce')
The result:
>>> season1993_94
Date Place
0 1993-08-28 Southampton
1 NaT Newcastle

Related

Texts in some cells update but not all cells using runs with python-docx

I'm trying to update the dates on an existing docx file. The file has one table with 3 columns and 25 rows. Each cell starts with a date in this format Aug. 17-18, 2021. My goal is to go through every cell and change the date to this year's, so that date should become Aug. 16-17, 2022.
Code:
from docx import Document
doc = Document("AP2021.docx")
table = doc.tables[0]
int_dates = []
ones = []
for i in range(10, 32):
int_dates.append(i)
for i in range(2, 10):
ones.append(i)
dates = [str(x) for x in int_dates]
ones_dates = [str(x) for x in ones]
for row in table.rows:
for cell in row.cells:
for date in dates:
# Update 10's and 20's of date
if date in cell.paragraphs[0].text:
new_date = str(int(date) - 1)
run = cell.paragraphs[0].runs
for i in range(len(run)):
if date in run[i].text:
run[i].text = run[i].text.replace(date, new_date)
else:
# Update ones of date separately because other text in cell uses the same numbers
for one in ones_dates:
if one in cell.paragraphs[0].text:
new_one = str(int(one) - 1)
run = cell.paragraphs[0].runs
for i in range(len(run)):
if one in run[i].text:
run[i].text = run[i].text.replace(one, new_one)
elif '1' in run[i].text:
run[i].text = run[i].text.replace('1', '31')
break;
break;
# one and two were used to check if 1920 and 1010 were detected. More two's were printed than one's.
if '1920' or '1010' in cell.text:
print("one")
run = cell.paragraphs[0].runs
for i in range(len(run)):
if '1920' or '1010' in run[i].text:
print("two")
run[i].text = run[i].text.replace('1920' or '1010', '2022')
doc.save("AP2022.docx")
Originally, I was able to change the dates using paragraphs.text but that removed all the formatting (highlight, bold, comments, etc.) for the cell. Following a comment on one of the posts, I replaced .text with .runs. The format was kept but for some reason, some cells' dates were updated but with incorrect year, some had updated year but not the correct date, some had correct date and year, and some had incorrect date and year.
I tried putting the year update code within the for loop for updating the date, but that didn't help. One of the posts said that .runs is inconsistent and hard to tell what part of the paragraph is actually wrapped for editing. All the runs examples I found were used with adding new text which wasn't what I wanted.
Is there a way for all the dates to be changed? Thank you.

how do I extract date string "Mar 11, 2019 • 3:26AM" from a paragraph and convert it to date time format (dd/mm/yy) in python

I have a paragraph that contains details like date and comments that I need to extract and make a separate column. The paragraph is in a column from which I am extracting the date is as follows:
'Story\nFAQ\nUpdates 2\nComments 35\nby Antaio Inc\nMar 11, 2019 • 3:26AM\n2 years ago\nThank you all for an amazing start!\nHi all,\nWe just want to thank you all for an awesome start! This is our first ever Indiegogo campaign and we are very grateful for your support that helped us achieve a successful campaign.\nIn the next little while, we will be dedicating our effort on production and shipping of the awesome A-Buds and A-Buds SE. We plan to ship them to you as promised in the coming month.\nWe will send out more updates as we are approaching the key production dates.\nStay tuned!\nBest regards,\nAntaio Team\nby Antaio Inc\nJan 31, 2019 • 5:15AM\nover 2 years ago\nPre-Production Update\nDear all,\nWe want to take this opportunity to thank all of you for being our early backers. You guys rock! :)\nAs you may have noticed, the A-Buds are already in production stage, which means we have already completed all development and testing, and are now working on pre-production. Not only will you receive fully tested and certified awesome A-Buds after the campaign, we are also giving you the promise to deliver them on time! We are truly excited to have these awesome true Bluetooth 5.0 earbuds in your hands. We are sure you will love them!\nSo here is a quick sneak peek:\nMore to come. Stay tuned! :)\nFrom: Antaio Team\nRead More'
This kind of paragraph is present in each row of the dataset in a particular column called 'Project_Updates_Description'. I am trying to extract the first date in each entry
The code I'm using so far is:
for i in df['Project_Updates_Description']:
if type(i) == str:
print(count)
word = i.split('\n',7)
count+=1
if len(word) > 5:
print(word[5])
df['Date'] = word[5]
The issue I have right now is that when I extract the date from the paragraph I'm getting it as string I need it as dd/mm/yyyy format I tried the methods like strptime it didn't work it is appending as string and when i try to append it in new 'Date' column I keep getting the same date for all entry. Could someone tell me were I am going wrong?
Assuming you have a dataframe with a column entitled 'Project_Updates_Description' which contains the example text and you want to extract the first date and generate a datetime stamp from this information you can do the following:
import pandas as pd
import numpy as np
def findDate(txin):
schptrn = '^\w+ \d{1,2}, \d{4,4}'
lines = txin.split('\n')
for line in lines:
#print(line)
data = re.findall(schptrn, line)[0]
if data:
#print(data)
return pd.to_datetime(data)
return np.nan
df['date'] = df.apply(lambda row: findDate(row['Project_Updates_Description']), axis = 1)

Pandas remove rows based on timestamps

Thanks in advance for checking out my question and help me out!
Basically what I am trying to do is removing MAC addresses which haven't been detected in the past hour or even longer than that.
I collect probe requests from my wifi network with timestamps for each MAC captured. The data is processed using Pandas. The dataframe has only 2 columns: 'MAC' and 'TIME' (in strftime format). Below is a screenshot of my dataframe.
As you can see I will only consider rows which have same MAC address are 'duplicates'. My problem is I can't find out each duplicated MAC addresses' time gap between the last one and the one before last.
MAC csv
MAC csv2
What I have tried so far:
I tried to use groupby and tail(2) to group data by MAC and take the last 2 entries however when there are several duplicated MACs in the dataframe this won't work because this method seems to only work for the last two entries.
Here is the code I tried:
def CheckListCleaner(inputDF) -> pd.DataFrame:
sleep(60 - time() % 60)
cond1 = inputDF.groupby("MAC").count() > 1
cond2 = inputDF.groupby("MAC").tail(2).diff() > 3600
combined_cond = cond1.mul(cond2)
combined_cond["M1"] = combined_cond.index
combined_cond.rename({"T": "val"}, axis=1, inplace=True)
out = inputDF.merge(combined_cond, left_on="MAC", right_on="M1")
listToDel = out[~out["val"]]
return listToDel
I am open to any new ideas. I am also wondering whether there are some easier ways or libs I can use to make this work without using a lot groupby and conditions.
P.S. In case you wonder how did I captured these MAC addresses. I am only interested in type 2 transmitter MACs. Below is the code I used to collect MACS.
def PacketHandler(pkt):
if pkt.haslayer(Dot11):
if pkt.type == 0:
allType2List.append((pkt.addr2, datetime.fromtimestamp(pkt.time).strftime('%H:%M:%S')))
if pkt.type == 1 and pkt.addr2 != None:
allType2List.append((pkt.addr2, datetime.fromtimestamp(pkt.time).strftime('%H:%M:%S')))
if pkt.type == 2 and pkt.addr2 != None:
allType2List.append((pkt.addr2, datetime.fromtimestamp(pkt.time).strftime('%H:%M:%S')))
if pkt.type == 3 and pkt.addr2 != None:
allType2List.append((pkt.addr2, datetime.fromtimestamp(pkt.time).strftime('%H:%M:%S')))

How to match asset price data from a csv file to another csv file with relevant news by date

I am researching the impact of news article sentiment related to a financial instrument and its potenatial effect on its instruments's price. I have tried to get the timestamp of each news item, truncate it to minute data (ie remove second and microsecond components) and get the base shareprice of an instrument at that time, and at several itervals after that time, in our case t+2. However, program created twoM to the file, but does not return any calculated price changes
Previously, I used Reuters Eikon and its functions to conduct the research, described in the article below.
https://developers.refinitiv.com/article/introduction-news-sentiment-analysis-eikon-data-apis-python-example
However, instead of using data available from Eikon, I would like to use my own csv news file with my own price data from another csv file. I am trying to match the
excel_file = 'C:\\Users\\Artur\\PycharmProjects\\JRA\\sentimenteikonexcel.xlsx'
df = pd.read_excel(excel_file)
sentiment = df.Sentiment
print(sentiment)
start = df['GMT'].min().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')
end = df['GMT'].max().replace(hour=0,minute=0,second=0,microsecond=0).strftime('%Y/%m/%d')
spot_data = 'C:\\Users\\Artur\\Desktop\\stocksss.csv'
spot_price_10 = pd.read_csv(spot_data)
print(spot_price_10)
df['twoM'] = np.nan
for idx, newsDate in enumerate(df['GMT'].values):
sTime = df['GMT'][idx]
sTime = sTime.replace(second=0, microsecond=0)
try:
t0 = spot_price_10.iloc[spot_price_10.index.get_loc(sTime),2]
df['twoM'][idx] = ((spot_price_10.iloc[spot_price_10.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100)
except:
pass
print(df)
However, the programm is not able to return the twoM price change values
I assume that you got a warning because you are trying to make changes on views. As soon as you have 2 [] (one for the column, one for the row) you can only read. You must use loc or iloc to write a value:
...
try:
t0 = spot_price_10.iloc[spot_price_10.index.get_loc(sTime),2]
df.loc[idx,'twoM'] = ((spot_price_10.iloc[spot_price_10.index.get_loc((sTime + datetime.timedelta(minutes=10))),3]/(t0)-1)*100)
except:
pass
...

How do I import an Excel file to Pandas if the file's last modified date is today?

So I'm trying to write a Python script that imports an excel file into a Pandas dataframe (condition being if the file was modified on a certain date, ex. today):
import pandas as pd
import glob
import os
import datetime
def report():
for x in glob.glob("../*.xlsx"):
modx = os.path.getmtime(x)
xmod = datetime.datetime.fromtimestamp(modx)
if datetime.datetime.today() == xmod:
return x
I've considered importing the excel file right from the function:
if datetime.datetime.today() == xmod:
df = pd.read_excel(x)
return df
An attempt to modify the dataframe (after attempting to import) yields this:
File "<ipython-input-56-e6fa18118137>", line 1, in <module>
df = pd.read_excel(report())
File "..\excel.py", line 151, in read_excel
return ExcelFile(io, engine=engine).parse(sheetname=sheetname, **kwds)
File "..\excel.py", line 196, in __init__
raise ValueError('Must explicitly set engine if not passing in'
ValueError: Must explicitly set engine if not passing in buffer or path for io.
Couldn't dig up much on that. I'm not even sure if I need to define a function.
Do I import into a Pandas DataFrame directly from the function? Or
keep it separate?
How would I go about setting the engine?
First off, use datetime.today().date(). This will ensure the correct comparison format with xmod.
Second, that glob loop is kind of confusing. Are you sure you want the for and if to be in the same indentation level? Or should the if be inside the for loop? Either way, that needs to be fixed.
Also, try not passing in a function. Explicitly pass either a path or a buffer, as the error implies. Otherwise, it will ask you which engine to use, which is more work than it's worth for what you're doing.
All else, just subscribe to best programming/Python practices.
In [1]: import pandas as pd
...: from datetime import datetime
...: import os
In [2]: def load_if_modified_today(xls):
...: modx = os.path.getmtime(xls)
...: xmod = datetime.fromtimestamp(modx)
...: if datetime.today().date() == xmod.date():
...: df = pd.read_excel(xls)
...: return df
In [3]: e = 'trades.xlsx'
...: df = load_if_modified_today(e)
...: df.head()
Out[3]:
System Name Symbol Unit Position Entry Date Entry Time \
0 TB Turtle FX Majors EURUSD 1 Short 2014-12-02 Day
1 TB Turtle FX Majors EURUSD 1 Short 2014-12-05 Day
2 TB Turtle FX Majors EURUSD 1 Long 2014-12-10 Day
3 TB Turtle FX Majors EURUSD 1 Long 2014-12-16 Day
4 TB Turtle FX Majors EURUSD 1 Short 2014-12-17 Day
Entry Order Price Stop Price Risk% Quantity Entry Fill
0 1.2420 1.2428 0.0024 6 1.2418
1 1.2284 1.2297 0.0022 4 1.2283
2 1.2447 1.2434 0.0017 3 1.2448
3 1.2496 1.2486 0.0026 5 1.2499
4 1.2385 1.2397 0.0022 4 1.2383
I came across same ValueError: Must explicitly set engine if not passing in buffer or path for io. After I altered my code to:
data = pd.read_excel(R"D:\\file.xlsx")
It worked.
I ran across a similar issue, and it turned out that different python frame versions installed were conflicting each other. I removed one of them (anaconda), kept and updated plain python2.7 and all was fixed.

Categories