using Pandas to read in excel file from URL - XLRDError - python

I am trying to read in excel files to Pandas from the following URLs:
url1 = 'https://cib.societegenerale.com/fileadmin/indices_feeds/CTA_Historical.xls'
url2 = 'https://cib.societegenerale.com/fileadmin/indices_feeds/STTI_Historical.xls'
using the code:
pd.read_excel(url1)
However it doesn't work and I get the error:
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '2000/01/'
After searching on Google it seems that sometimes .xls files offered through URLs are actually held in a different file format behind the scenes such as html or xml.
When I manually download the excel file and open it using Excel I get presented with an error message: The file format and extension don't match. The file could be corrupted or unsafe. Unless you trust it's source don't open it"
When I do open it, it appears just like a normal excel file.
I came across a post online that suggested I open the file in a text editor to see if there is any additional info held as to proper file format but I don't see any additional info when opened using notepad++.
Could someone please help me get this "xls" file read into a pandas DataFramj properly please?

It seems you can use read_csv:
import pandas as pd
df = pd.read_csv('https://cib.societegenerale.com/fileadmin/indices_feeds/CTA_Historical.xls',
sep='\t',
parse_dates=[0],
names=['a','b','c','d','e','f'])
print df
Then I check last column f if there are some other values as NaN:
print df[df.f.notnull()]
Empty DataFrame
Columns: [a, b, c, d, e, f]
Index: []
So there are only NaN, so you can filter last column f by parameter usecols:
import pandas as pd
df = pd.read_csv('https://cib.societegenerale.com/fileadmin/indices_feeds/CTA_Historical.xls',
sep='\t',
parse_dates=[0],
names=['a','b','c','d','e','f'],
usecols=['a','b','c','d','e'])
print df

If this helps someone.. you can read a Google Drive File directly by URL in to Excel without any login requirements. I tried in Google Colab it worked.
Upload an XL File to Google Drive, or use an already uploaded one
Share the File to Anyone with the Link (i don't know if view only works, but i tried with full access)
Copy the Link
You will get something like this.
share url: https://drive.google.com/file/d/---some--long--string/view?usp=sharing
Get the download url from attempting to download the file (copy the url from there)
It will be something like this: (it has got the same google file id as above)
download url: https://drive.google.com/u/0/uc?id=---some--long--string&export=download
Now go to Google Colab and paste the following code:
import pandas as pd
fileurl = r'https://drive.google.com/file/d/---some--long--string/view?usp=sharing'
filedlurl = r'https://drive.google.com/u/0/uc?id=---some--long--string&export=download'
df = pd.read_excel(filedlurl)
df
That's it.. the file is in your df.

Related

How do you download urls of pdfs online and save it using python?

I have an excel file out of which in one of the columns urls are present. These URLs are basically PDF Files. Is there a way to download these PDF files without opening the links using python and save it in a specified folder on your machine?
I can think of a way to open the excel file using pd.read_excel('excel_name.xlsx', col_name='pdf links') but I have no idea of downloading the files in a sequential manner.
And even some pdf links are duplicated, in that case how do I ensure to keep only the first reported link? Or for instance, I can use df.drop_duplicates()
Here is a sample excel file: https://i.stack.imgur.com/DuPSs.png
Please help!!
Here's what I've tried:
import dload
import pandas as pd
import requests
df = pd.read_excel('examples.xlsx', sheet_name='Reports and URL')
#print(df['URL'])
df1 = df['URL'].to_numpy()
print(df1)
for urls in df1:
pdfs = dload.save(urls, 'F:/technophile/proj/')
print(pdfs)
print('saved!')
and this is the error I get requests.exceptions.SSLError: HTTPSConnectionPool(host='www.incitecpivot.com.au', port=443)
EDIT 2:- removed for loop and used dload.save_multi() still gave the same error.

Reading XLSB (binary) file with Pandas read_excel using pyxlsb reads empty rows for some xlsb file

I'm trying to read binary Excel files using read_excel method in pandas with pyxlsb engine as below:
import pandas as pd
df = pd.read_excel('test.xlsb', engine='pyxlsb')
If the xlsb file is like this file (Right now, I'm sharing this file via WeTransfer, but if there is a better way to share files on StackOverflow, let me know), the returned dataframe is filled with NaN's. I suspected that it might be because the file was saved with active cell pointing at the empty cells after the data originally. So I tried this:
import pandas as pd
with open('test.xlsb', 'rb') as data:
data.seek(0,0)
df = pd.read_excel(data, engine='pyxlsb')
but it still doesn't seem to work. I also tried reading the data from byte number 0 (from the beginning), writing it into a new file, 'test_1.xlsb', and finally reading it with pandas, but that doesn't work.
with open('test.xlsb','rb') as data:
data.seek(0,0)
with open('test_1.xlsb','wb') as outfile:
outfile.write(data.read())
df = pd.read_excel('test_1.xlsb', engine='pyxlsb')
If anyone has suggestion as to what might be going on and how to resolve it, I'd greatly appreciate the help.

I am trying to upload a csv file onto Python (Azure) but am running into file IO Error does not exist

My code is:
import pandas as pd
df=pd.read_csv('Project_Wind_Data.csv'), usecols = ['U100', 'V100']) with open
('Project_Wind_Data.csv',"r") as csvfile:
I am trying to access certain columns within the csv file. I recive an error message saying that the data file does not exist
My data is in the following form:
This is must a be trivial issue but help would be much appreciated.
If your csv file is in the same working directory as your .py code, you use directly
import pandas as pd
df=pd.read_csv('Project_Wind_Data.csv'), usecols = ['U100', 'V100'])
If the file is in another directory, replace 'Project_Wind_Data.csv' with the full path to the file like c:User/Documents/file.txt

Error (little-endian) reading a XLS file with python

I download a XLS file from the web using selenium.
I tried many options I found in stack-overflow and other websites to read the XLS file :
import pandas as pd
df = pd.read_excel('test.xls') # Read XLS file
Expected "little-endian" marker, found b'\xff\xfe'
And
df = pd.ExcelFile('test.xls').parse('Sheet1') # Read XLSX file
Expected "little-endian" marker, found b'\xff\xfe'
And again
from xlrd import open_workbook
book = open_workbook('test.xls')
CompDocError: Expected "little-endian" marker, found b'\xff\xfe'
I have tried different encoding: utf-8, ANSII, utf_16_be, utf16
I have even tried to get the encoding of the file from notepad or other applications.
Type of file : Microsoft Excel 97-2003 Worksheet (.xls)
I can open the file with Excel without any issue.
What's frustrating is that if I open the file with excel and just press save I then can read the file with of the previous python command.
I would be really grateful if someone could provide me other ideas I could try. I need to open this file with a python script only.
Thanks,
Max
Solution(Somewhat messy but simple) that could potentially work for any type of Excel file :
Called VBA from python to Open and save the file in Excel. Excel "clean-up" the file and then Python is able to read it with any read Excel type function
Solution inspired by #Serge Ballesta and #John Y comments.
## Open a file in Excel and save it to correct the encoding error
import win32com.client
import pandas
downloadpath="c:\\firefox_downloads\\"
filename="myfile.xls"
xl=win32com.client.Dispatch("Excel.Application")
xl.Application.DisplayAlerts = False # disables Excel pop up message (for saving the file)
wb = xl.Workbooks.Open(Filename=downloadpath+filename)
wb.SaveAs(downloadpath+filename)
wb.Close
xl.Application.DisplayAlerts = True # enables Excel pop up message for saving the file
df = pandas.ExcelFile(downloadpath+filename).parse('Sheet1') # Read XLSX file
Thank you all!
What does pd mean?? What
pandas is made for data science. In my opinion, you have to use openpyxl (read and write only xlsx) or xlwt/xlrd (read xls... and write only xls).
from xlrd import open_workbook
book = open_workbook(<math file>)
sheet =....
It has several examples with this on Internet...

How to read this kind of .cvs file (html style content) by Python2.7?

I am practicing github machine learning contest using Python. I start from other's submission, but stuck at the first step: use pandas to read CSV file:
import pandas as pd
import numpy as np
filename = './facies_vectors.csv'
training_data = pd.read_csv(filename)
print(set(training_data["Well Name"]))
[enter image description here][1]training_data.head()
This gave me the following error message:
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 104, saw 3
I could not understand that the .csv file describe itself as html DOCTYPE. Please help.
The representing segments of the csv data content are attached. Thanks
It turns out I download the csv file following the convention of regular web operation: right click and save as. The right way is open the item from github, and then open it from the github desktop. I got the tables now. But the way to work with html files from python is definite something I would learn more about. Thanks.

Categories