Reading CSV using Pandas - python

I am attempting to read the following csv so I can process it further but I am getting an pandas.errors.ParserError. I would really appreciate any help on how I can read it. Can you help me identify what I am doing wrong?
My code:
import pandas as pd
logic_df = pd.read_csv("http://www.sharecsv.com/s/6c1b912f54d87d45f4728f8fb1510a5eb/random.csv")
I am not sure if there is something wrong with my csv because I used csv lint and it said my csv is fine so I am not sure what the issue is.
I also tried to do the following
logic_df = pd.read_csv("http://www.sharecsv.com/s/6cb912f54d87d45f4728f81fb1510a5eb/random.csv", error_bad_lines=False)
with no luck.

Changing the url to the direct link of the table should work:
df = pd.read_csv("http://www.sharecsv.com/dl/6cb912f54d87d45f4728f8fb1510a5eb/random.csv")
The thing is, your url is pointing to a html page, not a csv file per se. You can either use the url above, or reading the your url source with pd.read_html, like this:
df = pd.read_html('http://www.sharecsv.com/s/6cb912f54d87d45f4728f8fb1510a5eb/random.csv', header=0)[0]
Hope it helps!

Related

tabula 'pages' argument not specified, pages='all'

I'm trying to extract some data from a PDF file using tabula.
The issues I'm facing is that it's only extracting from one page, even though the pages argument is specified.
Not too sure whats going on, any insight would be greatly appreciated!! ~
The code:
import tabula
tables = tabula.read_pdf("testfile.pdf", pages='all')
tabula.convert_into("testfile.pdf", "test_file_tables.csv")
THANK YOU!
After looking at the documentation I realised I forgot to specify the pages argument in the .convert_into
The correct code is:
tables = tabula.read_pdf("testfile.pdf", pages='all')
tabula.convert_into("testfile.pdf", "test_file_tables.csv", pages='all')

How to download xlsx file from URL and save in data frame via python

I would like the following code to download the xlsx files from the URL and save in drive.
I receive this error:
AttributeError: 'str' object has no attribute 'content'
Below is the code:
import requests
import xlrd
import pandas as pd
filed = 'https://www.icicipruamc.com/downloads/others/monthly-portfolio-disclosures/monthly-portfolio-disclosure-november19/Arbitrage.xlsx'
resp = requests.get(filed)
workbook = xlrd.open_workbook(file_contents = filed.content)
worksheet = workbook.sheet_by_index(0)
first_row = worksheet.row(0)
df = pd.DataFrame(first_row)
pandas already has a function thas converts excel direclty into pandas dataframe (using xlrd):
import pandas as pd
MY_EXCEL_URL="www.yes.com/xl.xlsx"
xl_df = pd.read_excel(MY_EXCEL_URL,
sheet_name='my_sheet',
skiprows=range(5),
skipfooter=0)
then yo can handle /save file using pd.DataFrame.to_excel
This function works, tested individual components. The ICICI website you have seems to give me a 404. So make sure the website works and has an excel sheet before trying this out.
import requests
import pandas as pd
def excel_to_pandas(URL, local_path):
resp = requests.get(URL)
with open(local_path, 'wb') as output:
output.write(resp.content)
df = pd.read_excel(local_path)
return df
print(excel_to_pandas("www.websiteforxls.com", '~/Desktop/my_downloaded.xls'))
As a footnote, this was super simple. And I'm disappointed you couldn't do this on your own. I might not have been able to do this 5 years ago, and that's why I decided to help.
If you want to code. Learn the basics, literally the basics: Class, Functions, Variables, Types, OOP principals. And that's all you need to start. Then you need to learn how to search, and make different components to work together the way you require them too. And with SO, if you show some effort, we are happy to help. We are a community, not a place to solve your homework. Try harder next time.

How to Read a WebPage with Python and write to a flat file?

Very novice at Python here.
Trying to read the table presented at this page (w/ the current filters set as is) and then write it to a csv file.
http://www65.myfantasyleague.com/2017/options?L=47579&O=243&TEAM=DAL&POS=RB
I tried this next approach. It creates the csv file but does not fill it w/ the actual table contents.
Appreciate any help in advance. thanks.
import requests
import pandas as pd
url = 'http://www65.myfantasyleague.com/2017/optionsL=47579&O=243&TEAM=DAL&POS=RB'
csv_file='DAL.RB.csv'
pd.read_html(requests.get(url).content)[-1].to_csv(csv_file)
Generally, try to emphasize your problems better, try to debug and don't put everything in one line. With that said, your specific problem here was the index and the missing ? in the code (after options):
import requests
import pandas as pd
url = 'http://www65.myfantasyleague.com/2017/options?L=47579&O=243&TEAM=DAL&POS=RB'
# -^-
csv_file='DAL.RB.csv'
pd.read_html(requests.get(url).content)[1].to_csv(csv_file)
# -^-
This yields a CSV file with the table in it.

using Pandas to read in excel file from URL - XLRDError

I am trying to read in excel files to Pandas from the following URLs:
url1 = 'https://cib.societegenerale.com/fileadmin/indices_feeds/CTA_Historical.xls'
url2 = 'https://cib.societegenerale.com/fileadmin/indices_feeds/STTI_Historical.xls'
using the code:
pd.read_excel(url1)
However it doesn't work and I get the error:
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '2000/01/'
After searching on Google it seems that sometimes .xls files offered through URLs are actually held in a different file format behind the scenes such as html or xml.
When I manually download the excel file and open it using Excel I get presented with an error message: The file format and extension don't match. The file could be corrupted or unsafe. Unless you trust it's source don't open it"
When I do open it, it appears just like a normal excel file.
I came across a post online that suggested I open the file in a text editor to see if there is any additional info held as to proper file format but I don't see any additional info when opened using notepad++.
Could someone please help me get this "xls" file read into a pandas DataFramj properly please?
It seems you can use read_csv:
import pandas as pd
df = pd.read_csv('https://cib.societegenerale.com/fileadmin/indices_feeds/CTA_Historical.xls',
sep='\t',
parse_dates=[0],
names=['a','b','c','d','e','f'])
print df
Then I check last column f if there are some other values as NaN:
print df[df.f.notnull()]
Empty DataFrame
Columns: [a, b, c, d, e, f]
Index: []
So there are only NaN, so you can filter last column f by parameter usecols:
import pandas as pd
df = pd.read_csv('https://cib.societegenerale.com/fileadmin/indices_feeds/CTA_Historical.xls',
sep='\t',
parse_dates=[0],
names=['a','b','c','d','e','f'],
usecols=['a','b','c','d','e'])
print df
If this helps someone.. you can read a Google Drive File directly by URL in to Excel without any login requirements. I tried in Google Colab it worked.
Upload an XL File to Google Drive, or use an already uploaded one
Share the File to Anyone with the Link (i don't know if view only works, but i tried with full access)
Copy the Link
You will get something like this.
share url: https://drive.google.com/file/d/---some--long--string/view?usp=sharing
Get the download url from attempting to download the file (copy the url from there)
It will be something like this: (it has got the same google file id as above)
download url: https://drive.google.com/u/0/uc?id=---some--long--string&export=download
Now go to Google Colab and paste the following code:
import pandas as pd
fileurl = r'https://drive.google.com/file/d/---some--long--string/view?usp=sharing'
filedlurl = r'https://drive.google.com/u/0/uc?id=---some--long--string&export=download'
df = pd.read_excel(filedlurl)
df
That's it.. the file is in your df.

how to read a data file including "pandas.core.frame, numpy.core.multiarray"

I met a DF file which is encoded in binary format. But when I open it using Vim, still I can see characters like "pandas.core.frame", "numpy.core.multiarray". So I guess it is related with Python. However I know little about the Python language. Though I have tried using pandas and numpy modules, I failed to read the file. Could you guys give any suggestion on this issue? Thank you in advance. Here is the Dropbox link to the DF file: https://www.dropbox.com/s/b22lez3xysvzj7q/flux.df
Looks like DataFrame stored with pickle, use read_pickle() to read it:
import pandas as pd
df = pd.read_pickle('flux.df')

Categories