Scrape data in csv format without downloading csv file - python

I am trying to scrape data (using python) ,from google sheet, in csv format,without downloading the csv file. My aim is to update a database in mysql after getting information from google sheets.
Downloading the csv file would make me download the csv file again and again ,whenever the google sheet gets updated.
I have tried to use csv_reader(open('')) and putting the url of the google sheet in csv format(the link that we get after publishing to the web in csv format).
But it shows it as an invalid argument.
Can someone help me with this?

Related

Creating an html report with columns and rows from csv

I have a question for you, I'm working on a new jenkins instance and as a result of the job I get a csv file with errors if there were any during the test. I would like to generate an HTML report based on this csv file, which would be more convenient to use than opening excel and loading the csv file to see the errors. I came across a plugin like HTML Publisher, unfortunately I don't know if it supports generating HTML reports based on csv files. Alternatively, you could do something like this with a python script and show the resulting html file in artifats. Do you have any ideas ??

Automation using Python

Is is possible to automate where i can extract particular data ( numbers) from scanned PDF file to excel file ?
Currently we need to go page by page to look for particular data and then manually type that in excel sheet .
Thanks

Accessing an Excel file in a .dtsx that was generated through Xlwt library (Python) raises error CANNOTACQUIRECONNECTIONFROMCONNECTIONMANAGER

I have a web scraper written in Python, fetching raw data from the HTML of a page and writing it onto a 97-2003 Workbook Excel file, using the Xlwt library. I then have a .dtsx file with some tasks, where one of them is an Excel Source task to fetch data from an Excel file. Later down the road, that data is inserted into a SQL Server table.
If I try to access my newly-generated Excel file with said task, I get an OLE DB error
External table is not in the expected format
And I cannot run my dtsx. However, if I manually access the Excel file through my File Explorer, open it and close it again (don't even need to save it), suddenly my SSIS task works without a problem, fetching all the columns and all the info. What could possibly be causing this behavior?
External table is not in the expected format
The error above happens when the Excel file is corrupted and cannot be opened by Access Database Engine (OLE DB provider) even if you can open the file from Excel.
In general, the solution is to open this Excel manually which will auto repair it. In a similar case and if the process is repeated many times, you can automate opening and repairing excel using a C# script using Interop.Excel library.
Additional Information
What .xlsx file format is this?
Getting "External table is not in the expected format." error while trying to import an Excel File in SSIS

BigQuery: loading excel file

Is there any way we can load direct excel file into BigQuery, instead of converting to CSV.
I get the files every days in excel format and need to load into BigQuery. Right now converting into CSV manually and loading into BigQuery.
Planning to schedule the job.
If not possible to load the excel files directly into BigQuery then I need to write a process(Python) to convert into CSV before loading into BigQuery.
Please let me know if any better options are there.
Thanks,
I think you could achieve above in a few clicks, without any code.
You need to use Google Drive and external (federated) tables.
1) You could upload manually you excel files to Google Drive or synchronise them
2) In Google Drive Settings find:
"**Convert uploads** [x] Convert uploaded files to Google Docs editor format"
and check it.
To access above option go to https://drive.google.com/drive/my-drive, click on the Gear settings icon and then choose Settings.
Now you excel files will be accessible by Big Query
3) Last part: https://cloud.google.com/bigquery/external-data-drive
You could access you excel file by URI: https://cloud.google.com/bigquery/external-data-drive#drive-uri and then create table manually using above uri.
You could do last step also by API.

How to make XLRD read hyperlinks in XLSX cells?

This is not a duplicate although the issue has been raised in this forum in 2011Getting a hyperlink URL from an Excel document, 2013 Extracting Hyperlinks From Excel (.xlsx) with Python and 2014 Getting the URL from Excel Sheet Hyper links in Python with xlrd; there is still no answer.
After some deep dive into the xlrd module, it seems the Data_sheet.hyperlink_map.get((row, col)) item trips because "xlrd cannot read the hyperlink without formatting_info, which is currently not supported for xlsx" per #alecxe at Extracting Hyperlinks From Excel (.xlsx) with Python.
Question: has anyone has made progress with extracting URLs from hyperlinks stored in an excel file. Say, of all the customer data, there is a column of hyperlinks. I was toying with the idea of dumping the excel sheet as an html page and proceed per usual scraping (file on local drive). But that's not a production solution. Supplementary: is there any other module that can extract the url from a .cell(row,col).value() call on the hyperlink-cell. Is there a solution in mechanize? Many thanks.
I had the same problem trying to get the hyperlinks from the cells of a xlsx file. The work around I came up with is simply converting the Excel sheet to xls format, from which I could manage to get the hyperlinks withount any trouble, and once finished the editing, I formatted it back to the original xlsx file.
I don't know if this should work for your specific needs, or if the change of format implies some consecuences I am not aware of, but I think it's worth a try.
I was able to read and use hyperlinks to copy files with openpyxl. It has a cell_obj.hyperlink and cell_obj.hyperlink.target which will grab the link value. I made a list of the cell row col values which had hyperlinks, then appended them to a list and then looped through the list to move the linked files.

Categories