BigQuery: loading excel file - python

Is there any way we can load direct excel file into BigQuery, instead of converting to CSV.
I get the files every days in excel format and need to load into BigQuery. Right now converting into CSV manually and loading into BigQuery.
Planning to schedule the job.
If not possible to load the excel files directly into BigQuery then I need to write a process(Python) to convert into CSV before loading into BigQuery.
Please let me know if any better options are there.
Thanks,

I think you could achieve above in a few clicks, without any code.
You need to use Google Drive and external (federated) tables.
1) You could upload manually you excel files to Google Drive or synchronise them
2) In Google Drive Settings find:
"**Convert uploads** [x] Convert uploaded files to Google Docs editor format"
and check it.
To access above option go to https://drive.google.com/drive/my-drive, click on the Gear settings icon and then choose Settings.
Now you excel files will be accessible by Big Query
3) Last part: https://cloud.google.com/bigquery/external-data-drive
You could access you excel file by URI: https://cloud.google.com/bigquery/external-data-drive#drive-uri and then create table manually using above uri.
You could do last step also by API.

Related

Loading csv.gz from url to bigquery

I am trying to load all the csv.gz files from this url to google bigquery. What is the best way to do this?
I tried using pyspark to read the csv.gz files (as I need to perform some data cleaning on these files) but I realized that pyspark doesn't support directly reading files from url. Would it make sense to load the cleaned versions of the csv.gz files into BigQuery or should I dump the raw,original csv.gz files in BigQuery and perform my cleaning process in BigQuery itself?
I was reading the "Google BigQuery: The Definitive Guide" book and it suggests to load the data on Google Cloud Storage. Do I have to load each csv.gz file into Google Cloud Storage or is there an easier way to do this?
Thanks for your help!
As #Samuel mentioned, you can use the curl command to download the files from the URL and then copy the files to GCS bucket.
If you have heavy transformations to be done on the data I would recommend using Cloud Dataflow otherwise you can go for Cloud Dataprep workflow and finally export your clean data to BigQuery table.
Choosing BigQuery for transformations totally depends upon your use-case, data size and budget ie, if you have high volume then direct transformations could be costly.

Pass list of google drive file URLs to dataframe

I have a google sheet with a column of URLs to other google sheets, each of which could either be a native google sheet or an excel file uploaded to the Drive. All the files are stored in my work Google Drive, so the share link provides access for anyone within the company.
My research revealed several ways to access individual Google Drive files or all files in a particular directory, but I'm hoping to find a way to access hundreds of other file URLs and read each of them (and their tabs) to a separate pandas dataframe.
I could go through the process of creating shortcuts for each of the files to a folder on my drive and go the "pull in everything from this directory" route, but before I subject myself to the tedium I thought I'd put myself out there and ask.

Accessing an Excel file in a .dtsx that was generated through Xlwt library (Python) raises error CANNOTACQUIRECONNECTIONFROMCONNECTIONMANAGER

I have a web scraper written in Python, fetching raw data from the HTML of a page and writing it onto a 97-2003 Workbook Excel file, using the Xlwt library. I then have a .dtsx file with some tasks, where one of them is an Excel Source task to fetch data from an Excel file. Later down the road, that data is inserted into a SQL Server table.
If I try to access my newly-generated Excel file with said task, I get an OLE DB error
External table is not in the expected format
And I cannot run my dtsx. However, if I manually access the Excel file through my File Explorer, open it and close it again (don't even need to save it), suddenly my SSIS task works without a problem, fetching all the columns and all the info. What could possibly be causing this behavior?
External table is not in the expected format
The error above happens when the Excel file is corrupted and cannot be opened by Access Database Engine (OLE DB provider) even if you can open the file from Excel.
In general, the solution is to open this Excel manually which will auto repair it. In a similar case and if the process is repeated many times, you can automate opening and repairing excel using a C# script using Interop.Excel library.
Additional Information
What .xlsx file format is this?
Getting "External table is not in the expected format." error while trying to import an Excel File in SSIS

How can stock files like .xlsx in elasticsearch, but with type upload. just indtroduce path?

I have an excel file: fille1.xlsx, I want to index it in elasticsearch, without store all pieces of information but just the path of the file, The file is stored in my hard disc. And after, to access on, I upload the file.
is this possible?
This solution permits me to search the files in elasticsearch_database.
for example, search: 'fille' and return a list of files that the name is beginning with 'fille'..without access on information that files contains, just a possibility to upload it and open it with excel.

Python code to load CSV data from Google Storage to Bigquery?

I am pretty new in this, so wanted to have code and process to load data from csv file (Placed in Google Storage) to BigQuery Table using the python code and DataFlow.
Thanks in advance.
There are different BigQuery libraries depending on the language. For Python you would find this one.
But if what you want is the exactly piece of code to upload CSV from Google Cloud Storage to Bigquery, this example might work for you: "Loading CSV data into a new table"
You can see also in the same documentation page "Appending to or overwriting a table with CSV data".
You can go also to the GitHub in order to check all the methods available for Python.

Categories