Uploading CSV from google cloud to Bigquery using python - python

I'm trying to upload a CSV file to bigQuery via Google Cloud and facing formatting issues. I have two date columns(date and cancel) to convert to the required bigQuery DateTime format, I'm using this code for conversion.
df['date'] = pd.to_datetime(df['date']
this works fine for the "date" column but doesn't work for the "cancel" column, the "cancel column has some empty rows, are empty rows an issue?? And when I execute the code mentioned above, an additional column is automatically added(as a first column) to the CSV with random integer values. How to get rid of the formatting issues??

Me using the ELT approach, first load all your data to Bigquery and transform accordingly.
i.e., Make all columns as string and load. Thus you will not get error. Then you can transform as you want in Bigquery.

Related

Pandas Dataframe (with Date as Header Column) to MySQL

I have been trying to send/export a particular set of data from an excel file to Python to MySQL.
The data from an excel file looks like the one in the screenshot shown below:
Data in Excel
After using 'iloc' and some other pandas functions i get it converted it into the one below:
Data in Python Pandas
Now the problem really is with the Dataframe header column which is a date. I want this data, when exported to MySQL to look like:
Data in MySQL
I have tried converting Date to both string or datetime.datetime etc but so far have not been able to export it to MySQL the way I want to.
Any help would be very much appreciated.
Thanks.

Pandas df `to_gbq` with nested data

I'm working in a limited Airflow environment in which I don't have access to google-cloud-bigquery but do have access to pandas-gbq. My goal is to load some JSON API data using some schema involving records into a BigQuery table. My strategy is to first read all the data into a pandas dataframe using a dictionary to represent the records: e.g.
uuid
metadata1
...
001
{u'time_updated': u'', u'name':u'jeff'}
...
Then, I've been trying to use pandas_gbq.to_gbq to load into BQ. The issue is I get
Error at Row: 0, Reason: invalid, Location: metadata1, Message: This field: metadata1 is not a record. and I realize this is because from the Google Cloud website it says that pandas-gbq "Converts the DataFrame to CSV format before sending to the API, which does not support nested or array values."
And so I won't be able to upload a dataframe with records to BQ in this way since again I can't use google-cloud-bigquery in my environment.
What would be the best strategy for me to upload my data to BQ (around 30k rows and 6 or so columns with 8ish nested fields each)?
I know this sounds like a very bad strategy but I could upload a flattened version of all fields ini a record as a single string to the BQ table and then run a query from my code to replace these flattened fields with their record-form versions. But this seems really bad since for a time, the table would contain the wrong schema.
Any thoughts would be much appreciated. Thanks in advance.

Avoiding exponential values in CSV on loading to Database

I am using a python script in order to structure a CSV file fetched from a website and store it as CSV in Azure container. The CSV File is loaded into Azure database using Azure Data Factory Copy Activity.
The datatype one of the amount column in CSV is having Exponential value inside it.For example it has value 3.7E+08 (the actual value is 369719968 ). But when the data is loaded into the database the values are stored as 3.7E+08 in the staging table for which the column data type is varchar(30). I am converting it to float and then to decimal in order to store it in the target table. On converting to float the value is coming as 370000000 inside database and hence it causing a big mismatch in the amount.
Kindly suggest how we can resolve this issue. We tried to convert the amount column to string in the CSV using Python. But still the value is coming as exponential.

Get JSON format for Google BigQuery Data using Pandas/Python

I am trying to query Google BigQuery using the Pandas/Python client interface. I am following the tutorial here: https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas. I was able to get it to work but I want to query the data as the JSON format that can be downloaded directly from the WebUI (see screenshot). Is there a way to download data as the JSON structure pictured instead of converting it to the data frame object?
I imagine the command would be somewhere around this part of the code from the tutorial:
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(bqstorage_client=bqstorageclient)
)
Just add .to_json(orient='records') call after converting to dataframe:
json_data = bqclient.query(query_string).result().to_dataframe(bqstorage_client=bqstorageclient).to_json(orient='records')
pandas docs

Create Database using Python on Jupyter Notebook

so i am building a database for a larger program and do not have much experience in this area of coding (mostly embedded system programming). My task is to import a large excel file into python. It is large so i'm assuming I must convert it to a CSV then truncate it by parsing and then partitioning and then import to avoid my computer crashing. Once the file is imported i must be able to extract/search specific information based on the column titles. There are other user interactive aspects that are simply string based so not very difficult. As for the rest, I am getting the picture but would like a more efficient and specific design. Can anyone offer me guidance on this?
An excel or csv can be read into python using pandas. The data is stored as rows and columns and is called a dataframe. To import data in such a structure, you need to import pandas first and then read the csv or excel into the dataframe structure.
import pandas as pd
df1= pd.read_csv('excelfilename.csv')
This dataframe structure is similar to tables and you can perform joining of different dataframes, grouping of data etc.
I am not sure if this is what you need, let me know if you need any further clarifications.
I would recommend actually loading it into a proper database such as Mariadb or Postgresql. This will allow you to access the data from other applications and it takes the load off of you for writing a database. You can then use a ORM if you would like to interact with the data or simply use plain SQL via python.
read the CSV
df = pd.read_csv('sample.csv')
connect to a database
conn = sqlite3.connect("Any_Database_Name.db") #if the db does not exist, this creates a Any_Database_Name.db file in the current directory
store your table in the database:
df.to_sql('Some_Table_Name', conn)
read a SQL Query out of your database and into a pandas dataframe
sql_string = 'SELECT * FROM Some_Table_Name' df = pd.read_sql(sql_string, conn)

Categories