In our application we are migrating huge volume of data from teradata to Hive.
Need to validate the data between source and target.We are planning to do it using python & pandas dataframe.
My queries are
1.Will pandas data-frame can handle around 15 million of data ?
2.Is there any other way to do it ?
What is the best way to achieve the above using python ?
Thanks in advance
Related
I have about 45GB of data in my BigQuery that I want to transfer to Elasticsearch. Currently, I am fetching rows from my BigQuery table as JSON then indexing them into Elasticsearch. The whole process took about 2 weeks to complete. I just wanted to know if there is a better and more efficient way to do this.
I need to pull a table ABC from Salesforce having 4Million records with 250 columns. I'm using python simple-salesforce API to do it, but if is running out of memory as I'm using a 8GB ram machine.
Is there any way to query this much large amount of records in salesforce using Pyspark, if so please suggest.
If there are any other approach also by using either Python or Pyspark, suggest them as well...
Is there a chance you are using query_all from simple_salesforce? If so, you might try query_more or query_all_iter instead so that your script does not attempt to load all into a single Python list.
Simple Salesforce documentation here.
New to spark programming and had a doubt regarding the method to read partitioned tables using pyspark.
Let us say we have a table partitioned as below:
~/$table_name/category=$category/year=$year/month=$month/day=$day
Now, I want to read data from all the categories, but want to restrict data by time period. Is there any way to specify this with wild cards rather than writing out all the individual paths?
Something to the effect of
table_path = ["~/$table_name/category=*/year=2019/month=03",
"~/$table_name/category=*/year=2019/month=04"]
table_df_raw = spark.read.option(
"basePath", "~/$table_name").parquet(*table_path)
Also, as bonus is there a more pythonic way to specify the time ranges which may fall in different years rather than listing the paths individually.
Edit: To clarify a few things, I don't have access to the hive metastore for this table and hence can't access with just a SQL query. Also, the size of the data doesn't allow filtering post conversion to dataframe.
You can try this
Wildcards can also be used to specify a range of days:
table_df_raw = spark.read
.option("basePath", "~/$table_name")
.parquet("~/$table_name/category=*/year=2019/month={3,4,8}")
Or
table_df_raw = spark.read
.option("basePath", "~/$table_name")
.parquet("~/$table_name/category=*/year=2019/month=[3-4]")
Are you using a Hortonworks HDP cluster? If yes, try to use HiveWarehouse connector. Its allow Spark to access Hive catalog. After this, you can perform any Spark SQL command over hive tables: https://community.hortonworks.com/articles/223626/integrating-apache-hive-with-apache-spark-hive-war.html
If you aren't using Hortonworks, i suggest you look at this link: https://acadgild.com/blog/how-to-access-hive-tables-to-spark-sql
so i am building a database for a larger program and do not have much experience in this area of coding (mostly embedded system programming). My task is to import a large excel file into python. It is large so i'm assuming I must convert it to a CSV then truncate it by parsing and then partitioning and then import to avoid my computer crashing. Once the file is imported i must be able to extract/search specific information based on the column titles. There are other user interactive aspects that are simply string based so not very difficult. As for the rest, I am getting the picture but would like a more efficient and specific design. Can anyone offer me guidance on this?
An excel or csv can be read into python using pandas. The data is stored as rows and columns and is called a dataframe. To import data in such a structure, you need to import pandas first and then read the csv or excel into the dataframe structure.
import pandas as pd
df1= pd.read_csv('excelfilename.csv')
This dataframe structure is similar to tables and you can perform joining of different dataframes, grouping of data etc.
I am not sure if this is what you need, let me know if you need any further clarifications.
I would recommend actually loading it into a proper database such as Mariadb or Postgresql. This will allow you to access the data from other applications and it takes the load off of you for writing a database. You can then use a ORM if you would like to interact with the data or simply use plain SQL via python.
read the CSV
df = pd.read_csv('sample.csv')
connect to a database
conn = sqlite3.connect("Any_Database_Name.db") #if the db does not exist, this creates a Any_Database_Name.db file in the current directory
store your table in the database:
df.to_sql('Some_Table_Name', conn)
read a SQL Query out of your database and into a pandas dataframe
sql_string = 'SELECT * FROM Some_Table_Name' df = pd.read_sql(sql_string, conn)
New to Pandas & SQL. Haven't found an answer specific to this config, and not sure if standard SQL wisdom applies when introducing pandas to the mix.
Doing a school project that involves ~300 gb of data in ~6gb .csv chunks.
School advised syncing data via dropbox, but this seemed impractical for a 4-person team.
So, current solution is AWS EC2 & RDS instance (MySQL, I think it'll be, 1 table).
What I wanted to confirm before we start setting it up:
If multiple users are working with (and occasionally modifying) the data, can this arrangement manage conflicts? e.g., if user A uses pandas to construct a dataframe from a query, are the records in that query frozen if user B tries to work with them?
My assumption is that the data in the frame are in memory, and the records in the SQL database are free to be modified by others until the dataframe is written back to the db, but I'm hoping that either I'm wrong or there's a simple solution here (like a random sample query for each user or something).
A pandas DataFrame object does not interact directly with the db. Once you read it in it sits in memory locally. You would have to use a method like DataFrame.to_sql to write your changes back to the MySQL DB. For more information on reading and writing to SQL tables, see the pandas documentation here.