I have a dataframe with 100 rows [ name, age, date, hour] . I need to partition this dataframe with distinct values of date. Let's say there are 20 distinct date values in these 100 rows , then i need to spawn up 20 parallel hive queries where each hive QL will join each of these partitions with a hive table . Hive table - [dept, couse , date] is partitioned by date field.
Hive table is huge and hence I need to optimize these joins in to multiple smaller joins and then aggregate these results. Any recommendations on how can I achieve this ?
You can do this in single query. Partition both df on date and join. During join broadcast you first table which have small data (~10MB). Here is example:-
df3 = df1.repartition("date").join(
F.broadcast(df2.repartition("date")),
"date"
)
#df2 is your dataframe smaller dataframe in your case it is name, age, date, ,hour.
#Now perform any operation on df3
Related
I need query 200+ tables in database.
By using spark.sql = f"" select ... " statement i get col(0) (because result of the query give me specific information about column that i've retrive) and result of calculation for particulare table, like this:
col(0)
1
My goal is to have 1 csv file, with name of table and the result of calculation:
Table name
Count
accounting
3
sales
1
So far my main part of my code:
list_tables = ['accounting', 'sales',...]
for table in list_tables:
df = spark.sql(
f""" select distinct errors as counts from {database}.{table} where errors is not null""")
df.repartition(1).write.mode("append").option("header","true").csv(f"s3:.......)
rename_part_file(dir,output,newdir)
I'm kinda new to PySpark and all structures included.
Soo far i'm confused because i heard iteration dataframe isn't best idea.
By using following code i get only 1 csv with last recent record, not all processed tables from my list_tables.
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Both of the options you mentioned lead to the same thing - you have to iterate over a list of tables (you can't read multiple tables at once), read each of it, execute a SQL statement and save the results into DataFrame, then union all of the DataFrames and save as a single CSV file. The sample code could look something like this:
from pyspark.sql.functions import lit
from functools import reduce
tables = ["tableA", "tableB", "tableC"]
dfs = []
for table in tables:
dfs.append(spark.read.table(table).sql("my sql statement").withColumn("TableName", lit(table))) # Append the DF with SQL query results
df = reduce(lambda df1, df2: df1.union(df2), dfs) # Union all DFs
df.coalesce(1).write.mode("overwrite").csv("my_csv.csv") # Combine and write as single file
Note: the union operation takes into account only the position of the column, and not its name. I assume for your case that is the desired behaviour, as your are only extracting a single statistic.
I have a database which contains 2 tables. I'd like to get all the data from both tables into single dataframe. In both tables there is a time column on which I'd like to sort the data after the combining.
df1=pd.read_sql("Select * from table1")
df2=pd.read_sql("Select * from table2")
What is the best way to combine df1 and df2 to a single dataframe ordered by time column?
Do you mean by concat and sort_values:
print(pd.concat([df1, df2]).sort_values('time'))
I have the two pyspark dataframes.
I want to select all records from voutdf where its "hash" does not exist in vindf.tx_hash
How to do this using pyspark dataframe.?
I tried a semi join but I am ending up with out of memory errors.
voutdf = sqlContext.createDataFrame(voutRDD,["hash", "value","n","pubkey"])
vindf = sqlContext.createDataFrame(vinRDD,["txid", "tx_hash","vout"])
You can do it with left-anti join:
df = voutdf.join(vindf.withColumnRenamed("tx_hash", "hash"), "hash", 'left_anti')
left-anti join:
It takes all rows from the left dataset that don't have their matching in the right dataset.
I have two csv files with 30 to 40 thousands records each.
I loaded the csv files into two corresponding dataframes.
Now I want to perform this sql operation on the dataframes instead of in sqlite : update table1 set column1 = (select column1 from table2 where table1.Id == table2.Id), column2 = (select column2 from table2 where table1.Id == table2.Id) where column3 = 'some_value';
I tried to perform the update on dataframe in 4 steps:
1. merging dataframes on common Id
2. getting Ids from dataframe where column 3 has 'some_value'
3. filtering the dataframe of 1st step based on Ids received in 2nd step.
4. using lambda function to insert in dataframe where Id matches.
I just want to know other views on this approach and if there are any better solutions. One important thing is that the size of dataframe is quite large, so I feel like using sqlite will be better than pandas as it gives result in single query and is much faster.
Shall I use sqlite or there are any better way to perform this operation on dataframe?
Any views on this will be appreciated. Thank you.
There are two DataFrames. One, df1, contains events, and one of its columns is ID.
The other df2 contains just ID-s.
How would be best to crate df3 which contain just rows whose ID is not present in df2.
Looks like this type of query is not supported in Spark SQL:
sqlContext.sql(""" SELECT * FROM table_df1
WHERE ID NOT IN (SELECT ID FROM table_df2) """)
Spark SQL will support this type of subqueries starting from Spark version 2.0 (more information is available on Databricks blog).
A way to do this in older versions of Spark would be the following:
df3 = sqlContext.sql(
"""
select
*
from df1 left join df2 on df1.id=df2.id
where df2.id is null
"""
)