Iterate through database with PySpark DataFrame - python

I need query 200+ tables in database.
By using spark.sql = f"" select ... " statement i get col(0) (because result of the query give me specific information about column that i've retrive) and result of calculation for particulare table, like this:
col(0)
1
My goal is to have 1 csv file, with name of table and the result of calculation:
Table name
Count
accounting
3
sales
1
So far my main part of my code:
list_tables = ['accounting', 'sales',...]
for table in list_tables:
df = spark.sql(
f""" select distinct errors as counts from {database}.{table} where errors is not null""")
df.repartition(1).write.mode("append").option("header","true").csv(f"s3:.......)
rename_part_file(dir,output,newdir)
I'm kinda new to PySpark and all structures included.
Soo far i'm confused because i heard iteration dataframe isn't best idea.
By using following code i get only 1 csv with last recent record, not all processed tables from my list_tables.
Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?

Im stuck, don't know if there is possibility to pack all of it into 1 dataframe, or i should union dataframe?
Both of the options you mentioned lead to the same thing - you have to iterate over a list of tables (you can't read multiple tables at once), read each of it, execute a SQL statement and save the results into DataFrame, then union all of the DataFrames and save as a single CSV file. The sample code could look something like this:
from pyspark.sql.functions import lit
from functools import reduce
tables = ["tableA", "tableB", "tableC"]
dfs = []
for table in tables:
dfs.append(spark.read.table(table).sql("my sql statement").withColumn("TableName", lit(table))) # Append the DF with SQL query results
df = reduce(lambda df1, df2: df1.union(df2), dfs) # Union all DFs
df.coalesce(1).write.mode("overwrite").csv("my_csv.csv") # Combine and write as single file
Note: the union operation takes into account only the position of the column, and not its name. I assume for your case that is the desired behaviour, as your are only extracting a single statistic.

Related

How to Left Join a Dataframe into an Existing Table in Databricks

I have a delta table in Databricks that I am loading into a dataframe to perform a specific numpy operation:
import pyspark.pandas as ps
df = spark.sql("SELECT Id,Field_to_Transform FROM bronze.Table_A where Field_to_Transform is not null").toPandas()
The operation I perform is to remove special characters:
df['Transformed_Field'] = [''.join(e for e in x if e.isalnum()) for x in df['Field_to_Transform ']]
df.drop(['Field_to_Transform '], axis=1)
So this leaves me with the dataframe "df" which has just the id and the Transformed_Field in it:
Id
Transformed_Field
00A1
12345
00A2
123456
00A3
1234
00A4
1234568
Now I want to left join the df back to bronze.Table_A in databricks by simply joining back on the id field.
What is the most effecient way to join df back to bronze.Table_A?
Things I have tried so far:
Saved the entire bronze.Table_A in a new dataframe df2, used df.merge to put them together, and then created a brand new table. This worked but it was way too excessive and I do not need a new table, just the transformed column joined back.
Tried to use spark.sql to perform the left join in a SQL query but it seems I cannot access a df inside a spark.sql query.
Any help is much appreciated, thank you.
Option 1 - DataFrame API way
The first option is a modification of your first bullet point:
Saved the entire bronze.Table_A in a new dataframe df2, used df.merge to put them together, and then created a brand new table. This worked but it was way too excessive and I do not need a new table, just the transformed column joined back.
The merge operation is a Pandas method and not PySpark, hence the performance might not be optimal - docs here. Loading the whole table into DataFrame is the correct approach, that just needs the built-in join method to the required left-join, like this:
table_df = spark.read.table("bronze.Table_A")
# Join
merged_df = table_df.join(df, on="Id", how="left")
Option 2 - SQL way
The second option builds on your second bullet point:
Tried to use spark.sql to perform the left join in a SQL query but it seems I cannot access a df inside a spark.sql query.
You can temporarily register DataFrame as a view, and then query it using plain SQL, whether in spark.sql method or any other way. Try doing this:
df.createOrReplaceTempView("transformed_df")
# Then join like this
spark.sql("SELECT * FROM bronze.Table_A ta LEFT JOIN transformed_df tdf ON ta.Id = tdf.Id")

how to perform sql update like operation on pandas dataframe?

I have two csv files with 30 to 40 thousands records each.
I loaded the csv files into two corresponding dataframes.
Now I want to perform this sql operation on the dataframes instead of in sqlite : update table1 set column1 = (select column1 from table2 where table1.Id == table2.Id), column2 = (select column2 from table2 where table1.Id == table2.Id) where column3 = 'some_value';
I tried to perform the update on dataframe in 4 steps:
1. merging dataframes on common Id
2. getting Ids from dataframe where column 3 has 'some_value'
3. filtering the dataframe of 1st step based on Ids received in 2nd step.
4. using lambda function to insert in dataframe where Id matches.
I just want to know other views on this approach and if there are any better solutions. One important thing is that the size of dataframe is quite large, so I feel like using sqlite will be better than pandas as it gives result in single query and is much faster.
Shall I use sqlite or there are any better way to perform this operation on dataframe?
Any views on this will be appreciated. Thank you.

Splitting one large comma separated row into many rows after number of values

I'm rather new to MySQL so apologies if this is an intuitive problem, I couldn't find anything too helpful in stackoverflow. I have a rather large amount of financial data in one row currently, with each value separated by a comma. 12 values equals one set of data and so I want to create a new row after every 12 values.
In other words, the data I have looks like this:
(open_time,open,high,low,close,volume,close_time,quotevol,trades,ignore1,ignore2,ignore3, ...repeat...)
And I'd like for it to look like:
Row1:(open_time,open,high,low,close,volume,close_time,quotevol,trades,ignore1,ignore2,ignore3)
Row2:(open_time2,open2,high2,low2,close2,volume2,close_time2,quotevol2,trades2,ignore4,ignore5,ignore6)
Row3:
...
The data is already a .sql file and I have it in a table too if that makes a difference.
To clarify, the table it is in has only one row and one column.
I don't doubt there is a way to do it in MySQL, but I would approach it by exporting out the record as .CSV.
Export to CSV
Write a simple python script using the CSV module and shift every x number of fields to a new row using the comma as a delimiter. Afterward, you can reimport it back into MySQL.
If I understand correctly, you want to do the following:
Get the string from the database, which is located in the first row of the first column in the query results
Break the string into "rows" with 12 values long
Be able to use this data
The way I would go about this in Python is to:
Create a mysql connection and cursor
Execute the query to pull the data from the database
Put the data from the single cell into a string
Split the string at each comma and add those values to a list
Break that list into chunks of 12 elements each
Put this data into a tabular form for easy consumption
Code:
import mysql
import pandas as pd
query = '''this is your sql statement that returns everything into the first row of the first column in your query results'''
cnx = mysql.connector.connect('''enter relevant connection information here: user, password, host, and database''')
mycursor = cnx.cursor()
mycursor.execute(query)
tup = tuple(mycursor.fetchall()[0])
text = str(tup[0])
ls = text.split(',') # converts text into list of items
n = 12
rows = [ls[i:i + n] for i in range(0, len(ls), n)]
data = []
for row in rows:
data.append(tuple(row))
labels = ['open_time','open','high','low','close','volume','close_time','quotevol','trades','ignore1','ignore2','ignore3']
df = pd.DataFrame.from_records(data, columns=labels)
print(df)
The list comprehension code was taken from this. You did not specify exactly how you wanted your resultant dataset, but the pandas data frame should have each of your rows.
Without an actual string or dataset, I can't confirm that this works entirely. Would you be able to give us a Minimal, Complete, and Verifiable example?

Importing SQL query into Pandas results in only 1 column

I'm trying to import the results of a complex SQL query into a pandas dataframe. My query requires me to create several temporary tables since the final result table I want includes some aggregates.
My code looks like this:
cnxn = pyodbc.connect(r'DRIVER=foo;SERVER=bar;etc')
cursor = cnxn.cursor()
cursor.execute('SQL QUERY HERE')
cursor.execute('SECONDARY SQL QUERY HERE')
...
df = pd.DataFrame(cursor.fetchall(),columns = [desc[0] for desc in cursor.description])
I get an error that tells me shapes aren't matching:
ValueError: Shape of passed values is (1,900000),indices imply (5,900000)
And indeed, the result of all the SQL queries should be a table with 5 columns rather than 1. I've run the SQL query using Microsoft SQL Server Management Studio and it works and returns the 5 column table that I want. I've tried to not pass any column names into the dataframe and printed out the head of the dataframe and found that pandas has put all the information in 5 columns into 1. The values in each row is a list of 5 values separated by commas, but pandas treats the entire list as 1 column. Why is pandas doing this? I've also tried going the pd.read_sql route but I still get the same error.
EDIT:
I have done some more debugging, taking the comments into account. The issue doesn't appear to stem from the fact that my query is nested. I tried a simple (one line) query to return a 3 column table and I still got the same error. Printing out fetchall() looks like this:
[(str1,str2,str3,datetime.date(stuff),datetime.date(stuff)),
(str1,str2,str3,datetime.date(stuff),datetime.date(stuff)),...]
Use pd.DataFrame.from_records instead:
df = pd.DataFrame.from_records(cursor.fetchall(),
columns = [desc[0] for desc in cursor.description])
Simply adjust the pd.DataFrame() call as right now cursor.fetchall() returns one-length list of tuples. Use tuple() or list to map child elements into their own columns:
df = pd.DataFrame([tuple(row) for row in cur.fetchall()],
columns = [desc[0] for desc in cursor.description])

How to SELECT DISTINCT from a pandas hdf5store?

I have a large amount of data in an HDFStore (as a table), on the order of 80M rows with 1500 columns. Column A has integer values ranging between 1 and 40M or so. The values in column A are not unique and there may be between 1 and 30 rows with the same column A value. In addition, all rows which share a common value in column A will also have a common value in column B (not the same value as column A though).
I would like to do a select against the table to get a list of column A values and their corresponding column B values. The equivalent SQL statement would be something like SELECT DISTINCT ColA, ColB FROM someTable What are some ways to achieve this? Can it be done such that the results of the query are stored directly into another table in the HDF5Store?
Blocked Algorithms
One solution would be to look at dask.dataframe which implements a subset of the Pandas API with blocked algorithms.
import dask.dataframe as dd
df = dd.read_hdf('myfile.hdf5', '/my/data', columns=['A', 'B'])
result = df.drop_duplicates().compute()
In this particular case dd.DataFrame.drop_duplicates would pull out a medium-sized block of rows, perform the pd.DataFrame.drop_duplicates call and store the (hopefully smaller) result. It would do this for all blocks, concatenate them, and then perform a final pd.DataFrame.drop_duplicates on the concatenated intermediate result. You could also do this with just a for loop. Your case is a bit odd in that you also have a large number of unique elements. This might still be a challenge to compute even with blocked algorithms. Worth a shot though.
Column Store
Alternatively you should consider looking into a storage format that can store your data as individual columns. This would let you collect just the two columns that you need, A and B, rather than having to wade through all of your data on disk. Arguably you should be able to fit 80 million rows into a single Pandas dataframe in memory. You could consider bcolz for this.
To be clear, you tried something like this and it didn't work?
import pandas
import tables
import pandasql
check that your store is the type you think it is:
in: store
out: <class 'pandas.io.pytables.HDFStore'>
You can select a table from a store like this:
df = store.select('tablename')
Check that it worked:
in: type(tablename)
out: pandas.core.frame.DataFrame
Then you can do something like this:
q = """SELECT DISTINCT region, segment FROM tablename"""
distinct_df = (pandasql.sqldf(q, locals()))
(note that you will get deprecation warnings doing it this way, but it works)

Categories