collect() slowing the process down, is there any alternative? - python

This is my code:
def fun(df, file):
symbol = df.select(df.SMBL).distinct().collect()
for i in symbol:
csv_data = df.filter(df.SMBL == i.SMBL)
csv_data.write.csv('%s/'%(BUCKET_PATH))
using collect() slows the process. How to access the column 'SMBL' without using collect?

As far as I understand you are trying to write to files that are named based on the SMBL dataframe column. I suggest to write the dataframe with partitionBy() in which you specify the column. It might be needed to make a user defined function based on the SMBL column in order to get the right partition naming.
Doing this you don't need to call collect previously to the write action.

Related

Performance issue with Pandas when aggregating using a custom defined function

I have database of above 500.000 records and 140 MB (when stored as CSV). Pandas takes about 1.5 seconds to load it, parsing dates included. Is not a problem at all. Now, I have a Python program that is continuously creating more records, which I want to add to the database (I also remove older records, so the database has a fairly stable size). And I'm facing a performance issue, as adding the new records takes longer than the process creating such records.
For adding these new records, I basically merge the freshly obtained Dataframe with the one that contains the database, which is loaded from a CSV file, i.e.:
# read the database
old_df = pd.read_csv('database.csv',
index_col=False,
parse_dates=['date'],
dtype=dtypes)
# some process produces new_df
# I merge them by just concatenating
merged = pd.concat([df, df2])
This step is even faster, so no problem so far. Perhaps it's worth to note that the new_df is tiny compared to old_df. Typically less than 10 new records are added each time.
Now, a particularity of this database is that some of the new records are supposed to replace their counterpart in the database, i.e. they not just grow but update it (The details are not important for the problem, but for a bit of context, the database keeps memory of previous fails in the column type, which can be either 'success' or' 'failed', that correspond to attempts to get a file stored in the column file. This way, when a latter attempts of the program success, the record for the fail is replaced by the success.)
The replacement consists of grouping the database by the column file, so each file is unique. Once grouped, I need to aggregate to define a value for type, so I keep just one record for the given file. And my problem is that the aggregation is done through a user defined function that has become a bottleneck of the program.
This code:
merged = merged.groupby('file', as_index=False).agg({'type': 'last'})
runs in less than a second, whereas this:
def keep_success(x):
"""! Auxiliary function to keep `success` if it exist."""
if (x == "success").any():
return 'success'
else:
return x.iloc[-1]
merged = merged.groupby('file', as_index=False).agg({'type': keep_success})
takes more than a minute. So far I was using 'last', but a change in my program made that sometimes 'success' is previous to 'fail', so I need to account for the unknown order of these two values.
TL;DR; I need a FAST way to aggregate records in a Dataframe sharing the file column, and keeping just the value 'success' for the column type in case there is any occurrence of this value within the group. Otherwise we keep 'failed'
EDIT to add my guess:
I think the problem is in the string comparison. The program has to go through ALL the database making trivial/useless comparisons that systematically are not fulfilled. To replace about 10 records, we need to check the equity of above 500.000 strings. Can I work around this taking advantage of what I known, i.e. that most records, once grouped, are unique so we do not need to do anything to with them?

How to read 1 record from a json file using panda

In this program I'd like to randomly choose one word from a json file
[My code]
I succesfully opend the file but i don't know how to access only one record inside "randwords"
Thanks!!!
Try this:
randwords.sample(1)
See:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html
You can access a pandas dataframe (the pd) with an operation called indexing. For example pd['data'] will let you have access to the data column. Refer to here for more information.
One specific function that you can benefit here is iloc. For example pd.iloc[0] will let you have access to the first row. Then you can specify which column you are interested int by calling the appropriate column name.
pd.idloc[0].data
This will return the data column of the first row. Using a random number instead of 0 will result in a random row obioulsy.

Pandas read csv only returns the first column when column names are duplicate

I have OHLC data in a .csv file with the stock name is repeated in the header rows, like this:
M6A=F, M6A=F,M6A=F, M6A=F, M6A=F
Open, High, Low, Close, Volume
I am using pandas read_csv to get it, and parse all (and only) the 'M6A=F' columns to FastAPI. So far nothing I do will get all the columns. I either get the first column if I filter with "usecols=" or the last column if I filter with "names=".
I don't want to load the entire .csv file then dump unwanted data due to speed of use, so need to filter before extracting the data.
Here is my code example:
symbol = ['M6A=F']
df = pd.read_csv('myOHCLVdata.csv', skipinitialspace=True, usecols=lambda x: x in symbol)
def parse_csv(df):
res = df.to_json(orient="records")
parsed = json.loads(res)
return parsed
#app.get("/test")
def historic():
return parse_csv(df)
What I have done so far:
I checked the documentation for pandas.read_csv and it says "names=" will not allow duplicates.
I use lambdas in the above code to prevent the symbol hanging FastAPI if it does not match a column.
My understanding from other stackoverflow questions on this is that mangle_dupe_cols=True should be incrementing the duplicates with M6A=F.1, M6A=F.2, M6A=F.3 etc... when pandas reads it into a dataframe, but that isnt happening and I tried setting it to false, but it says it is not implemented yet.
And answers like I found in this stackoverflow solution dont seem to tally with what is happening in my code, since I am only getting the first column returned, or the last column with the others over-written. (I included FastAPI code here as it might be related to the issue or a workaround).

Using checkpointed dataframe to overwrite table fails with FileNotFoundException

I have some dataframe df in pySpark, which results from calling:
df = spark.sql("select A, B from org_table")
df = df.stuffIdo
I want to overwrite org_table at the end of my script.
Since overwriting input-tabels is forbidden, I checkpointed my data:
sparkContext.setCheckpointDir("hdfs:/directoryXYZ/PrePro_temp")
checkpointed = df.checkpoint(eager=True)
The lineage should be broken now and I can also see my checkpointed data with checkpointed.show() (works). What does not work is writing the table:
checkpointed.write.format('parquet')\
.option("checkpointLocation", "hdfs:/directoryXYZ/PrePro_temp")\
.mode('overwrite').saveAsTable('org_table')
This results in an error:
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://org_table_path/org_table/part-00081-4e9d12ea-be6a-4a01-8bcf-1e73658a54dd-c000.snappy.parquet
I have tried several things like refreshing the org_table before doing the writing etc., but I'm puzzled here. How can I solve this error?
I would be careful with such operations where transformed input is new output. The reason for that is that you can lost your data in case of any error. Let's imagine that your transformation logic was buggy and you generated invalid data. But you saw that only one day later. Moreover, to fix the bug, you cannot use the data you've just transformed. You needed the data before the transformation. What do you do to bring data consistent again ?
An alternative approach would be:
exposing a view
at each batch you're writing a new table and at the end you only replace the view with this new table
after some days you can also plan a cleaning job that will delete the tables from last X days
If you want to stay with your solution, why not simply to do that instead of dealing with checkpointing ?
df.write.parquet("hdfs:/directoryXYZ/PrePro_temp")\
.mode('overwrite')
df.load("hdfs:/directoryXYZ/PrePro_temp").write.format('parquet').mode('overwrite').saveAsTable('org_table')
Of course, you will read the data twice but it looks less hacky than the one with checkpoint. Moreover, you could store your "intermediate" data in different dirs every time and thanks to that you can address the issue I exposed at the beginning. Even though you had a bug, you can still bring valid version of data by simply choosing a good directory and doing .write.format(...) to org_table.

Best way to store BigQuery dataset location as variable - python

I currently have a function that reads an SQL file to execute a query on Google's BigQuery.
import pandas as pd
def func1(arg1,arg2):
with open('query.sql', 'r') as sqlfile:
sql_query= sqlfile.read()
df = pd.read_gbq(sql_query.format(arg1=arg1,arg2=arg2)
query.sql
SELECT *
FROM bigquery.dataset
WHERE col1= {arg1}
AND col2 = {arg2}
The dataset location is hardcoded in the SQL file itself and as such, makes it hard to make changes if I were to change the dataset location (I.E I would have to individually go to each SQL file and manually change the "From" clause. Since I have many SQL files, it becomes cumbersome to manually edit each individual SQL file's from clause)
So my questions is, what is the best way to make the dataset location dynamic?
Ideally, the dataset location should be a variable, but the question is where to place the variable. If it is a variable, is it better to pass it in as a function argument? I.E func1 will have one more argument, called dataset_loc
import pandas as pd
def func1(arg1,arg2,dataset_loc):
with open('query.sql', 'r') as sqlfile:
sql_query= sqlfile.read()
df = pd.read_gbq(sql_query.format(arg1=arg1,arg2=arg2,dataset_loc=dataset_loc)
query.sql
SELECT *
FROM {dataset_loc}
WHERE col1 = {arg1}
AND col2 = {arg2}
Would like to know what is the best way to go around doing this. Thank you
If you are using the same functions to operate on different datasets, it is a good practice to make the function “dataset agnostic”, i.e to pass the dataset as a parameter. For me, your second example is the good approach to do it.
Also, keep in mind that now, your application might be small, but you need to prepare things for scaling up in the future. And definitely, you don’t want to have to write the same SQL query file for everyone of your datasets.
It depends on your use case but as a general rule it is recommended to manage the parameters of an application out of the code. To do this config files are used and as you are using Python take a look at this Python library which is useful to read them.

Categories