Slow Pandas Series initialization from list of DataFrames - python

I found it to be extremely slow if we initialize a pandas Series object from a list of DataFrames. E.g. the following code:
import pandas as pd
import numpy as np
# creating a large (~8GB) list of DataFrames.
l = [pd.DataFrame(np.zeros((1000, 1000))) for i in range(1000)]
# This line executes extremely slow and takes almost extra ~10GB memory. Why?
# It is even much, much slower than the original list `l` construction.
s = pd.Series(l)
Initially I thought the Series initialization accidentally deep-copied the DataFrames which make it slow, but it turned out that it's just copy by reference as the usual = in python does.
On the other hand, if I just create a series and manually shallow copy elements over (in a for loop), it will be fast:
# This for loop is faster. Why?
s1 = pd.Series(data=None, index=range(1000), dtype=object)
for i in range(1000):
s1[i] = l[i]
What is happening here?
Real-life usage: I have a table loader which reads something on disk and returns a pandas DataFrame (a table). To expedite the reading, I use a parallel tool (from this answer) to execute multiple reads (each read is for one date for example), and it returns a list (of tables). Now I want to transform this list to a pandas Series object with a proper index (e.g. the date or file location used in the read), but the Series construction takes ridiculous amount of time (as the sample code shown above). I can of course write it as a for loop to solve the issue, but that'll be ugly. Besides I want to know what is really taking the time here. Any insights?

This is not a direct answer to the OP's question (what's causing the slow-down when constructing a series from a list of dataframes):
I might be missing an important advantage of using pd.Series to store a list of dataframes, however if that's not critical for downstream processes, then a better option might be to either store this as a dictionary of dataframes or to concatenate into a single dataframe.
For the dictionary of dataframes, one could use something like:
d = {n: df for n, df in enumerate(l)}
# can change the key to something more useful in downstream processes
For concatenation:
w = pd.concat(l, axis=1)
# note that when using with the snippet in this question
# the column names will be duplicated (because they have
# the same names) but if your actual list of dataframes
# contains unique column names, then the concatenated
# dataframe will act as a normal dataframe with unique
# column names

Related

How do you check if all the values in a column in a dataframe exist in another column in another dataframe using Vaex?

I have a dataframe with 160,000 rows and I need to know if these values exist in another column in another different dataframe that has over 7 million rows using Vaex.
I have tried doing this in pandas but it takes way too long to run.
Once I run this code I would like a list or a column that says either "True" or "False" about whether the value exists.
There are few tricks you can do.
Some ideas:
you can try inner join, and then get the list of unique values, which appear in both dataframes. Then you can use the isin method in the smaller dataframe and that list to get your answer.
Dunno if this will work out of the box, but it would be something like:
df_join = df_small.join(df_big, on='key', allow_duplicates=True)
common_samples = df_join[key].tolist()
df_small['is_in_df_big'] = df_small.key.isin(common_samples)
# If it is something you gonna reuse a lot, but be worth doing
df_small = df_small.materialize('is_in_df_big') # to put it in memory otherwise it will be lazily recomputed each time you need it.
Similar idea: instead of doing join do something like:
unique_samples = df_small.key.unique()
common_samples = df_big[df_big.key.isin(unique_samples)].key.unqiue()
df_small['is_in_df_big'] = df_small.key.isin(common_samples)
I dunno which one would be faster. I hope this at least will lead to some inspiration if not to the full solution.

Correct way of adding new columns/headers to a dataframe

I need to add new columns to a dataframe. Every column has a header and a value across all the rows (the value is the same for all the columns).
Right now im doing something like this:
array_of_new_headers = [...]
for column in array_of_new_headers:
df[column] = 0
As a result I'm getting this message:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many
times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
It tells me to use concat, but, I don't need to concatenate two dataframes really, should I use concat for better performance and better code? To me it doesn't really make sense unless I think of the arrays as also dataframes maybe.
You can pass an unpacked dictionary with keys as column names, and values as value for the columns to pandas.DataFrame.assign :
>>> array_of_new_headers = [...]
>>> df.assign(**{c:0 for c in array_of_new_headers})
But the operation is immutable, so make sure to assign it back to the required variable.
should I use concat for better performance
Beware so-called premature optimization, if your code does work rapidly enough for your needs then you might end simply wasting your time on trying to make it faster.

Converting for loop to numpy calculation for pandas dataframes

So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]

How to compute multiple counts with different conditions on a pyspark DataFrame, fast?

Let's say I have this pyspark Dataframe:
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('BE',), ('France',), ('Latvia',)])
And let's say I want to collect various statistics about this data. For example, I might want to know how many rows use a 2-character country code and how many use longer country names:
count_short = data.where(F.length(F.col('Country')) == 2).count()
count_long = data.where(F.length(F.col('Country')) > 2).count()
This works, but when I want to collect many different counts based on different conditions, it becomes very slow even for tiny datasets. In Azure Synapse Studio, where I am working, every count takes 1-2 seconds to compute.
I need to do 100+ counts, and it takes multiple minutes to compute for a dataset of 10 rows. And before somebody asks, the conditions for those counts are more complex than in my example. I cannot group by length or do other tricks like that.
I am looking for a general way to do multiple counts on arbitrary conditions, fast.
I am guessing that the reason for the slow performance is that for every count call, my pyspark notebook starts some Spark processes that have significant overhead. So I assume that if there was some way to collect these counts in a single query, my performance problems would be solved.
One possible solution I thought of is to build a temporary column that indicates which of my conditions have been matched, and then call countDistinct on it. But then I would have individual counts for all combinations of condition matches. I also noticed that depending on the situation, the performance is a bit better when I do data = data.localCheckpoint() before computing my statistics, but the general problem still persists.
Is there a better way?
Function "count" can be replaced by "sum" with condition (Scala):
data.select(
sum(
when(length(col("Country")) === 2, 1).otherwise(0)
).alias("two_characters"),
sum(
when(length(col("Country")) > 2, 1).otherwise(0)
).alias("more_than_two_characters")
)
While one way is to combine multiple queries in to one, the other way is to cache the dataframe that is being queried again and again.
By caching the dataframe, we avoid the re-evaluation each time the count() is invoked.
data.cache()
Few things to keep in mind. If you are applying multiple actions on your dataframe and there are lot of transformations and you are reading that data from some external source then you should definitely cache that dataframe before you apply any single action on that dataframe.
The answer provided by #pasha701 works but you will have to keep on adding the columns based on different country code length value you want to analyse.
You can use the below code to get the count of different country codes all in one single dataframe.
//import statements
from pyspark.sql.functions import *
//sample Dataframe
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('ACE',), ('BE',), ('France',), ('Latvia',)])
//adding additional column that gives the length of the country codes
data1 = data.withColumn("CountryLength",length(col('Country')))
//creating columns list schema for the final output
outputcolumns = ["CountryLength","RecordsCount"]
//selecting the countrylength column and converting that to rdd and performing map reduce operation to count the occurrences of the same length
countrieslength = data1.select("CountryLength").rdd.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b).toDF(outputcolumns).select("CountryLength.CountryLength","RecordsCount")
//now you can do display or show on the dataframe to see the output
display(countrieslength)
please see the output snapshot that you might get as below :
If you want to apply multiple filter condition on this dataframe, then you can cache this dataframe and get the count of different combination of records based on the country code length.

What's the fastest way to do these tasks?

I originally have some time series data, which looks like this and have to do the following:
First import it as dataframe
Set date column as datetime index
Add some indicators such as moving average etc, as new columns
Do some rounding (values of the whole column)
Shift a column one row up or down (just to manipulate the data)
Then convert the df to list (because I need to loop it based on some conditions, it's a lot faster than looping a df because I need speed)
But now I want to convert df to dict instead of list because I want to keep the column names, it's more convenient
But now I found out that convert to dict takes a lot longer than list. Even I do it manually instead of using python built-in method.
My question is, is there a better way to do it? Maybe not to import as dataframe in the first place? And still able to do Point 2 to Point 5? At the end I need to convert to dict which allows me to do the loop, keep the column names as keys? THanks.
P.S. the dict should look something like this, the format is similar to df, each row is basically the date with the corresponding data.
On item #7: If you want to convert to a dictionary, you can use df.to_dict()
On item #6: You don't need to convert the df to a list or loop over it: Here are better options. Look for the second answer (it says DON'T)

Categories