I am currently performing multiple analysis steps on all the columns of my Pandas dataframe to get a good sense and overview of the data quality and structure (e.g. number of unique values, # missing values, # values by data type int/float/str ...).
My approach appears rather memory expensive and inefficient, especially with regards to larger data sets. I would really appreciate your thoughts on how to optimize the current process.
I am iterating over all the different columns in my dataset and create two dictionaries (see below) for every column separately which hold the relevant information. As I am checking/testing each row item anyways would it be reasonable to somehow combine all the checks in one go? And if so, how would you approach the issue? Thank you very much in advance for your support.
data_column = input_dataframe.loc[:,"column_1"] # as example, first column of my dataframe
dictionary_column = {}
unique_values = data_column.nunique()
dictionary_column["unique_values"] = unique_values
na_values = data_column.isna().sum()
dictionary_column["na_values"] = na_values
zero_values = (data_column == 0).astype(int).sum()
dictionary_column["zero_values"] = zero_values
positive_values = (data_column > 0).astype(int).sum()
dictionary_column["positive_values"] = positive_values
negative_values = (data_column < 0).astype(int).sum()
dictionary_column["negative_values"] = negative_values
data_column.dropna(inplace=True) # drop NaN otherwise elemts will be considered as float
info_dtypes = data_column.apply(lambda x: type(x).__name__).value_counts()
dictionary_data_types = {} # holds the count of the different data types (e.g. int, float, datetime, str, ...)
for index, value in info_dtypes.iteritems():
dictionary_data_types[str(index)] = int(value)
I have 2 sets of split data frames from a big data frame. Say for example,
import pandas as pd, numpy as np
np.random.seed([3,1415])
ind1 = ['A_p','B_p','C_p','D_p','E_p','F_p','N_p','M_p','O_p','Q_p']
col1 = ['sap1','luf','tur','sul','sul2','bmw','aud']
df1 = pd.DataFrame(np.random.randint(10, size=(10, 7)), columns=col1,index=ind1)
ind2 = ['G_l','I_l','J_l','K_l','L_l','M_l','R_l','N_l']
col2 = ['sap1','luf','tur','sul','sul2','bmw','aud']
df2 = pd.DataFrame(np.random.randint(20, size=(8, 7)), columns=col2,index=ind2)
# Split the dataframes into two parts
pc_1,pc_2 = np.array_split(df1, 2)
lnc_1,lnc_2 = np.array_split(df2, 2)
And now, I need to concatenate each split data frames from df1 (pc1, pc2) with each data frames from df2 (ln_1,lnc_2). Currently, I am doing it following,
# concatenate each split data frame pc1 with lnc1
pc1_lnc_1 =pd.concat([pc_1,lnc_1])
pc1_lnc_2 =pd.concat([pc_1,lnc_2])
pc2_lnc1 =pd.concat([pc_2,lnc_1])
pc2_lnc2 =pd.concat([pc_2,lnc_2])
On every concatenated data frame I need to run a correlation analysis function, for example,
correlation(pc1_lnc_1)
And I wanted to save the results separately, for example,
pc1_lnc1= correlation(pc1_lnc_1)
pc1_lnc2= correlation(pc1_lnc_2)
......
pc1_lnc1.to_csv(output,sep='\t')
The question is if there is a way I can automate the above concatenation part, rather than coding it in every line using some sort of loop, currently for every concatenated data frame. I am separately running the function correlation. And I have a pretty long list of the split data frame.
You can loop over the split dataframes:
for pc in np.array_split(df1, 2):
for lnc in np.array_split(df2, 2):
print(correlation(pd.concat([pc,lnc])))
Here is another thought,
def correlation(data):
# do some complex operation..
return data
# {"pc_1" : split_1, "pc_2" : split_2}
pc = {f"pc_{i + 1}": v for i, v in enumerate(np.array_split(df1, 2))}
lc = {f"lc_{i + 1}": v for i, v in enumerate(np.array_split(df2, 2))}
for pc_k, pc_v in pc.items():
for lc_k, lc_v in lc.items():
# (pc_1, lc_1), (pc_1, lc_2) ..
correlation(pd.concat([pc_v, lc_v])). \
to_csv(f"{pc_k}_{lc_k}.csv", sep="\t", index=False)
# will create csv like pc_1_lc_1.csv, pc_1_lc_2.csv.. in the current working dir
If you don't have your individual dataframes in an array (and assuming you have a nontrivial number of dataframes), the easiest way (with minimal code modification) would be to throw an eval in with a loop.
Something like
for counter in range(0,n):
for counter2 in range(0:n);
exec("pc{}_lnc{}=correlation(pd.concat([pc_{},lnc_{}]))".format(counter,counter2,counter,counter2))
eval("pc{}_lnc{}.to_csv(filename,sep='\t')".format(counter,counter2)
The standard disclaimer around eval does still apply (don't do it because it's lazy programming practice and unsafe inputs could cause all kinds of problems in your code).
See here for more details about why eval is bad
edit Updating answer for updated question.
I am trying to concatenate multiple dataframes using unionAll function in pyspark.
This is what I do :
df_list = []
for i in range(something):
normalizer = Normalizer(inputCol="features", outputCol="norm", p=1)
norm_df = normalizer.transform(some_df)
norm_df = norm_df.repartition(320)
data = index_df(norm_df)
data.persist()
mat = IndexedRowMatrix(
data.select("id", "norm")\
.rdd.map(lambda row: IndexedRow(row.id, row.norm.toArray()))).toBlockMatrix()
dot = mat.multiply(mat.transpose())
df = dot.toIndexedRowMatrix().rows.toDF()
df_list.append(df)
big_df = reduce(unionAll, df_list)
big_df.write.mode('append').parquet('some_path')
I want to do that because the writing part takes time and therefore, it is much faster to write one big file than n small files in my case.
The problem is that when I write big_df and check Spark UI, I have way too many tasks for writing parquet. While my goal is to write ONE big dataframe, it actually writes all the sub-dataframes.
Any guess?
Spark is lazy evaluated.
The write operation is the action that trigger all previous transformations. Therefore those tasks are for those transformations, not just for writing parquets.
Since I'm new and learning python, I want to collect specific data in dataframe corresponding to logical operation and add a label to it, however, this needs to perform in many lines of code.
For example:
df = df[(df['this_col'] >= 10) & (df['anth_col'] < 100)]
result_df = df.copy()
result_df['label'] = 'medium'
I really wonder if there's a way such that I can perform in one line of code without applying a function. If it cannot perform in one line, how come?
Cheers!
query returns a copy, always.
result_df = df.query("this_col >= 10 and anth_col < 100").assign(label='medium')
Assuming your columns can pass off as valid identifier names in python, this will do.
I am attempting to perform multiple operations on a large dataframe (~3 million rows).
Using a small test-set representative of my data, I've come up with a solution. However the script runs extremely slowly when using the large dataset as input.
Here is the main loop of the application:
def run():
df = pd.DataFrame(columns=['CARD_NO','CUSTOMER_ID','MODIFIED_DATE','STATUS','LOYALTY_CARD_ENROLLED'])
foo = input.groupby('CARD_NO', as_index=False, sort=False)
for name, group in foo:
if len(group) == 1:
df = df.append(group)
else:
dates = group['MODIFIED_DATE'].values
if all_same(dates):
df = df.append(group[group.STATUS == '1'])
else:
df = df.append(group[group.MODIFIED_DATE == most_recent(dates)])
path = ''
df.to_csv(path, sep=',', index=False)
The logic is as follows:
For each CARD_NO
- if there is only 1 CARD_NO, add row to new dataframe
- if there are > 1 of the same CARD_NO, check MODIFIED_DATE,
- if MODIFIED_DATEs are different, take the row with most recent date
- if all MODIFIED_DATES are equal, take whichever row has STATUS = 1
The slow-down occurs at each iteration around,
input.groupby('CARD_NO', as_index=False, sort=False)
I am currently trying to parallelize the loop by splitting the groups returned by the above statement, but I'm not sure if this is the correct approach...
Am I overlooking a core functionality of Pandas?
Is there a better, more Pandas-esque way of solving this problem?
Any help is greatly appreciated.
Thank you.
Two general tips:
For looping over a groupby object, you can try apply. For example,
grouped = input.groupby('CARD_NO', as_index=False, sort=False))
grouped.apply(example_function)
Here, example_function is called for each group in your groupby object. You could write example_function to append to a data structure yourself, or if it has a return value, pandas will try to concatenate the return values into one dataframe.
Appending rows to dataframes is slow. You might be better off building some other data structure with each iteration of the loop, and then building your dataframe at the end. For example, you could make a list of dicts.
data = []
grouped = input.groupby('CARD_NO', as_index=False, sort=False)
def example_function(row,data_list):
row_dict = {}
row_dict['length'] = len(row)
row_dict['has_property_x'] = pandas.notnull(row['property_x'])
data_list.append(row_dict)
grouped.apply(example_function, data_list=data)
pandas.DataFrame(data)
I wrote a significant improvement in running time. The return statements appear to change the dataframe in place, significantly improving running time (~30 minutes for 3 million rows) and avoiding the need for secondary data structures.
def foo(df_of_grouped_data):
group_length = len(df_of_grouped_data)
if group_length == 1:
return df_of_grouped_data
else:
dates = df_of_grouped_data['MODIFIED_DATE'].values
if all_same(dates):
return df_of_grouped_data[df_of_grouped_data.STATUS == '1']
else:
return df_of_grouped_data[df_of_grouped_data.MODIFIED_DATE == most_recent(dates)]
result = card_groups.apply(foo)