Process columns in chunks and write to csv - python

I have a dataframe with around 2.5 million rows and more than 7000 columns(all categorical).
I iterate through each column, dummy the variables and do some processing and concatenate to a final dataframe.
The code is below:
cat_count = 0
df_final = pd.DataFrame()
for each_col in cat_cols:
df_temp = pd.DataFrame()
df_single_col_data = df_data[[each_col]]
cat_count += 1
# Calculate uniques and nulls in each column to display in log file.
uniques_in_column = len(df_single_col_data[each_col].unique())
nulls_in_column = df_single_col_data.isnull().sum()
print('%s has %s unique values and %s null values' %(each_col,uniques_in_column,nulls_in_column[0]))
#Convert into dummies
df_categorical_attribute = pd.get_dummies(df_single_col_data[each_col].astype(str), dummy_na=True, prefix=each_col)
df_categorical_attribute = df_categorical_attribute.loc[:, df_categorical_attribute.var() != 0.0]# Drop columns with 0 variance.
#//// Some data processing code://///
df_final = pd.concat([df_final,df_categorical_attribute],axis = 1)
print ('*'*10 + "\n Variable number %s processed!" %(cat_count))
# Write the final dataframe to a csv
df_final.to_csv('cat_processed.csv')
However, for such large data, df_final hogs up upto 75% of the memory on the server and I would like to reduce the memory footprint of this piece of code.
So what I am thinking is, I will process till the 300th column, write the results to csv. Then again process the next 300 columns, open the csv , write to it and close.
In that way, df_final at a time will hold the results of only 300 columns.
Can someone please help me with this?
Or, if there is any better way to deal with the issue, I would like to implement that too.
Below is some sample data to replicate:
df_data
rev_m1_Transform ov_m1_Transform ana_m1_Transform oov_m1_Transform
0_to_12.95 34.95_to_846.4 65_to_74.95 64.9_to_1239.51
13.95_to_116.55 14.95_to_19.95 45.05_to_60.05 34.9_to_39.95
12.95_to_13.95 19.95_to_29.95 89.95_to_9491.36 54.95_to_59.95
0_to_12.95 0_to_14.95 0_to_29.949999 64.9_to_1239.51
0_to_12.95 19.95_to_29.95 74.95_to_83.9 54.95_to_59.95
0_to_12.95 0_to_14.95 0_to_29.9499 0_to_34.9
0_to_12.95 14.95_to_19.95 45.05_to_60.05 39.95_to_44.9
0_to_12.95 0_to_14.95 0_to_29.949 0_to_34.9
0_to_12.95 19.95_to_29.95 89.95_to_9491.36 54.95_to_59.95
cat_cols is a list with all the column names in df_data
Thanks

Instead of going for columns go for a chunk of rows, then apply processing step on all the columns. For example:
CHUNKSIZE = 1000
for chunk in pd.read_csv("filename.csv", chunksize=CHUNKSIZE):
#Apply processing steps here
processed = process(chunk)
processed.to_csv("final.csv", mode="a")
Use CHUNKSIZE according to your physical memory size (RAM).

Related

How to reduce the time complexity of KS test python code?

I am currently working on a project where i need to compare whether two distributions are same or not. For that i have two data frame both contains numeric values only
db_df - which is from the db
2)data - which is user uploaded dataframe
I have to compare each and every columns from db_df with the data and find the similar columns from data and suggest it to user as suggestions for the db column
Dimensions of both the data frame is 100 rows,239 columns
`
from scipy.stats import kstest
row_list = []
suggestions = dict()
s = time.time()
db_data_columns = db_df.columns
data_columns = data.columns
for i in db_data_columns:
col_list = list()
for j in data_columns:
# perform Kolmogorov-Smirnov test
col_list.append(kstest(
df_db[i], data[j]
)[1])
row_list.append(col_list)
print(f"=== AFTER FOR TIME {time.time()-s}")
df = pd.DataFrame(row_list).T
df.columns = db_df.columns
df.index = data.columns
for i in df.columns:
sorted_df = df.sort_values(by=[i], ascending=False)
sorted_df = sorted_df[sorted_df > 0.05]
sorted_df = sorted_df[:3].loc[:, i:i]
sorted_df = sorted_df.dropna()
suggestions[sorted_df.columns[0]] = list(sorted_df.to_dict().values())[0]
`
After getting all the p-values for all the columns in db_df with the data i need select the top 3 columns from data for each column in db_df
**Overall time taken for this is 14 seconds which is very long. is there any chances to reduce the time less than 5 sec **

Assemble a dataframe from two csv files

I wrote the following code to form a data frame containing the energy consumption and the temperature. The data for each of the variables is collected from a different csv file:
def match_data():
pwr_data = pd.read_csv(r'C:\\Users\X\Energy consumption per hour-data-2022-03-16 17_50_56_Edited.csv')
temp_data = pd.read_csv(r'C:\\Users\X\temp.csv')
new_time = []
new_pwr = []
new_tmp = []
for i in range(1,len(pwr_data)):
for j in range(1,len(temp_data)):
if pwr_data['time'][i] == temp_data['Date'][j]:
time = pwr_data['time'][i]
pwr = pwr_data['watt_hour'][i]
tmp = temp_data['Temp'][j]
new_time.append(time)
new_pwr.append(pwr)
new_tmp.append(tmp)
return pd.DataFrame({'Time' : new_time,'watt_hour' : new_pwr,'Temp':new_tmp})
I was trying to collect data with matching time indices so that I can assemble them in a data frame.
The code works well but it takes time(43 seconds for around 1300 data points). At the moment I don't have much data but I was wondering if there was a more efficient and faster way to do so
Do the pwr_data['time'] and temp_data['Date] columns have the same granularity?
If so, you can pd.merge() the two dataframes after reading them.
# read data
pwr_data = pd.read_csv(r'C:\\Users\X\Energy consumption per hour-data-2022-03-16 17_50_56_Edited.csv')
temp_data = pd.read_csv(r'C:\\Users\X\temp.csv')
# merge data on time and Date columns
# you can set the how to be 'inner' or 'right' depending on your needs
df = pd.merge(pwr_data, temp_data, how='left', left_on='time', right_on='Date')
Just like #greco recommended this did the trick and in no time!
pd.merge(pwr_data,temp_data,how='inner',left_on='time',right_on='Date')
'time' and Date are the columns on which you want to base the merge.

Iterating through values of one column to get descriptive statistics for another column

I'm trying to get descriptive statistics for a column of data (the tb column which is a list of numbers) for every individual (i.e., each ID). Normally, I'd use a for i in range(len(list)) statement but since the ID is not a number I'm unsure of how to do that. Any tips would be helpful! The code included below gets me descriptive statistics for the entire tb column, instead of for tb data for each individual in the ID list.
df = pd.DataFrame(pd.read_csv("SurgeryTpref.csv")) #importing data
df.columns = ['date', 'time', 'tb', 'ID','before_after'] #column headers
df.to_numpy()
import pandas as pd
# read the data in with
df = pd.read_clipboard(sep=',')
# data
,date,time,tb,ID,before_after
0,6/29/20,4:15:33 PM,37.1,SR10,after
1,6/29/20,4:17:33 PM,38.1,SR10,after
2,6/29/20,4:19:33 PM,37.8,SR10,after
3,6/29/20,4:21:33 PM,37.5,SR10,after
4,6/29/20,4:23:33 PM,38.1,SR10,after
5,6/29/20,4:25:33 PM,38.5,SR10,after
6,6/29/20,4:27:33 PM,38.6,SR10,after
7,6/29/20,4:29:33 PM,37.6,SR10,after
8,6/29/20,4:31:33 PM,35.5,SR10,after
9,6/29/20,4:33:33 PM,34.7,SR10,after
summary=[]
for individual in (ID):
vals= df['tb'].describe()
summary.append(vals)
print(summary)

Efficiently load and manipulate csv using dask DataFrame

I am trying to manipulate the csv-file from https://www.kaggle.com/raymondsunartio/6000-nasdaq-stocks-historical-daily-prices using dask.dataframe. The original dataframe has columns 'date', 'ticker', 'open', 'close', etc...
My goal is to create a new data frame with index 'date' and columns as the closing price of each unique ticker.
The following code does the trick, but is quite slow, using almost a minute for N = 6. I suspect that dask tries to read the CSV-file multiple times in the for-loop, but I don't know how I would go about making this faster. My initial guess is that using df.groupby('ticker') somewhere would help, but I am not familiar enough with pandas.
import dask.dataframe as dd
from functools import reduce
def load_and_fix_csv(path: str, N: int, tickers: list = None) -> dd.DataFrame:
raw = dd.read_csv(path, parse_dates=["date"])
if tickers is None:
tickers = raw.ticker.unique().compute()[:N] # Get unique tickers
dfs = []
for tick in tickers:
tmp = raw[raw.ticker == tick][["date", "close"]] # Temporary dataframe from specific ticker with columns date, close
dfs.append(tmp)
df = reduce(lambda x, y: dd.merge(x, y, how="outer", on="date"), dfs) # Merge all dataframes on date
df = df.set_index("date").compute()
return df
Every kind of help is appreciated!
Thank you.
I'm pretty sure you're right that Dask is likely going "back to the well" for each loop; this is because Dask builds a graph of operations and attempts to defer computation until forced or necessary. One thing I like to do is to cut the reading operations of the graph with Client.persist:
from distributed import Client
client = Client()
def persist_load_and_fix_csv(path: str, N: int, tickers: list = None) -> dd.DataFrame:
raw = dd.read_csv(path, parse_dates=["date"])
# This "cuts the graph" prior operations (just the `read_csv` here)
raw = client.persist(raw)
if tickers is None:
tickers = raw.ticker.unique().compute()[:N] # Get unique tickers
dfs = []
for tick in tickers:
tmp = raw[raw.ticker == tick][["date", "close"]] # Temporary dataframe from specific ticker with columns date, close
dfs.append(tmp)
df = reduce(lambda x, y: dd.merge(x, y, how="outer", on="date"), dfs) # Merge all dataframes on date
df = df.set_index("date").compute()
return df
In a Kaggle session I tested both functions with persist_load_and_fix_csv(csv_path, N=3) and managed to cut the time in half. You'll also get better performance by only keeping the columns you end up using.
(Note: I've found that, at least for me and my code, if I start seeing .compute() crop up in functions that I should step back and reevaluate the code paths; I view it as a code smell)

Pandas append perfomance concat/append using "larger" DataFrames

The problem: I have data stored in csv file with the following columns data/id/value. I have 15 files each containing around 10-20mio rows. Each csv file covers a distinct period so the time indexes are non overlapping, but the columns are (new ids enter from time to time, old ones disappear). What I originally did was running the script without the pivot call, but then I run into memory issues on my local machine (only 8GB). Since there is lots of redundancy in each file, pivot seemd at first a nice way out (roughly 2/3 less data) but now perfomance kicks in. If I run the following script the concat function will run "forever" (I always interrupted manually so far after some time (2h>)). Concat/append seem to have limitations in terms of size (I have roughly 10000-20000 columns), or do I miss something here? Any suggestions?
import pandas as pd
path = 'D:\\'
data = pd.DataFrame()
#loop through list of raw file names
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
data = pd.concat([data,data_tmp])
del data_tmp
EDIT I:To clarify, each csv file has about 10-20mio rows and three columns, after pivot is applied this reduces to about 2000 rows but leads to 10000 columns.
I can solve the memory issue by simply splitting the full-set of ids into subsets and run the needed calculations based on each subset as they are independent for each id. I know it makes me reload the same files n-times, where n is the number of subsets used, but this is still reasonable fast. I still wonder why append is not performing.
EDIT II: I have tried to recreate the file structure with a simulation, which is as close as possible to the actual data structure. I hope it is clear, I didn't spend to much time minimizing simulation-time, but it runs reasonable fast on my machine.
import string
import random
import pandas as pd
import numpy as np
import math
# Settings :-------------------------------
num_ids = 20000
start_ids = 4000
num_files = 10
id_interval = int((num_ids-start_ids)/num_files)
len_ids = 9
start_date = '1960-01-01'
end_date = '2014-12-31'
run_to_file = 2
# ------------------------------------------
# Simulation column IDs
id_list = []
# ensure unique elements are of size >num_ids
for x in range(num_ids + round(num_ids*0.1)):
id_list.append(''.join(
random.choice(string.ascii_uppercase + string.digits) for _
in range(len_ids)))
id_list = set(id_list)
id_list = list(id_list)[:num_ids]
time_index = pd.bdate_range(start_date,end_date,freq='D')
chunk_size = math.ceil(len(time_index)/num_files)
data = []
# Simulate files
for file in range(0, run_to_file):
tmp_time = time_index[file * chunk_size:(file + 1) * chunk_size]
# TODO not all cases cover, make sure ints are obtained
tmp_ids = id_list[file * id_interval:
start_ids + (file + 1) * id_interval]
tmp_data = pd.DataFrame(np.random.standard_normal(
(len(tmp_time), len(tmp_ids))), index=tmp_time,
columns=tmp_ids)
tmp_file = tmp_data.stack().sortlevel(1).reset_index()
# final simulated data structure of the parsed csv file
tmp_file = tmp_file.rename(columns={'level_0': 'Date', 'level_1':
'ID', 0: 'Value'})
# comment/uncomment if pivot takes place on aggregate level or not
tmp_file = tmp_file.pivot(index='Date', columns='ID',
values='Value')
data.append(tmp_file)
data = pd.concat(data)
# comment/uncomment if pivot takes place on aggregate level or not
# data = data.pivot(index='Date', columns='ID', values='Value')
Using your reproducible example code, I can indeed confirm that the concat of only two dataframes takes a very long time. However, if you first align them (make the column names equal), then concatting is very fast:
In [94]: df1, df2 = data[0], data[1]
In [95]: %timeit pd.concat([df1, df2])
1 loops, best of 3: 18min 8s per loop
In [99]: %%timeit
....: df1b, df2b = df1.align(df2, axis=1)
....: pd.concat([df1b, df2b])
....:
1 loops, best of 3: 686 ms per loop
The result of both approaches is the same.
The aligning is equivalent to:
common_columns = df1.columns.union(df2.columns)
df1b = df1.reindex(columns=common_columns)
df2b = df2.reindex(columns=common_columns)
So this is probably the easier way to use when having to deal with a full list of dataframes.
The reason that pd.concat is slower is because it does more. E.g. when the column names are not equal, it checks for every column if the dtype has to be upcasted or not to hold the NaN values (which get introduced by aligning the column names). By aligning yourself, you skip this. But in this case, where you are sure to have all the same dtype, this is no problem.
That it is so much slower surprises me as well, but I will raise an issue about that.
Summary, three key performance drivers depending on the set-up:
1) Make sure datatype are the same when concatenating two dataframes
2) Use integer based column names if possible
3) When using string based columns, make sure to use the align method before concat is called as suggested by joris
As #joris mentioned, you should append all of the pivot tables to a list and then concatenate them all in one go. Here is a proposed modification to your code:
dfs = []
for file in raw_files:
data_tmp = pd.read_csv(path + file, engine='c',
compression='gzip',
low_memory=False,
usecols=['date', 'Value', 'ID'])
data_tmp = data_tmp.pivot(index='date', columns='ID',
values='Value')
dfs.append(data_tmp)
del data_tmp
data = pd.concat(dfs)

Categories