Pandas dataframe to_csv - split into multiple output files - python

What is the best /easiest way to split a very large data frame (50GB) into multiple outputs (horizontally)?
I thought about doing something like:
stepsize = int(1e8)
for id, i in enumerate(range(0,df.size,stepsize)):
start = i
end = i + stepsize-1 #neglect last row ...
df.ix[start:end].to_csv('/data/bs_'+str(id)+'.csv.out')
But I bet there is a smarter solution out there?
As noted by jakevdp, HDF5 is a better way to store huge amounts of numerical data, however it doesn't meet my business requirements.

This answer brought me to a satisfying solution using:
numpy.array_split(object, number_of_chunks)
for idx, chunk in enumerate(np.array_split(df, number_of_chunks)):
chunk.to_csv(f'/data/bs_{idx}.csv')

Use id in the filename else it will not work. You missed id, and without id, it gives an error.
for id, df_i in enumerate(np.array_split(df, number_of_chunks)):
df_i.to_csv('/data/bs_{id}.csv'.format(id=id))

Related

Quickest way to access & compare huge data in Python

I am a newbie to Pandas, and somewhat newbie to python
I am looking at stock data, which I read in as CSV and typical size is 500,000 rows.
The data looks like this
'''
'''
I need to check the data against itself - the basic algorithm is a loop similar to
Row = 0
x = get "low" price in row ROW
y = CalculateSomething(x)
go through the rest of the data, compare against y
if (a):
append ("A") at the end of row ROW # in the dataframe
else
print ("B") at the end of row ROW
Row = Row +1
the next iteration, the datapointer should reset to ROW 1. then go through same process
each time, it adds notes to the dataframe at the ROW index
I looked at Pandas, and figured the way to try this would be to use two loops, and copying the dataframe to maintain two separate instances
The actual code looks like this (simplified)
df = pd.read_csv('data.csv')
calc1 = 1 # this part is confidential so set to something simple
calc2 = 2 # this part is confidential so set to something simple
def func3_df_index(df):
dfouter = df.copy()
for outerindex in dfouter.index:
dfouter_openval = dfouter.at[outerindex,"Open"]
for index in df.index:
if (df.at[index,"Low"] <= (calc1) and (index >= outerindex)) :
dfouter.at[outerindex,'notes'] = "message 1"
break
elif (df.at[index,"High"] >= (calc2) and (index >= outerindex)):
dfouter.at[outerindex,'notes'] = "message2"
break
else:
dfouter.at[outerindex,'notes'] = "message3"
this method is taking a long time (7 minutes+) per 5K - which will be quite long for 500,000 rows. There may be data exceeding 1 million rows
I have tried using the two loop method with the following variants:
using iloc - e.g df.iloc[index,2]
using at - e,g df.at[index,"low"]
using numpy& at - eg df.at[index,"low"] = np.where((df.at[index,"low"] < ..."
The data is floating point values, and datetime string.
Is it better to use numpy? maybe an alternative to using two loops?
any other methods, like using R, mongo, some other database etc - different from python would also be useful - i just need the results, not necessarily tied to python.
any help and constructs would be greatly helpful
Thanks in advance
You are copying the dataframe and manually looping over the indicies. This will almost always be slower than vectorized operations.
If you only care about one row at a time, you can simply use csv module.
numpy is not "better"; pandas internally uses numpy
Alternatively, load the data into a database. Examples include sqlite, mysql/mariadb, postgres, or maybe DuckDB, then use query commands against that. This will have the added advantage of allowing for type-conversion from stings to floats, so numerical analysis is easier.
If you really want to process a file in parallel directly from Python, then you could move to Dask or PySpark, although, Pandas should work with some tuning, though Pandas read_sql function would work better, for a start.
You have to split main dataset in smaller datasets for eg. 50 sub-datasets with 10.000 rows each to increase speed. Do functions in each sub-dataset using threading or concurrency and then combine your final results.

Pandas Regex: Read specific columns only from csv with regex patterns

Given a large CSV file(large enough to exceed RAM), I want to read only specific columns following some patterns. The columns can be any of the following: S_0, S_1, ...D_1, D_2 etc. For example, a chunk from the data frame looks like this:
And the regex pattern would be for example anyu column that starts with S: S_\d.*.
Now, how do I apply this with pd.read_csv(/path/, __) to read the specific columns as mentioned?
You can first read few rows and try DataFrame.filter to get possible columns
cols = pd.readcsv('path', nrows=10).filter(regex='S_\d*').columns
df = pd.readcsv('path', usecols=cols)
Took the same approach(as of now) as mentioned in the comments. Here goes the detailed piece I used:
def extract_col_names(all_cols, pattern):
result = []
for col in all_cols:
if re.match(pattern, col):
result.append(col)
else:
continue
return result
extract_col_names(cols, pattern="S_\d+")
And it works!
But without this work-around, say even loading the columns is heavy enough itself. So, does there exist any method to parse regex patterns at the time of reading CSVs? This still remains a question.
Thanks for the response :)

Efficiently filtering the oldest record in a dataframe according to each id

I have a dataframe with the following details:
For each q_id there might be multiple ph_id with different ph_date.
I want to make a new dataframe out of it, in a way that for each q_id there is just one ph_id and that is the oldest (with minimum date).
I tried the following code but I think it is computationally slow:
def oldest_ph(q_id):
return a.loc[a.ph_date == a[a['q_id'] == q_id].ph_date.min(), 'ph_date']
b['oldest_date'] = a['q_id'].apply(lambda x: a(x))
Is there any better way for this point?
First, let try extracting the oldest ph_id for each q_id, then you can use map:
s = df.sort_values('date').drop_duplicates('q_id').set_index('q_id')
df['ph_id'] = df['q_id'].map(s['ph_id'])

How to access N Chunk using Pandas?

i'm sure most of you might find this basic but i'm somehow finding it very confusing to understand the way to access a particular chunk in pandas and append it later. I know to append the set but i don't know to identify the data based on a chunk
for ex, just imagine my table has 36000 records and i chunk it by 1200, now i want to access just the 3rd chunk only. how to achieve it in pandas? i googled it extensively but no good results
for df in pd.read_sql_query('select id from table;', conn, chunksize=1200):
print(df)
Pandas - Slice Large Dataframe in Chunks
thank you for pointing out to this link. The fix is pretty simple!
df = pd.read_sql_query('select * from x',conn)
chunksize=100
new_df= [df[i: i+n] for i in range(0,df.shape[0],chunksize)]
now if you add the index with your frame and print it, you can see your chunk
new_df[0] ----prints the data in the 1st chunk
new_df[1] ----prints the data in 2nd chunk
new_df[2] ----prints the data in the 3rd chunk

How to write a pandas DataFrame to a CSV by fixed size chunks

I need to output data from pandas into CSV files to interact with a 3rd party developed process.
The process requires that I pass it no more than 100,000 records in a file, or it will cause issues (slowness, perhaps a crash).
That said, how can I write something that takes a dataframe in pandas and splits it into 100,000 records frames? Nothing would be different other than the exported dataframes would be subsets of the parent dataframe.
I assume I could do a loop with something like this, but I assume it would be remarkably inefficient..
First, taking recordcount=len(df.index) to get the number of records and then looping until I get there using something like
df1 = df[currentrecord:currentrecord+100000,]
And then exporting that to a CSV file
There has to be an easier way.
You can try smth like this:
def save_df(df, chunk_size=100000):
df_size=len(df)
for i, start in enumerate(range(0, df_size, chunk_size)):
df[start:start+chunk_size].to_csv('df_name_{}.csv'.format(i))
You could add a column with a group, and then use the function groupby:
df1['Dummy'] = [a for b in zip(*[range(N)] * 100000) for a in b][:len(df1)]
Where N is set to a value large enough, the minimum being:
N = int(np.ceil(df1.len() / 100000))
Then group by that column and apply function to_csv():
def save_group(df):
df.drop('Dummy', axis=1).to_csv("Dataframe_" + str(df['Dummy'].iloc[0]) + ".csv")
df1.groupby('Dummy').apply(save_group)

Categories