Best practice for loading large dataset and using dask.delayed

Best practice for loading large dataset and using dask.delayed - python

I have a csv file of 550,000 rows of text. I read it into a pandas dataframe, loop over it, and perform some operation on it. Here is some sample code:
import pandas as pd
def my_operation(row_str):
#perform operation on row_str to create new_row_str
return new_row_str
df = pd.read_csv('path/to/myfile.csv')
results_list = []
for ii in range(df.shape[0]):
my_new_str = my_operation(df.iloc[ii, 0])
results_list.append(my_new_str)
I started to implement dask.delayed but after reading the Delayed Best Practices section, I am not sure I am using dask.delayed in the most optimal way for this problem. Here is the same code with dask.delayed:
import pandas as pd
import dask
def my_operation(row_str):
#perform operation on row_str to create new_row_str
return new_row_str
df = pd.read_csv('path/to/myfile.csv')
results_list = []
for ii in range(df.shape[0]):
my_new_str = dask.delayed(my_operation)(df.iloc[ii, 0])
results_list.append(my_new_str)
results_list = dask.compute(*results_list)
I'm running this on a single machine with 8 cores and was wanting to know if there was a more optimal way to load this large dataset and perform the same operation over each of the rows?
Thanks in advance for your help and let me know what else I can provide!

Related

UCI dataset: How to extract features and change the data into usable format after reading the data on python

I am looking to apply some ml algorithms on the data set from https://archive.ics.uci.edu/ml/datasets/University.
I noticed that the data is unstructured. Indeed, I want the data to have the features as the columns and have observations as the rows. Therefore, I need help with parsing this dataset.
Any help will be appreciated. Thanks.
Below is what I have tried:
column_names = ["University-name"
,"State"
,"location"
,"Control"
,"number-of-students"
,"male:female (ratio)"
,"student:faculty (ratio)",
"sat-verbal"
,"sat-math"
,"expenses"
,"percent-financial-aid"
,"number-of-applicants"
,"percent-admittance"
,"percent-enrolled"
,"academics"
,"social"
,"quality-of-life"
,"academic-emphasis"]
data_list =[]
data = ['https://archive.ics.uci.edu/ml/machine-learning-
databases/university/university.data','https://archive.ics.uci.edu/ml/machine-
learning-databases/university/university.data',...]'
for file in in data:
df = pd.read_csv(file, names = column_names)
data_list.append(df)

The data is not structured in a way you can parse it using pandas. Do something like this:
import requests
data = "https://archive.ics.uci.edu/ml/machine-learning-databases/university/university.data"
data = requests.get(data)
temp = data.text
import re
fdic = {'def-instance':[], 'state':[]}
for col in fdic.keys():
fdic[col].extend(re.findall(f'\({col} ([^\\\n)]*)' , temp))
import pandas as pd
pd.DataFrame(fdic)
The output:

Trouble merging dask dataframes

I have several .pcap files whose data I want write to one large dask data frame. Currently, initializes a dask data frame using data from the first file. It then is supposed to process the rest of the pcap files and add to that dask data frame using merge/concat. However, when I check the number of the rows of the merged dask dataframe it doesn't increase. What is happening?
I also am not sure if I am using the right approach for my use case. I am trying to convert my entire dataset into a giant dask dataframe and write it out to h5 file. My computer doesn't have enough memory to load the entire dataset so that's why I'm using dask. The idea is to load the dask dataframe that contains the entire dataset so I could do operations on the entire dataset. I'm new to dask and I've read over the some of the documentation but I'm still fuzzy about how dasks handles loading data from disk instead of memory. I'm also fuzzy about how partitions work in dask. Specifically, I'm also not sure how chunksize differs from partitions so I'm having trouble properly partitioning this dataframe. Any tips and advice would be helpful.
As said before, I've read over the main parts of the documentation.
I've tried using the dd.merge(dask_df, panda_df) as shown in the documentation. When I initialize the dask dataframe, it starts with 6 rows. When I use merge the row count decreases to 1
I've also tried using concat. Again, I have a count of 6 rows during initialization. However, after the concat operations the row count still remains at 6. I would expect the row count to increase.
Here is the initialization function
import os
import sys
import h5py
import pandas as pd
import dask.dataframe as dd
import gc
import pprint
from scapy.all import *
flags = {
'R': 0,
'A': 1,
'S': 2,
'DF':3,
'FA':4,
'SA':5,
'RA':6,
'PA':7,
'FPA':8
}
def initialize(file):
global flags
data = {
'time_delta': [0],
'ttl':[],
'len':[],
'dataofs':[],
'window':[],
'seq_delta':[0],
'ack_delta':[0],
'flags':[]
}
scap = sniff(offline=file,filter='tcp and ip')
for packet in range(0,len(scap)):
pkt = scap[packet]
flag = flags[str(pkt['TCP'].flags)]
data['ttl'].append(pkt['IP'].ttl)
data['len'].append(pkt['IP'].len)
data['dataofs'].append(pkt['TCP'].dataofs)
data['window'].append(pkt['TCP'].window)
data['flags'].append(flag)
if packet != 0:
lst_pkt = scap[packet-1]
data['time_delta'].append(pkt.time - lst_pkt.time)
data['seq_delta'].append(pkt['TCP'].seq - lst_pkt['TCP'].seq)
data['ack_delta'].append(pkt['TCP'].ack - lst_pkt['TCP'].ack)
panda = pd.DataFrame(data=data)
panda['ttl']=panda['ttl'].astype('float16')
panda['flags']=panda['flags'].astype('float16')
panda['dataofs']=panda['dataofs'].astype('float16')
panda['len']=panda['len'].astype('float16')
panda['window']=panda['window'].astype('float32')
panda['seq_delta']=panda['seq_delta'].astype('float32')
panda['ack_delta']=panda['ack_delta'].astype('float32')
df =dd.from_pandas(panda,npartitions=6)
gc.collect()
return df
Here is the concatenation function
def process(file):
global flags
global df
data = {
'time_delta': [0],
'ttl':[],
'len':[],
'dataofs':[],
'window':[],
'seq_delta':[0],
'ack_delta':[0],
'flags':[]
}
scap = sniff(offline=file,filter='tcp and ip')
for packet in range(0,len(scap)):
pkt = scap[packet]
flag = flags[str(pkt['TCP'].flags)]
data['ttl'].append(pkt['IP'].ttl)
data['len'].append(pkt['IP'].len)
data['dataofs'].append(pkt['TCP'].dataofs)
data['window'].append(pkt['TCP'].window)
data['flags'].append(flag)
if packet != 0:
lst_pkt = scap[packet-1]
data['time_delta'].append(pkt.time - lst_pkt.time)
data['seq_delta'].append(pkt['TCP'].seq - lst_pkt['TCP'].seq)
data['ack_delta'].append(pkt['TCP'].ack - lst_pkt['TCP'].ack)
panda = pd.DataFrame(data=data)
panda['ttl']=panda['ttl'].astype('float16')
panda['flags']=panda['flags'].astype('float16')
panda['dataofs']=panda['dataofs'].astype('float16')
panda['len']=panda['len'].astype('float16')
panda['window']=panda['window'].astype('float32')
panda['seq_delta']=panda['seq_delta'].astype('float32')
panda['ack_delta']=panda['ack_delta'].astype('float32')
#merge version dd.merge(df, panda)
dd.concat([df,dd.from_pandas(panda,npartitions=6)])
gc.collect()
And here is the main program
directory = 'dev/streams/'
files = os.listdir(directory)
df = initialize(directory+files[0])
files.remove(files[0])
for file in files:
process(directory+file)
print(len(df))
using merge:
print(len(df)) = 1
using concat:
print(len(df))=6
expected:
print(len(df)) > 10,000

Try explicitly returning df as the result of your dask concat:
df = dd.concat([df, dd.from_pandas(panda,npartitions=6)])
And don't duplicate the exact same blocks of code but encaspulate them in another function:
def process_panda(file_wpath, flags):
data = {
[...]
panda['ack_delta']=panda['ack_delta'].astype('float32')
return panda
Then you just have to test if the file to process is the first, so your main code becomes:
import os
import sys
import h5py
import pandas as pd
import dask.dataframe as dd
import gc
import pprint
from scapy.all import *
flags = {
'R': 0,
'A': 1,
'S': 2,
'DF':3,
'FA':4,
'SA':5,
'RA':6,
'PA':7,
'FPA':8
}
directory = 'dev/streams/'
files = os.listdir(directory)
for file in files:
file_wpath = os.path.join(directory, file)
panda = process_panda(file_wpath, flags)
if file == files[0]:
df = dd.from_pandas(panda, npartitions=6)
else:
df = dd.concat([df, dd.from_pandas(panda, npartitions=6)])
gc.collect()
print(len(df))

How to create dataframes from chunks

I have huge scv file(630 mln rows),and my computer can t read it in 1 dataframe(out of memory)(After it i wanna to teach model for each dataframe).I did 630 chunks, and wanna create dataframe from each chunk(It s will 630 dataframes). Cant find or undestand no one solution of this situation.Can someone support me pls. Mb i think wrong in general and someone can say smtng new opinion on this situation. Code:
import os
import pandas as pd
lol=0
def load_csv():
path="D:\\mml\\"
csv_path = os.path.join(path,"eartquaqe_train.csv")
return pd.read_csv(csv_path,sep=',',chunksize=1000000)
dannie = load_csv()
for chunk in dannie:
lol=lol+1
print(lol)
630

Use the pandas.read_csv() method and specify either the chunksize parameter or create an iterator over all you csv rows using skiprows like:
import pandas as pd
path = 'D:\...'
a = list(range(0,6300))
for line in range(0,6300-630,630):
df = pd.read_csv(path,skiprows=a[0:line]+a[line+630:])
print(df)
OR
import pandas as pd
path = 'D:\...'
df = pd.read_csv(path,chunksize=6300)
for chunk in df:
print(chunk)

Use -
for chunk in dannie:
chunk.to_csv('{}.csv'.format(lol))
lol+=1
Read here for more info

Using pandas to efficiently read in a large CSV file without crashing

I am trying to read a .csv file called ratings.csv from http://grouplens.org/datasets/movielens/20m/ the file is 533.4MB in my computer.
This is what am writing in jupyter notebook
import pandas as pd
ratings = pd.read_cv('./movielens/ratings.csv', sep=',')
The problem from here is the kernel would break or die and ask me to restart and its keeps repeating the same. There is no any error. Please can you suggest any alternative of solving this, it is as if my computer has no capability of running this.
This works but it keeps rewriting
chunksize = 20000
for ratings in pd.read_csv('./movielens/ratings.csv', chunksize=chunksize):
ratings.append(ratings)
ratings.head()
Only the last chunk is written others are written-off

You should consider using the chunksize parameter in read_csv when reading in your dataframe, because it returns a TextFileReader object you can then pass to pd.concat to concatenate your chunks.
chunksize = 100000
tfr = pd.read_csv('./movielens/ratings.csv', chunksize=chunksize, iterator=True)
df = pd.concat(tfr, ignore_index=True)
If you just want to process each chunk individually, use,
chunksize = 20000
for chunk in pd.read_csv('./movielens/ratings.csv',
chunksize=chunksize,
iterator=True):
do_something_with_chunk(chunk)

try like this - 1) load with dask and then 2) convert to pandas
import pandas as pd
import dask.dataframe as dd
import time
t=time.clock()
df_train = dd.read_csv('../data/train.csv')
df_train=df_train.compute()
print("load train: " , time.clock()-t)

How to efficiently create a sparse Adjacency matrix from adjacency list?

I am working with last.fm dataset from the Million song dataset.
The data is available as a set of json-encoded text files where the keys are: track_id, artist, title, timestamp, similars and tags.
Using the similars and track_id fields, I'm trying to create a sparse adjacency matrix so that I could further do other tasks with the dataset. Following is my attempt. However, it's very slow (Especially the to_sparse, opening and loading all json files and the slowest is the apply function I've come up with, even though it's after a few improvements :/ ). I'm new to pandas and I've improved from this from my very first attempt but I'm sure some vectorisation or other methods will significantly boost up the speed and efficiency.
import os
import pandas as pd
import numpy as np
# Path to the dataset
path = "../lastfm_subset/"
# Getting list of all json files in dataset
all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')]
data_list=[json.load(open(file)) for file in all_files]
df = pd.DataFrame(data_list, columns=['similars', 'track_id'])
df.set_index('track_id', inplace=True)
a = pd.DataFrame(0,columns= df.index, index=df.index).to_sparse()
def make_graph(adjacent):
importance = 1/len(adjacent['similars'])
neighbors = list(filter(lambda x: x[1] > threshold, adjacent['similars']))
if len(neighbors) == 0:
return
t_id, similarity_score = map(list, zip(*neighbors))
a.loc[list(t_id), adjacent['track_id']] = importance
df[( df['similars'].str.len() > 0 )].reset_index()[['track_id','similars']].apply(make_graph, axis=1)
I also believe that the way of reading the dataset could be highly improved and better written as well.
So, we just need to read the data and then make a sparse adjacency matrix from the adjacency list in an efficient manner.
The similars key has a list of list. The inner lists are 1x2 with track_id of similar song and similarity score.
As I am new to this subject, I am open to tips, suggestions and better methods available for any part of tasks like these.
UPDATE 1
After taking input from comments, a slightly better version though it's still far from being at acceptable speeds. The good part, the apply function works reasonably fast. However, the list comprehension of opening and loading json files to make data_list is very slow. Moreover, to_sparse takes forever, so I worked without creating a sparse matrix.
import os
import pandas as pd
import numpy as np
# Path to the dataset
path = "../lastfm_subset/"
# Getting list of all json files in dataset
all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')]
data_list=[json.load(open(file)) for file in all_files]
df = pd.DataFrame(data_list, columns=['similars', 'track_id'])
df.set_index('track_id', inplace=True)
df.loc[( df['similars'].str.len() > 0 ), 'importance' ] = 1/len(df['similars']) # Update 1
a = pd.DataFrame(df['importance'],columns= df.index, index=df.index)#.to_sparse(fill_value=0)
def make_graph(row):
neighbors = list(filter(lambda x: x[1] > threshold, row['similars']))
if len(neighbors) == 0:
return
t_id, similarity_score = map(list, zip(*neighbors))
a.loc[list(t_id), row['track_id']] = row['importance']
df[( df['similars'].str.len() > 0 )].reset_index()[['track_id','similar', 'importance']].apply(make_graph, axis=1)
Update 2
Using generator comprehension instead of list comprehension.
data_list=(json.load(open(file)) for file in all_files)
I'm also using ujson for speed purposes in parsing json files which can be seen evidently from this question here
try:
import ujson as json
except ImportError:
try:
import simplejson as json
except ImportError:
import json

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best practice for loading large dataset and using dask.delayed - python

Related

UCI dataset: How to extract features and change the data into usable format after reading the data on python

Trouble merging dask dataframes

How to create dataframes from chunks

Using pandas to efficiently read in a large CSV file without crashing

How to efficiently create a sparse Adjacency matrix from adjacency list?

Categories

Resources