I'm using the following code to import 2 columns (trigger and amplitude) out of 3 from 500 *.txt files :
from glob import glob
import pandas
dataFileList = glob( '*.txt' )
nbDataSamplesFiles = len(dataFileList)
amplitudes = []
colnames = ['time','trigger','amplitude']
for dataFileName in dataFileList :
#Method4
data = pandas.read_csv( dataFileName, delim_whitespace=True, skipinitialspace=True, names = colnames ) #Environ 4.5s pour 500 fichiers
trigger1 = data['trigger'].tolist()
amplitude1 = data.amplitude.tolist() #another way
amplitudes.append( amplitude1 ) #list of lists
amplitudes = np.asarray( amplitudes ) #matrix nbFiles x nbSamples
It takes about 3.5 seconds to do the job.
I need it to be much faster, is there a way to do it using the same or another python module ?
And how can I achieve it ?
UPDATE 1 : Using dask
import dask.dataframe as dd
amplitudes = []
for dataFileName in dataFileList :
df = dd.read_csv(urlpath = dataFileName, delim_whitespace=True, skipinitialspace=True, names = colnames )
trigger1 = df.trigger.values
amplitude1 = df.amplitude.values
amplitudes.append( amplitude1 ) #list of arrays
I want to check the content of amplitude1 :
ipdb> amplitude1[111:121]
*** ValueError: ('Arrays chunk sizes are unknown: %s', (nan,))
Any idea ?
Dask might be good option to try for handling large collections/directory of CSVs - Go through Dask Docs - Specific Usecase
Related
I'm working on a program which analyses a lot of csv files.
Currently I'm declaring every item manually, but as you can see in my code I'm actually just go +1 in my paths and in the variable-names.
I guess I can simplify this with a loop, just don't know how to do this with the path-names.
My code:
import pandas as pd
import numpy as np
### declation ###
df_primes1 = pd.DataFrame()
df_primes1 = np.array(df_primes1)
df_search1 = pd.DataFrame()
df_primes2 = pd.DataFrame()
df_primes2 = np.array(df_primes2)
df_search2 = pd.DataFrame()
df_primes3 = pd.DataFrame()
df_primes3 = np.array(df_primes3)
df_search3 = pd.DataFrame()
searchterm = '322'
### reads csv in numpy array ###
df_primes1 = pd.read_csv('1/1_Primes_32.csv', delimiter=';', header=None, names='1')
df_primes2 = pd.read_csv('1/2_Primes_32.csv', delimiter=';', header=None, names='2')
df_primes3 = pd.read_csv('1/3_Primes_32.csv', delimiter=';', header=None, names='3')
### sorts prime numbers ###
#df_sorted = df_primes1.sort_values(by='n')
#print(df_sorted)
### searches for number with "searchterm" as start value ###
df_search1 = df_primes1[df_primes1['1'].astype(str).str.startswith(searchterm)]['1']
df_search2 = df_primes2[df_primes2['2'].astype(str).str.startswith(searchterm)]['2']
df_search3 = df_primes3[df_primes3['3'].astype(str).str.startswith(searchterm)]['3']
print(df_search1)
print(df_search2)
print(df_search3)
The program is working, I was just want to know how I can simplify this, because there will be 20+ more files like this.
IIUC, we can use pathlib and a dict comprehension :
from pathlib import Path
p = 'Path/to/your_csv/'
dfs = {
f"search_{i}": pd.read_csv(file, delimiter=";",
header=None,
names=str(i))
for i, file in enumerate(Path(p).glob("*Prime*.csv"), 1)
}
to break down each item,
p is the target folder that holds your csvs
i is an enumerator to loop over your files you will most likely need to add a pre-step of sorting your csvs to get the order you're after.
file is each item that is returned from the generator object. we turn each value into a dataframe.
you can filter each dataframe by your collection i.e
dfs['search_1']
this will return a dataframe.
I am merging several thousands of reasonably sized (~1 million rows) dataframes together on a fairly regular basis.
While I can get pandas to work with read_csv, it is a terrible solution due to the extremely large overhead.
I need a faster solution to this and dask apparently has this multiple csv functionality baked into their read_csv/read_table functions.
However, I haven't noticed much improvement in speed with these solutions.
Is there a way to increase the speed of the following type of process? :
import io
import re
import numpy as np
import dask.bag as dbag
import dask.dataframe as ddf
def filter_data(fp, ix_col = 'index_here', val_col = 'some_value'):
dask_frame = ddf.read_table(fp)
# filter to only one column and index (like a series)
series = dask_frame[[ix_col, val_col]].set_index(ix_col)
# Rename it to be the filename / file_id
file_id = re.match("file_(.+)\.txt", fp)[1]
series.columns = [file_id]
return series
def get_dataframe(file_paths):
# Make a collection
dasks_bag = dbag.from_sequence(file_paths)
# Open the files as dask frame and filter each to series-like frames
filtered_dfs = dasks_bag.map(filter_data)
# Compute pandas dataframe on each within the list
filtered_dfs = filtered_dfs.compute()
# concatenate them together
df = ddf.concat(filtered_dfs, axis = 1)
# Compute on concatenated again, so it becomes pandas dataframe
return df.dropna(how = "all").compute()
# Just write some random files here
paths = ['file_120202021.txt', 'file_123.txt', 'file_12330.txt']
for fp in paths:
with open(fp, 'w') as f:
f.write('index_here\tsome_value\tother_cols\n')
for row in range(0,1000):
for val, other_col in np.random.rand(1, 2):
f.write(str(row)+'\t'+str(val)+'\t'+str(other_col)+'\n')
# Make a dataframe with dask
get_dataframe(paths)
Edit:
I have a small script here that shows the failure of dask:
The time required for dask on my machine is 1.87 seconds
while the time required for pandas is 0.29 seconds
Clearly, I am doing this wrong, as dask was specifically made for more rapid computation on dataframes.
import io
import re
import numpy as np
import pandas as pd
import dask.bag as dbag
import dask.dataframe as ddf
import time
def get_dask_dataframe(file_paths, ix_col = 'index_here', val_col = 'some_value'):
# Make a collection
dasks_bag = dbag.from_sequence(file_paths)
# read and filter to data of interest
dask_frames = ddf.read_table(file_paths, include_path_column = True)[[ix_col, val_col, 'path']]
# Make pandas dataframe
df = dask_frames.compute()
# Pivot since read_table puts path in one column
df = df.pivot_table(values = val_col, index = ix_col, columns = 'path')
return df.dropna(how = "all")
def get_pandas_dataframe(file_paths, ix_col = 'index_here', val_col = 'some_value'):
# Make a collection
l = []
for f in file_paths:
series = pd.read_csv(f, sep = '\t')[[ix_col, val_col]].set_index(ix_col)
# Rename it to be the filename / file_id
file_id = re.match("file_(.+)\.txt", f)[1]
series.columns = [file_id]
l += [series]
# concatenate them together
df = pd.concat(l, axis = 1)
return df.dropna(how = "all")
# Just write a whole bunch of random files
paths = ['file_'+str(i)+'.txt' for i in range(0, 100)]
for fp in paths:
with open(fp, 'w') as f:
f.write('index_here\tsome_value\tother_cols\n')
for row in range(0,1000):
for val, other_col in np.random.rand(1, 2):
f.write(str(row)+'\t'+str(val)+'\t'+str(other_col)+'\n')
t0 = time.time()
# Make a dataframe with dask
df1 = get_dask_dataframe(paths)
t1 = time.time()
print(t1-t0)
t0 = time.time()
# Make a dataframe with dask
df2 = get_pandas_dataframe(paths)
t1 = time.time()
print(t1-t0)
I would like to make following code faster to export into csv with (average file size 800MB) containing 100+ columns.
...................................................................
,.................................................................
.................................................................
INPUT:
DATE_TIME; DATA1; DATA2
12.18.2018 00:00:00;XXXXXXXXXXXX;YYYYYYYY
12.18.2018 00:00:00;XXXXXXXXXXXX;YYYYYYYY
12.18.2018 00:00:00;XXXXXXXXXXXX;YYYYYYYY
12.18.2018 01:00:00;XXXXXXXXXXXX;YYYYYYYY
OUTPUT will be no header:
DATE, TIME, DATA1, DATA2
2018-12-18,00:00:00,XXXXXXXXXXXX,YYYYYYYY
2018-12-18,00:00:00,XXXXXXXXXXXX,YYYYYYYY
2018-12-18,00:00:00,XXXXXXXXXXXX,YYYYYYYY
2018-12-18,01:00:00,XXXXXXXXXXXX,YYYYYYYY
CODE
import pandas
import glob
dfraw = []
rawCSV = glob.glob('C:\RAW\*.csv')
rawCSV
for filename in rawCSV:
data = pandas.read_csv(filename, delimiter=';')
dfraw.append(data)
totalFile = len(dfraw)
x=0
dfclean=[]
for x in range(totalFile):
tempdf =dfraw[x]["DATE_TIME"].str.split(" ", n = 1, expand = True)
tempdf[0] = tempdf[0].str.replace('.','-')
dfraw[x].drop(columns =["DATE_TIME"], inplace = True)
dfraw[x].insert(loc=0, column='DATE_ONLY', value=tempdf[0])
dfraw[x].insert(loc=1, column='TIME_ONLY', value=tempdf[1])
dfraw[x]['DATE_TIME'] = dfraw[x]['DATE_TIME'].astype('datetime64[ns]')
dfclean.append(dfraw[x])
concatdf=pandas.concat(dfclean, axis=0)
#dfclean.columns = dfclean.iloc[0]
#dfclean = dfclean[1:]
concatdf.to_csv('C:\CLEAN\__result.csv', index=False , header=False)
This is probably the fastest you can get; I think it should work. It's writing files out as it reads them in, instead of piling up everything in memory until the end and doing concat (which is a bit slow)
import pandas
import glob
rawCSV = glob.glob('C:\RAW\*.csv')
for filename in rawCSV:
data = pandas.read_csv(filename, delimiter=';')
date_time = data['DATE_TIME'].str.split(" ", n = 1, expand = True)
data.drop(columns =["DATE_TIME"], inplace = True)
data.insert(loc=0, column='DATE_ONLY', value=date_time[0].str.replace('.','-'))
data.insert(loc=1, column='TIME_ONLY', value=date_time[1])
with open('C:\CLEAN\__result.csv', 'a') as fh:
data.to_csv(fh, index=False , header=False)
It would probably be worth putting some checks in to make sure that you're not appending to a file that's already there and whatnot.
I'm fairly new to python and pandas but trying to get better with it for parsing and processing large data files. I'm currently working on a project that requires me to parse a few dozen large CSV CAN files at the time. The files have 9 columns of interest (1 ID and 7 data fields), have about 1-2 million rows, and are encoded in hex.
A sample bit of data looks like this:
id Flags DLC Data0 Data1 Data2 Data3 Data4 Data5 Data6 Data7
cf11505 4 1 ff
cf11505 4 1 ff
cf11505 4 1 ff
cf11a05 4 1 0
cf11505 4 1 ff
cf11505 4 1 ff
cf11505 4 1 ff
cf11005 4 8 ff ff ff ff ff ff ff ff
I need to decode the hex, and then extract a bunch of different variables from it depending on the CAN ID.
A colleague of mine wrote a script to parse these files that looks like this (henceforth known as Script #1):
import sys # imports the sys module
import itertools
import datetime
import time
import glob, os
for filename in glob.glob(sys.argv[1] + "/*.csv"):
print('working on ' + filename +'...')
#Initialize a bunch of variables
csvInput = open(filename, 'r') # opens the csv file
csvOutput = open((os.path.split(filename)[0] + os.path.split(filename)[1]), 'w', newline='')
writer = csv.writer(csvOutput) #creates the writer object
writer.writerow([var1, var2, var3, ...])
try:
reader = csv.reader(csvInput)
data=list(reader)
if (data[3][1] == 'HEX'): dataType = 16
elif (data[3][1] == 'DEC'): dataType = 10
else: print('Invalid Data Type')
if (data[4][1] == 'HEX'): idType = 16
elif (data[4][1] == 'DEC'): idType = 10
else: print('Invalid ID Type')
start_date = datetime.datetime.strptime(data[6][1],'%Y-%m-%d %H:%M:%S')
for row in itertools.islice(data,8,None):
try: ID = int(row[2],idType)
except: ID = 0
if (ID == 0xcf11005):
for i in range(0,4): var1[i] = float((int(row[2*i+6],dataType)<<8)|
#similar operations for a bunch of variables go here
writer.writerow([var1[0], var2[1],.....])
finally:
csvInput.close()
csvOutput.close()
print(end - start)
print('done')
It basically uses the CSV reader and writer to generate a processed CSV file line by line for each CSV. For a 2 million row CSV CAN file, it takes about 40 secs to fully run on my work desktop. Knowing that line by line iteration is much slower than performing vectorized operations on a pandas dataframe, I thought I could do better, so I wrote a script that looks like this (Script #2):
from timeit import default_timer as timer
import numpy as np
import pandas as pd
import os
import datetime
from tkinter import filedialog
from tkinter import Tk
Tk().withdraw()
filename = filedialog.askopenfile(title="Select .csv log file", filetypes=(("CSV files", "*.csv"), ("all files", "*.*")))
name = os.path.basename(filename.name)
##################################################
df = pd.read_csv(name, skiprows = 7, usecols = ['id', 'Data0', 'Data1', 'Data2', 'Data3', 'Data4', 'Data5', 'Data6', 'Data7'],
dtype = {'id':str, 'Data0':str, 'Data1':str, 'Data2':str, 'Data3':str, 'Data4':str, 'Data5':str, 'Data6':str, 'Data7':str})
log_cols = ['id', 'Data0', 'Data1','Data2', 'Data3', 'Data4', 'Data5', 'Data6', 'Data7']
for col in log_cols:
df[col] = df[col].dropna().astype(str).apply(lambda x: int(x, 16))
df.loc[:, 'Data0':'Data7'] = df.loc[:, 'Data0':'Data7'].fillna(method = 'ffill') #forward fill empty rows
df.loc[:, 'Data0':'Data7'] = df.loc[:, 'Data0':'Data7'].fillna(value = 0) #replace any remaining nans with 0
df['Data0'] = df['Data0'].astype(np.uint8)
df.loc[:, 'Data0':'Data7'] = df.loc[:, 'Data0':'Data7'].astype(np.uint8)
processed_df = pd.DataFrame(np.nan, index= range(0, len(df)), columns= ['var1' 'var2', 'var3', ...])
start_date = datetime.datetime.strptime('7/17/2018 14:12:48','%m/%d/%Y %H:%M:%S')
processed_df ['Time Since Start (s)'] = pd.read_csv(name, skiprows = 7, usecols = ['Time'], dtype = {'Time':np.float32}, engine = 'c')
processed_df['Date'] = pd.to_timedelta(processed_df['Time Since Start (s)'], unit = 's') + start_date
processed_df['id'] = df['id']
processed_df.loc[:, 'var1':'var37'] = processed_df.loc[:, 'var1':'var37'].astype(np.float32)
##################Data Processing###########################
processed_df.loc[processed_df.id == int(0xcf11005), 'var1'] = np.bitwise_or(np.left_shift(df['Data1'], 8), df['Data0'])/10
#a bunch of additional similar vectorized calculations go here to pull useful values
name_string = "Processed_" + name
processed_df.to_csv(name_string) #dump dataframe to CSV
The processing part was definitely faster, although not as much as I had hoped--it took about 13 seconds to process the 2 million row CSV file. There's probably some more I could do to optimize script #2, but that's a topic for another post.
Anyway, my hopes that script #2 would actually be faster than the script #1 one were dashed when I tried to save the dataframe as a CSV. the .to_csv() method took 40s alone! I tried playing around with a few parameters in the .to_csv() method, including chunk size and compression, as well as reducing the memory footprint of the dataframe, but even with these tweaks it still took 30s to save the dataframe, and once you factored in the initial processing time, the entire script was slower than the original row by row script #1.
Is row by row iteration of a CSV file really the most computationally efficient way to parse these files?
The dask library might be worth a look. It implements a subset of the pandas DataFrame functionality, but stores the DataFrames on disk rather than in-memory, and allows you to use the DataFrame as if it were in memory. I believe it can even treat multiple files as a single DataFrame among other things like using multiple machines to do things in parallel.
This was faster for me when I was dealing with a 6GB CSV with millions of rows.
https://dask.pydata.org/en/latest/
Have you tried setting a chunksize, the number of rows to write at a time, as you can see here above 100,000 it's set to 1.
Another thing to consider is adding mode='a' (from the default w) for appending.
So I would suggest using:
processed_df.to_csv(name_string, mode='a', chunksize=100000)
I'd play with the chunksize until it suits your needs.
I want to use python to reduce the two lines of data in a file to create a new file. I use pandas and numpy for processing, but the processing time of pandas is very long,even need a few hours, while numpy can be two or three minutes, a total of more than 1 million data,
as part of the data:
33,Jogging,49105962326000,-0.6946377,12.680544,0.50395286;
33,Jogging,49106062271000,5.012288,11.264028,0.95342433;
33,Jogging,49106112167000,4.903325,10.882658,-0.08172209;
My pandas code is as below:
import pandas as pd
import numpy as np
import time
time1 = time.time()
file = open('WISDM_ar_v1.1_raw.txt','r')
dataset = file.readlines()
list1 = []
for i in range(len(dataset)-1):
dataset[i] = dataset[i].rstrip('\n')
dataset[i] = dataset[i].rstrip(';')
dataset[i] = dataset[i].split(",")
if len(dataset[i])==6:
#list1为处理后的数据
list1.append(dataset[i])
array1 = np.array(list1)
#newline两行之间按什么分割 delimiter列之间按什么分割
np.savetxt("aa.txt", array1, fmt="%s",newline='\r\n', delimiter=",")
column_names = ['user-id', 'activity', 'timestamp', 'x-axis', 'y-axis', 'z-axis']
dataset1 = pd.read_csv('aa.txt',names=column_names, header=None)
df = pd.DataFrame(dataset1)
df1 = pd.DataFrame(columns=column_names)
for i in range(0,len(dataset1)-1):
data = dataset1.loc[[i]]
if dataset1.loc[i+1, 'activity']==dataset1.loc[i,'activity']:
data.loc[i,'user-id'] = dataset1.loc[i,'user-id']
data.loc[i,'x-axis'] = dataset1.loc[i+1,'x-axis']-dataset1.loc[i,'x-axis']
data.loc[i,'y-axis'] = dataset1.loc[i+1,'y-axis'] - dataset1.loc[i,'y-axis']
data.loc[i,'z-axis'] = dataset1.loc[i+1,'z-axis'] - dataset1.loc[i,'z-axis']
df1 = df1.append(data, ignore_index=True)
df1.to_csv('new_data.txt', mode='a',sep=',', header=False, index=False)
I want to know why this is the case. Is there any mistake in the pandas code I wrote? Thank you very much!