Related
It's my first time creating a code for processing files with a lot of data, so I am kinda stuck here.
What I'm trying to do is to read a list of path, listing all of the csv files that need to be read, retrieve the HEAD and TAIL from each files and put it inside a list.
I have 621 csv files in total, with each files consisted of 5800 rows, and 251 columns
This is the data sample
[LOGGING],RD81DL96_1,3,4,5,2,,,,
LOG01,,,,,,,,,
DATETIME,INDEX,SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0],SHORT[DEC.0]
TIME,INDEX,FF-1(1A) ,FF-1(1B) ,FF-1(1C) ,FF-1(2A),FF-2(1A) ,FF-2(1B) ,FF-2(1C),FF-2(2A)
47:29.6,1,172,0,139,1258,0,0,400,0
47:34.6,2,172,0,139,1258,0,0,400,0
47:39.6,3,172,0,139,1258,0,0,400,0
47:44.6,4,172,0,139,1263,0,0,400,0
47:49.6,5,172,0,139,1263,0,0,450,0
47:54.6,6,172,0,139,1263,0,0,450,0
The problem is, while it took about 13 seconds to read all the files (still kinda slow honestly)
But when I add a single line of append code, the process took a lot of times to finish, about 4 minutes.
Below is the snipset of the code:
# CsvList: [File Path, Change Date, File size, File Name]
for x, file in enumerate(CsvList):
timeColumn = ['TIME']
df = dd.read_csv(file[0], sep =',', skiprows = 3, encoding= 'CP932', engine='python', usecols=timeColumn)
# The process became long when this code is added
startEndList.append(list(df.head(1)) + list(df.tail(1)))
Why that happened? I'm using dask.dataframe
Currently, your code isn't really leveraging Dask's parallelizing capabilities because:
df.head and df.tail calls will trigger a "compute" (i.e., convert your Dask DataFrame into a pandas DataFrame -- which is what we try to minimize in lazy evaluations with Dask), and
the for-loop is running sequentially because you're creating Dask DataFrames and converting them to pandas DataFrames, all inside the loop.
So, your current example is similar to just using pandas within the for-loop, but with the added Dask-to-pandas-conversion overhead.
Since you need to work on each of your files, I'd suggest checking out Dask Delayed, which might be more elegant+ueful here. The following (pseudo-code) will parallelize the pandas operation on each of your files:
import dask
import pandas as pd
for file in list_of_files:
df = dask.delayed(pd.read_csv)(file)
result.append(df.head(1) + df.tail(1))
dask.compute(*result)
The output of dask.visualize(*result) when I used 4 csv-files confirms parallelism:
If you really want to use Dask DataFrame here, you may try to:
read all files into a single Dask DataFrame,
make sure each Dask "partition" corresponds to one file,
use Dask Dataframe apply to get the head and tail values and append them to a new list
call compute on the new list
A first approach using only Python as starting point:
import pandas as pd
import io
def read_first_and_last_lines(filename):
with open(filename, 'rb') as fp:
# skip first 4 rows (headers)
[next(fp) for _ in range(4)]
# first line
first_line = fp.readline()
# start at -2x length of first line from the end of file
fp.seek(-2 * len(first_line), 2)
# last line
last_line = fp.readlines()[-1]
return first_line + last_line
data = []
for filename in pathlib.Path('data').glob('*.csv'):
data.append(read_first_and_last_lines(filename))
buf = io.BytesIO()
buf.writelines(data)
buf.seek(0)
df = pd.read_csv(buf, header=None, encoding='CP932')
I have a txt file with following format:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],"values":[["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Testcustomer",null,null,null,null,-196,196,-196,null,null],["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Testcustomer",null,null,null,null,null,null,null,null,null],["2017-10-06T08:50:25.349Z",null,null,2596,null,null,null,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,80700],["2017-10-06T08:50:25.35Z",null,null,null,null,null,null,null,null,null,1956,"41762721","Testkunde",null,null,null,null,null,null,null,null,null],["2017-10-06T09:20:05.742Z",null,null,null,null,null,67.98999,null,null,null,null,"41762721","Testkunde",null,null,null,null,null,null,null,null,null]]}]}]}
...
So in the text file everything is saved in one line. CSV file is not available.
I would like to have it as a data frame in pandas. when I use read.csv:
df = pd.read_csv('time-series-data.txt', sep = ",")
the output of print(df) is someting like [0 rows x 3455.. columns]
So currently everything is read in as one line. However, I would like to have 22 columns (time, activepower0, CosPhi0,..). I ask for tips, thank you very much.
Is a pandas dataframe even suitable for this? the text files are up to 2 GB in size.
Here's an example which can read the file you posted.
Here's the test file, named test.json:
{"results":[{"statement_id":0,"series":[{"name":"datalogger","columns":["time","ActivePower0","CosPhi0","CurrentRms0","DcAnalog5","FiringAngle0","IrTemperature0","Lage 1_Angle1","Lage 1_Angle2","PotentioMeter0","Rotation0","SNR","TNR","Temperature0","Temperature3","Temperature_MAX31855_0","Temperature_MAX31855_3","Vibra0_X","Vibra0_Y","Vibra0_Z","VoltageAccu0","VoltageRms0"],
"values":[
["2017-10-06T08:50:25.347Z",null,null,null,null,null,null,null,null,null,null,"41762721","Test-customer",null,null,null,null,-196,196,-196,null,null],
["2017-10-06T08:50:25.348Z",null,null,null,null,null,null,346.2964,76.11179,null,null,"41762721","Test-customer",null,null,null,null,null,null,null,null,null]]}]}]}
Here's the python code used to read it in:
import json
import pandas as pd
# Read test file.
# This reads the entire file into memory at once. If this is not
# possible for you, you may want to look into something like ijson:
# https://pypi.org/project/ijson/
with open("test.json", "rb") as f
data = json.load(f)
# Get the first element of results list, and first element of series list
# You may need a loop here, if your real data has more than one of these.
subset = data['results'][0]['series'][0]
values = subset['values']
columns = subset['columns']
df = pd.DataFrame(values, columns=columns)
print(df)
We all know the question, when you are running in a memory error: Maximum size of pandas dataframe
I also try to read 4 large csv-files with the following command:
files = glob.glob("C:/.../rawdata/*.csv")
dfs = [pd.read_csv(f, sep="\t", encoding='unicode_escape') for f in files]
df = pd.concat(dfs,ignore_index=True)
The only massage I receive is:
C:..\conda\conda\envs\DataLab\lib\site-packages\IPython\core\interactiveshell.py:3214:
DtypeWarning: Columns (22,25,56,60,71,74) have mixed types. Specify
dtype option on import or set low_memory=False. if (yield from
self.run_code(code, result)):
which should be no problem.
My total dataframe has a size of: (6639037, 84)
Could there be any datasize restriction without an memory error? That means python is automatically skipping some lines without telling me? I had this with another porgramm in the past, I don't think python is so lazy, but you never know.
Further reading:
Later i am saving is as sqlite-file, but I also don't think this should be a problem:
conn = sqlite3.connect('C:/.../In.db')
df.to_sql(name='rawdata', con=conn, if_exists = 'replace', index=False)
conn.commit()
conn.close()
You can pass a generator expression to concat
dfs = (pd.read_csv(f, sep="\t", encoding='unicode_escape') for f in files)
so you avoid the creation of that crazy list in the memory. This might alleviate the problem with the memory limit.
Besides, you can make a special generator that contains a downcast for some columns.
Say, like this:
def downcaster(names):
for name in names:
x = pd.read_csv(name, sep="\t", encoding='unicode_escape')
x['some_column'] = x['some_column'].astype('category')
x['other_column'] = pd.to_numeric(x['other_column'], downcast='integer')
yield x
dc = downcaster(names)
df = pd.concat(dc, ...
It turned out that there was an error in the file reading, so thanks #Oleg O for the help and tricks to reduce the memory.
For now I do not think that there is a effect that python automatically skips lines. It only happened with wrong coding. My example you can find here: Pandas read csv skips some lines
I have a huge csv file (14gb) on disk that I need to "melt" (using pd.melt) - I can import the file using pd.read_csv() without issue, but when I apply the melt function I max out my 32gb of memory and hit a "memory error" limit. Can anyone suggest some solutions? The original file is the output of another script, therefore I cannot reduce it by only importing selected columns or removing rows. There are a few hundred columns and over 10 million rows.. I tried something like this (in a much abbreviated version):
chunks = pd.read_csv('file.csv', chunksize=10000)
ids = list(set(list(chunks.columns.values)) - set(['1','2','3','4','5']))
out = []
for chunk in chunks:
df = pd.melt(chunk, id_vars=ids, var_name='foo',value_name='bar')
df['a_col'] = df['a_col'].fillna('not_na')
out.append(df)
full_df = pd.concat(out, ignore_index=False)
df_grouped = pd.DataFrame(full_df.groupby(['id_col', 'foo'])['bar'].apply(lambda x: ((x-min(x))/(max(x)-min(x))*100)))
df_grouped.columns = ['bar_grouped']
final_df = full_df.merge(df_grouped, how='inner' left_index=True, right_index=True)
final_df.to_csv('output.csv', sep='|', index=False)
Clearly this didn't work, as the columns.values attribute is not available for chunks since it is not a data frame. Any suggestions on how to re-code so it works and avoids memory issues are greatly appreciated!
thanks to StackOverflow (so basically all of you) I've managed to solve almost all my issues regarding reading excel data to DataFrame, except one... My code goes like this:
df = pd.read_excel(
fileName,
sheetname=sheetName,
header=None,
skiprows=3,
index_col=None,
skip_footer=0,
parse_cols='A:J,AB:CC,CE:DJ',
na_values='')
The thing is that in excel files which I'm parsing last row of dataI want to load is every time on different position. The only way I can identify last row of data which interest me is to look for word "SUMA" in first column of each sheet, and the last row I want to load to df will be n-1 row from the one containing "SUMA". Rows below SUMA also have some irrevelant (for me) information and there can by quite a lot of them so I want to avoid loading them.
If you do it with generators, you could do something like this. This loads the complete DataFrame, but afterwards filters out the lines after 'SUMA', using the trick that True == 1, so you only keep the relevant info. You might need some work afterwards to get the dtypes correct
def read_files(files):
sheetname = 'my_sheet'
for file in files:
yield pd.read_excel(
file,
sheetname=sheetName,
header=None,
skiprows=3,
index_col=None,
skip_footer=0,
parse_cols='A:J,AB:CC,CE:DJ',
na_values='')
def clean_files(dataframes):
summary_text = 'SUMA'
for df in dataframes:
index_after_suma = df.index.str.startswith(summary_text).cumsum()
yield df.loc[~index_after_suma, :]