For me, working remotely means accessing big CSV files on a server which take a long time to download to local hard drive.
I've tried to speed this process up using a bit of Python, only reading in particular columns I require. Ideally, however, if I could only read in data for those columns after a date (e.g > 2019-01-04) it would significantly reduce the amount of data.
My existing code for this will read in the total file and then apply a date filter. I'm just wondering if it's possible to apply that date filter to the reading of the file in the first place. I appreciate this might not be possible.
Code e.g...
import pandas as pd
fields = ['a','b','c'...]
data1 = pd.read_csv(r'SomeForeignDrive.csv', error_bad_lines=False,usecols=fields)
data1['c']=pd.to_datetime(data1['c'], errors='coerce')
data1.dropna()
data1 = data1[data1['c'] > '2019-01-04']
data1.to_csv(r'SomeLocalDrive.csv')
It's not possible to read files starting from a specific date but you can use the following workaround. You can read only the column with dates and find the row index where you want to start from. Then you can read the whole file and skip all rows before the start index:
df = pd.read_csv('path', usecols=['date'])
df['date'] = pd.to_datetime(df['date'])
idx = df[df['date'] > '2019-01-04'].index[0]
df = pd.read_csv('path', skiprows=idx)
read_csv docs:
Using this parameter (usecols) results in much faster parsing time and
lower memory usage.
Related
I am currently facing a problem when loading data via the read_csv() function of pandas.
Here is an extract of a record from the CSV file :
2021-11-28T03:13:01+00:00,49.59,49.93,49.56,49.88
When I use pandas to_csv() my index column which is in timestamp systematically loses one hour on all records
The previous example look like this after using pandas :
2021-11-28T02:13:01+00:00,49.59,49.93,49.56,49.88
Here is python snippet code. :
df_mre_unpacked = pd.read_csv('mre_unpacked.csv', sep=',',encoding='utf-8',index_col='timestamp',decimal=".")
df_mre_unpacked = df_mre_unpacked[['ASG1.CPU10_XV_ACHSE1_ZR','ASG1.CPU11_XV_ACHSE2_ZR','ASG2.CPU10_XV_ACHSE1_ZR','ASG2.CPU11_XV_ACHSE2_ZR']]
Here is the original CSV
Result of pandas dataframe head() function :
As you can see the first records start from 03:13:01 from the CSV file but from the panda dataframe it begains with 02:13:01 since I do not have this timestamp in my csv file
Has anyone had this problem before ?
UPDATE :
I have a better view of the problem now. Here is what I discovered.
When I do the data extraction via a portal the time that is displayed is as follows:
Here is the same record read from the CSV file with notepad++
The problem comes from the extraction process that automatically removes one hour.
How do I add 1 hour to my timestamp column which is my index?
We just have to simply add this snippet code :
df_de18_EventData.index = pd.to_datetime(df_de18_EventData.index) + pd.Timedelta("1 hour")
This will add 1 hour to the index timestamp column
It's not the best solution but it works for now. An optimal solution would be to take into account automatically the time difference like excel does
I have significantly large volume of "Raw data" stored in pickle files. At first I have to read/load them in pandas dataframe. Afterwards I do some analysis, update something, analyze again and so on.
Every time I run the code, it reads the raw data from the pickle files. This takes quite some time that I want to avoid. One dirty solution is, I load the files once and then comment out the reading section. Of course I remain careful not to %reset the namespace.
import pandas as pd
df_1 = pd.read_pickle('MyFile_1.pkl')
df_2 = pd.read_pickle('MyFile_2.pkl')
df_3 = pd.read_pickle('MyFile_3.pkl')
Do some work on the loaded data.....
Is there some smart way to do it? Something like
if Myfile_1.pkl is NOT loaded:
df_1 = pd.read_pickle('MyFile_1.pkl')
Due to some limitations of the consumer of my data, I need to "rewrite" some parquet files to convert timestamps that are in nanosecond precision to timestamps that are in millisecond precision.
I have implemented this and it works but I am not completely satisfied with it.
import pandas as pd
df = pd.read_parquet(
f's3://{bucket}/{key}', engine='pyarrow')
for col_name in df.columns:
if df[col_name].dtype == 'datetime64[ns]':
df[col_name] = df[col_name].values.astype('datetime64[ms]')
df.to_parquet(f's3://{outputBucket}/{outputPrefix}{additionalSuffix}',
engine='pyarrow', index=False)
I'm currently running this job in lambda for each file but I can see this may be expensive and may not always work if the job takes longer than 15 minutes as that is the maximum time Lambda's can run.
The files can be on the larger side (>500 MB).
Any ideas or other methods I could consider? I am unable to use pyspark as my dataset has unsigned integers in it.
You could try rewriting all columns at once. Maybe this would reduce some memory copies in pandas, thus speeding up the process if you have many columns:
df_datetimes = df.select_dtypes(include="datetime64[ns]")
df[df_datetimes.columns] = df_datetimes.astype("datetime64[ms]")
Add use_deprecated_int96_timestamps=True to df.to_parquet() when you first write the file, and it will save as a nanosecond timestamp. https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
I have a large tabular data, which needs to be merged and splitted by group. The easy method is to use pandas, but the only problem is memory.
I have this code to merge dataframes:
import pandas as pd;
from functools import reduce;
large_df = pd.read_table('large_file.csv', sep=',')
This, basically load the whole data in memory th
# Then I could group the pandas dataframe by some column value (say "block" )
df_by_block = large_df.groupby("block")
# and then write the data by blocks as
for block_id, block_val in df_by_block:
pd.Dataframe.to_csv(df_by_block, "df_" + str(block_id), sep="\t", index=False)
The only problem with above code is memory allocation, which freezes my desktop. I tried to transfer this code to dask but dask doesn't have a neat groupby implementation.
Note: I could have just sorted the file, then read the data line by line and split as the "block" value changes. But, the only problem is that "large_df.txt" is created in the pipeline upstream by merging several dataframes.
Any suggestions?
Thanks,
Update:
I tried the following approach but, it still seems to be memory heavy:
# find unique values in the column of interest (which is to be "grouped by")
large_df_contig = large_df['contig']
contig_list = list(large_df_contig.unique().compute())
# groupby the dataframe
large_df_grouped = large_df.set_index('contig')
# now, split dataframes
for items in contig_list:
my_df = large_df_grouped.loc[items].compute().reset_index()
pd.DataFrame.to_csv(my_df, 'dask_output/my_df_' + str(items), sep='\t', index=False)
Everything is fine, but the code
my_df = large_df_grouped.loc[items].compute().reset_index()
seems to be pulling everything into the memory again.
Any way to improve this code??
but dask doesn't have a neat groupb
Actually, dask does have groupby + user defined functions with OOM reshuffling.
You can use
large_df.groupby(something).apply(write_to_disk)
where write_to_disk is some short function writing the block to the disk. By default, dask uses disk shuffling in these cases (as opposed to network shuffling). Note that this operation might be slow, and it can still fail if the size of a single group exceeds your memory.
I try to use multiprocessing to read the csv file faster than using read_csv.
df = pd.read_csv('review-1m.csv', chunksize=10000)
But the df I get is not the dataframe but of the type pandas.io.parsers.TextFileReader. So I try to use
df = pd.concat(tp, ignore_index=True)
to convert df into a dataframe. But this process takes a lot of time thus the result is not much different from directly using read_csv. Does anyone know that how to make the process of converting df into dataframe faster?
pd.read_csv() is likely going to give you the same read time as any other method. If you want a real performance increase you should change the format you store your file in.
http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations