I would like to make following code faster to export into csv with (average file size 800MB) containing 100+ columns.
...................................................................
,.................................................................
.................................................................
INPUT:
DATE_TIME; DATA1; DATA2
12.18.2018 00:00:00;XXXXXXXXXXXX;YYYYYYYY
12.18.2018 00:00:00;XXXXXXXXXXXX;YYYYYYYY
12.18.2018 00:00:00;XXXXXXXXXXXX;YYYYYYYY
12.18.2018 01:00:00;XXXXXXXXXXXX;YYYYYYYY
OUTPUT will be no header:
DATE, TIME, DATA1, DATA2
2018-12-18,00:00:00,XXXXXXXXXXXX,YYYYYYYY
2018-12-18,00:00:00,XXXXXXXXXXXX,YYYYYYYY
2018-12-18,00:00:00,XXXXXXXXXXXX,YYYYYYYY
2018-12-18,01:00:00,XXXXXXXXXXXX,YYYYYYYY
CODE
import pandas
import glob
dfraw = []
rawCSV = glob.glob('C:\RAW\*.csv')
rawCSV
for filename in rawCSV:
data = pandas.read_csv(filename, delimiter=';')
dfraw.append(data)
totalFile = len(dfraw)
x=0
dfclean=[]
for x in range(totalFile):
tempdf =dfraw[x]["DATE_TIME"].str.split(" ", n = 1, expand = True)
tempdf[0] = tempdf[0].str.replace('.','-')
dfraw[x].drop(columns =["DATE_TIME"], inplace = True)
dfraw[x].insert(loc=0, column='DATE_ONLY', value=tempdf[0])
dfraw[x].insert(loc=1, column='TIME_ONLY', value=tempdf[1])
dfraw[x]['DATE_TIME'] = dfraw[x]['DATE_TIME'].astype('datetime64[ns]')
dfclean.append(dfraw[x])
concatdf=pandas.concat(dfclean, axis=0)
#dfclean.columns = dfclean.iloc[0]
#dfclean = dfclean[1:]
concatdf.to_csv('C:\CLEAN\__result.csv', index=False , header=False)
This is probably the fastest you can get; I think it should work. It's writing files out as it reads them in, instead of piling up everything in memory until the end and doing concat (which is a bit slow)
import pandas
import glob
rawCSV = glob.glob('C:\RAW\*.csv')
for filename in rawCSV:
data = pandas.read_csv(filename, delimiter=';')
date_time = data['DATE_TIME'].str.split(" ", n = 1, expand = True)
data.drop(columns =["DATE_TIME"], inplace = True)
data.insert(loc=0, column='DATE_ONLY', value=date_time[0].str.replace('.','-'))
data.insert(loc=1, column='TIME_ONLY', value=date_time[1])
with open('C:\CLEAN\__result.csv', 'a') as fh:
data.to_csv(fh, index=False , header=False)
It would probably be worth putting some checks in to make sure that you're not appending to a file that's already there and whatnot.
Related
I have the following code:
import glob
import pandas as pd
import os
import csv
myList = []
path = "/home/reallymemorable/Documents/git/COVID-19/csse_covid_19_data/csse_covid_19_daily_reports_us/*.csv"
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
fileDate = pd.DataFrame({'Date': [dateFromFilename]})
myList.append(row.join(fileDate))
concatList = pd.concat(myList, sort=True)
print(concatList)
concatList.to_csv('/home/reallymemorable/Documents/test.csv', index=False, header=True
It goes through a folder of CSVs and grabs a specific row and puts it all in a CSV. The files themselves have names like 10-10-2020.csv. I have some code in there that gets the filename and removes the file extension, so I am left with the date alone.
I am trying to add another column called "Date" that contains the filename for each file.
The script almost works: it gives me a CSV of all the rows I pulled out of the various CSVs, but the Date column itself is empty.
If I do print(dateFromFilename), the date/filename prints as expected (e.g. 10-10-2020).
What am I doing wrong?
I believe join has how=left by default. And your fileDate dataframe has different index than row, so you wouldn't get the date. Instead, do an assignment:
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
myList.append(row.assign(Date=dateFromFilename))
concatList = pd.concat(myList, sort=True)
Another way is to store the dataframes as a dictionary, then concat:
myList = dict()
for fname in glob.glob(path):
df = pd.read_csv(fname)
row = df.loc[df['Province_State'] == 'Pennsylvania']
dateFromFilename = os.path.basename(fname).replace('.csv','')
myList[dateFromFilename] = row
concatList = pd.concat(myList, sort=True)
I'm working on a program which analyses a lot of csv files.
Currently I'm declaring every item manually, but as you can see in my code I'm actually just go +1 in my paths and in the variable-names.
I guess I can simplify this with a loop, just don't know how to do this with the path-names.
My code:
import pandas as pd
import numpy as np
### declation ###
df_primes1 = pd.DataFrame()
df_primes1 = np.array(df_primes1)
df_search1 = pd.DataFrame()
df_primes2 = pd.DataFrame()
df_primes2 = np.array(df_primes2)
df_search2 = pd.DataFrame()
df_primes3 = pd.DataFrame()
df_primes3 = np.array(df_primes3)
df_search3 = pd.DataFrame()
searchterm = '322'
### reads csv in numpy array ###
df_primes1 = pd.read_csv('1/1_Primes_32.csv', delimiter=';', header=None, names='1')
df_primes2 = pd.read_csv('1/2_Primes_32.csv', delimiter=';', header=None, names='2')
df_primes3 = pd.read_csv('1/3_Primes_32.csv', delimiter=';', header=None, names='3')
### sorts prime numbers ###
#df_sorted = df_primes1.sort_values(by='n')
#print(df_sorted)
### searches for number with "searchterm" as start value ###
df_search1 = df_primes1[df_primes1['1'].astype(str).str.startswith(searchterm)]['1']
df_search2 = df_primes2[df_primes2['2'].astype(str).str.startswith(searchterm)]['2']
df_search3 = df_primes3[df_primes3['3'].astype(str).str.startswith(searchterm)]['3']
print(df_search1)
print(df_search2)
print(df_search3)
The program is working, I was just want to know how I can simplify this, because there will be 20+ more files like this.
IIUC, we can use pathlib and a dict comprehension :
from pathlib import Path
p = 'Path/to/your_csv/'
dfs = {
f"search_{i}": pd.read_csv(file, delimiter=";",
header=None,
names=str(i))
for i, file in enumerate(Path(p).glob("*Prime*.csv"), 1)
}
to break down each item,
p is the target folder that holds your csvs
i is an enumerator to loop over your files you will most likely need to add a pre-step of sorting your csvs to get the order you're after.
file is each item that is returned from the generator object. we turn each value into a dataframe.
you can filter each dataframe by your collection i.e
dfs['search_1']
this will return a dataframe.
I am processing a Large Data Set with at least 8GB in size using pandas.
I've encountered a problem in reading the whole set so I read the file chunk by chunk.
In my understanding, chunking the whole file will create many different dataframes. So using my existing routine, this only removes the duplicate values on that certain dataframe and not the duplicates on the whole file.
I need to remove the duplicates on this whole data set based on the ['Unique Keys'] column.
I tried to use the pd.concat but I also encountered a problem with the memory so I tried to write the file on a csv file and append all the results of the dataframes on it.
After running the code, the file doesn't reduce much so I think my assumption is right that the current routine is not removing all the duplicates based on the whole data set.
I'm a newbie in Python so it would really help if someone can point me in the right direction.
def removeduplicates(filename):
CHUNK_SIZE = 250000
df_iterator = pd.read_csv(filename, na_filter=False, chunksize=CHUNK_SIZE,
low_memory=False)
# new_df = pd.DataFrame()
for df in df_iterator:
df = df.dropna(subset=['Unique Keys'])
df = df.drop_duplicates(subset=['Unique Keys'], keep='first')
df.to_csv(join(file_path, output_name.replace(' Step-2', '') +
' Step-3.csv'), mode='w', index=False, encoding='utf8')
If you can fit in memory the set of unique keys:
def removeduplicates(filename):
CHUNK_SIZE = 250000
df_iterator = pd.read_csv(filename, na_filter=False,
chunksize=CHUNK_SIZE,
low_memory=False)
# create a set of (unique) ids
all_ids = set()
for df in df_iterator:
df = df.dropna(subset=['Unique Keys'])
df = df.drop_duplicates(subset=['Unique Keys'], keep='first')
# Filter rows with key in all_ids
df = df.loc[~df['Unique Keys'].isin(all_ids)]
# Add new keys to the set
all_ids = all_ids.union(set(df['Unique Keys'].unique()))
Probably easier not doing it with pandas.
with open(input_csv_file) as fin:
with open(output_csv_file) as fout:
writer = csv.writer(fout)
seen_keys = set()
header = True
for row in csv.reader(fin):
if header:
writer.writerow(row)
header = False
continue
key = tuple(row[i] for i in key_indices)
if not all(key): # skip if key is empty
continue
if key not in seen_keys:
writer.writerow(row)
seen_keys.add(key)
I think this is a clear example of when you should use Dask or Pyspark. Both allow you to read files that does not fit in your memory.
As an example with Dask you could do:
import dask.dataframe as dd
df = dd.read_csv(filename, na_filter=False)
df = df.dropna(subset=["Unique Keys"])
df = df.drop_duplicates(subset=["Unique Keys"])
df.to_csv(filename_out, index=False, encoding="utf8", single_file=True)
I want to use python to reduce the two lines of data in a file to create a new file. I use pandas and numpy for processing, but the processing time of pandas is very long,even need a few hours, while numpy can be two or three minutes, a total of more than 1 million data,
as part of the data:
33,Jogging,49105962326000,-0.6946377,12.680544,0.50395286;
33,Jogging,49106062271000,5.012288,11.264028,0.95342433;
33,Jogging,49106112167000,4.903325,10.882658,-0.08172209;
My pandas code is as below:
import pandas as pd
import numpy as np
import time
time1 = time.time()
file = open('WISDM_ar_v1.1_raw.txt','r')
dataset = file.readlines()
list1 = []
for i in range(len(dataset)-1):
dataset[i] = dataset[i].rstrip('\n')
dataset[i] = dataset[i].rstrip(';')
dataset[i] = dataset[i].split(",")
if len(dataset[i])==6:
#list1为处理后的数据
list1.append(dataset[i])
array1 = np.array(list1)
#newline两行之间按什么分割 delimiter列之间按什么分割
np.savetxt("aa.txt", array1, fmt="%s",newline='\r\n', delimiter=",")
column_names = ['user-id', 'activity', 'timestamp', 'x-axis', 'y-axis', 'z-axis']
dataset1 = pd.read_csv('aa.txt',names=column_names, header=None)
df = pd.DataFrame(dataset1)
df1 = pd.DataFrame(columns=column_names)
for i in range(0,len(dataset1)-1):
data = dataset1.loc[[i]]
if dataset1.loc[i+1, 'activity']==dataset1.loc[i,'activity']:
data.loc[i,'user-id'] = dataset1.loc[i,'user-id']
data.loc[i,'x-axis'] = dataset1.loc[i+1,'x-axis']-dataset1.loc[i,'x-axis']
data.loc[i,'y-axis'] = dataset1.loc[i+1,'y-axis'] - dataset1.loc[i,'y-axis']
data.loc[i,'z-axis'] = dataset1.loc[i+1,'z-axis'] - dataset1.loc[i,'z-axis']
df1 = df1.append(data, ignore_index=True)
df1.to_csv('new_data.txt', mode='a',sep=',', header=False, index=False)
I want to know why this is the case. Is there any mistake in the pandas code I wrote? Thank you very much!
my following code reads several csv files in one folder, filters according the value in a column, and then appends the resulting dataframe to a csv file. Given that there are about 410 files 130 MB each, this code currently takes about 30 min. I was wondering if there is a quick way to make it faster my using a multiprocessing library. Could you offer me some tips on how to get it started? thank you
import pandas as pd
import glob
path =r'C:\Users\\Documents\\'
allfiles = glob.glob(path + "*.csv")
with open('test.csv','w') as f:
for i,file in enumerate(allfiles):
df = pd.read_csv(file,index_col=None, header=0)
df.sort_values(['A','B','C'], ascending = True, inplace = True)
df['D'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')
df[(df['D'] == 1) | (df['D'] == 0)].to_csv(f, header = False)
print i
print "Done"