Python: creating dataframe with filename and file last modify time

Python: creating dataframe with filename and file last modify time - python

i want to read file name in folder which i already did using file=glob.glob... function.
and add in 'file_last_mod_t' column last modify file time.
my part of code:
df=pd.DataFrame(columns=['filename','file_last_mod_t','else'])
df.set_index('filename')
for file in glob.glob('folder_path'): #inside this folder is file.txt
file_name=os.path.basename('folder_path')
df.loc[file_name]= os.path.getmtime(file)
which gives me:
df:
filename,file_last_mod_t,else
file.txt,123456,123456 #123456 its time result example
i want to add this last modify time only to file_last_mod_t column, not for all.
i want to receive :
df:
filename,file_last_mod_t,else
file.txt,123456,
thanks in advice
after code modification:
df=pd.read_csv('C:/df.csv')
filename_list= pd.Series(result_from_other_definition)# it looks same as in #filename column
df['filename']=filename_list # so now i have dataframe with 3 columns and firs column have files list
df.set_index('filename')
for file in glob.glob('folder_path'):#inside this folder is file.txt
df['file_last_mod_t']=df['filename'].apply(lambda x: (os.path.getmtime(x)) #the way how getmtime is present is now no matter, could be #float numbers
df.to_csv('C:/df.csv')
#printing samples:
first run:
df['filename']=filename_list
print (df)
,'filename','file_last_mod_t','else'
0,file1.txt,NaN,NaN
1,file2.txt,NaN,NaN
code above works fine after first run when df is empty , only with headers.
after next run when i run the code and df.csv have some content i am changing manually value of timestamp in file, i am receiving an error : TypeError: stat: path should be string, bytes, os.PathLike or integer,not float this code should replace manually modified cell with good timestamp. i think it's connected with apply
also i dont know why index appear in df
**solved **

Please see comment on code as following:
import os
import pandas as pd
import datetime as dt
import glob
# this is the function to get file time as string
def getmtime(x):
x= dt.datetime.fromtimestamp(os.path.getmtime(x)).strftime("%Y-%m-%d %H:%M:%d")
return x
df=pd.DataFrame(columns=['filename','file_last_mod_t','else'])
df.set_index('filename')
# I set filename list to df['filename']
df['filename'] = pd.Series([file for file in glob.glob('*')])
# I applied a time modified file to df['file_last_mod_t'] by getmtime function
df['file_last_mod_t'] = df['filename'].apply(lambda x: getmtime(x))
print (df)
The result is
filename file_last_mod_t else
0 dataframe 2019-05-04 18:43:04 NaN
1 fer2013.csv 2018-05-26 12:18:26 NaN
2 file.txt 2019-05-04 18:49:04 NaN
3 file2.txt 2019-05-04 18:51:04 NaN
4 Untitled.ipynb 2019-05-04 17:41:04 NaN
5 Untitled1.ipynb 2019-05-04 20:51:04 NaN
For the updated question, I started with df.csv what have data as following:
filename,file_last_mod_t,else
file1.txt,,
And, I think you want to add new files. So, I made the code as following:
import os
import pandas as pd
df=pd.read_csv('df.csv')
df_adding=pd.DataFrame(columns=['filename','file_last_mod_t','else'])
df_adding['filename'] = pd.Series(['file2.txt'])
df = df.append(df_adding)
df = df.drop_duplicates('filename')
df['file_last_mod_t']=df['filename'].apply(lambda x: (os.path.getmtime(x))) #the way how getmtime is present is now no matter, could be #float numbers
df.to_csv('df.csv', index=False)
I created df_adding dataframe for new files and I appended it to df which read df.csv.
Finally, we can apply getmtime and save if to df.csv.

Related

Python - Write a row into an array into a text file

I have to work on a flat file (size > 500 Mo) and I need to create to split file on one criterion.
My original file as this structure (simplified):
JournalCode|JournalLib|EcritureNum|EcritureDate|CompteNum|
I need to create to file depending on the first digit from 'CompteNum'.
I have started my code as well
import sys
import pandas as pd
import numpy as np
import datetime
C_FILE_SEP = "|"
def main(fic):
pd.options.display.float_format = '{:,.2f}'.format
FileFec = pd.read_csv(fic, C_FILE_SEP, encoding= 'unicode_escape')
It seems ok, my concern is to create my 2 files based on criteria. I have tried with unsuccess.
TargetFec = 'Target_'+fic+datetime.datetime.now().strftime("%Y%m%d-%H%M%S")+'.txt'
target = open(TargetFec, 'w')
FileFec = FileFec.astype(convert_dict)
for row in FileFec.iterrows():
Fec_Cpt = str(FileFec['CompteNum'])
nb = len(Fec_Cpt)
if (nb > 7):
target.write(str(row))
target.close()
the result of my target file is not like I expected:
(0, JournalCode OUVERT
JournalLib JOURNAL D'OUVERTURE
EcritureNum XXXXXXXXXX
EcritureDate 20190101
CompteNum 101300
CompteLib CAPITAL SOUSCRIT
CompAuxNum
CompAuxLib
PieceRef XXXXXXXXXX
PieceDate 20190101
EcritureLib A NOUVEAU
Debit 000000000000,00
Credit 000038188458,00
EcritureLet NaN
DateLet NaN
ValidDate 20190101
Montantdevise
Idevise
CodeEtbt 100
Unnamed: 19 NaN
And I expected to obtain line into my target file when CompteNum(0:1) > 7
I have read many posts for 2 days, please some help will be perfect.
There is a sample of my data available here
Philippe

Suiting the rules and the desired format, you can use logic like:
# criteria:
verify = df['CompteNum'].apply(lambda number: str(number)[0] == '8' or str(number)[0] == '9')
# saving the dataframes:
df[verify].to_csv('c:/users/jack/desktop/meets-criterios.csv', sep = '|', index = False)
Original comment:
As I understand it, you want to filter the imported dataframe according to some criteria. You can work directly on the pandas you imported. Look:
# criteria:
verify = df['CompteNum'].apply(lambda number: len(str(number)) > 7)
# filtering the dataframe based on the given criteria:
df[verify] # meets the criteria
df[~verify] # does not meet the criteria
# saving the dataframes:
df[verify].to_csv('<your path>/meets-criterios.csv')
df[~verify].to_csv('<your path>/not-meets-criterios.csv')
Once you have the filtered dataframes, you can save them or convert them to other objects, such as dictionaries.

convert comment (list) to dataframe ,pandas

I have big list of names , I want to keep it in my interpreter so I would like not use csv files.
The only way how i can store it in my interpreter as variable using 'copy -paste' from my original file is comment
so my input looks like this :
temp='''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
my goal is to convert this 'comment' inside my interpreter to dataframe
i tried to
df=pd.DataFrame([temp]) and also to series using in comment only one column but without success, any idea?
my read data have hundreds of lines

Use:
from io import StringIO
temp=u'''A,B,C
adam,dorothy,ben
luis,cristy,hoover'''
df = pd.read_csv(StringIO(temp))
print (df)
A B C
0 adam dorothy ben
1 luis cristy hoover

Import multiple excel files, create a column and get values from excel file's name

I need to upload multiple excel files - each one has a name of starting date. Eg. "20190114".
Then I need to append them in one DataFrame.
For this, I use the following code:
all_data = pd.DataFrame()
for f in glob.glob('C:\\path\\*.xlsx'):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
In fact, I do not need all data, but filtered by multiple columns.
Then, I would like to create an additional column ('from') with values of file name (which is "date") for each respective file.
Example:
Data from the excel file, named '20190101'
Data from the excel file, named '20190115'
The final dataframe must have values in 'price' column not equal to '0' and in code column - with code='r' (I do not know if it's possible to export this data already filtered, avoiding exporting huge volume of data?) and then I need to add a column 'from' with the respective date coming from file's name:
like this:
dataframes for trial:
import pandas as pd
df1 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[0,12.5,17.5,24.5,7.5],
'code':['r','r','r','c','r'] })
df2 = pd.DataFrame({'id':['id_1', 'id_2','id_3', 'id_4','id_5'],
'price':[7.5,24.5,0,149.5,7.5],
'code':['r','r','r','c','r'] })

IIUC, you can filter necessary rows ,then concat, for file name you can use os.path.split() and access the filename with string slicing:
l=[]
for f in glob.glob('C:\\path\\*.xlsx'):
df=pd.read_excel(f)
df['from']=os.path.split(f)[1][:-5]
l.append(df[(df['code'].eq('r')&df['price'].ne(0))])
pd.concat(l,ignore_index=True)
id price code from
0 id_2 12.5 r 20190101
1 id_3 17.5 r 20190101
2 id_5 7.5 r 20190101
3 id_1 7.5 r 20190115
4 id_2 24.5 r 20190115
5 id_5 7.5 r 20190115

How to compile multiple excel files in numeric order (file1.xls, file2.xls, etc) into one python file?

I am trying to compile several .xls files together. I found some code that works but it put in the files out of order. The files are names therm_sensor1.xls, therm_sensor2.xls, etc. I need the output to be in numeric order but my current code seems to have them scrambled. I am very new to computer coding so an explanation would be helpful :)
Also my current output has all the data except for the top 6 lines. I have no idea why it is doing this.
import pandas as pd
import glob
glob.glob('therm_sensor*.xls')
all_data = pd.DataFrame()
for f in glob.glob('therm_sensor*.xls'):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
print(all_data.to_string())
Output:
6 1.739592e-05 0.30 NaN
7 2.024840e-05 0.35 NaN
8 2.309999e-05 0.40 NaN
...
502 2.949562e-10 0.95 NaN
503 3.113220e-10 1.00 NaN

I had something similar issue, eventually, I figured out a way. So I am gonna give you the solution that worked for me. One key thing I did was name the column names before passing to dataframe. See if this helps.
fileList=glob.glob("*.csv")
dfList=[]
colnames=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]
for filename in fileList:
print(filename)
df=pd.read_csv(filename, header=None)
dfList.append(df)
concatDf=pd.concat(dfList, axis=0)
concatDf.columns=colnames
#concatDf.to_csv(outfile, index=None) -# You dont need this.
concatenate()

The problem here is (probably) due to the difference in the way humans and computers tend to sort things. Take a list like this:
files = ['file10.xls', 'file2.xls', 'file1.xls']
The computer sorts this list in a way that looks unintuitive to humans (because it goes 1, 10, 2):
>>> sorted(files)
['file1.xls', 'file10.xls', 'file2.xls']
But if you change the sort criteria you can get a more intuitive result. Here, that means isolating the part of the filename that contains the number and turning it into an integer so the computer can sort it correctly:
>>> sorted(files, key=lambda s: int(s[4:-4]))
['file1.xls', 'file2.xls', 'file10.xls']
In your use case, this should do the trick:
sorted(glob.glob('therm_sensor*.xls'), key=lambda s: int(s[12:-4]))

saving the entire column in pandas via to_csv

I have a 2 column dataframe with the following values, unfotunately, since the list of values is too long, it gets chopped with , ...] while saving to a csv with to_csv, How can I retain the entire list while saving
attribute_name list_of_values attribute1 [1320698, 1320699, 1096323,
1320690, 1839190, 1091359, 1325750, 1569072, 1829679, 142100, 1320163,
1829673, 588914, 418137, 757085, 588910, 1321158, 1073897, 1823533,
1823535, 1091363, 1383908, 1834826, 36191, 1829641, 767536, 1829597,
1829591, 1326727, 1834700, 1317721, 1317802, 1834838, 52799, 1383915,
1320042, 1829654, 1829655, 1829658, 647089, 1829581, 1829586, 1829587,
1321116, 1829585, 1829588, 1839799, 1588509, 1834471, 1793632,
1327850, 1793599, 1456968, 1315869, 1793605, 1321236, 1829579,
1829577, 1793609, 1829571, 1829570, 1320139, 777057, 1829671, 1829566,
1831047, 1829567, 588927, 60484, 1793596, 1829634, 1839580, 1829569,
1793615, 1323529, 1793619, 1834758, 1612974, 1320007, 1839780,
1291475, 1834835, 1834453, 1823663, 418112, 1092106, 1829689, 1829688,
1793606, 647050, 1834742, 1839551, 1839553, 1834746, 1839556, 1834745,
1575978, 1834749, 1320711, 1317910, ...]
df.to_csv(loc,index=False,header=False,sep='\t',mode='a',encoding='utf8').
I tried the display options here, http://pandas.pydata.org/pandas-docs/dev/options.html, with pd.set_option('max_colwidth',20000000000), but I think since it works only on display mode, not on dumping to a csv, this does not work.
What else can I set, to retain the contents of the entire list.
Edit -: Try creating a dataframe with this orignal data, once you save this, it will give you the distorted data as pointed above.
import pandas as pd
pd.options.display.multi_sparse = False
pd.set_option('max_colwidth',2000000000000000000)
headers=["attribute_name", "list_of_values"]
file_name='/home/ekta/abcd.csv'
data = ['attribute1', ['1320698', '1320699', '1096323', '1320690', '1839190', '1091359', '1325750', '1569072', '1829679', '142100', '1320163', '1829673', '588914', '418137', '757085', '588910', '1321158', '1073897', '1823533', '1823535', '1091363', '1383908', '1834826', '36191', '1829641', '767536', '1829597', '1829591', '1326727', '1834700', '1317721', '1317802', '1834838', '52799', '1383915', '1320042', '1829654', '1829655', '1829658', '647089', '1829581', '1829586', '1829587', '1321116', '1829585', '1829588', '1839799', '1588509', '1834471', '1793632', '1327850', '1793599', '1456968', '1315869', '1793605', '1321236', '1829579', '1829577', '1793609', '1829571', '1829570', '1320139', '777057', '1829671', '1829566', '1831047', '1829567', '588927', '60484', '1793596', '1829634', '1839580', '1829569', '1793615', '1323529', '1793619', '1834758', '1612974', '1320007', '1839780', '1291475', '1834835', '1834453', '1823663', '418112', '1092106', '1829689', '1829688', '1793606', '647050', '1834742', '1839551', '1839553', '1834746', '1839556', '1834745', '1575978', '1834749', '1320711', '1317910', '1829700', '1839791', '1839796', '1320019', '1829494', '437131', '1829696', '1839576', '721318', '1829699', '1838874', '1315822', '647049', '1325775', '1320708', '133913', '835588', '1839564', '1320700', '1320707', '1839563', '1834737', '1834736', '1834734', '1823669', '1321159', '1320577', '1839768', '1823665', '1838602', '1823667', '1321099', '1753590', '1753593', '1320688', '1839583', '1326633', '1320681', '1793646', '1323683', '1091348', '982081', '1793648', '1478516', '1317650', '1829663', '1829667', '1829666', '1793640', '1839577', '1315855', '1317796', '1839775', '1321163', '1793642']]
def write_file(data,flag,headers,file_name):
# open a df & write recursively
print " \n \n data", data
df = pd.DataFrame(data).T
print "df \n", df
# write to a df recursively
loc=file_name
#loc="%s%s_%s"%(args.base_path,args.merchant_domain,file_name)
if flag ==True :
df.to_csv(loc,index=False,header=headers,sep='\t',mode='a',encoding='utf8')
flag = False
elif flag == False :
df.to_csv(loc,index=False,header=False,sep='\t',mode='a',encoding='utf8')
return loc
# I call the function above with this data & headers, I pass flag as "True" the 1st time around, after which I write recursively with flag=False.
write_file(data,flag=True,headers,file_name)
debug :
length of original list is 155, the distorted list saved to_csv has 100 datapoints.
Purpose of loc & flag : location of file & flag = indicates whether I am writing the 1st row, or 2nd row onwrads I dont need to write headers again if the 1st row has been written already.
Here's how I solved it
The main diagnosis was that I couldn't store the entire list I was passing, even if I treated it as a dict object, may be it is because of the way pandas treats the column lengths, but this is only the diagnosis.
I got the whole list back, by writing the file, not with to_csv(pandas), but writing it as a simple file, and then reading it back with pandas, in which case, I could get the entire file back.
import pandas as pd
# Note that I changed my headers from the initial format as a list
headers="attribute_name\tlist_of_values"
data = ['attribute1',['1320698', '1320699', '1096323', '1320690', '1839190', '1091359', '1325750', '1569072', '1829679', '142100', '1320163', '1829673', '588914', '418137', '757085', '588910', '1321158', '1073897', '1823533', '1823535', '1091363', '1383908', '1834826', '36191', '1829641', '767536', '1829597', '1829591', '1326727', '1834700', '1317721', '1317802', '1834838', '52799', '1383915', '1320042', '1829654', '1829655', '1829658', '647089', '1829581', '1829586', '1829587', '1321116', '1829585', '1829588', '1839799', '1588509', '1834471', '1793632', '1327850', '1793599', '1456968', '1315869', '1793605', '1321236', '1829579', '1829577', '1793609', '1829571', '1829570', '1320139', '777057', '1829671', '1829566', '1831047', '1829567', '588927', '60484', '1793596', '1829634', '1839580', '1829569', '1793615', '1323529', '1793619', '1834758', '1612974', '1320007', '1839780', '1291475', '1834835', '1834453', '1823663', '418112', '1092106', '1829689', '1829688', '1793606', '647050', '1834742', '1839551', '1839553', '1834746', '1839556', '1834745', '1575978', '1834749', '1320711', '1317910', '1829700', '1839791', '1839796', '1320019', '1829494', '437131', '1829696', '1839576', '721318', '1829699', '1838874', '1315822', '647049', '1325775', '1320708', '133913', '835588', '1839564', '1320700', '1320707', '1839563', '1834737', '1834736', '1834734', '1823669', '1321159', '1320577', '1839768', '1823665', '1838602', '1823667', '1321099', '1753590', '1753593', '1320688', '1839583', '1326633', '1320681', '1793646', '1323683', '1091348', '982081', '1793648', '1478516', '1317650', '1829663', '1829667', '1829666', '1793640', '1839577', '1315855', '1317796', '1839775', '1321163', '1793642']]
flag=True
# write to a file
with open(loc, 'a') as f:
if flag :
f.write(headers+"\n")
flag=False
#Explicitly writing a tab separated file
f.write(str(data[0])+"\t"+str(data[1])+"\n")
# read the file & confirm
df=pd.read_csv(loc,sep='\t',header='infer')
print df['list_of_values'].ix[0]
print len(df['list_of_values'].ix[0])
#Yah !! 155
Thanks to #paul who diagnosed the problem & pointed me in this direction.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: creating dataframe with filename and file last modify time - python

Related

Python - Write a row into an array into a text file

convert comment (list) to dataframe ,pandas

Import multiple excel files, create a column and get values from excel file's name

How to compile multiple excel files in numeric order (file1.xls, file2.xls, etc) into one python file?

saving the entire column in pandas via to_csv

Categories

Resources