Function gives list, I need DataFrame so I can concatenate - python

I started off pulling all files in the folder and concatenating them, this one works:
warranty_list = []
warranty_files = glob.glob(os.path.join(qms, '*.csv'))
for file_ in warranty_files:
df = pd.read_csv(file_,index_col=None, header=0)
warranty_list.append(df)
warranty = pd.concat(warranty_list)
Then I had to write a function so I would only grab certain files and concatenate them, but this one is not working. I do not get an error but the last line is not being used, so I am not concatenating the files.
def get_warranty(years=5):
warranty_list = [] #list for glob.glob()
current_year = datetime.datetime.today().year #current year
last_n_years = [str(current_year-i) for i in range(0,years+1)]
for year in last_n_years:
warranty = glob.glob(os.path.join(qms, "Warranty Detail%s.csv" % year))
if warranty:
for file_ in warranty:
df = pd.read_csv(file_,index_col=None, header=0)
warranty_list.append(df)
warranty_df = pd.concat(warranty_list)
The last line isn't working presumably because the pd.concat() is getting a list as an input and it won't do anything with that. O don't understand why it worked in the first set of code and not this one.
I don't know how to change the function to get a data frame or how to change what I get at the end into a data frame.
Any suggestions?

I would suggest to use directly append because it do same thing as concat
So basically you start with an empty dataframe
warranty_df = pd.Dataframe()
And then append the the others dataframe to this while reading the file
So your function should remain the same but you need to delete the following line
warranty_df = pd.concat(warranty_list)
And after the loop, you return the warranty_df!
def get_warranty(years=5):
warranty_df = pd.Dataframe()
current_year = datetime.datetime.today().year #current year
last_n_years = [str(current_year-i) for i in range(0,years+1)]
for year in last_n_years:
warranty = glob.glob(os.path.join(qms, "Warranty Detail%s.csv" % year))
if warranty:
for file_ in warranty:
df = pd.read_csv(file_,index_col=None, header=0)
waranty_df = warranty_df.append(df)
return warranty_df

Related

How to save the sum of specific columns from diffrent files

I have written a code (thanks to) that groupe the column that I need to remain as it is and sum of the targeted columns:
import pandas as pd
import glob as glob
import numpy as np
#Read excel and Create DF
all_data = pd.DataFrame()
for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx'):
df = pd.read_excel(f,index_col=None, na_values=['NA'])
df['filename'] = f
data = all_data.append(df,ignore_index=True)
#Group and Sum
result = data.groupby(["Date"])["Families","Individuals"].agg([np.sum])
#Save file
file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)
the problem is here :
#Save file
file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)
the code gives me the result that I want however it only takes into account the last file that it iterates through, I need to save all the sums from different files
thank you
Simply you never change all_data in the loop since it is never re-assigned. Each loop iteration appends to the empty data frame initialized outside loop. So only the very last file is retained. A quick (non-recommended) fix would include:
all_data = pd.DataFrame()
for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx'):
...
all_data = all_data.append(df, ignore_index=True) # CHANGE LAST LINE IN LOOP
# USE all_data (NOT data) aggregation
result = all_data.groupby(...)
However, reconsider growing a data frame inside a loop. As #unutbu warns us: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying. Instead, the recommended version would be to build a list of data frames to concatenate once outside the loop which you can do so with a list comprehension, even assign for filename:
# BUILD LIST OF DFs
df_list = [(pd.read_excel(f, index_col=None, na_values=['NA'])
.assign(filename = f)
) for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx')]
# CONCATENATE ALL DFs
data = pd.concat(df_list, ignore_index=True)
# AGGREGATE DATA
result = data.groupby(["Date"])["Families", "Individuals"].agg([np.sum])
file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)

Pandas reading multiple files from different folders

I have the same file with quarterly data saved in different folders corresponding to the quarter. In other words, a quarter 1 folder, quarter 2 , quarter 3, quarter 4. This is the only difference in the file path. I am looking to read all four files in and concatenate them into one database. I can do this manually using a version of the simplified code below and changing the period each time.
period = ‘Q1’
filepath = ‘filepath/’ + period
file = filepath + ‘/file.xls’
df = pd.read_excel(file)
I would like to automate it with some form of for loop (I assume). That loops through the 4 periods, reads the file into a database and then concatenates. I have read other answers as to how this can be done with files in the same folder. But am struggling to do it where the file path changes. Manually putting the files into the same folder is not a desirable solution.
I tried making period a tuple and a list containing all 4 periods then a simple for loop but this didn’t work. I got the following error message.
TypeError: Can't convert 'list' object to str implicitly
Greatly appreciate any advice.
How about you first use list comprehension to get a list of all files:
periods= ["Q1", "Q2", "Q3", "Q4"]
files = ["filepath/"+ p + "/file.xls" for p in periods]
and then load them all into a list of data frames with
dfs = []
for f in files:
df = pd.read_excel(f)
dfs.append(df)
You can use these loops to create full file paths and to iterate over them to create one DataFrame containing all the files.
filepath = 'path/'
file = 'file.xlsx'
periods=['Q1','Q2','Q3','Q4']
files = []
for p in periods:
files.append(filepath+p+'/'+file)
files
data = []
for f in files:
data.append(pd.read_excel(f))
df = pd.concat(data)
You probably want something like this:
periods = ['Q1', 'Q2', 'Q3', 'Q4']
df = None
for period in periods:
filepath = 'filepath/' + period
file = filepath + '/file.xls'
if df is None:
df = pd.read_excel(file)
else:
df.append(pd.read_excel(file))
You could try something like this:
complete_df = pd.DataFrame()
for i in range(1,5):
quarter = 'Q'+str(i)
filepath = 'filepath/' + quarter
file = filepath + '/file.xls'
df = pd.read_excel(file)
complete_df = complete_df.append(df)

Import multiple excel files into pandas and create a column based on name of file

I have multiple excel files in one folder which I want to read and concat together,but while concating together I want to add column based on name of the file
'D:\\156667_Report.xls',
'D:\\192059_Report.xls',
'D:\\254787_Report.xls',
'D:\\263421_Report.xls',
'D:\\273554_Report.xls',
'D:\\280163_Report.xls',
'D:\\307928_Report.xls'
I can read these files in pandas with following script
path =r'D:\' # use your path
allFiles = glob.glob(path + "/*.xls")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_excel(file_,index_col=None, header=0)
list_.append(df)
frame = pd.concat(list_)
I want to add column as Code in all the files which I read.Code will be numbers from filename e.g. 156667,192059
why not just match
foo = re.match('\.*_Report', file_)
num = foo[:6]`
df['Code']= num
Inside your loop?
One you could do this, is by using join, isdigit, inside a list comprehension.
The isdigit will get only the numbers from the file name (in a list), and the join function will join them back into 1.
To be clear, you could change your for loop to this:
for file_ in allFiles:
df = pd.read_excel(file_,index_col=None, header=0)
df['Code'] = ''.join(str(i) for i in file_ if i.isdigit())
list_.append(df)
which will add a column called Code in each df.

Dataframe to .csv - is only writing last value - Python/Pandas

I'm trying to write a dataframe to a .csv using df.to_csv(). For some reason, its only writing the last value (data for the last ticker). It reads through a list of tickers (turtle, all tickers are in first column) and spits out price data for each ticker. I can print all the data without a problem but can't seem to write to .csv. Any idea why? Thanks
input_file = pd.read_csv("turtle.csv", header=None)
for ticker in input_file.iloc[:,0].tolist():
data = web.DataReader(ticker, "yahoo", datetime(2011,06,1), datetime(2016,05,31))
data['ymd'] = data.index
year_month = data.index.to_period('M')
data['year_month'] = year_month
first_day_of_months = data.groupby(["year_month"])["ymd"].min()
first_day_of_months = first_day_of_months.to_frame().reset_index(level=0)
last_day_of_months = data.groupby(["year_month"])["ymd"].max()
last_day_of_months = last_day_of_months.to_frame().reset_index(level=0)
fday_open = data.merge(first_day_of_months,on=['ymd'])
fday_open = fday_open[['year_month_x','Open']]
lday_open = data.merge(last_day_of_months,on=['ymd'])
lday_open = lday_open[['year_month_x','Open']]
fday_lday = fday_open.merge(lday_open,on=['year_month_x'])
monthly_changes = {i:MonthlyChange(i) for i in range(1,13)}
for index,ym, openf,openl in fday_lday.itertuples():
month = ym.strftime('%m')
month = int(month)
diff = (openf-openl)/openf
monthly_changes[month].add_change(diff)
changes_df = pd.DataFrame([monthly_changes[i].get_data() for i in monthly_changes],columns=["Month","Avg Inc.","Inc","Avg.Dec","Dec"])
CSVdir = r"C:\Users\..."
realCSVdir = os.path.realpath(CSVdir)
if not os.path.exists(CSVdir):
os.makedirs(CSVdir)
new_file_name = os.path.join(realCSVdir,'PriceData.csv')
new_file = open(new_file_name, 'wb')
new_file.write(ticker)
changes_df.to_csv(new_file)
Use a for appending instead of wb because it overwrites the data in every iteration of loop.For different modes of opening a file see here.

How to read multiple files from different folder in python

I have yearly data files in different folders. each file contains daily data ranging from Jan 1 to Dec 31. Data files name is looks like AS060419.67 where last four digit represent year i.e. 1967 and 0604 is folder name.
I tried to read these multiple files by using the code (below), but it reads only for last year data in last folder
def date_parser(doy, year):
return dt.datetime.strptime(doy.zfill(3)+year, '%j%Y')
files = glob.glob('????/AS*')
files.sort()
files
STNS = {}
for f in files:
stn_id, info = f.split('/')
year = "".join(info[-5:].split('.'))
#print (f,stn_id)
with open(f) as fo:
data = fo.readlines()[:-1]
data = [d.strip() for d in data]
data = '\n'.join(data)
with open('data.dump', 'w') as dump:
dump.write(data)
parser = lambda date: date_parser(date, year=year)
df = pd.read_table('data.dump', delim_whitespace=True,names=['date','prec'],
na_values='DNA', parse_dates=[0], date_parser=parser, index_col='date' )
df.replace({'T': 0})
df = df.apply(pd.to_numeric, args=('coerce',))
df.name = stn_name
df.sid = stn_id
if stn_id not in STNS.keys():
STNS[stn_name] = df
else:
STNS[stn_id] = STNS[stn_id].append(df)
STNS[stn_id].name = df.name
STNS[stn_id].sid = df.sid
#outfile.write(line)
For making plot
for stn in STNS:
STNS[stn_id].plot()
plt.title('Precipitation for {0}'.format(STNS[stn].name))
The problem is it reads only last year data in last folder. Can anyone help to figure out this problem.Your help will be highly appreciated.
You can do it like this:
import os
import glob
import pandas as pd
import matplotlib.pyplot as plt
# file mask
fmask = r'./data/????/AS*.??'
# all RegEx replacements
replacements = {
r'T': 0
}
# list of data files
flist = glob.glob(fmask)
def read_data(flist, date_col='date', **kwargs):
dfs = []
for f in flist:
# parse year from the file name
y = os.path.basename(f).replace('.', '')[-4:]
df = pd.read_table(f, **kwargs)
# replace day of year with a date
df[date_col] = pd.to_datetime(y + df[date_col].astype(str).str.zfill(3), format='%Y%j')
dfs.append(df)
return pd.concat(dfs, ignore_index=True)
df = read_data(flist,
date_col='date',
sep=r'\s+',
header=None,
names=['date','prec'],
engine='python',
skipfooter=1,
) \
.replace(replacements, regex=True) \
.set_index('date') \
.apply(pd.to_numeric, args=('coerce',))
df.plot()
plt.show()
I've downloaded only four files, so the corresponding data you can see on the plot...
You overwrite the same file over and over again.
Derive your target file name from your source file name.
Or use the append mode if you want it all in the same file.
How do you append to a file?

Categories