I have a folder that has hundreds or files which contain comma separated data, however, the files themselves have no file extensions (i.e., EPI or DXPX; NOT EPI.csv or DXPX.csv).
I am trying to create a loop that reads in only certain files that I need (between 15-20 files). I do not want to concat or append the dfs. I merely want to read each df into memory and be able to call the df by name.
Even though there is no extension, I can read the file in as .csv
YRD = pd.read_csv('YRD', low_memory=False)
My expected result from the loop below is two dfs: one labeled YRD and another labeled HOUSE. However, I only get one df named df_raw and it is only the final file in the list. Sorry if this is a silly question, but I cannot figure out what I am missing.
df_list = ['YRD','HOUSE']
for raw_df in df_list:
raw_df = pd.read_csv(raw_df, low_memory=False)
This is because you reassign the value raw_df every time you encounter a new file...
You should create new variables, not reuse the old ones:
mydfs=[]
for raw_df in df_list:
mydfs.append( pd.read_csv(raw_df, low_memory=False))
or you can put them into a dictionnary:
mydfs={}
for raw_df in df_list:
mydfs[raw_df]= pd.read_csv(raw_df, low_memory=False)
Related
I am trying to concatenate two excel files with the same column names together, but there seems to be a problem as there are new empty columns/spaces being added to my new excel file, and i don't know why.
I used pd.concat() function which was supposed to concat the two files into one single sheet and make a new file, but when it adds the table in the second file to the first file, new columns/spaces are added to the new merged file.
file_list = glob.glob(path + "/*.xlsx")
dfs = pd.DataFrame()
dfs = [pd.read_excel(p,) for p in file_list]
print(dfs[0].shape)
res = pd.concat(dfs)
That is a snippet of my code
I also added a picture of what the result i am getting now looks like
Concat respects the column names, so is not like a plain vector concatenate, try to check if the column names are the same among all your source files. If no, you can normalize them, rename them or move to a vector base format like numpy arrays.
First-time poster here! I have perused these forums for a while and I am taken aback by how supportive this community is.
My problem involves several excel files with the same name, column headers, data types, that I am trying to read in with pandas. After reading them in, I want to compare the column 'Agreement Date' across all the data-frames and create a yes/no column if they match. I then want to export the data frame.
I am still learning Python and Pandas so I am struggling with this task. This is my code so far:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# read .xlsx file into a list
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files
for excelfiles in allfiles:
raw_excel = pd.read_excel(allfiles)
# place all the pulled dataframe into a list
list = [raw_excel]
From here though I am quite lost. I do not know how to join all of my files together on my id column and then compare the 'Agreement Date' column? Any help would be greatly appreciated!
THANKS!!
In your loop you need to hand the looped value and not the whole list to read_excel
You have to append the list values within the loop, otherwise only the last item will be in the list
Do not overwrite python builtins such as list or you can encounter some difficult to debug behaviors
Here's what I would change:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# get file name list of .xlsx files in the directory
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files & place all the pulled dataframe into a list
dataframes_list = []
for file in allfiles:
dataframes_list.append(pd.read_excel(file))
You can then append the DataFrames like this:
merged_df = dataframes_list[0]
for df in dataframes_list[1:]:
merged_df.append(df, ignore_index=True)
Use ignore_index if the Indexes are overlapping and causing problems. If they already are distinct and you want to keep them, set this to False.
I have 10 files which I need to work on.
I need to import those files using pd.read_csv to turn them all into dataframes along with usecols as I only need the same two specific columns from each file.
I then need to search the two columns for a specific entry in the rows like ‘abcd’ and for python to return a new df with includes all the rows it appeared in for each file.
Is there a way I could do this using a for loop. For far I’ve only got a list of all the paths to the 10 files.
So far what I do for one file without the for loop is:
df = pd.read_csv(r'filepath', header=2, usecols=['Column1', 'Column2'])
search_df = df.loc[df['Column1'] == 'abcd']
I am fairly new to Python and wanting some assistance with generating a new column called Ticker when reading in multiple csv files. As the Yahoo! Finance API is depreciated, I am reading in csv data from Yahoo! Finance for 'GOOG', 'IBM' and 'AAPL'. The following code reads the individual csv files into one DateFrame, however, it is hard to distinguish which stock is which.
path =
allFiles = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list_ = []
for file in allFiles:
df = pd.read_csv(file,index_col=None,
header=0)
list_.append(df)
frame = pd.concat(list_)
frame.head()
Is it possible to create a column called Ticker that has the name of the csv file for each observation for each stock? Eg. GOOG.csv is the file name for Google, IBM.csv is the file name for IBM...
This would make it easier to identify which stock is which.
According to this previous post, I am led to believe that you have two clear options. Either (1) include names=[] in the original read_csv command to specify the stock name, or (2) add the column names to the dataframe before loading.
Approach (1) might involve replacing your current read with the following code snippet:
df=pd.read_csv(file,names=[file[len(path)+1:-4]],index_col=None)
Here I assumed I could get the string of the desired ticker by looking at all the characters after the one slash following path, and up to the .csv.
Approach (2) might be accomplished by adding the following line of code after reading the csv but before appending the dataframe:
df.columns=[file[len(path)+1:-4]]
I have assumed in this response that you only have/want one column of data for each csv, but if you wanted to put multiple columns in you would simply specify more than one name in the list of column names.
This is the code I have so far:
import pandas as pd
import glob, os
os.chdir("L:/FMData/")
results = pd.DataFrame([])
for counter, file in enumerate(glob.glob("F5331_FM001**")):
namedf = pd.read_csv(file, skiprows=[1,2,3,4,5,6,7], index_col=[1], usecols=[1,2])
results = results.append(namedf)
results.to_csv('L:/FMData/FM001_D/FM5331_FM001_D.csv')
This however is producing a new document as instructed but isn't copying any data into it. I'm wanting to look up files in a certain location, with names along the lines of FM001, combine them, skip the first 7 rows in each csv, and only keep columns 1 and 2 in the new file. Can anyone help with my code?
Thanks in advance!!!
To combine multiple csv files, you should create a list of dataframes. Then combine the dataframes within your list via pd.concat in a single step. This is much more efficient than appending to an existing dataframe.
In addition, you need to write your result to a file outside your for loop.
For example:
results = []
for counter, file in enumerate(glob.glob("F5331_FM001**")):
namedf = pd.read_csv(file, skiprows=[1,2,3,4,5,6,7], index_col=[1], usecols=[1,2])
results = results.append(namedf)
df = pd.concat(results, axis=0)
df.to_csv('L:/FMData/FM001_D/FM5331_FM001_D.csv')
This code works on my side (using Linux and Python 3), it populates a csv file with data in.
Add a print just after the read_csv to see if your csv file actually reads any data, else nothing will be written, like this:
namedf = pd.read_csv(file)
print(namedf)
results = results.append(namedf)
It adds row 1 (probably becuase it is considered the header) and then skips 7 rows then continue, this is my result for csv file just written from one to eleven out in rows:
F5331_FM001.csv
one
0 nine
1 ten
2 eleven
Addition:
If print(namedf) shows nothing, then check your input files.
The python program is looking in L:/FMData/ for your files. Are you sure your files are located in that directory? You can change the directory by adding the correct path with the os.chdir command.