I am reading a few XLS files via
import os
import pandas as pd
path = r'pathtofolder'
files = os.listdir(path=path)
dataframes = {}
for file in files:
filepath = path + '\\' + file
if filepath[-3:] == 'xls':
print(file)
dataframes[file] = pd.read_excel(filepath)
For some reason however, I can't access the dataframes inside the dictionaries, as .head() doesn't seem to work:
for file, dataframe in dataframes.items():
dataframe.head()
This code doesn't seem to do anything in Jupyter. However when I type() dataframe, I get a pandas.core.frame.DataFrame, so head should be working, right?
haven't worked with Python data frames before, but I don't think your for loop will give you any output in this way. It's just a running loop which ends when the last head is calculated. You can just use print() to see your output.
for file, dataframe in mydict.items():
print(dataframe.head())
Or create a reusable list of dataframe.head() as shown below. You enter the list name anytime in the console to view it later. Pardon the code for creating a dictionary of dataframes.
import pandas as pd
from sklearn import datasets
iris = pd.DataFrame(datasets.load_iris().data)
digits = pd.DataFrame(datasets.load_digits().data)
diabetes = pd.DataFrame(datasets.load_diabetes().data)
dataframes={'a':iris,'b':digits,'c':diabetes} #create a dictionary of dataframes
list_heads=[] #create a list of dataframe head()
for i in dataframes:
list_heads.append(dataframes[i].head())
list_heads
Related
Hi I am trying to create multiple csv files from a single big csv using python. The original csv file has multiple stocks data in 1 min date/time with Open, high, low, close, volume as other columns.
Sample data from original file is here
At first, I tried to copy individual Ticker and all its corresponding values to a new file with following code:
import pandas as pd
excel_file_path=r'C:\Users\mahan\Documents\test projects\01_07_APR_WEEKLY_expiry_data_VEGE_NF_AND_BNF_Options_Desktop_Vege.csv'
export_path=r"C:\Users\mahan\Documents\exportfiles\{output_file_name}_sheet.csv"
data= pd.read_csv(excel_file_path, index_col="Ticker") #Making data frame from csv file
rows= data.loc[['NIFTYWK17500CE']] #Retrieving rows by loc method
output_file_name ="NIFTYWK17500CE_"
print(type(rows))
rows
rows.to_csv(export_path)
Result was something like this:
a file was saved with the name "{output_file_name}__sheet.csv"
I failed at naming the file but data was copied pertaining to all the values with Ticker value 'NIFTYWK17500CE'.
Then I tried to create a array with column "Ticker" to find unique values. Created a dataframe with original file for all the data. And tried to use a For loop for values in the array matching the 1st column 'Ticker' and copy those data to a new file using the value in the exporting csv file name.
code as below:
import pandas as pd
excel_file_path=r'C:\Users\mahan\Documents\test projects\01_07_APR_WEEKLY_expiry_data_VEGE_NF_AND_BNF_Options_Desktop_Vege.csv'
df2=pd.read_csv(excel_file_path)
df2_uniques =df2['Ticker'].unique()
df2_counts=df2['Ticker'].value_counts()
for value in df2_uniques:
value=value.replace(' ', '_')
export_path=r"C:\Users\mahan\Documents\exportfiles\{value}__sheet.csv"
df=pd.read_csv(excel_file_path,index_col="Ticker")
rows=df.loc[['value']]
print(type(rows))
rows.to_csv(export_path)
Received an error:
KeyError: "None of [Index(['value'], dtype='object', name='Ticker')] are in the [index]"
Where did I went wrong:
In naming the file properly to save in earlier code.
In the second code.
Any help is really appreciated. Thanks in advance.
SOLVED
What worked for me was the following with comments:
import pandas as pd
excel_file_path=r'C:\Users\mahan\Documents\test projects\01_07_APR_WEEKLY_expiry_data_VEGE_NF_AND_BNF_Options_Desktop_Vege.csv'
df2=pd.read_csv(excel_file_path)
df2_uniques =df2['Ticker'].unique()
for value in df2_uniques:
value=value.replace(' ', '_')
df=pd.read_csv(excel_file_path,index_col="Ticker")
rows=df.loc[[value]] #Changed from 'value' to value
print(type(rows))
rows.to_csv(r'_'+value+'.csv')
#Removed export_path as filename and filepath together were giving me hard time to figure out.
#The files get saved in same filepath as the original imported filepath. So that'll do. sharing just for reference
Final output looks like this:
I can't know for sure without seeing the dataframe, but the error indicates that there is no column name 'Ticker'. It appears that you set this column to be the index, so you can try df2_uniques = set(df2.index).
changed
rows=df.loc[['value']]
to
rows=df.loc[[value]]
Also, Removed export_path as both filename and filepath together were giving me hard time to figure out.
The files get saved in same filepath as the original imported filepath. So that'll do. Sharing just for reference
Final code that worked looked like this:
import pandas as pd
excel_file_path=r'C:\Users\mahan\Documents\test projects\01_07_APR_WEEKLY_expiry_data_VEGE_NF_AND_BNF_Options_Desktop_Vege.csv'
df2=pd.read_csv(excel_file_path)
df2_uniques =df2['Ticker'].unique()
for value in df2_uniques:
value=value.replace(' ', '_')
df=pd.read_csv(excel_file_path,index_col="Ticker")
rows=df.loc[[value]] #Changed from 'value' to value
print(type(rows))
rows.to_csv(r'_'+value+'.csv')
First-time poster here! I have perused these forums for a while and I am taken aback by how supportive this community is.
My problem involves several excel files with the same name, column headers, data types, that I am trying to read in with pandas. After reading them in, I want to compare the column 'Agreement Date' across all the data-frames and create a yes/no column if they match. I then want to export the data frame.
I am still learning Python and Pandas so I am struggling with this task. This is my code so far:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# read .xlsx file into a list
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files
for excelfiles in allfiles:
raw_excel = pd.read_excel(allfiles)
# place all the pulled dataframe into a list
list = [raw_excel]
From here though I am quite lost. I do not know how to join all of my files together on my id column and then compare the 'Agreement Date' column? Any help would be greatly appreciated!
THANKS!!
In your loop you need to hand the looped value and not the whole list to read_excel
You have to append the list values within the loop, otherwise only the last item will be in the list
Do not overwrite python builtins such as list or you can encounter some difficult to debug behaviors
Here's what I would change:
import pandas as pd
import glob
xlpath = "/Users/myname/Documents/Python/"
# get file name list of .xlsx files in the directory
allfiles = glob.glob(xlpath + "*.xls")
# for loop to read in all files & place all the pulled dataframe into a list
dataframes_list = []
for file in allfiles:
dataframes_list.append(pd.read_excel(file))
You can then append the DataFrames like this:
merged_df = dataframes_list[0]
for df in dataframes_list[1:]:
merged_df.append(df, ignore_index=True)
Use ignore_index if the Indexes are overlapping and causing problems. If they already are distinct and you want to keep them, set this to False.
I've searched for about an hour for an answer to this and none of the solutions I've found are working. I'm trying to get a folder full of CSVs into a single dataframe, to output to one big csv. Here's my current code:
import os
sourceLoc = "SOURCE"
destLoc = sourceLoc + "MasterData.csv"
masterDF = pd.DataFrame([])
for file in os.listdir(sourceLoc):
workingDF = pd.read_csv(sourceLoc + file)
print(workingDF)
masterDF.append(workingDF)
print(masterDF)
The SOURCE is a folder path but I've had to remove it as it's a work network path. The loop is reading the CSVs to the workingDF variable as when I run it it prints the data into the console, but it's also finding 349 rows for each file. None of them have that many rows of data in them.
When I print masterDF it prints Empty DataFrame Columns: [] Index: []
My code is from this solution but that example is using xlsx files and I'm not sure what changes, if any, are needed to get it to work with CSVs. The Pandas documentation on .append and read_csv is quite limited and doesn't indicate anything specific I'm doing wrong.
Any help would be appreciated.
There are a couple of things wrong with your code, but the main thing is that pd.append returns a new dataframe, instead of modifying in place. So you would have to do:
masterDF = masterDF.append(workingDF)
I also like the approach taken by I_Al-thamary - concat will probably be faster.
One last thing I would suggest, is instead of using glob, check out pathlib.
import pandas as pd
from pathlib import Path
path = Path("your path")
df = pd.concat(map(pd.read_csv, path.rglob("*.csv"))))
you can use glob
import glob
import pandas as pd
import os
path = "your path"
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join(path,'*.csv'))))
print(df)
You may store them all in a list and pd.concat them at last.
dfs = [
pd.read_csv(os.path.join(sourceLoc, file))
for file in os.listdir(sourceLoc)
]
masterDF = pd.concat(df)
I am currently working on importing and formatting a large number of excel files (all the same format/scheme, but different values) with Python.
I have already read in and formatted one file and everything worked fine so far.
I would now do the same for all the other files and combine everything in one dataframe, i.e. read in the first excel in one dataframe, add the second at the bottom of the dataframe, add the third at the bottom the dataframe, and so on until I have all the excel files in one dataframe.
So far my script looks something like this:
import pandas as pd
import numpy as np
import xlrd
import os
path = os.getcwd()
path = "path of the directory"
wbname = "name of the excel file"
files = os.listdir(path)
files
wb = xlrd.open_workbook(path + wbname)
# I only need the second sheet
df = pd.read_excel(path + wbname, sheet_name="sheet2", skiprows = 2, header = None,
skipfooter=132)
# here is where all the formatting is happening ...
df
So, "files" is a list with all file relevant names. Now I have to try to put one file after the other into a loop (?) so that they all eventually end up in df.
Has anyone ever done something like this or can help me here.
Something like this might work:
import os
import pandas as pd
list_dfs=[]
for file in os.listdir('path_to_all_xlsx'):
df = pd.read_excel(file, <the rest of your config to parse>)
list_dfs.append(df)
all_dfs = pd.concat(list_dfs)
You read all the dataframes and add them to a list, and then the concat method adds them all together int one big dataframe.
So what I'm trying to do is the following:
I have 300+ CSVs in a certain folder. What I want to do is open each CSV and take only the first row of each.
What I wanted to do was the following:
import os
list_of_csvs = os.listdir() # puts all the names of the csv files into a list.
The above generates a list for me like ['file1.csv','file2.csv','file3.csv'].
This is great and all, but where I get stuck is the next step. I'll demonstrate this using pseudo-code:
import pandas as pd
for index,file in enumerate(list_of_csvs):
df{index} = pd.read_csv(file)
Basically, I want my for loop to iterate over my list_of_csvs object, and read the first item to df1, 2nd to df2, etc. But upon trying to do this I just realized - I have no idea how to change the variable being assigned when doing the assigning via an iteration!!!
That's what prompts my question. I managed to find another way to get my original job done no problemo, but this issue of doing variable assignment over an interation is something I haven't been able to find clear answers on!
If i understand your requirement correctly, we can do this quite simply, lets use Pathlib instead of os which was added in python 3.4+
from pathlib import Path
csvs = Path.cwd().glob('*.csv') # creates a generator expression.
#change Path(your_path) with Path.cwd() if script is in dif location
dfs = {} # lets hold the csv's in this dictionary
for file in csvs:
dfs[file.stem] = pd.read_csv(file,nrows=3) # change nrows [number of rows] to your spec.
#or with a dict comprhension
dfs = {file.stem : pd.read_csv(file) for file in Path('location\of\your\files').glob('*.csv')}
this will return a dictionary of dataframes with the key being the csv file name .stem adds this without the extension name.
much like
{
'csv_1' : dataframe,
'csv_2' : dataframe
}
if you want to concat these then do
df = pd.concat(dfs)
the index will be the csv file name.