Extracting and manipulating data from excel worksheet with python - python

Scenario: I am trying to come up with a python code that reads all the workbooks in a given folder, gets the data of each and puts it to a single data frame (each workbook becomes a dataframe, so I can manipulate them individually).
Issue1: With this code, even though I am using the proper path and file types, I keep getting the error:
File "<ipython-input-3-2a450c707fbe>", line 14, in <module>
f = open(file,'r')
FileNotFoundError: [Errno 2] No such file or directory: '(1)Copy of
Preisanfrage_17112016.xlsx'
Issue2: The reason for me to create different data frames is that each workbook has an individual format (rows are my identifiers and columns are dates). My problem is that some of these workbooks have data on a sheet named "Closing", or "Opening" or the name is not specified. So I will try to configure each data frame individually and them join them afterwards.
Issue3: Considering the final output once the data frame data is already unified, my objective is to output them in a format like:
date 1 identifier 1 value
date 1 identifier 2 value
date 1 identifier 3 value
date 1 identifier 4 value
date 2 identifier 1 value
date 2 identifier 4 value
date 2 identifier 5 value
Obs1: For the output, not all dates have the same array of identifiers.
Question 1: Any ideas why the code is yielding this error? Is there a better way to extract data from excel?
Question 2: Is it possible to create a unique dataframe for each worksheet? Is this a good practice?
Question 3: Can I do this type of output using a loop? Is this a good practice?
Obs2: I don't know how relevant this is, but I am using Python 3.6 with Anaconda.
Code so far:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob, os
import datetime as dt
from datetime import datetime
import matplotlib as mpl
directory = os.path.join("C:\\","Users\\Dgms\\Desktop\\final 2")
for root,dirs,files in os.walk(directory):
for file in files:
print(file)
f = open(file,'r')
df1 = pd.read_excel(file)

think you do not need your open. And I would store them in a list. you can either use pd.concat(list_of_dfs) or some manual changes.
list_of_dfs = []
for root,dirs,files in os.walk(directory):
for file in files:
f = os.path.join(root, file)
print(f)
list_of_dfs .append(pd.read_excel(f))
or using glob:
import glob
list_of_dfs = []
for file in glob.iglob(directory + '*.xlsx')
print(file)
list_of_dfs .append(pd.read_excel(file))
or as jackie suggests you can read specific sheets list_of_dfs.append(pd.concat([pd.read_excel(file, 'Opening'), pd.read_excel(file, 'Closing')])). If you have only either of them available, you could even change to
try:
list_of_dfs.append(pd.concat([pd.read_excel(file, 'Opening'))
except:
pass
try:
list_of_dfs.append(pd.concat([pd.read_excel(file, 'Closing'))
except:
pass
(Of course, you should specify the exact error, but can't test that atm)

Issue 1: If you are using IDE or Jupyter put absolute path to file.
Or add the project folder to system path (workaround, not recommended).

Related

Extracting a column from a collection of csv files and constructing a new table with said data

I'm a newbie when it comes to Python with a bit more experience in MATLAB. I'm currently trying to write a script that basically loops through a folder to pick up all the .csv files, extract column 14 from csv file 1 and adding it to column 1 of the new table, extract column 14 from csv file 2 and adding it to column 2 of the new table, to build up a table of column 14 from all csvfiles in the folder. I'd ideally like to have the headers of the new table to show the respective filename that said column 14 has been extracted from.
I've considered that Python is base0 so I've double checked that it reads the desired column, but as my code stands, i can only get it to print all the files' 14th columns in the one array and I'm not sure how to split it up to put it into a table. Perhaps via dataframe, although I'm not entirely sure how they work.
Any help would be greatly appreciated!
Code attached below:
import os
import sys
import csv
pathName = "D:/GLaDOS-CAMPUS/data/TestData-AB/"
numFiles = []
fileNames = os.listdir(pathName)
for fileNames in fileNames:
if fileNames.endswith(".csv"):
numFiles.append(fileNames)
print(numFiles)
for i in numFiles:
file = open(os.path.join(pathName, i), "rU")
reader = csv.reader(file, delimiter=',')
for column in reader:
print(column[13])
Finding files.
I'm not sure if your way of finding files is right or not. Since I do not have a folder with csv files. But I can say it is way better to use glob for getting list of files:
from glob import glob
files = glob("/Path/To/Files/*.csv")
This will return all csv files.
Reading CSV files
Now we need to find a way to read all files and get 13th column. I don't know if it is an overkill but I prefer to use pandas and numpy to get 13th column.
To read a column of a csv file using pandas one can use:
pd.read_csv(file, usecols=[COL])
Now we can loop over files and get 13th columns:
columns = [pd.read_csv(file, usecols=[2]).values[:, 0] for file in files]
Notice we converted all values to numpy arrays.
Merging all columns
In columns we have our each column as an element of a list. So it is technical rows. Not columns.
Now we should get the transpose of the array so it will become columns:
pd.DataFrame(np.transpose(columns))
The code
The whole code would look like:
from glob import glob
import pandas as pd
import numpy as np
files = glob("/Path/To/Files/*.csv")
columns = [pd.read_csv(file, usecols=[2]).values[:, 0] for file in files]
print(pd.DataFrame(np.transpose(columns)))

Repeating same processes for multiple csv files

I am a beginner of Python. I have about 1000 CSV files (1.csv, 2.csv....1000.csv). Each CSV file has about 3,000,000,000 rows and 14 variables. I would like to clean data in each CSV file first using the same process for each CSV file:
sum variable A and variable B,
count C by sorting date, if the number of records in C for one day is greater than 50, then drop it.
Next, save the cleaned data into a new CSV file. At last, append all 1000 new CSV files into one CSV file.
I have some code as follows, but it imports all CSV files first, then process to clean data, which is very inefficient. I would like to clean the data in each CSV file first, then append new CSV files. Can anyone help me on this? Any help will be appreciated.
This what I understand from your question. I read all the files and I add a new column for the summation. Then I order the date and drop any value of C greater than 50. After that, I save the update. Before you do this you have to copy your original files or you can save them with a different files name.
import glob
import os
import pandas as pd
path = "./data/"
all_files = glob.glob(os.path.join(path, "*.csv")) #make list of paths
for file in all_files:
# Getting the file name without extension
file_name = os.path.splitext(os.path.basename(file))[0]
df = pd.read_csv(file_name)
df['new_column'] = df['A']+ df['B']
df.sort_values(by='C')
df.drop(df.loc[df['C']>50].index, inplace=True)
df.to_csv(file_name)

Error when adding a new column to pandas dataframe

I am trying to modify .csv files in a folder. The files contain flight information from years 2011-2016.
However, year information cannot be found in the values.
I would like to solve this by using the filename of the .csv file which contains the year. I am adding a new 'year' column after reading it into a pandas dataframe. I will then export the modified file to a new .csv with only the year as its filename.
However, I am encountering this error:
ValueError:Length of values does not match length of index
Code below for your reference.
import pandas as pd
import glob
import re
import os
path = r'data_caap/'
all_files = glob.glob(os.path.join(path, "*.csv"))
for f in all_files:
df = pd.read_csv(f)
year= re.findall(r'\d{4}', f)
#Error here
df['year']=year
#Error here
df.to_csv(year)
Found the cause of the error.
Must be df['year']=year[0]. findall returns a list. – DyZ
Thanks a lot #Dyz

Using Pandas read_table with list of files

I am pretty new to Python in general, but am trying to make a script that takes data from certain files in a folder and puts it into an Excel spreadsheet.
The code I have will find the file type that I want in my specified folder, and then make a list with the full file paths.
import os
file_paths = []
for folder, subs, files in os.walk('C://Users/Dir'):
for filename in files:
if filename.endswith(".log") or filename.endswith(".txt"):
file_paths.append(os.path.abspath(os.path.join(folder,filename)))
It will also take a specific file path, pull data from the correct column, and put it into excel in the correct cells.
import pandas as pd
import numpy
for i in range(len(file_paths)):
fields = ['RDCR']
data = pd.read_table(file_paths[i], sep= "\s+", names = fields, usecols=[3],
Where I am having trouble is making the read_table iterate through my list of files and put the data into an excel sheet where every time it reads a new file it moves over one column in the spreadsheet.
Ideally, the for loop would see how long the file_paths list is, and use that as the range. It would then use the file_paths[i] to input the file names into the read_table one by one.
What happens is that it finds the length of file_paths, and instead of iterating through the files in it one by one, it just inputs the data from the last file on the list.
Any help would be much appreciated! Thank you!
Try to concatenate all of them at once and write to excel one time.
from glob import glob
import pandas as pd
files = glob('C://Users/Dir/*.log') + glob('C://Users/Dir/*.txt')
def read_file(f):
fields = ['RDCR']
return pd.read_table(
f, sep="\s+",
names=fields, usecols=[3])
df = pd.concat([read_file(f) for f in files], axis=1).to_excel('out.xlsx')

Read multiple .xlsx files from a directory into separate Pandas data frames based on file name

I want to load multiple xlsx files with varying structures from a directory and assign these their own data frame based on the file name. I have 30+ files with differing structures but for brevity please consider the following:
3 excel files [wild_animals.xlsx, farm_animals_xlsx, domestic_animals.xlsx]
I want to assign each with their own data frame so if the file name contains 'wild' it is assigned to wild_df, if farm then farm_df and if domestic then dom_df. This is just the first step in a process as the actual files contain a lot of 'noise' that needs to be cleaned depending on file type etc they file names will also change on a weekly basis with only a few key markers staying the same.
My assumption is the glob module is the best way to begin to do this but in terms of taking very specific parts of the file extension and using this to assign to a specific df I become a bit lost so any help appreciated.
I asked a similar question a while back but it was part of a wider question most of which I have now solved.
I would parse them into a dictionary of DataFrame's:
import os
import glob
import pandas as pd
files = glob.glob('/path/to/*.xlsx')
dfs = {}
for f in files:
dfs[os.path.splitext(os.path.basename(f))[0]] = pd.read_excel(f)
then you can access them as a normal dictionary elements:
dfs['wild_animals']
dfs['domestic_animals']
etc.
You nee to get all xlsx files, than using comprehension dict, you can access to any elm
import pandas as pd
import os
import glob
path = 'Your_path'
extension = 'xlsx'
os.chdir(path)
result = [i for i in glob.glob('*.{}'.format(extension))]
{elm:pd.ExcelFile(elm) for elm in result}
For completeness wanted to show the solution I ended up using, very close to Khelili suggestion with a few tweaks to suit my particular code including not creating a DataFrame at this stage
import os
import pandas as pd
import openpyxl as excel
import glob
#setting up path
path = 'data_inputs'
extension = 'xlsx'
os.chdir(path)
files = [i for i in glob.glob('*.{}'.format(extension))]
#Grouping files - brings multiple files of same type together in a list
wild_groups = ([s for s in files if "wild" in s])
domestic_groups = ([s for s in files if "domestic" in s])
#Sets up a dictionary associated with the file groupings to be called in another module
file_names = {"WILD":wild_groups, "DOMESTIC":domestic_groups}
...

Categories