Error when adding a new column to pandas dataframe - python

I am trying to modify .csv files in a folder. The files contain flight information from years 2011-2016.
However, year information cannot be found in the values.
I would like to solve this by using the filename of the .csv file which contains the year. I am adding a new 'year' column after reading it into a pandas dataframe. I will then export the modified file to a new .csv with only the year as its filename.
However, I am encountering this error:
ValueError:Length of values does not match length of index
Code below for your reference.
import pandas as pd
import glob
import re
import os
path = r'data_caap/'
all_files = glob.glob(os.path.join(path, "*.csv"))
for f in all_files:
df = pd.read_csv(f)
year= re.findall(r'\d{4}', f)
#Error here
df['year']=year
#Error here
df.to_csv(year)

Found the cause of the error.
Must be df['year']=year[0]. findall returns a list. – DyZ
Thanks a lot #Dyz

Related

How to add filename as column to every file in a directory python

Hi there stack overflow community,
I have several csv-files in a folder and I need to append a column containing the first 8 chars of each filename in a aditional column of the csv. After this step i want to save the datafram including the new colum to the same file.
I get the right output, but it doesn't save the changes in the csv file :/
Maybe someone has some inspiration for me. Thanks a lot!
from tkinter.messagebox import YES
import pandas as pd
import glob, os
import fnmatch
import os
files = glob.glob(r'path\*.csv')
for fp in files:
df = pd.concat([pd.read_csv(fp).assign(date=os.path.basename(fp).split('.')[0][:8])])
#for i in df('date'):
#Decoder problem
print(df)
use:
df.to_csv
like this:
for fp in files:
df = pd.concat([pd.read_csv(fp).assign(date=os.path.basename(fp).split('.')[0][:8])])
df.to_csv(fp, index=False) # index=False if you don't want to save the index as a new column in the csv
btw, I think this may also work and is more readable:
for fp in files:
df = pd.read(fp)
df[date] = os.path.basename(fp).split('.')[0][:8]
df.to_csv(fp, index=False)

Pandas - Reading CSVs to dataframes in a FOR loop then appending to a master DF is returning a blank DF

I've searched for about an hour for an answer to this and none of the solutions I've found are working. I'm trying to get a folder full of CSVs into a single dataframe, to output to one big csv. Here's my current code:
import os
sourceLoc = "SOURCE"
destLoc = sourceLoc + "MasterData.csv"
masterDF = pd.DataFrame([])
for file in os.listdir(sourceLoc):
workingDF = pd.read_csv(sourceLoc + file)
print(workingDF)
masterDF.append(workingDF)
print(masterDF)
The SOURCE is a folder path but I've had to remove it as it's a work network path. The loop is reading the CSVs to the workingDF variable as when I run it it prints the data into the console, but it's also finding 349 rows for each file. None of them have that many rows of data in them.
When I print masterDF it prints Empty DataFrame Columns: [] Index: []
My code is from this solution but that example is using xlsx files and I'm not sure what changes, if any, are needed to get it to work with CSVs. The Pandas documentation on .append and read_csv is quite limited and doesn't indicate anything specific I'm doing wrong.
Any help would be appreciated.
There are a couple of things wrong with your code, but the main thing is that pd.append returns a new dataframe, instead of modifying in place. So you would have to do:
masterDF = masterDF.append(workingDF)
I also like the approach taken by I_Al-thamary - concat will probably be faster.
One last thing I would suggest, is instead of using glob, check out pathlib.
import pandas as pd
from pathlib import Path
path = Path("your path")
df = pd.concat(map(pd.read_csv, path.rglob("*.csv"))))
you can use glob
import glob
import pandas as pd
import os
path = "your path"
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join(path,'*.csv'))))
print(df)
You may store them all in a list and pd.concat them at last.
dfs = [
pd.read_csv(os.path.join(sourceLoc, file))
for file in os.listdir(sourceLoc)
]
masterDF = pd.concat(df)

programtically ingesting xl files to pandas data frame by reading filename

I have a folder with 6 files, 4 are excel files that I would like to bring into pandas and 2 are just other files. I want to be able to use pathlib to work with the folder to automatically ingest the excel files I want into individual pandas dataframes. I would also like to be able to name each new dataframe with the name of the excel file (without the file extension)
for example.
import pandas as pd
import pathlib as pl
folder = pl.WindowsPath(r'C:\Users\username\project\output')
files = [e for e in folder.iterdir()]
for i in files:
print(i)
['C:\Users\username\project\output\john.xlsx',
'C:\Users\username\project\output\paul.xlsx',
'C:\Users\username\project\output\random other file not for df.xlsx',
'C:\Users\username\project\output\george.xlsx',
'C:\Users\username\project\output\requirements for project.txt',
'C:\Users\username\project\output\ringo.xlsx' ]
From here, i'd like to be able to do something like
for i in files:
if ' ' not in str(i.name):
str(i.name.strip('.xlsx'))) = pd.read_excel(i)
read the file name, if it doesn't contain any spaces, take the name, remove the file extension and use that as the variable name for a pandas dataframe built from the excel file.
If what I'm doing isn't possible then I have other ways to do it, but they repeat a lot of code.
Any help is appreciated.
using pathlib and re
we can exclude any files that match a certain pattern in our dictionary comprehension, that is any files with a space.
from pathlib import Path
import re
import pandas as pd
pth = (r'C:\Users\username\project\output')
files = Path(pth).glob('*.xlsx') # use `rglob` if you want to to trawl a directory.
dfs = {file.stem : pd.read_excel(file) for file in
files if not re.search('\s', file.stem)}
based on the above you'll get :
{'john': pandas.core.frame.DataFrame,
'paul': pandas.core.frame.DataFrame,
'george': pandas.core.frame.DataFrame,
'ringo': pandas.core.frame.DataFrame}
where pandas.core.frame.DataFrame is your target dataframe.
you can then call them by doing dfs['john']

How to Include Source File in Column in Pandas Dataframe

I have a list of 50+ Excel files that I loop through and consolidate into one dataframe. However, I need to know the source of the data, since the data will repeat across these files.
Each file name is the date of the report. Since this data is time series data, I need to pull this date into the dataframe to do further manipulations.
import os
import glob
import pandas as pd
path = r"path"
extension = 'xls*'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
files_df = pd.concat([pd.read_excel(fp, usecols=[0,15], header=None) for fp in files], ignore_index=True)
I get the expected dataframe. I just do not know how to include the source file name as a third column. I thought there was an argument to do this in pd.read_excel(), but I couldn't find it.
For example, I have a list of the following files:
02-2019.xlsx
03-2011.xls
04-2014.xls
etc
I want to include those file names next to the data that comes from that file in the combined dataframe.
Maybe use the keys= parameter in pd.concat()?
import os
import glob
import pandas as pd
path = r"path"
extension = 'xls*'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
# remove ignore_index=True otherwise keys parameter won't work
files_df = pd.concat([pd.read_excel(fp, usecols=[0,15], header=None) for fp in files], keys=[f"{fp.split('.')[0]}" for fp in files])
You can then reset_index() and convert to_datetime()
fp.reset_index(inplace=True)
fp['index'] = pd.to_datetime(fp['index'])

Extracting and manipulating data from excel worksheet with python

Scenario: I am trying to come up with a python code that reads all the workbooks in a given folder, gets the data of each and puts it to a single data frame (each workbook becomes a dataframe, so I can manipulate them individually).
Issue1: With this code, even though I am using the proper path and file types, I keep getting the error:
File "<ipython-input-3-2a450c707fbe>", line 14, in <module>
f = open(file,'r')
FileNotFoundError: [Errno 2] No such file or directory: '(1)Copy of
Preisanfrage_17112016.xlsx'
Issue2: The reason for me to create different data frames is that each workbook has an individual format (rows are my identifiers and columns are dates). My problem is that some of these workbooks have data on a sheet named "Closing", or "Opening" or the name is not specified. So I will try to configure each data frame individually and them join them afterwards.
Issue3: Considering the final output once the data frame data is already unified, my objective is to output them in a format like:
date 1 identifier 1 value
date 1 identifier 2 value
date 1 identifier 3 value
date 1 identifier 4 value
date 2 identifier 1 value
date 2 identifier 4 value
date 2 identifier 5 value
Obs1: For the output, not all dates have the same array of identifiers.
Question 1: Any ideas why the code is yielding this error? Is there a better way to extract data from excel?
Question 2: Is it possible to create a unique dataframe for each worksheet? Is this a good practice?
Question 3: Can I do this type of output using a loop? Is this a good practice?
Obs2: I don't know how relevant this is, but I am using Python 3.6 with Anaconda.
Code so far:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob, os
import datetime as dt
from datetime import datetime
import matplotlib as mpl
directory = os.path.join("C:\\","Users\\Dgms\\Desktop\\final 2")
for root,dirs,files in os.walk(directory):
for file in files:
print(file)
f = open(file,'r')
df1 = pd.read_excel(file)
think you do not need your open. And I would store them in a list. you can either use pd.concat(list_of_dfs) or some manual changes.
list_of_dfs = []
for root,dirs,files in os.walk(directory):
for file in files:
f = os.path.join(root, file)
print(f)
list_of_dfs .append(pd.read_excel(f))
or using glob:
import glob
list_of_dfs = []
for file in glob.iglob(directory + '*.xlsx')
print(file)
list_of_dfs .append(pd.read_excel(file))
or as jackie suggests you can read specific sheets list_of_dfs.append(pd.concat([pd.read_excel(file, 'Opening'), pd.read_excel(file, 'Closing')])). If you have only either of them available, you could even change to
try:
list_of_dfs.append(pd.concat([pd.read_excel(file, 'Opening'))
except:
pass
try:
list_of_dfs.append(pd.concat([pd.read_excel(file, 'Closing'))
except:
pass
(Of course, you should specify the exact error, but can't test that atm)
Issue 1: If you are using IDE or Jupyter put absolute path to file.
Or add the project folder to system path (workaround, not recommended).

Categories