python: concatenating multiple pandas strings within FOR loop - python

I am working on a Python program that reads spefic .SDF filles from a given directory in a loop and then store some information regarding each file in pandas DF format. There is specific function which accept .SDF file and then return a data file contained one string with all required information about it. In the code below I've tried to apply this function (which works correctly!) on many .SDF filles and then append all linnes in new data file (should contain the same number of lines as a number of processed filles). How this concatenuation of separate DF should be realized correctly within for loop?
def load_sdf_file(file, key):
"""
Reads molecules from an SDF file and store some of its properties as data file
"""
df = PandasTools.LoadSDF(file)
df['Source'] = key
df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP)
df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt)
df['LipinskyHBA'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA)
df['LipinskyHBD'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD)
df = df[['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD']]
return df
pwd = os.getcwd()
filles='sdf'
results='results'
#set directory to analyse
data = os.path.join(pwd,filles)
os.chdir(data)
dirlist = [os.path.basename(p) for p in glob.glob(data + '/*.sdf')]
# create a new data file with the same columns as it was in df defined in the function
all = pd.DataFrame(columns=['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD'])
for sdf in dirlist:
try:
sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
key = f'{sdf_name}'
df = load_sdf_file(sdf,key)
print(f'{sdf_name}.sdf has been processed')
# this does not work!
all.append(df)
except:
print(f'{sdf_name}.sdf has not been processed')

Try pandas.concat() and store the dataframes in a list:
import pandas as pd
list_of_df = []
for _ in range(10):
list_of_df.append(pd.DataFrame({'col_a':[1,1,1], 'col_b':[2,2,2]}))
df = pd.concat(list_of_df)

Related

Iterate through directory and return DataFrame with number of lines per file

I have a directory containing several excel files. I want to create a DataFrame with a list of the filenames, a count of the number of rows in each file, and a min and max column.
Example file 1:
Example file 2:
Desired result:
This is as far as I've gotten:
fileslist = os.listdir(folder)
for file in fileslist:
str = file
if not str.startswith('~$'):
df = pd.read_excel(os.path.join(folder,file), header = 0, sheet_name = 'Main', usecols=['Name','Number'])
NumMax = max(df['Number'])
NumMin = min(df['Number'])
NameCount = df['Name'].count()
From here, I can't figure out how to create the final DataFrame as shown in the above "Desired Result." I'm very new at this and would appreciate any nudge in the right direction.
You're using str wrong. It is a function in Python, but you don't need it at all. Here, you just mean to write file.startswith. Now, to store the data, at each iteration you'll want to append to a list. What you can do is use dictionaries to create the data:
import pandas as pd
fileslist = os.listdir(folder)
data = [] # store the intermediate data in the loop
for file in fileslist:
# no need to assign file to str
if not file.startswith('~$'):
df = pd.read_excel(os.path.join(folder, file), header=0,
sheet_name='Main', usecols=['Name', 'Number'])
NumMax = max(df['Number'])
NumMin = min(df['Number'])
NameCount = df['Name'].count()
data.append(
{ # the dict keys will become pandas column names
'Filename': file, # you probably want to remove the extension here
'Count': NameCount,
'MinNumber': NumMin,
'MaxNumber': NumMax
})
df = pd.DataFrame(data)
From here, you just need to write the data frame to your excel file.
First of all, I would just like to point out that you shouldn't name any variable as "str" as you did here:
str = file
This can cause issues in the future if you ever try to convert some object to a string using the str(object) as you are redefining the method. Also, this redefinition of "file" is unnecessary, so you can just take that out. You did something similar with "file" as that is also a keyword that you are redefining. A name like "file_name" would be better.
As for how to create the final dataframe, it is somewhat simple. I would recommend you use a list and dictionaries and add all the data to that, then create the dataframe. Like this:
fileslist = os.listdir(folder)
# temporary list to store data
data = []
for file_name in fileslist:
if not file_name.startswith('~$'):
df = pd.read_excel(os.path.join(folder,file_name), header = 0, sheet_name = 'Main', usecols=['Name','Number'])
NumMax = max(df['Number'])
NumMin = min(df['Number'])
NameCount = df['Name'].()
# appending row of data with appropriate column names
data.append({'Filename': file_name, 'Count': NameCount, 'MinNumber': NumMin, 'MaxNumber': NumMax})
# creating actual dataframe
df = pd.DataFrame(data)

Trying to take multiple excel spreadsheets, extract specific data, add them all to one dataframe and save it as a csv file

Very new to this, so please go easy on me :)
Trying to take multiple excel spreadsheets, extract data specific from specific cells, add them all to one dataframe and save it as a csv file.
The csv output only contains the data from the last excel file. Please could you help?
import pandas as pd
import os
from pathlib import Path
ip = "//NETWORKLOCATION/In"
op = "//NETWORKLOCATION/Out"
file_exist = False
dir_list = os.listdir(ip)
print(dir_list)
for xlfile in dir_list:
if xlfile.endswith('.xlsx') or xlfile.endswith('.xls'):
file_exist = True
str_file = os.path.join(ip, xlfile)
df1 = pd.read_excel(str_file)
columns1 = {*VARIOUSDATA -*
}
#creates an empty dataframe for the data to all sequentially be added into
df1a = pd.DataFrame([])
#appends the array to the new dataframe df1a
df1a = df1a.append(pd.DataFrame(columns1, columns = ['*VARIOUS COLUMNS*]))
if not file_exist:
print('cannot find any valid excel file in the folder ' + ip)
print(str_file)
df1a.to_csv('//NETWORKLOCATION/Out/Test.csv')
print(df1a)
I think You should put:
#creates an empty dataframe for the data to all sequentially be added into
df1a = pd.DataFrame([])
before for xlfile in dir_list: loop not inside the loop.
Otherwise df1a recreates empty on each file iteration.
A couple of things. First, you'll never encounter:
if not file_exist:
print('cannot find any valid excel file in the folder ' + ip)
print(str_file)
as is written, because it's a nested if statement and so file_exists is always set to true before it's reached.
You're creating df1a inside of your for loop. So you're always setting it back to empty.
Why import Path, and then use os.path and os.listdir?
Why not just use Path(ip).glob('.xls')
This would look like:
import pandas as pd
import os
from pathlib import Path
ip = "//NETWORKLOCATION/In"
op = "//NETWORKLOCATION/Out"
#creates an empty dataframe for the data to all sequentially be added into
df1a = pd.DataFrame([])
for xlfile in Path(ip).glob('*.xls*'):
df1 = pd.read_excel(xlfile)
columns1 = {"VARIOUSDATA"}
#appends the array to the new dataframe df1a
df1a = df1a.append(pd.DataFrame(columns1, columns = ['VARIOUS_COLUMNS']))
if df1a.empty:
print('cannot find any valid excel file in the folder ' + ip)
print(str_file)
else:
df1a.to_csv(op+'/Test.csv')
print(df1a)
The csv output only contains the data from the last excel file.
You create the df1a DataFrame inside the for loop. Each time you read a new xlfile you create a new empty DataFrame.
You have to put df1a = pd.DataFrame([]) on the 9th line of your script before the loop.
Something like this should work for you.
import os
import pandas as pd
import glob
glob.glob("C:\\your_path\\*.xlsx")
all_data = pd.DataFrame()
for f in glob.glob("C:\\your_path\\*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df,ignore_index=True)
type(all_data)
Check out this link.
https://pbpython.com/excel-file-combine.html

How To Print File Names Conditionally Based On Multiple Imported csv Files

I was wondering if there is a way that prints out the file names conditionally based on the multiple imported csv files. My procedure is:
Set my path.
Grab all the csv files in this path.
Import all these csv files grabbing only the numbers of each file names and store this in 'new_column'.
Check number of columns of each file and want to exclude the files that are not having 10 columns (acvhieved using shape[1]).
Now, I want to print out the actual file names that don't have 10 columns -> I am stuck here.
I have no problems up to number 4. However, I am stuck on 5. How do I achieve 5. ?
# setting my path
path = r'my\path'
# make a function that grabs all csv files in my path
all_files = glob.glob(path + "/*.csv")
# grab the numeric part of each file
def get_numbers_from_filename(filename):
return re.search(r'\d+', filename).group(0)
# import all the actual csv files and add a 'new_column' column based on the "get_numbers_from_filename" function
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
df['new_column'] = get_numbers_from_filename(filename)
li.append(df)
# check frequency of column numbers for each file using a frequency table
result = []
for lis in li:
result.append(lis.shape[1])
# make this a dataframe
result = pd.DataFrame(result, columns = ['shape'])
# actual checking step
result['shape'].value_counts()
# grab only shape == 10 files to correctly concatenate
result = []
for lis in li:
if lis.shape[1] == 10:
result.append(lis)
## my solution for part 5:
# print and save all the paths of my directory
path = os.listdir(path)
# grab file names if columns numbers are not 10
result3 = []
for paths in path:
for list in li:
if lis.shape[1] != 10:
result3.append(paths)
my solution gives an empty string []

comparing data frames from multiple data frames to filter data and extract relevant features

I try to load my all data-set files in python using pandas but the results are not shown.
import os
print(os.listdir("C:/Users/Smile/.spyder-py3/datasets"))
# Any results you write to the current directory are saved as output.
data = ["name","version","tool_name","wmc","dit","noc","cbo","rfc","lcom","ca","ce","npm","lcom3","loc","dam","moa","mfa","cam","ic","cbm","amc","max_cc","avg_cc","bug"]
data = pd.DataFrame()
for file in os.listdir():
if file.endswith('.csv'):
data = pd.read_csv(file)
data.set_index('name',inplace = True)
data = data.append(data, ignore_index=True
)
print(data.head(5))
************************************************************************
My output is given below:
Empty DataFrame
Columns: []
Index: []
you overwrite data each time you read a new CSV
replace the data variable with a temp variable, like this:
data = pd.DataFrame()
for file in os.listdir():
if file.endswith('.csv'):
csv_data = pd.read_csv(file)
csv_data.set_index('name',inplace = True)
data = data.append(csv_data, ignore_index=True)
print(data.head(5))
by using data to read a new csv data each time 'data = pd.read_csv(file)', you overwrite the data you already appended in the last iteration, you need to keep it intact in order to keep appending to it, so each CSV read must be separated.

Why is the cdc_list getting updated after calling the function read_csv() in total_list?

# Program to combine data from 2 csv file
The cdc_list gets updated after second call of read_csv
overall_list = []
def read_csv(filename):
file_read = open(filename,"r").read()
file_split = file_read.split("\n")
string_list = file_split[1:len(file_split)]
#final_list = []
for item in string_list:
int_fields = []
string_fields = item.split(",")
string_fields = [int(x) for x in string_fields]
int_fields.append(string_fields)
#final_list.append()
overall_list.append(int_fields)
return(overall_list)
cdc_list = read_csv("US_births_1994-2003_CDC_NCHS.csv")
print(len(cdc_list)) #3652
total_list = read_csv("US_births_2000-2014_SSA.csv")
print(len(total_list)) #9131
print(len(cdc_list)) #9131
I don't think the code you pasted explains the issue you've had, at least it's not anywhere I can determine. Seems like there's a lot of code you did not include in what you pasted above, that might be responsible.
However, if all you want to do is merge two csvs (assuming they both have the same columns), you can use Pandas' read_csv and Pandas DataFrame methods append and to_csv, to achieve this with 3 lines of code (not including imports):
import pandas as pd
# Read CSV file into a Pandas DataFrame object
df = pd.read_csv("first.csv")
# Read and append the 2nd CSV file to the same DataFrame object
df = df.append( pd.read_csv("second.csv") )
# Write merged DataFrame object (with both CSV's data) to file
df.to_csv("merged.csv")

Categories