I am writing a function to pull data from an excel file
output = pd.Dataframe()
def get_help(sheet,head,file):
tmp_df = pd.read_excel(file,sheet_name='Cover')
tmp_df = tmp_df.dropna(how='all')
tmp_df = tmp_df.dropna(axis=1,how='all')
fnf = tmp_df.iloc[3,0]
currency = tmp_df.iloc[5,0]
q_date = tmp_df.iloc[7,0]
tmp_df = pd.read_excel(file,sheet_name=sheet,header=head)
tmp_df = tmp_df.loc[tmp_df['banana'] != '--- End of Sheet ---']
tmp_df['banana'] = tmp_df['banana'].astype(str).str.replace('hi', '')
tmp_df['FNF'] = fnf
tmp_df['banana'] = tmp_df['FNF'] + ' ' + tmp_df['banana']
global output
output = pd.concat([tmp_df,output])
This code works perfectly until I define it as a function and call the function in a for loop.
I have tried changing dataframe names and using global to define a variable. I am very new to python, does this have something to do with scope and namespaces? Thank you for reading.
To clarify: I am expecting a dataframe as the output, the actual output is a empty dataframe.
If I run the same code that I put inside the function in that order I get the dataframe I was expecting. However, when I use that same code inside the get_help() function the dataframe returns empty.
Related
I have a typical method that I use to pull data from an Excel file into a DataFrame:
import pandas as pd
import openpyxl as op
path = r'thisisafilepath\filename.xlsx'
book = op.load_workbook(filename=path, data_only=True)
tab = book['sheetname']
data = tab.values
columns = next(data)[0:]
df = pd.DataFrame(data, columns=columns)
I'm trying to define this method as a function to make the code simpler/more readable.
I have tried the following:
def openthis(path, sheet):
book = op.load_workbook(filename=path, data_only=True)
tab = book[sheet]
data = tab.values
columns = next(data)[0:]
df = pd.DataFrame(data, columns=columns)
return df
When I then call openthis() the output is a printed version of the DataFrame in my console, but no variable has actually been created for me to work with.
What am I missing? Also, is there a way to define what the DataFrame variable is called when it is produced?
You didn't show your actual implementation of calling it but I'm guessing that you didn't assign the output to a variable.
Notice in your function return df.
This statement means when you call openthis() it outputs a variable. Unless you assign that output to a local variable, its gone forever.
Try this
df = openthis(some_arguments)
I have a for loop gets datas from a website and would like to export it to xlsx or csv file.
Normally when I print result of loop I can get all list but when I export that to xlsx file only get last item. Where is the problem can you help?
for item1 in spec:
spec2 = item1.find_all('th')
expl2 = item1.find_all('td')
spec2x = spec2[a].text
expl2x = expl2[a].text
yazim = spec2x + ': ' + expl2x
cumle = yazim
patern = r"(Brand|Series|Model|Operating System|CPU|Screen|MemoryStorage|Graphics Card|Video Memory|Dimensions|Screen Size|Touchscreen|Display Type|Resolution|GPU|Video Memory|Graphic Type|SSD|Bluetooth|USB)"
if re.search(patern, cumle):
speclist = translator.translate(cumle, lang_tgt='tr')
specl = speclist
#print(specl)
import pandas as pd
exp = [{ 'Prospec': specl,},]
df = pd.DataFrame(exp, columns = ['Prospec',])
df.to_excel('output1.xlsx',)
Create an empty list and, at each iteration in your for loop, append a data frame to the list. You will end up with a list of data frames. After the loop, use pd.concat() to create a new data frame by concatenating every element of your list. You can then save the resulting df to an excel file.
Your code would look something like this:
import pandas as pd
df_list = []
for item1 in spec:
......
if re.search(patern, cumle):
....
df_list.append(pd.DataFrame(.....))
df = pd.concat(df_list)
df.to_excel(.....)
I am working on a Python program that reads spefic .SDF filles from a given directory in a loop and then store some information regarding each file in pandas DF format. There is specific function which accept .SDF file and then return a data file contained one string with all required information about it. In the code below I've tried to apply this function (which works correctly!) on many .SDF filles and then append all linnes in new data file (should contain the same number of lines as a number of processed filles). How this concatenuation of separate DF should be realized correctly within for loop?
def load_sdf_file(file, key):
"""
Reads molecules from an SDF file and store some of its properties as data file
"""
df = PandasTools.LoadSDF(file)
df['Source'] = key
df['LogP'] = df['ROMol'].apply(Chem.Descriptors.MolLogP)
df['MolWt'] = df['ROMol'].apply(Chem.Descriptors.MolWt)
df['LipinskyHBA'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBA)
df['LipinskyHBD'] = df['ROMol'].apply(Chem.rdMolDescriptors.CalcNumLipinskiHBD)
df = df[['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD']]
return df
pwd = os.getcwd()
filles='sdf'
results='results'
#set directory to analyse
data = os.path.join(pwd,filles)
os.chdir(data)
dirlist = [os.path.basename(p) for p in glob.glob(data + '/*.sdf')]
# create a new data file with the same columns as it was in df defined in the function
all = pd.DataFrame(columns=['Source','LogP','MolWt','LipinskyHBA','LipinskyHBD'])
for sdf in dirlist:
try:
sdf_name=sdf.rsplit( ".", 1 )[ 0 ]
key = f'{sdf_name}'
df = load_sdf_file(sdf,key)
print(f'{sdf_name}.sdf has been processed')
# this does not work!
all.append(df)
except:
print(f'{sdf_name}.sdf has not been processed')
Try pandas.concat() and store the dataframes in a list:
import pandas as pd
list_of_df = []
for _ in range(10):
list_of_df.append(pd.DataFrame({'col_a':[1,1,1], 'col_b':[2,2,2]}))
df = pd.concat(list_of_df)
I have a bunch of excel files which I am merging into a csv file. Once the files are merged, I need to add a few more columns in the beginning of the csv file (I am planning to populate those columns using parameters e.g. GLB_DM_VER to populate Global_DM_Version column).
The following script gives me an error:
AttributeError: 'NoneType' object has no attribute 'to_csv'
I am new to Python and would really appreciate any help on this issue.Thanks.
import glob
path= input("Enter the location of files ")
GLB_DM_VER = input("Enter global DM version")
file_list = glob.glob(path+"\*.xls")
excels = [pd.ExcelFile(name) for name in file_list]
frames = [x.parse(x.sheet_names[2], header=0,index_col=None) for x in excels]
combined = pd.concat(frames)
combined = combined.insert(loc=1, column = 'Global_DM_Version', value = GLB_DM_VER )
combined.to_csv("STAND_2.csv", header=['TARGET_DOMAIN','SOURCE_DOMAIN','DOMAIN_LABEL','SOURCE_VARIABLE','RAVE_LABEL','TYPE','VARIABLE_LENGTH','CONTROL_TYPE','CODELIST_OID','TARGET_VARIABLE','MANDATORY','RAVE_ORIGIN'], index=False)
It's the following line:
combined = combined.insert(loc=1, column = 'Global_DM_Version', value = GLB_DM_VER )
Just write this instead:
combined.insert(loc=1, column = 'Global_DM_Version', value = GLB_DM_VER)
Explanation:
pd.DataFrame.insert does not return a DataFrame but modifies the DataFrame inplace. If a function does not return anything in python, it returns None instead, hence you see the error that you see.
I'm using pandas to handle some csv file, but i'm having trouble storing the results in a variable and printing it out as it is.
This is the code that I have.
df = pd.read_csv(MY_FILE.csv, index_col=False, header=0)
df2 = df[(df['Name'])]
# Trying to get the result of Name to the variable
n = df2['Name']
print(n)
And the result that i get:
1 jake
Name: Name, dtype: object
My Question:
Is it possible to just have "Jake" stored in a variable "n" so that i can call it out whenever i need it?
EG: Print (n)
Result: Jake
This is the code that I have constructed
def name_search():
list_to_open = input("Which list to open: ") + ".csv"
directory = "C:\Users\Jake Wong\PycharmProjects\box" "\\" + list_to_open
if os.path.isfile(directory):
# Search for NAME
Name_id = input("Name to search for: ")
df = pd.read_csv(directory, index_col=False, header=0)
df2 = df[(df['Name'] == Name_id)]
# Defining the name to save the file as
n = df2['Name'].ix[1]
print(n)
This is what is in the csv file
S/N,Name,Points,test1,test2,test3
s49,sing chun,5000,sc,90 sunrsie,4984365132
s49,Alice Suh,5000,jake,88 sunrsie,15641816
s1231,Alice Suhfds,5000,sw,54290 sunrsie,1561986153
s49,Jake Wong,5000,jake,88 sunrsie,15641816
The problem is that n = df2['Name'] is actually a Pandas Series:
type(df.loc[df.Name == 'Jake Wong'].Name)
pandas.core.series.Series
If you just want the value, you can use values[0] -- values is the underlying array behind the Pandas object, and in this case it's length 1, and you're just taking the first element.
n = df2['Name'].values[0]
Also your CSV is not formatted properly: It's not enough to have things lined up in columns like that, you need to have a consistent delimiter (a comma or a tab usually) between columns, so the parser can know when one column ends and another one starts. Can you fix your csv to look like this?:
S/n,Name,points
s56,Alice Suh,5000
s49,Jake Wong,5000
Otherwise we can work on another solution for you but we will probably use regex rather than pandas.