I have a dataframe containing a path to an excel file, a sheet name and an id, each in one column:
df = pd.DataFrame([['00100', 'one.xlsx', 'sheet1'],
['00100', 'two.xlsx', 'sheet2'],
['05300', 'thr.xlsx', 'sheet3'],
['95687', 'fou.xlsx', 'sheet4'],
['05300', 'fiv.xlsx', 'sheet5']],
columns=['id', 'file', 'sheet'])
This dataframe looks like:
id file sheet
0 00100 c:\one.xlsx sheet1
1 00100 c:\two.xlsx sheet2
2 05300 c:\thr.xlsx sheet3
3 95687 c:\fou.xlsx sheet4
4 05300 c:\fiv.xlsx sheet5
I made a function to use with apply, which will read each file and return a dataframe.
def getdata(row):
file = row['file']
sheet = row['sheet']
id = row['id']
tempdf = pd.ExcelFile(file) # Used on purpose
tempdf = tempdf.parse(sheet) # Used on purpose
tempdf['ID'] = id
return tempdf
I then use apply over the initial dataframe so it will return a dataset for each row. The problem is, I don't know how to store the dataframes created in this way.
I tried to store the dataframes in a new column, but the column stores None:
df['data'] = df.apply(getdata, axis=1)
I tried to create a dictionary but the ways that came to my mind were plain useless:
results = {df.apply(getdata, axis=1)} # for this one, in the function I tried to return id, tempdf
In the end, I ended converting the 'id' column to an index to iterate over it in the following way:
for id in df.index:
df[id] = getdata(df.loc[id], id)
But I want to know if there was a way to store the resulting dataframes without using an iterator.
Thanks for your feedback.
Related
I have a list of values, if they appear in a the column 'Books' I would like that row to be returned.
I think I have achieved this with the below code:
def return_Row():
file = 'TheFile.xls'
df = pd.read_excel(file)
listOfValues = ['A','B','C']
return(df.loc[df['Column'].isin(listOfValues)])
This currently only seems to work on the first Worksheet as there are multiple worksheets in 'TheFile.xls' how would I go about looping through these worksheets to return any rows where the listOfValues is found in the 'Books' column of all the other sheets?
Any help would be greatly appreciated.
Thank you
The thing is, pd.read_excel() returns a dataframe for the first sheet only if you didn't specify the sheet_name argument. If you want to get all the sheets in excel file without specifying their names, you can pass None to sheet_name as follows:
df = pd.read_excel(file, sheet_name=None)
This will give you a different dataframe for each sheet on which you can loop and do whatever you want. For example you can append the results that you need to a list and return the list:
def return_Row():
file = 'TheFile.xls'
results = []
dfs = pd.read_excel(file, sheet_name=None)
listOfValues = ['A','B','C']
for df in dfs.values():
results.append(df.loc[df['Column'].isin(listOfValues)])
return(results)
I am following Freddy's example in appending my csv file with unique values. Here is the code I am using:
header = ['user.username', 'user.id']
user_filename = f"{something}_users.csv"
if os.path.isfile(user_filename): #checks if file exists
#Read in old data
oldFrame = pd.read_csv(user_filename, header=0)
#Concat and drop dups
df_diff = pd.concat([oldFrame, df[['user.username', 'user.id']]],ignore_index=True).drop_duplicates()
#Write new rows to csv file
df_diff.to_csv(user_filename, header = False, index=False)
else: # else it exists so append
df.to_csv(user_filename, columns = header, header=['username', 'user_id'], index=False, mode = 'a')
Running this code for the first time returns the desired result: A csv file with two named columns (username and user_id) and the respective values. If I run it a second time, something odd happens: I still keep the old values and also the new values. But the new values appear below the old ones in two new (unnamed) columns like so:
username user_id
user1 123
user2 456
user3 789
user4 124
The output I'm looking for is this:
username user_id
user1 123
user2 456
user3 789
user4 124
The main issue with the code is of naming convention. Try this piece of code
header = ['user.username', 'user.user_id']
user_filename = "users.csv"
if os.path.isfile(user_filename): #checks if file exists
#Read in old data
oldFrame = pd.read_csv(user_filename, header=0)
#Concat and drop dups
concat = pd.concat([oldFrame, df[['user.username', 'user.user_id']]], ignore_index=True)
df_diff = concat.drop_duplicates()
#Write new rows to csv file
df_diff.to_csv(user_filename, header=['user.username', 'user.user_id'], index=False)
else: # else it exists so append
df.to_csv(user_filename, columns = header, header=['user.username', 'user.user_id'], index=False, mode='a')
What this code does differently is, that the name of the headers you are reading from the file should be the same as the names of headers you are trying to concatenate the data with. You can use some interim dictionary to achieve this if you don't want to change your column names.
The problem is caused by concatenating two dataframes with different column names. The imported dataframe already has the new column names ('username' and 'user_id'), the dataframe df still uses 'user.username' and 'user.id'.
To avoid the error, I changed this line
df_diff = pd.concat([oldFrame, df[['user.username', 'user.id']]],ignore_index=True).drop_duplicates()
to
df_diff = pd.concat([oldFrame, df[['user.username', 'user.id']].rename(columns={"user.username": "username", "user.id": "user_id"})],ignore_index=True).drop_duplicates()
I am trying to write a code that looks into a certain directory, selects .xlsx extension files then appends the sheets together and creates a single Excel sheet, it then drops a row if FAMILYNAME AND FIRSTNAME are null, creates a code column for every file it creates, like CODE = 1 for the first booklet it has finished working upon, then 2 for the second and so on, then it will create a column called ID, which will generate values from 1 to the total number of rows in the data frame, though am facing a challenge on this one. Can anyone help? Thanks!
os.chdir(r"C:\Users\Kimanya\jupyter lessons\REMOTE WORK\New folder\example")
def read_excel_sheets(xls_path):
"""Read all sheets of an Excel workbook and return a single DataFrame"""
print(f'Loading {xls_path} into pandas')
xl = pd.ExcelFile(xls_path)
df = pd.DataFrame()
columns = None
for idx, name in enumerate(xl.sheet_names):
print(f'Reading sheet #{idx}: {name}')
#sheet = xl.parse(name)
sheet = xl.parse(name,skiprows=1)
#if idx == 0:
# Save column names from the first sheet to match for append
# *****************THE FOLLOWINF TWO LINES ARE SO IMPORTANT IF THE EXCEL WORK SHEETS ARE HAVING THE SAME NUMBER OF COLUMNS AND IN THE SAME ORDER***********
# columns = sheet.columns
# sheet.columns = columns
#****************************************************************************************#
# Assume index of existing data frame when appended
df = df.append(sheet, ignore_index=False)
return df
n=0
for files in os.listdir():
if files.endswith(".xlsx"):
kim = pd.read_excel(files, sheet_name=None,header=0)
kim.keys()
kim = pd.concat(kim, ignore_index=True)
kim = pd.concat(pd.read_excel(files, sheet_name=None), ignore_index=True)
kim =kim[kim['FAMILYNAME'].notna() & kim['FIRSTNAME'].notna()]
kim =kim[kim.FAMILYNAME !='FAMILYNAME']
row,col = kim.shape
#############Alternatively you can use the code above in just a single line as below###########
#kim= pd.concat(pd.read_excel(files, sheet_name=None), ignore_index=True)
print(kim.shape)
if files.endswith(".xlsx"):
#HOW TO AUYOMATICALLY IMPORT THE FILE NAMES AND APPEND THEM
stops = read_excel_sheets(files)
n = n+1
stops['ID'] = pd.Series(ID)
stops['CODE']= n
ID = list( range(1, row+1))
stops['ID'] = pd.Series(ID)
stops['GENDER'] = np.where((stops.GENDER == 'M'),'MALE',stops.GENDER)
stops['GENDER'] = np.where((stops.GENDER == 'F'),'FEMALE',stops.GENDER)
stops =stops[stops['FAMILYNAME'].notna() & stops['FIRSTNAME'].notna()]
os.chdir(r"C:\Users\Kimanya\jupyter lessons\REMOTE WORK\New folder\TRIAL")
stops.to_excel(files,index=False)
os.chdir(r"C:\Users\Kimanya\jupyter lessons\REMOTE WORK\New folder\example")
This is what the code is doing after execution - I would love something like this, yet I want it to automatically increment to all rows in the sheet
I would like to convert an excel file to a pandas dataframe. All the sheets name have spaces in the name, for instances, ' part 1 of 22, part 2 of 22, and so on. In addition the first column is the same for all the sheets.
I would like to convert this excel file to a unique dataframe. However I dont know what happen with the name in python. I mean I was hable to import them, but i do not know the name of the data frame.
The sheets are imported but i do not know the name of them. After this i would like to use another 'for' and use a pd.merge() in order to create a unique dataframe
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
print(sheet_name.info())
Using only the code snippet you have shown, each sheet (each DataFrame) will be assigned to the variable sheet_name. Thus, this variable is overwritten on each iteration and you will only have the last sheet as a DataFrame assigned to that variable.
To achieve what you want to do you have to store each sheet, loaded as a DataFrame, somewhere, a list for example. You can then merge or concatenate them, depending on your needs.
Try this:
all_my_sheets = []
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
all_my_sheets.append(sheet_name)
Or, even better, using list comprehension:
all_my_sheets = [pd.read_excel(Matrix, sheet_name) for sheet_name in Matrix.sheet_names]
You can then concatenate them into one DataFrame like this:
final_df = pd.concat(all_my_sheets, sort=False)
You might consider using the openpyxl package:
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename=file_path, read_only=True)
all_my_sheets = wb.sheetnames
# Assuming your sheets have the same headers and footers
n = 1
for ws in all_my_sheets:
records = []
for row in ws._cells_by_row(min_col=1,
min_row=n,
max_col=ws.max_column,
max_row=n):
rec = [cell.value for cell in row]
records.append(rec)
# Make sure you don't duplicate the header
n = 2
# ------------------------------
# Set the column names
records = records[header_row-1:]
header = records.pop(0)
# Create your df
df = pd.DataFrame(records, columns=header)
It may be easiest to call read_excel() once, and save the contents into a list.
So, the first step would look like this:
dfs = pd.read_excel(["Sheet 1", "Sheet 2", "Sheet 3"])
Note that the sheet names you use in the list should be the same as those in the excel file. Then, if you wanted to vertically concatenate these sheets, you would just call:
final_df = pd.concat(dfs, axis=1)
Note that this solution would result in a final_df that includes column headers from all three sheets. So, ideally they would be the same. It sounds like you want to merge the information, which would be done differently; we can't help you with the merge without more information.
I hope this helps!
I have an excel workbook with multiple sheets with some sales data. I am trying to sort them so that each customer has a separate sheet(different workbook), and has the item details. I have created a dictionary with all customernames.
for name in cust_dict.keys():
cust_dict[name] = pd.DataFrame(columns=cols)
for sheet in sheets:
ws = sales_wb.sheet_by_name(sheet)
code = ws.cell(4, 0).value #This is the item code
df = pd.read_excel(sales_wb, engine='xlrd', sheet_name=sheet, skiprows=7)
df = df.fillna(0)
count = 0
for index,row in df.iterrows():
print('rotation '+str(count))
count+=1
if row['Particulars'] != 0 and row['Particulars'] not in no_cust:
cust_name = row['Particulars']
# try:
cust_dict[cust_name] = cust_dict[cust_name].append(df.loc[df['Particulars'] == cust_name],ignore_index=False)
cust_dict[cust_name] = cust_dict[cust_name].drop_duplicates()
cust_dict[cust_name]['Particulars'] = code
Right now I have to drop duplicates because the Particulars has the client name more than once and hence the cope appends the data say x number of times.
I would like to avoid this but I can't seem to figure out a good way to do it.
The second problem is that since the code changes to the code in the last sheet for all rows, but I want it to remain the same for the rows pulled from a particular sheet.
I can't seem to figure out a way around both the above problems.
Thanks