I have an excel workbook with multiple sheets with some sales data. I am trying to sort them so that each customer has a separate sheet(different workbook), and has the item details. I have created a dictionary with all customernames.
for name in cust_dict.keys():
cust_dict[name] = pd.DataFrame(columns=cols)
for sheet in sheets:
ws = sales_wb.sheet_by_name(sheet)
code = ws.cell(4, 0).value #This is the item code
df = pd.read_excel(sales_wb, engine='xlrd', sheet_name=sheet, skiprows=7)
df = df.fillna(0)
count = 0
for index,row in df.iterrows():
print('rotation '+str(count))
count+=1
if row['Particulars'] != 0 and row['Particulars'] not in no_cust:
cust_name = row['Particulars']
# try:
cust_dict[cust_name] = cust_dict[cust_name].append(df.loc[df['Particulars'] == cust_name],ignore_index=False)
cust_dict[cust_name] = cust_dict[cust_name].drop_duplicates()
cust_dict[cust_name]['Particulars'] = code
Right now I have to drop duplicates because the Particulars has the client name more than once and hence the cope appends the data say x number of times.
I would like to avoid this but I can't seem to figure out a good way to do it.
The second problem is that since the code changes to the code in the last sheet for all rows, but I want it to remain the same for the rows pulled from a particular sheet.
I can't seem to figure out a way around both the above problems.
Thanks
Related
I have some big Excel files like this (note: other variables are omitted for brevity):
and would need to build a corresponding Pandas DataFrame with the following structure.
I am trying to develop a Pandas code for, at least, parsing the first column and transposing the id and the full of each user. Could you help with this?
The way that I would tackle it, and I am assuming there are likely to be more efficient ways, is to import the excel file into a dataframe, and then iterate through it to grab the details you need for each line. Store that information in a dictionary, and append each formed line into a list. This list of dictionaries can then be used to create the final dataframe.
Please note, I made the following assumptions:
Your excel file is named 'data.xlsx' and in the current working directory
The index next to each person increments by one EVERY time
All people have a position described in brackets next to the name
I made up the column names, as none were provided
import pandas as pd
# import the excel file into a dataframe (df)
filename = 'data.xlsx'
df = pd.read_excel(filename, names=['col1', 'col2'])
# remove blank rows
df.dropna(inplace=True)
# reset the index of df
df.reset_index(drop=True, inplace=True)
# initialise the variables
counter = 1
name_pos = ''
name = ''
pos = ''
line_dict = {}
list_of_lines = []
# iterate through the dataframe
for i in range(len(df)):
if df['col1'][i] == counter:
name_pos = df['col2'][i].split(' (')
name = name_pos[0]
pos = name_pos[1].rstrip(name_pos[1][-1])
p_index = counter
counter += 1
else:
date = df['col1'][i].strftime('%d/%m/%Y')
amount = df['col2'][i]
line_dict = {'p_index': p_index, 'name': name, 'position': pos, 'date':date, 'amount': amount}
list_of_lines.append(line_dict)
final_df = pd.DataFrame(list_of_lines)
OUTPUT:
So the code below is creating three data frames based on a year. Each data frame is essentially the same except each year will have different stats for how players did. However, the header at the top of the datat frame gets repeated within every 20 rows or so. Im trying to figure how to get rid of it. So i figured that if i search the "Player" column for every instance "Player" is repeated within the column, that i could find the occurences and delete the row that it occurs in. At the end of my code, i ran a print function to see how many times the header row occurs within the data and it comes out to be 20 times. I just cant figure out the way to delete those rows.
import pandas as pd
year = ["2018", "2019", "2020"]
str = "https://www.pro-football-reference.com/years/{}/fantasy.htm"
url = str.format(year)
urlList = []
for season in year:
url = str.format(season)
urlList.append(url)
df2018 = pd.read_html(urlList[0], header=1)
df2019 = pd.read_html(urlList[1], header=1)
df2020 = pd.read_html(urlList[2], header=1)
print(df2020)
print(sum(df2020[0]["Player"] == "Player"))
P.S. I thought there was a way to reference a data frame variable by using the form of: dataframe.variable ??
This should work:
import pandas as pd
year = ["2018", "2019", "2020"]
str = "https://www.pro-football-reference.com/years/{}/fantasy.htm"
url = str.format(year)
urlList = []
for season in year:
url = str.format(season)
urlList.append(url)
df2018 = pd.read_html(urlList[0], header=1)
df2019 = pd.read_html(urlList[1], header=1)
df2020 = pd.read_html(urlList[2], header=1)
df2020 = df2020[0]
df2020 = df2020[df2020['Rk'] != 'Rk']
print(df2020.head(50))
It filters the Rk column for the value "Rk", and excludes it when creating the new dataframe. I only ran the code for 2020, but you can repeat it for the other dataframes.
As a note, pd.read_html() makes a list of dataframes, rather than a dataframe, because an HTML website or file can contain multiple dataframes. That's why I included this line of code:
df2020 = df2020[0]. It selects the first dataframe from the list.
If you need to reset the index, add this code to the end:
df2020 = df2020.reset_index(drop=True)
I am trying to write a code that looks into a certain directory, selects .xlsx extension files then appends the sheets together and creates a single Excel sheet, it then drops a row if FAMILYNAME AND FIRSTNAME are null, creates a code column for every file it creates, like CODE = 1 for the first booklet it has finished working upon, then 2 for the second and so on, then it will create a column called ID, which will generate values from 1 to the total number of rows in the data frame, though am facing a challenge on this one. Can anyone help? Thanks!
os.chdir(r"C:\Users\Kimanya\jupyter lessons\REMOTE WORK\New folder\example")
def read_excel_sheets(xls_path):
"""Read all sheets of an Excel workbook and return a single DataFrame"""
print(f'Loading {xls_path} into pandas')
xl = pd.ExcelFile(xls_path)
df = pd.DataFrame()
columns = None
for idx, name in enumerate(xl.sheet_names):
print(f'Reading sheet #{idx}: {name}')
#sheet = xl.parse(name)
sheet = xl.parse(name,skiprows=1)
#if idx == 0:
# Save column names from the first sheet to match for append
# *****************THE FOLLOWINF TWO LINES ARE SO IMPORTANT IF THE EXCEL WORK SHEETS ARE HAVING THE SAME NUMBER OF COLUMNS AND IN THE SAME ORDER***********
# columns = sheet.columns
# sheet.columns = columns
#****************************************************************************************#
# Assume index of existing data frame when appended
df = df.append(sheet, ignore_index=False)
return df
n=0
for files in os.listdir():
if files.endswith(".xlsx"):
kim = pd.read_excel(files, sheet_name=None,header=0)
kim.keys()
kim = pd.concat(kim, ignore_index=True)
kim = pd.concat(pd.read_excel(files, sheet_name=None), ignore_index=True)
kim =kim[kim['FAMILYNAME'].notna() & kim['FIRSTNAME'].notna()]
kim =kim[kim.FAMILYNAME !='FAMILYNAME']
row,col = kim.shape
#############Alternatively you can use the code above in just a single line as below###########
#kim= pd.concat(pd.read_excel(files, sheet_name=None), ignore_index=True)
print(kim.shape)
if files.endswith(".xlsx"):
#HOW TO AUYOMATICALLY IMPORT THE FILE NAMES AND APPEND THEM
stops = read_excel_sheets(files)
n = n+1
stops['ID'] = pd.Series(ID)
stops['CODE']= n
ID = list( range(1, row+1))
stops['ID'] = pd.Series(ID)
stops['GENDER'] = np.where((stops.GENDER == 'M'),'MALE',stops.GENDER)
stops['GENDER'] = np.where((stops.GENDER == 'F'),'FEMALE',stops.GENDER)
stops =stops[stops['FAMILYNAME'].notna() & stops['FIRSTNAME'].notna()]
os.chdir(r"C:\Users\Kimanya\jupyter lessons\REMOTE WORK\New folder\TRIAL")
stops.to_excel(files,index=False)
os.chdir(r"C:\Users\Kimanya\jupyter lessons\REMOTE WORK\New folder\example")
This is what the code is doing after execution - I would love something like this, yet I want it to automatically increment to all rows in the sheet
I'm working on some project and came up with the messy situation across where I've to split the data frame based on the first column of a data frame, So the situation is here the data frame I've with me is coming from SQL queries and I'm doing so much manipulation on that. So that is why not posting the code here.
Target: The data frame I've with me is like the below screenshot, and its available as an xlsx file.
Output: I'm looking for output like the attached file here:
The thing is I'm not able to put any logic here that how do I get this done on dataframe itself as I'm newbie in Python.
I think you can do this:
df = df.set_index('Placement# Name')
df['Date'] = df['Date'].dt.strftime('%M-%d-%Y')
df_sub = df[['Delivered Impressions','Clicks','Conversion','Spend']].sum(level=0)\
.assign(Date='Subtotal')
df_sub['CTR'] = df_sub['Clicks'] / df_sub['Delivered Impressions']
df_sub['eCPA'] = df_sub['Spend'] / df_sub['Conversion']
df_out = pd.concat([df, df_sub]).set_index('Date',append=True).sort_index(level=0)
startline = 0
writer = pd.ExcelWriter('testxls.xlsx', engine='openpyxl')
for n,g in df_out.groupby(level=0):
g.to_excel(writer, startrow=startline, index=True)
startline += len(g)+2
writer.save()
Load the Excel file into a Pandas dataframe, then extract rows based on condition.
dframe = pandas.read_excel("sample.xlsx")
dframe = dframe.loc[dframe["Placement# Name"] == "Needed value"]
Where "needed value" would be the value of one of those rows.
Background: My first Excel-related script. Using openpyxl.
There is an Excel sheet with loads of different types of data about products in different columns.
My goal is to extract certain types of data from certain columns (e.g. price, barcode, status), assign those to the unique product code and then output product code, price, barcode and status to a new excel doc.
I have succeeded in extracting the data and putting it the following dictionary format:
productData = {'AB123': {'barcode': 123456, 'price': 50, 'status': 'NEW'}
My general thinking on getting this output to a new report is something like this (although I know that this is wrong):
newReport = openpyxl.Workbook()
newSheet = newReport.active
newSheet.title = 'Output'
newSheet['A1'].value = 'Product Code'
newSheet['B1'].value = 'Price'
newSheet['C1'].value = 'Barcode'
newSheet['D1'].value = 'Status'
for row in range(2, len(productData) + 1):
newSheet['A' + str(row)].value = productData[productCode]
newSheet['B' + str(row)].value = productPrice
newSheet['C' + str(row)].value = productBarcode
newSheet['D' + str(row)].value = productStatus
newReport.save('ihopethisworks.xlsx')
What do I actually need to do to output the data?
I would suggest using Pandas for that. It has the following syntax:
df = pd.read_excel('your_file.xlsx')
df['Column name you want'].to_excel('new_file.xlsx')
You can do a lot more with it. Openpyxl might not be the right tool for your task (Openpyxl is too general).
P.S. I would leave this in the comments, but stackoverflow, in their widom decided to let anyone to leave answers, but not to comment.
The logic you use to extract the data is missing but I suspect the best approach is to use it to loop over the two worksheets in parallel. You can then avoid using a dictionary entirely and just append loops to the new worksheet.
Pseudocode:
ws1 # source worksheet
ws2 # new worksheet
product = []
code = ws1[…] # some lookup
barcode = ws1[…]
price = ws1[…]
status = ws1[…]
ws2.append([code, price, barcode, status])
Pandas will work best for this
here are some examples
import pandas as pd
#df columns: Date Open High Low Close Volume
#reading data from an excel
df = pd.read_excel('GOOG-NYSE_SPY.xls')
#set index to the column of your choice, in this case it would be date
df.set_index('Date', inplace = True)
#choosing the columns of your choice for further manipulation
df = df[['Open', 'Close']]
#divide two colums to get the % change
df = (df['Open'] - df['Close']) / df['Close'] * 100
print(df.head())