I have around 20 xlsx files that I would like to append using python. I can easily do that with pandas, the problem is that in the first column, I have hyperlinks and when I use pandas to append my xlsx files, I lose the hyperlink and get only the text in the column. Here is the code using pandas.
excels = [pd.ExcelFile(name) for name in files]
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
frames[1:] = [df[1:] for df in frames[1:]]
combined = pd.concat(frames)
combined.to_excel("c.xlsx", header=False, index=False)
Is there any way that I can append my files while retaining the hyperlinks? Is there a particular library that can do this?
It depends on how the hyperlinks are written in the original Excel files and on the Excel writer you use. read_excel will return the display text, e.g. if you have a hyperlink to https://www.google.com and the diplay text is just google, then there's no way to retain the link with pandas as you'll have just google in your dataframe.
If no separate display name is given (or the display name is identical with the hyperlink) and you use xlsxwriter (engine='xlsxwriter'), then the output of to_excel is automatically converted to hyperlinks (because it starts with 'http://' or any other scheme) (as of xlsxwriter version 1.1.5).
If you know that all your hyperlinks are 'http://' links with no authority and the display name (if different from the link) is just the url path, then you can prepend the 'http://' suffix and you'll get hyperlinks in the Excel file:
combined.iloc[combined[~combined.iloc[:,0].str.startswith('http')].index,0] = 'http://' + combined.iloc[combined[~combined.iloc[:,0].str.startswith('http')].index,0]
combined.to_excel("c.xlsx", header=False, index=False, engine='xlsxwriter')
A universal solution without pandas using openpyxl is shown in this answer to the same SO question where you took the pandas solution from. In order to copy hyperlinks too, you'll just have to add the following lines to the copySheet function:
if cell.hyperlink is not None:
newCell.hyperlink = copy(cell.hyperlink)
Related
Do I need read_excel GoogleSheet for doing further search action on its columns in Python?
I must gather data from the entire Google Sheet file. I need search by sheetname firstly, then gather information by looking up the values in columns.
I started by looking up the two popular solutions on the internet;
First one is, with the gspread package : as it relies on service_account.json info I will not use it.
Second one is, appropriate for me. But it shows how to export as csv file. I need to take data as xlsx file.
code is below;
import pandas as pd
sheet_id=" url "
sheet_name="sample_1"
url=f"https://docs.google...d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
I have both info sheet_id and sheet_name but need to export as xlsx file.
Here I see an example how to read an excel file. Is tehre a way to read as excel file but google spreadsheet
Using Pandas to pd.read_excel() for multiple worksheets of the same workbook
xls = pd.ExcelFile('excel_file_path.xls')
# Now you can list all sheets in the file
xls.sheet_names
# ['house', 'house_extra', ...]
# to read just one sheet to dataframe:
df = pd.read_excel(file_name, sheet_name="house")
I have no problem reading a google sheet using the method I found here:
Python Read In Google Spreadsheet Using Pandas
spreadsheet_id = "<INSERT YOUR GOOGLE SHEET ID HERE>"
url = f"https://docs.google.com/spreadsheets/d/{spreadsheet_id}/export?format=csv"
df = pd.read_csv(url)
df.to_excel("my_sheet.xlsx")
You need to set the permissions of your sheet though. I found that setting it to "anyone with a link" worked.
UPDATE - based on comments below
If your spreadsheet has multiple tabs and you want to read anything other than the first sheet, you need to specify a sheetID as described here
spreadsheet_id = "<INSERT YOUR GOOGLE spreadsheetId HERE>"
sheet_id = "<INSERT YOUR GOOGLE sheetId HERE>"
url = f"https://docs.google.com/spreadsheets/d/{spreadsheet_id}/export?gid={sheet_id}&format=csv"
df = pd.read_csv(url)
df.to_excel("my_sheet.xlsx")
TLDR: How can I convert the .xlsx over to .csv and preserve the values of formulas, instead of the formulas themselves?
My code combines two .xlsx sheets together to generate emails for new org users.
The first .xlsx contains a formula that concatenates the user's name and our domain, while the other .xlsx contains the queried list of new users. When combined, the newly generated .xlsx, titled 'users.xlsx' includes the desired information - but the emails generated are done so using the formula, still - not values. If asked to read data_only via pandas, it doesn't seem to work at all and no emails are generated on this newly created 'users' xlsx sheet.
This is all fine and works well, but the final step is converting the .xlsx over to .csv
Because the emails are technically generated through the concatenating formula, the conversion doesn't preserve the user's emails.
How can I convert the .xlsx over to .csv and preserve the values of formulas, instead of the formulas themselves? Is this possible? Can I force the third .xlsx to preserve values only and then do the conversion?
Things I've tried (While they all successfully convert into a .cvs, the data within formulas is lost):
Lenged:
combined_xlsx_2
# The .xlsx product after combining two xlsx (user info + emails)
# This product is 'users.xlsx' - I need it converted to a .csv
Code 1:
# Read and store content
# of an excel file
read_file = pd.read_excel (combined_xlsx_2)
# Write the dataframe object
# into csv file
filedir = combined_xlsx_2.replace("users_2.xlsx","users.csv")
read_file.to_csv (filedir,
index = None,
header=True,
encoding='utf-8')
# read csv file and convert
# into a dataframe object
df = pd.DataFrame(pd.read_csv(filedir))
df
Code 2:
filename = (combined_xlsx_2)
filedir = (filename.replace("/users.xlsx",""))
path_to_excel_files = glob.glob(filedir)
for excel in path_to_excel_files:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel)
df.to_csv(out)
Code 3:
wb = xlrd.open_workbook(combined_xlsx_2)
sh = wb.sheet_by_name('Sheet1')
your_csv_file = open(combined_xlsx_2.replace('.xlsx', '.csv'), 'w')
wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
for rownum in range(sh.nrows):
wr.writerow(sh.row_values(rownum))
your_csv_file.close()
Thank you for your time and assistance!
UPDATE 1:
I was able to accomplish this using 'convert-api'
https://www.convertapi.com/xlsx-to-csv#snippet=python
While not what I had in mind, it will at least get me by. Still hoping there's a better solution for this. Just wanted to share this just in case anyone else had a similar question.
I am trying to create a Hyperlink for each item in a column based on another column.
Here is an image to help you understand better:
Each title should hyperlink to the corresponding URL. (when you click apple, it should go to apple.com, when you click banana it should go to banana.com, so on) Is there a way to do this to a CSV file in python?
(Let's say my data is 2000 rows)
Thanks in advance
You can use (I used) the pandas library to read the data from csv and later write it to excel, and leverage the Excel function HYPERLINK to make the cell a, well, Hyperlink. The Excel function for HYPERLINK requires the url we are going to (with http:// or https:// at the beginning) and then the visible text, or friendly-name.
One thing the Excel file will not have when you open it is blue underlined text for the Hyperlinked cells. If that is needed, you can probably leverage this solution somewhat and also use the XlsxWriter module.
import pandas as pd
df = pd.read_csv("path\test.csv") #put your actual path here
# this will read the file and save it is a 'dataframe', basically a table.
df['title'] = '=HYPERLINK("https://' + df["url"] +'","' + df["title"]+'")'
"""the df that we originally pulled in has the columns 'title' and 'url'. This line re-writes the values in
'title' to use the HYPERLINK formula in Excel, where the link is the value from the URL column with 'https://'
added to the beginning, and then uses the value from 'title' as the text the user will see in Excel"""
df.to_excel("save_location\test2.xlsx",index=False) #this saves the new file as an Excel. index=False removes the index column that was created when you first make any dataframe
If you want your output to just be one column, the one they click, your final line will be slightly different:
df['title'].to_excel("save_location\test2.xlsx",index=False)
Did not work or probably you have included too many 's
import pandas as pd
df = pd.read_csv('test.csv')
df['URL'] = 'https://url/'+df['id'].astype(str)
keep_col = ['URL']
newurl = df[keep_col]
newurl.to_csv("newurl.csv", index=False)
This code is working but the output file does not show a clickable url
I have a few hundred files with data and hyperlinks in them that I was trying to upload and append to a single DataFrame when I realized that Pandas was not reading any of the hyperlinks.
I then tried to use Openpyxl to read the hyperlinks in the input Excel files and write a new column into the excels with the text of the hyperlink that hopefully Pandas can read into my dataframe.
However, I am running into issues with my testing the openpyxl code. It is able to read and write some of the hyperlinks but not the others.
My sample file has three rows and looks like this:
My actual data has hyperlinks in the way that I have it for "Google" in my test data set.
The other two hyperlinks in my text data, I inserted by right clicking on the cell and pasting the link.
Sample Test file here: Text.xlsx
Here is the code I wrote to read the hyperlink and paste it in a new column. It works for the first two rows (India and China) but fails for the third row (Google). It's unfortunate because all of my actual data is of that type. Can someone please help me figure it out?
import openpyxl
wb = openpyxl.load_workbook('test.xlsx')
ws = wb.active
column_indices = [1]
max_col = ws.max_column
ws.cell(row=1,column = max_col+1).value = "Hyperlink Text"
for row in range(2,ws.max_row+1):
for col in column_indices:
print(ws.cell(row, column=1).hyperlink.target)
ws.cell(column=max_col+1,row=row).value = ws.cell(row, column=1).hyperlink.target
wb.save('test.xlsx')
The cells where you are using the HYPERLINK function (like google.com) will not be of type hyperlink. You will need to process the cells with HyperLink function using re so similar function.
The values looks like below,
>>> ws.cell(2,1).value
'China'
>>> ws.cell(3,1).value
'India'
>>> ws.cell(4,1).value
'=HYPERLINK("www.google.com","google")'
Suggested code to handle HYPERLINK :
val = ws.cell(row,column).value
if val.find("=HYPERLINK") >= 0 :
hyplink = ws.cell(4,1).value # Or use re module for more robust check
Note : The second for loop to iterate over columns seems not required since you are always using column=1.
I have a text file that contains data like this. It is is just a small example, but the real one is pretty similar.
I am wondering how to display such data in an "Excel Table" like this using Python?
The pandas library is wonderful for reading csv files (which is the file content in the image you linked). You can read in a csv or a txt file using the pandas library and output this to excel in 3 simple lines.
import pandas as pd
df = pd.read_csv('input.csv') # if your file is comma separated
or if your file is tab delimited '\t':
df = pd.read_csv('input.csv', sep='\t')
To save to excel file add the following:
df.to_excel('output.xlsx', 'Sheet1')
complete code:
import pandas as pd
df = pd.read_csv('input.csv') # can replace with df = pd.read_table('input.txt') for '\t'
df.to_excel('output.xlsx', 'Sheet1')
This will explicitly keep the index, so if your input file was:
A,B,C
1,2,3
4,5,6
7,8,9
Your output excel would look like this:
You can see your data has been shifted one column and your index axis has been kept. If you do not want this index column (because you have not assigned your df an index so it has the arbitrary one provided by pandas):
df.to_excel('output.xlsx', 'Sheet1', index=False)
Your output will look like:
Here you can see the index has been dropped from the excel file.
You do not need python! Just rename your text file to CSV and voila, you get your desired output :)
If you want to rename using python then -
You can use os.rename function
os.rename(src, dst)
Where src is the source file and dst is the destination file
XLWT
I use the XLWT library. It produces native Excel files, which is much better than simply importing text files as CSV files. It is a bit of work, but provides most key Excel features, including setting column widths, cell colors, cell formatting, etc.
saving this is:
df.to_excel("testfile.xlsx")