Create hyperlink for each item in column in python(csv) - python

I am trying to create a Hyperlink for each item in a column based on another column.
Here is an image to help you understand better:
Each title should hyperlink to the corresponding URL. (when you click apple, it should go to apple.com, when you click banana it should go to banana.com, so on) Is there a way to do this to a CSV file in python?
(Let's say my data is 2000 rows)
Thanks in advance

You can use (I used) the pandas library to read the data from csv and later write it to excel, and leverage the Excel function HYPERLINK to make the cell a, well, Hyperlink. The Excel function for HYPERLINK requires the url we are going to (with http:// or https:// at the beginning) and then the visible text, or friendly-name.
One thing the Excel file will not have when you open it is blue underlined text for the Hyperlinked cells. If that is needed, you can probably leverage this solution somewhat and also use the XlsxWriter module.
import pandas as pd
df = pd.read_csv("path\test.csv") #put your actual path here
# this will read the file and save it is a 'dataframe', basically a table.
df['title'] = '=HYPERLINK("https://' + df["url"] +'","' + df["title"]+'")'
"""the df that we originally pulled in has the columns 'title' and 'url'. This line re-writes the values in
'title' to use the HYPERLINK formula in Excel, where the link is the value from the URL column with 'https://'
added to the beginning, and then uses the value from 'title' as the text the user will see in Excel"""
df.to_excel("save_location\test2.xlsx",index=False) #this saves the new file as an Excel. index=False removes the index column that was created when you first make any dataframe
If you want your output to just be one column, the one they click, your final line will be slightly different:
df['title'].to_excel("save_location\test2.xlsx",index=False)

Did not work or probably you have included too many 's
import pandas as pd
df = pd.read_csv('test.csv')
df['URL'] = 'https://url/'+df['id'].astype(str)
keep_col = ['URL']
newurl = df[keep_col]
newurl.to_csv("newurl.csv", index=False)
This code is working but the output file does not show a clickable url

Related

Ignore and Append tabs in worksheet

New to python/pandas... I get a profit and loss report from my trading brokerage in PDF. I can convert it into a spreadsheet that splits it up into headings on one worksheet and a table on another worksheet which it does for each page of the PDF report. I do this simply and cleanly for now just using export from Adobe PDF.
What I'm trying to figure out next is how to loop through the worksheet and ignore Tabs that have headings or data that does not match the columns of the Tabs that I want to append. I'm able to use this tutorial https://pythoninoffice.com/use-python-to-combine-multiple-excel-files/ for the most part, but I have no idea how to generalise the process so that over time as the report grows in PDF pages that I do not need to specify to pandas what number tabs to append and what number tabs to ignore, and can just run a loop in python until it runs out of tabs/worksheets to look at that match certain headers for appending.
Any help or pointers would be appreciated.
Ignore the tabs with this Tab header
Append tabs with this header
You could take advantage of the fact that you only want sheets with more than one row, read all sheets, concat only the ones with more than one row, here goes:
import pandas as pd
# create sample data
df_many_lines = pd.DataFrame([[1,2,3], [1,2,3], [1,2,3]])
df_one_line = pd.DataFrame([10,20,30]).T
# create sample file
with pd.ExcelWriter('output.xlsx') as writer:
for i in range(10):
df_one_line.to_excel(writer, sheet_name=f'Sheet_name_{i}', index=False)
df_many_lines.to_excel(writer, sheet_name=f'Sheet_name_1{i}', index=False)
# read all the sheets to a dict of dfs
df_dict = pd.read_excel('output.xlsx', sheet_name=None)
# concat the ones with more than one row to one df
pd.concat([df for df in df_dict.values() if df.shape[0]>1]).reset_index(drop=True)
I'm the original author of that tutorial you quoted at https://pythoninoffice.com
That's a lot of trades that you made! If I understand your question correctly, you are trying to:
Ignore tabs that contain only the header row, and
Append tabs that contain actual trade information
Since this is a p&l report, I'm assuming that all trade tabs (green tabs you highlighted) have the same format? Also, all the header (red) tabs contain the same format, i.e. only headers?
You can probably read all tabs and check if there's actually anything inside the dataframe. If not, you know it's a header (red) tab, otherwise, if there's data, then it's a trade (green) tab.
import pandas as pd
temp = pd.DataFrame()
with pd.ExcelFile('file_path') as excel:
for sheet in excel.sheet_names:
df = excel.parse(sheet)
if not df.empty:
temp = temp.append(df)
On a side note, to make your process more streamlined, you can try to use Python to convert a pdf to Excel instead of manually doing it in Adobe. https://pythoninoffice.com/pdf-to-excel-with-python/

Some Hyperlinks not opening with Openpyxl

I have a few hundred files with data and hyperlinks in them that I was trying to upload and append to a single DataFrame when I realized that Pandas was not reading any of the hyperlinks.
I then tried to use Openpyxl to read the hyperlinks in the input Excel files and write a new column into the excels with the text of the hyperlink that hopefully Pandas can read into my dataframe.
However, I am running into issues with my testing the openpyxl code. It is able to read and write some of the hyperlinks but not the others.
My sample file has three rows and looks like this:
My actual data has hyperlinks in the way that I have it for "Google" in my test data set.
The other two hyperlinks in my text data, I inserted by right clicking on the cell and pasting the link.
Sample Test file here: Text.xlsx
Here is the code I wrote to read the hyperlink and paste it in a new column. It works for the first two rows (India and China) but fails for the third row (Google). It's unfortunate because all of my actual data is of that type. Can someone please help me figure it out?
import openpyxl
wb = openpyxl.load_workbook('test.xlsx')
ws = wb.active
column_indices = [1]
max_col = ws.max_column
ws.cell(row=1,column = max_col+1).value = "Hyperlink Text"
for row in range(2,ws.max_row+1):
for col in column_indices:
print(ws.cell(row, column=1).hyperlink.target)
ws.cell(column=max_col+1,row=row).value = ws.cell(row, column=1).hyperlink.target
wb.save('test.xlsx')
The cells where you are using the HYPERLINK function (like google.com) will not be of type hyperlink. You will need to process the cells with HyperLink function using re so similar function.
The values looks like below,
>>> ws.cell(2,1).value
'China'
>>> ws.cell(3,1).value
'India'
>>> ws.cell(4,1).value
'=HYPERLINK("www.google.com","google")'
Suggested code to handle HYPERLINK :
val = ws.cell(row,column).value
if val.find("=HYPERLINK") >= 0 :
hyplink = ws.cell(4,1).value # Or use re module for more robust check
Note : The second for loop to iterate over columns seems not required since you are always using column=1.

Opening a Python file from another application without saving the file first (Opening a Pandas table from Excel)

I am working with pandas, and I've just modified a table
Now, I would like to see my table in excel, but it's just a quick look, and I will have to modify the table again later on, so I don't want to save my table anywhere.
In other words, the solution
my_df = pd.DataFrame()
item_path = "my/path"
my_df.to_csv("my/path")
os.startfile(os.normpath(item_path))
Is not what I want. I would like to obtain the same behavior without saving the Dataframe as CSV first.
#Something like:
my_df = pd.DataFrame()
start_excel(table_to_load = my_df) #Opens excel with a COPY of my_df
Note
To quickly explore a DataFrame, df.head() is the way, but I want to open my DataFrame from a Tkinter application. I need to use an external program to open this temporary table
you can have a quick look using
<dataframe_name>.head()
it will display top 5 rows by default
or
you can simply write how many rows you want
<dataframe_name>.head(<rows_you_want>)

Append xlsx files retaining hyperlinks through python

I have around 20 xlsx files that I would like to append using python. I can easily do that with pandas, the problem is that in the first column, I have hyperlinks and when I use pandas to append my xlsx files, I lose the hyperlink and get only the text in the column. Here is the code using pandas.
excels = [pd.ExcelFile(name) for name in files]
frames = [x.parse(x.sheet_names[0], header=None,index_col=None) for x in excels]
frames[1:] = [df[1:] for df in frames[1:]]
combined = pd.concat(frames)
combined.to_excel("c.xlsx", header=False, index=False)
Is there any way that I can append my files while retaining the hyperlinks? Is there a particular library that can do this?
It depends on how the hyperlinks are written in the original Excel files and on the Excel writer you use. read_excel will return the display text, e.g. if you have a hyperlink to https://www.google.com and the diplay text is just google, then there's no way to retain the link with pandas as you'll have just google in your dataframe.
If no separate display name is given (or the display name is identical with the hyperlink) and you use xlsxwriter (engine='xlsxwriter'), then the output of to_excel is automatically converted to hyperlinks (because it starts with 'http://' or any other scheme) (as of xlsxwriter version 1.1.5).
If you know that all your hyperlinks are 'http://' links with no authority and the display name (if different from the link) is just the url path, then you can prepend the 'http://' suffix and you'll get hyperlinks in the Excel file:
combined.iloc[combined[~combined.iloc[:,0].str.startswith('http')].index,0] = 'http://' + combined.iloc[combined[~combined.iloc[:,0].str.startswith('http')].index,0]
combined.to_excel("c.xlsx", header=False, index=False, engine='xlsxwriter')
A universal solution without pandas using openpyxl is shown in this answer to the same SO question where you took the pandas solution from. In order to copy hyperlinks too, you'll just have to add the following lines to the copySheet function:
if cell.hyperlink is not None:
newCell.hyperlink = copy(cell.hyperlink)

How to iterate over csv rows to extract text from URLS using pandas

I have a csv of a bunch of news articles, and I'm hoping to use the newspaper3k package to extract the body text from those articles and save them as txt files. I want to create a script that iterates over every row in the csv, extracts the URL, extracts the text from the URL, and then saves that as a uniquely named txt file. Does anyone know how I might do this? I'm a journalist who is new to Python, sorry if this is straightforward.
I only have the code below. Before figuring out how to save each body text as a txt file, I figured I should try and just get the script to print the text from each row in the csv.
import newspaper as newspaper
from newspaper import Article
import sys as sys
import pandas as pd
data = pd.read_csv('/Users/alexfrandsen14/Desktop/Projects/newspaper3k-
scraper/candidate_coverage.csv')
data.head()
for index,row in data.iterrows():
article_name = Article(url=['link'], language='en')
article_name.download()
article_name.parse()
print(article_name.text)
Since all the url's are in the same column, it is easier to access that column directly with a for loop. I will go over some explanation here:
# to access your specific url column
from newspaper import Article
import sys as sys
import pandas as pd
data = pd.read_csv('/Users/alexfrandsen14/Desktop/Projects/newspaper3k-scraper/candidate_coverage.csv')
for x in data['url_column_name']: #replace 'url_column_name' with the actual name in your df
article_name = Article(x, language='en') # x is the url in each row of the column
article.download()
article.parse()
f=open(article.title, 'w') # open a file named the title of the article (could be long)
f.write(article.text)
f.close()
I have not tried this package before, but reading the tutorial posted this seems like it should work. Generally, you are accessing the url column in your dataframe by the line:
for x in data['url_column_name']: you will replace the 'url_column_name' with the actual name of the column.
Then, x will be the url in the first row so you will pass that to Article (you don't need brackets around x judging by the tutorial). It will download this first x and parse it, then open a file with the name of the title of the article, write the text to that file, then close that file.
It will then do this same thing for the second x, and third x, all the way until you run out of urls.
I hope this helps!

Categories