How to Extract the result from python into a xls file - python

I'm a novice in python and I need to extract references from scientific literature. Following is the code I'm using
from refextract import extract_references_from_url
references = extract_references_from_url('https://arxiv.org/pdf/1503.07589.pdf')
print(references)
So, Please guide me on how to extract this printed information into a Xls file. Thank you so much.

You could use the pandas library to write the references into excel.
from refextract import extract_references_from_url
import pandas as pd
references = extract_references_from_url('https://arxiv.org/pdf/1503.07589.pdf')
print(references)
# convert to pandas dataframe
dfref = pd.DataFrame(references)
# write dataframe into excel
dfref.to_excel('./refs.xlsx')

You should have a look at xlsxwriter, a module for creating excel files.
Your code could then look like this:
import xlsxwriter
from refextract import extract_references_from_url
workbook = xlsxwriter.Workbook('References.xlsx')
worksheet = workbook.add_worksheet()
references = extract_references_from_url('https://arxiv.org/pdf/1503.07589.pdf')
row = 0
col = 0
worksheet.write(references)
workbook.close
(modified based upon https://xlsxwriter.readthedocs.io/tutorial01.html)

After going through the documentation of refextract here, I found that your variable references is a dictionary. For converting such a dictionary to python you can use Pandas as follows-
import pandas as pd
# create a pandas dataframe using a dictionary
df = pd.DataFrame(data=references, index=[0])
# Take transpose of the dataframe
df = (df.T)
# write the dictionary to an excel file
df.to_excel('extracted_references.xlsx')

Related

Treat everything as raw string (even formulas) when reading into pandas from excel

So, I am actually handling text responses from surveys, and it is common to have responses that starts with -, an example is: -I am sad today.
Excel would interpret it as #NAMES?
So when I import the excel file into pandas using read_excel, it would show NAN.
Now is there any method to force excel to retain as raw strings instead interpret it at formula level?
I created a vba and assigning the entire column with text to click through all the cells in the column, which is slow if there is ten thousand++ data.
I was hoping it can do it at python level instead, any idea?
I hope, it works for your solution, use openpyxl to extract excel data and then convert it into a pandas dataframe
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename = './formula_contains_raw.xlsx', ).active
print(wb.values)
# sheet_names = wb.get_sheet_names()[0]
# sheet_ranges = wb[name]
df = pd.DataFrame(list(wb.values)[1:], columns=list(wb.values)[0])
df.head()
It works for me using a CSV instead of excel file.
In the CSV file (opened in excel) I need to select the option Formulas/Show Formulas, then save the file.
pd.read_csv('draft.csv')
Output:
Col1
0 hello
1 =-hello

Inserting Data into an Excel file using Pandas - Python

I have an excel file that contains the names of 60 datasets.
I'm trying to write a piece of code that "enters" the Excel file, accesses a specific dataset (whose name is in the Excel file), gathers and analyses some data and finally, creates a new column in the Excel file and inserts the information gathered beforehand.
I can do most of it, except for the part of adding a new column and entering the data.
I was trying to do something like this:
path_data = **the path to the excel file**
recap = pd.read_excel(os.path.join(path_data,'My_Excel.xlsx')) # where I access the Excel file
recap['New information Column'] = Some Value
Is this a correct way of doing this? And if so, can someone suggest a better way (that works ehehe)
Thank you a lot!
You can import the excel file into python using pandas.
import pandas as pd
df = pd.read_excel (r'Path\Filename.xlsx')
print (df)
If you have many sheets, then you could do this:
import pandas as pd
df = pd.read_excel (r'Path\Filename.xlsx', sheet_name='sheetname')
print (df)
To add a new column you could do the following:
df['name of the new column'] = 'things to add'
Then when you're ready, you can export it as xlsx:
import openpyxl
# to excel
df.to_excel(r'Path\filename.xlsx')

Pandas Create Excel with Table formatted as a Table

I have a .csv file that I am converting into a table format using the following python script. In order to make this useful, I need to create a table within the Excel that holds the data (actually formatted as a table (Insert > Table). Is this possible within python? I feel like it should be relatively easy, but can't find anything on the internet.
The idea here is that the python takes the csv file, converts it to xlsx with a table embedded on sheet1, and then moves it to the correct folder.
import os
import shutil
import pandas as pd
src = r"C:\Users\xxxx\Python\filename.csv"
src2 = r"C:\Users\xxxx\Python\filename.xlsx"
read_file = pd.read_csv (src) - convert to Excel
read_file.to_excel (src2, index = None, header=True)
dest = path = r"C:\Users\xxxx\Python\repository"
destination = shutil.copy2(src2, dest)
Edit: I got sidetracked by the original MWE.
This should work, using xlsxwriter:
import pandas as pd
import xlsxwriter
#Dummy data
my_data={"list1":[1,2,3,4], "list2":"a b c d".split()}
df1=pd.DataFrame(my_data)
df1.to_csv("myfile.csv", index=False)
df2=pd.read_csv("myfile.csv")
#List of column name dictionaries
headers=[{"header" : i} for i in list(df2.columns)]
#Create and propagate workbook
workbook=xlsxwriter.Workbook('output.xlsx')
worksheet1=workbook.add_worksheet()
worksheet1.add_table(0, 0, len(df2), len(df2.columns)-1, {"columns":headers, "data":df2.values.tolist()})
workbook.close()

Find words with underscores in excel worksheet by using Python

Is it possible to search/ parse through two columns in excel (let's say columns C & D) and find only the fields with underscores by using python?
Maybe a code like this? Not too sure..:
Import xl.range
Columns = workbook.get("C:D"))
Extract = re.findall(r'\(._?)\', str(Columns)
Please let me know if my code can be further improved on! :)
for those who need an answer, I solved it via using this code:
import openpyxl
from openpyxl.reader.excel import load_workbook
dict_folder = "C:/...../abc"
for file in os.listdir(dict_folder):
if file.endswith(".xlsx"):
wb1 = load_workbook(join(dict_folder, file), data_only = True)
ws = wb1.active
for rowofcellobj in ws["C" : "D"]:
for cellobj in rowofcellobj:
data = re.findall(r"\w+_.*?\w+", str(cellobj.value))
if data != []:
fields = data[0]
fieldset.add(fields)
Yes, it is indeed possible. The main lib you'll get to for that is pandas. With it installed (instructions here) after, of course, installing python, you could do something along the lines of
import pandas as pd
# Reading the Excel worksheet into a pandas.DataFrame type object
sheet_path = 'C:\\Path\\to\\excel\\sheet.xlsx'
df = pd.read_excel(sheet_path)
# Using multiple conditions to find column substring within
underscored = df[(df['C'].str.contains('_')) | (df['D'].str.contains('_'))]
And that'd do it for columns C and D within your worksheet.
pandas has got a very diverse documentation, but to the extent you're looking for, the read_excel function documentation (has examples) will suffice, along with some more content on python itself, if needed.

Import data tables in Python

I am new to Python, coming from MATLAB. In MATLAB, I used to create a variable table (copy from excel to MATLAB) in MATLAB and save it as a .mat file and whenever I needed the data from the MATLAB, I used to import it using:
A = importdata('Filename.mat');
[Filename is 38x5 table, see the attached photo]
Is there a way I can do this in Python? I have to work with about 35 such tables and loading everytime from excel is not the best way.
In order to import excel tables into your python environment you have to install pandas.
Check out the detailed guideline.
import pandas as pd
xl = pd.ExcelFile('myFile.xlsx')
I hope this helps.
Use pandas:
import pandas as pd
dataframe = pd.read_csv("your_data.csv")
dataframe.head() # prints out first rows of your data
Or from Excel:
dataframe = pd.read_excel('your_excel_sheet.xlsx')

Categories