I did some analysis of my .txt file using python. Each data produced a set of results. I need to transfer the whole results into a single excel file. You can see my results from this image enter image description here. Also, I want to mention each txt file name along with the results in the excel sheet. Can anyone help this matter ?
I presume that you are using pandas. Pandas have a build in function to export to excel.
df_excel.to_excel("output.xlsx")
If you want to add the name of the textfile, simply add the name to the corresponding rows.
Related
I am trying to develop a tool in python so that the user input strings are stored in an excel "xlsx" file. My script is supposed to read the cell contents of the excel file and plot them using Matplotib (Like the title of the graph, Y label, etc). However, reading the content having subscript or superscript strings (such as scientific notation or units) is failing. I tried different approaches such as Pandas, Win32com, Xlrd but couldn't be successful as they can read plain text only. I am seeking help for it. How can it be achieved? If this can't be done in excel, is there any method to take an input file preserving the format of the string?
I need to extract data from tables in multiple PDF's using Python. I have tested both camelot and tabula however neither of them are able to accurately get the data. The tables have some merged cells, cells with mutiple lines of information etc. so both these libraries get confused. Is there a good way of approaching this issue?
There may be something wrong with the underlying structure of the table encoded in the PDF if that's the case.
You could use OCR, and do some string/regex manipulation to extract column data from each row. github.com/cseas/ocr-table seems to work. See the input.pdf and output.txt to see if it works with your situation.
I have an xlsx that has two sheets: on has some data in G1:O25 (let's call this "data") and one that has some images inserted into cells in G1:O25 (let's call this one "images").
My goal is to use Python to filter the data using images. I want a popup that shows me image from cell G1 along with a checkbox or something to include/exclude this data point. Then create a new sheet ("filtered data") with the included data points.
I'm new to Python so bear with me, but I've figured out a couple things from searching:
I can load the data into a list.
xlsx files are actually zip files so I can use zipfile and matplotlib to read the images from subdirectories display them.
It shouldn't be hard to add the checkbox thing and do the filtering.
The issues I am having:
Since openpyxl does not preserve the images when reading/writing to a workbook, I would loose the images when I append my "filtered data" sheet. Maybe there is a workaround like saving to a seperete sheet and using COM?
Although I can load the images using the zip method, I lose information on which cell they are associated with. They are in a logical order inside the xlsx/zip file, but sometimes there will be a missing image (i.e. say cell K11 does not have an image) so I cannot just assume that image1.jpeg corresponds to cell G1 and so on and so forth). I am not sure where in the excel file I can find info associating images to their respective cells in the spreadsheet.
Thank you in advance
As per how to get the relative position of shapes within a worksheet , in Excel object model, you get the cell adjacent to an image by its .TopLeftCell property:
import win32com.client
x=win32com.client.Dispatch("Excel.Application")
wb=x.Workbooks.Open("<path_to.xlsx>")
ws=wb.Sheets("Sheet1")
for i in ws.Shapes:
print i.TopLeftCell.Address
prints:
$B$2
$B$5
$D$3
first of all, thank you for taking the time to help me!
I am currently working on a machine learning problem using python where I have to extract several specific sections in a large text file for training a classification algorithm. The texts then have to be saved in a CSV format with its corresponding ID-num and label/category from an excel sheet.
The CSV file should look like this: https://imgur.com/a/3cntJlL
The excel sheet contains a lot of columns where only the ID-number and label columns should be used.
Here you can see some of the excel sheet: https://imgur.com/a/AZlWdeE
IDNUM column is the ID-number which connects the excel sheet to a specific text.
The AType1 column is the corresponding label which also has to be saved.
Here you can see some of one of the text files: https://imgur.com/a/Yns8HAC
The text which should be extracted goes from the word "Text:" to where there are two "*" (stars) right after each other in two lines. The ID-num is placed above the section, as the picture shows.
I have been trying to split the document but I can seem to figure out how to make the CSV file containing information from both an excel sheet and the text file. It would be optimal to make a script that can do this in one run and maybe then loop through several large text files.
So, my problem is to create a script which can:
Match excel cell content (ID-number) with text
Extract a section of the text between two delimiters ("Text:" and "* \n *")
Save the text, ID-number and label in a CSV file.
I hope there is someone who can help me. I am on the beginner level of using python so making this kind of script is pretty challenging.
Looking forward to hearing your ideas!
// Rasmus
It would be good for you to familiarize yourself with the pandas library.
Pandas (https://pandas.pydata.org/docs/) will allow you to read a CSV file into what is called a dataframe and manipulate the data by column name and rows. You can also put your results into a pandas dataframe and write the results to a CSV file.
I have a scanned PDF which has some random data in a tabular format and want to copy that into an Excel sheet.
I have played around with digital PDFs and use 'tabula' to extract tables but scanned PDFs require OCRs(what I've seen over google).
I know there is an OCR involved(tesseract), but do not know what approach should I take towards solving the problem.
Take a look at Tesseract's TSV (Tab Separated Value) output format and see if Excel can read or import it. Some tranformation may be needed to get it into a format consumable by Excel.
https://digi.bib.uni-mannheim.de/tesseract/manuals/tesseract.1.html