I'm having a csv file as a source dataset. Currently in the table there is a column that I would like to use Python to loop and extract data from string in each cell. For example, in a cell:
"Quantity changed by 10, Price changed by 90."
I would like to use Python and extract "Quantity, Price" and "10, 90" to create new table with those properties and values. Then use PowerBI Visual to create visuals instead of using Python visual. How should I do that? And is that possible at all?
Edit: Due to all the confusions, I'm adding a screenshot of the column I'm working with.
I would like to go through all rows in the Properties column, get data in each cell, then extract them to create a new table. For example, in this case, the new table will be like:
Properties | Value
Unconnected Height | -2800
Area | -39.883
Regarding code, I have nothing yet, just the base Python snippet PowerBI created:
# The following code to create a dataframe and remove duplicated rows is always executed and acts as a preamble for your script:
# dataset = pandas.DataFrame(Properties)
# dataset = dataset.drop_duplicates()
# Paste or type your script code here:
So PowerBI is using pandas and matplotlib to get data and plot data. But I only want to get data and output as a new table.
If Python is not possible, is there a way to use Query?
Extracting the data from the column is fairly easy if the data follows the same format:
s = "Quantity changed by 10, Price changed by 90."
l, sep, r = s.partition(',')
if sep:
quantity = int(l.split()[-1])
price = int(r.split()[-1][:-1])
print(quantity, price)
#10 90
The string contained in the cell is partitioned by the comma to separate the quantity and price strings, then each string it processed to extract the value.
Related
In my GUI, made with Qt Designer, I have table, 6 columns and 5 rows(headers not count). In first column will be date in format "DD/MM/YY". How I can read and save to some variable those dates, for future use in pdf report? Dates will not be used in any operations, just copy from table and send to function that build pdf report, so they can be str format.
I tried this:
T=[]
for i in range(self.ui.table_Level_N.rowCount()):
T.append(self.ui.table_Level_N.item(i,0))
but got some strange text:
<PyQt5.QtWidgets.QTableWidgetItem object at 0x0000019A24D903A0>
I assumed that it read dates but not in right format.table_Level_N is my table.
You need to get the text from the QTableWidgetItem and append it to the list. Try this:
T = []
for i in range(self.ui.table_Level_N.rowCount()):
item = self.ui.table_Level_N.item(i, 0)
T.append(item.text())
This will give you the text of the cell in the format you want, which you can then use in your PDF report.
im kinda new to pandas and stuck at how to refer a column within same name under different merged column. here some example which problem im stuck about. i wanna refer a database from worker at company C. but if im define this excel as df and
dfcompanyAworker=df[Worker]
it wont work
is there any specific way to define a database within identifical column like this ?
heres the table
https://i.stack.imgur.com/8Y6gp.png
thanks !
first read the dataset that will be used, then set the shape for example I use excel format
dfcompanyAworker = pd.read_excel('Worker', skiprows=1, header=[1,2], index_col=0, skipfooter=7)
dfcompanyAworker
where:
skiprows=1 to ignore the title row in the data
header=[1, 2] is a list because we have multilevel columns, namely Category (Company) and other data
index_col=0 to make the Date column an index for easier processing and analysis
skipfooter=7 to ignore the footer at the end of the data line
You can follow or try the steps as I made the following
I have a cookbook PDF file which consists of various tables that describe about the variables that are used in one of the datasets I am working with. Since the actual data consists of the values that I need to lookup, I will need to create multiple CSV output files from all the tables that are present in this cookbook.
For instance, on page 15 of this PDF file, we have a table as below from which I need to extract pandas dataframe so that I can save it as a CSV file for later use. I do not care about the "Totals" in these tables since I only need the value and the label field.
I tried to solve this problem by using camelot library in Python -
import camelot
# try extracting table from 1 of the pages
tables = camelot.read_pdf('/Users/Downloads/TEDS-A-2018-DS0001-info-codebook_v1.pdf', pages = '12')
# check data
>>> type(tables)
<class 'camelot.core.TableList'>
>>> len(tables)
0
I am not sure why I do not get any tables in the output. Any help is highly appreciated.
Update - I have also tried out the tabula library however I only get odd rows and not even rows from a table. Here is my code for this trial -
pdf_loc = 'csvs/TEDS-A-2018-DS0001-info-codebook_v1.pdf'
list_of_dataframs = tb.read_pdf(input_path=pdf_loc, pages='all')
number_of_dfs = len(list_of_dataframs)
print('first df in list')
list_of_dataframs[0]
Here is the output -
The PDF cookbook can be found here
One can use Tabula with trying few of it's parameters.
As per your case, I have seen that the structure of the table is similar through out the PDF and so we can use column parameter of Tabula to define our own column structure. If we don't describe this parameter, tabula tries to guess the column structure on it's own, and yes it some times fails to identify the right table structure.
tables = tabula.read_pdf(filename, area = (0,0,800,800), pages=15, columns = (95, 410, 490), pandas_options={'header': None})
After using that parameter I am getting below output for page-15 of the PDF:
We can use this for all the pages and of course we can do pre processing also to remove unnecessary rows, so that you get a perfect tabular data. I would love to help further counting this would work for you.
I have implemented a program in VBA for excel to generate automatic communications based on user inputs (selections of cells).
Such Macro written in VBA uses extensively the listObject function of VBA
i.e.
defining a table (list object)
Dim ClsSht As Worksheet
Set ClsSht = ThisWorkbook.Sheets("paragraph texts")
Dim ClsTbl As ListObject
Set ClsTbl = ClsSht.ListObjects(1)
accessing the table in the code in a very logical manner:
ClsTbl being now the table where I want to pick up data.
myvariable= ClsTbl.listcolumns("D1").databodyrange.item(34).value
Which means myvariable is the item (row) 34 of the data of the column D1 of the table clstbl
I decided to learn python to "translate" all that code into python and make a django based program accesable for anyone.
I am a beginner in Python and I am wondering what would be the equivalent in python to listobject of VBA. This decision will shape my whole program in python from the beginning, and I am hesitating a lot to decide what is the python equivalent to listobject in VBA.
The main idea here getting a way where I can access tables-data in a readable way,
i.e. give me the value of column "text" where column "chapter" is 3 and column paragraph is "2". The values are unique, meaning there is only one value in "text" column where that occurs.
Some observations:
I know everything can be done with lists in python, lists can contain lists that can contain lists..., but this is terrible for readability. mylist1[2][3] (assuming for instance that every row could be a list of values, and the whole table a list of lists of rows).
I don't considered an option to build any database. There are multiple relatively small tables (from 10 to 500 rows and from 3 to 15 columns) that are related but not in a database manner. That would force me to learn yet another language SQL or so, and I have more than enough with python and DJango.
The user modifies the structure of many tables (chapters coming together or getting splitted.
the data is 100% strings. The only integers are numbers to sort out text. I don't perform any mathematical operation with values but simple add together pieces of text and make replacements in texts.
the tables will be load into Python as CSV text files.
Please indicate me if there is something not enough clear in the question and I will complete it
Would it be necesary to operate with numpy? pandas?
i.e give me the value of cell
A DataFrame using pandas should provide everything you need, i.e. converstion to strings, manipulation, import and export. As a start, try
import pandas as pd
df = pd.read_csv('your_file.csv')
print(df)
print(df['text'])
The entries of the first row will be converted to labels of the DataFrame columns.
I'm having a go at using Numpy instead of Matlab, but I'm relatively new to Python.
My current challenge is importing the data in multiple file in a sensible way so that I can use and plot it. The data is organized in columnes (Temperature, Pressure, Time, etc, each file being a measurement period), and I decided pandas was probably the best way to import the data. I was thinking of using top-leve descriptor for each file, and subdescriptors for each column. Thought of doing it something like this.
Reading Multiple CSV Files into Python Pandas Dataframe
The problem is I'd like to retain and use some of the data in the header (for plotting, for instance). There's no column titles, but general info on data mesaurements, something like this:
Flight ID: XXXXXX
Date: 01-27-10 Time: 5:25:19
OWNER
Release Point: xx.304N xx.060E 11 m
Serial Number xxxxxx
Surface Data: 985.1 mb 1.0 C 100% 1.0 m/s # 308 deg.
I really don't know how to extract and store the data in a way that makes sense when combined with the data frame. Thought of perhaps a dictionary, but I'm not sure how to split the data efficiently since there's no consistent divider. Any ideas?
Looks like somebody is working with radiosondes...
When I pull in my radiosonde data I usually put it in a multi-level indexed dataframe. The levels could be of various forms and orders, but something like FLIGHT_NUM, DATE, ALTITUDE, etc. would make sense. Also, when working with sonde data I too want some additional information that does not necessarily need to be stored within the dataframe, so I store that as additional attributes. If I were to parse your file and then store it I would do something along the lines of this (yes, there are modifications that can be made to "improve" this):
import pandas as pd
with open("filename.csv",'r') as data:
header = data.read().split('\n')[:5] # change to match number of your header rows
data = pd.read_csv(data, skiprows=6, skipinitialspace=True, na_values=[-999,'Infinity','-Infinity'])
# now you can parse your header to get out the necessary information
# continue until you have all the header info you want/need; e.g.
flight = header[0].split(': ')[1]
date = header[1].split(': ')[1].split('')[0]
time = header[1].split(': ')[2]
# a lot of the header information will get stored as metadata for me.
# most likely you want more than flight number and date in your metadata, but you get the point.
data.metadata = {'flight':flight,
'date':date}
I presume you have a date/time column (call it "dates" here) within your file, so you can use that to re-index your dataframe. If you choose to use different variables within your multi-level index then the same method applies.
new_index = [(data.metadata['flight'],r) for r in data.dates]
data.index = pd.MultiIndex.from_tuples(new_index)
You now have a multi-level indexed dataframe.
Now, regarding your "metadata". EdChum makes an excellent point that if you copy "data" you will NOT copy over the metadata dictionary. Also, if you save "data" to a dataframe via data.to_pickle you will lose your metadata (more on this later). If you want to keep your metadata you have a couple options.
Save the data on a flight-by-flight basis. This will allow you to store metadata for each individual flight's file.
Assuming you want to have multiple flights within one saved file: you can add an additional column within your dataframe that hold that information (i.e. another column for flight number, another column for surface temperature, etc.), though this will increase the size of your saved file.
Assuming you want to have multiple flights within one saved file (option 2): You can make your metadata dictionary "keyed" by flight number. e.g.
data.metadata = {FLIGHT1:{'date':date},
FLIGHT2:{'date':date}}
Now to store the metadata. Check you my IO class on storing additional attributes within an h5 file posted here.
Your question was quite broad, so you got a broad answer. I hope this was helpful.