How to loop in tabula-py data format in python

How to loop in tabula-py data format in python - python

I want to know how to extract particular table column from pdf file in python.
My code so far
import tabula.io as tb
from tabula.io import read_pdf
dfs = tb.read_pdf(pdf_path, pages='all')
print (len(dfs)) [It displays 73]
I am able to access individual table column by doing print (dfs[2]['Section ID'])
I want to know how can I search particular column in all data frame using for loop.
I want to do something like this
for i in range(len(dfs)):
if (dfs[i][2]) == 'Section ID ' //(This gives invalid syntax)
print dfs[i]

If you have only one dataframe with Section ID name (or are interested only in the first dataframe with this column) you can iterate over the list returned by read_pdf, check for the column presence with in df.columns and break when a match is found.
import tabula.io as tb
from tabula.io import read_pdf
df_list = tb.read_pdf(pdf_path, pages='all')
for df in df_list:
if 'Section ID' in df.columns:
break
print(df)
If you may have multiple dataframes with the Section ID column, you can use list comprehension filter and get a list of dataframes with that column name.
dfs_section_id = [df for df in df_list if 'Section ID' in df.columns]

Related

Extra column appears when appending selected row from one csv to another in Python

I have this code which appends a column of a csv file as a row to another csv file:
def append_pandas(s,d):
import pandas as pd
df = pd.read_csv(s, sep=';', header=None)
df_t = df.T
df_t.iloc[0:1, 0:1] = 'Time Point'
df_t.at[1, 0] = 1
df_t.columns = df_t.iloc[0]
df_new = df_t.drop(0)
pdb = pd.read_csv(d, sep=';')
newpd = pdb.append(df_new)
from pandas import DataFrame
newpd.to_csv(d, sep=';')
The result is supposed to look like this:
Instead, every time the row is appended, there is an extra "Unnamed" column appearing on the left:
Do you know how to fix that?..
Please, help :(
My csv documents from which I select a column look like this:

You have to add index=False to your to_csv() method

How can I filter a csv file based on its columns in python?

I have a CSV file with over 5,000,000 rows of data that looks like this (except that it is in Farsi):
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766140,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,1050000,7266.44,5,Concrete,13890108,5166884645
766146,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,100,700000,4844.29,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
770822,Mobayee,East Azar,Tabriz,Apartment,,Residential,144.5,50,500000,1730.1,5,Concrete,13890114,5166884645
I would like to write a code to pass the first row as the header and then extract data from two specific cities (Kish and Qeshm) and save it into a new CSV file. Somthing like this one:
Contract Code,Contract Type,State,City,Property Type,Region,Usage Type,Area,Percentage,Price,Price per m2,Age,Frame Type,Contract Date,Postal Code
765720,Mobayee,East Azar,Kish,Apartment,,Residential,96,100,570000,5937.5,36,Metal,13890107,5169614658
766134,Mobayee,East Azar,Qeshm,Apartment,,Residential,144.5,100,1070000,7404.84,5,Concrete,13890108,5166884645
766147,Mobayee,East Azar,Kish,Apartment,,Residential,144.5,100,1625000,11245.67,5,Concrete,13890108,5166884645
It's worth mentioning that I'm very new to python.
I've written the following block to define the headers, but this is the furthest I've gotten so far.
import pandas as pd
path = '/Users/Desktop/sample.csv'
df = pd.read_csv(path , header=[0])
df.head = ()

You don't need to use header=... because the default is to treat the first row as the header, so
df = pd.read_csv(path)
Then, to keep rows on conditions:
df2 = df[df['City'].isin(['Kish', 'Qeshm'])]
And you can save it with
df2.to_csv(another_path)

How to skip rows while importing csv?

How to skip the rows based on certain value in the first column of the dataset. For example: if the first column has some unwanted stuffs in the first few rows and i want skip those rows upto a trigger value. please help me for importing csv in python

You can achieve this by using the argument skip_rows
Here is sample code below to start with:
import pandas as pd
df = pd.read_csv('users.csv', skiprows=<the row you want to skip>)
For a series of CSV files in the folder, you could use the for loop, read the CSV file and remove the row from the df containing the string.Lastly, concatenate it to the df_overall.
Example:
from pandas import DataFrame, concat, read_csv
df_overall = DataFrame()
dir_path = 'Insert your directory path'
for file_name in glob.glob(dir_path+'*.csv'):
df = pd.read_csv('file_name.csv', header=None)
df = df[~df. < column_name > .str.contains("<your_string>")]
df_overall = concat(df_overall, df)

Convert excel file with many sheets (with spaces in the name of the shett) in pandas data frame

I would like to convert an excel file to a pandas dataframe. All the sheets name have spaces in the name, for instances, ' part 1 of 22, part 2 of 22, and so on. In addition the first column is the same for all the sheets.
I would like to convert this excel file to a unique dataframe. However I dont know what happen with the name in python. I mean I was hable to import them, but i do not know the name of the data frame.
The sheets are imported but i do not know the name of them. After this i would like to use another 'for' and use a pd.merge() in order to create a unique dataframe
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
print(sheet_name.info())

Using only the code snippet you have shown, each sheet (each DataFrame) will be assigned to the variable sheet_name. Thus, this variable is overwritten on each iteration and you will only have the last sheet as a DataFrame assigned to that variable.
To achieve what you want to do you have to store each sheet, loaded as a DataFrame, somewhere, a list for example. You can then merge or concatenate them, depending on your needs.
Try this:
all_my_sheets = []
for sheet_name in Matrix.sheet_names:
sheet_name = pd.read_excel(Matrix, sheet_name)
all_my_sheets.append(sheet_name)
Or, even better, using list comprehension:
all_my_sheets = [pd.read_excel(Matrix, sheet_name) for sheet_name in Matrix.sheet_names]
You can then concatenate them into one DataFrame like this:
final_df = pd.concat(all_my_sheets, sort=False)

You might consider using the openpyxl package:
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename=file_path, read_only=True)
all_my_sheets = wb.sheetnames
# Assuming your sheets have the same headers and footers
n = 1
for ws in all_my_sheets:
records = []
for row in ws._cells_by_row(min_col=1,
min_row=n,
max_col=ws.max_column,
max_row=n):
rec = [cell.value for cell in row]
records.append(rec)
# Make sure you don't duplicate the header
n = 2
# ------------------------------
# Set the column names
records = records[header_row-1:]
header = records.pop(0)
# Create your df
df = pd.DataFrame(records, columns=header)

It may be easiest to call read_excel() once, and save the contents into a list.
So, the first step would look like this:
dfs = pd.read_excel(["Sheet 1", "Sheet 2", "Sheet 3"])
Note that the sheet names you use in the list should be the same as those in the excel file. Then, if you wanted to vertically concatenate these sheets, you would just call:
final_df = pd.concat(dfs, axis=1)
Note that this solution would result in a final_df that includes column headers from all three sheets. So, ideally they would be the same. It sounds like you want to merge the information, which would be done differently; we can't help you with the merge without more information.
I hope this helps!

Comparing two Microsoft Excel files in Python

I have two Microsoft Excel files fileA.xlsx and fileB.xlsx
fileA.xlsx looks like this:
fileB.xlsx looks like this:
The Message section of a row can contain any type of character. For example: smileys, Arabic, Chinese, etc.
I would like to find and remove all rows from fileB which are already present in fileA. How can I do this in Python?

You can use Panda's merge to first get the rows which are similar,
then you can use them as a filter.
import pandas as pd
df_A = pd.read_excel("fileA.xlsx", dtype=str)
df_B = pd.read_excel("fileB.xlsx", dtype=str)
df_new = df_A.merge(df_B, on = 'ID',how='outer',indicator=True)
df_common = df_new[df_new['_merge'] == 'both']
df_A = df_A[(~df_A.ID.isin(df_common.ID))]
df_B = df_B[(~df_B.ID.isin(df_common.ID))]
df_A, df_B now contains the rows from fileA,fileB respectively without the common rows in both.
Hope this helps.

Here I'am trying with using pandas and you have to also install xlrd for opening xlsx files,
Then it will take values from second file that are not in first file. Then creating a excel file name with second file name will rewrite the second file :
import pandas as pd
a = pd.read_excel('a.xlsx')
b = pd.read_excel('b.xlsx')
diff = b[b!=a].dropna()
diff.to_excel("b.xlsx",sheet_name='Sheet1',index=False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to loop in tabula-py data format in python - python

Related

Extra column appears when appending selected row from one csv to another in Python

How can I filter a csv file based on its columns in python?

How to skip rows while importing csv?

Convert excel file with many sheets (with spaces in the name of the shett) in pandas data frame

Comparing two Microsoft Excel files in Python

Categories

Resources