Get numerical values from document in python

Get numerical values from document in python - python

I've extracted the details of a docx file using this code
from docx import Document
document = Document('136441742-Rental-Agreement-Format.pdf.docx')
for para in document.paragraphs:
print(para.text)
The output contains numerical values, date and text fields. How to extract numerical values and dates ??

Using the document output you had shared in the comment, using the data as a string, and assuming the date format is dd.mm.yyy, and does not changes, I wrote the below code to get the date and numerical values, and it works fine for me.
I am using regular expression to extract date and isdigit() to get numerical values.
you could adopt the below code to work on your exact document output if needed.
import re
from datetime import datetime
text = "TENANCY AGREEMENT This Tenancy Agreement is made and executed at Bangalore on this 22.01.2013 by MR .P .RAJA SEKHAR AGED ABOUT 28 YRS S/0.MR.KRISHNA PARAMATMA PENTAKOTA R/at NESTER RAGA B-502, OPP MORE MEGA STORE BANGALORE-560 048 Hereinafter called the 'OWNER' of the One Part. AND MR.VENKATA BHYRAVA MURTHY MUTNURI & P/at NO.17-2-16, l/EERABHARAPURAM AGED ABOUT 26 YRS RAOAHMUNDRY ANDHRA PRADESH S/n.MR.RAGHAVENDRA RAO 533105"
a=[]
match = re.search(r'\d{2}.\d{2}.\d{4}', text)
date = datetime.strptime(match.group(), '%d.%m.%Y').date()
print(date)
for i in text :
if i.isdigit() == True:
a.append(i)
print(a)
Output -
2013-01-22
['2', '2', '0', '1', '2', '0', '1', '3', '2', '8', '0', '5', '0', '2', '5', '6', '0', '0', '4', '8', '1', '7', '2', '1', '6', '2', '6', '5', '3', '3', '1', '0', '5']

You can find the numbers using regex
\d{2}\.\d{2}\.\d{4} this will find the date
\d+-\d+-\d+ this will find plot number
\d{3} ?\d{3} this will find pincodes
\d+ this will find all other numbers
To find underline text you can use docx run.underline property
for para in Document('test.docx').paragraphs:
nums = re.findall('\d{2}\.\d{2}\.\d{4}|\d+-\d+-\d+|\d{3} ?\d{3}|\d+', para.text)
underline_text = [run.text for run in para.runs if run.underline]

Related

Pymer4 for logistic mixed effects regression. The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I have a dataset for one year for all employees with individual-level data (e.g. age, gender, promotions, etc.). Each employee is in a team of a certain manager. I have some variables on the team- and manager-levels as well (e.g. manager's tenure, team diversity, etc.). I want to explain the termination of employees (binary: left the company or not). I am running a multilevel logistic regression, where employees are grouped by their managers, therefore they share the same team- and manager-level characteristics.
So, my model looks like:
Termination ~ Age + Time in company + Promotions + Manager tenure + Manager age + Average age in team + % of women in team", data, groups=data[Manager_ID]
Dataset example:
data = {'Employee': ['ID1', 'ID2','ID3','ID4','ID5','ID6','ID7', 'ID8'],
'Manager_ID': ['MID1', 'MID2','MID2','MID1','MID3','MID3','MID3', 'MID1'],
'Termination': ['0', '0', '0', '0', '1', '1', '1', '0'],
'Age': ['35', '40','50','24','33','46','44', '31'],
'TimeinCompany': ['1', '3', '10', '20', '4', '0', '4', '9'],
'Promotions': ['1', '0', '0', '0', '1', '1', '1', '0'],
'Manager_Tenure': ['10', '5', '5', '10', '8', '8', '8', '10'],
'Manager_Age': ['40', '45', '45', '40', '38', '38', '38', '40'],
'AverageAgeTeam': ['33', '30', '30', '33', '44', '44', '44', '33'],
'PercentWomenTeam': ['40', '20', '20', '40', '49', '49', '49', '40']}
columns = ['Employee','Manager_ID','Age', 'TimeinCompany', 'Promotions', 'Manager_Tenure', 'Manager_Age', 'AverageAgeTeam', 'PercentWomenTeam']
data = pd.DataFrame(data, columns=columns)
I am using pymer4 package to run logistic mixed effects regression (lmer from R) in Python.
from pymer4.models import Lmer
building model
model = Lmer("Termination ~ Age + TimeinCompany + Promotions + Manager_Tenure + Manager_Age + AverageAgeTeam + PercentWomenTeam + (1|Manager_ID)",
data=data, family = 'binomial')
print(model.fit())
However, I receive an error "ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()".
I thought it is due to some managers only having 1 employee in the dataset. I excluded managers who have less than 5 / 20 / 50 employees, e.g.:
data['Count'] = data.groupby('Manager_ID')["Employee"].transform("count")
data1 = data[data['Count']>=50]
but the error message is the same.
I also tried transforming all variables into numeric:
all_columns = list(data)
data[all_columns] = data[all_columns].astype(np.int64, errors='ignore')
Some variables are now int64, while others are float64. The error message is still the same.
The dataset is also biased towards employees who did not leave the company, so Termination variable has more 0 than 1. Model also runs for a long time on the full sample before showing the error message.

I ran into the same error and could resolve it by updating my pymer4 version (see also this issue on github)
conda install -c ejolly -c conda-forge -c defaults pymer4=0.7.8
Hope this helps!
Edit:
In case you have a pandas version >= 1.2 in your environment, you'll need to install pymer4 0.8.0 from the pre-release channel:
conda install -c ejolly/label/pre-release -c conda-forge -c defaults pymer4=0.8.0

tinyDB update all occurances of a value

How do I update all instances of a value within a tinyDB?
So, for example, if I have a db like
db = TinyDB('test.json') # assume empty
db.insert({'a':'1', 'b':'2', 'c':'1'})
db.insert({'a':'2', 'b':'3', 'c':'1'})
db.insert({'a':'4', 'b':'1', 'c':'3'})
How do I update all values of '1' to say '5'?
How to update all values within say column 'c' only?
As an addition. If some of my columns contain arrays,
e.g.
db = TinyDB('test.json') # assume empty
db.insert({'a':'1', 'b':['2', '5', '1'], 'c':'1'})
db.insert({'a':'2', 'b':['3', '4', '1'], 'c':'1'})
db.insert({'a':'4', 'b':['1', '4', '4'], 'c':'3'})
how could I perform the same points above?

pandas find the previous instance index number of a matching value

I am trying to get the index number of a value in pandas. For example, in below table, I want to bring index numbers to column "Found"
import pandas as pd
columns = ["hit/miss", "Value", "Found?"]
data = {'hit/miss':['0', '0', '1', '0', '0','1', '0', '0', '0', '0'],
'Value':["Not found","Not found","Yes!","Not found","Not found","Yes!","Not found","Not found","Not found","Not found"],
'Found':["","","","2","2","2","5","5","5","5"]}
pd.DataFrame(data)
Anyone have idea how I can iterate from bottom to top and find the matching value and get its index number?

Try this:
found = df['hit/miss'].eq('1')
df['Found'] = found.mask(found, found.index.to_list()).shift().replace(False, None)

how to read .SDL text file?

I have a .SDL Text file in format
244455|199|6577888|20210401|138.61|0.78|83.16|0.00|0.00|221.77|6|0.00|17000
is there any python library to read and interpret such .SDL text file?

I am assuming that there will be no multiple line in the file.
data.sdl
490797|C|64||BLAH BLAH BLAH||||0|190/0000/07|A|1998889|198666566|||8990900|BLAGHHH72|L78899|||0040|012|432565|012|435659||MBLAHAHAHAHASIE|2WES|ARGHKKHHHT|PRE||0002|012|432565|012|435659||MR. JOHN DOE|PO BOX 198898|SILUHHHHH||0052|661|13||82110|35000000|2|0|||||0|0||||Y||70877746414|R
Python script to extract data in a list:
data_list = []
# with open('path/to/file.sdl') as file
with open('data.sdl', 'r') as file:
data = file.read()
data_list = data.split('|')
data_list[-1] = data_list[-1].strip()
data_list = list(filter(None, data_list))
Output:
['490797', 'C', '64', 'BLAH BLAH BLAH', '0', '190/0000/07', 'A', '1998889', '198666566', '8990900', 'BLAGHHH72', 'L78899', '0040', '012', '432565', '012', '435659', 'MBLAHAHAHAHASIE', '2WES', 'ARGHKKHHHT', 'PRE', '0002', '012', '432565', '012', '435659', 'MR. JOHN DOE', 'PO BOX 198898', 'SILUHHHHH', '0052', '661', '13', '82110', '35000000', '2', '0', '0', '0', 'Y', '70877746414', 'R']
Please let me know if you need anything else.

Presuming there's more rows than you've provided in the same format, Pandas .read_csv() will be able to load this up for you!
import pandas as pd
df = pd.read_csv("my_path/whateverfilename.sdl", sep="|")
This will create a DataFrame object for you, which may be what you're after
If you just wanted each row as a list, you can simply load the file and .split() each line, though this will probably be harder to work with overall
split_lines = []
with open("my_path/whateverfilename.sdl") as fh:
for line in fh: # file-like objects are iterable by-line
split_lines.append(line.split("|"))

Assuming that each line as the same amount of columns:
File './path_to_data':
244455|199|6577888|20210401|138.61|0.78|83.16|0.00|0.00|221.77|6|0.00|17000
||||0||0|| , |C|64||
Data "reader":
import numpy as np
path = './path_to_data'
N_COLS = 13
# declare the data type of each column - in this case python Object
dts = np.dtype(', '.join(['O'] * N_COLS))
data = np.loadtxt(fname=path, delimiter='|', dtype=dts, unpack=False, skiprows=0, max_rows=None)
for i in data:
print(i)
Output
('244455', '199', '6577888', '20210401', '138.61', '0.78', '83.16', '0.00', '0.00', '221.77', '6', '0.00', '17000')
('', '', '', '', '0', '', '0', '', ' , ', 'C', '64', '', '')
To get the data as column: unpack=True
Tell form which line start to read skiprows=0
End reading at line max_rows=None if None read everything (default).
Here the doc.

How to identify changes in a variable per person per time (in panel data)?

I have panel data (repeated observations per ID at different points in time). Data is unbalanced (there are gaps). I need to check and possibly adjust for a change in variable per person over the years.
I tried two versions. First, a for loop-setting, to first access each person and each of its years. Second, a one line combination with groupby. Groupby looks more elegant to me. Here the main issue is to identify the "next element". I assume in a loop I can solve this with a counter.
Here is my MWE panel data:
import pandas as pd
df = pd.DataFrame({'year': ['2003', '2004', '2005', '2006', '2007', '2008', '2009','2003', '2004', '2005', '2006', '2007', '2008', '2009'],
'id': ['1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2', '2'],
'money': ['15', '15', '15', '16', '16', '16', '16', '17', '17', '17', '18', '17', '17', '17']}).astype(int)
df
Here is what a time series per person looks like:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, ax = plt.subplots()
for i in df.id.unique():
df[df['id']==i].plot.line(x='year', y='var', ax=ax, label='id = %s'%i)
df[df['id']==i].plot.scatter(x='year', y='var', ax=ax)
plt.xticks(np.unique(df.year),rotation=45)
Here is what I want to achieve: For each person, compare the time series of values and drop every successor who is different from its precursor value (identify red circles). Then I will try different strategies to handle it:
Drop (very iffy): if successor differs, drop it
Smooth (absolute value): if successor differs by (say) 1 unit, assign it its precursor value
Smooth (relative value): if successor differs by (say) 1 percent, assign it its precursor value
Solution to drop
df['money_difference'] = df['money']-df.groupby('id')['money'].shift(1)
df_new = df.drop(df[df['money_difference'].abs()>0].index)
Idea to smooth
# keep track of change of variable by person and time
df['money_difference'] = df['money']-df.groupby('id')['money'].shift(1)
# first element has no precursor, it will be NaN, replace this by 0
df = df.fillna(0)
# now: whenever change_of_variable exceeds a threshold, replace the value by its precursor - not working so far
df['money'] = np.where(abs(df['money_difference'])>=1, df['money'].shift(1), df['money'])

To get the next event in your database you can use a combination with groupby and shift and then do the subraction to the previos event:
df['money_difference'] =df.groupby(['year', 'id'])['money'].shift(-1)-df['money']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get numerical values from document in python - python

Related

Pymer4 for logistic mixed effects regression. The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

tinyDB update all occurances of a value

pandas find the previous instance index number of a matching value

how to read .SDL text file?

How to identify changes in a variable per person per time (in panel data)?

Categories

Resources