pandas find the previous instance index number of a matching value - python

I am trying to get the index number of a value in pandas. For example, in below table, I want to bring index numbers to column "Found"
import pandas as pd
columns = ["hit/miss", "Value", "Found?"]
data = {'hit/miss':['0', '0', '1', '0', '0','1', '0', '0', '0', '0'],
'Value':["Not found","Not found","Yes!","Not found","Not found","Yes!","Not found","Not found","Not found","Not found"],
'Found':["","","","2","2","2","5","5","5","5"]}
pd.DataFrame(data)
Anyone have idea how I can iterate from bottom to top and find the matching value and get its index number?

Try this:
found = df['hit/miss'].eq('1')
df['Found'] = found.mask(found, found.index.to_list()).shift().replace(False, None)

Related

tinyDB update all occurances of a value

How do I update all instances of a value within a tinyDB?
So, for example, if I have a db like
db = TinyDB('test.json') # assume empty
db.insert({'a':'1', 'b':'2', 'c':'1'})
db.insert({'a':'2', 'b':'3', 'c':'1'})
db.insert({'a':'4', 'b':'1', 'c':'3'})
How do I update all values of '1' to say '5'?
How to update all values within say column 'c' only?
As an addition. If some of my columns contain arrays,
e.g.
db = TinyDB('test.json') # assume empty
db.insert({'a':'1', 'b':['2', '5', '1'], 'c':'1'})
db.insert({'a':'2', 'b':['3', '4', '1'], 'c':'1'})
db.insert({'a':'4', 'b':['1', '4', '4'], 'c':'3'})
how could I perform the same points above?

Unable to create a pandas dataframe from a json list due to presence of a colon in one of the values

The code below works fine with other list objects, but here due to the presence of a colon in imageURL, it's giving me an error. I have to load the data dynamically without looking at the particular key value pair. Please help.
dt=[{'lineno': '3544', 'sku': 'B2039P015DP', 'status': 'Shipped', 'order_qty': '4', 'openQty': '0', 'wipQty': '0', 'shippedQty': '2', 'closedQty': '0', 'closed_date': '', 'returnedQty': '0', 'deliveredQty': '0', 'imageUrl': 'https://d2p3w.cloudfront.net/pub/media/catalog/product/b/2/b2039p010ds.jpg', 'itemName': 'Primo Brown Cube Box, 5Ply, (20"x10"x10"), Pack of 15', 'price': '1033.76000', 'udf1': None, 'udf2': None, 'udf3': None, 'udf4': None, 'udf5': None, 'internalLineNo': '1'}]
dummy = pd.read_json(json.dumps(dt),orient='records')
Just use json.loads to load it rather than pd.read_json.
So with your input dt this code works fine:
dummy = pd.DataFrame(json.loads(json.dumps(dt)))

Get numerical values from document in python

I've extracted the details of a docx file using this code
from docx import Document
document = Document('136441742-Rental-Agreement-Format.pdf.docx')
for para in document.paragraphs:
print(para.text)
The output contains numerical values, date and text fields. How to extract numerical values and dates ??
Using the document output you had shared in the comment, using the data as a string, and assuming the date format is dd.mm.yyy, and does not changes, I wrote the below code to get the date and numerical values, and it works fine for me.
I am using regular expression to extract date and isdigit() to get numerical values.
you could adopt the below code to work on your exact document output if needed.
import re
from datetime import datetime
text = "TENANCY AGREEMENT This Tenancy Agreement is made and executed at Bangalore on this 22.01.2013 by MR .P .RAJA SEKHAR AGED ABOUT 28 YRS S/0.MR.KRISHNA PARAMATMA PENTAKOTA R/at NESTER RAGA B-502, OPP MORE MEGA STORE BANGALORE-560 048 Hereinafter called the 'OWNER' of the One Part. AND MR.VENKATA BHYRAVA MURTHY MUTNURI & P/at NO.17-2-16, l/EERABHARAPURAM AGED ABOUT 26 YRS RAOAHMUNDRY ANDHRA PRADESH S/n.MR.RAGHAVENDRA RAO 533105"
a=[]
match = re.search(r'\d{2}.\d{2}.\d{4}', text)
date = datetime.strptime(match.group(), '%d.%m.%Y').date()
print(date)
for i in text :
if i.isdigit() == True:
a.append(i)
print(a)
Output -
2013-01-22
['2', '2', '0', '1', '2', '0', '1', '3', '2', '8', '0', '5', '0', '2', '5', '6', '0', '0', '4', '8', '1', '7', '2', '1', '6', '2', '6', '5', '3', '3', '1', '0', '5']
You can find the numbers using regex
\d{2}\.\d{2}\.\d{4} this will find the date
\d+-\d+-\d+ this will find plot number
\d{3} ?\d{3} this will find pincodes
\d+ this will find all other numbers
To find underline text you can use docx run.underline property
for para in Document('test.docx').paragraphs:
nums = re.findall('\d{2}\.\d{2}\.\d{4}|\d+-\d+-\d+|\d{3} ?\d{3}|\d+', para.text)
underline_text = [run.text for run in para.runs if run.underline]

Pandas: Create new column that displays the total number of rows of each group of another column

I would like to create a new column called TotalCountByCycle that displays the total number of rows in each group of the Cycle column and also appears in every row belonging to that Cycle group.
Here is an example of a simplified table:
raw_data = {'Reagent': ['H20', 'MWS', 'H20_1', 'H20', 'MWS', 'H20_1', 'H20_2', 'H20_3'],
'Cycle': ['1', '1', '1', '2', '2', '2', '2', '2'],
'Day': ['Mon', 'Tue', 'Wed', 'Thur', 'Fri', 'Sat', 'Sun', 'Mon']}
df = pd.DataFrame(raw_data, columns = ['Reagent', 'Cycle', 'Day'])
df
I am trying to achieve the column on the right in the image below:
I tried the code below, but got the error, ValueError: Wrong number of items passed 2, placement implies 1.
df['new_col'] = df.groupby('Cycle').transform('count')
Solved! Refer to comments below.
groupby then transform by the count and assign to new column
df['TotalCountByCycle'] = df.groupby('Cycle')['Reagent'].transform('count')

How to identify changes in a variable per person per time (in panel data)?

I have panel data (repeated observations per ID at different points in time). Data is unbalanced (there are gaps). I need to check and possibly adjust for a change in variable per person over the years.
I tried two versions. First, a for loop-setting, to first access each person and each of its years. Second, a one line combination with groupby. Groupby looks more elegant to me. Here the main issue is to identify the "next element". I assume in a loop I can solve this with a counter.
Here is my MWE panel data:
import pandas as pd
df = pd.DataFrame({'year': ['2003', '2004', '2005', '2006', '2007', '2008', '2009','2003', '2004', '2005', '2006', '2007', '2008', '2009'],
'id': ['1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2', '2'],
'money': ['15', '15', '15', '16', '16', '16', '16', '17', '17', '17', '18', '17', '17', '17']}).astype(int)
df
Here is what a time series per person looks like:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
fig, ax = plt.subplots()
for i in df.id.unique():
df[df['id']==i].plot.line(x='year', y='var', ax=ax, label='id = %s'%i)
df[df['id']==i].plot.scatter(x='year', y='var', ax=ax)
plt.xticks(np.unique(df.year),rotation=45)
Here is what I want to achieve: For each person, compare the time series of values and drop every successor who is different from its precursor value (identify red circles). Then I will try different strategies to handle it:
Drop (very iffy): if successor differs, drop it
Smooth (absolute value): if successor differs by (say) 1 unit, assign it its precursor value
Smooth (relative value): if successor differs by (say) 1 percent, assign it its precursor value
Solution to drop
df['money_difference'] = df['money']-df.groupby('id')['money'].shift(1)
df_new = df.drop(df[df['money_difference'].abs()>0].index)
Idea to smooth
# keep track of change of variable by person and time
df['money_difference'] = df['money']-df.groupby('id')['money'].shift(1)
# first element has no precursor, it will be NaN, replace this by 0
df = df.fillna(0)
# now: whenever change_of_variable exceeds a threshold, replace the value by its precursor - not working so far
df['money'] = np.where(abs(df['money_difference'])>=1, df['money'].shift(1), df['money'])
To get the next event in your database you can use a combination with groupby and shift and then do the subraction to the previos event:
df['money_difference'] =df.groupby(['year', 'id'])['money'].shift(-1)-df['money']

Categories