Counting number of occurrences of string matched to another column

Counting number of occurrences of string matched to another column - python

df = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted','hello there'],
'number_of_stickers':['2','0','0','1','0','0']} ##This column 'number_of_stickers' is what i am aiming to achieve. Currently, i don't have this column.
df = pd.DataFrame(data=df)
Above is what I am trying to achieve. I currently Do not have the column 'number_of_stickers'. This column would be my end goal.
I am trying to count the number of rows with "sticker omitted" and append the row above the chain of "sticker omitted" with the number of occurrences. I would like to append onto the new column 'number_of_stickers'
To give you some context, I am analysing whatsapp text data, and I thought it would be useful to see how many stickers were sent right after a chat was sent. This also shows the tonality and sentiments of the conversation.
Update:
I have posted a solution (credits to #JacoSolari) which would work for the problem I'm solving. Added 1-2 lines (if statement) on top of his codes so that we do not face a problem at the end of the dataframe (range issues).

It's a common technique to check for the other values and take cumsum to identify the blocks:
omitted = df.msg.ne('sticker omitted').cumsum()
df['number_of_stickers'] = np.where(omitted.duplicated(), 0,
omitted.groupby(omitted).transform('size')-1)

This code should do the job. I could not find a solution that only uses pandas functions (it might be possible to do it). Anyways, I left some comments in the code to describe my approach.
# create data
df_dict = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted']}
df=pd.DataFrame(data=df_dict)
# build column for sticker counts after message
sticker_counts = []
for index, row in df.iterrows(): # iterating over df rows
flag = True
count = 0
# when a sticker row is encountered, just put 0 in the count column
# when a non-sticker row is encountered do the following
if row['msg'] != 'sticker omitted':
k = 1 # to check rows after the non-sticker row
while flag:
# if the index + k row is a sticker increase the count for index and k
if df.loc[index + k].msg == 'sticker omitted':
count += 1
k += 1
# when reached the end of the database, break the loop
if index + k +1 > len(df):
flag = False
else:
flag = False
k = 1
sticker_counts.append(count)
df['sticker_counts'] = sticker_counts
print(df)

You've actually got it all right so far, and your data is substantial for a easy yet functional algorithm!
Here is a little piece of code I coded up for this problem:
#ss
df = {'msg':['i am so happy thank you',
'sticker omitted',
'sticker omitted',
'thank you for your time!'
,'sticker omitted'],
'number_of_stickers':['2','0','0','1','0']}
j = 0
newarr = [] # new array for use
for i in df["number_of_stickers"]:
if(not int(i)==0):
newarr.append([df["msg"][j], int(i)]) # will store each data in a array
#access the number of it by using element 1(newarr[1]) and the msg by newarr[0]
j+=1;
#se
#feel free to do whatever you want after ss to se
pd.DataFrame(data=df)
se being snippet end and ss snippet start.
Hope this helps! Just comment below if it doesn't!
also you have to refeed the new array to the dict.

I have edited #JacoSolari's codes (with the help of a kind soul) to match the needs of the problem I'm trying to solve. Please find the code below useful.
sticker_counts = []
msg_index = 0
for index, row in df.iterrows(): # iterating over df rows
flag = True
count = 0
# when a sticker row is encountered, just put 0 in the count column
# when a non-sticker row is encountered do the following
if row['msg'] != 'sticker omitted':
k = 1 # to check rows after the non-sticker row
while flag:
print(f'i{msg_index} flag{flag} len{len(df)}')
# if the index + k row is a sticker increase the count for index and k
msg_index=index + k
if msg_index >= len(df):
break
if df.loc[msg_index].msg == 'sticker omitted':
count += 1
k += 1
# when reached the end of the database, break the loop
if msg_index +1 > len(df):
flag = False
print(f'i{msg_index} flag{flag}')
else:
flag = False
k = 1
sticker_counts.append(count)
df['sticker_counts'] = sticker_counts
print(df)

Related

Pandas finding a text in row and assign a dummy variable value based on this

I have a data frame which contains a text column i.e. df["input"],
I would like to create a new variable which checks whether df["input"] column contains any of the word in a given list and assigns a value of 1 if previous dummy variable is equal to 0 (logic is 1) create a dummy variable that equals to zero 2) replace it to one if it contains any word in a given list and it was not contained in the previous lists.)
# Example lists
listings = ["amazon listing", "ecommerce", "products"]
scripting = ["subtitle", "film", "dubbing"]
medical = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
which looks like:
input
amazon listing subtitle
medical
film biotechnology dentist
final dataset should look like:
input listings scripting medical
amazon listing subtitle 1 0 0
medical 0 0 1
film biotechnology dentist 0 1 0

One possible implementation is to use str.contains in a loop to create the 3 columns, then use idxmax to get the column name (or the list name) of the first match, then create a dummy variable from these matches:
import numpy as np
d = {'listings':listings, 'scripting':scripting, 'medical':medical}
for k,v in d.items():
df[k] = df['input'].str.contains('|'.join(v))
arr = df[list(d)].to_numpy()
tmp = np.zeros(arr.shape, dtype='int8')
tmp[np.arange(len(arr)), arr.argmax(axis=1)] = arr.max(axis=1)
out = pd.DataFrame(tmp, columns=list(d)).combine_first(df)
But in this case, it might be more efficient to use a nested for-loop:
import re
def get_dummy_vars(col, lsts):
out = []
len_lsts = len(lsts)
for row in col:
tmp = []
# in the nested loop, we use the any function to check for the first match
# if there's a match, break the loop and pad 0s since we don't care if there's another match
for lst in lsts:
tmp.append(int(any(True for x in lst if re.search(fr"\b{x}\b", row))))
if tmp[-1]:
break
tmp += [0] * (len_lsts - len(tmp))
out.append(tmp)
return out
lsts = [listings, scripting, medical]
out = df.join(pd.DataFrame(get_dummy_vars(df['input'], lsts), columns=['listings', 'scripting', 'medical']))
Output:
input listings medical scripting
0 amazon listing subtitle 1 0 0
1 medical 0 1 0
2 film biotechnology dentist 0 0 1

Here is a simpler - more pandas vector style solution:
patterns = {} #<-- dictionary
patterns["listings"] = ["amazon listing", "ecommerce", "products"]
patterns["scripting"] = ["subtitle", "film", "dubbing"]
patterns["medical"] = ["medical", "biotechnology", "dentist"]
df = pd.DataFrame({'input': ['amazon listing subtitle',
'medical',
'film biotechnology dentist']})
#---------------------------------------------------------------#
# step 1, for each column create a reg-expression
for col, items in patterns.items():
# create a regex pattern (word1|word2|word3)
pattern = f"({'|'.join(items)})"
# find the pattern in the input column
df[col] = df['input'].str.contains(pattern, regex=True).astype(int)
# step 2, if the value to the left is 1, change its value to 0
## 2.1 create a mask
## shift the rows to the right,
## --> if the left column contains the same value as the current column: True, otherwise False
mask = (df == df.shift(axis=1)).values
# substract the mask from the df
## and clip the result --> negative values will become 0
df.iloc[:,1:] = np.clip( df[mask].iloc[:,1:] - mask[:,1:], 0, 1 )
print(df)
Result
input listings scripting medical
0 amazon listing subtitle 1 0 0
1 medical 0 0 1
2 film biotechnology dentist 0 1 0

Great question and good answers (I somehow missed it yesterday)! Here's another variation with .str.extractall():
search = {"listings": listings, "scripting": scripting, "medical": medical, "dummy": []}
pattern = "|".join(
f"(?P<{column}>" + "|".join(r"\b" + s + r"\b" for s in strings) + ")"
for column, strings in search.items()
)
result = (
df["input"].str.extractall(pattern).assign(dummy=True).groupby(level=0).any()
.idxmax(axis=1).str.get_dummies().drop(columns="dummy")
)

python-Dataframe count rows with condition

I have a python code that is collecting information from a dataframe (df1) like this
for ind, data in enumerate(df1.Link):
print(data)
result = getInformation(driver, links)
for i in result['information']:
df1.loc[ind, "numOfWorkers"] = i["numOfWorkers"]
the output is saved to a dataframe like the one shown in the photo:
Is there anyway to update my code before it returns the dataframe with this condition:
if noOfWorkers >=30, once we have 2 links that have this condition, the code will break and return the result
can someone help ?

It would be better to put the logic in the code you already have. I would keep a count of the number of records matching the conditions, then exit out of the loop using break (rather than a while loop):
...
workers_threshold = 30
records_matching_threshold = 0
max_records_for_matching_records = 2
for i in result['information']:
df1.loc[ind, "numOfWorkers"] = i["numOfWorkers"]
if i["numOfWorkers"] > workers_threshold:
records_matching_threshold += 1
if records_matching_threshold > max_records_for_matching_records:
break
Note that the variable names above are purposefully long to make their purposes clear in my example.

i = 2
while i != 0:
if numOfWorkers >= 30:
i- = 1

compare a long list with strings in dataframe and on the basis of matching populate the dataframe in Python

I've aminor dataset I've to extract only computer science terms out of it so there is list1 to which i've to compare my dataset for this task.
https://www.aminer.org/oag2019
list1 = ['document types', 'surveys and overviews', 'reference works', 'general conference proceedings', 'biographies', 'general literature', 'computing standards, rfcs and guidelines', 'cross-computing tools and techniques',......]
total count of list1 is 2112 computer science terms from ACM.
data frame to which I've to compare(string comparison) list1 in a data frame column in the form of
df_train14year['keywords'].head()
0 "nmr spectroscopy","mass spectrometry","nanost...
1 "plk1","cationic dialkyl histidine","crystal s...
2 "case-control","child","fuel","hydrocarbons","...
3 "Ca2+ handling","CaMKII","cardiomyocyte","cont...
4
Name: keywords, dtype: object
in each of these lists in dataframe there are max 10 keywords min (3) in each and there are millions of records in dataframe.
so I've to compare each keywords with original list1 if more then 3 words are matching in both lists and populate a dataframe with those values, substrings match may also be needed.
how to do this task inefficient way in python, what I've done is by for loop to each keyword compared to the whole list and there are three loops in it, so it is inefficient.
# for i in range(5):
# df.loc[i] = ['<some value for first>','<some value for second>','<some value for third>']
count = 0;
i = 0;
for index, row in df_train14year.iterrows():
# print("index",index)
i=1+1;
# if(i==50):
# break
for outr in row['keywords'].split(","):
#print(count)
if (count>1):
# print("found1")
count = 0;
break;
for inr in computerList:
# outr= outr.replace("[","") # i skip the below three lines because i applied the pre- processing on data to remove the [] and "
# outr= outr.replace("]","")
outr= outr.replace('"',"")
#print("outr",outr,"inr",inr)
if outr in inr:
count = count+1
if (count>10):
#print("outr",outr,"inr",inr)
# print("found2")
# df12.loc[i] = [index,row['keywords']]
#df12.insert(index,"keywords",row['keywords'])
df14_4_match = df14_4_match.append({'abstract': row['abstract'],'keywords': row['keywords'],'title': row['title'],'year': row['year']}, ignore_index=True)
break;
# else:
# print('not found')```

kewwords = ["nmr spectroscopy","mass spectrometry","nanos"] rows in the data frame
prprocessing:
dfk['list_keywords']=[[x for x in j.split('[')[1].split(']')[0].split('"')[1:-1] if x not in[',',' , ',', ']] for j in dfk['keywords']]
after converting the orgional list to set
dfk['set_keywords']=dfk['list_keywords'].map(lambda x: set(x))
we compare the intersection of kewordlist and computerlist(list1) mentioned in question
to get number of items or keywords matched
dfk['set_keywords']=dfk['set_keywords'].map(lambda x:x.intersection(proceseedComputerList))
get the length with this function
dfk['len_keywords']=dfk['set_keywords'].map(lambda x:len(x))
sort in ascending order
dfk.head()```

While loop on a dataframe column?

I have a small dataframe comprised of two columns, an ORG column and a percentage column. The dataframe is sorted largest to smallest based on the percentage column.
I'd like to create a while loop that adds up the values in the percentage column up until it hits a value of .80 (80%).
So far I've tried:
retail_pareto = 0
counter = 0
while retail_pareto < .80:
retail_pareto += retailerDF[counter]['RETAILER_PCT_OF_CHANGE']
counter += 1
This does not work, both the counter and the counter and retail_pareto value remain at zero with no real error message to help me troubleshoot what I'm doing incorrectly. Ideally, I'd like to end up with a list of the orgs with the largest percentage that together add up to 80%.
I'm not exactly sure what to try next. I've searched these forums, but haven't found anything similar in the forums yet.
Any advice or help is much appreciated. Thank you.
Example Dataframe:
ORG PCT
KST 0.582561
ISL 0.290904
BOV 0.254456
BRH 0.10824
GNT 0.0913631
DSH 0.023441
RDM -0.0119665
JBL -0.0348893
JBD -0.071883
WEG -0.232227
The output that I would expect would be something along the lines of:
ORG PCT
KST 0.582561
ISL 0.290904

Use:
df_filtered = df.loc[df['PCT'].shift(fill_value=0).cumsum().le(0.80),:]
#if you don't want include where cumsum is greater than 0,80
#df_filtered = df.loc[df['PCT'].cumsum().le(0.80),:]
print(df_filtered)
ORG PCT
0 KST 0.582561
1 ISL 0.290904

Can you use this example to help you?
import pandas as pd
retail_pareto = 0
orgs = []
for i,row in retailerDF.iterrows():
if retail_pareto <= .80:
retail_pareto += row['RETAILER_PCT_OF_CHANGE']
orgs.append(row)
else:
break
new_df = pd.DataFrame(orgs)
Edit: made it more like your example and added the new DataFrame.

Instead of your loop, take a more pandasonic approach.
Start with computing an additional column containing cumulative sum
of RETAILER_PCT_OF_CHANGE:
df['pct_cum'] = df.RETAILER_PCT_OF_CHANGE.cumsum()
For your data, the result is:
ORG RETAILER_PCT_OF_CHANGE pct_cum
0 KST 0.582561 0.582561
1 ISL 0.290904 0.873465
2 BOV 0.254456 1.127921
3 BRH 0.108240 1.236161
4 GNT 0.091363 1.327524
5 DSH 0.023441 1.350965
6 RDM -0.011967 1.338999
7 JBL -0.034889 1.304109
8 JBD -0.071883 1.232226
9 WEG -0.232227 0.999999
And now, to print rows which totally include 80 % of change,
ending on the first row above the limit, run:
df[df.pct_cum.shift(1).fillna(0) < 0.8]
The result, together with the cumulated sum, is:
ORG RETAILER_PCT_OF_CHANGE pct_cum
0 KST 0.582561 0.582561
1 ISL 0.290904 0.873465

Odd Column returns when scraping with lxml

I am learning python and trying to build a scraper to glean parts data from a suppliers site. My issue now is that I am getting different column counts from my parsed table rows where I know that each row has the same column count. The issue has to be something I am overlooking and after two days of trying different things I am asking for a few more sets of eyes on my code to locate my error. Not having much python coding experience is no doubt my biggest hurdle.
First, the data. Rather than paste the html I have stored in my database, I'll give you a link to the live site I have crawled and stored in my db. The first link is this one.
The issue is that I get mostly correct results. However, every so often I get the values skewed in the column count. I can't seem to locate the cause.
Here is an example of the flawed result:
----------------------------------------------------------------------------------
Record: 1 Section:Passenger / Light Truck Make: ACURA SubMake:
Model: CL SubModel: Year: 1997 Engine: L4 1.6L 1590cc
----------------------------------------------------------------------------------
Rec:1 Row 6 Col 1 part Air Filter
Rec:1 Row 6 Col 2 2
Rec:1 Row 6 Col 3 part_no 46395
Rec:1 Row 6 Col 4 filter_loc
Rec:1 Row 6 Col 5 engine
Rec:1 Row 6 Col 6 vin_code V6 3.0L 2997cc
Rec:1 Row 6 Col 7 comment Engine Code J30A1
** Note that the engine value has been shifted to the vin_code field.
And proof it works some of the time:
Record: 2 Section:Passenger / Light Truck Make: ACURA SubMake:
Model: CL SubModel: Year: 1998 Engine: L4 1.6L 1590cc
----------------------------------------------------------------------------------
Rec:3 Row 4 Col 1 part Oil Filter
Rec:3 Row 4 Col 2 2
Rec:3 Row 4 Col 3 part_no 51334
Rec:3 Row 4 Col 4 filter_loc
Rec:3 Row 4 Col 5 engine L4 2.3L 2254cc
Rec:3 Row 4 Col 6 vin_code
Rec:3 Row 4 Col 7 comment Engine Code F23A1
** Note the fields line up in this record...
I suspect either there is something in the table cells my parser is not looking for or I have missed something trivial.
Here is the important portion of my code:
# Per Query
while records:
# Per Query Loop
#print str(records)
for record in records:
print 'Record Count:'+str(rec_cnt)
items = ()
item = {}
source = record['doc']
page = html.fromstring(source)
for rows in page.xpath('//div/table'):
#records = []
item = {}
cntx = 0
for row in list(rows):
cnty = 0 # Column Counter
found_oil = 0 # Found oil filter record flag
data = {} # Data
# Data fields
field_data = {'part':'', 'part_no':'', 'filter_loc':'', 'engine':'', 'vin_code':'', 'comment':'', 'year':''}
print
print '----------------------------------------------------------------------------------'
print 'Record: '+str(record['id']), 'Section:'+str(record['section']), 'Make: '+str(record['make']), 'SubMake: '+str(record['submake'])
print 'Model: '+str(record['model']), 'SubModel: '+str(record['submodel']), 'Year: '+str(record['year']), 'Engine: '+str(record['engine'])
print '----------------------------------------------------------------------------------'
#
# Rules for extracting data columns
# 1. First column always has a link to the bullet image
# 2. Second column is part name
# 3. Third column always empty
# 4. Fourth column is part number
# 5. Fith column is empty
# 6. Sixth column is part location
# 7. Seventh column is always empty
# 8. Eigth column is engine size
# 9. Ninth column is vin code
# 10. Tenth column is COmment
# 11. Eleventh column does not exist.
#
for column in row.xpath('./td[#class="blackmedium"][text()="0xa0"] | ./td[#class="blackmedium"][text()="\n"]/text() | ./td[#class="blackmeduim"]/img[#src]/text() | ./td[#class="blackmedium"][text()=""]/text() | ./td[#class="blackmedium"]/b/text() | ./td[#class="blackmedium"]/a/text() |./td[#class="blackmedium"]/text() | ./td[#class="blackmedium"][text()=" "]/text() | ./td[#class="blackmedium"][text()="&#160"]/text() | ./td[#class="blackmedium"][text()=None]/text()'):
#' | ./td[position()>1]/a/text() | ./td[position()>1]/text() | self::node()[position()=1]/td/text()'):
cnty+=1
if ('Oil Filter' == column.strip() or 'Air Filter' == column.strip()) and found_oil == 0:
found_oil = 1
if found_oil == 1:
print 'Rec:'+str(rec_cnt), 'Row '+str(cntx), 'Col '+str(cnty), _fields[cnty], column.strip()
#cnty+= 1
#print
else:
print 'Rec: '+str(rec_cnt), 'Col: '+str(cnty)
field_data[ str(_fields[cnty]) ] = str(column.strip())
#cnty = cnty+1
# Save data to db dest table
if found_oil == 1:
data['source_id'] = record['id']
data['section_id'] = record['section_id']
data['section'] = record['section']
data['make_id'] = record['make_id']
data['make'] = record['make']
data['submake_id'] = record['submake_id']
data['submake'] = record['submake']
data['model_id'] = record['model_id']
data['model'] = record['model']
data['submodel_id'] = record['submodel_id']
data['submodel'] = record['submodel']
data['year_id'] = record['year_id']
data['year'] = record['year']
data['engine_id'] = record['engine_id']
data['engine'] = record['engine']
data['part'] = field_data['part']
data['part_no'] = field_data['part_no']
data['filter_loc'] = field_data['filter_loc']
data['vin_code'] = field_data['vin_code']
data['comment'] = conn.escape_string(field_data['comment'])
data['url'] = record['url']
save_data(data)
print 'Filed Data:'
print field_data
cntx+=1
rec_cnt+=1
#End main per query loop
delay() # delay if wait was passed on cmd line
records = get_data()
has_offset = 1
#End Queries
Thank you all for your help and your eyes...

Usually when I run into a problem like this, I do two things:
Break the problem down into smaller chunks. Use python functions or classes to perform subsets of functionality so that you can test the functions individually for correctness.
Use the Python Debugger to inspect the code as it runs to understand where it's failing. For example, in this case, I would add import pdb; pdb.set_trace() before the line that says cnty+=1.
Then, when the code runs, you'll get an interactive interpreter at that point and you can inspect the various variables and discover why you're not getting what you expect.
A couple of tips for using pdb:
Use c to allow the program to continue (until the next breakpoint or set_trace); Use n to step to the next line in the program. Use q to raise an Exception (and usually abort).

Can you pass the details of your scrapping process? The intermittent failures could be based on the parsing of the html data.

The problem seems to be that your xpath expression searches for text nodes. No matches are found for empty cells, causing your code to "skip" columns. Try iterating over the td elements themselves, and then "look down" from the element to its contents. To get you started:
# just iterate over child elements of the row, which are always td
# use enumerate to easily get a counter for the columns
for col_no, td in enumerate(row, start=1):
# use the xpath function string() to get the string value for the element
# this will yield an empty string for empty elements
print col_no, td.xpath('string()')
Note that the use of the string() xpath function may in some cases be not enough/too simple for what you want. In your example, you may find something like <td><a>51334</a><sup>53</sup></td> (see oil filter). My example would give you "5133453", where you would seem to need "51334" (not sure if that was intentional or if you hadn't noticed the "missing" part, if you do want only the in the hyperlink, use td.findtext('a'))

I want to thank everyone who has given aid to me these past few days. All your input has resulted in a working application that I am now using. I wanted to post the resulting changes to my code so those who look here may find an answer or at least information on how they may also tackle their issue. Below is the rewritten portion of my code that solved the issues I was having:
#
# get_column_index()
# returns a dict of column names/column number pairs
#
def get_column_index(row):
index = {}
col_no = 0
td = None
name = ''
for col_no, td in enumerate(row, start=0):
mystr = str(td.xpath('string()').encode('ascii', 'replace'))
name = str.lower(mystr).replace(' ', '_')
idx = name.replace('.', '')
index[idx] = col_no
if int(options.verbose) > 2:
print 'Field Index:', str(index)
return index
def run():
global has_offset
records = get_data()
#print 'Records', records
rec_cnt = 0
# Per Query
while records:
# Per Query Loop
#print str(records)
for record in records:
if int(options.verbose) > 0:
print 'Record Count:'+str(rec_cnt)
items = ()
item = {}
source = record['doc']
page = html.fromstring(source)
col_index = {}
for rows in page.xpath('//div/table'):
#records = []
item = {}
cntx = 0
for row in list(rows):
data = {} # Data
found_oil = 0 #found proper part flag
# Data fields
field_data = {'part':'', 'part_no':'', 'part_note':'', 'filter_loc':'', 'engine':'', 'vin_code':'', 'comment':'', 'year':''}
if int(options.verbose) > 0:
print
print '----------------------------------------------------------------------------------'
print 'Row'+str(cntx), 'Record: '+str(record['id']), 'Section:'+str(record['section']), 'Make: '+str(record['make']), 'SubMake: '+str(record['submake'])
print 'Model: '+str(record['model']), 'SubModel: '+str(record['submodel']), 'Year: '+str(record['year']), 'Engine: '+str(record['engine'])
print '----------------------------------------------------------------------------------'
# get column indexes
if cntx == 1:
col_index = get_column_index(row)
if col_index != None and cntx > 1:
found_oil = 0
for col_no, td in enumerate(row):
if ('part' in col_index) and (col_no == col_index['part']):
part = td.xpath('string()').strip()
if 'Oil Filter' == part or 'Air Filter' == part or 'Fuel Filter' == part or 'Transmission Filter' == part:
found_oil = 1
field_data['part'] = td.xpath('string()').strip()
# Part Number
if ('part_no' in col_index) and (col_no == col_index['part_no']):
field_data['part_no'] = str(td.xpath('./a/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
field_data['part_note'] = str(td.xpath('./sup/text()')).strip().replace('[', '').replace(']', '').replace("'", '')
# Filter Location
if ('filterloc' in col_index) and (col_no == col_index['filterloc']):
field_data['filter_loc'] = td.xpath('string()').strip()
# Engine
if ('engine' in col_index) and (col_no == col_index['engine']):
field_data['engine'] = td.xpath('string()').strip()
if ('vin_code' in col_index) and (col_no == col_index['vin_code']):
field_data['vin_code'] = td.xpath('string()').strip()
if ('comment' in col_index) and (col_no == col_index['comment']):
field_data['comment'] = td.xpath('string()').strip()
if int(options.verbose) == 0:
print ','
if int(options.verbose) > 0:
print 'Field Data: ', str(field_data)
elif int(options.verbose) == 0:
print '.'
# Save data to db dest table
if found_oil == 1:
data['source_id'] = record['id']
data['section_id'] = record['section_id']
data['section'] = record['section']
data['make_id'] = record['make_id']
data['make'] = record['make']
data['submake_id'] = record['submake_id']
data['submake'] = record['submake']
data['model_id'] = record['model_id']
data['model'] = record['model']
data['submodel_id'] = record['submodel_id']
data['submodel'] = record['submodel']
data['year_id'] = record['year_id']
data['year'] = record['year']
data['engine_id'] = record['engine_id']
data['engine'] = field_data['engine'] #record['engine']
data['part'] = field_data['part']
data['part_no'] = field_data['part_no']
data['part_note'] = field_data['part_note']
data['filter_loc'] = field_data['filter_loc']
data['vin_code'] = field_data['vin_code']
data['comment'] = conn.escape_string(field_data['comment'])
data['url'] = record['url']
save_data(data)
found_oil = 0
if int(options.verbose) > 2:
print 'Data:', str(data)
cntx+=1
rec_cnt+=1
#End main per query loop
delay() # delay if wait was passed on cmd line
records = get_data()
has_offset = 1
#End Queries

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting number of occurrences of string matched to another column - python

It's a common technique to check for the other values and take cumsum to identify the blocks: omitted = df.msg.ne('sticker omitted').cumsum() df['number_of_stickers'] = np.where(omitted.duplicated(), 0, omitted.groupby(omitted).transform('size')-1)

Related

Pandas finding a text in row and assign a dummy variable value based on this

python-Dataframe count rows with condition

compare a long list with strings in dataframe and on the basis of matching populate the dataframe in Python

While loop on a dataframe column?

Odd Column returns when scraping with lxml

Categories

Resources