Pandas Key Error When Searching For Keyword In "Cell" - python

I am iterating over some data in a pandas dataframe searching for specific keywords, however the resulting regex search results in a KeyError: 19.
I've tried to pull out the data in the specific cell, place it in a string object and search through that, but every time I attempt to point anything to look a the data in that column, I get a KeyError: 19.
To preface my code example, I have pulled out specific chunks of the dataframe and placed them in a list of lists. (Of these chunks, I have kept all of the columns that were in the original dataframe)
Here is an example of the iteration I am attempting:
for eachGroup in mainList:
for lineItem in eachGroup:
if re.search(r'( keyword )', lineItem[19], re.I):
dostuff
As you might have guessed, the data I am searching for keywords in is column 19 which has data formatted like this:
3/23/2019 11:32:0 3/23/2019 11:32:0 3/23/2019 14:3:0 CSG CHG H6 27 1464D Random Random Random 81
Every other attempt at searching for keywords in different columns executes fine without any errors. Why would this case alone return a KeyError?
To add some more clarity, even the following code produces the same KeyError:
for eachGroup in mainList:
for lineItem in eachGroup:
text = lineItem[19]

Here's a WTF moment...
Instead of using python's smart for looping, I decided to be more granular and loop through with a while loop. Needless to say it worked.
The below code implementation fixes the issue though why it does I have no clue:
bigCount = len(mainList)
count = 0
while count < bigCount:
while smallCount < len(mainList[count]):
if re.search(r'( keyword )', mainList[count][smallCount][19], re.I):
dostuff

Try changing re.search(r'( keyword )', lineItem[19], re.I): to re.match('(.*)keyword(.*)', lineItem[19]):. re.search will return the corresponding matching object, while re.match will return a logical value that you need in an if statement. The suffix and prefix (.*) is to ignore any other character to the left or right of the string. Hope it helps.

Related

What's the logic behind locating elements using letters in pandas?

I have a CSV file. I load it in pandas dataframe. Now, I am practicing the loc method. This CSV file contains a list of James bond movies and I am passing letters in the loc method. I could not interpret the result shown.
bond = pd.read_csv("jamesbond.csv", index_col = "Film")
bond.sort_index(inplace = True)
bond.head(3)
bond.loc["A": "I"]
The result for the above code is:
bond.loc["a": "i"]
And the result for the above code is:
What is happening here? I could not understand. Please someone help me to understand the properties of pandas.
Following is the file:
Your dataframe uses the first column ("Film") as an index when it is imported (because of the option index_col = "Film"). The column contains the name of each film stored as a string, and they all start with a capital letter. bond.loc["A":"I"] returns all films where the index is greater than or equal to "A" and less than or equal to "I" (pandas slices are upper-bound inclusive), which by the rules of string comparison in Python includes all films beginning with "A"-"H", and would also include a film called "I" if there was one. If you enter e.g. "A" <= "b" <="I" in the python prompt you will see that lower-case letters are not within the range, because ord("b") > ord("I").
If you wrote bond.index = bond.index.str.lower() that would change the index to lower case and you could search films using e.g. bond["a":"i"] (but bond["A":"I"] would no longer return any films).
DataFrame.loc["A":"I"] returns the rows that start with the letter in that range - from what I can see and tried to reproduce. Might you attach the data?

How to strip a value from a delimited string

I have a list which i have joined using the following code:
patternCore = '|'.join(list(Broker['prime_broker_id']))
patternCore
'CITI|CS|DB|JPM|ML'
Not sure why i did it that way but I used patternCore to filter multiple strings at the same time. Please note that Broker is a dataFrame
Broker['prime_broker_id']
29 CITI
30 CS
31 DB
32 JPM
33 ML
Name: prime_broker_id, dtype: object
Now I am looking to strip one string. Say I would like to strip 'DB'. How can I do that please?
I tried this
patternCore.strip('DB')
'CITI|CS|DB|JPM|ML'
but nothing is stripped
Thank you
Since Broker is a Pandas dataframe, you can use loc with Boolean indexing, then use pd.Series.tolist:
mask = Broker['prime_broker_id'] != 'DB'
patternCore = '|'.join(Broker.loc[mask, Broker['prime_broker_id']].tolist())
A more generic solution, which works with objects other than Pandas dataframes, is to use a list comprehension with an if condition:
patternCore = '|'.join([x for x in Broker['prime_broker_id'] if x != 'DB'])
Without returning to your input series, using the same idea you can split and re-join:
patternCore = 'CITI|CS|DB|JPM|ML'
patternCore = '|'.join([x for x in patternCore.split('|') if x != 'DB'])
You should expect the last option to be expensive as your algorithm requires reading each character in your input string.
I would like to mention some points which have not been touched upon till now.
I tried this
patternCore.strip('DB')
'CITI|CS|DB|JPM|ML'
but nothing is stripped
The reason why it didn't work was because strip() returns a copy of the string with the leading and trailing characters removed.
NOTE:
Not the characters in the occuring somewhere in the mid.
The chars argument is not a prefix or suffix; rather, all combinations of its values are stripped
Here you have specified the argument characters as 'DB'. So had your string been something like 'CITI|CS|JPM|ML|DB', your code would have worked partially(the pipe at the end would remain).
But anyways this is not a good practice. Because it would strip something like
'DCITI|CS|JPM|MLB' to 'CITI|CS|JPM|ML' or 'CITI|CS|JPM|ML|BD' to 'CITI|CS|JPM|ML|' also.
I would like to strip 'DB'.
For this part, #jpp has already given a fine answer.

Key word search just in one column of the file and keeping 2 words before and after key word

Love Python and I am new to Python as well. Here with the help of community (users like Antti Haapala) I was able to proceed some extent. But I got stuck at the end. Please help. I have two tasks remaining before I get into my big data POC. (planning to use this code in 1+ million records in text file)
• Search a key word in Column (C#3) and keep 2 words front and back to that key word.
• Divert the print output to file.
• Here I don’t want to touch C#1, C#2 for referential integrity purposes.
Really appreciate for all your help.
My input file:
C #1 C # 2 C# 3 (these are headings of columns, I used just for clarity)
12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it
Desired output file: (only change in Column 3 or last column)
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it
Code I am currently using:
s = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
for line in s.splitlines():
if not line.strip():
continue
fields = line.split(None, 2)
joined = '|'.join(fields)
print(joined)
BTW, If I use the key word search, I am looking my 1st and 2nd columns. My challenge is keep 1st and 2nd columns without change. And search only 3rd column and keep 2 words after/before key word/s.
First I need to warn you that using this code for 1million records is dangerous. You are dealing with regular expression and this method is good as long as expressions are regular. Else you might end up creating, tons of cases to extract the data you want without extracting the data you don't want to.
For 1 million cases you'll need pandas as for loop is too slow.
import pandas as pd
import re
df = pd.DataFrame({'C1': [12088
,12089],'C2':["CITA","CITA"],"C3":["Hello very nice lists, better to keep those",
"This is great theme for lists keep it"]})
df["C3"] = df["C3"].map(lambda x:
re.findall('(?<=Hello)[\w\s,]*(?=keep)|(?<=great)[\w\s,]*',
str(x)))
df["C3"]= df["C3"].map(lambda x: x[0].strip())
df["C3"].map(lambda x: x.strip())
which gives
df
C1 C2 C3
0 12088 CITA very nice lists, better to
1 12089 CITA theme for lists keep it
There are still some questions left about how exactly you strive to perform your keyword search. One obstacle is already contained in your example: how to deal with characters such as commas? Also, it is not clear what to do with lines that do not contain the keyword. Also, what to do if there are not two words before or two words after the keyword? I guess that you yourself are a little unsure about the exact requirements and did not think about all edge cases.
Nevertheless, I have made some "blind decisions" about these questions, and here is a naive example implementation that assumes that your keyword matching rules are rather simple. I have created the function findword(), and you can adjust it to whatever you like. So, maybe this example helps you finding your own requirements.
KEYWORD = "lists"
S = """12088|CITA|{Hello very nice lists, better to keep those
12089|CITA|This is great theme for lists keep it """
def findword(words, keyword):
"""Return index of first occurrence of `keyword` in sequence
`words`, otherwise return None.
The current implementation searches for "keyword" as well as
for "keyword," (with trailing comma).
"""
for test in (keyword, "%s," % keyword):
try:
return words.index(test)
except ValueError:
pass
return None
for line in S.splitlines():
tokens = line.split("|")
words = tokens[2].split()
idx = findword(words, KEYWORD)
if idx is None:
# Keyword not found. Print line without change.
print line
continue
l = len(words)
start = idx-2 if idx > 1 else 0
end = idx+3 if idx < l-2 else -1
tokens[2] = " ".join(words[start:end])
print '|'.join(tokens)
Test:
$ python test.py
12088|CITA|very nice lists, better to
12089|CITA|theme for lists keep it
PS: I hope I got the indices right for slicing. You should check, nevertheless.

Remove Rows that Contain a specific Value anywhere in the row (Pandas, Python 3)

I am trying to remove all rows in a Panda dataset that contain the symbol "+" anywhere in the row. So ideally this:
Keyword
+John
Mary+Jim
David
would become
Keyword
David
I've tried doing something like this in my code but it doesn't seem to be working.
excluded = ('+')
removal2 = removal[~removal['Keyword'].isin(excluded)]
The problem is that sometimes the + is contained within a word, at the beginning of a word, or at the end. Any ideas how to help? Do I need to use an index function? Thank you!
Use the vectorised str method contains and pass the '+' identifier, negate the boolean condition by using ~:
In [29]:
df[~df.Keyword.str.contains('\+')]
Out[29]:
Keyword
2 David

Accessing Data using df['foo'] missing data for pattern searching python

So I have this function which takes in one row from dataframe and matches the pattern
and add it to the data. Since pattern search needs input to be string, I am forcing it with str(). However, if I do that it cuts off my url after certain point.
I figured out if I force it using ix function
str(data.ix[0,'url'])
It does not cut off any and gets me what I want. Also, if I use str(data.ix[:'url']),
it also cuts off after some point.
Problem is I cannot specify the index position inside the ix function as I plan to iterate by row using apply function. Any suggestion?
def foo (data):
url = str(data['url'])
m = re.search(r"model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)", url)
if m:
data['make'] = m.group("make")
data['model'] = m.group("model")
return data
Iterating row-by-row is a last resort. It's almost always slower, less readable, and less idiomatic.
Fortunately, there is an easy way to do what you want to do. Check out the DataFrame.str.extract method, added in version 0.13 of pandas.
Something like this...
pattern = r'model=(?P<model>\w+)&id=\d+&make=(?P<make>\w+)'
extracted_data = data.str.extract(pattern)
The result, extracted_data will be a new DataFrame with columns named 'model' and 'make', inferred from the named groups in your regex pattern.
Join it to your original DataFrame, and you're done.
data = data.join(extracted_data)

Categories