pandas: exact match does not work in an if AND condition - python

I have two dataframes as follows:
data = {'First': [['First', 'value'],['second','value'],['third','value','is'],['fourth','value','is']],
'Second': ['noun','not noun','noun', 'not noun']}
df = pd.DataFrame (data, columns = ['First','Second'])
and
data2 = {'example': ['First value is important', 'second value is important too','it us good to know',
'Firstap is also good', 'aplsecond is very good']}
df2 = pd.DataFrame (data2, columns = ['example'])
and I have written the following code that would filter out the sentences from df2 if there is a match in df for the first word of the sentence, only if in the second column we have a match for the word 'noun'. so basically there are two conditions.
def checker():
result =[]
for l in df2.example:
df['first_unlist'] = [','.join(map(str, l)) for l in df.First]
if df.first_unlist.str.match(pat=l.split(' ', 1)[0]).any() and df.Second.str.match('noun').any():
result.append(l)
return result
however, i realized that i get ['First value is important', 'second value is important too'] as the output when I run the function, which shows that the second condition for 'noun' filter only does not work. so my desired output would be ['First value is important'].
I have also tried .str.contains() and .eq() but I still got the same output

I would suggest filtering out df before trying to match:
def checker():
result = []
for l in df2.example:
first_unlist = [x[0] for x in df.loc[df.Second == 'noun', 'First']
if l.split(' ')[0] in first_unlist:
result.append(l)
return result
checker()
['First value is important']

Related

Remove digits from a list of strings in pandas column

I have this pandas dataframe
0 Tokens
1: 'rice', 'XXX', '250g'
2: 'beer', 'XXX', '750cc'
All tokens here, 'rice', 'XXX' and '250g' are in the same list of strings, also in the same column
I want to remove the digits, and because it with another words,
the digits cannot be removed.
I have tried this code:
def remove_digits(tokens):
"""
Remove digits from a string
"""
return [''.join([i for i in tokens if not i.isdigit()])]
df["Tokens"] = df.Tokens.apply(remove_digits)
df.head()
but it only joined the strings, and I clearly do not want to do that.
My desired output:
0 Tokens
1: 'rice' 'XXX' 'g'
2: 'beer', 'XXX', 'cc'
This is possible using pandas methods, which are vectorised so more efficient that looping.
import pandas as pd
df = pd.DataFrame({"Tokens": [["rice", "XXX", "250g"], ["beer", "XXX", "750cc"]]})
col = "Tokens"
df[col] = (
df[col]
.explode()
.str.replace("\d+", "", regex=True)
.groupby(level=0)
.agg(list)
)
# Tokens
# 0 [rice, XXX, g]
# 1 [beer, XXX, cc]
Here we use:
pandas.Series.explode to convert the Series of lists into rows
pandas.Series.str.replace to replace occurrences of \d (number 0-9) with "" (nothing)
pandas.Series.groupby to group the Series by index (level=0) and put them back into lists (.agg(list))
Here's a simple solution -
df = pd.DataFrame({'Tokens':[['rice', 'XXX', '250g'],
['beer', 'XXX', '750cc']]})
def remove_digits_from_string(s):
return ''.join([x for x in s if not x.isdigit()])
def remove_digits(l):
return [remove_digits_from_string(s) for s in l]
df["Tokens"] = df.Tokens.apply(remove_digits)
You can use to_list + re.sub in order to update your original dataframe.
import re
for index, lst in enumerate(df['Tokens'].to_list()):
lst = [re.sub('\d+', '', i) for i in lst]
df.loc[index, 'Tokens'] = lst
print(df)
Output:
Tokens
0 [rice, XXX, g]
1 [beer, XXX, cc]

pandas: instead of applying the function to df get the result as a list from the function

I have a dataframe like the following:
df = pd.DataFrame({ 'text':['the weather is nice though', 'How are you today','the beautiful girl and the nice boy'],
'pos':["['DET', 'NOUN', 'VERB','ADJ', 'ADV']","['QUA', 'VERB', 'PRON', 'ADV']", "['DET', 'ADJ', 'NOUN','CON','DET', 'ADJ', 'NOUN' ]"]})
I have a function that outputs the exact corresponding word and its index for when pos == 'ADJ' like the following: (see here)
import pandas as pd
def extract_words(row):
word_pos = {}
text_splited = row.text.split()
pos = ast.literal_eval(row.pos)
for i, p in enumerate(pos):
if p == 'ADJ':
word_pos[text_splited[i]] = i
return word_pos
df['Third_column'] = ' '
df['Third_column'] = df.apply(extract_words, axis=1)
what I would like to do is to refactor the function in a way that I would not have to apply this function to df outside the function, and instead be able to append the result to a list outside the function. So far I have tried this:
list_word_index = []
def extract_words(dataframe):
for li in dataframe.text.str.split():
for lis in dataframe.pos:
for i, p in enumerate(ast.literal_eval(lis)):
if p == 'nk':
...
list_word_index.append(...)
extract_words(df)
I do not know how to fill in the ... part of the code.
Here's how you could use the function to get a list back, based on your DataFrame:
from typing import List
df = pd.DataFrame({ 'text':['the weather is nice though', 'How are you today','the beautiful girl and the nice boy'],
'pos':[['DET', 'NOUN', 'VERB','ADJ', 'ADV'],['QUA', 'VERB', 'PRON', 'ADV'], ['DET', 'ADJ', 'NOUN','CON','DET', 'ADJ', 'NOUN' ]]})
def extract_words_to_list(df: pd.DataFrame) -> List:
# iterate over dataframe row-wise
tmp = []
for _, row in df.iterrows():
word_pos = {}
text_splited = row.text.split()
for i, p in enumerate(row.pos):
if p == 'ADJ':
word_pos[text_splited[i]] = i
tmp.append(word_pos)
return tmp
list_word_index = extract_words_to_list(df)
list_word_index # [{'nice': 3}, {}, {'beautiful': 1, 'nice': 5}]
Though you could also just use:
df['Third_column'] = df.apply(extract_words, axis=1)
df['Third_column'].tolist() # [{'nice': 3}, {}, {'beautiful': 1, 'nice': 5}]
To achieve the same thing.

How to access the elements between zero th index and date index from sublist in python3?

How to access the elements between zero th index and date index from sublist in python3?
Find element between zero th index and date index.
After that , concat those elements. and keep in a list.
After that insert the concat elements to first index in a sublist and remove splitted elements.
import re
nested_list =[["1","a","b","22/01/2014","variable"],["2","c","d"],
["3","e","f","23/01/2014","variable"]]
sub_list=[]
for i in range(0,len(nested_list)):
concat = ''
data_index = ''
for j in range(0,len(nested_list[i])):
temp = re.search("[\d]{1,2}/[\d]{1,2}/[\d]{4}", nested_list[i][j])
if temp:
date_index = j
if date_index:
for d in range(1,date_index):
concat = concat+' '+ nested_list[i][d]
print(concat)
Expected Output:
nested_list =[["1","a b","22/01/2014","variable"],["2","c","d"],["3","e f","23/01/2014","variable"]]
So you
want the elements between date and zeroth index thats why ["2","c","d"] i didnt combine these elemen t# Patrick Artner
Here you go:
import re
nested_list =[["1","a","b","22/01/2014"],["2","c","d"], ["3","e","f","23/01/2014"]]
result = []
for inner in nested_list:
if re.match(r"\d{1,2}/\d{1,2}/\d{4}",inner[-1]): # simplified regex
# list slicing to get the result
result.append( [inner[0]] + [' '.join(inner[1:-1])] + [inner[-1]] )
else:
# add as is
result.append(inner)
print(result)
Output:
[['1', 'a b', '22/01/2014'], ['2', 'c', 'd'], ['3', 'e f', '23/01/2014']]
Edit because dates might also occure in between - what was not covered by the original questions data:
import re
nested_list =[["1","a","b","22/01/2014"], ["2","c","d"],
["3","e","f","23/01/2014","e","f","23/01/2014"]]
result = []
for inner in nested_list:
# get all date positions
datepos = [idx for idx,value in enumerate(inner)
if re.match(r"\d{1,2}/\d{1,2}/\d{4}",value)]
if datepos:
# add elem 0
r = [inner[0]]
# get tuple positions of where dates are
for start,stop in zip([0]+datepos, datepos):
# join between the positions
r.append(' '.join(inner[start+1:stop]))
# add the date
r.append(inner[stop])
result.append(r)
# add anything _behind_ the last found date
if datepos[-1] < len(inner):
result[-1].extend(inner[datepos[-1]+1:])
else:
# add as is
result.append(inner)
print(result)
Output:
[['1', 'a b', '22/01/2014'],
['2', 'c', 'd'],
['3', 'e f', '23/01/2014', 'e f', '23/01/2014']]

Inverse of string format in python

In python, we can use str.format to construct string like this:
string_format + value_of_keys = formatted_string
Eg:
FMT = '{name:} {age:} {gender}' # string_format
VoK = {'name':'Alice', 'age':10, 'gender':'F'} # value_of_keys
FoS = FMT.format(**VoK) # formatted_string
In this case, formatted_string = 'Alice 10 F'
I just wondering if there is a way to get the value_of_keys from formatted_string and string_format? It should be function Fun with
VoK = Fun('{name:} {age:} {gender}', 'Alice 10 F')
# the value of Vok is expected as {'name':'Alice', 'age':10, 'gender':'F'}
Is there any way to get this function Fun?
ADDED :
I would like to say, the '{name:} {age:} {gender}' and 'Alice 10 F' is just a simplest example. The realistic situation could be more difficult, the space delimiter may not exists.
And mathematically speaking, most of the cases are not reversible, such as:
FMT = '{key1:}{key2:}'
FoS = 'HelloWorld'
The VoK could be any one in below:
{'key1':'Hello','key2':'World'}
{'key1':'Hell','key2':'oWorld'}
....
So to make this question well defined, I would like to add two conditions:
1. There are always delimiters between two keys
2. All delimiters are not included in any value_of_keys.
In this case, this question is solvable (Mathematically speaking) :)
Another example shown with input and expected output:
In '{k1:}+{k2:}={k:3}', '1+1=2' Out {'k1':1,'k2':2, 'k3':3}
In 'Hi, {k1:}, this is {k2:}', 'Hi, Alice, this is Bob' Out {'k1':'Alice', 'k2':'Bob'}
You can indeed do this, but with a slightly different format string, called regular expressions.
Here is how you do it:
import re
# this is how you write your "format"
regex = r"(?P<name>\w+) (?P<age>\d+) (?P<gender>[MF])"
test_str = "Alice 10 F"
groups = re.match(regex, test_str)
Now you can use groups to access all the components of the string:
>>> groups.group('name')
'Alice'
>>> groups.group('age')
'10'
>>> groups.group('gender')
'F'
Regex is a very cool thing. I suggest you learn more about it online.
I wrote a funtion and it seems work:
import re
def Fun(fmt,res):
reg_keys = '{([^{}:]+)[^{}]*}'
reg_fmts = '{[^{}:]+[^{}]*}'
pat_keys = re.compile(reg_keys)
pat_fmts = re.compile(reg_fmts)
keys = pat_keys.findall(fmt)
lmts = pat_fmts.split(fmt)
temp = res
values = []
for lmt in lmts:
if not len(lmt)==0:
value,temp = temp.split(lmt,1)
if len(value)>0:
values.append(value)
if len(temp)>0:
values.append(temp)
return dict(zip(keys,values))
Usage:
eg1:
fmt = '{k1:}+{k2:}={k:3}'
res = '1+1=2'
print Fun(fmt,res)
>>>{'k2': '1', 'k1': '1', 'k': '2'}
eg2:
fmt = '{name:} {age:} {gender}'
res = 'Alice 10 F'
print Fun(fmt,res)
>>>
eg3:
fmt = 'Hi, {k1:}, this is {k2:}'
res = 'Hi, Alice, this is Bob'
print Fun(fmt,res)
>>>{'k2': 'Bob', 'k1': 'Alice'}
There is no way for python to determine how you created the formatted string once you get the new string.
For example: once your format "{something} {otherthing}" with values with space and you get the desired string, you can not differentiate whether the word with space was the part of {something} or {otherthing}
However you may use some hacks if you know about the format of the new string and there is consistency in the result.
For example, in your given example: if you are sure that you'll have word followed by space, then a number, then again a space and then a word, then you may use below regex to extract the values:
>>> import re
>>> my_str = 'Alice 10 F'
>>> re.findall('(\w+)\s(\d+)\s(\w+)', my_str)
[('Alice', '10', 'F')]
In order to get the desired dict from this, you may update the logic as:
>>> my_keys = ['name', 'age', 'gender']
>>> dict(zip(my_keys, re.findall('(\w+)\s(\d+)\s(\w+)', my_str)[0]))
{'gender': 'F', 'age': '10', 'name': 'Alice'}
I suggest another approach to this problem using **kwargs, such as...
def fun(**kwargs):
result = '{'
for key, value in kwargs.iteritems():
result += '{}:{} '.format(key, value)
# stripping the last space
result = result[:-1]
result += '}'
return result
print fun(name='Alice', age='10', gender='F')
# outputs : {gender:F age:10 name:Alice}
NOTE : kwargs is not an ordered dict, and will only keep the parameters order up to version 3.6 of Python. If order is something you with to keep, it is easy though to build a work-around solution.
This code produces strings for all the values, but it does split the string into its constituent components. It depends on the delimiter being a space, and none of the values containing a space. If any of the values contains a space this becomes a much harder problem.
>>> delimiters = ' '
>>> d = {k: v for k,v in zip(('name', 'age', 'gender'), 'Alice 10 F'.split(delimiters))}
>>> d
{'name': 'Alice', 'age': '10', 'gender': 'F'}
for your requirement, I have a solution.
This solution concept is:
change all delimiters to same delimiter
split input string by the same delimiter
get the keys
get the values
zip keys and values as dict
import re
from collections import OrderedDict
def Func(data, delimiters, delimiter):
# change all delimiters to delimiter
for d in delimiters:
data[0] = data[0].replace(d, delimiter)
data[1] = data[1].replace(d, delimiter)
# get keys with '{}'
keys = data[0].split(delimiter)
# if string starts with delimiter remove first empty element
if keys[0] == '':
keys = keys[1:]
# get keys without '{}'
p = re.compile(r'{([\w\d_]+):*.*}')
keys = [p.match(x).group(1) for x in keys]
# get values
vals = data[1].split(delimiter)
# if string starts with delimiter remove first empty element
if vals[0] == '':
vals = vals[1:]
# pack to a dict
result_1 = dict(zip(keys, vals))
# if you need Ordered Dict
result_2 = OrderedDict(zip(keys, vals))
return result_1, result_2
The usage:
In_1 = ['{k1}+{k2:}={k3:}', '1+2=3']
delimiters_1 = ['+', '=']
result = Func(In_1, delimiters_1, delimiters_1[0])
# Out_1 = {'k1':1,'k2':2, 'k3':3}
print(result)
In_2 = ['Hi, {k1:}, this is {k2:}', 'Hi, Alice, this is Bob']
delimiters_2 = ['Hi, ', ', this is ']
result = Func(In_2, delimiters_2, delimiters_2[0])
# Out_2 = {'k1':'Alice', 'k2':'Bob'}
print(result)
The output:
({'k3': '3', 'k2': '2', 'k1': '1'},
OrderedDict([('k1', '1'), ('k2', '2'), ('k3', '3')]))
({'k2': 'Bob', 'k1': 'Alice'},
OrderedDict([('k1', 'Alice'), ('k2', 'Bob')]))
try this :
import re
def fun():
k = 'Alice 10 F'
c = '{name:} {age:} {gender}'
l = re.sub('[:}{]', '', c)
d={}
for i,j in zip(k.split(), l.split()):
d[j]=i
print(d)
you can change the fun parameters as your wish and assign it to variables. It accepts the same string you want to give. and gives the dict like this:
{'name': 'Alice', 'age': '10', 'gender': 'F'}
I think the only right answer is that, what you are searching for isn't really possible generally after all. You just don't have enough information. A good example is:
#python 3
a="12"
b="34"
c="56"
string=f"{a}{b}{c}"
dic = fun("{a}{b}{c}",string)
Now dic might be {"a":"12","b":"34","c":"56"} but it might as well just be {"a":"1","b":"2","c":"3456"}. So any universal reversed format function would ultimately fail to this ambiguity. You could obviously force a delimiter between each variable, but that would defeat the purpose of the function.
I know this was already stated in the comments, but it should also be added as an answer for future visitors.

searching through a list within lists

I want to create a code that can append a file to a list, strip and split it, and find lines from Payments.txt that matches with the rule, which is find a customer that has a status of "A" AND still has outstanding money to pay. I can do the first two criteria's and can do the third one partially. I can find the customers that has amounts outstanding, but not the ones that has a status of "A". I have to use a list, not a dictionary by the way.
This is my code below:
myList = []
Status = "A"
myFile = open("Payments.txt")
record = myFile.readlines()
for line in record:
myList.append(line.strip().split(','))
myFile.close()
for z in record:
details = [[x for x in myList if x[0] == Status], [x for x in myList if x[2] > x[4]]] #This is were I am having trouble
if details:
print(details)
break
And this is the result:
[[], [['E1234', '12/09/14', '440', 'A', '0'], ['E3431', '10/01/12', '320', 'N', '120'], ['E5322', '05/04/02', '503', 'A', '320'], ['E9422', '26/11/16', '124', 'N', '0']]]
Why am I getting an empty list at the start of the result? There isn't a blank line in Payments.txt.
The list structure is as follows:
['Customer number', 'Date when they joined', 'Total amount', 'Status', 'Amount paid']
Try this:
[x for x in myList if len(x) == 5 and x[3] == Status and x[2]>x[4]]
In len(x) == 5, 5 represents the length of the normal list structure. It preferable to replace it by a variable.
Try this:
details = [[x for x in myList if x[0] == Status and x[2] > x[4]]
I think the reason why you're getting one is because you check either A or the greater than.

Categories