Apply operation and a division operation in the same step using Python - python

I am trying to get proportion of nouns in my text using the code below and it is giving me an error. I am using a function that calculates the number of nouns in my text and I have the overall word count in a different column.
pos_family = {
'noun' : ['NN','NNS','NNP','NNPS']
}
def check_pos_tag(x, flag):
cnt = 0
try:
for tag,value in x.items():
if tag in pos_family[flag]:
cnt +=value
except:
pass
return cnt
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')/df2['word_count'])
Note: I have used nltk package to get the counts by PoS tags and I have the counts in a dictionary in PoS_Count column in my dataframe.
If I remove "/df2['word_count']" in the first run and get the noun count and include it again and run, it works fine but if I run it for the first time I get the below error.
ValueError: Wrong number of items passed 100, placement implies 1
Any help is greatly appreciated
Thanks in Advance!

As you have guessed, the problem is in the /df2['word_count'] bit.
df2['word_count'] is a pandas series, but you need to use a float or int here, because you are dividing check_pos_tag(x, 'noun') (which is an int) by it.
A possible solution is to extract the corresponding field from the series and use it in your lambda.
However, it would be easier (and arguably faster) to do each operation alone.
Try this:
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')) / df2['word_count']

Related

PySpark / Python Slicing and Indexing Issue

Can someone let me know how to pull out certain values from a Python output.
I would like the retrieve the value 'ocweeklyreports' from the the following output using either indexing or slicing:
'config': '{"hiveView":"ocweeklycur.ocweeklyreports"}
This should be relatively easy, however, I'm having problem defining the Slicing / Indexing configuation
The following will successfully give me 'ocweeklyreports'
myslice = config['hiveView'][12:30]
However, I need the indexing or slicing modified so that I will get any value after'ocweeklycur'
I'm not sure what output you're dealing with and how robust you're wanting it but if it's just a string you can do something similar to this (for a quick and dirty solution).
input = "Your input"
indexStart = input.index('.') + 1 # Get the index of the input at the . which is where you would like to start collecting it
finalResponse = input[indexStart:-2])
print(finalResponse) # Prints ocweeklyreports
Again, not the most elegant solution but hopefully it helps or at least offers a starting point. Another more robust solution would be to use regex but I'm not that skilled in regex at the moment.
You could almost all of it using regex.
See if this helps:
import re
def search_word(di):
st = di["config"]["hiveView"]
p = re.compile(r'^ocweeklycur.(?P<word>\w+)')
m = p.search(st)
return m.group('word')
if __name__=="__main__":
d = {'config': {"hiveView":"ocweeklycur.ocweeklyreports"}}
print(search_word(d))
The following worked best for me:
# Extract the value of the "hiveView" key
hive_view = config['hiveView']
# Split the string on the '.' character
parts = hive_view.split('.')
# The value you want is the second part of the split string
desired_value = parts[1]
print(desired_value) # Output: "ocweeklyreports"

Python: Find and increment a number in a string

I can't find a solution to this, so I'm asking here. I have a string that consists of several lines and in the string I want to increase exactly one number by one.
For example:
[CENTER]
[FONT=Courier New][COLOR=#00ffff][B][U][SIZE=4]{title}[/SIZE][/U][/B][/COLOR][/FONT]
[IMG]{cover}[/IMG]
[IMG]IMAGE[/IMG][/CENTER]
[QUOTE]
{description_de}
[/QUOTE]
[CENTER]
[IMG]IMAGE[/IMG]
[B]Duration: [/B]~5 min
[B]Genre: [/B]Action
[B]Subgenre: [/B]Mystery, Scifi
[B]Language: [/B]English
[B]Subtitles: [/B]German
[B]Episodes: [/B]01/5
[IMG]IMAGE[/IMG]
[spoiler]
[spoiler=720p]
[CODE=rich][color=Turquoise]
{mediaInfo1}
[/color][/code]
[/spoiler]
[spoiler=1080p]
[CODE=rich][color=Turquoise]
{mediaInfo2}
[/color][/code]
[/spoiler]
[/spoiler]
[hide]
[IMG]IMAGE[/IMG]
[/hide]
[/CENTER]
I'm getting this string from a request and I want to increment the episode by 1. So from 01/5 to 02/5.
What is the best way to make this possible?
I tried to solve this via regex but failed miserably.
Assuming the number you want to change is always after a given pattern, e.g. "Episodes: [/B]", you can use this code:
def increment_episode_num(request_string, episode_pattern="Episodes: [/B]"):
idx = req_str.find(episode_pattern) + len(episode_pattern)
episode_count = int(request_string[idx:idx+2])
return request_string[:idx]+f"{(episode_count+1):0>2}"+request_string[idx+2:]
For example, given your string:
req_str = """[B]Duration: [/B]~5 min
[B]Genre: [/B]Action
[B]Subgenre: [/B]Mystery, Scifi
[B]Language: [/B]English
[B]Subtitles: [/B]German
[B]Episodes: [/B]01/5
"""
res = increment_episode_num(req_str)
print(res)
which gives you the desired output:
[B]Duration: [/B]~5 min
[B]Genre: [/B]Action
[B]Subgenre: [/B]Mystery, Scifi
[B]Language: [/B]English
[B]Subtitles: [/B]German
[B]Episodes: [/B]02/5
As #Barmar suggested in Comments, and following the example from the documentation of re, also formatting to have the right amount of zeroes as padding:
pattern = r"(?<=Episodes: \[/B\])[\d]+?(?=/\d)"
def add_one(matchobj):
number = str(int(matchobj.group(0)) + 1)
return "{0:0>2}".format(number)
re.sub(pattern, add_one, request)
The pattern uses look-ahead and look-behind to capture only the number that corresponds to Episodes, and should work whether it's in the format 01/5 or 1/5, but always returns in the format 01/5. Of course, you can expand the function so it recognizes the format, or even so it can add different numbers instead of only 1.

How to search for multiple substrings using text.find

I'm a Python beginner, so please forgive me if I'm not using the right lingo and if my code includes blatant errors.
I have text data (i.e., job descriptions from job postings) in one column of my data frame. I want to determine which job ads contain any of the following strings: bachelor, ba/bs, bs/ba.
The function I wrote doesn't work because it produces an empty column (i.e., all zeros). It works fine if I just search for one substring at a time. Here it is:
def requires_bachelor(text):
if text.find('bachelor|ba/bs|bs/ba')>-1:
return True
else:
return False
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map({True:1, False:0})
Thanks so much to anyone who is willing to help!
Here's my approach. You were pretty close but you need to check for each of the items individually. If any of the available "Bachelor tags" exist, return true. Then instead of using map({true:1, false:0}), you can use map(bool) to make it a bit nicer. Good luck!
import pandas as pd
df_jobs = pd.DataFrame({"name":["bob", "sally"], "description":["bachelor", "ms"]})
def requires_bachelor(text):
return any(text.find(a) > -1 for a in ['bachelor', 'ba/bs','bs/ba']) # -1 if not found
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map(bool)
The | in search string does not work like or operator. You should divide it into three calls like this:
if text.find('bachelor') > -1 or text.find('ba/bs') > -1 or text.find('bs/ba') > -1:
You could try doing:
bachelors = ["bachelor", "ba/bs", "bs/ba"]
if any(bachelor in text for bachelor in bachelors):
return True
Instead of writing a custom function that requires .apply (which will be quite slow), you can use str.contains for this. Also, you don't need map to turn booleans into 1 and 0; try using astype(int) instead.
df_jobs = pd.DataFrame({'description': ['job ba/bs', 'job bachelor',
'job bs/ba', 'job ba']})
df_jobs['bachelor'] = df_jobs.description.str.contains(
'bachelor|ba/bs|bs/ba', regex=True).astype(int)
print(df_jobs)
description bachelor
0 job ba/bs 1
1 job bachelor 1
2 job bs/ba 1
3 job ba 0
# note that the pattern does not look for match on simply "ba"!
So, you are checking for a string bachelor|ba/bs|bs/ba in the list, Which I don't believe will exist in any case...
What I suggest you do is to check for all possible combinations in the IF, and join them with a or statement, as follows:
def requires_bachelor(text):
if text.find('bachelor')>-1 or text.find('ba/bs')>-1 or text.find('bs/ba')>-1:
return True
else:
return False
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map({True:1, False:0})
It can all be done simply in one line in Pandas
df_jobs['bachelor'] = df_jobs['description'].str.contains(r'bachelor|bs|ba')

how to return string with re when formatting is different?

Introduction to the problem
I have inputs in a .txt file and I want to 'extract' the values when a velocity is given.
Inputs have the form: velocity\t\val1\t\val2...\tvaln
[...]
16\t1\t0\n
1.0000\t9.3465\t8.9406\t35.9604\n
2.0000\t10.4654\t9.9456\t36.9107\n
3.0000\t11.1235\t10.9378\t37.1578\n
[...]
What have I done
I have written a piece of code to return values when a velocity is requested:
def values(input,velocity):
return re.findall("\n"+str(velocity)+".*",input)[-1][1:]
It works "backwards" because I want to ignore the first row from the inputs (16\t1\t0\n), this way if I call:
>>>values('inputs.txt',16)
>>>16.0000\t0.5646\t14.3658\t1.4782\n
But it has a big problem: if I call the function for 1, it returns the value for 19.0000
Since I thought all inputs would be in the same format I made a litte fix:
def values(input,velocity):
if velocity <= 5: #Because velocity goes to 50
velocity = str(velocity)+'.0'
return re.findall("\n"+velocity+".*",input)[-1][1:]
And it works pretty well, maybe is not the most beautiful (or efficient) way of do it but I'm a beginner.
The problem
But with this code I have a problem and it is that sometimes inputs have this form:
[...]
16\t1\t0\n
1\t9.3465\t8.9406\t35.9604\n
2\t10.4654\t9.9456\t36.9107\n
3\t11.1235\t10.9378\t37.1578\n
[...]
And, of course my solution doesn't work
So, is there any pattern that fit both kinds of inputs?
Thank you for your help.
P.S. I have a solution using the function split('\n') and indexes but I would like to solve it with re library:
def values(input,velocity):
return input.split('\n)[velocity+1] #+1 to avoid first row
You could use a positive look ahead to check that after your velocity there is either a period or a tab. That will stop you picking up further numbers without hardcoding there must be .0. This means that velocity 1 will be able to match 1 or 1.xxxxx
import re
from typing import List
def find_by_velocity(velocity: int, data: str) -> List[str]:
return re.findall(r"\n" + str(velocity) + r"(?=\.|\t).*", data)
data = """16\t1\t0\n1\t9.3465\t8.9406\t35.9604\n2\t10.4654\t9.9456\t36.9107\n3\t11.1235\t10.9378\t37.1578\n16\t1\t0\n1.0000\t9.3465\t8.9406\t35.9604\n2.0000\t10.4654\t9.9456\t36.9107\n3.0000\t11.1235\t10.9378\t37.1578\n"""
print(find_by_velocity(1, data))
OUTPUT
['\n1\t9.3465\t8.9406\t35.9604', '\n1.0000\t9.3465\t8.9406\t35.9604']

selecting a value using a conditional statement based on a list of tuples

I have a list of tuples converted from a dictionary. I am looking to compare a conditional value against the list of tuples(values) whether it is higher or lower starting from the beginning on the list. When this conditional value is lower than a tuple's(value) I want to use that specific tuple for further coding.
Please can somebody give me an insight into how this is achieved?
I am relatively new to coding, self-learning and I am not 100% sure the example would run but for the sake of demonstrating I have tried my best.
`tuple_list = [(12:00:00, £55.50), (13:00:00, £65.50), (14:00:00, £75.50), (15:00:00, £45.50), (16:00:00, £55.50)]
conditional_value = £50
if conditional_value != for x in tuple_list.values()
y = 0
if conditional_value < tuple_list(y)
y++1
else
///"return the relevant value from the tuple_list to use for further coding. I would be
looking to work with £45.50"///`
Thank you.
Just form a new list with a condition:
tuple_list = [("12:00:00", 55.50), ("13:00:00", 65.50), ("14:00:00", 75.50), ("15:00:00", 45.50), ("16:00:00", 55.50)]
threshold = 50
below = [tpl for tpl in tuple_list if tpl[1] < threshold]
print(below)
Which yields
[('15:00:00', 45.5)]
Note that I added quotation marks and removed the currency sign to be able to compare the values. If you happen to have the £ in your actual values, you'll have to preprocess (stripping) them before.
If I'm understanding your question correctly, this should be what you're looking for:
for key, value in tuple_list:
if conditional_value < value:
continue # Skips to next in the list.
else:
# Do further coding.
You can use
tuple_list = [("12:00:00", 55.50), ("13:00:00", 65.50), ("14:00:00", 75.50), ("15:00:00", 45.50), ("16:00:00", 55.50)]
conditional_value = 50
new_tuple_list = list(filter(lambda x: x[1] > conditional_value, tuple_list))
This code will return a new_tuple_list with all items that there value us greater then the conditional_value.

Categories