PySpark / Python Slicing and Indexing Issue

PySpark / Python Slicing and Indexing Issue - python

Can someone let me know how to pull out certain values from a Python output.
I would like the retrieve the value 'ocweeklyreports' from the the following output using either indexing or slicing:
'config': '{"hiveView":"ocweeklycur.ocweeklyreports"}
This should be relatively easy, however, I'm having problem defining the Slicing / Indexing configuation
The following will successfully give me 'ocweeklyreports'
myslice = config['hiveView'][12:30]
However, I need the indexing or slicing modified so that I will get any value after'ocweeklycur'

I'm not sure what output you're dealing with and how robust you're wanting it but if it's just a string you can do something similar to this (for a quick and dirty solution).
input = "Your input"
indexStart = input.index('.') + 1 # Get the index of the input at the . which is where you would like to start collecting it
finalResponse = input[indexStart:-2])
print(finalResponse) # Prints ocweeklyreports
Again, not the most elegant solution but hopefully it helps or at least offers a starting point. Another more robust solution would be to use regex but I'm not that skilled in regex at the moment.

You could almost all of it using regex.
See if this helps:
import re
def search_word(di):
st = di["config"]["hiveView"]
p = re.compile(r'^ocweeklycur.(?P<word>\w+)')
m = p.search(st)
return m.group('word')
if __name__=="__main__":
d = {'config': {"hiveView":"ocweeklycur.ocweeklyreports"}}
print(search_word(d))

The following worked best for me:
# Extract the value of the "hiveView" key
hive_view = config['hiveView']
# Split the string on the '.' character
parts = hive_view.split('.')
# The value you want is the second part of the split string
desired_value = parts[1]
print(desired_value) # Output: "ocweeklyreports"

Related

How to search for multiple substrings using text.find

I'm a Python beginner, so please forgive me if I'm not using the right lingo and if my code includes blatant errors.
I have text data (i.e., job descriptions from job postings) in one column of my data frame. I want to determine which job ads contain any of the following strings: bachelor, ba/bs, bs/ba.
The function I wrote doesn't work because it produces an empty column (i.e., all zeros). It works fine if I just search for one substring at a time. Here it is:
def requires_bachelor(text):
if text.find('bachelor|ba/bs|bs/ba')>-1:
return True
else:
return False
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map({True:1, False:0})
Thanks so much to anyone who is willing to help!

Here's my approach. You were pretty close but you need to check for each of the items individually. If any of the available "Bachelor tags" exist, return true. Then instead of using map({true:1, false:0}), you can use map(bool) to make it a bit nicer. Good luck!
import pandas as pd
df_jobs = pd.DataFrame({"name":["bob", "sally"], "description":["bachelor", "ms"]})
def requires_bachelor(text):
return any(text.find(a) > -1 for a in ['bachelor', 'ba/bs','bs/ba']) # -1 if not found
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map(bool)

The | in search string does not work like or operator. You should divide it into three calls like this:
if text.find('bachelor') > -1 or text.find('ba/bs') > -1 or text.find('bs/ba') > -1:

You could try doing:
bachelors = ["bachelor", "ba/bs", "bs/ba"]
if any(bachelor in text for bachelor in bachelors):
return True

Instead of writing a custom function that requires .apply (which will be quite slow), you can use str.contains for this. Also, you don't need map to turn booleans into 1 and 0; try using astype(int) instead.
df_jobs = pd.DataFrame({'description': ['job ba/bs', 'job bachelor',
'job bs/ba', 'job ba']})
df_jobs['bachelor'] = df_jobs.description.str.contains(
'bachelor|ba/bs|bs/ba', regex=True).astype(int)
print(df_jobs)
description bachelor
0 job ba/bs 1
1 job bachelor 1
2 job bs/ba 1
3 job ba 0
# note that the pattern does not look for match on simply "ba"!

So, you are checking for a string bachelor|ba/bs|bs/ba in the list, Which I don't believe will exist in any case...
What I suggest you do is to check for all possible combinations in the IF, and join them with a or statement, as follows:
def requires_bachelor(text):
if text.find('bachelor')>-1 or text.find('ba/bs')>-1 or text.find('bs/ba')>-1:
return True
else:
return False
df_jobs['bachelor']=df_jobs['description'].apply(requires_bachelor).map({True:1, False:0})

It can all be done simply in one line in Pandas
df_jobs['bachelor'] = df_jobs['description'].str.contains(r'bachelor|bs|ba')

How to traverse dictionary keys in sorted order

I am reading a cfg file, and receive a dictionary for each section. So, for example:
Config-File:
[General]
parameter1="Param1"
parameter2="Param2"
[FileList]
file001="file1.txt"
file002="file2.txt" ......
I have the FileList section stored in a dictionary called section. In this example, I can access "file1.txt" as test = section["file001"], so test == "file1.txt". To access every file of FileList one after the other, I could try the following:
for i in range(1, (number_of_files + 1)):
access_key = str("file_00" + str(i))
print(section[access_key])
This is my current solution, but I don't like it at all. First of all, it looks kind of messy in python, but I will also face problems when more than 9 files are listed in the config.
I could also do it like:
for i in range(1, (number_of_files + 1)):
if (i <= 9):
access_key = str("file_00" + str(i))
elif (i > 9 and i < 100):
access_key = str("file_0" + str(i))
print(section[access_key])
But I don't want to start with that because it becomes even worse. So my question is: What would be a proper and relatively clean way to go through all the file names in order? I definitely need the loop because I need to perform some actions with every file.

Use zero padding to generate the file number (for e.g. see this SO question answer: https://stackoverflow.com/a/339013/3775361). That way you don’t have to write the logic of moving through digit rollover yourself—you can use built-in Python functionality to do it for you. If you’re using Python 3 I’d also recommend you try out f-strings (one of the suggested solutions at the link above). They’re awesome!

If we can assume the file number has three digits, then you can do the followings to achieve zero padding. All of the below returns "015".
i = 15
str(i).zfill(3)
# or
"%03d" % i
# or
"{:0>3}".format(i)
# or
f"{i:0>3}"

Start by looking at the keys you actually have instead of guessing what they might be. You need to filter out the ones that match your pattern, and sort according to the numerical portion.
keys = [key for key in section.keys() if key.startswith('file') and key[4:].isdigit()]
You can add additional conditions, like len(key) > 4, or drop the conditions entirely. You might also consider learning regular expressions to make the checking more elegant.
To sort the names without having to account for padding, you can do something like
keys = sorted(keys, key=lambda s: int(s[4:]))
You can also try a library like natsort, which will handle the custom sort key much more generally.
Now you can iterate over the keys and do whatever you want:
for key in sorted((k for k in section if k.startswith('file') and k[4:].isdigit()), key=lambda s: int(s[4:])):
print(section[key])
Here is what a solution equipt with re and natsort might look like:
import re
from natsort import natsorted
pattern = re.compile(r'file\d+')
for key in natsorted(k for k in section if pattern.fullmatch(k)):
print(section[key])

Get the last word after / in url python

I like simplify my code for get the last word after /
any suggestion?
def downloadRepo(repo):
pos1=repo[::-1].index("/")
salida=repo[::-1][:pos1]
print(salida[::-1])
downloadRepo("https://github.com/byt3bl33d3r/arpspoof")
Thanks in advance!

You can use str.rsplit and negative indexing:
"https://github.com/byt3bl33d3r/arpspoof".rsplit('/', 1)[-1]
# 'arpspoof'
You can also stick with indexes and use str.rfind:
s = "https://github.com/byt3bl33d3r/arpspoof"
index = s.rfind('/')
s[index+1:]
# 'arpspoof'
The latter is more memory efficient, since the split methods build in-memory lists which contain all the split tokens, including the spurious ones from the front that we don't use.

You may use
string = "https://github.com/byt3bl33d3r/arpspoof"
last_part = string.split("/")[-1]
print(last_part)
Which yields
arpspoof
Timing rsplit() vs split() yields (on my Macbook Air) the following results:
import timeit
def schwobaseggl():
return "https://github.com/byt3bl33d3r/arpspoof".rsplit('/', 1)[-1]
def jan():
return "https://github.com/byt3bl33d3r/arpspoof".split("/")[-1]
print(timeit.timeit(schwobaseggl, number=10**6))
print(timeit.timeit(jan, number=10**6))
# 0.347005844116
# 0.379151821136
So the rsplit alternative is indeed slightly faster (running it a 1.000.000 times that is).

How can I increase the amount of array iterated during the run-time of script?

My script cleans arrays from the unwanted string like "##$!" and other stuff.
The script works as intended but the speed of it is extremely slow when the excel row size is big.
I tried to use numpy if it could speed it up but I'm not too familiar with is so I might be using it incorrectly.
xls = pd.ExcelFile(path)
df = xls.parse("Sheet2")
TeleNum = np.array(df['telephone'].values)
def replace(orignstr): # removes the unwanted string from numbers
for elem in badstr:
if elem in orignstr:
orignstr = orignstr.replace(elem, '')
return orignstr
for UncleanNum in tqdm(TeleNum):
newnum = replace(str(UncleanNum)) # calling replace function
df['telephone'] = df['telephone'].replace(UncleanNum, newnum) # store string back in data frame
I also tried removing the method to if that would help and just place it as one block of code but the speed remained the same.
for UncleanNum in tqdm(TeleNum):
orignstr = str(UncleanNum)
for elem in badstr:
if elem in orignstr:
orignstr = orignstr.replace(elem, '')
print(orignstr)
df['telephone'] = df['telephone'].replace(UncleanNum, orignstr)
TeleNum = np.array(df['telephone'].values)
The current speed of the script running an excel file of 200,000 is around 70it/s and take around an hour to finish. Which is not that good since this is just one function of many.
I'm not too advanced in python. I'm just learning as I script so if you have any pointer it would be appreciated.
Edit:
Most of the array elements Im dealing with are numbers but some have string in them. I trying to remove all string in the array element.
Ex.
FD3459002912
*345*9002912$

If you are trying to clear everything that isn't a digit from the strings you can directly use re.sub like this:
import re
string = "FD3459002912"
regex_result = re.sub("\D", "", string)
print(regex_result) # 3459002912

Apply operation and a division operation in the same step using Python

I am trying to get proportion of nouns in my text using the code below and it is giving me an error. I am using a function that calculates the number of nouns in my text and I have the overall word count in a different column.
pos_family = {
'noun' : ['NN','NNS','NNP','NNPS']
}
def check_pos_tag(x, flag):
cnt = 0
try:
for tag,value in x.items():
if tag in pos_family[flag]:
cnt +=value
except:
pass
return cnt
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')/df2['word_count'])
Note: I have used nltk package to get the counts by PoS tags and I have the counts in a dictionary in PoS_Count column in my dataframe.
If I remove "/df2['word_count']" in the first run and get the noun count and include it again and run, it works fine but if I run it for the first time I get the below error.
ValueError: Wrong number of items passed 100, placement implies 1
Any help is greatly appreciated
Thanks in Advance!

As you have guessed, the problem is in the /df2['word_count'] bit.
df2['word_count'] is a pandas series, but you need to use a float or int here, because you are dividing check_pos_tag(x, 'noun') (which is an int) by it.
A possible solution is to extract the corresponding field from the series and use it in your lambda.
However, it would be easier (and arguably faster) to do each operation alone.
Try this:
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')) / df2['word_count']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark / Python Slicing and Indexing Issue - python

Related

How to search for multiple substrings using text.find

How to traverse dictionary keys in sorted order

Get the last word after / in url python

How can I increase the amount of array iterated during the run-time of script?

Apply operation and a division operation in the same step using Python

Categories

Resources