Python / Pyspark Indexing and Slicing issue on Databricks

Python / Pyspark Indexing and Slicing issue on Databricks - python

I'm not entirely sure if I need to index or slice to retrieve elements from an output in Python.
For example, the variable "Ancestor" produces the following output.
Out[30]: {'ancestorPath': '/mnt/lake/RAW/Internal/origination/dbo/xpd_opportunitystatushistory/1/Year=2022/Month=11/Day=29/Time=05-11',
'dfConfig': '{"sparkConfig":{"header":"true"}}',
'fileFormat': 'SQL'}
The element "xpd_opportunitystatushistory" is a table and I would like to retrieve "xpd_opportunitystatushistory" from the output.
I was thinking of something like:
table = Ancestor[:6]
But it fails.
Any thoughts?
I have been working on this while waiting for help.
the following
Ancestor['ancestorPath']
Give me
Out[17]: '/mnt/lake/RAW/Internal/origination/dbo/xpd_opportunitystatushistory/1/Year=2022/Month=11/Day=29/Time=05-11'
If someone could help with the remaining code to pull out 'xpd_opportunitystatushistory' that would be most helpful
ta

Ancestor is a dictionary (key value pairs) and hence has to be accessed using a key which in this case is ancestorPath.
I have assigned the value similar to yours and was able to retrieve ancesterPath as you have figured out.
Now to get the xpd_opportunitystatushistory you can use the following code. Since the value of Ancestor['ancestorPath'] is a string, you can split and then extract the required value from the resulting array:
req_array = Ancestor['ancestorPath'].split("/")
print(req_array)
print(req_array[7])
If you want to retrieve complete path until xpd_opportunitystatushistory, then you can use the following instead:
req_array = Ancestor['ancestorPath'].split("/")
print(req_array)
print('/'.join(req_array[:8]))

Related

Is there a Python pandas function for retrieving a specific value of a dataframe based on its content?

I've got multiple excels and I need a specific value but in each excel, the cell with the value changes position slightly. However, this value is always preceded by a generic description of it which remains constant in all excels.
I was wondering if there was a way to ask Python to grab the value to the right of the element containing the string "xxx".

try iterating over the excel files (I guess you loaded each as a separate pandas object?)
somehting like for df in [dataframe1, dataframe2...dataframeN].
Then you could pick the column you need (if the column stays constant), e.g. - df['columnX'] and find which index it has:
df.index[df['columnX']=="xxx"]. Maybe will make sense to add .tolist() at the end, so that if "xxx" is a value that repeats more than once, you get all occurances in alist.
The last step would be too take the index+1 to get the value you want.
Hope it was helpful.
In general I would highly suggest to be more specific in your questions and provide code / examples.

How to use each element in the list to check and locate the matching value from a data set?

I want to retrieve values from a data set that matches a certain value. The ".loc" method is working fine if I give one value at a time. But when trying to get the value from a list nothing is happening.
The below script work fine.
df.loc[df.domains=="IN"]
The below script is not. I want to use each item from the list to match and get the desired data frame from the data set
list=[""AE","AU","BE","BR","CN","DE","EG","ES","FR","IN","IT","JP","MX","NL","PL","SE","SG","UK"]
for i in list:
a=f'"{i}"'
print(a)
df.loc[df.domains==a]

Instead of a=f'"{i}"' try a = list[i]
You need to access the list in position i in order to get the location you desire.
Also I noticed that in list you have extra " in the beginning. It might give you a syntax error

BatchDataset not subscriptable when trying to format Python dictionary as table

I'm working through the TensorFlow Load pandas.DataFrame tutorial, and I'm trying to modify the output from a code snippet that creates the dictionary slices:
dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)
for dict_slice in dict_slices.take(1):
print (dict_slice)
I find the following output sloppy, and I want to put it into a more readable table format.
I tried to format the for loop, based on this recommendation
Which gave me the error that the BatchDataset was not subscriptable
Then I tried to use the range and leng function on the dict_slices, so that i would be an integer index and not a slice
Which gave me the following error (as I understand, because the dict_slices is still an array, and each iteration is one vector of the array, not one index of the vector):

Refer here for solution. To summarize we need to use as_numpy_iterator
example = list(dict_slices.as_numpy_iterator())
example[0]['age']

BatchDataset is a tf.data.Dataset instance that has been batches by calling it's .batch(..) method. You cannot "index" a tensorflow Dataset, or call the len function on it. I suggest iterating through it like you did in the first code snippet.
However in your dataset you are using .to_dict('list'), which means that a key in your dictionary is mapped to a list as value. Basically you have "columns" for every key and not rows, is this what you want? This would make printing line-by-line (shown in the table printing example you linked) a lot more difficult, since you do not have different features in a row. Also it is different from the example in the official Tensorflow code, where a datapoint consists of multiple features, and not one feature with multiple values.
Combining the Tensorflow code and pretty printing:
columns = list(df.columns.values)+['target']
dict_slices = tf.data.Dataset.from_tensor_slices((df.values, target.values)).batch(1) # batch = 1 because otherwise you will get multiple dict_slice - target pairs in one iteration below!
print(*columns, sep='\t')
for dict_slice, target in dict_slices.take(1):
print(*dict_slice.numpy(), target.numpy(), sep='\t')
This needs a bit of formatting, because column widths are not equal.

How do I use the Pandas .loc() function when converting a column to DateTime?

I'm working on a small project to parse and create graphs based on DNC Primary data. It's all functional so far but I'm working on getting rid of the following SettingWithCopyWarning:
"SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
candidate['end_date'] = pd.to_datetime(candidate_state['end_date'])"
I've tried changing the referenced line to:
candidate.loc['end_date'] = pd.to_datetime(candidate_state['end_date'])
But that throws another error about some kind of comparison. Can anyone help me figure this out?
Thank you!

It is hard to confirm without seeing the full code, but this warning is usually a result of the dataframe you are working on (candidate in this case) being a "copy" of a filtered selection of a larger dataframe. In other words, when you created candidate you did something like:
candidate = df_larger_dataset[df_larger_dataset['some_column'] == 'some_value']
The reason you get the warning is because when you do this, you don't actually create a new object, just a reference, which means when you start making changes to candidate, you are also modifying df_larger_dataset. This may or may not matter in your context, but to avoid the warning, when you create 'candidate' make it an explicit copy of df_larger_dataset:
candidate = df_larger_dataset[df_larger_dataset['some_column'] == some_value].copy()

Is there a way to find a substring in a DataFrame?

Well, I got this problem:
I have a pandas DataFrame and I'm trying to find a the value that starts with "THRL-" and delete that exact same prefix, I've tried to make it a string and use the result = df.toString() method as it follows (Where result is a DataFrame):
a = result.replace('THRL-', '')
But it doesn't work, I still see the same THRL- prefix in the string that I'm returning.
Is there a better way to do it? I also tried with a dictionary but it didn't seem to work because apparently the method .to_dict() returns a list instead of a dictionary

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python / Pyspark Indexing and Slicing issue on Databricks - python

Related

Is there a Python pandas function for retrieving a specific value of a dataframe based on its content?

How to use each element in the list to check and locate the matching value from a data set?

BatchDataset not subscriptable when trying to format Python dictionary as table

How do I use the Pandas .loc() function when converting a column to DateTime?

Is there a way to find a substring in a DataFrame?

Categories

Resources