Get a Spark dataframe field into a String value - python

I am currently trying to filter my dataframe into an if and get the field returned into variable.
Here is my code:
if df_table.filter(col(field).contains("val")):
id_2 = df_table.select(another_field)
print(id_2)
# Recursive call with new variable
The problem is : it looks like the if filtering works, but id_2 gives me the column name and type where I want the value itself from that field.
The output for this code is:
DataFrame[ID_1: bigint]
DataFrame[ID_2: bigint]
...
If I try collect like this : id_2 = df_table.select(another_field).collect()
I get this : [Row(ID_1=3013848), Row(ID_1=319481), Row(ID_1=391948)...] which looks like just listing all id in a list.
I thought of doing : id_2 = df_table.select(another_field).filter(col(field).contains("val"))
but I still get the same result as first attempt.
I would like my id_2 for each iteration of my loop to take value from the field I am filtering on. Like :
3013848
319481
...
But not a list from every value of matching fields from my dataframe.
Any idea on how I could get that into my variable ?
Thank you for helping.

In fact, dataFrame.select(colName) is supposed to return a column(a dataframe of with only one column) but not the column value of the line. I see in your comment you want to do recursive lookup in a spark dataframe. The thing is, firstly, spark AFAIK, doesn't support recursive operation. If you have a deep recursive operation to do, you'd better collect the dataframe you have and do it on your driver without spark. Instead, you can use what library you want but you lose the advantage of treating the data in the distributive way.
Secondly, spark isn't designed to do operations with iteration on each record. Try to achieve with join of dataframes, but it return to my first point, if your later operation of join depends on your join result, in a recursive way, just forget spark.

Related

How do you check if all the values in a column in a dataframe exist in another column in another dataframe using Vaex?

I have a dataframe with 160,000 rows and I need to know if these values exist in another column in another different dataframe that has over 7 million rows using Vaex.
I have tried doing this in pandas but it takes way too long to run.
Once I run this code I would like a list or a column that says either "True" or "False" about whether the value exists.
There are few tricks you can do.
Some ideas:
you can try inner join, and then get the list of unique values, which appear in both dataframes. Then you can use the isin method in the smaller dataframe and that list to get your answer.
Dunno if this will work out of the box, but it would be something like:
df_join = df_small.join(df_big, on='key', allow_duplicates=True)
common_samples = df_join[key].tolist()
df_small['is_in_df_big'] = df_small.key.isin(common_samples)
# If it is something you gonna reuse a lot, but be worth doing
df_small = df_small.materialize('is_in_df_big') # to put it in memory otherwise it will be lazily recomputed each time you need it.
Similar idea: instead of doing join do something like:
unique_samples = df_small.key.unique()
common_samples = df_big[df_big.key.isin(unique_samples)].key.unqiue()
df_small['is_in_df_big'] = df_small.key.isin(common_samples)
I dunno which one would be faster. I hope this at least will lead to some inspiration if not to the full solution.

How to read 1 record from a json file using panda

In this program I'd like to randomly choose one word from a json file
[My code]
I succesfully opend the file but i don't know how to access only one record inside "randwords"
Thanks!!!
Try this:
randwords.sample(1)
See:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html
You can access a pandas dataframe (the pd) with an operation called indexing. For example pd['data'] will let you have access to the data column. Refer to here for more information.
One specific function that you can benefit here is iloc. For example pd.iloc[0] will let you have access to the first row. Then you can specify which column you are interested int by calling the appropriate column name.
pd.idloc[0].data
This will return the data column of the first row. Using a random number instead of 0 will result in a random row obioulsy.

Pandas: Dealing with missing column in input dataframe

I have a python code which performs mathematical calculations on multiple columns of the dataframe. This input comes from various sources so there is a possibility that sometimes one column is missing from the same.
This column is missing because its insignificant but i need to have a null column atleast for the code to run without errors.
I can add a null column using if loop but there are around 120 columns and i do not want to slow down the code. Is there any other way where the code can check each column is present in the original dataframe and then if any column is not present it adds a null column and then starts with execution of the actual code?
If you know that the column name is the same for every dataframe you could do something like this without having to loop over the column names
if col_name not in df.columns:
df[col_name] = '' # or whatever value you want to set it to
If speed is a super concern, which I can't tell, you could always convert the the columns to a set with set(df.columns) and reduce the search to O(1) time because it will be a hashed search. You can read more in detail on the efficiency of the in operator at this link How efficient is Python's 'in' or 'not in' operators?

How to access JSON values from PySpark dataframes with default values?

I have a spark dataframe which has a Json on one of the columns. My task is to turn this dataframe in to a columnar type of dataframe. The problem is that the JSON is dynamic and it always changes structure. What I would like to do is attempt to take values from it and if it and in case it does not have then, return a default value. Is there an option for this in the dataframe? This is how I am taking values out of the JSON, the problem is that in case one of the level changes name or structure, it will not fail.
columnar_df = df.select(col('json')['level1'].alias('json_level1'),
col('json')['level1']['level2a'].alias('json_level1_level2a'),
col('json')['level1']['level2b'].alias('json_levelb'),
)
you can do something like that with json_tuple
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.json_tuple
df.select(json_tuple(col("json"), << all_the_fields , _you_want >> ))

How does the isnull() method work to return all rows that are missing in my data frame?

I'm new to Python and just trying to figure out how this small bit of code works. Hoping this'll be easy to explain without an example data frame.
My data frame, called df_train, contains a column called Age. This column is NaN for 177 records.
I submit the following code...
df_train[df_train['Age'].isnull()]
... and it returns all records that are missing.
Now if I submit df_train['Age'].isnull(), all I get is a Boolean List of values. How does the data frame object then work to convert this Boolean List to the rows we actually want?
I don't understand how passing the boolean list to the data frame again results in just the 177 records that we need - could someone please ELI5 for a newbie?
You will have to create subsets of the dataframe you want to use. Suppose you want to use only those rows where df_train['Age'] is not null. In that case, you have to select
df_train_to_use = df_train[df_train['Age'].isnull() == False]
Now, you may cross check any other column that you may want to use and have nulls like
df_train['Column_name'].isnull().any()
If this returns True, you may go ahead and replace nulls with default values, average, zeros or whatever methods you prefer, usually put in application for machine learning programs.
Example
df_train['Column_name'].dropna()
df_train['Column_name'].fillna('') #for strings
df_train['Column_name'].fillna(0) #for int
df_train['Column_name'].fillna(0.0) #for float
Etc.
I hope this helps you explain.

Categories