How can I get a single column out of a spark dataframe? - python

I would like to take a single column out of my spark dataframe.
And I would like to put the latitude in a variable, and the longitude.
When I do this;
I only get the column name.

The best or corrected way to select any column would be to use col() function in order to let spark know that it's not a string and also this will not be dependent on that Dataframe (i.e in case the dataframe is deleted df.select("name") might give issue)
df = df.select(F.col("some_column_name")
Likewise, same for filter operation , use lit to make spark understand it is string -
df = df.filter(F.col("some_column_name") == F.lit("a_string"))

Well all you need to do is :
Lats = [row[0] for row in df.select('latitude').collect()]
print(Lats)

Related

pyspark Drop rows in dataframe to only have X distinct values in one column

So I have a dataframe with a column "Category" and it has over 12k distinct values, for sampling purposes I would like to get a small sample where there are only 1000 different values of this category column.
Before I was doing:
small_distinct = df.select("category").distinct().limit(1000).rdd.flatMap(lambda x: x).collect()
df = df.where(col("category").isin(small_distinct))
I know this is extremely inefficient as I'm doing a distinct of the category column and then casting it into a normal python list so I can use isin() filter.
Is there any "spark" way of doing this? I thought maybe something with rollingoverwindows could do the job? But I cant get to solve it
Thanks!
You can improve your code using a left_semi join:
small_distinct = df.select("category").distinct().limit(1000)
df = df.join(small_distinct, "category", "left_semi")
Using left_semi is a good way to filter a table using another table, keeping the same schema, in a efficient way.

How to feed new columns every time in a loop to a spark dataframe?

I have a task of reading each columns of Cassandra table into a dataframe to perform some operations. Here I want to feed the data like if 5 columns are there in a table I want:-
first column in the first iteration
first and second column in the second iteration to the same dataframe
and likewise.
I need a generic code. Has anyone tried similar to this? Please help me out with an example.
This will work:
df2 = pd.DataFrame()
for i in range(len(df.columns)):
df2 = df2.append(df.iloc[:,0:i+1],sort = True)
Since, the same column name is getting repeated, obviously df will not have same column name twice and hence it will keep on adding rows
You can extract the names from dataframe's schema and then access that particular column and use it the way you want to.
names = df.schema.names
columns = []
for name in names:
columns.append(name)
//df[columns] use it the way you want

Pandas - merge/join/vlookup df and delete all rows that get a match

I am trying to reference a list of expired orders from one spreadsheet(df name = data2), and vlookup them on the new orders spreadsheet (df name = data) to delete all the rows that contain expired orders. Then return a new spreadsheet(df name = results).
I am having trouble trying to mimic what I do in excel vloookup/sort/delete in pandas. Please view psuedo code/steps as code:
Import simple.xls as dataframe called 'data'
Import wo.xlsm, sheet
name "T" as dataframe called 'data2'
Do a vlookup , using Column
"A" in the "data" to be used to as the values to be
matched with any of the same values in Column "A" of "data2" (there both just Order Id's)
For all values that exist inside Column A in 'data2'
and also exist in Column "A" of the 'data',group ( if necessary) and delete the
entire row(there is 26 columns) for each matched Order ID found in Column A of both datasets. To reiterate, deleting the entire row for the matches found in the 'data' file. Save the smaller dataset as results.
import pandas as pd
data = pd.read_excel("ors_simple.xlsx", encoding = "ISO-8859-1",
dtype=object)
data2 = pd.read_excel("wos.xlsm", sheet_name = "T")
results = data.merge(data2,on='Work_Order')
writer = pd.ExcelWriter('vlookuped.xlsx', engine='xlsxwriter')
results.to_excel(writer, sheet_name='Sheet1')
writer.save()
I re-read your question and think I undertand it correctly. You want to find out if any order in new_orders (you call it data) have expired using expired_orders (you call it data2).
If you rephrase your question what you want to do is: 1) find out if a value in a column in a DataFrame is in a column in another DataFrame and then 2) drop the rows where the value exists in both.
Using pd.merge is one way to do this. But since you want to use expired_orders to filter new_orders, pd.merge seems a bit overkill.
Pandas actually has a method for doing this sort of thing and it's called isin() so let's use that! This method allows you to check if a value in one column exists in another column.
df_1['column_name'].isin(df_2['column_name'])
isin() returns a Series of True/False values that you can apply to filter your DataFrame by using it as a mask: df[bool_mask].
So how do you use this in your situation?
is_expired = new_orders['order_column'].isin(expired_orders['order_column'])
results = new_orders[~is_expired].copy() # Use copy to avoid SettingWithCopyError.
~is equal to not - so ~is_expired means that the order wasn't expired.

How to add values to a new column in pandas dataframe?

I want to create a new named column in a Pandas dataframe, insert first value into it, and then add another values to the same column:
Something like:
import pandas
df = pandas.DataFrame()
df['New column'].append('a')
df['New column'].append('b')
df['New column'].append('c')
etc.
How do I do that?
If I understand, correctly you want to append a value to an existing column in a pandas data frame. The thing is with DFs you need to maintain a matrix-like shape so the number of rows is equal for each column what you can do is add a column with a default value and then update this value with
for index, row in df.iterrows():
df.at[index, 'new_column'] = new_value
Dont do it, because it's slow:
updating an empty frame a-single-row-at-a-time. I have seen this method used WAY too much. It is by far the slowest. It is probably common place (and reasonably fast for some python structures), but a DataFrame does a fair number of checks on indexing, so this will always be very slow to update a row at a time. Much better to create new structures and concat.
Better to create a list of data and create DataFrame by contructor:
vals = ['a','b','c']
df = pandas.DataFrame({'New column':vals})
If in case you need to add random values to the newly created column, you could also use
df['new_column']= np.random.randint(1, 9, len(df))

Accessing groups in Pandas lambda function

I have a Pandas dataframe with a multiindex. Level 0 is 'Strain' and level 1 is 'JGI library.' Each 'Strain' has several 'JGI library' columns associated with it. I would like to use a lambda function to apply a t-test to compare two different strains. To troubleshoot, I have been taking one row of my dataframe using the .iloc[0] command.
row = pvalDf.iloc[0]
parent = 'LL1004'
child = 'LL345'
ttest_ind(row.groupby(level='Strain').get_group(parent), row.groupby(level='Strain').get_group(child))[1]
This works as expected. Now I try to apply it to my whole dataframe
parent = 'LL1004'
child = 'LL345'
pvalDf = countsDf4.apply(lambda row: ttest_ind(row.groupby(level='Strain').get_group(parent), row.groupby(level='Strain').get_group(child))[1])
Now I get an error message saying, "ValueError: ('level name Strain is not the name of the index', 'occurred at index (LL1004, BCHAC)')"
'LL1004' is a 'Strain,' but Pandas doesn't seem to be aware of this. It looks like maybe the multiindex was not passed to the lambda function correctly? Is there a better way to troubleshoot lambda functions than using .iloc[0]?
I put a copy of my Jupyter notebook and an excel file with the countsDf4 dataframe on Github https://github.com/danolson1/pandas_ttest
Thanks,
Dan
How about, more simply:
pvalDf = countsDf4.apply(lambda row: ttest_ind(row[parent], row[child]), axis=1)
I've tested it on your notebook and it works.
Your problem is that DataFrame.apply() by default applies the function to each column, not to each row. So, you need to specify the axis=1 parameter to override the default behavior and apply the function row by row.
Also, there's no reason to use row.groupby(level='Strain').get_group(x) when you could simply index the group of columns by row[x]. :)

Categories