I have a pyspark dataframe (df) with very many rows and two columns (col1, col2), though I want to get a dataframe with only the top 30 rows. I've seen that I can display them with:
df.show(30)
But this isn't what I want to do, I need an actual dataframe with only these 30 rows. I've also seen that I can run:
df.head(30) or df.take(30)
But these don't output a dataframe but rather a list of the form [Row(col1=100, col2=200), Row(col1=300, col2=500), ...]
What I am wanting to know is how I can convert this list [Row(col1=100, col2=200), Row(col1=300, col2=500), ...] back into a dataframe. I have tried DataFrame and createDataFrame, but these are throwing up errors. Specifically, I have tried:
df_list=df.head(30)
df_Columns = ["col1","col2"]
new_df=DataFrame(df_list, df_Columns)
But this is throwing up errors. I am at a loss as to how to do this seemingly simple task, so I am looking for a straightforward method to do so (convert the df.head(30) list back to a pyspark dataframe). Any guidance or insights are greatly appreciated.
Related
I have a dataframe that I created from a master table in SQL. That new dataframe is then grouped by type as I want to find the outliers for each group in the master table.
The function finds the outliers, showing where in the GroupDF they outliers occur. How do I see this outliers as a part of the original dataframe? Not just volume but also location, SKU, group etc.
dataframe: HOSIERY_df
Code:
##Sku Group Data Frames
grouped_skus = sku_volume.groupby('SKUGROUP')
HOSIERY_df = grouped_skus.get_group('HOSIERY')
hosiery_outliers = find_outliers_IQR(HOSIERY_df['VOLUME'])
hosiery_outliers
#.iloc[[hosiery_outliers]]
#hosiery_outliers
Picture to show code and output:
I know enough that I need to find the rows based on location of the index. Like Vlookup in Excel but i need to do it with in Python. Not sure how to pull only the 5, 6, 7...3888 and 4482nd place in the HOSIERY_df.
You can provide a list of index numbers as integers to iloc, which it looks like you have tried based on your commented-out code. So, you may want to make sure that find_outliers_IQR is returning a list of int so it will work properly with iloc, or convert it's output.
It looks like it's currently returning a DataFrame. You can get the index of that frame as a list like this:
hosiery_outliers.index.tolist()
I have a dataframe with 5 columns (Participants, duration_1, duration_2, duration_3, duration_4). The "Participant" column has either subjects with the IDCY or IDCO labels. FOr example: IDCY06, IDCO02,IDCY31...etc. I want to create two new dataframes: those with the IDCY and those with IDCO. I have been using the code:
df[df["Participant"].str.contains("IDCY")]
and I keep getting a keyerror code for Participants even though everything is spelled as it should be.
Is there any other method to iterate over rows and to get a new dataframe with the participants that have a set substring?
Thank you.
Ive attempted to search the forum for this question, but, I believe I may not be asking it correctly. So here it goes.
I have a large data set with many columns. Originally, I needed to sum all columns for each row by multiple groups based on a name pattern of variables. I was able to do so via:
cols = data.filter(regex=r'_name$').columns
data['sum'] = data.groupby(['id','group'],as_index=False)[cols].sum().assign(sum = lambda x: x.sum(axis=1))
By running this code, I receive a modified dataframe grouped by my 2 factor variables (group & id), with all the columns, and the final sum column I need. However, now, I want to return the final sum column back into the original dataframe. The above code returns the entire modified dataframe into my sum column. I know this is achievable in R by simply adding a .$sum at the end of a piped code. Any ideas on how to get this in pandas?
My hopeful output is just a the addition of the final "sum" variable from the above lines of code into my original dataframe.
Edit: To clarify, the code above returns this entire dataframe:
All I want returned is the column in yellow
is this what you need?
data['sum'] = data.groupby(['id','group'])[cols].transform('sum').sum(axis = 1)
I just edited the question as maybe I didn't make myself clear.
I have two dataframes (MR and DT)
The column 'A' in dataframe DT is a subset of the column 'A' in dataframe MR, they both are just similar (not equal) in this ID column, the rest of the columns are different as well as the number of rows.
How can I get the rows from dataframe MR['ID'] that are equal to the dataframe DT['ID']? Knowing that values in 'ID' can appear several times in the same column.
The DT is 1538 rows and MR is 2060 rows).
I tried some lines proposed here >https://stackoverflow.com/questions/28901683/pandas-get-rows-which-are-not-in-other-dataframe but I got bizarre results as I don't fully understand the methods they proposed (and the goal is little different)
Thanks!
Take a look at pandas.Series.isin() method. In your case you'd want to use something like:
matching_id = MR.ID.isin(DT.ID) # This returns a boolean Series of whether values match or not
# Now filter your dataframe to keep only matching rows
new_df = MR.loc[matching_id, :]
Or if you want to just get a new dataframe of combined records for the same ID you need to use merge():
new_df = pd.merge(MR, DT, on='ID')
This will create a new dataframe with columns from both original dfs but only where ID is the same.
I have a dataframe df with two columns date and data. I want to take the first difference of the data column and add it as a new column.
It seems that df.set_index('date').shift() or df.set_index('date').diff() give me the desired result. However, when I try to add it as a new column, I get NaN for all the rows.
How can I fix this command:
df['firstdiff'] = df.set_index('date').shift()
to make it work?