Splitting a dataframe column into two separate data frames using pandas - python

i am using python to code in jupyter notebook.
Im trying to use pandas to split a column of dataframe (called "PostTypeId' into two separate dataframes, based on the columns value - one dataframe is to be called Questions and has the column value of 1, and the second dataframe is to be called Answers that has the column value of 2. Im asked to do all this by defining it within a function, called split_df
Wondering how i would go about this.
Thanks so much:)

you can do it by:
Questions=pd.DataFrame(PostTypeId[PostTypeId.col_name==1])
Answers=pd.DataFrame(PostTypeId[PostTypeId.col_name==2])
when creating the function, use the filter values as the argument

Related

Converting for loop to numpy calculation for pandas dataframes

So I have a python script that compares two dataframes and works to find any rows that are not in both dataframes. It currently iterates through a for loop which is slow.
I want to improve the speed of the process, and know that iteration is the problem. However, I haven't been having much luck using various numpy methods such as merge and where.
Couple of caveats:
The column names from my file sources aren't the same, so I set their names into variables and use the variable names to compare.
I want to only use the column names from one of the dataframes.
df_new represents new information to be checked against what is currently on file (df_current)
My current code:
set_current = set(df_current[current_col_name])
df_out = pd.DataFrame(columns=df_new.columns)
for i in range(len(df_new.index)):
# if the row entry is new, we add it to our dataset
if not df_new[new_col_name][i] in set_current:
df_out.loc[len(df_out)] = df_new.iloc[i]
# if the row entry is a match, then we aren't going to do anything with it
else:
continue
# create a xlsx file with the new items
df_out.to_excel("data/new_products_to_examine.xlsx", index=False)
Here are some simple examples of dataframes I would be working with:
df_current
|partno|description|category|cost|price|upc|brand|color|size|year|
|:-----|:----------|:-------|:---|:----|:--|:----|:----|:---|:---|
|123|Logo T-Shirt||25|49.99||apple|red|large|2021||
|456|Knitted Shirt||35|69.99||apple|green|medium|2021||
df_new
|mfgr_num|desc|category|cost|msrp|upc|style|brand|color|size|year|
|:-------|:---|:-------|:---|:---|:--|:----|:----|:----|:---|:---|
|456|Knitted Shirt||35|69.99|||apple|green|medium|2021|
|789|Logo Vest||20|39.99|||apple|yellow|small|2022|
There are usually many more columns in the current sheet, but I wanted the table displayed to be somewhat readable. The key is that I would only want the columns in the "new" dataframe to be output.
I would want to match partno with mfgr_num since the spreadsheets will always have them, whereas some items don't have upc/gtin/ean.
It's still a unclear what you want without providing examples of each dataframe. But if you want to test unique IDs in differently named columns in two different dataframes, try an approach like this.
Find the IDs that exist in the second dataframe
test_ids = df2['cola_id'].unique().tolist()
the filter the first dataframe for those IDs.
df1[df1['keep_id'].isin(test_ids)]
Here is the answer that works - was supplied to me by someone much smarter.
df_out = df_new[~df_new[new_col_name].isin(df_current[current_col_name])]

Unable to update new column values in rows which were derived from existing column having multiple values separeted by ','?

Original dataframe
Converted Dataframe using stack and split:
Adding new column to a converted dataframe:
What i am trying to is add a new column using np.select(condition, values) but it not updating the two addition rows derived from H1 its returning with 0 or NAN. Can someone please help me here ?
Please note i have already done the reset index but still its not helping.
I think using numpy in this situation is kind of unnecessary.
you can use something like the following code:
df[df.State == 'CT']['H3'] = 4400000

How to get column names after importing pickle file into Pandas

I am new to hands-on python and programming in general. I have imported a 6gb pickle file into pandas and been able to display the results of the file. It doesn't look well ordered however. My dataframe has varying rows and 842 columns.
My next task is to;
get column names of all 842 columns so i can find columns that have similar features.
create a new column (series) with data from (1) above
"append" new column to original dataframe
Thus far i have tried the "functions" column, col, dataframe.columns, to get column names but no one is working.
Please see what my program looks like;code and output
You can get list of your dataframe column names using this :
list(your_dataframe.columns)
for adding new columns, check this :
new-columns-in pandas

I am trying to create a function that calculates z-score that loops through a dataframe

I know I could use the z-score function with a stats plugin but I am trying to practice creating functions but am stuck on this:
Create a function that takes a DataFrame, and a list of columns to be standardized (z-scores), as inputs and return a DataFrame as output. Inside this function, loop through the DataFrame columns supplied as input and standard the values of these columns. The columns must contain only numerical values. Inside your function, you may need to initialize an empty DataFrame and populate it with columns having standardized values. The returned DataFrame should contain only the columns supplied as input. Each column of the returned DataFrame should contain standardized values only.
Thanks for the help!!

Spark - HashingTF inputCol accepts one column but I want more

I'm trying to use HashTF in Spark but I have one major problem.
If inputCol has only one column like this
HashingTF(inputCol="bla",outputCol="tf_features") it works fine.
But if I try to add more columns I get error message "Cannot convert list to string".
All I want to do is
HashingTF(inputCol=["a_col","b_col","c_col"], outputCol="tf_features").
Any ideas on how to fix it?
HashingTF takes on input one column, if you want to use other columns you can create array of these columns using array function and then flat them using explode, you will have one column with values from all columns. Finally you can pass that column to HashingTF.
df2 = df.select(explode(array(f.col("a_col"),f.col("b_col")))).as('newCol'))

Categories