copy rows of pandas dataframe to a new dataframe - python

i have a dataframe almost like this one :
id name numOfppl
1 A 30
2 B 31
3 C 10
4 D 0
.
.
.
31 comp 52
These numbers are coming from a python code.
Once we have 5 rows where numOfppl >=30, the code will stop and return all the rest of the rows to a new dataframe.
my code so far:
df[df['numOfppl'] >= 30].iloc[:5]
if more rows are added, how can i copy them to a new Dataframe ?

Once you have created a dataframe for the condition you mentioned, you need all the other rows to be in a new dataframe, right?
Please check with the below
df_1 = df[df['numOfppl'] >= 30].iloc[:5]
df_2 = df[~df.isin(df_1)].dropna()
Here df_1 will have 5 rows as you mentioned with the condition and rest all the rows will be copied to df_2.
Also, for newly added rows (later) you can directly copy them into dataframe df_2

Related

Fill empty columns with values from another column of another row based on an identifier

I am trying to fill a dataframe, containing repeated elements, based on an identifier.
My Dataframe is as follows:
Code Value
0 SJHV
1 SJIO 96B
2 SJHV 33C
3 CPO3 22A
4 CPO3 22A
5 SJHV 33C #< -- Numbers stored as strings
6 TOY
7 TOY #< -- These aren't NaN, they are empty strings
I would like to remove the empty 'Value' rows only if a non-empty 'Value' row exists. To be clear, I would want my output to look like:
Code Value
0 SJHV 33C
1 SJIO 96B
2 CPO3 22A
3 TOY
My attempt was as follows:
df['Value'].replace('', np.nan, inplace=True)
df2 = df.dropna(subset=['Value']).drop_duplicates('Code')
As expected, this code also drops the 'TOY' Code. Any suggestions?
The empty strings should go to the bottom if you sort them, then you can just drop duplicates.
import pandas as pd
df = pd.DataFrame({'Code':['SJHV','SJIO','SJHV','CPO3','CPO3','SJHV','TOY','TOY'],'Value':['','96B','33C','22A','22A','33C','','']})
df = (
df.sort_values(by=['Value'], ascending=False)
.drop_duplicates(subset=['Code'], keep='first')
.sort_index()
)
Output
Code Value
1 SJIO 96B
2 SJHV 33C
3 CPO3 22A
6 TOY

Appending only rows that are not yet in a pandas dataframe

I have the same dataset but over different weeks (so later weeks contain new rows). I want to append the new rows to the original dataframe to create one big dataframe with all unique rows and no duplicates. I can't just take the last week because some get deleted over the weeks.
I tried to use the following code but somehow my final_info dataframe still contains some non-unique values
final_info = data[list(data.keys())[-1]]['all_info']
for week in reversed(data.keys()):
df_diff = pd.concat([data[week]['all_info'],final_info]).drop_duplicates(subset='project_slug',
keep=False)
final_info = final_info.append(df_diff).reset_index(drop=True)
Does somebody see where it goes wrong?
if I understand your question, you are just trying to add the unique rows from one dataframe to another dataframe. I don't think there is any need to iterate through the keys like you are doing. There is an example on this question that I think can help you and i think it is conceptually easier to follow 1. I'll try to walk through an example to make it more clear.
So if you have a dataframe A:
col1 col2
1 2
2 3
3 4
and a dataframe B:
col1 col2
1 2
2 3
6 4
These two dataframes have the same first two rows but have different last rows. If you wanted to get all the unique rows into one dataframe you could first get all the unique rows from just one of the dataframes. So for this example you could get the unique row in dataframe B, lets call it df_diff in this example. The code to do this would be
df_diff = B[~B.col1.isin(A.col1)]
output: col1 col2
6 4
This above line of code makes whats called a boolean mask and then negates using ~ so that you get all rows in dataframe B where the col1 value is not in dataframe A.
You could then merge this dataframe, df_diff, with the first dataframe A. We can call this df_full. This step is done with:
df_full = pd.concat([A, df_diff], ignore_index=True)
The ignore_index=True just resets the index of the resulting dataframe. This will give you:
col1 col2
1 2
2 3
3 4
6 4
Now the above dataframe has the new row in dataframe B plus the original rows from dataframe A.
I think this would work for your situation and may be less lines of code.

Replace rows in Dataframe using index from another Dataframe

I have two dataframes with identical structures df and df_a. df_a is a subset of df that I need to reintegrate into df. Essentially, df_a has various rows (with varying indices) from df that have been manipulated.
Below is an example of indices of each df and df_a. These both have the same column structure so all the columns are the same, it's only the rows and idex of the rows that differ.
>> df
index .. other_columns ..
0
1
2
3
. .
9999
10000
10001
[10001 rows x 20 columns]
>> df_a
index .. other_columns ..
5
12
105
712
. .
9824
9901
9997
[782 rows x 20 columns]
So, I want to overwrite only the rows in df that have the indices of df_a with the corresponding rows in df_a. I checked out Replace rows in a Pandas df with rows from another df and replace rows in a pandas data frame but neither of those tell how to use the indices of another dataframe to replace the values in the rows.
Something along the lines of:
df.loc[df_a.index, :] = df_a[:]
I don't know if this wants you meant, for that you would need to be more specific, but if the first data frame was modified to be a new data frame with different indexes, then you can use this code to reset back the indexes:
import pandas as pd
df_a = pd.DataFrame({'a':[1,2,3,4],'b':[5,4,2,7]}, index=[2,55,62,74])
df_a.reset_index(inplace=True, drop=True)
print(df_a)
PRINTS:
a b
0 1 5
1 2 4
2 3 2
3 4 7

Pandas Merge/Join

I have a dataframe called Bob with Columns = [A,B] and A has only unique values like a serial ID. Shape is (100,2)
I have another dataframe called Anna with Columns [C,D,E,F] where C has the same values as A in bob but there are duplicates. Column D is a category (phone/laptop/ipad) that is defined by the serial ID found in C. Shape of Anna is (500,4).
Example of row in anna:
A B C D
K103 phone 12 17
K103 phone 14 23
G221 laptop 25 6
I want to create a new dataframe that has columns A,B,D by searching for value of A in anna[C]. The final dataframe should be shape (100,3)
I'm finding this difficult with pd.merge (i tried left/inner/right joins) because it keeps creating 2 rows in the new dataframe with same values i.e. K103 will show up 2x in the new dataframe.
Tell me if this works, I'm thinking of this while typing it, so I couldn't actually check.
df = Bob.merge(Anna[['C','D'].drop_duplicates(keep='last'),how='left',left_on='A',right_on='C']
Let me know if it doesn't work, I'll create a sample dataset and edit it with the correct code.

Trying to multiply specific columns, by a portion of multiple rows in Pandas DataFrame (Python)

I am trying to multiply a few specific columns by a portion of multiple rows and creating a new column from every result. I could not really find an answer to my question in previous stackoverflow questions or on google, so maybe one of you can help.
I would like to point out that I am quite the beginner in Python, so apologies ahead for any obvious questions or strange code.
This is how my DataFrame currently looks like:
So, for the column Rank of Hospital by Doctor_1, I want to multiply all its numbers by the values of the first row of column Rank of Doctor by Hospital_1 until column Rank of Doctor by Hospital_10. Which would result in:
1*1
2*1
3*1
4*4
...
and so on.
I want to do this for every Doctor_ column. So for Doctor_2 its values should be multiplied by the second row of all those ten columns (Rank of Doctor by Hospital_. Doctor_3, multiplied by the third row etc.
So far, I have transposed the Rank of Doctor by Hospital_ columns in a new DataFrame:
and tried to multiply this by a DataFrame of the Rank of Hospital by Doctor_ columns. Here the first column of the first df should be multiplied by the first column of the second df. (and second column * second column, etc.):
But my current formula
preferences_of_doctors_and_hospitals_doctors_ranking.mul(preferences_of_doctors_and_hospitals_hospitals_ranking_transposed)
is obviously not working:
Does anybody know what I am doing wrong and how I could fix this? Maybe I could write a for loop so that a new column is created for every multiplication of columns? So Multiplication_column_1 of DF3 = Column 1 of DF1 * Column 1 of DF2 and Multiplication_column_2 of DF3 = Column 2 of DF1 * Column 2 of DF2.
Thank you in advance!
Jeff
You can multiple 2d arrays created by filtering column with filter and values first:
arr = df.filter(like='Rank of Hospital by').values * df.filter(like='Rank of Doctor by').values
Or:
arr = (preferences_of_doctors_and_hospitals_doctors_ranking.values *
preferences_of_doctors_and_hospitals_hospitals_ranking_transposed.values)
Notice - necessary is same ordering of columns, same length of columns names and index in both filtered DataFrames.
Get 2d array, so create DataFrame by constructor and join to original:
df = df.join(pd.DataFrame(arr, index=df.index).add_prefix('Multiplied '))
df = pd.DataFrame({"A":[1,2,3,4,5], "B":[6,7,8,9,10]})
df["mul"] = df["A"] * df["B"]
print(df)
Output:
A B mul
0 1 6 6
1 2 7 14
2 3 8 24
3 4 9 36
4 5 10 50
If I understood the question correctly I think you way over complicated it.
You can just create another column telling pandas to give it the value of first column multiplied by second column.
More similar to your specific case with more than 2 columns:
df = pd.DataFrame({"A":[1,2,3,4,5], "B":[6,7,8,9,10], "C":[11,12,13,14,15]})
df["mul"] = df["A"] * df["B"] * df["C"]

Categories