Appending only rows that are not yet in a pandas dataframe - python

I have the same dataset but over different weeks (so later weeks contain new rows). I want to append the new rows to the original dataframe to create one big dataframe with all unique rows and no duplicates. I can't just take the last week because some get deleted over the weeks.
I tried to use the following code but somehow my final_info dataframe still contains some non-unique values
final_info = data[list(data.keys())[-1]]['all_info']
for week in reversed(data.keys()):
df_diff = pd.concat([data[week]['all_info'],final_info]).drop_duplicates(subset='project_slug',
keep=False)
final_info = final_info.append(df_diff).reset_index(drop=True)
Does somebody see where it goes wrong?

if I understand your question, you are just trying to add the unique rows from one dataframe to another dataframe. I don't think there is any need to iterate through the keys like you are doing. There is an example on this question that I think can help you and i think it is conceptually easier to follow 1. I'll try to walk through an example to make it more clear.
So if you have a dataframe A:
col1 col2
1 2
2 3
3 4
and a dataframe B:
col1 col2
1 2
2 3
6 4
These two dataframes have the same first two rows but have different last rows. If you wanted to get all the unique rows into one dataframe you could first get all the unique rows from just one of the dataframes. So for this example you could get the unique row in dataframe B, lets call it df_diff in this example. The code to do this would be
df_diff = B[~B.col1.isin(A.col1)]
output: col1 col2
6 4
This above line of code makes whats called a boolean mask and then negates using ~ so that you get all rows in dataframe B where the col1 value is not in dataframe A.
You could then merge this dataframe, df_diff, with the first dataframe A. We can call this df_full. This step is done with:
df_full = pd.concat([A, df_diff], ignore_index=True)
The ignore_index=True just resets the index of the resulting dataframe. This will give you:
col1 col2
1 2
2 3
3 4
6 4
Now the above dataframe has the new row in dataframe B plus the original rows from dataframe A.
I think this would work for your situation and may be less lines of code.

Related

How to merge rows by same value in different columns using Python (Pandas)

I have a data frame, something like this:
Id Col1 Col2 Paired_Id
1 a A
2 c B
A b 1
B d 2
I would like to merge the rows to get the output something like this. Delete the paired row after merging.
Id Col1 Col2 Paired_Id
1 a b A
2 c d B
Any hint?
So:
Merging rows (ID) with its Paired_ID entries.
Is this possible with Pandas?
Assuming NaNs in the empty cells, I would use a groupby.first with a frozenset of the two IDs as grouper:
group = df[['Id', 'Paired_Id']].apply(frozenset, axis=1)
out = df.groupby(group, as_index=False).first()
Output:
Id Col1 Col2 Paired_Id
0 1 a b A
1 2 c d B
Don't have a lot of information about the structure of your dataframe, so I will just assume a few things - please correct me if I'm wrong:
A line with an entry in Col1 will never have an entry in Col2.
Corresponding lines appear in the same sequence (lines 1,2,3... then
corresponding lines 1,2,3...)
Every line has a corresponding second line later on in the dataframe
If all those assumptions are correct, you could split your data into two dataframes, df_upperhalf containing the Col1, df_lowerhalf the Col2.
df_upperhalf = df.iloc[:len(df.index),]
df_lowerhalf = df.iloc[(len(df.index)*(-1):,]
Then you can easily combine those values:
df_combined = df_upperhalf
df_combined['Col2'] = df_lowerhalf['Col2']
If some of my assumptions are incorrect, this will of course not produce the results you want.
There are also quite a few ways to do it in fewer lines of code, but I think this way you end up with nicer dataframes and the code should be easily readable.
Edit:
I think this would be quite a bit faster:
df_upperhalf = df.head(len(df.index))
df_lowerhalf = df.tail(len(df.index))

Pandas dataframe on python

I feel like this may be a really easy question but I can't figure it out I have a data frame that looks like this
one two three
1 2 3
2 3 3
3 4 4
The third column has duplicates if I want to keep the first row but drop the second row because there is a duplicate on row two how would I do this.
Pandas DataFrame objects have a method for this; assuming df is your dataframe, df.drop_duplicates(subset='name_of_third_column') returns the dataframe with any rows containing duplicate values in the third column removed.

Pandas: Loop through DataFrame columns and remove rows with variables that have less than i observations

Suppose I have the following data frame.
X = pd.DataFrame([["A","Z"],["A","Z"],["A","Z"],["B","Y"],["B","Y"]],columns=["COL1","COL2"])
Suppose I have the above dataframe. COL1 contains 3 A's and 2 B's. COL2 contains 3 Z's and 2 Y's.
What I'm trying to do is search each column and find the rows where there is less than i of a variable (E.g. in this case I search each column and find what rows have fewer than 3 entries).
In this case I have a bunch of duplicate entries but it's just presented like that for simplicity.
Link to my previous question:
Pandas: How do I loop through and remove rows where a column has a single entry
Please let me know if clarification is needed.
You can use subset and keep False params
X = X[X.duplicated(subset=list(X.columns), keep=False)]
output:
COL1 COL2
0 A Z
1 A Z
You can do
i=3
X[X.groupby(X.columns.tolist()).COL1.transform('count')>=i]
COL1 COL2
0 A Z
1 A Z
2 A Z

Trying to multiply specific columns, by a portion of multiple rows in Pandas DataFrame (Python)

I am trying to multiply a few specific columns by a portion of multiple rows and creating a new column from every result. I could not really find an answer to my question in previous stackoverflow questions or on google, so maybe one of you can help.
I would like to point out that I am quite the beginner in Python, so apologies ahead for any obvious questions or strange code.
This is how my DataFrame currently looks like:
So, for the column Rank of Hospital by Doctor_1, I want to multiply all its numbers by the values of the first row of column Rank of Doctor by Hospital_1 until column Rank of Doctor by Hospital_10. Which would result in:
1*1
2*1
3*1
4*4
...
and so on.
I want to do this for every Doctor_ column. So for Doctor_2 its values should be multiplied by the second row of all those ten columns (Rank of Doctor by Hospital_. Doctor_3, multiplied by the third row etc.
So far, I have transposed the Rank of Doctor by Hospital_ columns in a new DataFrame:
and tried to multiply this by a DataFrame of the Rank of Hospital by Doctor_ columns. Here the first column of the first df should be multiplied by the first column of the second df. (and second column * second column, etc.):
But my current formula
preferences_of_doctors_and_hospitals_doctors_ranking.mul(preferences_of_doctors_and_hospitals_hospitals_ranking_transposed)
is obviously not working:
Does anybody know what I am doing wrong and how I could fix this? Maybe I could write a for loop so that a new column is created for every multiplication of columns? So Multiplication_column_1 of DF3 = Column 1 of DF1 * Column 1 of DF2 and Multiplication_column_2 of DF3 = Column 2 of DF1 * Column 2 of DF2.
Thank you in advance!
Jeff
You can multiple 2d arrays created by filtering column with filter and values first:
arr = df.filter(like='Rank of Hospital by').values * df.filter(like='Rank of Doctor by').values
Or:
arr = (preferences_of_doctors_and_hospitals_doctors_ranking.values *
preferences_of_doctors_and_hospitals_hospitals_ranking_transposed.values)
Notice - necessary is same ordering of columns, same length of columns names and index in both filtered DataFrames.
Get 2d array, so create DataFrame by constructor and join to original:
df = df.join(pd.DataFrame(arr, index=df.index).add_prefix('Multiplied '))
df = pd.DataFrame({"A":[1,2,3,4,5], "B":[6,7,8,9,10]})
df["mul"] = df["A"] * df["B"]
print(df)
Output:
A B mul
0 1 6 6
1 2 7 14
2 3 8 24
3 4 9 36
4 5 10 50
If I understood the question correctly I think you way over complicated it.
You can just create another column telling pandas to give it the value of first column multiplied by second column.
More similar to your specific case with more than 2 columns:
df = pd.DataFrame({"A":[1,2,3,4,5], "B":[6,7,8,9,10], "C":[11,12,13,14,15]})
df["mul"] = df["A"] * df["B"] * df["C"]

Pandas reindexing task based on a column value

I have a dataframe with millions of rows with unique indexes and a column('b') that has several repeated values.
I would like to generate a dataframe without the duplicated data but I do not want to lose the index information. I want the new dataframe to have an index that is a concatenation of the indexes ("old_index1,old_index2") where 'b' had duplicated values but remains unchanged for rows where 'b' had unique values. The values of the 'b' column should remain unchanged like in a keep=first strategy. Example below.
Input dataframe:
df = pd.DataFrame(data = [[1,"non_duplicated_1"],
[2,"duplicated"],
[2,"duplicated"],
[3,"non_duplicated_2"],
[4,"non_duplicated_3"]],
index=['one','two','three','four','five'],
columns=['a','b'])
desired output:
a b
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
The actual dataframe is quite large so I would like to avoid non-vectorized operations.
I am finding this surprisingly difficult...Any ideas?
You can use transform on the index column (after you use reset_index). Then, drop duplicates in column b:
df.index = df.reset_index().groupby('b')['index'].transform(','.join)
df.drop_duplicates('b',inplace=True)
>>> df
a b
index
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
Setup
dct = {'index': ','.join, 'a': 'first'}
You can reset_index before using groupby, although it's unclear to me why you want this:
df.reset_index().groupby('b', as_index=False, sort=False).agg(dct).set_index('index')
b a
index
one non_duplicated_1 1
two,three duplicated 2
four non_duplicated_2 3
five non_duplicated_3 4

Categories