I have been searching for an answer to my question for a while, and have not been able to find anything that produces my desired result.
The problem is this: I have a dataframe with two rows that I want to merge into a single row dataframe that has multi-level columns. Using my example below (which I drafted in excel to better visualize my desired output), I want the new DF to have a multicolumn index with the first level being based on the original columns A-C, then add a new column sub level based on the values from the original 'Name' column. It is quite possible i'm incorrectly using existing functions. If you could provide me with your simplest way of altering the dataframe, I would greatly appreciate it!
Code to construct current df:
import pandas as pd
df = pd.DataFrame([['Alex',1,2,3],['Bob',4,5,6]],columns='Name A B
C'.split())
Image of current df with desired output:
Using set_index + unstack
df.set_index('Name').unstack().to_frame().T
Out[198]:
A B C
Name Alex Bob Alex Bob Alex Bob
0 1 4 2 5 3 6
Related
I feel like this may be a really easy question but I can't figure it out I have a data frame that looks like this
one two three
1 2 3
2 3 3
3 4 4
The third column has duplicates if I want to keep the first row but drop the second row because there is a duplicate on row two how would I do this.
Pandas DataFrame objects have a method for this; assuming df is your dataframe, df.drop_duplicates(subset='name_of_third_column') returns the dataframe with any rows containing duplicate values in the third column removed.
I have the same dataset but over different weeks (so later weeks contain new rows). I want to append the new rows to the original dataframe to create one big dataframe with all unique rows and no duplicates. I can't just take the last week because some get deleted over the weeks.
I tried to use the following code but somehow my final_info dataframe still contains some non-unique values
final_info = data[list(data.keys())[-1]]['all_info']
for week in reversed(data.keys()):
df_diff = pd.concat([data[week]['all_info'],final_info]).drop_duplicates(subset='project_slug',
keep=False)
final_info = final_info.append(df_diff).reset_index(drop=True)
Does somebody see where it goes wrong?
if I understand your question, you are just trying to add the unique rows from one dataframe to another dataframe. I don't think there is any need to iterate through the keys like you are doing. There is an example on this question that I think can help you and i think it is conceptually easier to follow 1. I'll try to walk through an example to make it more clear.
So if you have a dataframe A:
col1 col2
1 2
2 3
3 4
and a dataframe B:
col1 col2
1 2
2 3
6 4
These two dataframes have the same first two rows but have different last rows. If you wanted to get all the unique rows into one dataframe you could first get all the unique rows from just one of the dataframes. So for this example you could get the unique row in dataframe B, lets call it df_diff in this example. The code to do this would be
df_diff = B[~B.col1.isin(A.col1)]
output: col1 col2
6 4
This above line of code makes whats called a boolean mask and then negates using ~ so that you get all rows in dataframe B where the col1 value is not in dataframe A.
You could then merge this dataframe, df_diff, with the first dataframe A. We can call this df_full. This step is done with:
df_full = pd.concat([A, df_diff], ignore_index=True)
The ignore_index=True just resets the index of the resulting dataframe. This will give you:
col1 col2
1 2
2 3
3 4
6 4
Now the above dataframe has the new row in dataframe B plus the original rows from dataframe A.
I think this would work for your situation and may be less lines of code.
I have a dataframe called Bob with Columns = [A,B] and A has only unique values like a serial ID. Shape is (100,2)
I have another dataframe called Anna with Columns [C,D,E,F] where C has the same values as A in bob but there are duplicates. Column D is a category (phone/laptop/ipad) that is defined by the serial ID found in C. Shape of Anna is (500,4).
Example of row in anna:
A B C D
K103 phone 12 17
K103 phone 14 23
G221 laptop 25 6
I want to create a new dataframe that has columns A,B,D by searching for value of A in anna[C]. The final dataframe should be shape (100,3)
I'm finding this difficult with pd.merge (i tried left/inner/right joins) because it keeps creating 2 rows in the new dataframe with same values i.e. K103 will show up 2x in the new dataframe.
Tell me if this works, I'm thinking of this while typing it, so I couldn't actually check.
df = Bob.merge(Anna[['C','D'].drop_duplicates(keep='last'),how='left',left_on='A',right_on='C']
Let me know if it doesn't work, I'll create a sample dataset and edit it with the correct code.
I am trying to multiply a few specific columns by a portion of multiple rows and creating a new column from every result. I could not really find an answer to my question in previous stackoverflow questions or on google, so maybe one of you can help.
I would like to point out that I am quite the beginner in Python, so apologies ahead for any obvious questions or strange code.
This is how my DataFrame currently looks like:
So, for the column Rank of Hospital by Doctor_1, I want to multiply all its numbers by the values of the first row of column Rank of Doctor by Hospital_1 until column Rank of Doctor by Hospital_10. Which would result in:
1*1
2*1
3*1
4*4
...
and so on.
I want to do this for every Doctor_ column. So for Doctor_2 its values should be multiplied by the second row of all those ten columns (Rank of Doctor by Hospital_. Doctor_3, multiplied by the third row etc.
So far, I have transposed the Rank of Doctor by Hospital_ columns in a new DataFrame:
and tried to multiply this by a DataFrame of the Rank of Hospital by Doctor_ columns. Here the first column of the first df should be multiplied by the first column of the second df. (and second column * second column, etc.):
But my current formula
preferences_of_doctors_and_hospitals_doctors_ranking.mul(preferences_of_doctors_and_hospitals_hospitals_ranking_transposed)
is obviously not working:
Does anybody know what I am doing wrong and how I could fix this? Maybe I could write a for loop so that a new column is created for every multiplication of columns? So Multiplication_column_1 of DF3 = Column 1 of DF1 * Column 1 of DF2 and Multiplication_column_2 of DF3 = Column 2 of DF1 * Column 2 of DF2.
Thank you in advance!
Jeff
You can multiple 2d arrays created by filtering column with filter and values first:
arr = df.filter(like='Rank of Hospital by').values * df.filter(like='Rank of Doctor by').values
Or:
arr = (preferences_of_doctors_and_hospitals_doctors_ranking.values *
preferences_of_doctors_and_hospitals_hospitals_ranking_transposed.values)
Notice - necessary is same ordering of columns, same length of columns names and index in both filtered DataFrames.
Get 2d array, so create DataFrame by constructor and join to original:
df = df.join(pd.DataFrame(arr, index=df.index).add_prefix('Multiplied '))
df = pd.DataFrame({"A":[1,2,3,4,5], "B":[6,7,8,9,10]})
df["mul"] = df["A"] * df["B"]
print(df)
Output:
A B mul
0 1 6 6
1 2 7 14
2 3 8 24
3 4 9 36
4 5 10 50
If I understood the question correctly I think you way over complicated it.
You can just create another column telling pandas to give it the value of first column multiplied by second column.
More similar to your specific case with more than 2 columns:
df = pd.DataFrame({"A":[1,2,3,4,5], "B":[6,7,8,9,10], "C":[11,12,13,14,15]})
df["mul"] = df["A"] * df["B"] * df["C"]
I have a dataframe with millions of rows with unique indexes and a column('b') that has several repeated values.
I would like to generate a dataframe without the duplicated data but I do not want to lose the index information. I want the new dataframe to have an index that is a concatenation of the indexes ("old_index1,old_index2") where 'b' had duplicated values but remains unchanged for rows where 'b' had unique values. The values of the 'b' column should remain unchanged like in a keep=first strategy. Example below.
Input dataframe:
df = pd.DataFrame(data = [[1,"non_duplicated_1"],
[2,"duplicated"],
[2,"duplicated"],
[3,"non_duplicated_2"],
[4,"non_duplicated_3"]],
index=['one','two','three','four','five'],
columns=['a','b'])
desired output:
a b
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
The actual dataframe is quite large so I would like to avoid non-vectorized operations.
I am finding this surprisingly difficult...Any ideas?
You can use transform on the index column (after you use reset_index). Then, drop duplicates in column b:
df.index = df.reset_index().groupby('b')['index'].transform(','.join)
df.drop_duplicates('b',inplace=True)
>>> df
a b
index
one 1 non_duplicated_1
two,three 2 duplicated
four 3 non_duplicated_2
five 4 non_duplicated_3
Setup
dct = {'index': ','.join, 'a': 'first'}
You can reset_index before using groupby, although it's unclear to me why you want this:
df.reset_index().groupby('b', as_index=False, sort=False).agg(dct).set_index('index')
b a
index
one non_duplicated_1 1
two,three duplicated 2
four non_duplicated_2 3
five non_duplicated_3 4