Pandas dataframe on python - python

I feel like this may be a really easy question but I can't figure it out I have a data frame that looks like this
one two three
1 2 3
2 3 3
3 4 4
The third column has duplicates if I want to keep the first row but drop the second row because there is a duplicate on row two how would I do this.

Pandas DataFrame objects have a method for this; assuming df is your dataframe, df.drop_duplicates(subset='name_of_third_column') returns the dataframe with any rows containing duplicate values in the third column removed.

Related

Appending only rows that are not yet in a pandas dataframe

I have the same dataset but over different weeks (so later weeks contain new rows). I want to append the new rows to the original dataframe to create one big dataframe with all unique rows and no duplicates. I can't just take the last week because some get deleted over the weeks.
I tried to use the following code but somehow my final_info dataframe still contains some non-unique values
final_info = data[list(data.keys())[-1]]['all_info']
for week in reversed(data.keys()):
df_diff = pd.concat([data[week]['all_info'],final_info]).drop_duplicates(subset='project_slug',
keep=False)
final_info = final_info.append(df_diff).reset_index(drop=True)
Does somebody see where it goes wrong?
if I understand your question, you are just trying to add the unique rows from one dataframe to another dataframe. I don't think there is any need to iterate through the keys like you are doing. There is an example on this question that I think can help you and i think it is conceptually easier to follow 1. I'll try to walk through an example to make it more clear.
So if you have a dataframe A:
col1 col2
1 2
2 3
3 4
and a dataframe B:
col1 col2
1 2
2 3
6 4
These two dataframes have the same first two rows but have different last rows. If you wanted to get all the unique rows into one dataframe you could first get all the unique rows from just one of the dataframes. So for this example you could get the unique row in dataframe B, lets call it df_diff in this example. The code to do this would be
df_diff = B[~B.col1.isin(A.col1)]
output: col1 col2
6 4
This above line of code makes whats called a boolean mask and then negates using ~ so that you get all rows in dataframe B where the col1 value is not in dataframe A.
You could then merge this dataframe, df_diff, with the first dataframe A. We can call this df_full. This step is done with:
df_full = pd.concat([A, df_diff], ignore_index=True)
The ignore_index=True just resets the index of the resulting dataframe. This will give you:
col1 col2
1 2
2 3
3 4
6 4
Now the above dataframe has the new row in dataframe B plus the original rows from dataframe A.
I think this would work for your situation and may be less lines of code.

I have to compare two large dataframes, how can I do it using multiprocessing in python?

One row in one dataframe should be compared with all other rows in other dataframe and should print column names that are equal in each row in second dataframe.
eg:
a=[['apple','cotton','pineapple']]
b=[['apple','lemon','pineapple'],['apple','cotton','mango'],['grapes','cotton','pineapple']]
consider a is a dataframe with one row 3 columns,and b is a dataframe with 3 rows and 3 columns:
My output while comparing first row of a with b should be:
0 2
0 1
1 2
0 is name if first column, 1 is name of second column,2 is name of third column.
Actual problem has million rows.So how can I do it using multiprocessing.

How to add a pandas Series to a DataFrame ignoring indices?

I have a DataFrame with random, unsorted row indices, which is a result of removing some 'noise' from the original DataFrame.
row_index col1 col2
2 1 2
19 3 4
432 4 1
I would like to add some pd.Series to this Dataframe. The Series has its indices sorted from 0 to n=number of rows. The number of rows equals the number of rows in the DataFrame
Having tried multiple ways of adding the Series to my DataFrame I realized that the data from the Series gets mixed up, because (I believe) Python is matching records by their indices.
Is there a way I can add the Series to the Dataframe, ignoring the indices, so that my data doesn't get mixed up?
convert the series into a data frame.
code
df=pd.DataFrame(df)
result=pd.concat([df1,df],axis=1,ignore_index=True)
df1 is the data frame you want to add .
df is the data frame i.e series you converted to data frame
df['new_col'] = other_df['column'].values

Efficient method to append only the new rows from pandas dataframe to a table in database

Suppose I have a table X in MySQL as follows
A B
1 2
3 4
5 6
and I have a dataframe df as follows
A B
1 2
5 6
7 8
9 10
I want to append to X only new rows from df (rows that are in df but not in X). The results should be:
A B
1 2
3 4
5 6
7 8
9 10
Note that sorting does not matter for me. Currently, what I can do are
1. Read table X and store in a dataframe called dfx
2. Concat df and dfx
3. Drop duplicate rows
4. Insert the results back to the table X with to_sql(if_exist='replace')
However, I think this is not a good practice particularly when the table X is very large. May I have your suggestions for the better way? Thank you
If you have a unique index in your table that would prevent you to insert duplicate records (the primary key should do the job), then using INSERT IGNOREinstead of INSERT will be enough: duplicate records will be silently discarded instead of generating an error.
If your indices are unique (or one of the columns, say A), you could
make a list of indices (or the unique column) from the dataframe,
query MySQL with the list and find the ones that do not exist in the table,
subset the dataframe based on the new indices or column values, and insert.
You will have to use something like sqlalchemy for (2). (3) can be done easily by using DataFrame.query; e.g. df.query("A == #list_of_new_values"), where list_of_new_values is a python list with the new values for column A.

PANDAS: Convert 2 row df to single row multilevel column df

I have been searching for an answer to my question for a while, and have not been able to find anything that produces my desired result.
The problem is this: I have a dataframe with two rows that I want to merge into a single row dataframe that has multi-level columns. Using my example below (which I drafted in excel to better visualize my desired output), I want the new DF to have a multicolumn index with the first level being based on the original columns A-C, then add a new column sub level based on the values from the original 'Name' column. It is quite possible i'm incorrectly using existing functions. If you could provide me with your simplest way of altering the dataframe, I would greatly appreciate it!
Code to construct current df:
import pandas as pd
df = pd.DataFrame([['Alex',1,2,3],['Bob',4,5,6]],columns='Name A B
C'.split())
Image of current df with desired output:
Using set_index + unstack
df.set_index('Name').unstack().to_frame().T
Out[198]:
A B C
Name Alex Bob Alex Bob Alex Bob
0 1 4 2 5 3 6

Categories