Combining data frames with the same columns - python

I have a data frame that looks like this
A B C
1 4 7
2 5 8
3 6 9
And also another data frame that looks like this
A B C
2 1 7
4 3 9
6 5 8
How can I combine those two data frames to get a new data frame that looks like this
A B C
1 4 7
2 5 8
3 6 9
2 1 7
4 3 9
6 5 8
Basically, the two data frames have the same column names and number of columns. I just want to combine all of the rows. Would prefer using pandas to do this.

Check with append
df1 = df1.append(df2)

Related

Python: dataframe pivot with duplicate labels

I have a panda dataframe as below
index ColumnName ColumnValue
0 A 1
1 B 2
2 C 3
3 A 4
4 B 5
5 C 6
6 A 7
7 B 8
8 C 9
I want ouput like below as panda dataframe
A B C
1 2 3
4 5 6
7 8 9
Can anyone sugget how i can i achieve desired output ?
Regards
Vipul
First solution came into my mind is to use for loop with unique columnName as below. If you want pivot method to achieve it, someone else might help you.
columns = df['ColumnName'].unique()
data = {}
for column in columns:
data[column] = list(df[df['ColumnName'] == column]['ColumnValue'])
pd.DataFrame(data)
which will give you the below output
A B C
0 1 2 3
1 4 5 6
2 7 8 9

How to concat two python DataFrames where if the row already exits it doesn't add it, if not, append it

I'm pretty new to python.
I am trying to concat two dataframes (df1, df2) where if a row already exists in df1 then it is not added. if not, it adds to df1.
I don't want to use .concat().drop_duplicates() because I don't want duplicates within the same DataFrame to be removed.
BackStory:
I have multiple csv files that are exported from a software in different locations once in a while I want to merge these into one file. the problem is the exported files will have the same data as before along with the new records made within that period of time. therefore I need to check if the record is already in there as I will be executing the same code each time I export the data.
for the sake of example:
import pandas as pd
main_df = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[2,5,4,5],[9,8,7,6],[8,5,6,7]])
df1 = pd.DataFrame([[1,2,3,4],[1,2,3,4],[4,2,5,1],[2,4,1,5],[1,5,4,8],[7,3,5,7],[4,3,8,5],[4,3,8,5]])
main_df
0 1 2 3
0 1 2 3 4 --duplicates I want to include--
1 1 2 3 4 --duplicates I want to include--
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
df1
0 1 2 3
0 1 2 3 4 --duplicates I want to exclude--
1 1 2 3 4 --duplicates I want to exclude--
2 4 2 5 1 --duplicates I want to exclude--
3 2 4 1 5 --duplicates I want to exclude--
4 1 5 4 8
5 7 3 5 7
6 4 3 8 5 --duplicates I want to include--
7 4 3 8 5 --duplicates I want to include--
I need the end result to be
main_df (after code execution)
0 1 2 3
0 1 2 3 4
1 1 2 3 4
2 4 2 5 1
3 2 4 1 5
4 2 5 4 5
5 9 8 7 6
6 8 5 6 7
7 1 5 4 8
8 7 3 5 7
9 4 3 8 5
10 4 3 8 5
I hope I have explained my issue in a clear way. Thank you
Check for every row in df1 whether it exists in main_df using pandas apply, and turn that into a mask by negating it with the ~ operator. I like using functools partial to make explicit that we are comparing to main_df.
import pandas as pd
from functools import partial
main_df = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[2,5,4,5],
[9,8,7,6],
[8,5,6,7]
])
df1 = pd.DataFrame([
[1,2,3,4],
[1,2,3,4],
[4,2,5,1],
[2,4,1,5],
[1,5,4,8],
[7,3,5,7],
[4,3,8,5],
[4,3,8,5]
])
def has_row(df, row):
return (df == row).all(axis = 1).any()
main_df_has_row = partial(has_row, main_df)
duplicate_rows = df1.apply(main_df_has_row, axis = 1)
df1_add = df1.loc[~duplicate_rows]
pd.concat([main_df, df1_add])

Transpose in Pandas

I imported the data from csv file with pandas. I want to split the column which includes 50 (0 to 49) values into 5 rows each having ten values. Can anyone tell me how i can do this transpose in form of pandas frame?
Let me rephrase what i said:
I attached the data that i have. I wanted to select the second column, and split it into two rows each having 10 values.
That is the code i have done so far:(I couldn't get the picture of 50 rows so i have only put 20 rowsenter image description here)
import numpy as np
import pandas as pd
df = pd.read_csv('...csv')
df.iloc[:50,:2]
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(mycolumn=np.random.randint(10, size=50)))
using numpy and reshape'ing, ignoring indices
pd.DataFrame(df.mycolumn.values.reshape(5, -1))
0 1 2 3 4 5 6 7 8 9
0 0 2 7 3 8 7 0 6 8 6
1 0 2 0 4 9 7 3 2 4 3
2 3 6 7 7 4 5 3 7 5 9
3 8 7 6 4 7 6 2 6 6 5
4 2 8 7 5 8 4 7 6 1 5
​

Deleting multiple DataFrame columns in Pandas

Say I want to delete a set of adjacent columns in a DataFrame and my code looks something like this currently:
del df['1'], df['2'], df['3'], df['4'], df['5'], df['6']
This works, but I was wondering if there was a more efficient, compact, or aesthetically pleasing way to do it, such as:
del df['1','6']
I think you need drop, for selecting is used range or numpy.arange:
df = pd.DataFrame({'1':[1,2,3],
'2':[4,5,6],
'3':[7,8,9],
'4':[1,3,5],
'5':[7,8,9],
'6':[1,3,5],
'7':[5,3,6],
'8':[5,3,6],
'9':[7,4,3]})
print (df)
1 2 3 4 5 6 7 8 9
0 1 4 7 1 7 1 5 5 7
1 2 5 8 3 8 3 3 3 4
2 3 6 9 5 9 5 6 6 3
print (np.arange(1,7))
[1 2 3 4 5 6]
print (range(1,7))
range(1, 7)
#convert string column names to int
df.columns = df.columns.astype(int)
df = df.drop(np.arange(1,7), axis=1)
#another solution with range
#df = df.drop(range(1,7), axis=1)
print (df)
7 8 9
0 5 5 7
1 3 3 4
2 6 6 3
You can do this without modifying the columns, by passing a slice object to drop:
In [29]:
df.drop(df.columns[slice(df.columns.tolist().index('1'),df.columns.tolist().index('6')+1)], axis=1)
Out[29]:
7 8 9
0 5 5 7
1 3 3 4
2 6 6 3
So this returns the ordinal position of the lower and upper bound of the column end points and passes these to create a slice object against the columns array

Comparing columns of different pandas dataframes

I'm new to Python and using pandas dataframes to store and work with a large dataset.
I'm interested in knowing whether it's possible to compare values between dataframes of similarly named columns. For example, the functionality I'm after would be similar to comparing the column 'A' in this dataframe:
A
0 9
1 9
2 5
3 8
4 7
5 9
6 2
7 2
8 5
9 7
to the column 'A' in this one:
A
0 6
1 3
2 7
3 8
4 2
5 5
6 1
7 8
8 4
9 9
Then, for each row I would determine which of the two 'A' values is smaller and add it to, say, a new column in the first dataframe called 'B':
A B
0 9 6
1 9 3
2 5 5
3 8 8
4 7 2
5 9 5
6 2 1
7 2 2
8 5 4
9 7 7
I'm aware of the
pandas.DataFrame.min
method but as I understand it this will only located the smallest value of one column and can't be used to compare columns of different dataframes. I'm not sure of any other ways in which this functionality could be achieved
Any suggestions for solving this (probably) very simple question would be much appreciated! Thank you.
You can use numpy.minimum():
import numpy as np
df1['B'] = np.minimum(df1.A, df2.A)
Or use Series.where() to replace values:
df1['B'] = df1['A'].where(df1.A < df2.A, df2.A)

Categories