Appending two dataframes with same columns, different order - python

I have two pandas dataframes.
noclickDF = DataFrame([[0, 123, 321], [0, 1543, 432]],
columns=['click', 'id', 'location'])
clickDF = DataFrame([[1, 123, 421], [1, 1543, 436]],
columns=['click', 'location','id'])
I simply want to join such that the final DF will look like:
click | id | location
0 123 321
0 1543 432
1 421 123
1 436 1543
As you can see the column names of both original DF's are the same, but not in the same order. Also there is no join in a column.

You could also use pd.concat:
In [36]: pd.concat([noclickDF, clickDF], ignore_index=True)
Out[36]:
click id location
0 0 123 321
1 0 1543 432
2 1 421 123
3 1 436 1543
Under the hood, DataFrame.append calls pd.concat.
DataFrame.append has code for handling various types of input, such as Series, tuples, lists and dicts. If you pass it a DataFrame, it passes straight through to pd.concat, so using pd.concat is a bit more direct.

For future users (sometime >pandas 0.23.0):
You may also need to add sort=True to sort the non-concatenation axis when it is not already aligned (i.e. to retain the OP's desired concatenation behavior). I used the code contributed above and got a warning, see Python Pandas User Warning. The code below works and does not throw a warning.
In [36]: pd.concat([noclickDF, clickDF], ignore_index=True, sort=True)
Out[36]:
click id location
0 0 123 321
1 0 1543 432
2 1 421 123
3 1 436 1543

You can use append for that
df = noclickDF.append(clickDF)
print df
click id location
0 0 123 321
1 0 1543 432
0 1 421 123
1 1 436 1543
and if you need you can reset the index by
df.reset_index(drop=True)
print df
click id location
0 0 123 321
1 0 1543 432
2 1 421 123
3 1 436 1543

Related

Create DataFrame from df1 and df2 and take empty value from df2 for column value if not exist in df1 column value

df1 = pd.DataFrame({'call_sign': ['ASD','BSD','CDSF','GFDFD','FHHH'],'frn':['123','124','','656','']})
df2 = pd.DataFrame({'call_sign': ['ASD','CDSF','BSD','GFDFD','FHHH'],'frn':['1234','','124','','765']})
need to get a new df like
df2 = pd.DataFrame({'call_sign': ['ASD','BSD','CDSF','GFDFD','FHHH'],'frn':['123','','124','656','765']})
I need to take frn from df2 if it's missing in df1 and create a new df
Replace empty strings to missing values and use DataFrame.set_index with DataFrame.fillna, because need ordering like df2.call_sign add DataFrame.reindex:
df = (df1.set_index('call_sign').replace('', np.nan)
.fillna(df2.set_index('call_sign').replace('', np.nan))
.reindex(df2['call_sign']).reset_index())
print(df)
call_sign frn
0 ASD 123
1 CDSF NaN
2 BSD 124
3 GFDFD 656
4 FHHH 765
If you want to update df2 you can use boolean indexing:
# is frn empty string?
m = df2['frn'].eq('')
# update those rows from the value in df1
df2.loc[m, 'frn'] = df2.loc[m, 'call_sign'].map(df1.set_index('call_sign')['frn'])
Updated df2:
call_sign frn
0 ASD 1234
1 CDSF
2 BSD 124
3 GFDFD 656
4 FHHH 765
temp = df1.merge(df2,how='left',on='call_sign')
df1['frn']=temp.frn_x.where(temp.frn_x!='',temp.frn_y)
call_sign frn
0 ASD 123
1 BSD
2 CDSF 124
3 GFDFD 656
4 FHHH 765

Python Groupby & Counting first in Series, sorting by month

I have a pandas dataframe (this is an example, actual dataframe is a lot larger):
data = [['345', 1, '2022_Jan'], ['678', 1, '2022_Jan'], ['123', 1, '2022_Feb'], ['123', 1, '2022_Feb'], ['345', 0, '2022_Mar'], ['678', 1, '2022_Mar'], ['901', 0, '2022_Mar'], ['678', 1, '2022_Mar']]
df = pd.DataFrame(data, columns = ['ID', 'Error Count', 'Year_Month'])
The question I want answered is :How many IDs have errors?
I want to get an output that groups by 'Year_Month' and counts 1 for each ID that occurs in each month. In other words, I want to count only 1 for each ID in a single month.
 
When I group by 'Year_Month' & 'ID': df.groupby(['Year_Month', 'ID']).count()
it will give me the following output (current output link below) with the total Error Count for each ID, but I only want to count each ID once. I also want the Year_Month to be ordered chronologically, not sure why it's not when my original dataframe is in order by month in the Year_Month column.
My current output
Desired output
Are these actually duplicate records? Are you sure you don't want to record that user 123 had two errors in February?
If so, drop duplicates first, then groupby and sum over Error Count. The .count() method doesn't do what you think it does:
df.drop_duplicates(["ID", "Year_Month"]) \
.groupby(["Year_Month", "ID"])["Error Count"] \
.sum()
Output:
In [3]: counts = df.drop_duplicates(["ID", "Year_Month"]) \
...: .groupby(["Year_Month", "ID"])["Error Count"] \
...: .sum()
In [4]: counts
Out[4]:
Year_Month ID
2022_Feb 123 1
2022_Jan 345 1
678 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int64
As far as sorting, you'd want to convert "Year_Month" to a datetime object, because right now they're just being sorted as strings:
In [5]: "2022_Feb" < "2022_Jan"
Out[5]: True
Here's how you could do that:
In [6]: counts.sort_index(level=0, key=lambda ym: pd.to_datetime(ym, format="%Y_%b"))
Out[6]:
Year_Month ID
2022_Jan 345 1
678 1
2022_Feb 123 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int64
Here's one way:
(df
.groupby(['Year_Month', 'ID']) # group by the two columns
.sum('Error Count')['Error Count'] # aggregate the sum over error count
.apply(lambda x: int(bool(x)))) # convert to boolean and back to int
.to_frame('Error Count') # add name back to applied column
)
Error Count
Year_Month ID
2022_Feb 123 1
2022_Jan 345 1
678 1
2022_Mar 345 0
678 1
901 0
Here is another way to do that
Converting the sum to a boolean with astype(bool) to return True or False, based on values being 0 or non-zero, and then to 0 or 1 with astype(int)
df.groupby(['Year_Month','ID'])['Error Count'].sum().astype(bool).astype(int)
Year_Month ID
2022_Feb 123 1
2022_Jan 345 1
678 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int32
To sort, assign the outcome to a dataframe and then apply the ddejohn solution to sort
counts = df.groupby(['Year_Month','ID'])['Error Count'].sum().astype(bool).astype(int)
counts.sort_index(level=0, key=lambda ym: pd.to_datetime(ym, format="%Y_%b")) # ddejohn: https://stackoverflow.com/a/71927886/3494754 answer above
Year_Month ID
2022_Jan 345 1
678 1
2022_Feb 123 1
2022_Mar 345 0
678 1
901 0
Name: Error Count, dtype: int32

how to concat 2 dataframes, keeping rows with same index

I have 2 dataframes, that i want to concat with keeping the order of the column of df1:
df1:
index unnamed:0 unnamed:1 unnamed:393 unnamed:395
0 nan 1 ... 394 396
1 0 BB CC DD
df2:
index Service
220 ABC
222 ABB
394 CC
396 DD
....
the output should be like:
df3:
index
0 394 396
1 394 396
2 CC DD
3 CC DD
if i simply make df3=pd.concat([df1,df2]) it just adds df2 at the end of the list of df 1 as whole
Almost same when using df3=pd.concat([df1,df2], axis=1, ignore_index=True)
i think there is some issue with multiindexing, but don't know what to type.
Thanks
You can use df1.join(df2).
This produces an outcome like this:
The documentation is here.
For more information about all the kinds of merging operations in pandas, you can also check this.

How to group by hierarchical data in pandas/sql?

I have a problem on hierarchy.i have data like this.
id performance_rating parent_id level
111 8 null 0
122 3 null 0
123 9 null 0
254 5 111 1
265 8 111 1
298 7 122 1
220 6 123 1
305 5 298 2
395 8 220 2
... ... ... ...
654 4 562 5
the id is person unique identity.
performance_rating is his rating out of 10
parent_id is the id of the person who is working above the corresponding id.
I need to find out the average rating of an individual tree(111,122,123).
what I tried is separate data frame according to levels. Then merge it and groupby. But it is quite long.
there will be a few different ways to do this - here's an ugly solution.
We use a while and for loop over a function to "back-level" each column of the dataframe:
This requires that we first set 'id' as index and sort by 'level', descending. It also requires no duplicate IDs. Here goes:
df = df.set_index('id')
df = df.sort_values(by='level', ascending=False)
for i in df.index:
while df.loc[i, 'level'] > 1:
old_pid = df.loc[i, 'parent_id']
df.loc[i, 'parent_id'] = df.loc[old_pid, 'parent_id']
old_level = df.loc[i,'level']
df.loc[i, 'level'] = old_level - 1
This way, no matter how many levels there are, we are left with everything at level 1 of hierarchy and can then do:
grouped = df.groupby('parent_id').mean()
(or whatever variation of that you need)
I hope that helps!

groupby pandas dataframe and create another dataframe which represents the groupby results horizontally

I have a pandas dataframe by the name usabledata with columns ['marker','action','id']
usabledata = pd.DataFrame(columns=['marker','action','id'])
I ran the following commands on the usabledata dataframe:
counts = usabledata.groupby(['marker','action']).count()
counts = counts.drop(['marker','action'])
print counts
id
marker action
1 A 377
B 224
C 9881
D 149946
2 A 481
B 397
C 7468
D 147581
3 A 538
B 458
D 145916
Now, I want to create a pandas dataframe with the following format:
Marker A B C D
1 377 224 9881 149946
2 481 397 7468 147581
3 538 458 0 145916
Is it possible to do this using pandas dataframe in ipython notebook?
Also, is it possible to delete a column for example the column 'C' after obtaining this desired output?
Another doubt in the same problem, after obtaining the desired output how can I add another column 'Fraction' which is just a ratio of the columns 'A' and 'D'?
IIUC then you can call unstack with fillna:
In [124]:
gp.unstack().fillna(0)
Out[124]:
action
marker A B C D
id
1 377 224 9881 149946
2 481 397 7468 147581
3 538 458 0 145916

Categories