I'm working in pandas 0.18.0 on python 2.7.9.
Take a sample DataFrame and group by a few columns, then sum over a different column for the result, like this:
>>> df = pandas.DataFrame([[1,2,3],[4,5,6],[1,2,9]], columns=['a','b','c'])
>>> print df
a b c
0 1 2 3
1 4 5 6
2 1 2 9
>>> df.groupby(['a','b'], as_index=False)['c'].sum()
a b c
0 1 2 12
1 4 5 6
That all looks great, but when the same operation is preformed on an empty DataFrame the columns are dropped from the result:
>>> empty = pandas.DataFrame(columns=['a','b','c'])
>>> print empty
Empty DataFrame
Columns: [a, b, c]
Index: []
>>> empty.groupby(['a','b'], as_index=False)['c'].sum()
Empty DataFrame
Columns: []
Index: []
Were someone to reference valid columns from the result later in the code, a key error would result. Is there a way to keep the columns?
I believe this is a standard result of groupby.sum() (see here http://pandas.pydata.org/pandas-docs/stable/missing_data.html).
The only way I can think would be to write an if statement checking if the dataframe is empty, e.g.:
if sum(empty.isnull().sum()) == 9:
print "empty dataframe"
elif sum(empty.isnull().sum()) < 9:
empty.groupby(['a','b'], as_index=False)['c'].sum()
This should keep your empty dataframe with column headers.
Hope this helps.
Related
I'm trying to find the duplicates between 2 columns, were order is independent, but I need to keep the count of duplicates after dropping them
df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
columns=['source', 'target'],
)
This is my expected result
source target count
0 A B 2
1 D B 1
3 B C 2
I've already tried several approaches, but I can't come close to a solution.
It does not matter which combination is maintained. In the result example I kept the first.
The following approach creates a new column containing a set of the values in the columns specified. The advantage is that all other columns are preserved in the final result. Furthermore, the indices are preserved the same way as in the expected output you posted:
df = pd.DataFrame([['A','B'],['D','B'],['B','A'],['B','C'],['C','B']],
columns=['source', 'target'],)
# Create column with set of both columns
df['tmp'] = df.apply(lambda x: frozenset([x['source'], x['target']]), axis=1)
# Create count column based on new tmp column
df['count'] = df.groupby(['tmp'])['target'].transform('size')
# Drop duplicate rows based on new tmp column
df = df[~df.duplicated(subset='tmp', keep='first')]
# Remove tmp column
df = df.drop('tmp', 1)
df
Output:
source target count
0 A B 2
1 D B 1
3 B C 2
You can use df.duplicated() to see which ones are duplicated, the output is true if item is duplicated and false if it isn’t. For more infos and practical example check out the documentation
Create a summary based on applying a frozenset to your desired columns. Here we're using all columns.
summary = df.apply(frozenset, axis=1).value_counts()
This'll give you a Series of:
(A, B) 2
(C, B) 2
(B, D) 1
dtype: int64
You can then reconstruct a DataFrame by iterating over that series, eg:
df2 = pd.DataFrame(((*idx, val) for idx, val in summary.items()), columns=[*df.columns, 'count'])
Which results in:
source target count
0 A B 2
1 C B 2
2 B D 1
This question already has an answer here:
Find unique column values out of two different Dataframes
(1 answer)
Closed 1 year ago.
i'm working on python with Pandas and i have 2 dataFrame
1 'A'
2 'B'
1 'A'
2 'B'
3 'C'
4 'D'
and i want to return the difference:
1 'C'
2 'D'
You can concatenate two dataframes and drop duplicates:
pd.concat([df1, df2]).drop_duplicates(keep=False)
If your dataframe contains more columns you can add a certain column name as a subset:
pd.concat([df1, df2]).drop_duplicates(subset='col_name', keep=False)
What i retrieve with pd.concat([df1, df2]).drop_duplicates(keep=False)
(N = name of column)
df1:
N
0 A
1 B
2 C
df2:
N
0 A
1 B
2 C
df3
N
0 A
1 B
2 C
0 A
1 B
2 C
Value in df is phone Number without '+' in it. i can't show them.
i import them with :
df1 = pd.DataFrame(ListResponse, columns=['33000000000'])
df2 = pd.read_csv('number.csv')
ListResponse return List with number and number.csv is ListResponse that i save in csv the last time i run the script
edit:
(what i want in this case is "Empty DataFrame")
just test with new value :
df3:
N
0 A
1 B
2 C
3 D
0 B
1 C
2 D
Edit2: i think drop_duplicate is not working because my func implement new value as index = 0 and not index = length+1 like you can see just above. but when same values in both df, it not return me empty df...
I have a pandas dataframe as below. How can I drop any column which is a subset of any of the remaining columns? I would like to do this without using fillna.
df = pd.DataFrame([ [1,1,3,3], [np.NaN,2,np.NaN,4]], columns=['A','B','C','D'] )
df
A B C D
0 1.0 1 3.0 3
1 NaN 2 NaN 4
I can identify here that column A is subset of B and column C is a subset of D with something like this:
if all(df[A][df[A].notnull()].isin(df[B]))
I could run a loop over all columns and drop the subset columns. But is there a more efficient way to accomplish this, so that I have the following result:
df
B D
0 1 3
1 2 4
Thanks.
It still requires iteration, but you can use this list comprehension (with an if statement similar to the one you provided) to get columns to keep:
keep_cols = [x for x in df if not any(df.drop(x, axis=1).apply(lambda y: df[x].dropna().isin(y).all()))]
# ['B', 'D']
And then use the result with filter:
df.filter(items=keep_cols)
# B D
# 0 1 3
# 1 2 4
This should be fast enough, since it still uses apply at its core, and seems to be safer/more efficient than dropping columns within a loop.
If you're keen on a one-line solution, of course assigning the list to a variable is an optional step:
df.filter(items=[x for x in df if not any(df.drop(x, axis=1).apply(lambda y: df[x].dropna().isin(y).all()))])
I had a problem and I found a solution but I feel it's the wrong way to do it. Maybe, there is a more 'canonical' way to do it.
I already had an answer for a really similar problem, but here I have not the same amount of rows in each dataframe. Sorry for the "double-post", but the first one is still valid so I think it's better to make a new one.
Problem
I have two dataframe that I would like to merge without having extra column and without erasing existing infos. Example :
Existing dataframe (df)
A A2 B
0 1 4 0
1 2 5 1
2 2 5 1
Dataframe to merge (df2)
A A2 B
0 1 4 2
1 3 5 2
I would like to update df with df2 if columns 'A' and 'A2' corresponds.
The result would be :
A A2 B
0 1 4 2 <= Update value ONLY
1 2 5 1
2 2 5 1
Here is my solution, but I think it's not a really good one.
import pandas as pd
df = pd.DataFrame([[1,4,0],[2,5,1],[2,5,1]],columns=['A','A2','B'])
df2 = pd.DataFrame([[1,4,2],[3,5,2]],columns=['A','A2','B'])
df = df.merge(df2,on=['A', 'A2'],how='left')
df['B_y'].fillna(0, inplace=True)
df['B'] = df['B_x']+df['B_y']
df = df.drop(['B_x','B_y'], axis=1)
print(df)
I tried this solution :
rows = (df[['A','A2']] == df2[['A','A2']]).all(axis=1)
df.loc[rows,'B'] = df2.loc[rows,'B']
But I have this error because of the wrong number of rows :
ValueError: Can only compare identically-labeled DataFrame objects
Does anyone has a better way to do ?
Thanks !
I think you can use DataFrame.isin for check where are same rows in both DataFrames. Then create NaN by mask, which is filled by combine_first. Last cast to int:
mask = df[['A', 'A2']].isin(df2[['A', 'A2']]).all(1)
print (mask)
0 True
1 False
2 False
dtype: bool
df.B = df.B.mask(mask).combine_first(df2.B).astype(int)
print (df)
A A2 B
0 1 4 2
1 2 5 1
2 2 5 1
With a minor tweak in the way in which the boolean mask gets created, you can get it to work:
cols = ['A', 'A2']
# Slice it to match the shape of the other dataframe to compare elementwise
rows = (df[cols].values[:df2.shape[0]] == df2[cols].values).all(1)
df.loc[rows,'B'] = df2.loc[rows,'B']
df
Suppose I have a Pandas DataFrame called df with columns a and b and what I want is the number of distinct values of b per each a. I would do:
distcounts = df.groupby('a')['b'].nunique()
which gives the desidered result, but it is as Series object rather than another DataFrame. I'd like a DataFrame instead. In regular SQL, I'd do:
SELECT a, COUNT(DISTINCT(b)) FROM df
and haven't been able to emulate this query in Pandas exactly. How to?
I think you need reset_index:
distcounts = df.groupby('a')['b'].nunique().reset_index()
Sample:
df = pd.DataFrame({'a':[7,8,8],
'b':[4,5,6]})
print (df)
a b
0 7 4
1 8 5
2 8 6
distcounts = df.groupby('a')['b'].nunique().reset_index()
print (distcounts)
a b
0 7 1
1 8 2
Another alternative using Groupby.agg instead:
df.groupby('a', as_index=False).agg({'b': 'nunique'})