Stick the columns based on the one columns keeping ids - python

I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df

You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5

Related

Using the items of a df as a header of a diffeerent dataframe

I have 2 dataframes
df1= 0 2
1 _A1-Site_0_norm _A1-Site_1_norm
and df2=
0 2
2 0.500000 0.012903
3 0.010870 0.013793
4 0.011494 0.016260
I want to use df1 as a header of df2 so that df1 is either the header of the columns or the first raw.
1 _A1-Site_0_norm _A1-Site_1_norm
2 0.500000 0.012903
3 0.010870 0.013793
4 0.011494 0.016260
i have multiple columns so it will not work to do
df2.columns=["_A1-Site_0_norm", "_A1-Site_1_norm"]
I thought of making a list of all the items present in the df1 to the use df2.columns and then include that list but I am having problems with converting the elements in row 1 of df1 of each column in items of a list.
I am not married to that approach any alternative to do it is wellcome
Many thanks
if I understood you question correctly
then this example should work for you
d={'A':[1],'B':[2],'C':[3]}
df = pd.DataFrame(data=d)
d2 = {'1':['D'],'2':['E'],'3':['F']}
df2 = pd.DataFrame(data=d2)
df.columns = df2.values.tolist() #this is what you need to implement

Remove rows of one Dataframe based on one column of another dataframe

I got two DataFrame and want remove rows in df1 where we have same value in column 'a' in df2. Moreover one common value in df2 will only remove one row.
df1 = pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]})
df2 = pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]})
result = pd.DataFrame({'a':[1,1,3,4],'b':[1,2,4,6],'c':[6,5,3,1]})
Use Series.isin + Series.duplicated to create a boolean mask and use this mask to filter the rows from df1:
m = df1['a'].isin(df2['a']) & ~df1['a'].duplicated()
df = df1[~m]
Result:
print(df)
a b c
0 1 1 6
1 1 2 5
3 3 4 3
5 4 6 1
Try This:
import pandas as pd
df1=pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]})
df2=pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]})
df2a = df2['a'].tolist()
def remove_df2_dup(x):
if x in df2a:
df2a.remove(x)
return False
return True
df1[df1.a.apply(remove_df2_dup)]
It creates a list from df2['a'], then checks that list against each value of df1['a'], removing values from the list each time there's a match in df1
try this
df1=pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]})
df2=pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]})
for x in df2.a:
if x in df1.a:
df1.drop(df1[df1.a==x].index[0], inplace=True)
print(df1)

How to re-index as multi-index pandas dataframe from index value that repeats

I have an index in a pandas dataframe which repeats the index value. I want to re-index as multi-index where repeated indexes are grouped.
The indexing looks like such:
so I would like all the 112335586 index values would be grouped under the same in index.
I have looked at this question Create pandas dataframe by repeating one row with new multiindex but here the value can be index can be pre-defined but this is not possible as my dataframe is far too large to hard code this.
I also looked at at the multi-index documentation but this also pre-defines the value for the index.
I believe you need:
s = pd.Series([1,2,3,4], index=[10,10,20,20])
s.index.name = 'EVENT_ID'
print (s)
EVENT_ID
10 1
10 2
20 3
20 4
dtype: int64
s1 = s.index.to_series()
s2 = s1.groupby(s1).cumcount()
s.index = [s.index, s2]
print (s)
EVENT_ID
10 0 1
1 2
20 0 3
1 4
dtype: int64
Try this:
df.reset_index(inplace=True)
df['sub_idx'] = df.groupby('EVENT_ID').cumcount()
df.set_index(['EVENT_ID','sub_idx'], inplace=True)

How to count data in a column based on another column separately?

I have two dataframe like this:
df1 = pd.DataFrame({'a':[1,2]})
df2 = pd.DataFrame({'a':[1,1,1,2,2,3,4,5,6,7,8]})
I want to count the two numbers of df1 separately in df2, the correct answer like:
No Amount
1 3
2 2
Instead of:
No Amount
1 5
2 5
How can I solve this problem?
First filter df2 for values that are contained in df1['a'], then apply value_counts. The rest of the code just presents the data in your desired format.
result = (
df2[df2['a'].isin(df1['a'].unique())]['a']
.value_counts()
.reset_index()
)
result.columns = ['No', 'Amount']
>>> result
No Amount
0 1 3
1 2 2
In pandas 0.21.0 you can use set_axis to rename columns as chained method. Here's a one line solution:
df2[df2.a.isin(df1.a)]\
.squeeze()\
.value_counts()\
.reset_index()\
.set_axis(['No','Amount'], axis=1, inplace=False)
Output:
No Amount
0 1 3
1 2 2
You can simply find value_counts of second df and map that with first df i.e
df1['Amount'] = df1['a'].map(df2['a'].value_counts())
df1 = df1.rename(columns={'a':'No'})
Output :
No Amount
0 1 3
1 2 2

Pandas distinct count as a DataFrame

Suppose I have a Pandas DataFrame called df with columns a and b and what I want is the number of distinct values of b per each a. I would do:
distcounts = df.groupby('a')['b'].nunique()
which gives the desidered result, but it is as Series object rather than another DataFrame. I'd like a DataFrame instead. In regular SQL, I'd do:
SELECT a, COUNT(DISTINCT(b)) FROM df
and haven't been able to emulate this query in Pandas exactly. How to?
I think you need reset_index:
distcounts = df.groupby('a')['b'].nunique().reset_index()
Sample:
df = pd.DataFrame({'a':[7,8,8],
'b':[4,5,6]})
print (df)
a b
0 7 4
1 8 5
2 8 6
distcounts = df.groupby('a')['b'].nunique().reset_index()
print (distcounts)
a b
0 7 1
1 8 2
Another alternative using Groupby.agg instead:
df.groupby('a', as_index=False).agg({'b': 'nunique'})

Categories