I have a data in excel file. here is a sample data and image.
In[1] import pandas as pd
df = pd.DataFrame({'T1': ['A', 'B', 'A'],
'T1_data': [3, 2, '3K'],
'T2': ['B', 'A', 'B'],
'T2_data': ["5,2K", 4, 2],
})
df
Out[1] T1 T1_data T2 T2_data
0 A 3 B 5,2K
1 B 2 A 4
2 A 3K B 2
expected outputs :
i want this
T1_count T1_count T2_count T2_data
A 2 3, 3k 1 4
B 1 2 2 5, 2k
and this
T12_count T12_data
A 3 3, 3K, 4
B 3 2, 5, 2.
I know simple value_counts() but i don't know how can i do above things. if anyone can help it would be really appriciated.
df1 = df['T1'].value_counts()
df1
A 2
B 1
Name: T1, dtype: int64
Related
Hi there I would like to join all strings within a group with Python datatable in order to avoid pandas. Below is the code I am currently using and which I would like to replicate in datatable.
Does anyone know how to do it? Thank you very much!
from datatable import dt, f, by
df = dt.Frame(group1=[1, 1, 1, 2, 2, 2], group2=[1, 1, 2, 2, 2, 3], text=['a', 'b', 'c', 'd', 'e', 'f'])
df = df.to_pandas()
df2 = df.groupby(['group1', 'group2'])['text'].apply(' '.join).reset_index() # replicate this with datatable
df:
group1 group2 text
0 1 1 a
1 1 1 b
2 1 2 c
3 2 2 d
4 2 2 e
5 2 3 f
df2
group1 group2 text
0 1 1 a b
1 1 2 c
2 2 2 d e
3 2 3 f
I have a dataframe where every two rows are related. I am trying to give every two rows a unique ID. I thought it would be much easier but I cannot figure it out. Let's say I have this dataframe:
df = pd.DataFrame({'Var1': ['A', 2, 'C', 7], 'Var2': ['B', 5, 'D', 9]})
print(df)
Var1 Var2
A B
2 5
C D
7 9
I would like to add an ID that would result in a dataframe that looks like this:
df = pd.DataFrame({'ID' : [1,1,2,2],'Var1': ['A', 2, 'C', 7], 'Var2': ['B', 5, 'D', 9]})
print(df)
ID Var1 Var2
1 A B
1 2 5
2 C D
2 7 9
This is just a sample, but every two rows are related so just trying to count by 1, 1, 2, 2, 3, 3 etc in the ID column.
Thanks for any help!
You can create a sequence first and then divide it by 2 (integer division):
import numpy as np
df['ID'] = np.arange(len(df)) // 2 + 1
df
# Var1 Var2 ID
#0 A B 1
#1 2 5 1
#2 C D 2
#3 7 9 2
I don't think think is a native Pandas way to do it but this works...
import pandas as pd
df = pd.DataFrame({'Var1': ['A', 2, 'C', 7], 'Var2': ['B', 5, 'D', 9]})
df['ID'] = 1 + df.index // 2
df[['ID', 'Var1', 'Var2']]
Output:
ID Var1 Var2
0 1 A B
1 1 2 5
2 2 C D
3 2 7 9
I have a Dataframe file in which I want to switch the order of columns in only the third row while keeping other rows the same.
Under some condition, I have to switch orders for my project, but here is an example that probably has no real meaning.
Suppose the dataset is
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', 'b', 'c', 'd', 'e']})
df
out[1]:
A B C
0 0 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e
I want to have the output:
A B C
0 0 5 a
1 1 6 b
2 **7 2** c
3 3 8 d
4 4 9 e
How do I do it?
I have tried:
new_order = [1, 0, 2] # specify new order of the third row
i = 2 # specify row number
df.iloc[i] = df[df.columns[new_order]].loc[i] # reorder the third row only and assign new values to df
I observed from the output of the right-hand side that the columns are reordering as I wanted:
df[df.columns[new_order]].loc[i]
Out[2]:
B 7
A 2
C c
Name: 2, dtype: object
But when assigned to df again, it did nothing. I guess it's because of the name matching.
Can someone help me? Thanks in advance!
Python version: 3.5.2; Pandas version: 0.23.1
I am noticing unexpected behavior when I groupby using two indices but each row is unique on the first index. The code I am executing on my data frame with column c is:
df.c.groupby(df.index.names).min()
Everything works as expected when the rows are not unique on the first index. To make this clear, I've included two versions below. Edit: Now including three versions!
Version 1: Has the expected output
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [1, 2, 4]], columns=['a', 'b', 'c'])
df = df.set_index(['a','b']).sort_index()
Input:
c
a b
1 2 3
2 4
4 5 6
Output:
a b
1 2 3
4 5 6
Version 2: Has the unexpected output
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'])
df = df.set_index(['a','b']).sort_index()
Input:
c
a b
1 2 3
4 5 6
Output:
a 3
b 6
Expected Output:
a b
1 2 3
4 5 6
Version 3: Has expected output, but not expected with version 2 in mind.
df = pd.DataFrame([[1, 2, 3, 4], [4, 5, 6, 7]], columns=['a', 'b1', 'b2', 'c'])
df = df.set_index(['a','b1','b2']).sort_index()
Input:
c
a b1 b2
1 2 3 4
4 5 6 7
Output:
a b1 b2
1 2 3 4
4 5 6 7
Here is a peek in to what is going on. Take a look at the name of the series that gets getting passed into the "applied" function, f.
In the first case (Expected Results):
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [1, 2, 4]], columns=['a', 'b', 'c'])
df = df.set_index(['a','b']).sort_index()
def f(x):
print(x)
print('\n')
print(min(x))
print('\n')
return min(x)
df.c.groupby(['a','b']).apply(f)
Output:
a b
1 2 3
2 4
Name: (1, 2), dtype: int64
3
a b
4 5 6
Name: (4, 5), dtype: int64
6
Out[292]:
a b
1 2 3
4 5 6
In the second case (unexpected results), note the name of the series passed in:
df1 = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['a', 'b', 'c'])
df1 = df1.set_index(['a','b']).sort_index()
def f(x):
print(x)
print('\n')
print(min(x))
print('\n')
return min(x)
df1.c.groupby(['a','b']).apply(f)
Output:
a b
1 2 3
Name: a, dtype: int64
3
a b
4 5 6
Name: b, dtype: int64
6
Out[293]:
a 3
b 6
Name: c, dtype: int64
It uses these series to build the resulting dataframe. The naming of the series is the culprit due the nature of the data. Why? Well, we'll have to look into the code for that.
The idiomatic fix for this problem is use this syntax:
df1.groupby(df1.index.names)['c'].min()
Output:
a b
1 2 3
4 5 6
Name: c, dtype: int64
You can use the level argument of groupby:
>>> df
c
a b
1 2 3
4 5 6
>>> df.c.groupby(level=[0,1]).min()
a b
1 2 3
4 5 6
Name: c, dtype: int64
From the docs
level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels
This behavior is now changed in pandas. The output now matches the expected output in all cases.
I have a dataframe like this where the columns are the scores of some metrics:
A B C D
4 3 3 1
2 5 2 2
3 5 2 4
I want to create a new column to summarize which metrics each row scored over a set threshold in, using the column name as a string. So if the threshold was A > 2, B > 3, C > 1, D > 3, I would want the new column to look like this:
A B C D NewCol
4 3 3 1 AC
2 5 2 2 BC
3 5 2 4 ABCD
I tried using a series of np.where:
df[NewCol] = np.where(df['A'] > 2, 'A', '')
df[NewCol] = np.where(df['B'] > 3, 'B', '')
etc.
but realized the result was overwriting with the last metric any time all four metrics didn't meet the conditions, like so:
A B C D NewCol
4 3 3 1 C
2 5 2 2 C
3 5 2 4 ABCD
I am pretty sure there is an easier and correct way to do this.
You could do:
import pandas as pd
data = [[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C', 'D'])
th = {'A': 2, 'B': 3, 'C': 1, 'D': 3}
df['result'] = [''.join(k for k in df.columns if record[k] > th[k]) for record in df.to_dict('records')]
print(df)
Output
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD
Using dot
s=pd.Series([2,3,1,3],index=df.columns)
df.gt(s,1).dot(df.columns)
Out[179]:
0 AC
1 BC
2 ABCD
dtype: object
#df['New']=df.gt(s,1).dot(df.columns)
Another option that operates in an array fashion. It would be interesting to compare performance.
import pandas as pd
import numpy as np
# Data to test.
data = pd.DataFrame(
[
[4, 3, 3, 1],
[2, 5, 2, 2],
[3, 5, 2, 4]
]
, columns = ['A', 'B', 'C', 'D']
)
# Series to hold the thresholds.
thresholds = pd.Series([2, 3, 1, 3], index = ['A', 'B', 'C', 'D'])
# Subtract the series from the data, broadcasting, and then use sum to concatenate the strings.
data['result'] = np.where(data - thresholds > 0, data.columns, '').sum(axis = 1)
print(data)
Gives:
A B C D result
0 4 3 3 1 AC
1 2 5 2 2 BC
2 3 5 2 4 ABCD