Join columns and reshape in row - python

i have data like this:
data = {'Host' : ['A','A','A','A','A','A','B','B','B'], 'Duration' : ['1','2',None,'4','5',None,'7','8',None], 'Predict' : [None,None,'3',None,None,'6',None,None,'9']}
df = pd.DataFrame(data)
It looks like:
Host Duration Predict
0 A 1 None
1 A 2 None
2 A None 3
3 A 4 None
4 A 5 None
5 A None 6
6 B 7 None
7 B 8 None
8 B None 9
What i expected to get:
A 1, 2, 3
A 4, 5, 6
B 7, 8, 9
I got what I wanted, but the way that I decided I do not like:
def create_vector(group):
result = []
df_array = []
for index, item in enumerate(group.Duration.ravel()):
if (item != None):
result.append(item)
else:
result.append(group.Predict.ravel()[index])
result.append(-1)
result = np.array(list(map(int, result)))
splitted = np.split(result, np.where(result == -1)[0] + 1)
for arr in splitted:
if (len(arr) > 3):
seq = ', '.join(str(e) for e in arr[:-1])
df_array.append(seq)
return pd.DataFrame(df_array,columns=['seq'])
Minimal length of arr must be one 'Duration' plus one 'Predict'
df= df.groupby(['host']).apply(create_vector)
df= df.reset_index().rename(columns={'level_1':'Index'})
df= df.drop(columns = {'Index'})
Would like to solve this problem using pandas.
Waiting for comments and advice

I believe you can replace missing values from Duration by Predict column, so solution is simplify:
df['new'] = df['Duration'].fillna(df['Predict']).astype(str)
If need groups each 3 values by Host groups:
g = df.groupby('Host').cumcount() // 3
Or if need groups by Predict column with None separator - only necessary default index:
g = df.index.where(df['Predict'].notna()).to_series().bfill()
#if always unique values in Predic column
#g = df['Predict'].bfill()
df = (df.groupby(['Host', g])['new']
.apply(', '.join)
.reset_index(level=1, drop=True)
.reset_index(name='Seq'))
print (df)
Host Seq
0 A 1, 2, 3
1 A 4, 5, 6
2 B 7, 8, 9
Another solution with reshape by DataFrame.stack - Nones or missing values are removed by default with again aggregate join:
g = df.groupby('Host').cumcount() // 3
df = (df.set_index(['Host', g])
.stack()
.astype(str)
.groupby(level=[0,1])
.apply(', '.join)
.reset_index(level=1, drop=True)
.reset_index(name='Seq')
)
print (df)
Host Seq
0 A 1, 2, 3
1 A 4, 5, 6
2 B 7, 8, 9

One way would be to melt, dropna to remove invalid values, then groupby and join the valid values:
(df.melt(id_vars='Host')
.dropna(subset=['value'])
.groupby('Host').value
.agg(', '.join)
.reset_index())
Host value
0 A 1, 2, 3, 4, 5
1 B 6, 7, 8, 9, 0

Related

Insert Row in Dataframe at certain place

I have the following Dataframe:
Now i want to insert an empty row after every time the column "Zweck" equals 7.
So for example the third row should be an empty row.
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5], 'f': [1, 7, 3, 4, 7]})
ren_dict = {i: df.columns[i] for i in range(len(df.columns))}
ind = df[df['f'] == 7].index
df = pd.DataFrame(np.insert(df.values, ind, values=[33], axis=0))
df.rename(columns=ren_dict, inplace=True)
ind_empt = df['a'] == 33
df[ind_empt] = ''
print(df)
Output
a b f
0 1 1 1
1
2 2 2 7
3 3 3 3
4 4 4 4
5
6 5 5 7
Here the dataframe is overwritten, as the append operation will be resource intensive. As a result, the required strings with values 33 appear. This is necessary because np.insert does not allow string values to be substituted. Columns are renamed to their original state with: df.rename. Finally, we find lines with df['a'] == 33 to set to empty values.

How to compare and replace individual cell values in data according to a list?: Pandas

I have a dataframe containing numerical values. I want to replace all values in the dataframe by comparing individual cell values to the respective elements of the list. The length of the list and the length of the columns are the same. Here's an example:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
Output
a b c
0 101 2 3
1 4 500 6
2 712 8 9
list_numbers = [100,100,100]
I want to compare individual cell values to the respective elements of the list.
So, the column 'a' will be compared to 100. If the values are greater than hundred, I want to replace the values with another number.
Here is my code so far:
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
df_columns = df.columns
df_index = df.index
#Creating a new dataframe to store the values.
df1 = pd.DataFrame(index= df_index, columns = df_columns)
df1 = df1.fillna(0)
for index, value in enumerate(df.columns):
#df.where replaces values where the condition is false
df1[[value]] = df[[value]].where(df[[value]] > list_numbers [index], -1)
df1[[value]] = df[[value]].where(df[[value]] < list_numbers [index], 1)
#I am getting something like: nan for column a and error for other columns.
#The output should look something like:
Output
a b c
0 1 -1 -1
1 -1 1 -1
2 1 -1 -1
Iterating over a DataFrame iterates over its column names. So you could simply do:
df1 = pd.DataFrame()
for i, c in enumerate(df):
df1[c] = np.where(df[c] >= list_numbers[i], 1, -1)
You can avoid iterating over the columns, and use numpy broadcasting (which is more efficient):
df1 = pd.DataFrame(
np.where(df.values > np.array(list_numbers), 1, -1),
columns=df.columns)
df1
Output:
a b c
0 1 -1 -1
1 -1 1 -1
2 1 -1 -1

Build a dataframe from a dict with specified labels from a txt

i want to make a dataframe with defined labels. Dont know how to tell panda to take the labels from the list. Hope someone can help
import numpy as np
import pandas as pd
df = []
thislist = []
thislist = ["A","D"]
thisdict = {
"A": [1, 2, 3],
"B": [4, 5, 6],
"C": [7, 8, 9],
"D": [7, 8, 9]
}
df = pd.DataFrame(data= thisdict[thislist]) # <- here is my problem
I want to get this:
df = A D
1 7
2 8
3 9
Use:
df = pd.DataFrame(thisdict)[thislist]
print(df)
A D
0 1 7
1 2 8
2 3 9
We could also use DataFrame.drop
df = pd.DataFrame(thisdict).drop(columns = ['B','C'])
or DataFrame.reindex
df = pd.DataFrame(thisdict).reindex(columns = thislist)
or DataFrame.filter
df = pd.DataFrame(thisdict).filter(items=thislist)
We can also use filter to filter thisdict.items()
df = pd.DataFrame(dict(filter(lambda item: item[0] in thislist, thisdict.items())))
print(df)
A D
0 1 7
1 2 8
2 3 9
I think this answer is completed with the solution of #anky_91
Finally, I recommend you see how to index
IIUC, use .loc[] with the dataframe constructor:
df = pd.DataFrame(thisdict).loc[:,thislist]
print(df)
A D
0 1 7
1 2 8
2 3 9
Use a dict comprehension to create a new dictionary that is a subset of your original so you only construct the DataFrame you care about.
pd.DataFrame({x: thisdict[x] for x in thislist})
A D
0 1 7
1 2 8
2 3 9
If you want to deal with the possibility of missing Keys, add some logic so it's similar to reindex
pd.DataFrame({x: thisdict[x] if x in thisdict.keys() else np.NaN for x in thislist})
df = pd.DataFrame(thisdict)
df[['A', 'D']]
another alternative for your input:
thislist = ["A","D"]
thisdict = {
"A": [1, 2, 3],
"B": [4, 5, 6],
"C": [7, 8, 9],
"D": [7, 8, 9]
}
df = pd.DataFrame(thisdict)
and than simply remove your columns not in thelist (you can do it directly from the df or aggregate them):
remove_columns = []
for c in df.columns:
if c not in thislist:
remove_columns.append(c)
and remove it:
df.drop(columns=remove_columns, inplace=True)

How to define a python lambda getting the first element?

import pandas as pd
df = pd.DataFrame({'A': [0, 0, 1, 1],
'B': [1, 3, 8, 10],
'C': ['alpha', 'bravo', 'charlie', 'delta']})
Now, I would like to group the data using own lambdas, but they behave different from what I expect. The lambda in the following example should return the first value of the column in the group:
df.groupby('A', as_index = False).agg({'B':'mean',
'C': lambda x: x[0]})
The code throws the KeyError: 0, which is unclear to me since ['alpha', 'bravo'][0] gives 'alpha'
So overall the desired output:
A B C
0 0 2 'alpha'
1 1 9 'charlie'
If need select first value in group is necessary use Series.iat or Series.iloc for select by position:
df1 = df.groupby('A', as_index = False).agg({'B':'mean', 'C': lambda x: x.iat[0]})
Another solution is use GroupBy.first:
df1 = df.groupby('A', as_index = False).agg({'B':'mean', 'C': 'first'})
print (df1)
A B C
0 0 2 alpha
1 1 9 charlie
Can you add an explanation of why the lambda doesn't work?
Problem is for second group, there is index not 0, but 2, what raise error, because x[0] try seelct by index with 0 and for second group it not exist:
df1 = df.groupby('A', as_index = False).agg({'B':'mean', 'C': lambda x: print (x[0])})
print (df1)
alpha <- return first value of first group only, because alpha has index 0
alpha
alpha
So if set index 0 for first values of groups it working with this sample data:
df = pd.DataFrame({'A': [0, 0, 1, 1],
'B': [1, 3, 8, 10],
'C': ['alpha', 'bravo', 'charlie', 'delta']}, index=[0,1,0,1])
print (df)
A B C
0 0 1 alpha <- index is 0
1 0 3 bravo
0 1 8 charlie <- index is 0
1 1 10 delta
df1 = df.groupby('A', as_index = False).agg({'B':'mean', 'C': lambda x: x[0]})
print (df1)
A B C
0 0 2 alpha
1 1 9 charlie
Small explanation on why your lambda function won't work.
When we use groupby we get an groupby object back:
g = df.groupby('A')
print(g)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023AA1BB41D0>
When we access the elements in our groupby object, we get grouped dataframes back:
for idx, d in g:
print(d, '\n')
A B C
0 0 1 alpha
1 0 3 bravo
A B C
2 1 8 charlie
3 1 10 delta
So thats why we need to threat these elements as DataFrames. As jezrael already pointed out in his answer, there are several ways to access the first value in your C column:
for idx, d in g:
print(d['C'].iat[0])
print(d['C'].iloc[0], '\n')
alpha
alpha
charlie
charlie

Find the minimum value of a column greater than another column value in Python Pandas

I'm working in Python. I have two dataframes df1 and df2:
d1 = {'timestamp1': [88148 , 5617900, 5622548, 5645748, 6603950, 6666502], 'col01': [1, 2, 3, 4, 5, 6]}
df1 = pd.DataFrame(d1)
d2 = {'timestamp2': [5629500, 5643050, 6578800, 6583150, 6611350], 'col02': [7, 8, 9, 10, 11], 'col03': [0, 1, 0, 0, 1]}
df2 = pd.DataFrame(d2)
I want to create a new column in df1 with the value of the minimum timestamp of df2 greater than the current df1 timestamp, where df2['col03'] is zero. This is the way I did it:
df1['colnew'] = np.nan
TSs = df1['timestamp1']
for TS in TSs:
values = df2['timestamp2'][(df2['timestamp2'] > TS) & (df2['col03']==0)]
if not values.empty:
df1.loc[df1['timestamp1'] == TS, 'colnew'] = values.iloc[0]
It works, but I'd prefer not to use a for loop. Is there a better way to do this?
Use pandas.merge_asof with a forward direction
pd.merge_asof(
df1, df2.loc[df2.col03 == 0, ['timestamp2']],
left_on='timestamp1', right_on='timestamp2', direction='forward'
).rename(columns=dict(timestamp2='colnew'))
col01 timestamp1 colnew
0 1 88148 5629500.0
1 2 5617900 5629500.0
2 3 5622548 5629500.0
3 4 5645748 6578800.0
4 5 6603950 NaN
5 6 6666502 NaN
Give a try to the apply method.
def func(x):
values = df2['timestamp2'][(df2['timestamp2'] > x) & (df2['col03']==0)]
if not values.empty:
return values.iloc[0]
else:
np.NAN
df1["timestamp1"].apply(func)
You can create a separate function to do what has to be done.
The output is your new column
0 5629500.0
1 5629500.0
2 5629500.0
3 6578800.0
4 NaN
5 NaN
Name: timestamp1, dtype: float64
It is not an one-line solution, but it helps keeping things organised.

Categories