import pandas as pd
data = {'col': ['a11 aaa','a121 aaaa','a3333 adfdf']}
df = pd.DataFrame(data)
i want to set index Similar ['a11','a121','a3333']
print(df)
Option 1
import pandas as pd
data = {'col': ['a11 aaa','a121 aaaa','a3333 adfdf']}
df = pd.DataFrame(data)
df.index = df['col'].apply(lambda line: line.split(' ')[0]).tolist()
df
Option 2
import pandas as pd
data = {'col': ['a11 aaa','a121 aaaa','a3333 adfdf']}
indexes = map(lambda line: line.split(' ')[0],data['col'])
df = pd.DataFrame(data,index=indexes)
df
Output
| col
a11 | a11 aaa
a121 | a121 aaaa
a3333 | a3333 adfdf
Related
import pandas as pd
import numpy as np
column_names = [str(x) for x in range(1,4)]
df= pd.DataFrame ( columns = column_names )
new_row = []
for i in range(3):
new_row.append(i)
df = df.append(new_row , ignore_index = True)
print(df)
output:
1 2 3 0
0 NaN NaN NaN 0.0
1 NaN NaN NaN 1.0
2 NaN NaN NaN 2.0
Is there a way to apply the loop to column 1, column 2, and column 3?
I think it's possible with a simple code, isn't it?
I've been thinking a lot, but I don't know how.
I also tried the .loc() method, but I couldn't apply the loop to the row of columns.
This is a supplementary explanation.
'column_names = [str(x) for x in range(1,4)]' creates columns 0 to 3.
A loop is applied to each column.
The "for" loop inserts 0 through 2 into column 1.
Therefore, 0, 1, 2 are input to the row of column 1.
The result I want is below.
You can add the following code after all your codes above:
for col in df:
df[col] = new_row
Result:
If you run after all your codes:
column_names = [str(x) for x in range(1,4)]
df= pd.DataFrame ( columns = column_names )
new_row = []
for i in range(3):
new_row.append(i)
df = df.append(new_row , ignore_index = True)
Then run the code here:
for col in df:
df[col] = new_row
You should get:
print(df)
1 2 3 0
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
I know it's weird but you can use .loc to do that:
df.loc[len(df.index)+1] = new_row
>>> df
1 2 3
1 0 1 2
you can use the name of the column for example:
for col in column_names:
df[col] = new_row
Assign the new row to the next index position in the dataframe using .loc.
import pandas as pd
import numpy as np
column_names = [str(x) for x in range(1,4)]
df= pd.DataFrame(columns=column_names)
new_row = []
for i in range(3):
new_row.append(i)
df.loc[len(df)] = new_row
If you have multiple rows to add in a loop,
len(df)
in the .loc statement will ensure they're always being added to the end.
not 100% sure what you are trying to do - can you rephrase?
import pandas as pd
column_names = [str(x) for x in range(1,4)]
df= pd.DataFrame ( columns = column_names )
new_row = []
for i in range(len(df.columns)):
new_row.append(i)
df = df.append(new_row , ignore_index = True)
for i in df:
df[i] = new_row
print(df)
I have data in an excel file, df that holds aggregated values per ID. I am looking to break this down to its distinct count and create a new record for each.
Data
A B C
2 3 1
Desired
count ID
1 A01
1 A02
1 B01
1 B02
1 B03
1 C01
Doing:
import pandas as pd
from numpy.random import randint
df = pd.DataFrame(columns=['A', 'B', 'C'])
for i in range(5):
df.loc[i] = ['ID' + str(i)] + list(randint(10, size=2))
I am thinking I can go about it this way, however, this is not stacking all the necessary IDs, consecutively.
Any suggestion or advice will be appreciated.
Let's try melt to reshape the data, reindex + repeat to duplicate the rows, and groupby + cumcount + zfill to create the suffixes:
import pandas as pd
df = pd.DataFrame({'A': {0: 2}, 'B': {0: 3}, 'C': {0: 1}})
# Melt Table Into New Form
df = df.melt(col_level=0, value_name='count', var_name='ID')
# Repeat Based on Count
df = df.reindex(df.index.repeat(df['count']))
# Set Count To 1
df['count'] = 1
# Add Suffix to Each ID
df['ID'] = df['ID'] + (
(df.groupby('ID').cumcount() + 1)
.astype(str)
.str.zfill(2)
)
# Reorder Columns
df = df[['count', 'ID']]
print(df)
df:
count ID
0 1 A01
0 1 A02
1 1 B01
1 1 B02
1 1 B03
2 1 C01
Do you want this?
df = pd.DataFrame([[f"{k}{str(i+1).zfill(2)}" for i in range(v)]
for k, v in df.to_dict('records')[0].items()]).stack().reset_index(drop=True).to_frame().rename(columns = {0:'ID'})
df['count'] = 1
Another option:
import numpy as np
df = df.melt()
new_df = (pd.DataFrame(np.repeat(df.variable, df.value))
.assign(count=1))
new_df.variable = new_df.variable + (new_df.groupby('variable').cumcount() + 1).astype(str).str.zfill(2)
I have two text columns A and B. I want to take the first non empty string or if both A and B has values take the values from A. C is the column im trying to create:
import pandas as pd
cols = ['A','B']
data = [['data','data'],
['','data'],
['',''],
['data1','data2']]
df = pd.DataFrame.from_records(data=data, columns=cols)
A B
0 data data
1 data
2
3 data1 data2
My attempt:
df['C'] = df[cols].apply(lambda row: sorted([val if val else '' for val in row], reverse=True)[0], axis=1) #Reverse sort to avoid picking an empty string
A B C
0 data data data
1 data data
2
3 data1 data2 data2 #I want data1 here
Expected output:
A B C
0 data data data
1 data data
2
3 data1 data2 data1
I think I want the pandas equivalent of SQL coalesce.
You can also use numpy.where:
In [1022]: import numpy as np
In [1023]: df['C'] = np.where(df['A'].eq(''), df['B'], df['A'])
In [1024]: df
Out[1024]:
A B C
0 data data data
1 data data
2
3 data1 data2 data1
Let's try idxmax + lookup:
df['C'] = df.lookup(df.index, df.ne('').idxmax(1))
Alternatively you can use Series.where:
df['C'] = df['A'].where(lambda x: x.ne(''), df['B'])
A B C
0 data data data
1 data data
2
3 data1 data2 data1
I have a Python dictionary:
adict = {
'col1': [
{'id': 1, 'tag': '#one#two'},
{'id': 2, 'tag': '#two#'},
{'id': 1, 'tag': '#one#three#'}
]
}
I want the result as follows:
Id tag
1 one,two,three
2 two
Could someone please tell me how to do this?
Try this
import pandas as pd
d={'col1':[{'id':1,'tag':'#one#two'},{'id':2,'tag':'#two#'},{'id':1,'tag':'#one#three#'}]}
df = pd.DataFrame()
for i in d:
for k in d[i]:
t = pd.DataFrame.from_dict(k, orient='index').T
t["tag"] = t["tag"].str.replace("#",",")
df = pd.concat([df,t])
tf = df.groupby(["id"])["tag"].apply(lambda x : ",".join(set(''.join(list(x)).strip(",").split(","))))
Here is a simple code
import pandas as pd
d = {'col1':[{'id':1,'tag':'#one#two'},{'id':2,'tag':'#two#'},{'id':1,'tag':'#one#three#'}]}
df = pd.DataFrame(d)
df['Id'] = df.col1.apply(lambda x: x['id'])
df['tag'] = df.col1.apply(lambda x: ''.join(list(','.join(x['tag'].split('#')))[1:]))
df.drop(columns = 'col1', inplace = True)
Output:
Id Tag
1 one, two
2 two
1 one, three
If order of tags is important first remove trailing # and split by #, then per groups remove duplicates and join:
df = pd.DataFrame(d['col1'])
df['tag'] = df['tag'].str.strip('#').str.split('#')
f = lambda x: ','.join(dict.fromkeys([z for y in x for z in y]).keys())
df = df.groupby('id')['tag'].apply(f).reset_index()
print (df)
id tag
0 1 one,two,three
1 2 two
If order of tags is not important for remove duplicates use sets:
df = pd.DataFrame(d['col1'])
df['tag'] = df['tag'].str.strip('#').str.split('#')
f = lambda x: ','.join(set([z for y in x for z in y]))
df = df.groupby('id')['tag'].apply(f).reset_index()
print (df)
id tag
0 1 three,one,two
1 2 two
I tried as below
import pandas as pd
a = {'col1':[{'id':1, 'tag':'#one#two'},{'id':2, 'tag':'#two#'},{'id':1, 'tag':'#one#three#'}]}
df = pd.DataFrame(a)
df[["col1", "col2"]] = pd.DataFrame(df.col1.values.tolist(), index = df.index)
df['col1'] = df.col1.str.replace('#', ',')
df = df.groupby(["col2"])["col1"].apply(lambda x : ",".join(set(''.join(list(x)).strip(",").split(","))))
O/P:
col2
1 one,two,three
2 two
dic=[{'col1':[{'id':1,'tag':'#one#two'},{'id':2,'tag':'#two#'},{'id':1,'tag':'#one#three#'}]}]
row=[]
for key in dic:
data=key['col1']
for rows in data:
row.append(rows)
df=pd.DataFrame(row)
print(df)
o
I have a dataframe that looks like:
import pandas as pd
import random
d={'ID':["x1", "x2", "x1"],
'CUSIP':['a', 'b', "#NULL"],
'ISIN':["#NULL", "#NULL", 'I']}
df=pd.DataFrame(data=d)
df
I am trying to replace all the '#NULL' with a unique number suffix. So, the output table will look something like:
import pandas as pd
import random
d={'ID':["x1", "x2", "x1"],
'CUSIP':['a', 'b', "#NULL_1"],
'ISIN':["#NULL_2", "#NULL_3", 'I']}
df=pd.DataFrame(data=d)
df
Create Series and add new values of filtered rows with range, last reshape back:
s = df.unstack()
m = s == '#NULL'
s.loc[m] = [f'#NULL_{x + 1}' for x in range(m.sum())]
df = s.unstack().T
print (df)
ID CUSIP ISIN
0 x1 a #NULL_2
1 x2 b #NULL_3
2 x1 #NULL_1 I
A simple solution would be to iterate through all values
count=1
for i in range(len(df)):
for c in df.columns:
if df.loc[i,c]=="#NULL":
df.loc[i,c]="#NULL_"+str(count)
count+=1
df
CUSIP ID ISIN
0 a x1 #NULL_1
1 b x2 #NULL_2
2 #NULL_3 x1 I
To obtain the other order:
count=1
for c in df.columns:
for i in range(len(df)):
if df.loc[i,c]=="#NULL":
df.loc[i,c]="#NULL_"+str(count)
count+=1
df
CUSIP ID ISIN
0 a x1 #NULL_2
1 b x2 #NULL_3
2 #NULL_1 x1 I