how to set the index as character for pandas - python

I am trying to create a pandas df like this post.
df = pd.DataFrame(np.arange(9).reshape(3,3) , columns=list('123'))
df
this piece of code gives
describe() gives
is there is way to set the name of each row (i.e. the index) in df as 'A', 'B', 'C' instead of '0', '1', '2' ?

Use df.index:
df.index=['A', 'B', 'C']
print(df)
1 2 3
A 0 1 2
B 3 4 5
C 6 7 8
A more scalable and general solution would be using list-comprehension
df.index = [chr(ord('a') + x).upper() for x in df.index]
print(df)
1 2 3
A 0 1 2
B 3 4 5
C 6 7 8

Add index parameter in DataFrame constructor:
df = pd.DataFrame(np.arange(9).reshape(3,3) ,
index=list('ABC'),
columns=list('123'))
print (df)
1 2 3
A 0 1 2
B 3 4 5
C 6 7 8

Related

Pandas values transfer from one df to another OverflowError

Is there pandas way to copy values to column 'column_to_fill' from another df without itterations? I have needed me row and column indexes in df_1 columns. I need to fill df_1['column_to_fill'] with values from df_2.
df1 = pd.DataFrame(columns=['row_df2', 'column_df2'])
df1['row_df2'] = [1, 3, 5]
df1['column_df2'] = ['a', 'c', 'd']
index=np.arange(6)
columns=['a', 'b', 'c', 'd']
df2 = pd.DataFrame(data=np.random.randint(10, size=(len(index), len(columns))), index=index, columns=columns)
df1['column_to_fill'] = 0
for idx in df1.index:
df1.loc[idx, 'column_to_fill'] = df2.loc[df1.loc[idx, 'row_df2'],
df1.loc[idx, 'column_df2']].sum()
df1
row_df2 column_df2
0 1 a
1 3 c
2 5 d
df2
a b c d
0 2 3 5 2
1 8 3 9 3
2 4 6 0 1
3 3 8 0 8
4 3 4 5 0
5 2 5 4 0
df1
row_df2 column_df2 column_to_fill
0 1 a 8
1 3 c 0
2 5 d 0
I think you want to pick the value of the df_2 based on the values(row and column combination) of df_1 and assign it to df_1 column. If that is the case then check below.,
df_1 = pd.DataFrame({'values_type_rows_df2':[0,1,0,1], 1:[4,5,6,7]})
df_2 = pd.DataFrame({0:['a','b','c','d'], 1:['e','a','b','c']})
df_1['column_to_fill'] = [df_2.loc[i,i] for i in df_1['values_type_rows_df2']]
Based on your modification of the question, below is the code modified.
df1['column_to_fill'] = [df2.loc[j["row_df2"], j["column_df2"]] for i,j in df1.loc[:,["row_df2", "column_df2"]].iterrows()]
Screenshot attached for the time it took

Pandas row value string parsing (mixed string and float) [duplicate]

I have data like this
ID INFO
1 A=2;B=2;C=5
2 A=3;B=4;C=1
3 A=1;B=3;C=2
I want to split the Info columns into
ID A B C
1 2 2 5
2 3 4 1
3 1 3 2
I can split columns with one delimiter by using
df['A'], df['B'], df['C'] = df['INFO'].str.split(';').str
then split again by = but this seems to not so efficient in case I have many rows and especially when there are so many field that cannot be hard-coded beforehand.
Any suggestion would be greatly welcome.
You could use named groups together with Series.str.extract. In the end concat back the 'ID'. This assumes you always have A=;B=;and C= in a line.
pd.concat([df['ID'],
df['INFO'].str.extract('A=(?P<A>\d);B=(?P<B>\d);C=(?P<C>\d)')], axis=1)
# ID A B C
#0 1 2 2 5
#1 2 3 4 1
#2 3 1 3 2
If you want a more flexible solution that can deal with cases where a single line might be 'A=1;C=2' then we can split on ';' and partition on '='. pivot in the end to get to your desired output.
### Starting Data
#ID INFO
#1 A=2;B=2;C=5
#2 A=3;B=4;C=1
#3 A=1;B=3;C=2
#4 A=1;C=2
(df.set_index('ID')['INFO']
.str.split(';', expand=True)
.stack()
.str.partition('=')
.reset_index(-1, drop=True)
.pivot(columns=0, values=2)
)
# A B C
#ID
#1 2 2 5
#2 3 4 1
#3 1 3 2
#4 1 NaN 2
Browsing a Series is much faster that iterating across the rows of a dataframe.
So I would do:
pd.DataFrame([dict([x.split('=') for x in t.split(';')]) for t in df['INFO']], index=df['ID']).reset_index()
It gives as expected:
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
It should be faster than splitting twice dataframe columns.
values = [dict(item.split("=") for item in value.split(";")) for value in df.INFO]
df[['a', 'b', 'c']] = pd.DataFrame(values)
This will give you the desired output:
ID INFO a b c
1 a=1;b=2;c=3 1 2 3
2 a=4;b=5;c=6 4 5 6
3 a=7;b=8;c=9 7 8 9
Explanation:
The first line converts every value to a dictionary.
e.g.
x = 'a=1;b=2;c=3'
dict(item.split("=") for item in x.split(";"))
results in :
{'a': '1', 'b': '2', 'c': '3'}
DataFrame can take a list of dicts as an input and turn it into a dataframe.
Then you only need to assign the dataframe to the columns you want:
df[['a', 'b', 'c']] = pd.DataFrame(values)
Another solution is Series.str.findAll to extract values and then apply(pd.Series):
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
df = df.drop("INFO", 1)
Details:
df = pd.DataFrame([[1, "A=2;B=2;C=5"],
[2, "A=3;B=4;C=1"],
[3, "A=1;B=3;C=2"]],
columns=["ID", "INFO"])
print(df.INFO.str.findall(r'=(\d+)'))
# 0 [2, 2, 5]
# 1 [3, 4, 1]
# 2 [1, 3, 2]
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
print(df)
# ID INFO A B C
# 0 1 A=2;B=2;C=5 2 2 5
# 1 2 A=3;B=4;C=1 3 4 1
# 2 3 A=1;B=3;C=2 1 3 2
# Remove INFO column
df = df.drop("INFO", 1)
print(df)
# ID A B C
# 0 1 2 2 5
# 1 2 3 4 1
# 2 3 1 3 2
Another solution :
#split on ';'
#explode
#then split on '='
#and pivot
df_INFO = (df.INFO
.str.split(';')
.explode()
.str.split('=',expand=True)
.pivot(columns=0,values=1)
)
pd.concat([df.ID,df_INFO],axis=1)
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2

How to slice a pandas dataframe based on the position after a specified row

The following code:
import pandas as pd
df = pd.DataFrame({'col':['A', '1', '2', '3', 'B', '4', '5', 'C', '7', '8', '10']})
Produces the following dataframe:
col
0 A
1 1
2 2
3 3
4 B
5 4
6 5
7 C
8 7
9 8
10 10
I would like to come up with a good, pandas-friendly way of slicing the dataframe based on the occurence of the letters 'A', 'B' or 'C'. The expected result is as follows:
col col2
1 1 A
2 2 A
3 3 A
5 4 B
6 5 B
8 7 C
9 8 C
10 10 C
How can I achieve this?
Create a mask which finds the rows to split. To create col2, boolean index with the mask, reindex with the full original index, forward-fill the missing values. For col1, copy the original col. Then create the final df and index with the negation of the mask.
mask = df['col'].isin(['A', 'B', 'C']) # could use df['col'].str.isalpha() also
col2 = df['col'][mask].reindex(df.index).ffill()
col1 = df['col']
df = pd.DataFrame({'col1':col1, 'col2':col2})[~mask]
Result (df):
col1 col2
1 1 A
2 2 A
3 3 A
5 4 B
6 5 B
8 7 C
9 8 C
10 10 C
Just find the string/substring you are looking for in the column, then explode and ffill it. You can then just filter out the dataframe where col and col2 have different values.
df['col2'] = df['col'].str.findall('A|B|C').explode().ffill()
df[df['col']!=df['col2']]
col col2
1 1 A
2 2 A
3 3 A
5 4 B
6 5 B
8 7 C
9 8 C
10 10 C
One way is to form a mask with isalpha or not to mark letters, and group by the letter and "its digits" via cumsum. Then transforming with "first" gives almost col2 except for repetition for letters, which are dropped with the firstly formed mask:
mask = df.col.str.isalpha()
grouper = mask.cumsum()
new_df = df.assign(col2=df.groupby(grouper).transform("first"))[~mask]
to get
>>> new_df
col col2
1 1 A
2 2 A
3 3 A
5 4 B
6 5 B
8 7 C
9 8 C
10 10 C
Just needs a clever way of slicing:
Find all of the letters
Mask anything that is not a letter as NaN
Forward fill the column
Remove the original rows that were letters
is_letters = df["col"].str.isalpha()
new_df = df.where(is_letters).ffill().loc[~is_letters]
print(new_df)
col
1 A
2 A
3 A
5 B
6 B
8 C
9 C
10 C

Getting the total for some columns (independently) in a data frame with python [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Replace values in pandas datatable if in list

How can I replace values in the datatable data with information in filllist if a value is in varlist?
import pandas as pd
data = pd.DataFrame({'A' : [5,6,3,4], 'B' : [1,2,3, 10]})
varlist = (5,7,9,10)
fillist = ('a', 'b', 'c', 'd')
data[data.isin(varlist)==True] = 'is in varlist!'
Returns data as:
A B
0 is in varlist! 1
1 6 2
2 3 3
3 4 is in varlist!
But I want:
A B
0 a 1
1 6 2
2 3 3
3 4 d
Use the replace method of the dataframe.
replace_map = dict(zip(varlist, fillist))
data.replace(replace_map)
this gives
A B
0 a 1
1 6 2
2 3 3
3 4 d
The documentation is here in case you want to use it in a different way:
replace method documentation

Categories