Replace part of a row of a pandas dataframe with an array

Replace part of a row of a pandas dataframe with an array - python

I have a dataframe df1 that looks like this:
Sample_names esv0 esv1 esv2 ... esv918 esv919 esv920 esv921
0 pr1gluc8NH1 2.1 3.5 6222 ... 0 0 0 0
1 pr1gluc8NH2 3189.0 75.0 9045 ... 0 0 0 0
2 pr1gluc8NHCR1 0.0 2152.0 12217 ... 0 0 0 0
3 pr1gluc8NHCR2 0.0 17411.0 1315 ... 0 1 0 0
4 pr1sdm8NH1 365.0 7.0 4117 ... 0 0 0 0
5 pr1sdm8NH2 4657.0 18.0 13520 ... 0 0 0 0
6 pr1sdm8NHCR1 0.0 139.0 3451 ... 0 0 0 0
7 pr1sdm8NHCR2 1130.0 1439.0 4163 ... 0 0 0 0
I want to perform some operations on the rows and replace them , via a for loop.
for i in range(len(df1)):
x=df1.iloc[i].values ### gets all the values corresponding to each row
x=np.vstack(x[1:]).astype(np.float) ####converts object type to a regular 2D array for all row elements except the first, which is a string.
x=x/np.sum(x) ###normalize to 1
df1.iloc[i,1:]=x ###this is the step that should replace part of the old row with the new array.
But with this I get an error "ValueError: Must have equal len keys and value when setting with an ndarray". x does have the same length as each row of df1 - 1 (I don't want to replace the first column, Sample_names)
I also tried df1=df1.replace(df1.iloc[i,1:],x). This gives TypeError: value argument must be scalar, dict, or Series.
I would appreciate any ideas for how to do this.
Thanks.

You need to reshape the x array as its shape is (n, 1), where n is the length of your all esv-like columns.
Change the line: df1.iloc[i, 1:] = x to
df1.iloc[i, 1:] = x.squeeze()

Related

Pandas - get_dummies with value from another column

I have a dataframe like below. The column Mfr Number is a categorical data type. I'd like to preform get_dummies or one hot encoding on it, but instead of filling in the new column with a 1 if it's from that row, I want it to fill in the value from the quantity column. All the other new 'dummies' should remain a 0 on that row. Is this possible?
Datetime Mfr Number quantity
0 2016-03-15 07:02:00 MWS0460MB 1
1 2016-03-15 07:03:00 TM-120-6X 3
2 2016-03-15 08:33:00 40.50699.0095 5
3 2016-03-15 08:42:00 40.50699.0100 1
4 2016-03-15 08:46:00 CXS-04T098-00-0703R-1025 10

Do it in two steps:
dummies = pd.get_dummies(df['Mfr Number'])
dummies.values[dummies != 0] = df['Quantity']

Check with str.get_dummies and mul
df.Number.str.get_dummies().mul(df.quantity,0)
40.50699.0095 40.50699.0100 ... MWS0460MB TM-120-6X
0 0 0 ... 1 0
1 0 0 ... 0 3
2 5 0 ... 0 0
3 0 1 ... 0 0
4 0 0 ... 0 0
[5 rows x 5 columns]

df = pd.get_dummies(df, columns = ['Mfr Number'])
for col in df.columns[2:]:
df[col] = df[col]*df['quantity']

python pandas sum columns into sum column

I want to create a column in a pandas dataframe that would add the values of the other columns (which are 0 or 1s). the column is called "sum"
my HEADPandas looks like:
Application AnsSr sum Col1 Col2 Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11 0.0 0.0 28/12/11 .... ...Dates... 28/12/11
~00c 0 0.0 0.0 0 0 0 .... 0 0 0
~00pr 0 0.0 0.0 0 0 0 .... 0 0 0
~00te 0 0.0 0.0 0 0 1 .... 0 0 1
in an image from pythoneverywhere:
expected result (assuming there would be no more columns
Application AnsSr sum Col1 Col2 Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11 0.0 nan 28/12/11 .... ...Dates... 28/12/11
~00c 0 0.0 0.0 0 0 0 .... 0 0 0
~00pr 0 0.0 0.0 0 0 0 .... 0 0 0
~00te 0 0.0 2 0 0 1 .... 0 0 1
as you see the values of 'sum' are kept 0 even if there are 1s values in some columns.
what Am I doing wrong?
The basics of the code are:
theMatrix=pd.DataFrame([datetime.today().strftime('%Y-%m-%d')],['Date'],['Application'])
theMatrix['Ans'] = 0
theMatrix['sum'] = 0
so far so good
then I add all the values with loc.
and then I want to add up values with
theMatrix.fillna(0, inplace=True)
# this being the key line:
theMatrix['sum'] = theMatrix.sum(axis=1)
theMatrix.sort_index(axis=0, ascending=True, inplace=True)
As you see in the result (attached image) the sum remains 0.
I had a look to here or here and to the pandas documentation at no avail.
Actually the expression:
theMatrix['sum'] = theMatrix.sum(axis=1)
I got it from there.
changing this last line by:
theMatrix['sum'] = theMatrix[3:0].sum(axis=1)
in order to avoid to sum the first three columns gives as result:
Application AnsSr sum Col1 Col2 Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11 0.0 nan 28/12/11 .... ...Dates... 28/12/11
~00c 0 0.0 nan 1 1 0 .... 0 0 0
~00pr 0 0.0 1.0 0 0 0 .... 0 0 1
~00te 0 0.0 0 0 0 0 .... 0 0 0
please observe two things:
a) how in row '~00c' sum is nan but there are 1s in that row.
b) before the calculating the sum the code theMatrix.fillna(0, inplace=True) should have change all possible nan into 0 so the sum should never be nan since in theory there are no nan values in any of the columns[3:]
it wouldnt work.
some idea?
thanks
PS: Later edition, just in case you wondere how the dataframe is populated: reading and parsing an XML and the lines are:
# myDocId being the name of the columns
# concept being the index.
theMatrix.loc[concept,myDocId]=1

If I understand correctly, this can help you:
import pandas as pd
import datetime
#create dataframe following your example
theMatrix=pd.DataFrame([datetime.datetime.today().strftime('%Y-%m-%d')],['Date'],['Application'])
theMatrix['Ans'] = 0
theMatrix['col1'] = 1
theMatrix['col2'] = 1
# create 'sum' column with summed values from certain columns
theMatrix['sum'] = theMatrix['col1'] + theMatrix['col2']

Any data you choose to sum, just add to a list, and use that list to provide to your sum function, with axis=1. This will provide you the desired outcome. Here is a sample related to your data.
Sample File Data:
Date,a,b,c
bad, bad, bad, bad # Used to simulate your data better
2018-11-19,1,0,0
2018-11-20,1,0,0
2018-11-21,1,0,1
2018-11-23,1,nan,0 # Nan here is just to represent the missing data
2018-11-28,1,0,1
2018-11-30,1,nan,1 # Nan here is just to represent the missing data
2018-12-02,1,0,1
Code:
import pandas as pd
df = pd.read_csv(yourdata.filename) # Your method of loading the data
#rows_to_sum = ['a','b','c'] # The rows you wish to summarize
rows_to_sum = df.columns[1:] # Alternate method to select remainder of rows.
df = df.fillna(0) # used to fill the NaN you were talking about below.
df['sum'] = df[rows_to_sum][1:].astype(int).sum(axis=1) # skip the correct amount of rows here.
# Also, the use of astype(int), is due to the bad data read from the top. So redefining it here, allows you to sum it appropriately.
print(df)
Output:
Date a b c sum
bad bad bad bad NaN
2018-11-19 1 0 0 1.0
2018-11-20 1 0 0 1.0
2018-11-21 1 0 1 2.0
2018-11-23 1 0 0 1.0
2018-11-28 1 0 1 2.0
2018-11-30 1 0 1 2.0
2018-12-02 1 0 1 2.0

Python Selecting and Adding row values of columns in dataframe to create an aggregated dataframe

I need to process my dataframe in Python such that I add the numeric values of numeric columns that lie between 2 rows of the dataframe.
The dataframe can be created using
df = pd.DataFrame(np.array([['a',0,1,0,0,0,0,'i'],
['b',1,0,0,0,0,0,'j'],
['c',0,0,1,0,0,0,'k'],
['None',0,0,0,1,0,0,'l'],
['e',0,0,0,0,1,0,'m'],
['f',0,1,0,0,0,0,'n'],
['None',0,0,0,1,0,0,'o'],
['h',0,0,0,0,1,0,'p']]),
columns=[0,1,2,3,4,5,6,7],
index=[0,1,2,3,4,5,6,7])
I need to add all rows that occur before the 'None' entries and move the aggregated row to a new dataframe that should look like:

Your data frame dtype is mess up , cause you are using the array to assign the value , since one array only accpet one type , so it push all int to become string , we need convert it firstly
df=df.apply(pd.to_numeric,errors ='ignore')# convert
df['newkey']=df[0].eq('None').cumsum()# using cumsum create the key
df.loc[df[0].ne('None'),:].groupby('newkey').agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))# then we agg
Out[742]:
0 1 2 3 4 5 6 7
newkey
0 a 1 1 1 0 0 0 i
1 e 0 1 0 0 1 0 m
2 h 0 0 0 0 1 0 p

You can also specify the agg funcs
s = lambda s: sum(int(k) for k in s)
d = {i: s for i in range(8)}
d.update({0: 'first', 7: 'first'})
df.groupby((df[0] == 'None').cumsum().shift().fillna(0)).agg(d)
0 1 2 3 4 5 6 7
0
0.0 a 1 1 1 1 0 0 i
1.0 e 0 1 0 1 1 0 m
2.0 h 0 0 0 0 1 0 p

Applying column names to pandas DataFrame, names no longer unicode

I have created a DataFrame full of zeros such as:
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
...
n 0 0 0
I have a list containing names for the column in unicode, such as:
list = [u'One', u'Two', u'Three']
The DataFrame of zeroes is known as a, and I am creating a new complete DataFrame with the list as column headers via:
final = pd.DataFrame(a, columns=[list])
However, the resulting DataFrame has column names that are no longer unicode (i.e. they do not show the u'' tag).
I am wondering why this is happening. Thanks!

There is no reason for lost unicode, you can check it by:
print df.columns.tolist()
Please never use reserved words like list, type, id... as variables because masking built-in functions. Also is necessary add values for convert values to numpy array:
a = pd.DataFrame(0, columns=range(3), index=range(3))
print (a)
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
L = [u'One', u'Two', u'Three']
final = pd.DataFrame(a.values, columns=L)
print (final)
One Two Three
0 0 0 0
1 0 0 0
2 0 0 0
because columns are not aligned and get all NaNs:
final = pd.DataFrame(a, columns=L)
print (final)
One Two Three
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
I think simpliest is use only index of a DataFrame if all values are 0:
L = [u'One', u'Two', u'Three']
final = pd.DataFrame(0, columns=L, index=a.index)
print (final)
One Two Three
0 0 0 0
1 0 0 0
2 0 0 0

retaining order of columns after pivot

I have a N x 3 DataFrame called A that looks like this:
_Segment _Article Binaire
0 550 5568226 1
1 550 5612047 1
2 550 5909228 1
3 550 5924375 1
4 550 5924456 1
5 550 6096557 1
....
The variable _Article is uniquely defined in A (there are N unique values of _Article in A).
I do a pivot:
B=A.pivot(index='_Segment', columns='_Article')
,then replace missing values nan with zeros:
B[np.isnan(B)]=0
and get:
Binaire \
_Article 2332299 2332329 2332337 2932377 2968223 3195643 3346080
_Segment
550 0 0 0 0 0 0 0
551 0 0 0 0 0 0 0
552 0 0 0 0 0 0 0
553 1 1 1 0 0 0 1
554 0 0 0 1 0 1 0
where columns were sorted lexicographically during the pivot.
My question is: how do I retain the sort order of _Article in A in the columns of B?
Thanks!

I think I got it. This works:
First, store the column _Article
order_art=A['_Article']
In the pivot, add the "values" argument to avoid hierarchical columns (see http://pandas.pydata.org/pandas-docs/stable/reshaping.html), which prevent reindex to work properly:
B=A.pivot(index='_Segment', columns='_Article', values='_Binaire')
then, as before, replace nan's with zeros
B[np.isnan(B)]=0
and finally use reindex to restore the original order of variable _Article across columns:
B=B.reindex(columns=order_art)
Are there more elegant solutions?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace part of a row of a pandas dataframe with an array - python

You need to reshape the x array as its shape is (n, 1), where n is the length of your all esv-like columns. Change the line: df1.iloc[i, 1:] = x to df1.iloc[i, 1:] = x.squeeze()

Related

Pandas - get_dummies with value from another column

python pandas sum columns into sum column

Python Selecting and Adding row values of columns in dataframe to create an aggregated dataframe

Applying column names to pandas DataFrame, names no longer unicode

retaining order of columns after pivot

Categories

Resources