I have a dataframe df1 that looks like this:
Sample_names esv0 esv1 esv2 ... esv918 esv919 esv920 esv921
0 pr1gluc8NH1 2.1 3.5 6222 ... 0 0 0 0
1 pr1gluc8NH2 3189.0 75.0 9045 ... 0 0 0 0
2 pr1gluc8NHCR1 0.0 2152.0 12217 ... 0 0 0 0
3 pr1gluc8NHCR2 0.0 17411.0 1315 ... 0 1 0 0
4 pr1sdm8NH1 365.0 7.0 4117 ... 0 0 0 0
5 pr1sdm8NH2 4657.0 18.0 13520 ... 0 0 0 0
6 pr1sdm8NHCR1 0.0 139.0 3451 ... 0 0 0 0
7 pr1sdm8NHCR2 1130.0 1439.0 4163 ... 0 0 0 0
I want to perform some operations on the rows and replace them , via a for loop.
for i in range(len(df1)):
x=df1.iloc[i].values ### gets all the values corresponding to each row
x=np.vstack(x[1:]).astype(np.float) ####converts object type to a regular 2D array for all row elements except the first, which is a string.
x=x/np.sum(x) ###normalize to 1
df1.iloc[i,1:]=x ###this is the step that should replace part of the old row with the new array.
But with this I get an error "ValueError: Must have equal len keys and value when setting with an ndarray". x does have the same length as each row of df1 - 1 (I don't want to replace the first column, Sample_names)
I also tried df1=df1.replace(df1.iloc[i,1:],x). This gives TypeError: value argument must be scalar, dict, or Series.
I would appreciate any ideas for how to do this.
Thanks.
You need to reshape the x array as its shape is (n, 1), where n is the length of your all esv-like columns.
Change the line: df1.iloc[i, 1:] = x to
df1.iloc[i, 1:] = x.squeeze()
Related
I have a dataframe like below. The column Mfr Number is a categorical data type. I'd like to preform get_dummies or one hot encoding on it, but instead of filling in the new column with a 1 if it's from that row, I want it to fill in the value from the quantity column. All the other new 'dummies' should remain a 0 on that row. Is this possible?
Datetime Mfr Number quantity
0 2016-03-15 07:02:00 MWS0460MB 1
1 2016-03-15 07:03:00 TM-120-6X 3
2 2016-03-15 08:33:00 40.50699.0095 5
3 2016-03-15 08:42:00 40.50699.0100 1
4 2016-03-15 08:46:00 CXS-04T098-00-0703R-1025 10
Do it in two steps:
dummies = pd.get_dummies(df['Mfr Number'])
dummies.values[dummies != 0] = df['Quantity']
Check with str.get_dummies and mul
df.Number.str.get_dummies().mul(df.quantity,0)
40.50699.0095 40.50699.0100 ... MWS0460MB TM-120-6X
0 0 0 ... 1 0
1 0 0 ... 0 3
2 5 0 ... 0 0
3 0 1 ... 0 0
4 0 0 ... 0 0
[5 rows x 5 columns]
df = pd.get_dummies(df, columns = ['Mfr Number'])
for col in df.columns[2:]:
df[col] = df[col]*df['quantity']
I want to create a column in a pandas dataframe that would add the values of the other columns (which are 0 or 1s). the column is called "sum"
my HEADPandas looks like:
Application AnsSr sum Col1 Col2 Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11 0.0 0.0 28/12/11 .... ...Dates... 28/12/11
~00c 0 0.0 0.0 0 0 0 .... 0 0 0
~00pr 0 0.0 0.0 0 0 0 .... 0 0 0
~00te 0 0.0 0.0 0 0 1 .... 0 0 1
in an image from pythoneverywhere:
expected result (assuming there would be no more columns
Application AnsSr sum Col1 Col2 Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11 0.0 nan 28/12/11 .... ...Dates... 28/12/11
~00c 0 0.0 0.0 0 0 0 .... 0 0 0
~00pr 0 0.0 0.0 0 0 0 .... 0 0 0
~00te 0 0.0 2 0 0 1 .... 0 0 1
as you see the values of 'sum' are kept 0 even if there are 1s values in some columns.
what Am I doing wrong?
The basics of the code are:
theMatrix=pd.DataFrame([datetime.today().strftime('%Y-%m-%d')],['Date'],['Application'])
theMatrix['Ans'] = 0
theMatrix['sum'] = 0
so far so good
then I add all the values with loc.
and then I want to add up values with
theMatrix.fillna(0, inplace=True)
# this being the key line:
theMatrix['sum'] = theMatrix.sum(axis=1)
theMatrix.sort_index(axis=0, ascending=True, inplace=True)
As you see in the result (attached image) the sum remains 0.
I had a look to here or here and to the pandas documentation at no avail.
Actually the expression:
theMatrix['sum'] = theMatrix.sum(axis=1)
I got it from there.
changing this last line by:
theMatrix['sum'] = theMatrix[3:0].sum(axis=1)
in order to avoid to sum the first three columns gives as result:
Application AnsSr sum Col1 Col2 Col3 .... Col(n-2) Col(n-1) Col(n)
date 28-12-11 0.0 nan 28/12/11 .... ...Dates... 28/12/11
~00c 0 0.0 nan 1 1 0 .... 0 0 0
~00pr 0 0.0 1.0 0 0 0 .... 0 0 1
~00te 0 0.0 0 0 0 0 .... 0 0 0
please observe two things:
a) how in row '~00c' sum is nan but there are 1s in that row.
b) before the calculating the sum the code theMatrix.fillna(0, inplace=True) should have change all possible nan into 0 so the sum should never be nan since in theory there are no nan values in any of the columns[3:]
it wouldnt work.
some idea?
thanks
PS: Later edition, just in case you wondere how the dataframe is populated: reading and parsing an XML and the lines are:
# myDocId being the name of the columns
# concept being the index.
theMatrix.loc[concept,myDocId]=1
If I understand correctly, this can help you:
import pandas as pd
import datetime
#create dataframe following your example
theMatrix=pd.DataFrame([datetime.datetime.today().strftime('%Y-%m-%d')],['Date'],['Application'])
theMatrix['Ans'] = 0
theMatrix['col1'] = 1
theMatrix['col2'] = 1
# create 'sum' column with summed values from certain columns
theMatrix['sum'] = theMatrix['col1'] + theMatrix['col2']
Any data you choose to sum, just add to a list, and use that list to provide to your sum function, with axis=1. This will provide you the desired outcome. Here is a sample related to your data.
Sample File Data:
Date,a,b,c
bad, bad, bad, bad # Used to simulate your data better
2018-11-19,1,0,0
2018-11-20,1,0,0
2018-11-21,1,0,1
2018-11-23,1,nan,0 # Nan here is just to represent the missing data
2018-11-28,1,0,1
2018-11-30,1,nan,1 # Nan here is just to represent the missing data
2018-12-02,1,0,1
Code:
import pandas as pd
df = pd.read_csv(yourdata.filename) # Your method of loading the data
#rows_to_sum = ['a','b','c'] # The rows you wish to summarize
rows_to_sum = df.columns[1:] # Alternate method to select remainder of rows.
df = df.fillna(0) # used to fill the NaN you were talking about below.
df['sum'] = df[rows_to_sum][1:].astype(int).sum(axis=1) # skip the correct amount of rows here.
# Also, the use of astype(int), is due to the bad data read from the top. So redefining it here, allows you to sum it appropriately.
print(df)
Output:
Date a b c sum
bad bad bad bad NaN
2018-11-19 1 0 0 1.0
2018-11-20 1 0 0 1.0
2018-11-21 1 0 1 2.0
2018-11-23 1 0 0 1.0
2018-11-28 1 0 1 2.0
2018-11-30 1 0 1 2.0
2018-12-02 1 0 1 2.0
I need to process my dataframe in Python such that I add the numeric values of numeric columns that lie between 2 rows of the dataframe.
The dataframe can be created using
df = pd.DataFrame(np.array([['a',0,1,0,0,0,0,'i'],
['b',1,0,0,0,0,0,'j'],
['c',0,0,1,0,0,0,'k'],
['None',0,0,0,1,0,0,'l'],
['e',0,0,0,0,1,0,'m'],
['f',0,1,0,0,0,0,'n'],
['None',0,0,0,1,0,0,'o'],
['h',0,0,0,0,1,0,'p']]),
columns=[0,1,2,3,4,5,6,7],
index=[0,1,2,3,4,5,6,7])
I need to add all rows that occur before the 'None' entries and move the aggregated row to a new dataframe that should look like:
Your data frame dtype is mess up , cause you are using the array to assign the value , since one array only accpet one type , so it push all int to become string , we need convert it firstly
df=df.apply(pd.to_numeric,errors ='ignore')# convert
df['newkey']=df[0].eq('None').cumsum()# using cumsum create the key
df.loc[df[0].ne('None'),:].groupby('newkey').agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))# then we agg
Out[742]:
0 1 2 3 4 5 6 7
newkey
0 a 1 1 1 0 0 0 i
1 e 0 1 0 0 1 0 m
2 h 0 0 0 0 1 0 p
You can also specify the agg funcs
s = lambda s: sum(int(k) for k in s)
d = {i: s for i in range(8)}
d.update({0: 'first', 7: 'first'})
df.groupby((df[0] == 'None').cumsum().shift().fillna(0)).agg(d)
0 1 2 3 4 5 6 7
0
0.0 a 1 1 1 1 0 0 i
1.0 e 0 1 0 1 1 0 m
2.0 h 0 0 0 0 1 0 p
I have created a DataFrame full of zeros such as:
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
...
n 0 0 0
I have a list containing names for the column in unicode, such as:
list = [u'One', u'Two', u'Three']
The DataFrame of zeroes is known as a, and I am creating a new complete DataFrame with the list as column headers via:
final = pd.DataFrame(a, columns=[list])
However, the resulting DataFrame has column names that are no longer unicode (i.e. they do not show the u'' tag).
I am wondering why this is happening. Thanks!
There is no reason for lost unicode, you can check it by:
print df.columns.tolist()
Please never use reserved words like list, type, id... as variables because masking built-in functions. Also is necessary add values for convert values to numpy array:
a = pd.DataFrame(0, columns=range(3), index=range(3))
print (a)
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
L = [u'One', u'Two', u'Three']
final = pd.DataFrame(a.values, columns=L)
print (final)
One Two Three
0 0 0 0
1 0 0 0
2 0 0 0
because columns are not aligned and get all NaNs:
final = pd.DataFrame(a, columns=L)
print (final)
One Two Three
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
I think simpliest is use only index of a DataFrame if all values are 0:
L = [u'One', u'Two', u'Three']
final = pd.DataFrame(0, columns=L, index=a.index)
print (final)
One Two Three
0 0 0 0
1 0 0 0
2 0 0 0
I have a N x 3 DataFrame called A that looks like this:
_Segment _Article Binaire
0 550 5568226 1
1 550 5612047 1
2 550 5909228 1
3 550 5924375 1
4 550 5924456 1
5 550 6096557 1
....
The variable _Article is uniquely defined in A (there are N unique values of _Article in A).
I do a pivot:
B=A.pivot(index='_Segment', columns='_Article')
,then replace missing values nan with zeros:
B[np.isnan(B)]=0
and get:
Binaire \
_Article 2332299 2332329 2332337 2932377 2968223 3195643 3346080
_Segment
550 0 0 0 0 0 0 0
551 0 0 0 0 0 0 0
552 0 0 0 0 0 0 0
553 1 1 1 0 0 0 1
554 0 0 0 1 0 1 0
where columns were sorted lexicographically during the pivot.
My question is: how do I retain the sort order of _Article in A in the columns of B?
Thanks!
I think I got it. This works:
First, store the column _Article
order_art=A['_Article']
In the pivot, add the "values" argument to avoid hierarchical columns (see http://pandas.pydata.org/pandas-docs/stable/reshaping.html), which prevent reindex to work properly:
B=A.pivot(index='_Segment', columns='_Article', values='_Binaire')
then, as before, replace nan's with zeros
B[np.isnan(B)]=0
and finally use reindex to restore the original order of variable _Article across columns:
B=B.reindex(columns=order_art)
Are there more elegant solutions?