I want to start with an empty data frame and then add to it one row each time.
I can even start with a 0 data frame data=pd.DataFrame(np.zeros(shape=(10,2)),column=["a","b"]) and then replace one line each time.
How can I do that?
Use .loc for label based selection, it is important you understand how to slice properly: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label and understand why you should avoid chained assignment: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [14]:
data=pd.DataFrame(np.zeros(shape=(10,2)),columns=["a","b"])
data
Out[14]:
a b
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
[10 rows x 2 columns]
In [15]:
data.loc[2:2,'a':'b']=5,6
data
Out[15]:
a b
0 0 0
1 0 0
2 5 6
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
[10 rows x 2 columns]
If you are replacing the entire row then you can just use an index and not need row,column slices.
...
data.loc[2]=5,6
Related
I have the following dataframe with multi-level columns
In [1]: data = {('A', '10'):[1,3,0,1],
('A', '20'):[3,2,0,0],
('A', '30'):[0,0,3,0],
('B', '10'):[3,0,0,0],
('B', '20'):[0,5,0,0],
('B', '30'):[0,0,1,0],
('C', '10'):[0,0,0,2],
('C', '20'):[1,0,0,0],
('C', '30'):[0,0,0,0]
}
df = pd.DataFrame(data)
df
Out[1]:
A B C
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0
1 3 2 0 0 5 0 0 0 0
2 0 0 3 0 0 1 0 0 0
3 1 0 0 0 0 0 2 0 0
In a new column results I want to return the combined column name containing the maximum value for each subset (i.e. second level column)
My desired output should look like the below
Out[2]:
A B C
10 20 30 10 20 30 10 20 30 results
0 1 3 0 3 0 0 0 1 0 A20&B10&C20
1 3 2 0 0 5 0 0 0 0 A10&B20
2 0 0 3 0 0 1 0 0 0 A30&B30
3 1 0 0 0 0 0 2 0 0 A10&C10
For example the first row:
For column 'A' the max value is under column '20' &
for column 'B' there is only 1 value under '10' &
for column 'C' also it is only one value under '20' &
so the result would be A20&B10&C20
Edit: replacing "+" with "&" in the results column, apparently I was misunderstood and you guys thought I need the summation while I need to column names separated by a separator
Edit2:
The solution provided by #A.B below didn't work for me for some reason. Although it is working on his side and for the sample data on google colab.
somehow the .idxmax(skipna = True) is causing a ValueError: No axis named 1 for object type Series
I found a workaround by transposing the data before this step, and then transposing it back after.
map_res = lambda x: ",".join(list(filter(None,['' if isinstance(x[a], float) else (x[a][0]+x[a][1]) for a in x.keys()])))
df['results'] = df.replace(0, np.nan)\
.T\ # Transpose here
.groupby(level=0)\ # Remove (axis=1) from here
.idxmax(skipna = True)\
.T\ # Transpose back here
.apply(map_res,axis=1)
I am still interested to know why it was is not working without the transpose though?
Idea is replace 0 by NaN, so if use DataFrame.stack all rows with NaNs are removed. Then get indices by DataFrameGroupBy.idxmax, mapping second and third tuple values by map and aggregate join to new column per indices - first level:
df['results'] = (df.replace(0, np.nan)
.stack([0,1])
.groupby(level=[0,1])
.idxmax()
.map(lambda x: f'{x[1]}{x[2]}')
.groupby(level=0)
.agg('&'.join))
print (df)
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 A20&B10&C20
1 3 2 0 0 5 0 0 0 0 A10&B20
2 0 0 3 0 0 1 0 0 0 A30&B30
3 1 0 0 0 0 0 2 0 0 A10&C10
Try:
df["results"] = df.groupby(level=0, axis=1).max().sum(1)
print(df)
Prints:
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 7
1 3 2 0 0 5 0 0 0 0 8
2 0 0 3 0 0 1 0 0 0 4
3 1 0 0 0 0 0 2 0 0 3
Group by level 0 and axis=1
You use idxmax to get max sub-level indexes as tuples (while skipping NaNs).
Apply function to rows (axix-1) to concat names
In function (that you apply to rows), Iterate on keys/columns and concatenate the column levels. Replace Nan (which have type 'float') with an empty string and filter them later.
You won't need df.replace(0, np.nan) if you initially have NaN and let them remain.
map_res = lambda x: ",".join(list(filter(None,['' if isinstance(x[a], float) else (x[a][0]+x[a][1]) for a in x.keys()])))
df['results'] = df.replace(0, np.nan)\
.groupby(level=0, axis=1)\
.idxmax(skipna = True)\
.apply(map_res,axis=1)
Here's output
A B C results
10 20 30 10 20 30 10 20 30
0 1 3 0 3 0 0 0 1 0 A20,B10,C20
1 3 2 0 0 5 0 0 0 0 A10,B20
2 0 0 3 0 0 1 0 0 0 A30,B30
3 1 0 0 0 0 0 2 0 0 A10,C10
I have a csv with data that I want to import to an ndarray so I can manipulate it. The csv data is formatted like this.
u i r c
1 1 5 1
2 2 5 1
3 3 1 0
4 4 1 1
I want to get all the elements with c = 1 in a row, and the ones with c = 0 in another one, like so, reducing the dimensionality.
1 1 1 5 2 2 5 4 4 1
0 3 3 1
However, different u and i can't be in the same column, hence the final result needing zero padding, like this. I want to keep the c variable column, since this represents a categorical variable, so I need to keep its value to be able to make the correspondence between the information and the c value. I don't want to just separate the data according to the value of c.
1 1 1 5 2 2 5 0 0 0 4 4 1
0 0 0 0 0 0 0 3 3 1 0 0 0
So far, I'm reading the .csv file with df = pd.read_csv and creating a multidimensional array/tensor by using arr=df.to_numpy(). After that, I'm permutating the order of the columns to make the c column be the first one, getting this array [[ 1 1 1 5][ 1 2 2 5][ 0 3 3 1][ 1 4 4 1]].
I then do arr = arr.reshape(2,), since there are two possible values for c and then delete all but the first c column according to the length of the tuples. So in this case, since there are 4 elements in each tuple and 16 elements I'm doing arr = np.delete(arr, (4,8,12), axis=1).
Finally, I'm doing this to pad the array with zeros when the u doesn't match with both columns.
nomatch = 0
for j in range(1, cols, 3):
if arr[0][j] != arr[1][j]:
nomatch+=1
z = np.zeros(nomatch*3, dtype=arr.dtype)
h1 = np.split(arr, [0][0])
new0 = np.concatenate((arr[0],z))
new1 = np.concatenate((z,arr[1])) # problem
final = np.concatenate((new0, new1))
In the line with the comment, the problem is how can I concatenate the array while maintaining the first element. Instead of just appending, I'd like to be able to set up a start and end index and patch the zeros only on those indexes. By using concatenate, I don't get the expected result, since I'm altering the first element (the head of the array should be untouched).
Additionally, I can't help but wonder if this is a good way to achieve the end result. For an example I tried to pad the array with resize before reshaping with np.resize(), but it doesn't work, when I print the result the array is the same as previous, no matter the dimensions I use as argument. A good solution would be one that adapted if there were 3 or more possible values for c, and that could include multiple c-like values, such as c1, c2... that would become rows in the table. I appreciate all the input and suggestions in advance.
Here is a compact numpy approach:
asnp = df.to_numpy()
(np.bitwise_xor.outer(np.arange(2),asnp[:,3:])*asnp[:,:3]).reshape(2,-1)
# array([[1, 1, 5, 2, 2, 5, 0, 0, 0, 4, 4, 1],
# [0, 0, 0, 0, 0, 0, 3, 3, 1, 0, 0, 0]])
UPDATE: multi category:
categories must be the last k columns and have column headers starting with "cat". we create a row for each unique combination of categories, this combination is prepended to the row.
Code:
import numpy as np
import pandas as pd
import itertools as it
def spreadcats(df):
cut = sum(map(str.startswith,df.columns,it.repeat("cat")))
data = df.to_numpy()
cats,idx = np.unique(data[:,-cut:],axis=0,return_inverse=True)
m,n,k,_ = data.shape + cats.shape
out = np.zeros((k,cut+(n-cut)*m),int)
out[:,:cut] = cats
out[:,cut:].reshape(k,m,n-cut)[idx,np.arange(m)] = data[:,:-cut]
return out
x = np.random.randint([1,1,1,0,0],[10,10,10,3,2],(10,5))
df = pd.DataFrame(x,columns=[f"data{i}" for i in "123"] + ["cat1","cat2"])
print(df)
print(spreadcats(df))
Sample run:
data1 data2 data3 cat1 cat2
0 9 5 1 1 1
1 7 4 2 2 0
2 3 9 8 1 0
3 3 9 1 1 0
4 9 1 7 2 1
5 1 3 7 2 0
6 2 8 2 1 0
7 1 4 9 0 1
8 8 7 3 1 1
9 3 6 9 0 1
[[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 9 0 0 0 3 6 9]
[1 0 0 0 0 0 0 0 3 9 8 3 9 1 0 0 0 0 0 0 2 8 2 0 0 0 0 0 0 0 0 0]
[1 1 9 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 7 3 0 0 0]
[2 0 0 0 0 7 4 2 0 0 0 0 0 0 0 0 0 1 3 7 0 0 0 0 0 0 0 0 0 0 0 0]
[2 1 0 0 0 0 0 0 0 0 0 0 0 0 9 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
I have a pandas Df with 2 million rows *10 columns.
I want to delete all the zero elements in a row for all columns except single column with non zero elements.
Ex. My Df like:
Índex Time a b c d e
0 1 0 0 0 0 0
1 2 1 2 0 0 0
2 3 0 0 0 0 0
3 4 5 0 0 0 0
4 5 0 0 0 0 0
5 6 7 0 0 0 0
What I needed:
Índex Time a b c d e
0 2 1 2 0 0 0
1 4 5 0 0 0 0
2 6 7 0 0 0 0
My Requirement:
Requirement 1:
Leaving the 1st column (Time) it should check for zero elements in every rows. If all column values are zero delete that particular row.
Requirement 2:
Finally I want my Index to be updated properly
What I tried:
I have been looking at this link.
I understood the logic used but I wasn't able to reproduce the result for my requirement.
I hope there will be a simple method to do the operation...
Use iloc for select all columns without first, comapre for not equal by ne and test at least one True per rows by any for filter by boolean indexing, last reset_index:
df = df[df.iloc[:, 1:].ne(0).any(axis=1)].reset_index(drop=True)
Alternative with remove column Time:
df = df[df.drop('Time', axis=1).ne(0).any(axis=1)].reset_index(drop=True)
print (df)
Time a b c d e
0 2 1 2 0 0 0
1 4 5 0 0 0 0
2 6 7 0 0 0 0
I have df that is similar to the following:
value is_1 is_2 is_3
5 0 1 0
7 0 0 1
4 1 0 0
(it is guaranteed, that the sum of values from columns is_1 ... is_n is equal to 1 calculating by each row)
I need to get the next result:
is_1 is_2 is_3
0 5 0
0 0 7
4 0 0
(I should find the column is_k that is more than 0, and fill it with value from "value" column)
What is the best way to achieve it?
I'd do it this way:
In [16]: df = df.mul(df.pop('value').values, axis=0)
In [17]: df
Out[17]:
is_1 is_2 is_3
0 0 5 0
1 0 0 7
2 4 0 0
I know that if I have a DataFrame object in Pandas that I can find out if the row is a duplicate by using the .duplicated() method on the DataFrame. This will return a Series giving True or False depending on whether the row was a duplicate or not. My question is, is it then possible to index the original DataFrame with this object, such that I only return the duplicates (so that I can visually inspect them)?
In [18]: df = pd.DataFrame(np.random.randint(0, 2, (10, 4)))
In [19]: df
Out[19]:
0 1 2 3
0 0 1 1 0
1 0 1 1 1
2 0 1 1 1
3 1 1 0 0
4 0 1 0 1
5 1 0 1 0
6 0 1 0 1
7 1 1 1 0
8 0 1 1 0
9 0 0 0 1
[10 rows x 4 columns]
In [20]: df[df.duplicated()]
Out[20]:
0 1 2 3
2 0 1 1 1
6 0 1 0 1
8 0 1 1 0
[3 rows x 4 columns]