Iterating through a Panda dataframe to count and adjust a cell - python

I have a Panda's dataframe like so:
colLabelA colLabelB ... colLabelZ
rowLabelA 10 0 0
specialRow 0 10 0
rowLabelB 20 0 10
...
rowLabelZ 0 0 20
Essentially I only know the row called specialRow. What I need is to find a way to iterate through the entire dataframe and check all the columns for 0 (zero).
If a column has all zeroes except specialRow, then that column by row cell needs to be made into a zero as well. Otherwise move to the next column and check that one.
So in the above example, only colLabelB has all zeroes except the specialRow so that needs to be updated like so:
colLabelA colLabelB ... colLabelZ
rowLabelA 10 0 0
specialRow 0 0 0
rowLabelB 20 0 10
...
rowLabelZ 0 0 20
Is there a quick and fast way to do this?
The dataframes aren't huge but I don't want it to be super slow either.

Use drop to drop the named row, then check for 0 with eq(0).all(). Then you can update with loc:
df.loc['specialRow', df.drop('specialRow').eq(0).all()] = 0
This works with more than one special rows too:
specialRows = ['specialRow']
df.loc[specialRows, df.drop(specialRows).eq(0).all()] = 0
Output:
colLabelA colLabelB colLabelZ
rowLabelA 10 0 0
specialRow 0 0 0
rowLabelB 20 0 10
rowLabelZ 0 0 20

For each column, exclude the particular index, then check if all other values for that column is zero, if yes, then just assign 0 to such columns:
for col in df:
if df[df.index!='specialRow'][col].eq(0).all():
df[col] = 0
OUTPUT:
colLabelA colLabelB colLabelZ
rowLabelA 10 0 0
specialRow 0 0 0
rowLabelB 20 0 10
rowLabelZ 0 0 20
In fact df.index!='specialRow' remains the same for all the columns, so you can just assign it to a variable and use it for each of the columns.

Drop the row with 'specialRow', check if column's values are all zero.
if (df.drop(['specialRow'])['colLabelB'] == 0).all():
df['B'] = 0

Related

Extracting rows in R based on an ID number

I have a data frame in R which, when running data$data['rs146217251',1:10,] looks something like:
g=0 g=1 g=2
1389117_1389117 0 1 NA
2912943_2912943 0 0 1
3094358_3094358 0 0 1
5502557_5502557 0 0 1
2758547_2758547 0 0 1
3527892_3527892 0 1 NA
3490518_3490518 0 0 1
1569224_1569224 0 0 1
4247075_4247075 0 1 NA
4428814_4428814 0 0 1
The leftmost column are participant identifiers. There are roughly 500,000 participants listed in this data frame, but I have a subset of them (about 5,000) listed by their identifiers. What is the best way to go about extracting only these rows that I care about according to their identifier in R or python (or some other way)?
Assuming that the participant identifiers are row names, you can filter by the vector of the identifiers as below:
df <- data$data['rs146217251',1:10,]
#Assuming the vector of identifiers
id <- c("4428814_4428814", "3490518_3490518", "3094358_3094358")
filtered <- df[id,]
Output:
> filtered
g.0 g.1 g.2
4428814_4428814 0 0 1
3490518_3490518 0 0 1
3094358_3094358 0 0 1

How can I find differences between two dataframe rows?

I have two data frames that I merged together on a common ID. I am trying to uncover when values in each row for a matching ID are different.
I merged the files so that I have the below table. I think I might be able to approach this with a series of if statements but the actual data file has hundreds of column attributes which doesn't seem efficient at all. I'm trying to determine if there's an easy way to do this.
x Loan_ID Trade_Quantity_x Principal_x Interest_x Late_Fee_x Trade_Quantity_y Principal_y Interest_y Late_Fee_y
0 1 10 30 0 0 10 30 0 0
1 2 10 0 0 5 10 0 0 0
2 3 10 0 50 0 10 0 0 0
3 4 10 0 0 0 10 0 0 0
4 5 10 100 10 0 10 100 10 0
5 6 9 0 0 0 9 0 0 0
6 7 10 0 0 0 10 0 0 0
Expected output should be:
2. Late_Fee_y
3. Interest_y
I am assuming that what you are after is to compare two data frames of the same structure, i.e. having the same list of columns and the same number of rows identified by values of special Loan_ID.
The goal is to list all "cells" which are different between the two frames, cell location is by the id from Loan_ID and column name.
Can I suggest to merge the two frames differently first, to get a list of values and then find differences by scanning melted frames, or by applying a filter?
Example data (think of id as Loan_ID)
x = {'id':[1,2],'A':[0,1],'B':[2,3]}
y = {'id':[1,2],'A':[0,2],'B':[2,4]}
df_x = pd.DataFrame(x)
df_y = pd.DataFrame(y)
print(df_x)
print(df_y)
melted
df_xm = pd.melt(df_x, id_vars=['id'])
df_xm['source']='x'
df_ym = pd.melt(df_y, id_vars=['id'])
df_ym['source']='y'
print(df_xm)
print(df_ym)
Assuming that both frames are sorted by id correspondingly
for i in df_xm.index:
if df_xm['value'][i] != df_ym['value'][i]:
print(f"{df_xm['id'][i]},{df_xm['variable'][i]}")
Second method :
merged = df_xm.merge(df_ym, left_on= ['id','variable'], right_on=['id','variable'])
print(merged)
filter_diff = merged['value_x'] != merged['value_y']
print('differences:')
print(merged[ filter_diff ])
I'm sure this can be improved for efficiency but this is my general idea how to tackle the "difference between two table snapshots" with general frame/tables and filter operations.

Reassign labels of columns from 0 after drop of few columns

I removed some duplicate columns by the following command.
columns = XY.columns[:-1].tolist()
XY1 = XY.drop_duplicates(subset=columns,keep='first').
The result is below:
Combined Series shape : (100, 4)
Combined Series: 1 222 223 0
0 0 0 0 1998.850000
1 0 0 0 0.947361
2 0 0 0 0.947361
3 0 0 0 0.947361
4 0 0 0 0.947361
Now the columns is labelled 1 222 223 0 (0 label at the end is because of concat with another df !!) I want the columns to be
re-labelled from index 0 onwards. How'll I do it?
So first create a dictionary with the mapping you want
trafo_dict = {x:y for x,y in zip( [1,222,223,0],np.linspace(0,3,4))}
Then you need to rename columns. This can be done with pd.DataFrame.rename:
XY1 = XY1.rename(columns=trafo_dict)
Edit: If you want it in a more general fashion use:
np.linspace(0,XY1.shape[1]-1,XY1.shape[1])

How to sample data by keeping at least two non zero columns

I have a pandas data frame which is basically 50K X9.5K dimensions. My dataset is binary that is it has 1 and 0 only. And has lot of zeros.
Think of it as a user-item purchase data where its 1 if user purchased an item else 0. Users are rows and items are columns.
353 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
354 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
355 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
356 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
357 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
I want to split into training, validation and test set. However it is not going to be just normal split by rows.
What I want is that for each validation and test set, I want to keep between 2-4 columns from original data which are non zero.
So basically if my original data had 9.5K columns for each user, I first keep only lets say 1500 or so columns. Then I spit this sampled data into train and test by keeping like 1495-1498 columns in train and 2-5 columns in test/validation. The columns which are in test are ONLY those which are non zero. Training can have both.
I also want to keep the item name/index corresponding to those which are retained in test/validation
I dont want to run a loop to check each cell value and put it in the next table.
Any idea?
EDIT 1:
So this is what I am trying to achieve.
So, by non-zero, I am guessing you mean those columns which only have ones in them. That is fairly easy to do. Best approach probably is to use sum, like so:
sums = df.sum(axis=1) # to sum along columns. You will have a Series with column names as indices, and column sums as values.
non_zero_cols = sums[sums = len(df)].index # this will have only column names with non-zero records
# Now to split the data into training and testing
test_cols = numpy.random.choice(non_zero_cols, 2, replace=False) # or 5, just randomly selecting columns.
test_data = df[test_cols]
train_data = df.drop(test_cols)
Is that what your are looking for?
IIUC:
threshold = 6
new_df = df.loc[df.sum(1) >= threshold]
df.sum(1) sums over each row. Since these are 1s and 0s, this is equivalent to counting.
df.sum(1) >= threshold creates series of Trues and Falses, also referred to as a boolean mask.
df.loc happens to accept boolean masks as a way to slice.
df.loc[df.sum(1) >= threshold] passes the boolean mask to df.loc and returns only those rows that had a corresponding True in the boolean mask.
Since the boolean mask only had Trues when there existed a count of 1s greater than or equal to threshold, this equates to returning a slice of the dataframe in which each row has at least a threshold number of non-zeroes.
And then refer to this answer on how to split into test, train, validation sets.
Or this answer

Pandas - Get dummies for only certain values

I have a Pandas series of 10000 rows which is populated with a single alphabet, starting from A to Z.
However, I want to create dummy data frames for only A, B, and C, using Pandas get_dummies.
How do I go around doing that?
I don't want to get dummies for all the row values in the column and then select the specific columns, as the column contains other redundant data which eventually causes a Memory Error.
try this:
# create mock dataframe
df = pd.DataFrame( {'alpha':['a','a','b','b','c','e','f','g']})
# use replace with a regex to set characters d-z to None
pd.get_dummies(df.replace({'[^a-c]':None},regex =True))
output:
alpha_a alpha_b alpha_c
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 1
5 0 0 0
6 0 0 0
7 0 0 0

Categories