Create columns in dataframe based on csv field - python

I have a pandas dataframe with the column "Values" that has comma separated values:
Row|Values
1|1,2,3,8
2|1,4
I want to create columns based on the CSV, and assign a boolean indicating if the row has that value, as follows:
Row|1,2,3,4,8
1|true,true,true,false,true
2|true,false,false,true,false
How can I accomplish that?
Thanks in advance

Just using get_dummies, check the link here and the astype(bool) change 1 to True 0 to False
df.set_index('Row')['Values'].str.get_dummies(',').astype(bool)
Out[318]:
1 2 3 4 8
Row
1 True True True False True
2 True False False True False

Related

How to count equal values in a row in a column

I would like to find with Python the way to find blocks of x equal values or more in a row in a column.
E.g. given a dataset, I would like to find blocks of three or more in a row of True in the column Value and create a new column with the result:
ID Value New column
1 True False
2 True False
3 False False
4 False False
5 False False
6 True True
7 True True
8 True True
9 False False
10 False False
11 True False
12 True False
13 False False
14 True True
. True True
. True True
.. True True
One way to solve your problem would be to use 2 loops. Note that this is quick and dirty code and not optimized for efficiency
First Loop - just populate data for all the 3 columns with a default value of False for the 3rd column.
Second Loop - Now go through each row. If the value of the 3rd column is not already set to True, check the value of 2nd column for that row and the next 2 rows i.e. row(i), row(i+1), row(i+2). If the values are all True then you update the value of the 3rd column to True for that row and the next 2 rows.
Example you're on row 6 where the 3rd column is False, you check the values of the 2nd column for row 6, 7, 8 and they are all True, so you update the value of the 3rd column for rows 6,7,8 to True
Since you are only checking rows where the 3rd column is False, it means you will then skip rows 7, 8 (so they don't get overwritten back to False) and you should land on row 9 where again the 3rd column is False
for i in range(len(rows) ):
try:
if not rows[i][2]: # Ignore any row where the third column is already set to True
if all(rows[i][1], rows[i+1][1], rows[i+2][1]):
rows[i][2] = True
rows[i+1][2] = True
rows[i+2][2] = True
except Exception, inst:
continue # if row(i+1) or row(i+2) doesn't exist, continue or exit the loop
The try except block takes care of when you get towards the end of the rows where you don't have row (i+1) or row(i+2)

How to subset output of pandas contain statement to give all True values?

How to subset output of pandas contain statement to give all True values?
Code
df_2clean["p2_conf"].astype(str).str.contains(r'[^0-9+-:.\s]')
Output
0 False
1 False
2 False
3 False
4 True
Try this:
df_2subset=df_2clean[df_2clean["p2_conf"].astype(str).str.contains(r'[^0-9+-:.\s]')==True]

Pandas - Find empty columns of the row and update in one column

I've below code to read excel values
import pandas as pd
import numpy as np
import os
df = pd.read_excel(os.path.join(os.getcwd(), 'TestData.xlsx'))
print(df)
Excel data is
Employee ID First Name Last Name Contact Technology Comment
0 1 KARTHICK RAJU 9500012345 .NET
1 2 TEST 9840112345 JAVA
2 3 TEST 9145612345 AWS
3 4 9123498765 Python
4 5 TEST TEST 9156478965
Below code returns True if any cell holds empty value
print(df.isna())
like below
Employee ID First Name Last Name Contact Technology Comment
0 False False False False False True
1 False False True False False True
2 False True False False False True
3 False True True False False True
4 False False False False True True
I want to add the comment for each row like below
Comment
Last Name is empty
First Name is empty
First Name and Last Name are empty
Technology is empty
One way of doing is iterating over each row to find the empty index and based on the index, comments can be updated.
If the table has huge data, iteration may not be a good idea
Is there a way to achieve this in more pythonic way?
You can simplify solution and instaed is and are use -, use matrix multiplication DataFrame.dot with boolean mask and columns with new value, last remove separator by DataFrame.dot:
#if column exist
df = df.drop('Comment', axis=1)
df['Comment'] = df.isna().dot(df.columns + '-empty, ').str.rstrip(', ')
print (df)
Employee ID First Name Last Name Contact Technology \
0 1 KARTHICK RAJU 9500012345 .NET
1 2 TEST NaN 9840112345 JAVA
2 3 NaN TEST 9145612345 AWS
3 4 NaN NaN 9123498765 Python
4 5 TEST TEST 9156478965 NaN
Comment
0
1 Last Name-empty
2 First Name-empty
3 First Name-empty, Last Name-empty
4 Technology-empty

Use boolean series of different length to select rows from dataframe

I have a dataframe that looks like this:
df = pd.DataFrame({"piece": ["piece1", "piece2", "piece3", "piece4"], "No": [1, 1, 2, 3]})
No piece
0 1 piece1
1 1 piece2
2 2 piece3
3 3 piece4
I have a series with an index that corresponds to the "No"-column in the dataframe. It assigns boolean variables to the "No"-values, like so:
s = pd.Series([True, False, True, True])
0 True
1 False
2 True
3 True
dtype: bool
I would like to select those rows from the dataframe where in the series the "No"-value is True. This should result in
No piece
2 2 piece3
3 3 piece4
I've tried a lot of indexing with df["No"].isin(s), or something like df[s["No"] == True]... But it didn't work yet.
I think you need map the value in No column to the true/false condition and use it for subsetting:
df[df.No.map(s)]
# No piece
#2 2 piece3
#3 3 piece4
df.No.map(s)
# 0 False
# 1 False
# 2 True
# 3 True
# Name: No, dtype: bool
You are trying to index into s using df['No'], then use the result as a mask on df itself:
df[s[df['No']].values]
The final mask needs to be extracted as an array using values because the duplicates in the index cause an error otherwise.

Extracting all rows from pandas Dataframe that have certain value in a specific column

I am relatively new to Python/Pandas and am struggling with extracting the correct data from a pd.Dataframe. What I actually have is a Dataframe with 3 columns:
data =
Position Letter Value
1 a TRUE
2 f FALSE
3 c TRUE
4 d TRUE
5 k FALSE
What I want to do is put all of the TRUE rows into a new Dataframe so that the answer would be:
answer =
Position Letter Value
1 a TRUE
3 c TRUE
4 d TRUE
I know that you can access a particular column using
data['Value']
but how do I extract all of the TRUE rows?
Thanks for any help and advice,
Alex
You can test which Values are True:
In [11]: data['Value'] == True
Out[11]:
0 True
1 False
2 True
3 True
4 False
Name: Value, dtype: bool
and then use fancy indexing to pull out those rows:
In [12]: data[data['Value'] == True]
Out[12]:
Position Letter Value
0 1 a True
2 3 c True
3 4 d True
*Note: if the values are actually the strings 'TRUE' and 'FALSE' (they probably shouldn't be!) then use:
data['Value'] == 'TRUE'
You can wrap your value/values in a list and do the following:
new_df = df.loc[df['yourColumnName'].isin(['your', 'list', 'items'])]
This will return a new dataframe consisting of rows where your list items match your column name in df.

Categories