Splitting string value in given column (Pandas) - python

I have a dataframe with many rows like this:
ID
Variable
1
A1_1 - Red
2
A1_2 - Blue
3
A1_3 - Yellow
I'm trying to iterate over all rows so that all the 2nd column's values change to just "A1". The code I've come up with is:
for row in df.iterrows():
current_response_id=row[1][0]
columncount=0
for columncount in range(2):
variable=row[1][1];
row[1][1]=variable.split("_")[0].split(" -")[0]
variable=row[1][1];
However, this isn't achieving the desired result. How could I go about this?

Try:
df["Variable"] = df["Variable"].str.split("_").str[0]
print(df)
Prints:
ID Variable
0 1 A1
1 2 A1
2 3 A1

Related

How to save multiple values on different rows as a variable or list in a CSV using Python Pandas

I'm currently trying to iterate through a dataframe/csv and compare the dates of the rows with the same ID. If the dates are different or are a certain time-frame apart I want to create a '1' in another column (not shown) to mark that ID and row/s.
I'm looking to save the DATE values as variables and compare them against other DATE variables with the same ID. If the dates are set amount of time apart I'll create a 1 in another column on the same row.
ID
DATE
1
11/11/2011
1
11/11/2011
2
5/05/2011
2
20/06/2011
3
2/04/2011
3
10/08/2011
4
8/12/2011
4
1/02/2012
4
12/03/2012
For this post, I'm mainly looking to save the multiple values as variables or a list. I'm hoping to figure out the rest once this roadblock has been removed.
Here's what I got currently, but I don't think it'll be much help. Currently it iterates through and converts the date strings to dates. Which is what I want to happen AFTER getting a list of all the dates with the same ID value.
import pandas as pd
from datetime import *
filename = 'TestData.csv'
df = pd.read_csv(filename)
print (df.iloc[0,1])
x = 0
for i in df.iloc:
FixDate = df.iloc[x, 1]
d1, m1, y1 = FixDate.split('/')
d1 = int(d1)
m1 = int(m1)
y1 = int(y1)
finaldate = date(y1, m1, d1)
print(finaldate)
x = x + 1
Any help is appreciated, thank you!
In pandas for performance is best avoid loops, if need new column tested if same values in DATE per groups use GroupBy.transform with DataFrameGroupBy.nunique and then compare values by 1:
df = pd.read_csv(filename)
df['test'] = df.groupby('ID')['DATE'].transform('nunique').eq(1).astype(int)
print (df)
ID DATE test
0 1 11/11/2011 1
1 1 11/11/2011 1
2 2 5/05/2011 0
3 2 20/06/2011 0
4 3 2/04/2011 0
5 3 10/08/2011 0
6 4 8/12/2011 0
7 4 1/02/2012 0
8 4 12/03/2012 0
If need filter matched rows:
mask = df.groupby('ID')['DATE'].transform('nunique').eq(1)
df1 = df[mask]
print (df1)
ID DATE
0 1 11/11/2011
1 1 11/11/2011
In last step convert values to lists:
IDlist = df1['ID'].tolist()

Put level of dataframe index at the same level of columns on a Multi-Index Dataframe

Context: I'd like to "bump" the index level of a multi-index dataframe up. In other words, I'd like to put the index level of a dataframe at the same level as the columns of a multi-indexed dataframe
Let's say we have this dataframe:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
tt.index.name = 'Index Column'
And we perform this change to add a multi-index level (like a label of a table)
tt = pd.concat([tt],keys=['Multi-Index Table Label'], axis=1)
Which results in this:
Multi-Index Table Label
A B C
Index Column
0 1 4 7
1 2 5 8
2 3 6 9
Desired Output: How can I make it so that the dataframe looks like this instead (notice the removal of the empty level on the dataframe/table):
Multi-Index Table Label
Index Column A B C
0 1 4 7
1 2 5 8
2 3 6 9
Attempts: I was testing something out and you can essentially remove the index level by doing this:
tt.index.name = None
Which would result in :
Multi-Index Table Label
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Essentially removing that extra level/empty line, but the thing is that I do want to keep the Index Column as it will give information about the type of data present on the index (which in this example are just 0,1,2 but can be years, dates, etc).
How could I do that?
Thank you all in advance :)
How about this:
tt = pd.DataFrame({'A':[1,2,3],'B':[4,5,6],'C':[7,8,9]})
tt.insert(loc=0, column='Index Column', value=tt.index)
tt = pd.concat([tt],keys=['Multi-Index Table Label'], axis=1)
tt = tt.style.hide_index()

Removing pandas rows that columns don't match with the same values but include empty columns

So I have these two pandas dfs and I know how to remove the rows that columns don't match with the same values, but I also want to keep the rows where one of the columns are empty (I don't want those to be removed).
V1 V2 V3
hello 0 0
nice 0 1
meeting 1 1
you 1 0
hi 0
V1 V2 V3
hello 0 0
meeting 1 1
hi 0
What I've tried:
df = df[df.V2 == df.V3]
Problem: it removes the rows with one empty column
Some good answers in the comments, here's another alternative:
df[df.V3.fillna(df.V2).eq(df.V2)]

Python3 Pandas Filter by Columns with Unknown Column Names

Working with a data set comparing rosters with different dates. It goes through a pivot and we don't know the dates of when the rosters are pulled but the resulting data set is structured like this:
colA ColB colC colD Date:yymmdd Date:yymmdd Date:yymmdd
Bob aa aa aa 0 0 1
Jack bb bb bb 1 1 1
Steve cc cc cc 0 1 1
Mary dd dd dd 1 1 1
Abu ee ee ee 1 1 0
I successfully did a fillna for every column after the first 4 columns (they are known).
df.iloc[:,4:] = df.iloc[:,4:].fillna(0) #<-- Fills blanks on every column after column 4.
Question: Now i'm trying to filter the df on the columns that have a zero. Is there a way to filter by columns after 4? I tried:
df = df[(df.iloc[:,4:] == 0)] # error
df = df[(df.columns[:,4:] == 0)] # error
df = df[(df.columns.str.contains(':') == 0)] # unknown columns do have a ':', but didn't work.
Is there a better way to do this? Looking for a result that only shows the rows with a 0 in any column past #4.
Below snippet will give you one Dataframe containing True and False as cell values of df.
df.iloc[:, 4:].eq(x)
If you want to have only those rows where x is there, then you can any() clause.
like the way #jpp has shown in his answer.
In your case, it will be df[df.iloc[:, 4:].eq(0).any(1)]
This will give you all the rows of Dataframe, where rows have atleast one '0' as data value
If all values are 0 or bigger, use min :
df[df.columns[:,4:].min(axis = 1) == 0]

New columns with incremental numbers that initial based on a diffrent column value (pandas)

I want to add a column with incremental numbers for rows with the same value in a defined row;
e.g. if I would have this df
df=pd.DataFrame([['a','b'],['a','c'],['c','b']])
and I want incremental numbers for the first column. It should look like this
df=pd.DataFrame([['a','b',1],['a','c',2],['c','b',1]])
I found sql solutions but I'm working with ipython/pandas. Can someone help me?
Use cumcount, for name of new column use length of original columns:
print (len(df.columns))
2
df[len(df.columns)] = df.groupby(0).cumcount() + 1
print (df)
0 1 2
0 a b 1
1 a c 2
2 c b 1

Categories