Python Split output column with fixed & dynamic length - python

I want to split the data frame from a single column to three columns Sample input and output
[(Col1=fix length), (Col2=dynamic length),( Col3= remaining part)]
import re
import pandas as pd
text='Raw Data'
out = re.findall(r"RIY-[A-Z]{6}-\d{6}\.\d{6,8}\.\d{5,7}", text)
df = pd.DataFrame(out, columns = ["RIY"])
df["col1"] = df.RIY.str[0:15]
df["col2"] = df.RIY.str[15:24]# need to split based on criteria (find next '.' less 2 char
df["col3"] = df.RIY.str[24:] # remaining all text after splitting 2 column
#Output
[1]: https://i.stack.imgur.com/Lupcd.png
I tried to split with a fixed length (solution by Roy2012) which only works perfectly, For the first part, [0:15], length varies for the remaining two columns. I want to split by finding second dot('.') less (-2) (to avoid removing 46) I want to achieve by (find the second dot(.) -2 (to avoid removing 46) then split.

Is this working for you?
df.RAW.str.extract(r"(.*)(\d\d\.\d+)(\d\d\.\d+)")
The output I get is:
0 1 2
0 RIY-OUHOMH-1002 24.534768 46.650127
1 RIY-OUHOHH-1017 24.51472 46.663988
2 RIY-OUHOMH-1004 24.532244 46.651758
3 RIY-OUHOHH-1007 24.529029 46.653571
4 RIY-OUHOHH-1006 24.530071 46.651934
5 RIY-OUHOHH-1005 24.531786 46.65279
6 RIY-OUHOMH-1001 24.535972 46.649456
7 RIY-DIRAHH-0151 24.495407 46.641877
8 RIY-DIRAHH-0152 24.494105 46.644253

Related

Apply if else condition in specific pandas column by location

I am trying to apply a condition to a pandas column by location and am not quite sure how. Here is some sample data:
data = {'Pop': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967],
'Pop2': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967]}
PopDF = pd.DataFrame(data)
remainder = 6
#I would like to subtract 1 from PopDF['Pop2'] column cells 0-remainder.
#The remaining cells in the column I would like to stay as is (retain original pop values).
PopDF['Pop2']= PopDF['Pop2'].iloc[:(remainder)]-1
PopDF['Pop2'].iloc[(remainder):] = PopDF['Pop'].iloc[(remainder):]
The first line works to subtract 1 in the correct locations, however, the remaining cells become NaN. The second line of code does not work – the error is:
ValueError: Length of values (1) does not match length of index (8)
Instead of selected the first N rows and subtracting them, subtract the entire column and only assign the first 6 values of it:
df.loc[:remainder, 'Pop2'] = df['Pop2'] - 1
Output:
>>> df
Pop Pop2
0 728375 728374
1 733355 733354
2 695395 695394
3 734658 734657
4 732811 732810
5 789396 789395
6 727761 727760
7 751967 751967

How to replace values in only some columns in Python without it affecting the same values in other columns?

I have a Pandas data frame with different columns.
Some columns are just “yes” or “no” answers that I would need to replace with 1 and 0
Some columns are 1s and 2s, where 2 equals no - these 2 need to be replaced by 0
Other columns are Numerical categories, for example 1,2,3,4,5 where 1 = lion 2 = dog
Other columns are string categories, like: “A lot”, “A little” etc
The first 2 columns are the target variables
My problem issues:
If I just change all 2 to 0 in the data frame, it would end up changing the 2 in the target variable (which in this case act as a score rather than a “No”)
Another problem would be that columns with categories as numbers, will have their 2s changed to 0 as well
How can I clean this dataframe so that
2. all columns with either yes or 1 and those with either no or 2 -> become 1 and 0s
3. the two target variables -> stay as scores from 1-5
4. and all categorical variables remain unchanged until I do onehot encoding with them.
These are the steps I took:
To change all the “yes” or “no” to 0 and 1
df.replace(('Yes', 'No'), (1, 0), inplace=True)
Now in order to replace all the 2s that act as “No”s with 0s -
without it affecting neither the “2” that act as a score in first two target columns
nor the “2” that act as a category value in columns that have more than 2 unique values, I think I would need to combine the following two lines of code, is that correct? I am trying different ways to combine them but I keep getting errors
df.loc[:, df.nunique() <= 2] or df[df.columns.difference([‘target1 ‘,’target2 '])].replace(2, 0)
It would be better if you showed your code here and a sample of the database. I'm a bit confused. Here is what I gleaned:
First, here is a dummy dataset I created:
Here is the code that I think solves your two problems. If there is something missing, it's because I didn't quite get the explanation as I said.
import pandas as pd
import numpy as np
import os
filename = os.path.join(os.path.dirname(__file__),'data.csv')
sample = pd.read_csv(filename)
# This solves your first problem. Here we create a new column using numeric values instead of yes/no string values
#with a function
def create_answers_column(sample, colname):
def is_yes(a):
if a == 'yes':
return 1
else:
return 0
return sample[colname].apply(is_yes)
sample['Answers Numeric'] = create_answers_column(sample, 'Answers')
#This solves your second problem
#Using replace()
sample['Numbers'] = sample.Numbers.replace({2:0})
print(sample)
And here's the output:
Answers Numbers Animals Quantifiers Answers Numeric
0 yes 1 1 a lot 1
1 yes 0 2 little 1
2 no 0 3 many 0
3 yes 1 4 some 1
4 no 1 5 several 0

Splitting row values and count unique's from a DataFrame

I have the following data in a column titled Reference:
ABS052
ABS052/01
ABS052/02
ADA010/00
ADD005
ADD005/01
ADD005/02
ADD005/03
ADD005/04
ADD005/05
...
WOO032
WOO032/01
WOO032/02
WOO032/03
WOO045
WOO045/01
WOO045/02
WOO045/03
WOO045/04
I would like to know how to split the row values to create a Dataframe that contains the single Reference code, plus a Count value, for example:
Reference
Count
ABS052
3
ADA010
0
ADD005
2
...
...
WOO032
3
WOO045
4
I have the following code:
df['Reference'] = df['Reference'].str.split('/')
Results in:
['ABS052'],
['ABS052','01'],
['ABS052','02'],
['ABS052','03'],
...
But I'm not sure how to ditch the last two digits from the list in each row.
All I want now is to keep the string in each row [0] if that makes sense, then I could just retrieve a value_count from the 'Reference' column.
There seems to be something wrong with the expected result listed in the question.
Let's say you want to ditch the digits and count the prefix occurrences:
df.Reference.str.split("/", expand=True)[0].value_counts()
If instead the suffix means something and you want to keep the highest value this should do
df.Reference.str.split("/", expand=True).fillna("00").astype({0: str, 1: int}).groupby(0).max()
You can just use regex to replace the last two digits like this:
df = pd.DataFrame({'a':['ABS052','ABS052/01','ABS052/02','ADA010/00','ADD005','ADD005/01','ADD005/02','ADD005/03','ADD005/04','ADD005/05']})
df = df['a'].str.replace(r'\/\d+$', '').value_counts().reset_index()
Output:
>>>> index a
0 ADD005 6
1 ABS052 3
2 ADA010 1
You are almost there, you can add expand=True to split and then use groupby:
df['Reference'].str.split("/", expand=True).fillna("--").groupby(0).count()
returns:
1
0
ABS052 3
ADA010 1
ADD005 6
for the first couple of rows of your data.
The fillna("--") makes sure you also count lines like ABS052 without a "", i.e. None in the second column.
Output to df with column names
df['Reference'] = df['Reference'].str.split('/').str[0]
df_counts = df['Reference'].value_counts().rename_axis('Reference').reset_index(name='Counts')
output
Reference Counts
0 ADD005 6
1 ABS052 3
2 ADA010 1
Explanation - The first line gives a clean series called 'Reference'. The second line gives a count of unique items and then resets the index and renames the columns.

Sort string columns with numbers in it in Pandas

I want to order my table by a column. The column is a string that has numbers in it, for example ASH11, ASH2, ASH1, etc. The problem is that using the method sort_values is going to do a "character" order, so the columns from the example will be order like this --> ASH1, ASH11, ASH2. And I want the order like this --> AS20H1, AS20H2, AS20H11 (taking into account the last number).
I though about taking the last characters of the string but sometimes would be only the last and in other cases the last two. The other way around (taking the characters from the beggining) doesnt work either because the strings are not always from the same lenght (i.e. some cases the name is ASH1, ASGH22, ASHGT3, etc)
Use keyparameter (new in 1.1.0)
df.sort_values(by=['xxx'], key=lambda col: col.map(lambda x: int(re.split('(\d+)',x)[-2])))
Using list comprehension and regular expression:
>>> import pandas as pd
>>> import re #Regular expression
>>> a = pd.DataFrame({'label':['AS20H1','AS20H2','AS20H11','ASH1','ASGH22','ASHGT3']})
>>> a
label
0 AS20H1
1 AS20H2
2 AS20H11
3 ASH1
4 ASGH22
5 ASHGT3
r'(\d+)(?!.*\d)'
Matches the last number in a string
>>> a['sort_int'] = [ int(re.search(r'(\d+)(?!.*\d)',i).group(0)) for i in a['label']]
>>> a
label sort_int
0 AS20H1 1
1 AS20H2 2
2 AS20H11 11
3 ASH1 1
4 ASGH22 22
5 ASHGT3 3
>>> a.sort_values(by='sort_int',ascending=True)
label sort_int
0 AS20H1 1
3 ASH1 1
1 AS20H2 2
5 ASHGT3 3
2 AS20H11 11
4 ASGH22 22
You could maybe extract the integers from your column and then use it to sort your dataFrame
df["new_index"] = df.yourColumn.str.extract('(\d+)')
df.sort_values(by=["new_index"], inplace=True)
In case you get some NA in your "new_index" column you can use the option na_position in the sort_values method in order to choose where to put them (beginning or end)

split one pandas column text to multiple columns

For example, I have one pandas column contain
text
A1V2
B2C7Z1
I want split it into 26(A-Z) columns with alphabet followed value, if it is missing, then -1.
So, it can be
text A B C D ... Z
A1V2 1 -1 -1 -1 ... -1
B2C7Z1 -1 2 7 -1 ... 1
Is there any fast way rather than using df.apply()?
Followup:
Thank Psidom for the brilliant answer. When I use the method run 4 millions rows, it took me 1 hour. I hope there's another way can make it faster. It seems str.extractall() is the most time-consuming one.
Try str.extractall with regex (?P<key>[A-Z])(?P<value>[0-9]+) which extracts the key([A-Z]) value([0-9]+) into separate columns and a long to wide transform should get you there.
Here regex (?P<key>[A-Z])(?P<value>[0-9]+) matches letterDigits pattern and the two capture groups go into two separate columns in the result as columns key and value (with ?P<> syntax);
And since extractall puts multiple matches into separate rows, you will need to transform it to wide format with unstack on the key column:
(df.text.str.extractall("(?P<key>[A-Z])(?P<value>[0-9]+)")
.reset_index('match', drop=True)
.set_index('key', append=True)
.value.unstack('key').fillna(-1))
#key A B C V Z
# 0 1 -1 -1 2 -1
# 1 -1 2 7 -1 1

Categories