split one pandas column text to multiple columns - python

For example, I have one pandas column contain
text
A1V2
B2C7Z1
I want split it into 26(A-Z) columns with alphabet followed value, if it is missing, then -1.
So, it can be
text A B C D ... Z
A1V2 1 -1 -1 -1 ... -1
B2C7Z1 -1 2 7 -1 ... 1
Is there any fast way rather than using df.apply()?
Followup:
Thank Psidom for the brilliant answer. When I use the method run 4 millions rows, it took me 1 hour. I hope there's another way can make it faster. It seems str.extractall() is the most time-consuming one.

Try str.extractall with regex (?P<key>[A-Z])(?P<value>[0-9]+) which extracts the key([A-Z]) value([0-9]+) into separate columns and a long to wide transform should get you there.
Here regex (?P<key>[A-Z])(?P<value>[0-9]+) matches letterDigits pattern and the two capture groups go into two separate columns in the result as columns key and value (with ?P<> syntax);
And since extractall puts multiple matches into separate rows, you will need to transform it to wide format with unstack on the key column:
(df.text.str.extractall("(?P<key>[A-Z])(?P<value>[0-9]+)")
.reset_index('match', drop=True)
.set_index('key', append=True)
.value.unstack('key').fillna(-1))
#key A B C V Z
# 0 1 -1 -1 2 -1
# 1 -1 2 7 -1 1

Related

Splitting row values and count unique's from a DataFrame

I have the following data in a column titled Reference:
ABS052
ABS052/01
ABS052/02
ADA010/00
ADD005
ADD005/01
ADD005/02
ADD005/03
ADD005/04
ADD005/05
...
WOO032
WOO032/01
WOO032/02
WOO032/03
WOO045
WOO045/01
WOO045/02
WOO045/03
WOO045/04
I would like to know how to split the row values to create a Dataframe that contains the single Reference code, plus a Count value, for example:
Reference
Count
ABS052
3
ADA010
0
ADD005
2
...
...
WOO032
3
WOO045
4
I have the following code:
df['Reference'] = df['Reference'].str.split('/')
Results in:
['ABS052'],
['ABS052','01'],
['ABS052','02'],
['ABS052','03'],
...
But I'm not sure how to ditch the last two digits from the list in each row.
All I want now is to keep the string in each row [0] if that makes sense, then I could just retrieve a value_count from the 'Reference' column.
There seems to be something wrong with the expected result listed in the question.
Let's say you want to ditch the digits and count the prefix occurrences:
df.Reference.str.split("/", expand=True)[0].value_counts()
If instead the suffix means something and you want to keep the highest value this should do
df.Reference.str.split("/", expand=True).fillna("00").astype({0: str, 1: int}).groupby(0).max()
You can just use regex to replace the last two digits like this:
df = pd.DataFrame({'a':['ABS052','ABS052/01','ABS052/02','ADA010/00','ADD005','ADD005/01','ADD005/02','ADD005/03','ADD005/04','ADD005/05']})
df = df['a'].str.replace(r'\/\d+$', '').value_counts().reset_index()
Output:
>>>> index a
0 ADD005 6
1 ABS052 3
2 ADA010 1
You are almost there, you can add expand=True to split and then use groupby:
df['Reference'].str.split("/", expand=True).fillna("--").groupby(0).count()
returns:
1
0
ABS052 3
ADA010 1
ADD005 6
for the first couple of rows of your data.
The fillna("--") makes sure you also count lines like ABS052 without a "", i.e. None in the second column.
Output to df with column names
df['Reference'] = df['Reference'].str.split('/').str[0]
df_counts = df['Reference'].value_counts().rename_axis('Reference').reset_index(name='Counts')
output
Reference Counts
0 ADD005 6
1 ABS052 3
2 ADA010 1
Explanation - The first line gives a clean series called 'Reference'. The second line gives a count of unique items and then resets the index and renames the columns.

How to optimally update cells based on previous cell value / How to elegantly spread values of cell to other cells?

I have a "large" DataFrame table with index being country codes (alpha-3) and columns being years (1900 to 2000) imported via a pd.read_csv(...) [as I understand, these are actually string so I need to pass it as '1945' for example].
The values are 0,1,2,3.
I need to "spread" these values until the next non-0 for each row.
example : 0 0 1 0 0 3 0 0 2 1
becomes: 0 0 1 1 1 3 3 3 2 1
I understand that I should not use iterations (current implementation is something like this, as you can see, using 2 loops is not optimal, I guess I could get rid of one by using apply(row) )
def spread_values(df):
for idx in df.index:
previous_v = 0
for t_year in range(min_year, max_year):
current_v = df.loc[idx, str(t_year)]
if current_v == 0 and previous_v != 0:
df.loc[idx, str(t_year)] = previous_v
else:
previous_v = current_v
However I am told I should use the apply() function, or vectorisation or list comprehension because it is not optimal?
The apply function however, regardless of the axis, does not allow to dynamically get the index/column (which I need to conditionally update the cell), and I think the core issue I can't make the vec or list options work is because I do not have a finite set of column names but rather a wide range (all examples I see use a handful of named columns...)
What would be the more optimal / more elegant solution here?
OR are DataFrames not suited for my data at all? what should I use instead?
You can use df.replace(to_replace=0, method='ffil). This will fill all zeros in your dataframe (except for zeros occuring at the start of your dataframe) with the previous non-zero value per column.
If you want to do it rowwise unfortunately the .replace() function does not accept an axis argument. But you can transpose your dataframe, replace the zeros and transpose it again: df.T.replace(0, method='ffill').T

Sort string columns with numbers in it in Pandas

I want to order my table by a column. The column is a string that has numbers in it, for example ASH11, ASH2, ASH1, etc. The problem is that using the method sort_values is going to do a "character" order, so the columns from the example will be order like this --> ASH1, ASH11, ASH2. And I want the order like this --> AS20H1, AS20H2, AS20H11 (taking into account the last number).
I though about taking the last characters of the string but sometimes would be only the last and in other cases the last two. The other way around (taking the characters from the beggining) doesnt work either because the strings are not always from the same lenght (i.e. some cases the name is ASH1, ASGH22, ASHGT3, etc)
Use keyparameter (new in 1.1.0)
df.sort_values(by=['xxx'], key=lambda col: col.map(lambda x: int(re.split('(\d+)',x)[-2])))
Using list comprehension and regular expression:
>>> import pandas as pd
>>> import re #Regular expression
>>> a = pd.DataFrame({'label':['AS20H1','AS20H2','AS20H11','ASH1','ASGH22','ASHGT3']})
>>> a
label
0 AS20H1
1 AS20H2
2 AS20H11
3 ASH1
4 ASGH22
5 ASHGT3
r'(\d+)(?!.*\d)'
Matches the last number in a string
>>> a['sort_int'] = [ int(re.search(r'(\d+)(?!.*\d)',i).group(0)) for i in a['label']]
>>> a
label sort_int
0 AS20H1 1
1 AS20H2 2
2 AS20H11 11
3 ASH1 1
4 ASGH22 22
5 ASHGT3 3
>>> a.sort_values(by='sort_int',ascending=True)
label sort_int
0 AS20H1 1
3 ASH1 1
1 AS20H2 2
5 ASHGT3 3
2 AS20H11 11
4 ASGH22 22
You could maybe extract the integers from your column and then use it to sort your dataFrame
df["new_index"] = df.yourColumn.str.extract('(\d+)')
df.sort_values(by=["new_index"], inplace=True)
In case you get some NA in your "new_index" column you can use the option na_position in the sort_values method in order to choose where to put them (beginning or end)

Python Split output column with fixed & dynamic length

I want to split the data frame from a single column to three columns Sample input and output
[(Col1=fix length), (Col2=dynamic length),( Col3= remaining part)]
import re
import pandas as pd
text='Raw Data'
out = re.findall(r"RIY-[A-Z]{6}-\d{6}\.\d{6,8}\.\d{5,7}", text)
df = pd.DataFrame(out, columns = ["RIY"])
df["col1"] = df.RIY.str[0:15]
df["col2"] = df.RIY.str[15:24]# need to split based on criteria (find next '.' less 2 char
df["col3"] = df.RIY.str[24:] # remaining all text after splitting 2 column
#Output
[1]: https://i.stack.imgur.com/Lupcd.png
I tried to split with a fixed length (solution by Roy2012) which only works perfectly, For the first part, [0:15], length varies for the remaining two columns. I want to split by finding second dot('.') less (-2) (to avoid removing 46) I want to achieve by (find the second dot(.) -2 (to avoid removing 46) then split.
Is this working for you?
df.RAW.str.extract(r"(.*)(\d\d\.\d+)(\d\d\.\d+)")
The output I get is:
0 1 2
0 RIY-OUHOMH-1002 24.534768 46.650127
1 RIY-OUHOHH-1017 24.51472 46.663988
2 RIY-OUHOMH-1004 24.532244 46.651758
3 RIY-OUHOHH-1007 24.529029 46.653571
4 RIY-OUHOHH-1006 24.530071 46.651934
5 RIY-OUHOHH-1005 24.531786 46.65279
6 RIY-OUHOMH-1001 24.535972 46.649456
7 RIY-DIRAHH-0151 24.495407 46.641877
8 RIY-DIRAHH-0152 24.494105 46.644253

How to drop empty rows from a DataFrame when 'pd.notnull' does not work? Python

I have a DataFrame with two columns 'A' and 'B'. My goal is to delete rows where 'B' is empty. Others have recommended to use df[pd.notnull(df['B'])]. For example here: Python: How to drop a row whose particular column is empty/NaN?
However, somehow this does not work in this case. Why not and how to solve this?
A B
0 Lorema Ipsuma
1 Corpusa Dominusa
2 Loremb
3 Corpusc Dominusc
4 Loremd
5 Corpuse Dominuse
This is the desired result:
A B
0 Lorema Ipsuma
1 Corpusa Dominusa
2 Corpusc Dominusc
3 Corpuse Dominuse
Basically, you could have whitespaces, tabs or even a \n in these blank cells.
For all those cases, you can strip values first, and then remove the rows, i.e.
df[df.B.str.strip().ne("") & df.B.notnull()]
I believe this should cover all cases.

Categories