Splitting row values and count unique's from a DataFrame - python

I have the following data in a column titled Reference:
ABS052
ABS052/01
ABS052/02
ADA010/00
ADD005
ADD005/01
ADD005/02
ADD005/03
ADD005/04
ADD005/05
...
WOO032
WOO032/01
WOO032/02
WOO032/03
WOO045
WOO045/01
WOO045/02
WOO045/03
WOO045/04
I would like to know how to split the row values to create a Dataframe that contains the single Reference code, plus a Count value, for example:
Reference
Count
ABS052
3
ADA010
0
ADD005
2
...
...
WOO032
3
WOO045
4
I have the following code:
df['Reference'] = df['Reference'].str.split('/')
Results in:
['ABS052'],
['ABS052','01'],
['ABS052','02'],
['ABS052','03'],
...
But I'm not sure how to ditch the last two digits from the list in each row.
All I want now is to keep the string in each row [0] if that makes sense, then I could just retrieve a value_count from the 'Reference' column.

There seems to be something wrong with the expected result listed in the question.
Let's say you want to ditch the digits and count the prefix occurrences:
df.Reference.str.split("/", expand=True)[0].value_counts()
If instead the suffix means something and you want to keep the highest value this should do
df.Reference.str.split("/", expand=True).fillna("00").astype({0: str, 1: int}).groupby(0).max()

You can just use regex to replace the last two digits like this:
df = pd.DataFrame({'a':['ABS052','ABS052/01','ABS052/02','ADA010/00','ADD005','ADD005/01','ADD005/02','ADD005/03','ADD005/04','ADD005/05']})
df = df['a'].str.replace(r'\/\d+$', '').value_counts().reset_index()
Output:
>>>> index a
0 ADD005 6
1 ABS052 3
2 ADA010 1

You are almost there, you can add expand=True to split and then use groupby:
df['Reference'].str.split("/", expand=True).fillna("--").groupby(0).count()
returns:
1
0
ABS052 3
ADA010 1
ADD005 6
for the first couple of rows of your data.
The fillna("--") makes sure you also count lines like ABS052 without a "", i.e. None in the second column.

Output to df with column names
df['Reference'] = df['Reference'].str.split('/').str[0]
df_counts = df['Reference'].value_counts().rename_axis('Reference').reset_index(name='Counts')
output
Reference Counts
0 ADD005 6
1 ABS052 3
2 ADA010 1
Explanation - The first line gives a clean series called 'Reference'. The second line gives a count of unique items and then resets the index and renames the columns.

Related

Apply if else condition in specific pandas column by location

I am trying to apply a condition to a pandas column by location and am not quite sure how. Here is some sample data:
data = {'Pop': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967],
'Pop2': [728375, 733355, 695395, 734658, 732811, 789396, 727761, 751967]}
PopDF = pd.DataFrame(data)
remainder = 6
#I would like to subtract 1 from PopDF['Pop2'] column cells 0-remainder.
#The remaining cells in the column I would like to stay as is (retain original pop values).
PopDF['Pop2']= PopDF['Pop2'].iloc[:(remainder)]-1
PopDF['Pop2'].iloc[(remainder):] = PopDF['Pop'].iloc[(remainder):]
The first line works to subtract 1 in the correct locations, however, the remaining cells become NaN. The second line of code does not work – the error is:
ValueError: Length of values (1) does not match length of index (8)
Instead of selected the first N rows and subtracting them, subtract the entire column and only assign the first 6 values of it:
df.loc[:remainder, 'Pop2'] = df['Pop2'] - 1
Output:
>>> df
Pop Pop2
0 728375 728374
1 733355 733354
2 695395 695394
3 734658 734657
4 732811 732810
5 789396 789395
6 727761 727760
7 751967 751967

extract values from column in dataframe

I have the following dataframe:
A
url/3gth33/item/PO151302
url/3jfj6/item/S474-3
url/dfhk34j/item/4964114989191
url/sdfkj3k4/place/9b81f6fd
url/as3f343d/thing/ecc539ec
I'm looking to extract anything with /item/ and its subsequent value.
The end result should be:
item
/item/PO151302
/item/S474-3
/item/4964114989191
here is what I've tried:
df['A'] = df['A'].str.extract(r'(/item/\w+\D+\d+$)')
This is returning what I need except the integer only values.
Based on the regex docs I'm reading this should grab all instances.
What am I missing here?
Use /item/.+ to match /item/ and anything after. Also, if you put ?P<foo> at the beginning of a group, e.g. (?P<foo>...), the column for that matched group in the returned dataframe of captures will be named what's inside the <...>:
item = df['A'].str.extract('(?P<item>/item/.+)').dropna()
Output:
>>> item
item
0 /item/PO151302
1 /item/S474-3
2 /item/4964114989191
This is not a regex solution but it could come handy in some situations.
keyword = "/item/"
df["item"] = ((keyword + df["A"].str.split(keyword).str[-1]) *
df["A"].str.contains(keyword))
which returns
A item
0 url/3gth33/item/PO151302 /item/PO151302
1 url/3jfj6/item/S474-3 /item/S474-3
2 url/dfhk34j/item/4964114989191 /item/4964114989191
3 url/sdfkj3k4/place/9b81f6fd
4 url/as3f343d/thing/ecc539ec
5
And in case you want only the rows where item is not empty you could use
df[df["item"].ne("")][["item"]]

Sort string columns with numbers in it in Pandas

I want to order my table by a column. The column is a string that has numbers in it, for example ASH11, ASH2, ASH1, etc. The problem is that using the method sort_values is going to do a "character" order, so the columns from the example will be order like this --> ASH1, ASH11, ASH2. And I want the order like this --> AS20H1, AS20H2, AS20H11 (taking into account the last number).
I though about taking the last characters of the string but sometimes would be only the last and in other cases the last two. The other way around (taking the characters from the beggining) doesnt work either because the strings are not always from the same lenght (i.e. some cases the name is ASH1, ASGH22, ASHGT3, etc)
Use keyparameter (new in 1.1.0)
df.sort_values(by=['xxx'], key=lambda col: col.map(lambda x: int(re.split('(\d+)',x)[-2])))
Using list comprehension and regular expression:
>>> import pandas as pd
>>> import re #Regular expression
>>> a = pd.DataFrame({'label':['AS20H1','AS20H2','AS20H11','ASH1','ASGH22','ASHGT3']})
>>> a
label
0 AS20H1
1 AS20H2
2 AS20H11
3 ASH1
4 ASGH22
5 ASHGT3
r'(\d+)(?!.*\d)'
Matches the last number in a string
>>> a['sort_int'] = [ int(re.search(r'(\d+)(?!.*\d)',i).group(0)) for i in a['label']]
>>> a
label sort_int
0 AS20H1 1
1 AS20H2 2
2 AS20H11 11
3 ASH1 1
4 ASGH22 22
5 ASHGT3 3
>>> a.sort_values(by='sort_int',ascending=True)
label sort_int
0 AS20H1 1
3 ASH1 1
1 AS20H2 2
5 ASHGT3 3
2 AS20H11 11
4 ASGH22 22
You could maybe extract the integers from your column and then use it to sort your dataFrame
df["new_index"] = df.yourColumn.str.extract('(\d+)')
df.sort_values(by=["new_index"], inplace=True)
In case you get some NA in your "new_index" column you can use the option na_position in the sort_values method in order to choose where to put them (beginning or end)

Filter Dataframe by using ~isin([list_of_substrings])

Given a dataframe full of emails, I want to filter out rows containing potentially blocked domain names or clearly fake emails. The dataframe below represents an example of my data.
>> print(df)
email number
1 fake#fake.com 2
2 real.email#gmail.com 1
3 no.email#email.com 5
4 real#yahoo.com 2
5 rich#money.com 1
I want to filter by two lists. The first list is fake_lst = ['noemail', 'noaddress', 'fake', ... 'no.email'].
The second list is just the set from disposable_email_domains import blocklist converted to a list (or kept as a set).
When I use df = df[~df['email'].str.contains('noemail')] it works fine and filters out that entry. Yet when I do df = df[~df['email'].str.contains(fake_lst)] I get TypeError: unhashable type: 'list'.
The obvious answer is to use df = df[~df['email'].isin(fake_lst)] as in many other stackoverflow questions, like Filter Pandas Dataframe based on List of substrings or pandas filtering using isin function but that ends up having no effect.
I suppose I could use str.contains('string') for each possible list entry, but that is ridiculously cumbersome.
Therefore, I need to filter this dataframe based on the substrings contained in the two lists such that any email containing a particular substring I want rid of, and the subsequent row in which it is contained, are removed.
In the example above, the dataframe after filtering would be:
>> print(df)
email number
2 real.email#gmail.com 1
4 real#yahoo.com 2
5 rich#money.com 1
Here is a potential solution assuming you have following df and fake_lst
df = pd.DataFrame({
'email': ['fake#fake.com', 'real.email#gmail.com', 'no.email#email.com',
'real#yahoo.com', 'rich#money.com'],
'number': [2, 1, 5, 2, 1]
})
fake_lst = ['fake', 'money']
Option 1:
Filter out rows that have any of the fake_lst words in email with apply:
df.loc[
~df['email'].apply(lambda x: any([i in x for i in fake_lst]))
]
email number
1 real.email#gmail.com 1
2 no.email#email.com 5
3 real#yahoo.com 2
Option 2:
Filter out without apply
df.loc[
[not any(i) for i in zip(*[df['email'].str.contains(word) for word in fake_lst])]
]
email number
1 real.email#gmail.com 1
2 no.email#email.com 5
3 real#yahoo.com 2
Use DataFrame.isin to check whether each element in the DataFrame is contained in values. Another issue is that your fake list contains the name without the domain so you need str.split to remove the characters you are not matching against.
Note: str.contains tests if a pattern or regex is contained within a string of a Series and hence your code df['email'].str.contains('noemail') works fine but doesn't work for list
df[~df['email'].str.split('#').str[0].isin(fake_lst)]
email number
0 fake#fake.com 2
1 real.email#gmail.com 1
3 real#yahoo.com 2
4 rich#money.com 1

Combine paired rows after pandas groupby, give NaN value if ID didn't occur twice in df

I have a single dataframe containing an ID column id, and I know that the ID will exist either exactly in one row ('mismatched') or two rows ('matched') in the dataframe.
In order to select the mismatched rows and the pairs of matched rows I can use a groupby on the ID column.
Now for each group, I want to take some columns from the second (pair) row, rename them, and copy them to the first row. I can then discard all the second rows and return a single dataframe containing all the modified first rows (for each and every group).
Where there is no second row (mismatched) - it's fine to put NaN in its place.
To illustrate this see table below id=1 and 3 are a matched pair, but id=2 is mismatched:
entity id partner value
A 1 B 200
B 1 A 300
A 2 B 600
B 3 C 350
C 3 B 200
The resulting transformation should leave me with the following:
entity id partner entity_value partner_value
A 1 B 200 300
A 2 B 600 NaN
B 3 C 350 200
What's baffling me is how to come up with a generic way of getting the matching partner_value from row 2, copied into row 1 after the groupby, in a way that also works when there is no matching id.
Solution (this was tricky):
dfg = df.groupby('id', sort=False)
# Create 'entity','id','partner','entity_value' from the first row...
df2 = dfg['entity','id','partner','value'].first().rename(columns={'value': 'entity_value'})
# Now insert 'partner_value' from those groups that have a second row...
df2['partner_value'] = nan
df2['partner_value'] = dfg['value'].nth(n=1)
entity id partner entity_value partner_value
id
1 A 1 B 200 300.0
2 A 2 B 600 NaN
3 B 3 C 350 200.0
This was tricky to get working. The short answer is that although pd.groupby(...).agg(...) in principle allows you to specify a list of tuples of (column, aggregate_function), and you could then chain those into a rename, that won't work here since we're trying to do two separate aggregate operations both on value column, and rename both their results (you get pandas.core.base.SpecificationError: Function names must be unique, found multiple named value).
Other complications:
We can't directly use groupby.nth(n) which sounds useful at first glance, except it's only on a DataFrame not a Series like df['value'], and also it silently drops groups which don't have an n'th element, not what we want. (But it does keep the index, so we can use it by first initializing the column as all-NaNs, then selectively inserting on that column, as above).
In any case the pd.groupby.agg() syntax won't even let you call nth() by just passing 'nth' as the agg_func name, since nth() is missing its n argument; you'd have to declare a lambda.
I tried defining the following function second_else_nan to use inside an agg() as above, but after much struggling I couldn't get this as this to work for multiple reasons, only one of which is you can't do two aggs on the same column:
Code:
def second_else_nan(v):
if v.size == 2:
return v[1]
else:
return pd.np.nan
(i.e. the equivalent on a list of the dict.get(key, default) builtin)
I would do that. First, get the first value:
df_grouped = df.reset_index().groupby('id').agg("first")
Then retrieve the values that are duplicated and insert them:
df_grouped["partner_value"] = df.groupby("id")["value"].agg("last")
The only thing is that you have a repeated value in case it's not duplicated (instead of a NaN).
What about something like this?
grouped = df.groupby("id")
first_values = grouped.agg("first")
sums = grouped.agg("sum")
first_values["partner_value"] = sums["value"] - first_values["value"]
first_values["partner_value"].replace(0, np.nan, inplace=True)
transformed_df = first_values.copy()
Group the data by id, take the first row, take the sum of the 'value' column for each group, from this subtract 'value' from the first row. Then replace 0's in the resulting column with np.nan (making the assumption here that data from the 'value' column is never 0)

Categories