How to sort a alphanumeric filed in pandas? - python

I have a dataframe and the first column contains id. How do I sort the first column when it contains alphanumeric data, such as:
id = ["6LDFTLL9", "N9RFERBG", "6RHSDD46", "6UVSCF4H", "7SKDEZWE", "5566FT6N","6VPZ4T5P", "EHYXE34N", "6P4EF7BB", "TT56GTN2", "6YYPH399" ]
Expected result is
id = ["5566FT6N", "6LDFTLL9", "6P4EF7BB", "6RHSDD46", "6UVSCF4H", "6VPZ4T5P", "6YYPH399", "7SKDEZWE", "EHYXE34N", "N9RFERBG", "TT56GTN2" ]

You can utilize the .sort() method:
>>> id.sort()
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
This will sort the list in place. If you don't want to change the original id list, you can utilize the sorted() method
>>> sorted(id)
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
>>> id
['6LDFTLL9', 'N9RFERBG', '6RHSDD46', '6UVSCF4H', '7SKDEZWE', '5566FT6N', '6VPZ4T5P', 'EHYXE34N', '6P4EF7BB', 'TT56GTN2', '6YYPH399']
Notice, with this one, that id is unchanged.
For a DataFrame, you want to use sort_values().
df.sort_values(0, inplace=True)
0 is either the numerical index of your column or you can pass the column name (eg. id)
0
5 5566FT6N
0 6LDFTLL9
8 6P4EF7BB
2 6RHSDD46
3 6UVSCF4H
6 6VPZ4T5P
10 6YYPH399
4 7SKDEZWE
7 EHYXE34N
1 N9RFERBG
9 TT56GTN2

Related

How to unlist a list in dataframe column?

i have a dataframe column codes as below
codes
-----
[K70, X090a2, T8a981,X090a2]
[A70, X90a2, T8a91,A70,A70]
[B70, X09a2, T8a81]
[C70, X00a2, T8981,X00a2,C70]
i want output like this in a dataframe.
need to check any duplicates and return only unique values and then need to unlist.
dict.fromkeys(z1['codes']) used this bcos keys doesn't have duplicates
and tried with for loop by count didn't get the expected results
output column:
codes
-----
K70 X090a2 T8a981
A70 X90a2 T8a91
B70 X09a2 T8a81
C70 X00a2 T8981
If in column are lists deduplicated with dict.fromkeys and then join by whitespace:
#if values are strings
#z1['codes'] = z1['codes'].str.strip('[]').str.split(',\s*')
z1['codes'] = z1['codes'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))
print (z1)
codes
0 K70 X090a2 T8a981
1 A70 X90a2 T8a91
2 B70 X09a2 T8a81
3 C70 X00a2 T8981
Set will remove duplicates from a list and join will unlist the list into a string with a whitespace.
z1['codes'].apply(lambda code: " ".join(set(code)))
print (z1)
codes
0 K70 X090a2 T8a981
1 A70 X90a2 T8a91
2 B70 X09a2 T8a81
3 C70 X00a2 T8981

How to split DataFrame columns into multiple rows?

I am trying to convert multiple columns to multiple rows. Can someone please offer some advice?
I have DataFrame:
id . values
1,2,3,4 [('a','b'), ('as','bd'),'|',('ss','dd'), ('ws','ee'),'|',('rr','rt'), ('tt','yy'),'|',('yu','uu'), ('ii','oo')]
I need it to look like this:
ID Values
1 ('a','b'), ('as','bd')
2 ('ss','dd'), ('ws','ee')
3 ('rr','rt'), ('tt','yy')
4 ('yu','uu'), ('ii','oo')
I have tried groupby, split, izip. Maybe I am not doing it the right way?
I made a quick and dirty example how you could parse this dataframe
# example dataframe
df = [
"1,2,3,4",
[('a','b'), ('as','bd'), '|', ('ss','dd'), ('ws','ee'), '|', ('rr','rt'), ('tt','yy'), '|', ('yu','uu'), ('ii','oo')]
]
# split ids by comma
ids = df[0].split(",")
# init Id and Items as int and dict()
Id = 0
Items = dict()
# prepare array for data insert
for i in ids:
Items[i] = []
# insert data
for i in df[1]:
if isinstance(i, (tuple)):
Items[ids[Id]].append(i)
elif isinstance(i, (str)):
Id += 1
# print data as written in stackoverflow question
print("id .\tvalues")
for item in Items:
print("{}\t{}".format(item, Items[item]))
I came up with a quite concise solution, based on multi-level grouping,
which in my opinion is to a great extent pandasonic.
Start from defining the following function, "splitting" a Series taken
from individual values element into a sequence of lists representations,
without surrounding [ and ]. The splitting occurs at each '|' element.:
def fn(grp1):
grp2 = (grp1 == '|').cumsum()
return grp1[grp1 != '|'].groupby(grp2).apply(lambda x: repr(list(x))[1:-1])
(will be used a bit later).
The first step of processing is to convert id column into a Series:
sId = df.id.apply(lambda x: pd.Series(x.split(','))).stack().rename('ID')
For your data the result is:
0 0 1
1 2
2 3
3 4
Name: ID, dtype: object
The first level of MultiIndex is the index of the source row and the second
level are consecutive numbers (within the current row).
Now it's time to perform similar conversion of values column:
sVal = pd.DataFrame(df['values'].values.tolist(), index= df.index)\
.stack().groupby(level=0).apply(fn).rename('Values')
The result is:
0 0 ('a', 'b'), ('as', 'bd')
1 ('ss', 'dd'), ('ws', 'ee')
2 ('rr', 'rt'), ('tt', 'yy')
3 ('yu', 'uu'), ('ii', 'oo')
Name: Values, dtype: object
Note that the MultiIndex above has the same structure as in the case of sId.
And the last step is to concat both these partial results:
result = pd.concat([sId, sVal], axis=1).reset_index(drop=True)
The result is:
ID Values
0 1 ('a', 'b'), ('as', 'bd')
1 2 ('ss', 'dd'), ('ws', 'ee')
2 3 ('rr', 'rt'), ('tt', 'yy')
3 4 ('yu', 'uu'), ('ii', 'oo')

Pandas Group by sum of all the values of the group and another column as comma separated

I want to group by one column (tag) and sum up the corresponding quantites (qty). The related reference no. column should be separated by commas
import pandas as pd
tag = ['PO_001045M100960','PO_001045M100960','PO_001045MSP2526','PO_001045M870191', 'PO_001045M870191', 'PO_001045M870191']
reference= ['PA_000003', 'PA_000005', 'PA_000001', 'PA_000002', 'PA_000004', 'PA_000009']
qty=[4,2,2,1,1,1]
df = pd.DataFrame({'tag' : tag, 'reference':reference, 'qty':qty})
tag reference qty
PO_001045M100960 PA_000003 4
PO_001045M100960 PA_000005 2
PO_001045MSP2526 PA_000001 2
PO_001045M870191 PA_000002 1
PO_001045M870191 PA_000004 1
PO_001045M870191 PA_000009 1
If I use df.groupby('tag')['qty'].sum().reset_index(), I am getting the following result.
tag qty
ASL_PO_000001045M100960 6
ASL_PO_000001045M870191 3
ASL_PO_000001045MSP2526 2
I need an additional column where the reference no. are added under the respective tags like,
tag qty refrence
ASL_PO_000001045M100960 6 PA_000003, PA_000005
ASL_PO_000001045M870191 3 PA_000002, PA_000004, PA_000009
ASL_PO_000001045MSP2526 2 PA_000001
How can I achieve this?
Thanks.
Use pandas.DataFrame.groupby.agg:
df.groupby('tag').agg({'qty': 'sum', 'reference': ', '.join})
Output:
reference qty
tag
PO_001045M100960 PA_000003, PA_000005 6
PO_001045M870191 PA_000002, PA_000004, PA_000009 3
PO_001045MSP2526 PA_000001 2
Note: if reference column is numeric, ', '.join will not work. In such case, use lambda x: ', '.join(str(i) for i in x)

Python pandas split column with NaN values

Hello my dear coders,
I'm new to coding and I've stumbled upon a problem. I want to split a column of a csv file that I have imported via pandas in Python. The column name is CATEGORY and contains 1, 2 or 3 values such seperated by a comma (IE: 2343, 3432, 4959) Now I want to split these values into seperate columns named CATEGORY, SUBCATEGORY and SUBSUBCATEGORY.
I have tried this line of code:
products_combined[['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY']] = products_combined.pop('CATEGORY').str.split(expand=True)
But I get this error: ValueError: Columns must be same length as key
Would love to hear your feedback <3
You need:
pd.DataFrame(df.CATEGORY.str.split(',').tolist(), columns=['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY'])
Output:
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 2343 3432 4959
I think this could be accomplished by creating three new columns and assigning each to a lambda function applied to the 'CATEGORY' column. Like so:
products_combined['SUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[1] if len(original) > 1 else None)
products_combined['SUBSUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[2] if len(original) > 2 else None)
products_combined['CATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[0])
The apply() method called on a series returns a new series that contains the result of running the passed function (in this case, the lambda function) on each row of the original series.
IIUC, use split and then Series:
(
df[0].apply(lambda x: pd.Series(x.split(",")))
.rename(columns={0:"CATEGORY", 1:"SUBCATEGORY", 2:"SUBSUBCATEGORY"})
)
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 1 NaN NaN
2 44 55 NaN
Data:
d = [["2343,3432,4959"],["1"],["44,55"]]
df = pd.DataFrame(d)

Pandas - Change value in column based on its relationship with another column

I am working with the sklearn.datasets.fetch_20newsgroups() dataset. Here, there are some documents that belong to more than one news group. I want to treat those documents as two different entities that each belong to one news group. To do this, I've brought the document IDs and group names into a dataframe.
import sklearn
from sklearn import datasets
data = datasets.fetch_20newsgroups()
filepaths = data.filenames.astype(str)
keys = []
for path in filepaths:
keys.append(os.path.split(path)[1])
groups = pd.DataFrame(keys, columns = ['Document_ID'])
groups['Group'] = data.target
groups.head()
>> Document_ID Group
0 102994 7
1 51861 4
2 51879 4
3 38242 1
4 60880 14
print (len(groups))
>>11314
print (len(groups['Document_ID'].drop_duplicates()))
>>9840
print (len(groups['Group'].drop_duplicates()))
>>20
For each Document_ID, I want to change its value if it has more than one Group number assigned. Example,
groups[groups['Document_ID']=='76139']
>> Document_ID Group
5392 76139 6
5680 76139 17
I want this to become:
>> Document_ID Group
5392 76139 6
5680 12345 17
Here, 12345 is a random new ID that is not already in keys list.
How can I do this?
You can find all the rows that contain duplicate Document_ID after the first with the duplicated methdod. Then create a list of new id's beginning with one more than the max id. Use the loc indexing operator to overwrite the duplicate keys with the new ids.
groups['Document_ID'] = groups['Document_ID'].astype(int)
dupes = groups.Document_ID.duplicated(keep='first')
max_id = groups.Document_ID.max() + 1
new_id = range(max_id, max_id + dupes.sum())
groups.loc[dupes, 'Document_ID'] = new_id
Test case
groups.loc[[5392,5680]]
Document_ID Group
5392 76139 6
5680 179489 17
Ensure that no duplicates remain.
groups.Document_ID.duplicated(keep='first').any()
False
Kinda Hacky, but why not!
data = {"Document_ID": [102994,51861,51879,38242,60880,76139,76139],
"Group": [7,1,3,4,4,6,17],
}
groups = pd.DataFrame(data)
groupDict ={}
tempLst=[]
#Create a list of unique ID's
DocList = groups['Document_ID'].unique()
DocList.tolist()
#Build a dictionary and push all group ids to the correct doc id
DocDict = {}
for x in DocList:
DocDict[x] = []
for index, row in groups.iterrows():
DocDict[row['Document_ID']].append(row['Group'])
#For all doc Id's with multip entries create a new id with the group id as a decimal point.
groups['DupID'] = groups['Document_ID'].apply(lambda x: len(DocDict[x]))
groups["Document_ID"] = np.where(groups['DupID'] > 1, groups["Document_ID"] + groups["Group"]/10,groups["Document_ID"])
Hope that helps...

Categories