How to unlist a list in dataframe column? - python

i have a dataframe column codes as below
codes
-----
[K70, X090a2, T8a981,X090a2]
[A70, X90a2, T8a91,A70,A70]
[B70, X09a2, T8a81]
[C70, X00a2, T8981,X00a2,C70]
i want output like this in a dataframe.
need to check any duplicates and return only unique values and then need to unlist.
dict.fromkeys(z1['codes']) used this bcos keys doesn't have duplicates
and tried with for loop by count didn't get the expected results
output column:
codes
-----
K70 X090a2 T8a981
A70 X90a2 T8a91
B70 X09a2 T8a81
C70 X00a2 T8981

If in column are lists deduplicated with dict.fromkeys and then join by whitespace:
#if values are strings
#z1['codes'] = z1['codes'].str.strip('[]').str.split(',\s*')
z1['codes'] = z1['codes'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))
print (z1)
codes
0 K70 X090a2 T8a981
1 A70 X90a2 T8a91
2 B70 X09a2 T8a81
3 C70 X00a2 T8981

Set will remove duplicates from a list and join will unlist the list into a string with a whitespace.
z1['codes'].apply(lambda code: " ".join(set(code)))
print (z1)
codes
0 K70 X090a2 T8a981
1 A70 X90a2 T8a91
2 B70 X09a2 T8a81
3 C70 X00a2 T8981

Related

How to iterate through rows which contains text and create bigrams using python

In an excel file I have 5 columns and 20 rows, out of which one row contains text data as shown below
df['Content'] row contains:
0 this is the final call
1 hello how are you doing
2 this is me please say hi
..
.. and so on
I want to create bigrams while it remains attached to its original table.
I tried applying the below function to iterate through rows
def find_bigrams(input_list):
bigram_list = []
for i in range(len(input_list)-1):
bigram_list.append(input_list[1:])
return bigram_list
And tried applying back the row into its table using the:
df['Content'] = df['Content'].apply(find_bigrams)
But I am getting the following error:
0 None
1 None
2 None
I am expecting the output as below
Company Code Content
0 xyz uh-11 (this,is),(is,the),(the,final),(final,call)
1 abc yh-21 (hello,how),(how,are),(are,you),(you,doing)
Your input_list is not actually a list, it's a string.
Try the function below:
def find_bigrams(input_text):
input_list = input_text.split(" ")
bigram_list = list(map(tuple, zip(input_list[:-1], input_list[1:])))
return bigram_list
You can use itertools.permutations()
s.str.split().map(lambda x: list(itertools.permutations(x,2))[::len(x)])

Pandas remove duplicates within the list of values and identifying id's that share the same values

I have a pandas dataframe :
I used to have duplicate test_no ; so I remove the duplicates by
df['test_no'] = df['test_no'].apply(lambda x: ','.join(set(x.split(','))))
but still as you can see the duplicates are still there ; I think it's due to extra spaces and I want to clean it
Part 1:
my_id test_no
0 10000000000055910 461511, 461511
1 10000000000064510 528422
2 10000000000064222 528422,528422 , 528421
3 10000000000161538 433091.0, 433091.0
4 10000000000231708 nan,nan
Expected Output
my_id test_no
0 10000000000055910 461511
1 10000000000064510 528422
2 10000000000064222 528422, 528421
3 10000000000161538 433091.0
4 10000000000231708 nan
Part 2:
I also want to check if any of the "my_id" share any of the test_no ;
for example :
my_id matched_myid
10000000000064222 10000000000064510
You can use a regex to split:
import re
df['test_no'] = df['test_no'].apply(lambda x: ','.join(set(re.split(',\s*', x))))
# or
df['test_no'] = [','.join(set(re.split(',\s*', x))) for x in df['test_no']]
If you want to keep the original order use dict.fromkeys in place of set.
If the duplicates are successive you can also use:
df['test_no'] = df['test_no'].str.replace(r'([^,\s]+),\s*\1', r'\1', regex=True)

Check if a string is present in multiple lists

I am trying to categorize a dataset based on the string that contains the name of the different objects of the dataset.
The dataset is composed of 3 columns, df['Name'], df['Category'] and df['Sub_Category'], the Category and Sub_Category columns are empty.
For each row I would like to check in different lists of words if the name of the object contains at least one word in one of the list. Based on this first check I would like to attribute a value to the category column. If it finds more than 1 word in 2 different lists I would like to attribute 2 values to the object in the category column.
Moreover, I would like to be able to identify which word has been checked in which list in order to attribute a value to the sub_category column.
Until now, I have been able to do it with only one list, but I am not able to identity which word has been checked and the code is very long to run.
Here is my code (where I added an example of names found in my dataset as df['Name']) :
import pandas as pd
import numpy as np
df['Name'] = ['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
for idx, row in df.iterrows():
for c in furniture_check:
if c in row['Name']:
df.loc[idx, 'Category'] = 'Meubles'
Any help would be appreciated
Here is an approach that expands lists, merges them and re-combines them.
df = pd.DataFrame({"name":['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']})
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
# put categories into a dataframe
dfcat = pd.DataFrame([{"category":"furniture","values":furniture_check},
{"category":"vechile","values":vehicle_check},
{"category":"art","values":art_check}])
# turn apace delimited "name" column into a list
dfcatlist = (df.assign(name=df["name"].apply(lambda x: x.split(" ")))
# explode list so it can be used as join. reset_index() to keep a copy of index of original DF
.explode("name").reset_index()
# merge exploded names on both side
.merge(dfcat.explode("values"), left_on="name", right_on="values")
# where there are multiple categoryies, make it a list
.groupby("index", as_index=False).agg({"category":lambda s: list(s)})
# but original index back...
.set_index("index")
)
# simple join and have names and list of associated categories
df.join(dfcatlist)
name
category
0
vitrine murale vintage
nan
1
commode ancienne
['furniture']
2
lustre antique
nan
3
solex
['vechile']
4
sculpture médievale
nan
5
jante voiture
['vechile']
6
lit et matelas
['furniture']
7
turbine moteur
nan

Pandas create a column for each element of a list in a dictionary value?

I have a dictionary where the values are lists like the following:
d= {u'2012-06-08': [list_element_0, list_element_1, list_element_2],
u'2012-06-09': [list_element_0, list_element_1, list_element_2],
u'2012-06-10': [list_element_0, list_element_1, list_element_2]}
I'd like to create a dataframe for with 4 columns: [column_for_dict_keys, column_for_elements_in_list_at_index_0, column_for_elements_in_list_at_index_1, column_for_elements_in_list_at_index_2]
I found how to make a regular dictionary into a dataframe here, but I don't know how to modify it for my specific case
Let's try:
pd.DataFrame(d).T.reset_index()
Output:
index 0 1 2
0 2012-06-08 list_element_0 list_element_1 list_element_2
1 2012-06-09 list_element_0 list_element_1 list_element_2
2 2012-06-10 list_element_0 list_element_1 list_element_2

How to sort a alphanumeric filed in pandas?

I have a dataframe and the first column contains id. How do I sort the first column when it contains alphanumeric data, such as:
id = ["6LDFTLL9", "N9RFERBG", "6RHSDD46", "6UVSCF4H", "7SKDEZWE", "5566FT6N","6VPZ4T5P", "EHYXE34N", "6P4EF7BB", "TT56GTN2", "6YYPH399" ]
Expected result is
id = ["5566FT6N", "6LDFTLL9", "6P4EF7BB", "6RHSDD46", "6UVSCF4H", "6VPZ4T5P", "6YYPH399", "7SKDEZWE", "EHYXE34N", "N9RFERBG", "TT56GTN2" ]
You can utilize the .sort() method:
>>> id.sort()
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
This will sort the list in place. If you don't want to change the original id list, you can utilize the sorted() method
>>> sorted(id)
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
>>> id
['6LDFTLL9', 'N9RFERBG', '6RHSDD46', '6UVSCF4H', '7SKDEZWE', '5566FT6N', '6VPZ4T5P', 'EHYXE34N', '6P4EF7BB', 'TT56GTN2', '6YYPH399']
Notice, with this one, that id is unchanged.
For a DataFrame, you want to use sort_values().
df.sort_values(0, inplace=True)
0 is either the numerical index of your column or you can pass the column name (eg. id)
0
5 5566FT6N
0 6LDFTLL9
8 6P4EF7BB
2 6RHSDD46
3 6UVSCF4H
6 6VPZ4T5P
10 6YYPH399
4 7SKDEZWE
7 EHYXE34N
1 N9RFERBG
9 TT56GTN2

Categories