How to unlist a list in dataframe column?

How to unlist a list in dataframe column? - python

i have a dataframe column codes as below
codes
-----
[K70, X090a2, T8a981,X090a2]
[A70, X90a2, T8a91,A70,A70]
[B70, X09a2, T8a81]
[C70, X00a2, T8981,X00a2,C70]
i want output like this in a dataframe.
need to check any duplicates and return only unique values and then need to unlist.
dict.fromkeys(z1['codes']) used this bcos keys doesn't have duplicates
and tried with for loop by count didn't get the expected results
output column:
codes
-----
K70 X090a2 T8a981
A70 X90a2 T8a91
B70 X09a2 T8a81
C70 X00a2 T8981

If in column are lists deduplicated with dict.fromkeys and then join by whitespace:
#if values are strings
#z1['codes'] = z1['codes'].str.strip('[]').str.split(',\s*')
z1['codes'] = z1['codes'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))
print (z1)
codes
0 K70 X090a2 T8a981
1 A70 X90a2 T8a91
2 B70 X09a2 T8a81
3 C70 X00a2 T8981

Set will remove duplicates from a list and join will unlist the list into a string with a whitespace.
z1['codes'].apply(lambda code: " ".join(set(code)))
print (z1)
codes
0 K70 X090a2 T8a981
1 A70 X90a2 T8a91
2 B70 X09a2 T8a81
3 C70 X00a2 T8981

Related

How to iterate through rows which contains text and create bigrams using python

In an excel file I have 5 columns and 20 rows, out of which one row contains text data as shown below
df['Content'] row contains:
0 this is the final call
1 hello how are you doing
2 this is me please say hi
..
.. and so on
I want to create bigrams while it remains attached to its original table.
I tried applying the below function to iterate through rows
def find_bigrams(input_list):
bigram_list = []
for i in range(len(input_list)-1):
bigram_list.append(input_list[1:])
return bigram_list
And tried applying back the row into its table using the:
df['Content'] = df['Content'].apply(find_bigrams)
But I am getting the following error:
0 None
1 None
2 None
I am expecting the output as below
Company Code Content
0 xyz uh-11 (this,is),(is,the),(the,final),(final,call)
1 abc yh-21 (hello,how),(how,are),(are,you),(you,doing)

Your input_list is not actually a list, it's a string.
Try the function below:
def find_bigrams(input_text):
input_list = input_text.split(" ")
bigram_list = list(map(tuple, zip(input_list[:-1], input_list[1:])))
return bigram_list

You can use itertools.permutations()
s.str.split().map(lambda x: list(itertools.permutations(x,2))[::len(x)])

Pandas remove duplicates within the list of values and identifying id's that share the same values

I have a pandas dataframe :
I used to have duplicate test_no ; so I remove the duplicates by
df['test_no'] = df['test_no'].apply(lambda x: ','.join(set(x.split(','))))
but still as you can see the duplicates are still there ; I think it's due to extra spaces and I want to clean it
Part 1:
my_id test_no
0 10000000000055910 461511, 461511
1 10000000000064510 528422
2 10000000000064222 528422,528422 , 528421
3 10000000000161538 433091.0, 433091.0
4 10000000000231708 nan,nan
Expected Output
my_id test_no
0 10000000000055910 461511
1 10000000000064510 528422
2 10000000000064222 528422, 528421
3 10000000000161538 433091.0
4 10000000000231708 nan
Part 2:
I also want to check if any of the "my_id" share any of the test_no ;
for example :
my_id matched_myid
10000000000064222 10000000000064510

You can use a regex to split:
import re
df['test_no'] = df['test_no'].apply(lambda x: ','.join(set(re.split(',\s*', x))))
# or
df['test_no'] = [','.join(set(re.split(',\s*', x))) for x in df['test_no']]
If you want to keep the original order use dict.fromkeys in place of set.
If the duplicates are successive you can also use:
df['test_no'] = df['test_no'].str.replace(r'([^,\s]+),\s*\1', r'\1', regex=True)

Check if a string is present in multiple lists

I am trying to categorize a dataset based on the string that contains the name of the different objects of the dataset.
The dataset is composed of 3 columns, df['Name'], df['Category'] and df['Sub_Category'], the Category and Sub_Category columns are empty.
For each row I would like to check in different lists of words if the name of the object contains at least one word in one of the list. Based on this first check I would like to attribute a value to the category column. If it finds more than 1 word in 2 different lists I would like to attribute 2 values to the object in the category column.
Moreover, I would like to be able to identify which word has been checked in which list in order to attribute a value to the sub_category column.
Until now, I have been able to do it with only one list, but I am not able to identity which word has been checked and the code is very long to run.
Here is my code (where I added an example of names found in my dataset as df['Name']) :
import pandas as pd
import numpy as np
df['Name'] = ['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
for idx, row in df.iterrows():
for c in furniture_check:
if c in row['Name']:
df.loc[idx, 'Category'] = 'Meubles'
Any help would be appreciated

Here is an approach that expands lists, merges them and re-combines them.
df = pd.DataFrame({"name":['vitrine murale vintage','commode ancienne', 'lustre antique', 'solex', 'sculpture médievale', 'jante voiture', 'lit et matelas', 'turbine moteur']})
furniture_check = ['canape', 'chaise', 'buffet','table','commode','lit']
vehicle_check = ['solex','voiture','moto','scooter']
art_check = ['tableau','scuplture', 'tapisserie']
# put categories into a dataframe
dfcat = pd.DataFrame([{"category":"furniture","values":furniture_check},
{"category":"vechile","values":vehicle_check},
{"category":"art","values":art_check}])
# turn apace delimited "name" column into a list
dfcatlist = (df.assign(name=df["name"].apply(lambda x: x.split(" ")))
# explode list so it can be used as join. reset_index() to keep a copy of index of original DF
.explode("name").reset_index()
# merge exploded names on both side
.merge(dfcat.explode("values"), left_on="name", right_on="values")
# where there are multiple categoryies, make it a list
.groupby("index", as_index=False).agg({"category":lambda s: list(s)})
# but original index back...
.set_index("index")
)
# simple join and have names and list of associated categories
df.join(dfcatlist)
name
category
0
vitrine murale vintage
nan
1
commode ancienne
['furniture']
2
lustre antique
nan
3
solex
['vechile']
4
sculpture médievale
nan
5
jante voiture
['vechile']
6
lit et matelas
['furniture']
7
turbine moteur
nan

Pandas create a column for each element of a list in a dictionary value?

I have a dictionary where the values are lists like the following:
d= {u'2012-06-08': [list_element_0, list_element_1, list_element_2],
u'2012-06-09': [list_element_0, list_element_1, list_element_2],
u'2012-06-10': [list_element_0, list_element_1, list_element_2]}
I'd like to create a dataframe for with 4 columns: [column_for_dict_keys, column_for_elements_in_list_at_index_0, column_for_elements_in_list_at_index_1, column_for_elements_in_list_at_index_2]
I found how to make a regular dictionary into a dataframe here, but I don't know how to modify it for my specific case

Let's try:
pd.DataFrame(d).T.reset_index()
Output:
index 0 1 2
0 2012-06-08 list_element_0 list_element_1 list_element_2
1 2012-06-09 list_element_0 list_element_1 list_element_2
2 2012-06-10 list_element_0 list_element_1 list_element_2

How to sort a alphanumeric filed in pandas?

I have a dataframe and the first column contains id. How do I sort the first column when it contains alphanumeric data, such as:
id = ["6LDFTLL9", "N9RFERBG", "6RHSDD46", "6UVSCF4H", "7SKDEZWE", "5566FT6N","6VPZ4T5P", "EHYXE34N", "6P4EF7BB", "TT56GTN2", "6YYPH399" ]
Expected result is
id = ["5566FT6N", "6LDFTLL9", "6P4EF7BB", "6RHSDD46", "6UVSCF4H", "6VPZ4T5P", "6YYPH399", "7SKDEZWE", "EHYXE34N", "N9RFERBG", "TT56GTN2" ]

You can utilize the .sort() method:
>>> id.sort()
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
This will sort the list in place. If you don't want to change the original id list, you can utilize the sorted() method
>>> sorted(id)
['5566FT6N', '6LDFTLL9', '6P4EF7BB', '6RHSDD46', '6UVSCF4H', '6VPZ4T5P', '6YYPH399', '7SKDEZWE', 'EHYXE34N', 'N9RFERBG', 'TT56GTN2']
>>> id
['6LDFTLL9', 'N9RFERBG', '6RHSDD46', '6UVSCF4H', '7SKDEZWE', '5566FT6N', '6VPZ4T5P', 'EHYXE34N', '6P4EF7BB', 'TT56GTN2', '6YYPH399']
Notice, with this one, that id is unchanged.
For a DataFrame, you want to use sort_values().
df.sort_values(0, inplace=True)
0 is either the numerical index of your column or you can pass the column name (eg. id)
0
5 5566FT6N
0 6LDFTLL9
8 6P4EF7BB
2 6RHSDD46
3 6UVSCF4H
6 6VPZ4T5P
10 6YYPH399
4 7SKDEZWE
7 EHYXE34N
1 N9RFERBG
9 TT56GTN2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to unlist a list in dataframe column? - python

Set will remove duplicates from a list and join will unlist the list into a string with a whitespace. z1['codes'].apply(lambda code: " ".join(set(code))) print (z1) codes 0 K70 X090a2 T8a981 1 A70 X90a2 T8a91 2 B70 X09a2 T8a81 3 C70 X00a2 T8981

Related

How to iterate through rows which contains text and create bigrams using python

Pandas remove duplicates within the list of values and identifying id's that share the same values

Check if a string is present in multiple lists

Pandas create a column for each element of a list in a dictionary value?

How to sort a alphanumeric filed in pandas?

Categories

Resources