How to remove duplicate string from each row in a column - python

I have a column that contains a bunch of 4 digit numbers separated by commas. Some contain duplicate sets of 4 digit numbers. For example, 1 row looks like this:
1400, 1400, 1400, 1455, 1455, 1455, 1670, 1670, 1670
I am trying to change that to this:
1400, 1455, 1670
I want to apply that to all rows within a column. I was able to get this from another question here.
df['ID'] = df['ID'].apply(lambda x: list(set(x)))
However, this is just filtering it down to unique single digits, like this:
1,4,0,5,6,7
How do I adjust the code to make this work?

To keep order of 'ID', you can use np.unique after extracting numbers from string:
df['ID'] = df['ID'].str.findall('\d{4}').map(np.unique).str.join(', ')
print(df)
# Output
0 1400, 1455, 1670
Name: ID, dtype: object

One option is to use map with join. But before that, make sure to split :
df["ID"] = df["ID"].str.split("\s*,\s*").map(set).str.join(", ")
You can modify you code by making an str.split right before you call apply :
df["ID"] = df["ID"].str.split("\s*,\s*").apply(lambda x : ", ".join(set(x)))
NB : Both approaches does not garantee/keep the order of numbers. So if the order is important, #Corralien's answer is what you're looking for.
​
Output :
print(df)
ID
0 1455, 1400, 1670

Related

Filtering columns in pandas by length

I have a column in a dataframe that contains IATA_Codes (Abbreviations) for Airports (such as: LAX, SFO, ...) However, if I analyze the column values a little more (column.unique()), it says that there are also 4 digit numbers in it.
How can I filter the column so that my Datafram will only consist of rows containing a real Airport code?
My idea was to filter the length (Airports Code Length is always 3, while the Number length is always 4) but I don't know how to implement this idea.
array(['LFT', 'HYS', 'ELP', 'DVL', 'ISP', 'BUR', 'DAB', 'DAY', 'GRK',
'GJT', 'BMI', 'LBE', 'ASE', 'RKS', 'GUM', 'TVC', 'ALO', 'IMT',
...
10170, 11577, 14709, 14711, 12255, 10165, 10918, 15401, 13970,
15497, 12265, 14254, 10581, 12016, 11503, 13459, 14222, 14025,
'10333', '14222', '14025', '13502', '15497', '12265'], dtype=object)
You can use df.columns.str.len to get the length, and pass that to the second indexer position of df.loc:
df = df.loc[:, df.columns.astype(str).str.len() == 3]
one another possibility is to use lambda expression :
df[df['IATA_Codes'].apply(lambda x : len(str(x))==3)]['IATA_Codes'].unique()

Count of values from list in column

I have a column
df['COL_1']
and a list of numbers
num_range = list(range(200,281, 5))
The columns contains either words such as UNREADABLE NOT_PASSIVE or some values present in the list above, so 200 205 210 etc or nothing.
I am trying to get a sum of how many rows in that column contains a number in the range given.
What I have tried:
df['COL_1'].value_counts(num_range)
I am not sure what else to try, the various trials I have done similar to the above have all failed.
I'm new to python, any guidance is highly appreciated.
Python 2.7 and pandas 0.24.2
EDIT:
I was getting errors, as other users have mentioned, my data was not numeric. Fixed this with .astype or alternatively, redifining target_range as so:
target_range = map(str, range(200, 281, 5))
If you're after the whole sum, and are not interested in the breakout of individual counts,
target_range = range(200, 281, 5)
df["COL_1"].isin(target_range).sum()
Notice that you do not need to cast the range object to a list.
If you want the breakout of value counts, see #Corralien's answer.
Details: pandas.DataFrame.isin() is a function which returns a boolean mask.
>>> import pandas as pd
>>> # Data provided by Corralien
>>> df = pd.DataFrame({'COL_1': ['UNREADABLE', 200, 'NOT_PASSIVE', 205, 210, 200, '', 210, 180, 170, '']})
>>> target_range = range(200, 281, 5)
>>> df.isin(target_range)
COL_1
0 False
1 True
2 False
3 True
4 True
5 True
6 False
7 True
8 False
9 False
10 False
Notice I'm using df.isin() instead of df["COL_1"].isin(). If you have multiple columns in your DataFrame that you want to perform this operation on, you can pass a list of column names. If you want to perform this operation on your entire DataFrame, you can simply use df.isin().
The .isin() method returns a boolean mask. Since bool is a subtype of int, you can simply call sum() on the resulting DataFrame to sum up all the 1s and 0s to get your final tally of all the rows which match your criterion.
IIUC, you can try:
df = pd.DataFrame({'COL_1': ['UNREADABLE', 200, 'NOT_PASSIVE', 205, 210,
200, '', 210, 180, 170, '']})
out = df.loc[df['COL_1'].apply(pd.to_numeric, errors='coerce')
.isin(num_range), 'COL_1'] \
.value_counts()
>>> out
200 2
210 2
205 1
Name: COL_1, dtype: int64
>>> out.sum()
5
You have a column in a DataFrame as a list of items and also have another list of values. and you want to count how many items from dataframe column exist in the list. right?
so use this:
count = 0
for i in df['COL_1']:
if i in num_range:
count +=1
in each iteration on your column, if the value exist in the list, count variable added by one.

Pandas fill column with multiple lists

I'm trying to fill two pandas columns with multiple lists of different sizes.
So for example I have the first column with a list like "angioplasty, aortic, artery" and the second one like "251, 2882, 401, 4019, 412"
First I tried to append each list like this:
matches.code_matches.append(code_series)
which resulted in this TypeError:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
So I tried converting the lists into a series and append them to the dataframe with this code:
code_series = pd.Series( (v[0] for v in code_matches))
matches.code_matches.append(code_series)
However after appending the series my Dataframe still ends up empty.
What is the best way to fill the columns of a Dataframe different sized lists? Note that i want to keep the lists and fill them into single fields instead of filling in each element separately (i want to assign the values to ids)
Use:
In [704]: l1 = ['angioplasty', 'aortic', 'artery']
In [705]: l2 = [251, 2882, 401, 4019, 412]
In [707]: df = pd.DataFrame({'a': pd.Series(l1), 'b': pd.Series(l2)})
In [708]: df
Out[708]:
a b
0 angioplasty 251
1 aortic 2882
2 artery 401
3 NaN 4019
4 NaN 412
I used the .at method in the code below and it seems that worked.
x = pd.DataFrame()
x['A'] = 1,2,3,4,5,6
x['B'] = 1.0,2,3,4,5,6
x['C'] = 9,8,7,6,5,4
x['D']=0
x['D']=x['D'].astype('object')
for i in range(len(x)): x.at[i,'D'] = list(np.random.rand(3))
You can replace the list(np.random.rand(3)) with your own data.

Creating series from a series of list using positional list memeber

I have a data frame with a string column. I need to create a new column with 3rd element after col1.split(' '). I tried
df['col1'].str.split(' ')[0]
but all I get is error.
Actually I need to turn col1 into multiple columns after spliting by " ".
What is the correct way to do this ?
Consider this df
df = pd.DataFrame({'col': ['Lets say 2000 is greater than 5']})
col
0 Lets say 2000 is greater than 5
You can split and use str accessor to get elements at different positions
df['third'] = df.col.str.split(' ').str[2]
df['fifth'] = df.col.str.split(' ').str[4]
df['last'] = df.col.str.split(' ').str[-1]
col third fifth last
0 Lets say 2000 is greater than 5 2000 greater 5
Another way is:
df["third"] = df['col1'].apply(lambda x: x.split()[2])

get both unique count and max in group-by of pandas dataframe

Using Pandas data frame group by feature and I want to group by column c_b and (1) calculate unique count for column c_a and column c_c, (2) and get the max value of column c_d. Wondering if there is any solution to write one line of group by code to achieve both goals? I tried the following line of code, but it seems not correct.
sampleGroup = sample.groupby('c_b')(['c_a', 'c_d'].agg(pd.Series.nunique), ['c_d'].agg(pd.Series.max))
My expected results are,
Expected results,
c_b,c_a_unique_count,c_c_unique_count,c_d_max
python,2,2,1.0
c++,2,2,0.0
Thanks.
Input file,
c_a,c_b,c_c,c_d
hello,python,numpy,0.0
hi,python,pandas,1.0
ho,c++,vector,0.0
ho,c++,std,0.0
go,c++,std,0.0
Source code,
sample = pd.read_csv('123.csv', header=None, skiprows=1,
dtype={0:str, 1:str, 2:str, 3:float})
sample.columns = pd.Index(data=['c_a', 'c_b', 'c_c', 'c_d'])
sample['c_d'] = sample['c_d'].astype('int64')
sampleGroup = sample.groupby('c_b')(['c_a', 'c_d'].agg(pd.Series.nunique), ['c_d'].agg(pd.Series.max))
results.to_csv(sampleGroup, index= False)
You can pass a dict to agg():
df.groupby('c_b').agg({'c_a':'nunique', 'c_c':'nunique', 'c_d':'max'})
If you don't want c_b as index, you can pass as_index=False to groupby:
df.groupby('c_b', as_index=False).agg({'c_a':'nunique', 'c_c':'nunique', 'c_d':'max'})

Categories