Remove empty lists in pandas series - python

I have a long series like the following:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
In [151]: series
Out[151]:
0 [(1, 2)]
1 [(3, 5)]
2 []
3 [(3, 5)]
dtype: object
I want to remove all entries with an empty list. For some reason, boolean indexing does not work.
The following tests both give the same error:
series == [[(1,2)]]
series == [(1,2)]
ValueError: Arrays were different lengths: 4 vs 1
This is very strange, because in the simple example below, indexing works just like above:
In [146]: pd.Series([1,2,3]) == [3]
Out[146]:
0 False
1 False
2 True
dtype: bool
P.S. ideally, I'd like to split the tuples in the series into a DataFrame of two columns also.

You could check to see if the lists are empty using str.len():
series.str.len() == 0
and then use this boolean series to remove the rows containing empty lists.
If each of your entries is a list containing a two-tuple (or else empty), you could create a two-column DataFrame by using the str accessor twice (once to select the first element of the list, then to access the elements of the tuple):
pd.DataFrame({'a': series.str[0].str[0], 'b': series.str[0].str[1]})
Missing entries default to NaN with this method.

Using the built in apply you can filter by the length of the list:
series = pd.Series([[(1,2)],[(3,5)],[],[(3,5)]])
series = series[series.apply(len) > 0]

Your series is in a bad state -- having a Series of lists of tuples of ints
buries the useful data, the ints, inside too many layers of containers.
However, to form the desired DataFrame, you could use
df = series.apply(lambda x: pd.Series(x[0]) if x else pd.Series()).dropna()
which yields
0 1
0 1 2
1 3 5
2 3 5
A better way would be to avoid building the malformed series altogether and
form df directly from the data:
data = [[(1,2)],[(3,5)],[],[(3,5)]]
data = [pair for row in data for pair in row]
df = pd.DataFrame(data)

Related

How do I find the row # of a string index?

I have a dataframe where the indexes are not numbers but strings (specifically, name of countries) and they are all unique. Given the name of a country, how do I find its row number (the 'number' value of the index)?
I tried df[df.index == 'country_name'].index but this doesn't work.
We can use Index.get_indexer:
df.index.get_indexer(['Peru'])
[3]
Or we can build a RangeIndex based on the size of the DataFrame then subset that instead:
pd.RangeIndex(len(df))[df.index == 'Peru']
Int64Index([3], dtype='int64')
Since we're only looking for a single label and the indexes are "all unique" we can also use Index.get_loc:
df.index.get_loc('Peru')
3
Sample DataFrame:
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5]
}, index=['Bahamas', 'Cameroon', 'Ecuador', 'Peru', 'Japan'])
df:
A
Bahamas 1
Cameroon 2
Ecuador 3
Peru 4
Japan 5
pd.Index.get_indexer
We can use pd.Index.get_indexer to get integer index.
idx = df.index.get_indexer(list_of_target_labels)
# If you only have single label we can use tuple unpacking here.
[idx] = df.index.get_indexer([country_name])
NB: pd.Index.get_indexer takes a list and returns a list. Integers from 0 to n - 1 indicating that the index at these positions matches the corresponding target values. Missing values in the target are marked by -1.
np.where
You could also use np.where here.
idx = np.where(df.index == country_name)[0]
list.index
We could also use list.index after converting Pd.Index to list using pd.Index.tolist
idx = df.index.tolist().index(country_name)
Why you don make the index to be created with numbers instead of text? Because your df can be sorted in many ways beyond the alphabetical, and you can lose the rows count.
With numbered index this wouldn't be a problem.

Check column for variable and get value from another column in matched row

How can I get the values of one column in a csv-file by matching attributes in another column?
CSV-file would look like that:
One,Two,Three
x,car,5
x,bus,7
x,car,9
x,car,6
I only want to get the values of column 3, if they have the value "car" in column 2. I also do not want them to be added but rather have them printed in a list, or like that:
5
9
6
My approach is looking like that, but doesn't really work:
import pandas as pd
df = pd.read_csv(r"example.csv")
ITEMS = [car] #I will need more items, this is just examplified
for item in df.Two:
if item in ITEMS:
print(df.Three)
How can I get the exact value for a matched item?
In one line you can do it like:
print(df['Three'][df['Two']=='car'].values)
Output:
[5 9 6]
For multiple items try:
df = pd.DataFrame({'One': ['x','x','x','x', 'x'],'Two': ['car','bus','car','car','jeep'],'Three': [5,7,9,6,10]})
myitems = ['car', 'bus']
res_list = []
for item in myitems:
res_list += df['Three'][df['Two']==item].values.tolist()
print(*sorted(res_list), sep='\n')
Output:
5
6
7
9
Explanation
df['Two']=='car' returns a Dataframe with boolean True at row positions where value in column Two of of df is car
.values gets these boolean values as a numpy.ndarray, result would be [True False True True]
We can filter the values in column Three by using this list of booleans like so: df['Three'][<Boolean_list>]
To combine the resulting arrays we convert each numpy.ndarray to python list using tolist() and append it to res_list
Then we use sorted to sort res_list

Compare values from one column to another all rows

I have a dataframe like:
df1
right left
[a,b] [c,d,e,f]
[b,c] [a,d,e,f]
[c,d,e,f] [a,b]
Line 1 and 3 are basically same and I want to remove the duplicates.
Is their any way to do this?
The data is structured in this way only.
I tried running below command I found but since these are lists, it throws an error:
df1.duplicated(subset = ['right', 'left'], keep = False)
error: unhashable type :list
Create hashable type tuple for both columns, sorting in list comprehension and test duplicates by Series.duplicated:
L = [tuple(map(tuple, sorted(x))) for x in df[['right','left']].to_numpy()]
m = pd.Series(L, index=df.index).duplicated(keep = False)
print (m)
0 True
1 False
2 True
dtype: bool

Why is that Pandas to numpy array conversion yields different results with the inclusion of brackets [duplicate]

I'm confused about the syntax regarding the following line of code:
x_values = dataframe[['Brains']]
The dataframe object consists of 2 columns (Brains and Bodies)
Brains Bodies
42 34
32 23
When I print x_values I get something like this:
Brains
0 42
1 32
I'm aware of the pandas documentation as far as attributes and methods of the dataframe object are concerned, but the double bracket syntax is confusing me.
Consider this:
Source DF:
In [79]: df
Out[79]:
Brains Bodies
0 42 34
1 32 23
Selecting one column - results in Pandas.Series:
In [80]: df['Brains']
Out[80]:
0 42
1 32
Name: Brains, dtype: int64
In [81]: type(df['Brains'])
Out[81]: pandas.core.series.Series
Selecting subset of DataFrame - results in DataFrame:
In [82]: df[['Brains']]
Out[82]:
Brains
0 42
1 32
In [83]: type(df[['Brains']])
Out[83]: pandas.core.frame.DataFrame
Conclusion: the second approach allows us to select multiple columns from the DataFrame. The first one just for selecting single column...
Demo:
In [84]: df = pd.DataFrame(np.random.rand(5,6), columns=list('abcdef'))
In [85]: df
Out[85]:
a b c d e f
0 0.065196 0.257422 0.273534 0.831993 0.487693 0.660252
1 0.641677 0.462979 0.207757 0.597599 0.117029 0.429324
2 0.345314 0.053551 0.634602 0.143417 0.946373 0.770590
3 0.860276 0.223166 0.001615 0.212880 0.907163 0.437295
4 0.670969 0.218909 0.382810 0.275696 0.012626 0.347549
In [86]: df[['e','a','c']]
Out[86]:
e a c
0 0.487693 0.065196 0.273534
1 0.117029 0.641677 0.207757
2 0.946373 0.345314 0.634602
3 0.907163 0.860276 0.001615
4 0.012626 0.670969 0.382810
and if we specify only one column in the list we will get a DataFrame with one column:
In [87]: df[['e']]
Out[87]:
e
0 0.487693
1 0.117029
2 0.946373
3 0.907163
4 0.012626
There is no special syntax in Python for [[ and ]]. Rather, a list is being created, and then that list is being passed as an argument to the DataFrame indexing function.
As per #MaxU's answer, if you pass a single string to a DataFrame a series that represents that one column is returned. If you pass a list of strings, then a DataFrame that contains the given columns is returned.
So, when you do the following
# Print "Brains" column as Series
print(df['Brains'])
# Return a DataFrame with only one column called "Brains"
print(df[['Brains']])
It is equivalent to the following
# Print "Brains" column as Series
column_to_get = 'Brains'
print(df[column_to_get])
# Return a DataFrame with only one column called "Brains"
subset_of_columns_to_get = ['Brains']
print(df[subset_of_columns_to_get])
In both cases, the DataFrame is being indexed with the [] operator.
Python uses the [] operator for both indexing and for constructing list literals, and ultimately I believe this is your confusion. The outer [ and ] in df[['Brains']] is performing the indexing, and the inner is creating a list.
>>> some_list = ['Brains']
>>> some_list_of_lists = [['Brains']]
>>> ['Brains'] == [['Brains']][0]
True
>>> 'Brains' == [['Brains']][0][0] == [['Brains'][0]][0]
True
What I am illustrating above is that at no point does Python ever see [[ and interpret it specially. In the last convoluted example ([['Brains'][0]][0]) there is no special ][ operator or ]][ operator... what happens is
A single-element list is created (['Brains'])
The first element of that list is indexed (['Brains'][0] => 'Brains')
That is placed into another list ([['Brains'][0]] => ['Brains'])
And then the first element of that list is indexed ([['Brains'][0]][0] => 'Brains')
Other solutions demonstrate the difference between a series and a dataframe. For the Mathematically minded, you may wish to consider the dimensions of your input and output. Here's a summary:
Object Series DataFrame
Dimensions (obj.ndim) 1 2
Syntax arg dim 0 1
Syntax df['col'] df[['col']]
Max indexing dim 1 2
Label indexing df['col'].loc[x] df.loc[x, 'col']
Label indexing (scalar) df['col'].at[x] df.at[x, 'col']
Integer indexing df['col'].iloc[x] df.iloc[x, 'col']
Integer indexing (scalar) df['col'].iat[x] dfi.at[x, 'col']
When you specify a scalar or list argument to pd.DataFrame.__getitem__, for which [] is syntactic sugar, the dimension of your argument is one less than the dimension of your result. So a scalar (0-dimensional) gives a 1-dimensional series. A list (1-dimensional) gives a 2-dimensional dataframe. This makes sense since the additional dimension is the dataframe index, i.e. rows. This is the case even if your dataframe happens to have no rows.
[ ] and [[ ]] are the concept of NumPy.
Try to understand the basics of np.array creating and use reshape and check with ndim, you'll understand.
Check my answer here.
https://stackoverflow.com/a/70194733/7660981

Splitting a dataframe based on condition

I am trying to split my dataframe into two based of medical_plan_id. If it is empty, into df1. If not empty into df2.
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
df2 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] is not ""]
The code below works, but if there are no empty fields, my code raises TypeError("invalid type comparison").
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
How to handle such situation?
My df_with_medicalplanid looks like below:
wellthie_issuer_identifier ... medical_plan_id
0 UHC99806 ... None
1 UHC99806 ... None
Use ==, not is, to test equality
Likewise, use != instead of is not for inequality.
is has a special meaning in Python. It returns True if two variables point to the same object, while == checks if the objects referred to by the variables are equal. See also Is there a difference between == and is in Python?.
Don't repeat mask calculations
The Boolean masks you are creating are the most expensive part of your logic. It's also logic you want to avoid repeating manually as your first and second masks are inverses of each other. You can therefore use the bitwise inverse ~ ("tilde"), also accessible via operator.invert, to negate an existing mask.
Empty strings are different to null values
Equality versus empty strings can be tested via == '', but equality versus null values requires a specialized method: pd.Series.isnull. This is because null values are represented in NumPy arrays, which are used by Pandas, by np.nan, and np.nan != np.nan by design.
If you want to replace empty strings with null values, you can do so:
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
Conceptually, it makes sense for missing values to be null (np.nan) rather than empty strings. But the opposite of the above process, i.e. converting null values to empty strings, is also possible:
df['medical_plan_id'] = df['medical_plan_id'].fillna('')
If the difference matters, you need to know your data and apply the appropriate logic.
Semi-final solution
Assuming you do indeed have null values, calculate a single Boolean mask and its inverse:
mask = df['medical_plan_id'].isnull()
df1 = df[mask]
df2 = df[~mask]
Final solution: avoid extra variables
Creating additional variables is something, as a programmer, you should look to avoid. In this case, there's no need to create two new variables, you can use GroupBy with dict to give a dictionary of dataframes with False (== 0) and True (== 1) keys corresponding to your masks:
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
Then dfs[0] represents df2 and dfs[1] represents df1 (see also this related answer). A variant of the above, you can forego dictionary construction and use Pandas GroupBy methods:
dfs = df.groupby(df['medical_plan_id'].isnull())
dfs.get_group(0) # equivalent to dfs[0] from dict solution
dfs.get_group(1) # equivalent to dfs[1] from dict solution
Example
Putting all the above in action:
df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
'values': [1, 2, 3, 4, 5, 6, 7]})
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
print(dfs[0], dfs[1], sep='\n'*2)
medical_plan_id values
2 2134.0 3
3 4325.0 4
4 6543.0 5
medical_plan_id values
0 NaN 1
1 NaN 2
5 NaN 6
6 NaN 7
Another variant is to unpack df.groupby, which returns an iterator with tuples (first item being the element of groupby and the second being the dataframe).
Like this for instance:
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
_ is in Python used to mark variables that are not interested to keep. I have separated the code to two lines for readability.
Full example
import pandas as pd
df_with_medicalplanid = pd.DataFrame({
'medical_plan_id': ['214212','','12251','12421',''],
'value': 1
})
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
print(df1)
Returns:
medical_plan_id value
0 214212 1
2 12251 1
3 12421 1
cond = df_with_medicalplanid['medical_plan_id'] == ''
(_, df1) , (_, df2) = df_with_medicalplanid.groupby(cond)
# Anton missed cond in right side bracket
print(df1)

Categories