Pandas Series Split Value into Columns - python

I have a series that looks like this:
index
1 [{'id':1, 'primary':True,'source':None},{'id':2,'primary':False,'source':email}]
2 [{'id':2234, 'primary':True,'source':None},{'id':234,'primary':False,'source':email}]
3 [{'id':32, 'primary':False,'source':None}]
I want this to be a dataframe that looks like this:
index id primary source
1 1 True None
1 2 False email
2 2234 True None
2 234 False email
3 32 False google
I tried running this:
df_phone_numbers = df_phone_numbers.drop("phone_numbers", axis =1).join(pd.DataFrame(df_phone_numbers["phone_numbers"].to_dict()).T)
But I get an error message "All arrays must be of the same length"
Any advice?

Try convert the exploded series:
k = s.explode()
pd.DataFrame(k.tolist(), k.index)
Output:
id primary source
1 1 True None
1 2 False email
2 2234 True None
2 234 False email
3 32 False None

Related

How can I filter dataframe based on null/not null using a column name as a variable?

I want to list a dataframe where a specific column is either null or not null, I have it working using -
df[df.Survive.notnull()] # contains no missing values
df[df.Survive.isnull()] #---> contains missing values
This works perfectly but I want to make my code more dynamic and pass the column "Survive" as a variable but it's not working for me.
I tried:
variableToPredict = ['Survive']
df[df[variableToPredict].notnull()]
I get the error - ValueError: cannot reindex from a duplicate axis
I'm sure I'm making a silly mistake, what can I do to fix this?
So idea is always is necessary Series or list or 1d array for mask for filtering.
If want test only one column use scalar:
variableToPredict = 'Survive'
df[df[variableToPredict].notnull()]
But if add [] output is one column DataFrame, so is necessaty change function for test by any (test if at least one NaN per row, sense in multiple columns) or all (test if all NaNs per row, sense in multiple columns) functions:
variableToPredict = ['Survive']
df[df[variableToPredict].notnull().any(axis=1)]
variableToPredict = ['Survive', 'another column']
df[df[variableToPredict].notnull().any(axis=1)]
Sample:
df = pd.DataFrame({'Survive':[np.nan, 'A', 'B', 'B', np.nan],
'another column':[np.nan, np.nan, 'a','b','b']})
print (df)
Survive another column
0 NaN NaN
1 A NaN
2 B a
3 B b
4 NaN b
First if test only one column:
variableToPredict = 'Survive'
df1 = df[df[variableToPredict].notnull()]
print (df1)
Survive another column
1 A NaN
2 B a
3 B b
print (type(df[variableToPredict]))
<class 'pandas.core.series.Series'>
#Series
print (df[variableToPredict])
0 NaN
1 A
2 B
3 B
4 NaN
Name: Survive, dtype: object
print (df[variableToPredict].isnull())
0 True
1 False
2 False
3 False
4 True
Name: Survive, dtype: bool
If use list - here one element list:
variableToPredict = ['Survive']
print (type(df[variableToPredict]))
<class 'pandas.core.frame.DataFrame'>
#one element DataFrame
print (df[variableToPredict])
Survive
0 NaN
1 A
2 B
3 B
4 NaN
If testing per rows it is same output for any or all:
print (df[variableToPredict].notnull().any(axis=1))
0 False
1 True
2 True
3 True
4 False
dtype: bool
print (df[variableToPredict].notnull().all(axis=1))
0 False
1 True
2 True
3 True
4 False
dtype: bool
If test one or more columns in list:
variableToPredict = ['Survive', 'another column']
print (type(df[variableToPredict]))
<class 'pandas.core.frame.DataFrame'>
print (df[variableToPredict])
Survive another column
0 NaN NaN
1 A NaN
2 B a
3 B b
4 NaN b
print (df[variableToPredict].notnull())
Survive another column
0 False False
1 True False
2 True True
3 True True
4 False True
#at least one NaN per row, at least one True
print (df[variableToPredict].notnull().any(axis=1))
0 False
1 True
2 True
3 True
4 True
dtype: bool
#all NaNs per row, all Trues
print (df[variableToPredict].notnull().all(axis=1))
0 False
1 False
2 True
3 True
4 False
dtype: bool
Adding all at the end
df[df[variableToPredict].notnull().all(1)]

How to differentiate string and Alphanumeric?

df:
company_name product Id rating
0 matrix mobile Id456 2.5
1 ins-faq alpha1 Id956 3.5
2 metric5 sounds-B Id-356 2.5
3 ingsaf digital Id856 4star
4 matrix win11p Idklm 2.0
5 4567 mobile 596 3.5
df2:
Col_name Datatype
0 company_name String #(pure string)
1 Product String #(pure string)
2 Id Alpha-Numeric #(must contain atleast 1 number and 1 alphabet)
3 rating Float or int
df is the main dataframe and df2 is the expected datatype information of main dataframe.
how to check every column values extract wrong datatype values.
output:
row_num col_name current_value expected_dtype
0 2 company_name metric5 string
1 5 company_name 4567 string
2 1 Product alpha1 string
3 4 Product win11p string
4 4 Id Idklm Alpha-Numeric
5 5 Id 596 Alpha-Numeric
6 3 rating 4star Float or int
For columns that cannot contain numbers, you can find the exceptions with:
In [5]: df['product'].str.contains(r'[0-9]')
Out[5]:
0 False
1 True
2 False
3 False
4 True
5 False
Name: product, dtype: bool
For Alpha-Numeric columns identify compliance with:
In [7]: df['Id'].str.contains(r'(?:\d\D)|(?:\D\d)')
Out[7]:
0 True
1 True
2 True
3 True
4 False
5 False
Name: Id, dtype: bool
For int or float columns find exceptions with
In [8]: df['rating'].str.contains(r'[^0-9.+-]')
Out[8]:
0 False
1 False
2 False
3 True
4 False
5 False
That may be problematic, it won't catch things with multiple or misplaced plus,minus, or dot characters, like 9.4.1 or 6+3.-12. But you could use
In [11]: def check(thing):
...: try:
...: return bool(float(thing)) or thing==0
...: except ValueError:
...: return False
...:
In [12]: df['rating'].apply(check)
Out[12]:
0 True
1 True
2 True
3 False
4 True
5 True
Name: rating, dtype: bool

Pandas incremental values when boolean changes

There is a dataframe that contains column in which boolean values alternate. I want to create incremental value series based on those boolean changes. I want to increment only when boolean value differs from previous value. I want to do this without loop.
Example, Here's dataframe:
column
0 True
1 True
2 False
3 False
4 False
5 True
I want to get this:
column inc
0 True 1
1 True 1
2 False 2
3 False 2
4 False 2
5 True 3
Compare shifted column by not equal and add cumulative sum:
df['inc'] = df['column'].ne(df['column'].shift()).cumsum()
print (df)
column inc
0 True 1
1 True 1
2 False 2
3 False 2
4 False 2
5 True 3
Detail:
print (df['column'].ne(df['column'].shift()))
0 True
1 False
2 True
3 False
4 False
5 True
Name: column, dtype: bool

List columns in data frame that have missing values as '?'

List names of the column(s) of data frame along with the count of missing number of values if missing values are coded with '?' using pandas and numpy.
import numpy as np
import pandas as pd
bridgeall = pd.read_excel('bridge.xlsx',sheet_name='Sheet1')
#print(bridgeall)
bridge_sep = bridgeall.iloc[:,0].str.split(',',-1,expand=True)
bridge_sep.columns = ['IDENTIF','RIVER', 'LOCATION', 'ERECTED', 'PURPOSE', 'LENGTH', 'LANES','CLEAR-G', 'T-OR-D',
'MATERIAL', 'SPAN', 'REL-L', 'TYPE']
print(bridge_sep)
Data: I am posting a snippet. Its actually [107 rows x 13 columns].
IDENTIF RIVER LOCATION ERECTED ... MATERIAL SPAN REL-L TYPE
0 E2 A ? CRAFTS ... WOOD SHORT ? WOOD
1 E3 A 39 CRAFTS ... WOOD ? S WOOD
2 E5 A ? CRAFTS ... WOOD SHORT S WOOD
Output required:
LOCATION 2
SPAN 1
REL-L 1
Compare all values by eq (==) and for count accurencies use sum - Trues are processes like 1, then remove only False values (0) by boolean indexing:
s = df.eq('?').sum()
s = s[s != 0]
print (s)
LOCATION 2
SPAN 1
REL-L 1
dtype: int64
Last for DataFrame add reset_index:
df1 = s.reset_index()
df1.columns = ['names','count']
print (df1)
names count
0 LOCATION 2
1 SPAN 1
2 REL-L 1
EDIT:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)))
print (df)
0 1 2 3 4
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
#compare with same length Series
#same index values like index/columns of DataFrame
s = pd.Series(np.arange(5))
print (s)
0 0
1 1
2 2
3 3
4 4
dtype: int32
#compare columns
print (df.eq(s, axis=0))
0 1 2 3 4
0 False False False False False
1 False False False False False
2 True True False False False
3 False False False False False
4 True False False False True
#compare rows
print (df.eq(s, axis=1))
0 1 2 3 4
0 False False False False False
1 True False True False False
2 False False False False False
3 False False False False False
4 False True False True True
If your DataFrame is named df, try (df == '?').sum()

pandas transform nunique on groupby object dealing with nan values

I have the following df,
inv_id cluster_id
793 2
2
789 3
789 3
4
4
I like to groupby cluster_id and check how many unique values each group has,
df['same_inv_id'] = df.groupby('cluster_id')['inv_id'].transform('nunique') == 1
but I like to set same_inv_id = False when some cluster only contains empty/blank inv_id, and when some cluster contains one or more empty/blank inv_id, so the result will look like,
inv_id cluster_id same_inv_id
793 2 False
2 False
789 3 True
789 3 True
4 False
4 False
IIUC get the condition then transform+ all
s1=df.inv_id.ne('').groupby(df.cluster_id).transform('all')
s1
Out[432]:
0 False
1 False
2 True
3 True
4 False
5 False
Name: inv_id, dtype: bool
s2=df.groupby('cluster_id')['inv_id'].transform('nunique') == 1
#df['same_inv_id']=s1&s2

Categories