Splitting strings of tuples of different lengths to columns in Pandas DF - python

I have a dataframe that looks like this
id
human_id
1
('apples', '2022-12-04', 'a5ted')
2
('bananas', '2012-2-14')
3
('2012-2-14', 'reda21', 'ss')
..
..
I would like a "pythonic" way to have such output
id
human_id
col1
col2
col3
1
('apples', '2022-12-04', 'a5ted')
apples
2022-12-04
a5ted
2
('bananas', '2012-2-14')
bananas
2022-12-04
np.NaN
3
('2012-2-14', 'reda21', 'ss')
2012-2-14
reda21
ss
import pandas as pd
df['a'], df['b'], df['c'] = df.human_id.str
The code I have tried give me error:
ValueError: not enough values to unpack (expected 2, got 1) Python
How can I split the values in tuple to be in columns?
Thank you.

You can do it this way. It will just put None in places where it couldn't find the values. You can then append the df1 to df.
d = {'id': [1,2,3],
'human_id': ["('apples', '2022-12-04', 'a5ted')",
"('bananas', '2012-2-14')",
"('2012-2-14', 'reda21', 'ss')"
]}
df = pd.DataFrame(data=d)
list_human_id = tuple(list(df['human_id']))
newList = []
for val in listh:
newList.append(eval(val))
df1 = pd.DataFrame(newList, columns=['col1', 'col2', 'col3'])
print(df1)
Output
col1 col2 col3
0 apples 2022-12-04 a5ted
1 bananas 2012-2-14 None
2 2012-2-14 reda21 ss

You can do
out = df.join(pd.DataFrame(df.human_id.tolist(),index=df.index,columns=['a','b','c']))

column will create dynamic with length of tuple and using same dataframe
import pandas as pd
id = [1, 2, 3]
human_id = [('apples', '2022-12-04', 'a5ted')
,('bananas', '2012-2-14')
, ('2012-2-14', 'reda21', 'ss')]
df = pd.DataFrame({'id': id, 'human_id': human_id})
print("*"*20,'Dataframe',"*"*20)
print(df.to_string())
print()
print("*"*20,'Split Data',"*"*20)
row = 0
for x in df['human_id']:
col = 1
for xx in x:
#df['col'+str(z)] = str(xx)
name_column = 'col'+str(col)
df.loc[df.index[row], name_column] = str(xx)
col+=1
row+=1
print(df.to_string())

Related

How to append to each column of empty pandas data frame different size of lists in a loop?

Hi guys this is getting frustrating.! After long hours of online browsing. I can not find a single source that can help here. How to append to each column of empty pandas data frame different size of lists? For instance, I have these three variables:
var1 = ['BBCL15', 'KL12TT', 'TMAA03', '1523FR']
var2 = [253, 452, 16]
var3 = ['23n2', 'akg_9', '12.3bl', '30x2', 'dd91']
And I would like to append it to empty pandas data frame in a loop:
df = pd.DataFrame(columns=['col1', 'col2', 'col3'])
# something like this.
for x in var1:
df['col1'].append(pd.Series(x), ignore_index=True)
for x in var2:
df['col2'].append(pd.Series(x), ignore_index=True)
for x in var3:
df['col3'].append(pd.Series(x), ignore_index=True)
Where each variable correspond to a single column and fill in empties spaces with NaN as length of variables is not the same.
Can someone help with this?
>>> cols = ['col1', 'col2', 'col3']
>>> df = pd.DataFrame(columns=cols)
>>> max_len = max([len(var1), len(var2), len(var3)])
>>> for col, var in zip(cols, [var1, var2, var3]):
... df[col] = var+([None]*(max_len - len(var)))
>>> df
col1 col2 col3
0 BBCL15 253.0 23n2
1 KL12TT 452.0 akg_9
2 TMAA03 16.0 12.3bl
3 1523FR NaN 30x2
4 None NaN dd91
Create a list of lists to use list comphrehensions:
lists = [var1, var2, var3]
Get the length of the longesst list:
longest_length = max([len(v) for v in lists])
Pad the lists as required:
padded_lists = [v + [float("NaN")]*(longest_length - len(v)) for v in lists]
Create the data frame:
pd.DataFrame(padded_lists).T
Here is another solution using pd.concat :
var1 = ['BBCL15', 'KL12TT', 'TMAA03', '1523FR']
var2 = [253, 452, 16]
var3 = ['23n2', 'akg_9', '12.3bl', '30x2', 'dd91']
df = pd.DataFrame()
for i in [var1, var2, var3] :
df = pd.concat([df, pd.Series(i)], axis = 1, ignore_index = True)
df.columns = ['col1', 'col2', 'col3']
Note: avoid naming the data frame columns in the first place when you are using this solution.
Also, this one is practical solution!
mydict = {'col1': var1, 'col2': var2, 'col3': var3}
df = pd.DataFrame({key:pd.Series(value) for key, value in mydict.items()})
df
col1 col2 col3
0 BBCL15 253.0 23n2
1 KL12TT 452.0 akg_9
2 TMAA03 16.0 12.3bl
3 1523FR NaN 30x2
4 NaN NaN dd91

Converting 3 Columns into 2 Column - Python

I need to convert 3 columns into 2 rows using python.
col1 col2 col3
A 2 3
B 4 5
col1 col2
A 2
A 3
B 4
B 5
*my code
hdr = ['col1', 'col2']
final_output=[]
for row in rows:
output = {}
output1 = {}
output = { A : row.get(col1), B: row.get(col2)}
output1 = { A : row.get(col1), B: row.get(col3)}
final_out.append(output)
final_out.append(output1)
with open(tgt_file.csv, w) as tgt_file:
csv_writer=csv.DictWriter(tgt_file, fieldnames=hdr, delimiter=',')
csv_writer.writeheader()
csv_writer.writerows(final_output)
import pandas as pd
### this is the sample data
df = pd.DataFrame(data= [['A',2, 3],['B',4, 5]],
columns =['col1', 'col2', 'col3'])
### this is the solution
ef = [] # create an empty list
for i,row in df.iterrows():
ef.append([row[0], row[1]]) # append first column first
ef.append([row[0], row[2]]) # append 2nd column second
df = pd.DataFrame(data=ef,columns=['col1','col2']) # recreate the dataframe
remark: there are more advanced solutions possible, but I think this is readable
You can try using pd.melt
df = pd.melt(df, id_vars=["col1"],value_name = 'col2').drop(['variable'],axis=1)
And then you can sort the dataframe on "col1".

Pandas : if value in a dataframe contains string from another dataframe, append columns

Let's say I have two dataframes df1 and df2.
I want to append some columns of df2 to df1 if the value of a specific column of df1 contains the string in a specific column of df2, NaN if not.
A small example:
import pandas as pd
df1 = pd.DataFrame({'col': ['abc', 'def', 'abg', 'xyz']})
df2 = pd.DataFrame({'col1': ['ab', 'ef'], 'col2': ['match1', 'match2'], 'col3': [1, 2]})
df1:
   col
0  abc
1  def
2  abg
3 xyz
df2:
  col1    col2 col3
0   ab  match1 1
1   ef  match2 2
I want:
  col col2_match col3_match
0  abc match1 1
1  def match2 2
2  abg match1 1
3 xyz NaN NaN
I managed to do it in a dirty and unefficient way, but in my case df1 contains like 100K rows and it takes forever...
Thanks in advance !
EDIT
A bit dirty but gets the work done relatively quickly (I still thinks there exists a smartest way though...):
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'col': ['abc', 'def', 'abg']})
df2 = pd.DataFrame({'col1': ['ab', 'ef'],
'col2': ['match1', 'match2'],
'col3': [1, 2]})
def return_nan(tup):
return(np.nan if len(tup[0]) == 0 else tup[0][0])
def get_indexes_match(l1, l2):
return([return_nan(np.where([x in e for x in l2])) for e in l1])
def merge(df1, df2, left_on, right_on):
df1.loc[:, 'idx'] = get_indexes_match(df1[left_on].values,
df2[right_on].values)
df2.loc[:, 'idx'] = np.arange(len(df2))
return(pd.merge(df1, df2, how='left', on='idx'))
merge(df1, df2, left_on='col', right_on='col1')
You can use python difflib module for fuzzy match like this
import difflib
difflib.get_close_matches
df1.col = df1.col.map(lambda x: difflib.get_close_matches(x, df2.col1)[0])
So now your df1 is
col
0 ab
1 ef
2 ab
You can call it df3 if you wish to keep df1 unaltered.
Now you can merge
merged = df1.merge(df2, left_on = 'col', right_on = 'col1', how = 'outer').drop('col1', axis = 1)
The merged dataframe looks like
col col2 col3
0 ab match1 1
1 ab match1 1
2 ef match2 2
EDIT:
In case of no match like the new example given, you just need to put a conditional in lambda
df1.col = df1.col.map(lambda x: difflib.get_close_matches(x, df2.col1)[0] if difflib.get_close_matches(x, df2.col1) else x)
Now after the merge you get
col col2 col3
0 ab match1 1
1 ab match1 1
2 ef match2 2
3 xyz NaN NaN

Pandas dataframe: Group by two columns and then average over another column

Assuming that I have a dataframe with the following values:
df:
col1 col2 value
1 2 3
1 2 1
2 3 1
I want to first groupby my dataframe based on the first two columns (col1 and col2) and then average over values of the thirs column (value). So the desired output would look like this:
col1 col2 avg-value
1 2 2
2 3 1
I am using the following code:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby('col1','col2').mean())
which gets the following error:
ValueError: No axis named col2 for object type <class 'pandas.core.frame.DataFrame'>
Any help would be much appreciated.
You need to pass a list of the columns to groupby, what you passed was interpreted as the axis param which is why it raised an error:
In [30]:
columns = ['col1','col2','avg']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
print(df[['col1','col2','avg']].groupby(['col1','col2']).mean())
avg
col1 col2
1 2 3
3 3
If you want to group by multiple columns, you should put them in a list:
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).mean())
Or slightly more verbose, for the sake of getting the word 'avg' in your aggregated dataframe:
import numpy as np
columns = ['col1','col2','value']
df = pd.DataFrame(columns=columns)
df.loc[0] = [1,2,3]
df.loc[1] = [1,3,3]
df.loc[2] = [2,3,1]
print(df.groupby(['col1','col2']).agg({'value': {'avg': np.mean}}))

Remove rows where values appear in all columns in Pandas

Here is a very simple dataframe:
df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
I'm trying to remove rows where there are duplicate values (e.g., row 3)
This doesn't work,
df = df[(df.col1 != 3 & df.col2 != 3)]
and the documentation is pretty clear about why, which makes sense.
But I still don't know how to delete that row.
Does anyone have any ideas? Thanks. Monica.
If I understand your question correctly, I think you were close.
Starting from your data:
In [20]: df
Out[20]:
col1 col2
0 1 1
1 2 3
2 3 3
And doing this:
In [21]: df = df[df['col1'] != df['col2']]
Returns:
In [22]: df
Out[22]:
col1 col2
1 2 3
What about:
In [43]: df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
In [44]: df[df.max(axis=1) != df.min(axis=1)]
Out[44]:
col1 col2
1 2 3
[1 rows x 2 columns]
We want to remove rows whose values show up in all columns, or in other words the values are equal => their minimums and maximums are equal. This is method works on a DataFrame with any number of columns. If we apply the above, we remove rows 0 and 2.
Any row with all the same values with have zero as the standard deviation. One way to filter them is as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1' :[1, 2, 3, np.nan],
'col2' :[1, 3, 3, np.nan]}
>>> df.loc[df.std(axis=1, skipna=False) > 0]
col1 col2
1 2

Categories