Transform string to Pandas df - python

I have the string like that:
'key=IAfpK, age=58, key=WNVdi, age=64, key=jp9zt, age=47'
How can I transform it to Pandas DataFrame?
key
age
0
1
Thank you

Use:
In [919]: s = 'key=IAfpK, age=58, key=WNVdi, age=64, key=jp9zt, age=47'
In [922]: d = {}
In [927]: for i in s.split(', '):
...: ele, val = i.split('=')
...: if ele in d:
...: d[ele].append(val)
...: else:
...: d[ele] = [val]
...:
In [930]: df = pd.DataFrame(d)
In [931]: df
Out[931]:
key age
0 IAfpK 58
1 WNVdi 64
2 jp9zt 47

A quick and somewhat manual way to do it would be to first create a list of dict values appending each string. Then convert that list to a dataframe. (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html):
import pandas as pd
keylist = []
keylist.append({"key": 'IAfpK', "age": '58'})
keylist.append({"key": 'WNVdi', "age": '64'})
keylist.append({"key": 'jp9zt', "age": '47'})
#convert the list of dictionaries into a df
key_df = pd.DataFrame(keylist, columns = ['key', 'age'])
However, this is only efficient for that specific string you mentioned, if you need to work on a longer string/more data then a for loop would be more efficient.
Although I think this answers your question, there are probably more optimal ways to go about it :)

Try:
s = "key=IAfpK, age=58, key=WNVdi, age=64, key=jp9zt, age=47"
x = (
pd.Series(s)
.str.extractall(r"key=(?P<key>.*?),\s*age=(?P<age>.*?)(?=,|\Z)")
.reset_index(drop=True)
)
print(x)
Prints:
key age
0 IAfpK 58
1 WNVdi 64
2 jp9zt 47

Related

find the rows with more than 4 values in a list in a column

The dataframe I have, df:
name list
0 kfjh [[a,b,c],[d,f,h],[g,k,l]]
1 jhkg [[a,b,c],[d,f,h],[g,k,l],[f,k,j]]
2 khfg [[a,b,c],[g,k,l]]
3 khkjgr [[a,b,c],[d,f,h]]
4 kjrgjg [[d,f,h]]
5 jkdgr [[a,b,c],[d,f,h],[g,k,l, [g,j,l],[f,l,p]]
6 hgyr [[a,b,c],[d,kf,h],[g,k,l, [g,j,l],[f,l,p]]
7 jkgtjd [[f,l,p]]
8 nkjgrd [t,t,i]
if the list has more than 4 list, then I would like to get df1.
The desired output, df1 :
name list
5 jkdgr [[a,b,c],[d,f,h],[g,k,l, [g,j,l],[f,l,p]]
6 hgyr [[a,b,c],[d,kf,h],[g,k,l, [g,j,l],[f,l,p]]
and, df2:
name list
0 kfjh [[a,b,c],[d,f,h],[g,k,l]]
1 jhkg [[a,b,c],[d,f,h],[g,k,l],[f,k,j]]
2 khfg [[a,b,c],[g,k,l]]
3 khkjgr [[a,b,c],[d,f,h]]
4 kjrgjg [[d,f,h]]
7 jkgtjd [[f,l,p]]
8 nkjgrd [t,t,i]
You can do something like this if column list is a string. if the list is list of lists with every element as a string, you can change the split for only len of the array and compare to 4 to do it.
import pandas as pd
data = {
'name': ['kfjh', 'jhkg', 'khfg', 'khkjgr', 'kjrgjg', 'jkdgr', 'hgyr', 'jkgtjd', 'nkjgrd'],
'list': ['[[a,b,c],[d,f,h],[g,k,l]]', '[[a,b,c],[d,f,h],[g,k,l],[f,k,j]]', '[[a,b,c],[g,k,l]]', '[[a,b,c],[d,f,h]]', '[[d,f,h]]', '[[a,b,c],[d,f,h],[g,k,l],[g,j,l],[f,l,p]]', '[[a,b,c],[d,f,h],[g,kf,l],[g,j,l],[f,l,p]]', '[[f,l,p]]', '[t,t,i]']
}
df = pd.DataFrame(data)
df['drop'] = df.apply(lambda row : 'no' if len(row['list'].split('[')) > 6 else 'yes', axis = 1)
df1 = df.loc[df['drop'] == 'yes']
df2 = df.loc[df['drop'] == 'no']
df1 = df1.drop(columns=['drop'])
df2 = df2.drop(columns=['drop'])
print(df1)
print(df2)
Try this:
from ast import literal_eval
df.list.apply(literal_eval)
You can use map(len) to give the number of elements in a List in a column. So you could use:
df1 = df[df['list'].map(len) > 4]
df2 = df[df['list'].map(len) <= 4]
which gives the two sets of results you present
Simply iterate through the first dataframe, get list length by counting nested lists in a recursive method and add the new corresponding rows to another dataframe:
import pandas as pd
def count_lists(l):
return sum(1 + count_lists(i) for i in l if isinstance(i,list))
data = {'name': ['kfjh', 'jhkg', 'khfg', 'khkjgr', 'kjrgjg', 'jkdgr', 'hgyr', 'jkgtjd', 'nkjgrd'],
'list': [[['a','b','c'],['d','f','h'],['g','k','l']], [['a','b','c'],['d','f','h'],['g','k','l'],['f','k','j']],
[['a','b','c'],['g','k','l']], [['a','b','c'],['d','f','h']], [['d','f','h']],
[['a','b','c'],['d','f','h'],['g','k','l', ['g','j','l'],['f','l','p']]],
[['a','b','c'], ['d','kf','h'],['g','k','l', ['g','j','l'], ['f','l','p']]],[['f','l','p']],['t','t','i']]}
dframe = pd.DataFrame(data)
dframe1 = pd.DataFrame()
dframe2 = pd.DataFrame()
for i, j in dframe.iterrows():
if count_lists(j)-1 > 4:
dframe2 = dframe2.append(dframe.iloc[i])
else:
dframe1 = dframe1.append(dframe.iloc[i])
print("Dataframe1:\n", dframe1, "\n")
print("Dataframe2:\n", dframe2)
Result:

How to split one row into multiple rows in python

I have a pandas dataframe that has one long row as a result of a flattened json list.
I want to go from the example:
{'0_id': 1, '0_name': a, '0_address': USA, '1_id': 2, '1_name': b, '1_address': UK, '1_hobby': ski}
to a table like the following:
id
name
address
hobby
1
a
USA
2
b
UK
ski
Any help is greatly appreciated :)
There you go:
import json
json_data = '{"0_id": 1, "0_name": "a", "0_address": "USA", "1_id": 2, "1_name": "b", "1_address": "UK", "1_hobby": "ski"}'
arr = json.loads(json_data)
result = {}
for k in arr:
kk = k.split("_")
if int(kk[0]) not in result:
result[int(kk[0])] = {"id":"", "name":"", "hobby":""}
result[int(kk[0])][kk[1]] = arr[k]
for key in result:
print("%s %s %s" % (key, result[key]["name"], result[key]["address"]))
if you want to have field more dynamic, you have two choices - either go through all array and gather all possible names and then build template associated empty array, or just check if key exist in result when you returning results :)
This way only works if every column follows this pattern, but should otherwise be pretty robust.
data = {'0_id': '1', '0_name': 'a', '0_address': 'USA', '1_id': '2', '1_name': 'b', '1_address': 'UK', '1_hobby': 'ski'}
df = pd.DataFrame(data, index=[0])
indexes = set(x.split('_')[0] for x in df.columns)
to_concat = []
for i in indexes:
target_columns = [col for col in df.columns if col.startswith(i)]
df_slice = df[target_columns]
df_slice.columns = [x.split('_')[1] for x in df_slice.columns]
to_concat.append(df_slice)
new_df = pd.concat(to_concat)

How to recode/ map shared columns for dataframe stored in a dictionary?

I want to recode the 'Flavor' field, which both data sets share.
I successfully stored the data as data frames in a dictionary, but the names assigned (for ex. 'df_Mike') are strings and not callable/ alterable objects.
Do let me know where I'm going wrong and explain why.
name = ['Mike', 'Sue']
d = {}
for n in name:
url = f'https://raw.githubusercontent.com/steflangehennig/info4120/main/data/{n}.csv'
response = urllib.request.urlopen(url)
d[n] = pd.DataFrame(pd.read_csv(response))
flavor = {1: 'Choclate', 2: 'Vanilla', 3:'Mixed'}
for df in d:
df.map({'Flavor': flavor}, inplace = True)
Error code:
1 flavor = {1: 'Choclate', 2: 'Vanilla', 3:'Mixed'}
3 for df in d:
----> 4 df.map({'Flavor': flavor}, inplace = True)
AttributeError: 'str' object has no attribute 'map'
You are iterating over the keys in the dictionary. If you want to iterate over the values, you can use dict.values(). For example:
for df in d.values():
df.map({'Flavor': flavor}, inplace = True)
It would be better to iterate over the dict items. So, use
for name, df in d.items():
You can try
for n in name:
url = f'https://raw.githubusercontent.com/steflangehennig/info4120/main/data/{n}.csv'
df = pd.read_csv(url)
d[n] = df
for df in d.values():
df['Flavor'] = df['Flavor'].map(flavor)
or only in the first loop
for n in name:
url = f'https://raw.githubusercontent.com/steflangehennig/info4120/main/data/{n}.csv'
df = pd.read_csv(url)
df['Flavor'] = df['Flavor'].map(flavor)
d[n] = df

How to check if any of elements in a dictionary value is in string?

I have a dataframe with strings and a dictionary which values are lists of strings.
I need to check if each string of the dataframe contains any element of every value in the dictionary. And if it does, I need to label it with the appropriate key from the dictionary. All I need to do is to categorize all the strings in the dataframe with keys from the dictionary.
For example.
df = pd.DataFrame({'a':['x1','x2','x3','x4']})
d = {'one':['1','aa'],'two':['2','bb']}
I would like to get something like this:
df = pd.DataFrame({
'a':['x1','x2','x3','x4'],
'Category':['one','two','x3','x4']})
I tried this, but it has not worked:
df['Category'] = np.nan
for k, v in d.items():
for l in v:
df['Category'] = [k if l in str(x).lower() else x for x in df['a']]
Any ideas appreciated!
Firstly create a function that do this for you:-
def func(val):
for x in range(0,len(d.values())):
if val in list(d.values())[x]:
return list(d.keys())[x]
Now make use of split() and apply() method:-
df['Category']=df['a'].str.split('',expand=True)[2].apply(func)
Finally use fillna() method:-
df['Category']=df['Category'].fillna(df['a'])
Now if you print df you will get your expected output:-
a Category
0 x1 one
1 x2 two
2 x3 x3
3 x4 x4
Edit:
You can also do this by:-
def func(val):
for x in range(0,len(d.values())):
if any(l in val for l in list(d.values())[x]):
return list(d.keys())[x]
then:-
df['Category']=df['a'].apply(func)
Finally:-
df['Category']=df['Category'].fillna(df['a'])
I've come up with the following heuristic, which looks really dirty.
It outputs what you desire, albeit with some warnings, since I've used indices to append values to dataframe.
import pandas as pd
import numpy as np
def main():
df = pd.DataFrame({'a': ['x1', 'x2', 'x3', 'x4']})
d = {'one': ['1', 'aa'], 'two': ['2', 'bb']}
found = False
i = 0
df['Category'] = np.nan
for x in df['a']:
for k,v in d.items():
for item in v:
if item in x:
df['Category'][i] = k
found = True
break
else:
df['Category'][i] = x
if found:
found = False
break
i += 1
print(df)
main()

Pandas use cell value as dict key to return dict value

my question relates to using the values in a dataframe column as keys in order to return their respective values and run a conditional.
I have a dataframe, df, containing a column "count" that has integers from 1 to 8 and a column "category" that has values either "A", "B", or "C"
I have a dictionary, dct, containing pairs A:2, B:4, C:6
This is my (incorrect) code:
result = df[df["count"] >= dct.get(df["category"])]
So I want to return a dataframe where the "count" value for a given row is equal to more than the value retrieved from a dictionary using the "category" letter in the same row.
So if there were count values of (1, 2, 6, 6) and category values of (A, B, C, A), the third and forth row would be return in the resultant dataframe.
How do I modify the above code to achieve this?
A good way to go is to add your dictionary into a the existing dataframe and then apply a query on the new dataframe:
import pandas as pd
df = pd.DataFrame(data={'count': [4, 5, 6], 'category': ['A', 'B', 'C']})
dct = {'A':5, 'B':4, 'C':-1}
df['min_count'] = df['category'].map(dct)
df = df.query('count>min_count')
following your logic:
import pandas as pd
dct = {'A':2, 'B':4, 'C':6}
df = pd.DataFrame({'count':[1,2,5,6],
'category':['A','B','C','A']})
print('original dataframe')
print(df)
def process_row(x):
return True if x['count'] >= dct[x['category']] else False
f = df.apply(lambda row: process_row(row), axis=1)
df = df[f]
print('final output')
print(df)
output:
original dataframe
count category
0 1 A
1 2 B
2 5 C
3 6 A
final output
count category
3 6 A
A small modification to your code:
result = df[df['count'] >= df['category'].apply(lambda x: dct[x])]
You cannot directly use dct.get(df['category']) because df['category'] returns a mutable Series which cannot be used as a dictionary key (Dictionary keys need to be immutable objects)
So, apply and lambda to the rescue! :)

Categories