How to split one row into multiple rows in python - python

I have a pandas dataframe that has one long row as a result of a flattened json list.
I want to go from the example:
{'0_id': 1, '0_name': a, '0_address': USA, '1_id': 2, '1_name': b, '1_address': UK, '1_hobby': ski}
to a table like the following:
id
name
address
hobby
1
a
USA
2
b
UK
ski
Any help is greatly appreciated :)

There you go:
import json
json_data = '{"0_id": 1, "0_name": "a", "0_address": "USA", "1_id": 2, "1_name": "b", "1_address": "UK", "1_hobby": "ski"}'
arr = json.loads(json_data)
result = {}
for k in arr:
kk = k.split("_")
if int(kk[0]) not in result:
result[int(kk[0])] = {"id":"", "name":"", "hobby":""}
result[int(kk[0])][kk[1]] = arr[k]
for key in result:
print("%s %s %s" % (key, result[key]["name"], result[key]["address"]))
if you want to have field more dynamic, you have two choices - either go through all array and gather all possible names and then build template associated empty array, or just check if key exist in result when you returning results :)

This way only works if every column follows this pattern, but should otherwise be pretty robust.
data = {'0_id': '1', '0_name': 'a', '0_address': 'USA', '1_id': '2', '1_name': 'b', '1_address': 'UK', '1_hobby': 'ski'}
df = pd.DataFrame(data, index=[0])
indexes = set(x.split('_')[0] for x in df.columns)
to_concat = []
for i in indexes:
target_columns = [col for col in df.columns if col.startswith(i)]
df_slice = df[target_columns]
df_slice.columns = [x.split('_')[1] for x in df_slice.columns]
to_concat.append(df_slice)
new_df = pd.concat(to_concat)

Related

Python apply function to each row of DataFrame

I have DataFrame with two columns: Type and Name. The values in each cell are lists of equal length, i.e we have pairs (Type, Name). I want to:
Group Name by it's Type
Create column Type with the values of Names
My current code is a for loop:
for idx, row in df.iterrows():
for t in list(set(row["Type"])):
df.at[idx, t] = [row["Name"][i] for i in range(len(row["Name"])) if row["Type"][i] == t]
but it works very slow. How can I speed up this code?
EDIT Here is the code example which ilustrates what I want to obtain but in a faster way:
import pandas as pd
df = pd.DataFrame({"Type": [["1", "1", "2", "3"], ["2","3"]], "Name": [["A", "B", "C", "D"], ["E", "F"]]})
unique = list(set(row["Type"]))
for t in unique:
df[t] = None
df[t] = df[t].astype('object')
for idx, row in df.iterrows():
for t in unique:
df.at[idx, t] = [row["Name"][i] for i in range(len(row["Name"])) if row["Type"][i] == t]
You could write a function my_function(param) and then do something like this:
df['type'] = df['name'].apply(lambda x: my_function(x))
There are likely better alternatives to using lambda functions, but lambdas are what I remember. If you post a simplified mock of your original data and what the desired output should look like, it may help you find the best answer to your question. I'm not certain I understand what you're trying to do. A literal group by should be done using Dataframes' groupby method.
If I understand correctly your dataframe looks something like this:
df = pd.DataFrame({'Name':['a,b,c','d,e,f,g'], 'Type':['3,3,2','1,2,2,1']})
Name Type
0 a,b,c 3,3,2
1 d,e,f,g 1,2,2,1
where the elements are lists of strings.
Start with running:
df['Name:Type'] = (df['Name']+":"+df['Type']).map(process)
using:
def process(x):
x_,y_ = x.split(':')
x_ = x_.split(','); y_ = y_.split(',')
s = zip(x_,y_)
str_ = ','.join(':'.join(y) for y in s)
return str_
Then you will get:
This reduces the problem to a single column.
Finally produce the dataframe required by:
l = ','.join(df['Name:Type'].to_list()).split(',')
pd.DataFrame([i.split(':') for i in l], columns=['Name','Type'])
Giving:
is it the result you want? (if not then add to your question an example of desired output):
res = df.explode(['Name','Type']).groupby('Type')['Name'].agg(list)
print(res)
'''
Type
1 [A, B]
2 [C, E]
3 [D, F]
Name: Name, dtype: object
UPD
df1 = df.apply(lambda x: pd.Series(x['Name'],x['Type']).groupby(level=0).agg(list).T,1)
res = pd.concat([df,df1],axis=1)
print(res)
'''
Type Name 1 2 3
0 [1, 1, 2, 3] [A, B, C, D] [A, B] [C] [D]
1 [2, 3] [E, F] NaN [E] [F]

How to recode/ map shared columns for dataframe stored in a dictionary?

I want to recode the 'Flavor' field, which both data sets share.
I successfully stored the data as data frames in a dictionary, but the names assigned (for ex. 'df_Mike') are strings and not callable/ alterable objects.
Do let me know where I'm going wrong and explain why.
name = ['Mike', 'Sue']
d = {}
for n in name:
url = f'https://raw.githubusercontent.com/steflangehennig/info4120/main/data/{n}.csv'
response = urllib.request.urlopen(url)
d[n] = pd.DataFrame(pd.read_csv(response))
flavor = {1: 'Choclate', 2: 'Vanilla', 3:'Mixed'}
for df in d:
df.map({'Flavor': flavor}, inplace = True)
Error code:
1 flavor = {1: 'Choclate', 2: 'Vanilla', 3:'Mixed'}
3 for df in d:
----> 4 df.map({'Flavor': flavor}, inplace = True)
AttributeError: 'str' object has no attribute 'map'
You are iterating over the keys in the dictionary. If you want to iterate over the values, you can use dict.values(). For example:
for df in d.values():
df.map({'Flavor': flavor}, inplace = True)
It would be better to iterate over the dict items. So, use
for name, df in d.items():
You can try
for n in name:
url = f'https://raw.githubusercontent.com/steflangehennig/info4120/main/data/{n}.csv'
df = pd.read_csv(url)
d[n] = df
for df in d.values():
df['Flavor'] = df['Flavor'].map(flavor)
or only in the first loop
for n in name:
url = f'https://raw.githubusercontent.com/steflangehennig/info4120/main/data/{n}.csv'
df = pd.read_csv(url)
df['Flavor'] = df['Flavor'].map(flavor)
d[n] = df

filtering a dataframe if the length of the word inside the series > 3

Community! I really appreciate all support I'm receiving through my journey learning python so far!
I got this following dataframe:
d = {'name': ['john', 'mary', 'james'], 'area':[['IT', 'Resources', 'Admin'], ['Software', 'ITS', 'Programming'], ['Teaching', 'Research', 'KS']]}
df = pd.DataFrame(data=d)
My goal is:
In other words, if the length of word inside the list of the column 'area' > 3, remove them.
I'm trying something like this but I´m really stuck
What is the best way of approaching this situation?
Thanks again!!
Combine .map with list comprehension:
df['area'] = df['area'].map(lambda x: [e for e in x if len(e)>3])
0 [Resources, Admin]
1 [Software, Programming]
2 [Teaching, Research]
explaination:
x = ["Software", "ABC", "Programming"]
# return e for every element in x but only if length of element is larger than 3
[e for e in x if len(e)>3]
You can expand all your lists, filter on str length and then put them back in lists by aggregating using list:
df = df.explode("area")
df = df[df["area"].str.len() > 3].groupby("name", as_index=False).agg(list)
# name area
# 0 james [Teaching, Research]
# 1 john [Resources, Admin]
# 2 mary [Software, Programming]
Before you build the dataframe.
One simple and efficient way is to create a new list of the key: "area", which will contain only strings with length bigger than 3. For example:
d = {'name': ['john', 'mary', 'james'], 'area':['IT', 'Resources', 'Admin'], ['Software', 'ITS', 'Programming'], ['Teaching', 'Research', 'KS']]}
# Retrieving the areas from d.
area_list = d['area']
# Copying all values, whose length is larger than 3, in a new list.
filtered_area_list = [a in area_list if len(3) > 3]
# Replacing the old list in the dictionary with the new one.
d['area'] = filtered_area_list
# Creating the dataframe.
df = pd.DataFrame(data=d)
After you build the dataframe.
If your data is in a dataframe, then you can use the "map" function:
df['area'] = df['area'].map(lambda a: [e for e in a if len(e) > 3])

How to get row and colum number in dataframe in Pandas?

How can I get the number of the row and the column in a dataframe that contains a certain value using Pandas? For example, I have the following dataframe:
For example, i need to know the row and column of "Smith" (row 1, column LastName)
Maybe this is a solution or a first step to a solution.
If you filter for the value you are looking for all items which are not the value you want are replaced with NaN. Now you can drop all columns where all values are NaN. This leaves a DataFrame with your item and the indices. Then you can ask for index and name.
import numpy as np
import pandas as pd
df = pd.DataFrame({'LastName':['a', 'Smith', 'b'], 'other':[1,2,3]})
value = df[df=='Smith'].dropna(axis=0, how='all').dropna(axis=1, how='all')
print(value.index.values)
print(value.columns.values)
But I think this can be improved.
Here's a one liner that efficiently gets the row and column of a value:
df = pd.DataFrame({"ClientID": [34, 67, 53], "LastName": ["Johnson", "Smith", "Brows"] })
result = next(x[1:] for x in ((v, i, j) for i, row_tup in enumerate(df.itertuples(index=False)) for j, v in zip(df.columns, row_tup)) if x[0] == "Smith")
print(result)
Output
(1, "LastName")
Unpacking that one liner
# This is a generator that unpacks the dataframe and gets the value, row number (i) and column name (j) for every value in the dataframe
item_generator = ((v, i, j) for i, row_tup in enumerate(df.itertuples(index=False)) for j, v in zip(df.columns, row_tup))
# This iterates through the generator until it finds a match
# It outputs just the row and column number by leaving off the first item in the tuple
next(x[1:] for x in item_generator if x[0] == "Smith")
Props to this this answer for the second half of the solution
I tried to simplify the code and make it more readable. This is my attempt:
df = pd.DataFrame({'points': [25, 12, 15, 14, 19],
'assists': [5, 7, 7, 9, 12],
'rebounds': [11, 8, 10, 6, 6]})
index = df.index # Allows to get the row index
columns = df.columns # Allows to get the column name
value_to_be_checked = 6
for i in index[df.isin([value_to_be_checked]).any(axis=1)].to_list():
for j, e in enumerate(df.iloc[i]):
if e == value_to_be_checked:
print("(row {}, column {})".format(i, column[j])
Just to add another possible solution to the bucket. If you really need to your search your whole DataFrame, you may consider using numpy.where, such as:
import numpy as np
value = 'Smith'
rows, cols = np.where(df.values == value)
where_are_you = [(df.index[row], df.columns[col]) for row, col in zip(rows, cols)]
So, if your DataFrame is like
ClientID First Name LastName
0 34 Mr Smith
1 67 Keanu Reeves
2 53 Master Yoda
3 99 Smith Smith
4 100 Harry Potter
The code output will be:
[(0, 'LastName'), (3, 'First Name'), (3, 'LastName')]
Edit: Just to satisfy everybody's curiosity, here it is a benchmark of all answers
The code is written below. I removed the print statements to be fair, because they would make codes really slow for bigger dataframes.
val = 0
def setup(n=10):
return pd.DataFrame(np.random.randint(-100, 100, (n, 3)))
def nested_for(df):
index = df.index # Allows to get the row index
columns = df.columns # Allows to get the column name
value_to_be_checked = val
for i in index[df.isin([value_to_be_checked]).any(axis=1)].to_list():
for j, e in enumerate(df.iloc[i]):
if e == value_to_be_checked:
_ = "(row {}, column {})".format(i, columns[j])
def df_twin_dropna(df):
value = df[df == val].dropna(axis=0, how='all').dropna(axis=1, how='all')
return value.index.values, value.columns.values
def numpy_where(df):
rows, cols = np.where(df.values == val)
return [(df.index[row], df.columns[col]) for row, col in zip(rows, cols)]
def one_line_generator(df):
return [x[1:] for x in ((v, i, j) for i, row_tup in enumerate(df.itertuples(index=False))
for j, v in zip(df.columns, row_tup)) if x[0] == "Smith"]
You can do this by looping though all the columns and finding the matching rows. This will give you a list of all the cells that matches your criteria:
Method 1(without comprehension):
import pandas as pd
# assume this df and that we are looking for 'abc'
df = pd.DataFrame({
'clientid': [34, 67, 53],
'lastname': ['Johnson', 'Smith', 'Brows']
})
Searchval = 'Smith'
l1 = []
#loop though all the columns
for col in df.columns:
#finding the matching rows
for i in range(len(df[col][df[col].eq(Searchval)].index)):
#appending the output to the list
l1.append((df[col][df[col].eq(Searchval)].index[i], col))
print(l1)
Method 2 (With comprehension):
import pandas as pd
df = pd.DataFrame({
'clientid': [34, 67, 53],
'lastname': ['Johnson', 'Smith', 'Brows']
})
#Value to search
Searchval = 'Smith'
#using list comprehension to find the rows in each column which matches the criteria
#and saving it in a list in case we get multiple matches
l = [(df[col][df[col].eq(Searchval)].index[i], col) for col in df.columns
for i in range(len(df[col][df[col].eq(Searchval)].index))]
print(l)
Thanks for submitting your request. This is something you can find with a Google search. Please make some attempt to find answers before asking a new question.
You can find simple and excellent dataframe examples that include column and row selection here: https://studymachinelearning.com/python-pandas-dataframe/
You can also see the official documentation here: https://pandas.pydata.org/pandas-docs/stable/
Select a column by column name:
df['col']
select a row by index:
df.loc['b']

Pandas use cell value as dict key to return dict value

my question relates to using the values in a dataframe column as keys in order to return their respective values and run a conditional.
I have a dataframe, df, containing a column "count" that has integers from 1 to 8 and a column "category" that has values either "A", "B", or "C"
I have a dictionary, dct, containing pairs A:2, B:4, C:6
This is my (incorrect) code:
result = df[df["count"] >= dct.get(df["category"])]
So I want to return a dataframe where the "count" value for a given row is equal to more than the value retrieved from a dictionary using the "category" letter in the same row.
So if there were count values of (1, 2, 6, 6) and category values of (A, B, C, A), the third and forth row would be return in the resultant dataframe.
How do I modify the above code to achieve this?
A good way to go is to add your dictionary into a the existing dataframe and then apply a query on the new dataframe:
import pandas as pd
df = pd.DataFrame(data={'count': [4, 5, 6], 'category': ['A', 'B', 'C']})
dct = {'A':5, 'B':4, 'C':-1}
df['min_count'] = df['category'].map(dct)
df = df.query('count>min_count')
following your logic:
import pandas as pd
dct = {'A':2, 'B':4, 'C':6}
df = pd.DataFrame({'count':[1,2,5,6],
'category':['A','B','C','A']})
print('original dataframe')
print(df)
def process_row(x):
return True if x['count'] >= dct[x['category']] else False
f = df.apply(lambda row: process_row(row), axis=1)
df = df[f]
print('final output')
print(df)
output:
original dataframe
count category
0 1 A
1 2 B
2 5 C
3 6 A
final output
count category
3 6 A
A small modification to your code:
result = df[df['count'] >= df['category'].apply(lambda x: dct[x])]
You cannot directly use dct.get(df['category']) because df['category'] returns a mutable Series which cannot be used as a dictionary key (Dictionary keys need to be immutable objects)
So, apply and lambda to the rescue! :)

Categories