Skip item with more columns when creating Pandas DataFrame - python

I have a list of lists:
list = [
['Row 1','Value 1'],
['Row 2', 'Value 2'],
['Row 3', 'Value 3', 'Value 4']
]
And I have a list for dataframe header:
header_list = ['RowID', 'Value']
If I create the DataFrame using df = pd.DataFrame(list, columns = header_list), then python will through me an error says Row3 has more than 2 columns, which is inconsistent with the header_list.
So how can I skip Row 3 when creating the DataFrame. And how to achieve this with "in-place" calculation, which means NOT creating a new list which loops through the original list and append the item with length=2.
Thanks for the help!

First change variable list to L, because list is python code reserved word.
Then for filter use list comprehension:
L = [['Row 1','Value 1'], ['Row 2', 'Value 2'], ['Row 3', 'Value 3', 'Value 4']]
#for omit all rows != 2
df = pd.DataFrame([x for x in L if len(x) == 2], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
#filter last 2 values if len != 2
df = pd.DataFrame([x if len(x) == 2 else x[-2:] for x in L], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
2 Value 3 Value 4
Or:
#filter first 2 values if len != 2
df = pd.DataFrame([x if len(x) == 2 else x[:2] for x in L], columns = header_list)
print (df)
RowID Value
0 Row 1 Value 1
1 Row 2 Value 2
2 Row 3 Value 3

try below code:
list1 = [['Row 1','Value 1'], ['Row 2', 'Value 2'], ['Row 3', 'Value 3']]
dff = pd.DataFrame(list1)
dff = dff[[x for x in range(len(header_list))]]
dff.columns = header_list

Related

How to combine multiple columns of a pandas Dataframe into one column in JSON format

I have a sample dataframe as follows:
Main Key
Second
Column A
Column B
Column C
Column D
Column E
First
A
Value 1
Value 2
Value 3
Value 4
Value 5
Second
B
Value 6
Value 7
Value 8
Value 9
Value 10
Third
C
Value 11
Value 12
Value 13
Value 14
Value 15
Fourth
D
Value 16
Value 17
Value 18
Value 19
Value 20
I want to make a new column called 'Aggregated Data', where I make each value in Columns A to E, as key-value pair, and combine them in 'Aggregated Data' in JSON Format
The expected output would look like this:
Main Key
Second
Aggregated Data
First
A
{"Column A":"Value 1","Column B":"Value 2","Column C":"Value 3","Column D":"Value 4","Column E":"Value 5"}
Second
B
{"Column A":"Value 6","Column B":"Value 7","Column C":"Value 8","Column D":"Value 9","Column E":"Value 10"}
Third
C
{"Column A":"Value 11","Column B":"Value 12","Column C":"Value 13","Column D":"Value 14","Column E":"Value 15"}
Fourth
D
{"Column A":"Value 16","Column B":"Value 17","Column C":"Value 18","Column D":"Value 19","Column E":"Value 20"}
Any idea how this can be achieved? Thanks
Via intermediate pandas.DataFrame.to_dict call (with orient records to obtain lists like [{column -> value}, … , {column -> value}]):
df[['Main Key', 'Second']].assign(Aggregated_Data=df.set_index(['Main Key', 'Second']).to_dict(orient='records'))
Main Key Second Aggregated_Data
0 First A {'Column A': 'Value 1 ', 'Column B': 'Value 2 ...
1 Second B {'Column A': 'Value 6 ', 'Column B': 'Value 7 ...
2 Third C {'Column A': 'Value 11 ', 'Column B': 'Value 1...
3 Fourth D {'Column A': 'Value 16 ', 'Column B': 'Value 1...
Just skip the first two columns and call to_json :
out = (df[["Main Key", "Second"]]
.assign(Aggregated_Data= df.iloc[:, 2:]
.apply(lambda x: x.to_json(), axis=1))
Alternatively, use a dict/listcomp :
df["Aggregated_Data"] = [{k: v for k, v in zip(df.columns[2:], v)}
for v in df.iloc[:,2:].to_numpy()]
Output :
print(out)
Main Key Second Aggregated_Data
0 First A {"Column A":"Value 1","Column B":"Value 2","Co...
1 Second B {"Column A":"Value 6","Column B":"Value 7","Co...
2 Third C {"Column A":"Value 11","Column B":"Value 12","...
3 Fourth D {"Column A":"Value 16","Column B":"Value 17","...

Nested Dictionary using Pandas DataFrame

I have some data with duplicates that looks like this:
WEBPAGE
ID
VALUE
Webpage 1
ID 1
Value 1
Webpage 1
ID 1
Value 2
Webpage 1
ID 1
Value 3
Webpage 1
ID 2
Value 4
Webpage 1
ID 2
Value 5
Each webpage can have more than 1 ID associated with it and each ID can have more than one value associated with it.
I'd like to ideally have a nested dictionary with lists to handle the multiple IDs and multiple values:
{WEBPAGE: {ID 1: [value 1, value 2, value 3], ID 2: [value 4, value 5]}}
I've tried using to_dict and group_by but I can't seem to find the right syntax to create a nested dictionary within those.
Try:
out = {}
for _, x in df.iterrows():
out.setdefault(x["WEBPAGE"], {}).setdefault(x["ID"], []).append(x["VALUE"])
print(out)
Prints:
{
"Webpage 1": {
"ID 1": ["Value 1", "Value 2", "Value 3"],
"ID 2": ["Value 4", "Value 5"],
}
}
For a pandas approach, you just need to use a nested groupby:
d = (df.groupby('WEBPAGE')
.apply(lambda g: g.groupby('ID')['VALUE'].agg(list).to_dict())
.to_dict()
)
output:
{'Webpage 1': {'ID 1': ['Value 1', 'Value 2', 'Value 3'],
'ID 2': ['Value 4', 'Value 5']}}
Another possible solution, using dictionary comprehension:
{x: {y: [z for z in df.VALUE[(df.WEBPAGE == x) & (df.ID == y)]]
for y in df.ID[df.WEBPAGE == x]} for x in df.WEBPAGE}
Output:
{'Webpage 1': {'ID 1': ['Value 1', 'Value 2', 'Value 3'],
'ID 2': ['Value 4', 'Value 5']}}

Transforming my concatenated tuple into a pandas DataFrame [duplicate]

This question already has answers here:
How to unnest (explode) a column in a pandas DataFrame, into multiple rows
(16 answers)
Closed last year.
I wrote the following code (not sure if this is the best approach), just know the data I have is divided into two separate lists, in the correct order. Z[0] is steps, and z[1] is the lists.
for i,z in enumerate(zip(steps,userids_list)):
print(z)
This results in the following tuple values:
# SAMPLE
(('Step 1 string', [list of userid of that step]),
('Step 2 string', [list of userid of that step]),
('Step 3 string', [list of userid of that step]),
('Step n string', [list of userids of that step]))
My goal is to transform that style of data into the following pandas DataFrame.
Column 1 Column 2
Step 1 User id
Step 1 User id
Step 2 User id
Step 2 User id
Step 3 User id
Step 3 User id
Unfortunately I couldn't find a way to transform the data into what I want. Any ideas on what I could try to do?
explode is perfect for this. Load your data into a dataframe and then explode the column containing the lists:
df = pd.DataFrame({
'Column 1': Z[0],
'Column 2': Z[1],
})
df = df.explode('Column 2')
For example:
steps = ['Step 1', 'Step 2', 'Step 3']
user_ids = [
['user a', 'user b'],
['user a', 'user b', 'user c'],
['user c'],
]
df = pd.DataFrame({
'step': steps,
'user_id': user_ids,
})
df = df.explode('user_id').reset_index(drop=True)
print(df)
Output:
step user_id
0 Step 1 user a
1 Step 1 user b
2 Step 2 user a
3 Step 2 user b
4 Step 2 user c
5 Step 3 user c
data = (('Step 1 string', [list of userid of that step]),
('Step 2 string', [list of userid of that step]),
('Step 3 string', [list of userid of that step]),
('Step n string', [list of userids of that step]))
df = pd.DataFrame(data, columns=['Column 1', 'Column 2'])
This probably do the job

Convert list of strings to columns with custom rules pandas

I have list of string within my dataframe columns:
data = [{'column A': '3 item X; 4 item Y; item E of size 7', 'column B': 'item I of size 10; item X has 5 specificities; characteristic W'},
{'column A': '13 item X; item F of size 0; 9 item Y', 'column B': 'item J of size 11; item Y has 8 specificities'}]
df = pd.DataFrame(data)
I want to extract numerical information from strings that contains integers, for each row.
For instance, I need to create a new column named Size item E that takes the value 7 for the first row of df in column A, since the list contains item E of size 7.
If the value in the list of strings does not contain number, I just want to encode them as 1 or 0 if it is present in the original list.
Here is a summary of my desired output:
This is what I have coded so far, applying only 1 rule:
import pandas
import re
def hasNumbers(inputString):
return any(char.isdigit() for char in inputString)
def transform(df):
columns = ['column A', 'column B']
for col in columns:
temp = df[col].apply(lambda x : str(x).split(';'))
tokens = set([l for j in temp for l in j])
for token in tokens:
try:
integer = int(re.search(r'\d+', token).group())
except:
pass
if token[0].isdigit():
df['Nb ' + token.replace('{} '.format(integer), '')] = integer
# if ...:
# ...other rules
elif hasNumbers(token) == False:
df[token] = df[col].apply(lambda x : 1 if token in str(x) else 0)
df = df.drop(col, axis=1)
return df
df3 = transform(df)
Which is returning me the following dataframe:
As you can see, I cannot apply my feature extraction by row, it updates the whole pandas series. Is there any to update new column values for each row step by step?
Don't go for complex functions pandas has great string manipulation functions.
Check this code to get the desired output.
data = [{'column A': '3 item X; 4 item Y; item E of size 7', 'column B': 'item I of size 10; item X has 5 specificities; characteristic W'},
{'column A': '13 item X; item F of size 0; 9 item Y', 'column B': 'item J of size 11; item Y has 8 specificities'}]
df = pd.DataFrame(data)
#joining 2 columns with ';'
df['All Columns joined'] = df[['column A','column B']].apply(lambda x: ';'.join(x), axis=1)
#creating empty dataframe
df_new = pd.DataFrame([])
#Desired output logic using string extract function
df_new['Nb item X'] = df['All Columns joined'].str.extract(r'([0-9]+) item X',expand = False)
df_new['Nb item Y'] = df['All Columns joined'].str.extract(r'([0-9]+) item Y',expand = False)
df_new['Nb specificities item X'] = df['All Columns joined'].str.extract(r'item X has ([0-9]+) specificities',expand = False)
df_new['Nb specificities item Y'] = df['All Columns joined'].str.extract(r'item Y has ([0-9]+) specificities',expand = False)
df_new['Size item E'] = df['All Columns joined'].str.extract(r'item E of size ([0-9]+)',expand = False)
df_new['Size item F'] = df['All Columns joined'].str.extract(r'item F of size ([0-9]+)',expand = False)
df_new['Size item I'] = df['All Columns joined'].str.extract(r'item I of size ([0-9]+)',expand = False)
df_new['Size item J'] = df['All Columns joined'].str.extract(r'item J of size ([0-9]+)',expand = False)
df_new['characteristic W'] = df['All Columns joined'].str.extract(r'(characteristic W)',expand = False).notnull().astype(int)
df_new
Nb item X Nb item Y Nb specificities item X Nb specificities item Y Size item E Size item F Size item I Size item J characteristic W
0 3 4 5 NaN 7 NaN 10 NaN 1
1 13 9 NaN 8 NaN 0 NaN 11 0
Ouput of the df_new dataframe.

Filtering Pandas dataframe on two criteria where one column is a list

I have a Pandas Dataframe with columns Project Type and Parts. I would like to know how many part As are used in projects of Project Type 1. I am trying to use .count(), but it doesn't return just a single number.
import pandas as pd
parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['B']]], columns=['Project Type', 'Parts'])
print (parts_df[(parts_df['Project Type'] == 'Type 1') & ('A' in parts_df['Parts'])]).count()
Output:
Project Type 0
Parts 0
dtype: int64
Desired Output:
1
you can try something like this :
sum(['A' in i for i in parts_df[parts_df['Project Type']=='Type 1']['Parts'].tolist()])
sample :
In[32]: parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['A']], ['Type 1', ['C']]], columns=['Project Type', 'Parts'])
In[33]: sum(['A' in i for i in parts_df[parts_df['Project Type']=='Type 1']['Parts'].tolist()])
Out[33]: 1
IIUC you want the following:
In [13]:
parts_df.loc[parts_df['Project Type'] == 'Type 1','Parts'].apply(lambda x: x.count('A'))
Out[13]:
0 1
Name: Parts, dtype: int64
If you want the scalar value rather than a series then you can call .values attribute and index into the np array:
In [15]:
parts_df.loc[parts_df['Project Type'] == 'Type 1','Parts'].apply(lambda x: x.count('A')).values[0]
Out[15]:
1
You could just add a column that counts the 'A' parts:
In [17]:
parts_df['A count'] = parts_df['Parts'].apply(lambda x: x.count('A'))
parts_df
Out[17]:
Project Type Parts A count
0 Type 1 [A, B] 1
1 Type 2 [B] 0
you can then filter:
In [18]:
parts_df[(parts_df['Project Type'] == 'Type 1') & (parts_df['A count'] > 0)]
Out[18]:
Project Type Parts A count
0 Type 1 [A, B] 1
Change the 'A' in df['Parts'] to a lambda.
import pandas as pd
parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['B']]], columns=['Project Type', 'Parts'])
res = (parts_df[(parts_df['Project Type'] == 'Type 1') & (parts_df['Parts'].apply(lambda x: 'A' in x))]).count()
res.max()
Result:
1
You can spend a second to re-format the columns, and make life a little easier:
parts_df.Parts = parts_df.Parts.map(lambda x: ' '.join(x))
# Project type Parts
#0 Type 1 A B
#1 Type 2 B
Now you can use the Series.str.get_dummies method:
dummies = parts_df.Parts.str.get_dummies( sep=' ')
# A B
#0 1 1
#1 0 1
which shows the presence or absence of each "Part" using either a 1 or 0 respectively. Use this dummies frame to create a dataframe that can easily be manipulated using all of the standard pandas methods (pandas doesn't like lists in columns):
new_parts_df = pandas.concat( (parts_df['Project Type'], dummies), axis=1)
# Project type A B
#0 Type 1 1 1
#1 Type 2 0 1
You can now easily count groups in several ways. The most efficient thing to do would be use pandas.DataFrame.query, but the unfortunate white space in your column name "Project Type" makes this difficult. I would avoid white spaces in column names whenever possible. Try this:
new_parts_df.rename( columns={'Project Type': 'Project_Type'}, inplace=True)
print(len(new_parts_df.query( 'Project_Type=="Type 1" and A==1')))
# 1

Categories