Im trying to use data from two dateframes to create a new dataframe
lookup_data = [
{ 'item': 'apple',
'attribute_1':3,
'attribute_2':2,
'attribute_3':10,
'attribute_4':0,
},
{ 'item': 'orange',
'attribute_1':0.4,
'attribute_2':20,
'attribute_3':1,
'attribute_4':9,
},
{ 'item': 'pear',
'attribute_1':0,
'attribute_2':0,
'attribute_3':30,
'attribute_4':0,
},
{ 'item': 'peach',
'attribute_1':2,
'attribute_2':2,
'attribute_3':3,
'attribute_4':6,
},]
df_lookup_data = pd.DataFrame(lookup_data,dtype=float)
df_lookup_data.set_index('item', inplace=True, drop=True)
collected_data = [
{ 'item':'apple',
'qnt': 4},
{ 'item':'orange',
'qnt': 2},
{ 'item':'pear',
'qnt': 7},
]
df_collected_data = pd.DataFrame(collected_data,dtype=float)
df_collected_data.set_index('item', inplace=True, drop=True)
df_result = pd.DataFrame(
.... first column is item type
.... second column is qnt*attribute_1
.... second column is qnt*attribute_2
.... second column is qnt*attribute_3
.... second column is qnt*attribute_4
)
df_result.columns = ['item', 'attribute_1', 'attribute_2', 'attribute_3', 'attribute_4']
print(result)
the result should print
item attribute_1 attribute_2 attribute_3 attribute_4
0 apple 14 8 40 0
1 orange 0.8 40 2 18
2 pear 0 0 210 0
but im really not sure how i get date from these two dataframes and make this new one
No need to merge or concat here. Since indexes do match, simply mul across axis=0
>>> df_lookup_data.mul(df_collected_data.qnt, axis=0)
attribute_1 attribute_2 attribute_3 attribute_4
item
apple 12.0 8.0 40.0 0.0
orange 0.8 40.0 2.0 18.0
peach NaN NaN NaN NaN
pear 0.0 0.0 210.0 0.0
Or use:
df_lookup_data = pd.DataFrame(lookup_data,dtype=float)
items = [i['item'] for i in collected_data]
qnts = [i['qnt'] for i in collected_data]
print(df_lookup_data[df_lookup_data['item'].isin(items)].set_index('item').mul(qnts, axis=0))
Output:
attribute_1 attribute_2 attribute_3 attribute_4
item
apple 12.0 8.0 40.0 0.0
orange 0.8 40.0 2.0 18.0
pear 0.0 0.0 210.0 0.0
Related
def test_lprun():
data = {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'],
'Age':[20, 21, 19, 18],
'Height' : [6.1, 5.9, 6.0, 6.1]
}
df = pd.DataFrame(data)
df=df.assign(A=123,
B=lambda x:x.Age+x.Height,
C=lambda x:x.Name.str.upper(),
D=lambda x:x.Name.str.lower()
)
return df
In [8]: %lprun -f test_lprun test_lprun()
Timer unit: 1e-07 s
Total time: 0.0044901 s
File: <ipython-input-7-eaf21639fb5f>
Function: test_lprun at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def test_lprun():
2 1 21.0 21.0 0.0 data = {'Name':['Tom', 'Brad', 'Kyle', 'Jerry'],
3 1 13.0 13.0 0.0 'Age':[20, 21, 19, 18],
4 1 15.0 15.0 0.0 'Height' : [6.1, 5.9, 6.0, 6.1]
5 }
6 1 8651.0 8651.0 19.3 df = pd.DataFrame(data)
7 1 19.0 19.0 0.0 df=df.assign(A=123,
8 1 11.0 11.0 0.0 B=lambda x:x.Age+x.Height,
9 1 10.0 10.0 0.0 C=lambda x:x.Name.str.upper(),
10 1 36147.0 36147.0 80.5 D=lambda x:x.Name.str.lower()
11 )
12 1 14.0 14.0 0.0 return df
When using pandas assign, it could not tell which rows occupies most time but tell the whole result for assign function.
Goal: line_profile could tell each row result for pandas assign function like Line 6 %Time is 10, Line7 %Time is 30 and so on.
I have a dataframe:
import pandas as pd
df = pd.DataFrame(
{
"Qty": [1,2,2,4,5,4,3],
"Date": ['2020-12-16', '2020-12-17', '2020-12-18', '2020-12-19', '2020-12-20', '2020-12-21', '2020-12-22'],
"Item": ['22-A', 'R-22-A', '33-CDE', 'R-33-CDE', '55-A', '22-AB', '55-AB'],
"Price": [1.1, 2.2, 2.2, 4.4, 5.5, 4.4, 3.3]
})
I'm trying to duplicate each row where the Item suffix has 2 or more characters, and then change the value of the Item. For example, the row containing '22-AB' will become two rows. In the first row the Item will be '22-A', and in the 2nd it will be '22-B'.
All this should be done only if the item number (without suffix) is in a 'clean' list.
Here is the pseudocode for what I'm trying to achieve:
Clean list of items = ['11', '22', '33']
For each row, check if substring of df["Item"] is in clean list.
if no:
skip row and leave it as it is
if yes:
check if len(suffix) >= 2
if no:
skip row and leave it as it is
if yes:
separate the item (11, 22, or 33) and the suffix
for char in suffix:
newitem = concat item + char
duplicate the row, replacing the old item with newitem
if number started with R-, prepend the R- again
The desired output:
df2 = pd.DataFrame(
{
"Qty": [1,2,2,2,2,4,4,4,5,4,4,3,3],
"Date": ['2020-12-16', '2020-12-17', '2020-12-18', '2020-12-18', '2020-12-18', '2020-12-19', '2020-12-19', '2020-12-19', '2020-12-20', '2020-12-21', '2020-12-21', '2020-12-22', '2020-12-22'],
"Item": ['22-A', 'R-22-A', '33-C', '33-D', '33-E', 'R-33-C', 'R-33-D', 'R-33-E', '55-A', '22-A', '22-B', '55-A', '55-B'],
"Price": [1.1, 2.2, 2.2, 2.2, 2.2, 4.4, 4.4, 4.4, 5.5, 4.4, 4.4, 3.3, 3.3]
})
What I have come up with so far:
mains = ['11', '22', '33']
for i in df["Item"]:
iptrn = re.compile(r'\d{2}')
optrn = re.compile('(?<=[0-9]-).*')
item = bptrn.search(i).group(0)
option = optrn.search(i).group(0)
if item in mains:
for o in option:
combo = item + "-" + o
print(combo)
I can't figure out the last step of actually duplicating the row. I've tried this: df = df.loc[df.index.repeat(1)].assign(Item=combo, num=len(option)-1).reset_index(drop=True), but it doesn't replace the Item correctly
You can use pandas operations to do the work here
It seems like the first step is to separate the two parts of the item code with pandas string methods (here, use extract with expand=True)
>>> item_code = df['Item'].str.extract('(?P<ic1>R?-?\d+)-+(?P<ic2>\w+)', expand=True)
>>> item_code
ic1 ic2
0 22 A
1 R-22 A
2 33 CDE
3 R-33 CDE
4 55 A
5 22 AB
6 55 AB
You can add these columns directly to df - I just included that snippet above to show you the output from the extract operation.
>>> df = df.join(df['Item'].str.extract('(?P<ic1>R?-?\d+)-+(?P<ic2>\w+)', expand=True))
>>> df
Qty Date Item Price ic1 ic2
0 1 2020-12-16 22-A 1.1 22 A
1 2 2020-12-17 R-22-A 2.2 R-22 A
2 2 2020-12-18 33-CDE 2.2 33 CDE
3 4 2020-12-19 R-33-CDE 4.4 R-33 CDE
4 5 2020-12-20 55-A 5.5 55 A
5 4 2020-12-21 22-AB 4.4 22 AB
6 3 2020-12-22 55-AB 3.3 55 AB
Next, I would build up a python data structure and convert it to a dataframe at the end rather than trying to insert rows or change existing rows.
data = []
for row in df.itertuples(index=False):
for character in row.ic2:
data.append({
'Date': row.Date,
'Qty': row.Qty,
'Price': row.Price,
'Item': f'{row.ic1}-{character}'
})
newdf = pd.DataFrame(data)
The new dataframe looks like this
>>> newdf
Date Qty Price Item
0 2020-12-16 1 1.1 22-A
1 2020-12-17 2 2.2 R-22-A
2 2020-12-18 2 2.2 33-C
3 2020-12-18 2 2.2 33-D
4 2020-12-18 2 2.2 33-E
5 2020-12-19 4 4.4 R-33-C
6 2020-12-19 4 4.4 R-33-D
7 2020-12-19 4 4.4 R-33-E
8 2020-12-20 5 5.5 55-A
9 2020-12-21 4 4.4 22-A
10 2020-12-21 4 4.4 22-B
11 2020-12-22 3 3.3 55-A
12 2020-12-22 3 3.3 55-B
Let's say that I have this dataframe with three column : "Name", "Account" and "Ccy".
import pandas as pd
Name = ['Dan', 'Mike', 'Dan', 'Dan', 'Sara', 'Charles', 'Mike', 'Karl']
Account = ['100', '30', '50', '200', '90', '20', '65', '230']
Ccy = ['EUR','EUR','USD','USD','','CHF', '','DKN']
df = pd.DataFrame({'Name':Name, 'Account' : Account, 'Ccy' : Ccy})
Name Account Ccy
0 Dan 100 EUR
1 Mike 30 EUR
2 Dan 50 USD
3 Dan 200 USD
4 Sara 90
5 Charles 20 CHF
6 Mike 65
7 Karl 230 DKN
I would like to reprensent this data differently. I would like to write a script that find all the duplicates in the column name and regroup them wit the different account and if there are an currency "Ccy", it add a new column next to it with all the currency associated.
So something like that :
Dan Ccy1 Mike Ccy2 Sara Charles Ccy3 Karl Ccy4
0 100 EUR 30 EUR 90 20 CHF 230 DKN
1 50 USD 65
2 200 USD
I dont' really know how to start that ! So I simplify the problem to do step y step. I try to regroup the dupicates by the name with a list however it did not identify the duplicates.
x_len, y_len = df.shape
new_data = []
for i in range(x_len) :
if df.iloc[i,0] not in new_data :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in new_data)))
new_data.append([df.iloc[i,0],df.iloc[i,1]])
else:
new_data[str(df.iloc[i,0])].append(df.iloc[i,1])
Then I thought that it was easier to use a dictionary. So I try this loop but there is an error and maybe it is not the best way to go to the expected final result
from collections import defaultdict
dico=defaultdict(list)
x_len, y_len = df.shape
for i in range(x_len) :
if df.iloc[i,0] not in dico :
print(str(df.iloc[i,0]) + '\t'+ str(df.iloc[i,1])+ '\t' + str(bool(df.iloc[i,0] not in dico)))
dico[str(df.iloc[i,0])] = df.iloc[i,1]
print(dico)
else :
dico[df.iloc[i,0]].append(df.iloc[i,1])
Anyone has an idea how to start or to do the code if it is simple ?
Thank you
Use GroupBy.cumcount for counter, reshape by DataFrame.set_index and DataFrame.unstack and last flatten columns names:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
Account_Charles Ccy_Charles Account_Dan Ccy_Dan Account_Karl Ccy_Karl \
0 20 CHF 100 EUR 230 DKN
1 NaN NaN 50 USD NaN NaN
2 NaN NaN 200 USD NaN NaN
Account_Mike Ccy_Mike Account_Sara Ccy_Sara
0 30 EUR 90
1 65 NaN NaN
2 NaN NaN NaN NaN
If need custom columns names use if-else in list comprehension:
g = df.groupby(['Name']).cumcount()
df = df.set_index([g,'Name']).unstack().sort_index(level=1, axis=1)
L = [b if a == 'Account' else f'{a}{i // 2}' for i, (a, b) in enumerate(df.columns)]
df.columns = L
print (df)
Charles Ccy0 Dan Ccy1 Karl Ccy2 Mike Ccy3 Sara Ccy4
0 20 CHF 100 EUR 230 DKN 30 EUR 90
1 NaN NaN 50 USD NaN NaN 65 NaN NaN
2 NaN NaN 200 USD NaN NaN NaN NaN NaN NaN
I have a df with a column, Critic_Score, that has NaN values. I am trying to replace them with the average of the Critic Scores from the same platform. This question has been asked on stack overflow several times and I used 4 suggestions that did not give me the desired output. Please tell me how to fix this.
This is a subset of the df:
x[['Platform','Critic_Score']].head()
Platform Critic_Score
0 wii 76.0
1 nes NaN
2 wii 82.0
3 wii 80.0
4 gb NaN
More information on the original df:
x.head().to_dict('list')
{'Name': ['wii sports',
'super mario bros.',
'mario kart wii',
'wii sports resort',
'pokemon red/pokemon blue'],
'Platform': ['wii', 'nes', 'wii', 'wii', 'gb'],
'Year_of_Release': [2006.0, 1985.0, 2008.0, 2009.0, 1996.0],
'Genre': ['sports', 'platform', 'racing', 'sports', 'role-playing'],
'NA_sales': [41.36, 29.08, 15.68, 15.61, 11.27],
'EU_sales': [28.96, 3.58, 12.76, 10.93, 8.89],
'JP_sales': [3.77, 6.81, 3.79, 3.28, 10.22],
'Other_sales': [8.45, 0.77, 3.29, 2.95, 1.0],
'Critic_Score': [76.0, nan, 82.0, 80.0, nan],
'User_Score': ['8', nan, '8.3', '8', nan],
'Rating': ['E', nan, 'E', 'E', nan]}
These are the statements I tried followed by their output:
1.
x['Critic_Score'] = x['Critic_Score'].fillna(x.groupby('Platform')['Critic_Score'].transform('mean'), inplace = True)
0 None
1 None
2 None
3 None
4 None
Name: Critic_Score, dtype: object
x.loc[x.Critic_Score.isnull(), 'Critic_Score'] = x.groupby('Platform').Critic_Score.transform('mean')
#no change in column
0 76.0
1 NaN
2 82.0
3 80.0
4 NaN
x['Critic_Score'] = x.groupby('Platform')['Critic_Score']\
.transform(lambda y: y.fillna(y.mean()))
#no change in column
0 76.0
1 NaN
2 82.0
3 80.0
4 NaN
Name: Critic_Score, dtype: float64
x['Critic_Score']=x.groupby('Platform')['Critic_Score'].apply(lambda y:y.fillna(y.mean()))
x['Critic_Score'].head()
Out[73]:
0 76.0
1 NaN
2 82.0
3 80.0
4 NaN
Name: Critic_Score, dtype: float64
x.update(
x.groupby('Platform').Critic_Score.transform('mean'),
overwrite=False)
First you create a new df with the same number of rows but with the platform average on every row.
Then use that to update the original
Bear in mind your sample has only one row of nes and another of gb, both with nan score, so there is nothing to be averaged
I want to implement machine learning with a dataset a bit too complex. I want to work with pandas and then use some of the built-in models in skit-learn.
The data looks is given in JSON file, a sample looks like that below:
{
"demo_Profile": {
"sex": "male",
"age": 98,
"height": 160,
"weight": 139,
"bmi": 5,
"someinfo1": [
"some_more_info1"
],
"someinfo2": [
"some_more_inf2"
],
"someinfo3": [
"some_more_info3"
],
},
"event": {
"info_personal": {
"info1": 219.59,
"info2": 129.18,
"info3": 41.15,
"info4": 94.19,
},
"symptoms": [
{
"name": "name1",
"socrates": {
"associations": [
"associations1"
],
"onsetType": "onsetType1",
"timeCourse": "timeCourse1"
}
},
{
"name": "name2",
"socrates": {
"timeCourse": "timeCourse2"
}
},
{
"name": "name3",
"socrates": {
"onsetType": "onsetType2"
}
},
{
"name": "name4",
"socrates": {
"onsetType": "onsetType3"
}
},
{
"name": "name5",
"socrates": {
"associations": [
"associations2"
]
}
}
],
"labs": [
{
"name": "name1 ",
"value": "valuelab"
}
]
}
}
I want to create a pandas data frame that considers this kind of "nested data" but I don't know how to build a data frame which takes into account "nested parameters" besides of "singles parameters"
For example, I don't know how to merge "demo_Profile" which contains "single parameters" with symptoms which is a list of dictionaries of, in same cases single values, and in other cases lists.
Anybody knows any way to deal with this issue?
EDIT*********
The JSON shown above is just one example, in other cases, the number of values in lists would be different, as well as the number of symptoms. So, the example shown above is not fixed for every case.
Consider pandas's json_normalize. However, because there are even deeper nests, consider processing data in pieces separately, then concatenate together with a fill forward on "normalized" columns.
import json
import pandas as pd
from pandas.io.json import json_normalize
with open('myfile.json', 'r') as f:
data = json.loads(f.read())
final_df = pd.concat([json_normalize(data['demo_Profile']),
json_normalize(data['event']['symptoms']),
json_normalize(data['event']['info_personal']),
json_normalize(data['event']['labs'])], axis=1)
# FLATTEN NESTED LISTS
n_list = ['someinfo1', 'someinfo2', 'someinfo3', 'socrates.associations']
final_df[n_list] = final_df[n_list].apply(lambda col:
col.apply(lambda x: x if pd.isnull(x) else x[0]))
# FILLING FORWARD
norm_list = ['age', 'bmi', 'height', 'weight', 'sex', 'someinfo1', 'someinfo2', 'someinfo3',
'info1', 'info2', 'info3', 'info4', 'name', 'value']
final_df[norm_list] = final_df[norm_list].ffill()
Output
print(final_df)
# age bmi height sex someinfo1 someinfo2 someinfo3 weight name socrates.associations socrates.onsetType socrates.timeCourse info1 info2 info3 info4 name value
# 0 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name1 associations1 onsetType1 timeCourse1 219.59 129.18 41.15 94.19 name1 valuelab
# 1 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name2 NaN NaN timeCourse2 219.59 129.18 41.15 94.19 name1 valuelab
# 2 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name3 NaN onsetType2 NaN 219.59 129.18 41.15 94.19 name1 valuelab
# 3 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name4 NaN onsetType3 NaN 219.59 129.18 41.15 94.19 name1 valuelab
# 4 98.0 5.0 160.0 male some_more_info1 some_more_inf2 some_more_info3 139.0 name5 associations2 NaN NaN 219.59 129.18 41.15 94.19 name1 valuelab
A quick and easy way to flatten your json data is to use the flatten_json package which can be installed via pip
pip install flatten_json
I expect that you have a list of many entries which look like the one you have provided. Therefore the following code will give you the desired result:
import pandas as pd
from flatten_json import flatten
json_data = [{...patient1...}, {patient2...}, ...]
flattened = (flatten(entry) for entry in json_data)
df = pd.DataFrame(flattened)
In the flattened data, the list entries get suffixed with numbers (I added another patient with an additional entry in the "labs" list):
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| index demo_Profile_age demo_Profile_bmi demo_Profile_height demo_Profile_sex demo_Profile_someinfo1_0 demo_Profile_someinfo2_0 demo_Profile_someinfo3_0 demo_Profile_weight event_info_personal_info1 event_info_personal_info2 event_info_personal_info3 event_info_personal_info4 event_labs_0_name event_labs_0_value event_labs_1_name event_labs_1_value event_symptoms_0_name event_symptoms_0_socrates_associations_0 event_symptoms_0_socrates_onsetType event_symptoms_0_socrates_timeCourse event_symptoms_1_name event_symptoms_1_socrates_timeCourse event_symptoms_2_name event_symptoms_2_socrates_onsetType event_symptoms_3_name event_symptoms_3_socrates_onsetType event_symptoms_4_name event_symptoms_4_socrates_associations_0 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 0 98 5 160 male some_more_info1 some_more_inf2 some_more_info3 139 219.59 129.18 41.15 94.19 name1 valuelab NaN NaN name1 associations1 onsetType1 timeCourse1 name2 timeCourse2 name3 onsetType2 name4 onsetType3 name5 associations2 |
| 1 98 5 160 male some_more_info1 some_more_inf2 some_more_info3 139 219.59 129.18 41.15 94.19 name1 valuelab name2 valuelabr2 name1 associations1 onsetType1 timeCourse1 name2 timeCourse2 name3 onsetType2 name4 onsetType3 name5 associations2 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The flatten method contains additional parameters to remove unwanted columns or prefixes.
Note: While this method gives you a flattened DataFrame as desired, I expect that you will run into other problems when feeding the dataset into a machine learning algorithm, depending on what will be your prediction target and how you want to encode the data as features.