this is my code:
DF['CustomerId'] = DF['CustomerId'].apply(str)
print(DF.dtypes)
for index, row in merged.iterrows():
DF = DF.loc[(DF['CustomerId'] == str(row['CustomerId'])), 'CustomerId'] = row['code']
My goal is to do this:
if DF['CustomerId'] is equal to row['CustomerId'] then change value of DF['CustomerId'] to row['CustomerId']
else leave as it is.
row['CustomerId'] and DF['CustomerId'] should be string. I know that loc works not with string, but how can I do this with string type ?
thanks
You can approach without looping by merging the 2 dataframes on the common CustomerId column using .merge() and then update the CustomerID column with the code column originated from the 'merged' datraframe with .update(), as follows:
df_out = DF.merge(merged, on='CustomerId', how='left')
df_out['CustomerId'].update(df_out['code'])
Demo
Data Preparation:
data = {'CustomerId': ['11111', '22222', '33333', '44444'],
'CustomerInfo': ['Albert', 'Betty', 'Charles', 'Dicky']}
DF = pd.DataFrame(data)
print(DF)
CustomerId CustomerInfo
0 11111 Albert
1 22222 Betty
2 33333 Charles
3 44444 Dicky
data = {'CustomerId': ['11111', '22222', '44444'],
'code': ['A1011111', 'A1022222', 'A1044444']}
merged = pd.DataFrame(data)
print(merged)
CustomerId code
0 11111 A1011111
1 22222 A1022222
2 44444 A1044444
Run New Code
# ensure the CustomerId column are strings as you did
DF['CustomerId'] = DF['CustomerId'].astype(str)
merged['CustomerId'] = merged['CustomerId'].astype(str)
df_out = DF.merge(merged, on='CustomerId', how='left')
print(df_out)
CustomerId CustomerInfo code
0 11111 Albert A1011111
1 22222 Betty A1022222
2 33333 Charles NaN
3 44444 Dicky A1044444
df_out['CustomerId'].update(df_out['code'])
print(df_out)
# `CustomerId` column updated as required if there are corresponding entries in dataframe `merged`
CustomerId CustomerInfo code
0 A1011111 Albert A1011111
1 A1022222 Betty A1022222
2 33333 Charles NaN
3 A1044444 Dicky A1044444
Related
I have a json with some nested/array items like the one below
I'm looking at flattening it before saving it into a csv
[{'SKU':'SKU1','name':'test name 1',
'ItemSalesPrices':[{'SourceNumber': 'OEM', 'AssetNumber': 'TEST1A', 'UnitPrice': 1600}, {'SourceNumber': 'RRP', 'AssetNumber': 'TEST1B', 'UnitPrice': 1500}],
},
{'SKU':'SKU2','name':'test name 2',
'ItemSalesPrices':[{'SourceNumber': 'RRP', 'AssetNumber': 'TEST2', 'UnitPrice': 1500}],
}
]
I have attempted with the good solution here flattern nested JSON and retain columns (or Panda json_normalize) but got no where so I'm hoping to get some tips from the community
SKU
Name
ItemSalesPrices_OEM_UnitPrice
ItemSalesPrices_OEM_AssetNumber
ItemSalesPrices_RRP_UnitPrice
ItemSalesPrices_RRP_AssetNumber
SKU1
test name 1
1600
TEST1A
1500
TEST1B
SKU2
test name 2
1500
TEST2
Thank you
Use json_normalize:
first = ['SKU','name']
df = pd.json_normalize(L,'ItemSalesPrices', first)
print (df)
SourceNumber AssetNumber UnitPrice SKU name
0 OEM TEST1A 1600 TEST1 test name 1
1 RRP TEST1B 1500 TEST1 test name 1
2 RRP TEST2 1500 TEST2 test name 2
Then you can pivoting values - if numeric use sum, if strings use join:
f = lambda x: x.sum() if np.issubdtype(x.dtype, np.number) else ','.join(x)
df1 = (df.pivot_table(index=first,
columns='SourceNumber',
aggfunc=f))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
df1 = df1.rename_axis(None, axis=1).reset_index()
print (df1)
SKU name AssetNumber_OEM AssetNumber_RRP UnitPrice_OEM \
0 SKU1 test name 1 TEST1A TEST1B 1600.0
1 SKU2 test name 2 NaN TEST2 NaN
UnitPrice_RRP
0 1500.0
1 1500.0
lets say for example we have 2 Dataframes, df1 and df2;
df1 = pd.DataFrame({'id': ['A01', 'A02'],
'Name': ['ABC', 'PQR']})
df2 = pd.DataFrame({'id': ['B05', 'B06'],
'Name': ['XYZ', 'TUV']})
I want to merge the two and label each dataframes, so it appears like this.
So basically, i want to concatenate two dataframes into a new dataframe and create a third column that labels each of those dataframes. As seen the the table above, you can see that there is a 3rd column named 'class' and the values there are grouping of each dataframes that were merged. The first two above are data for df1 and it was labelled as 1 for all of them. it groups all of them and put them as one.
i'm trying to make sure it doesn't appear like this one below;
in this case, it's appending for each line.. i prefer to append to the whole DF as single entity as shown in the first table.
This is what I have tried;
df1['class'] = 1
df2['class'] = 2
df_merge = pd.concat([df1,df2])
and i got result like this
But this is not what I was expecting. I am expecting the result to look like this. Grouping each df as one and add the 3rd column.
You can concat using the keys and names parameters, then reset_index:
(pd.concat([df1, df2], keys=[1, 2], names=['class', None])
.reset_index('class')
)
Output:
class id Name
0 1 A01 ABC
1 1 A02 PQR
0 2 B05 XYZ
1 2 B06 TUV
Or without reset_index to get a MultiIndex:
pd.concat([df1, df2], keys=[1, 2], names=['class', None])
id Name
class
1 0 A01 ABC
1 A02 PQR
2 0 B05 XYZ
1 B06 TUV
hiding the "duplicated" class:
(pd.concat([df1, df2], keys=[1, 2], names=['class', None])
.reset_index('class')
.assign(**{'class': lambda d: d['class'].mask(d['class'].duplicated(), '')})
)
Output:
class id Name
0 1 A01 ABC
1 A02 PQR
0 2 B05 XYZ
1 B06 TUV
I am trying to get a value from another row which is "next day" data for each person. Let's say I have this example dataset:
import pandas as pd
data= {'date' : [20210701, 20210703, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,1,0]}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
I am trying to create another column with a value of column 'a' of the next day.
So, I created a 'next_day' column with:
df['next_date'] = df['date'] + pd.Timedelta(days=1)
but I am stuck on the next step.
The final data frame should look like this:
import pandas as pd
data= {'date' : [20210701, 20210703, 20210704, 20210703, 20210704, 20210705],
'name': ['Dave', 'Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,1,0],
'new_column' : [np.nan, 1, np.nan, 1, np.nan, np.nan ]}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
As you can see, the new column takes the value from the next day for each person and takes NaN for the ones that there is no data.
You can utilize numpy where to check for the wanted conditions and df.shift to grab next row values:
df['new_column'] = np.where(((df['name'].shift(-1)==df['name']) &
(df['next_date']==df['date'].shift(-1))), df['a'].shift(-1), np.nan)
This seems to work:
date name a next_date
0 2021-07-01 Dave 1 2021-07-02
1 2021-07-03 Dave 0 2021-07-04
2 2021-07-04 Dave 1 2021-07-05
3 2021-07-03 Sue 1 2021-07-04
4 2021-07-05 Sue 1 2021-07-06
5 2021-07-05 Ann 0 2021-07-06
df['next_date'] = df['next_date'].apply(lambda x:df.loc[df.date==x, 'a'])
date name a next_date
0 2021-07-01 Dave 1 NaN
1 2021-07-03 Dave 0 1.0
2 2021-07-04 Dave 1 NaN
3 2021-07-03 Sue 1 1.0
4 2021-07-05 Sue 1 NaN
5 2021-07-05 Ann 0 NaN
Update: Taking 'name' into account
Here is a solution – In order to account for the name as well, we can apply a function to the dataframe as a whole. As it's more complex, define it first,
def get_next_a(x):
# get the relevant rows
values = df.loc[(df['name']==x['name']) & (df.date==x.next_date), 'a']
# return the first truthy value or np.nan if no match is found
return next((v for v in values), np.nan)
and apply it afterwards:
df['new_column'] = df.apply(get_next_a, axis=1)
I would like to create a DataFrame from a DataFrame I already have in Python.
The DataFrame I have looks like below:
Nome Dept
Maria A1
Joao A2
Anna A1
Jorge A3
The DataFrame I want to create is like the below:
Dept Funcionario 1 Funcionario 2
A1 Maria Anna
A2 Joao
I tried the below code:
df_func.merge(df_dept, how='inner', on='Dept')
But I got the error: TypeError: merge() got multiple values for argument 'how'
Would anyone know how I can do this?
Thank you in Advance! :)
Even if you try that and it works, you will not get the right answer. in fact the key is gonna be duplicated 4 times.
{'Name': ['maria', 'joao', 'anna', 'jorge'], 'dept': [1, 2, 1, 3]}
{'Name': ['maria', 'joao', 'anna', 'jorge'], 'dept': [1, 2, 1, 3]}
d = _
df = pd.DataFrame(d)
df.merge(df, how='inner', on='dept')
Out[8]:
Name_x dept Name_y
0 maria 1 maria
1 maria 1 anna
2 anna 1 maria
3 anna 1 anna
4 joao 2 joao
5 jorge 3 jorge
Best way around is to groupby :
dd = df.groupby('dept').agg(list)
Out[10]:
Name
dept
1 [maria, anna]
2 [joao]
3 [jorge]
Then you apply pd.Series
dd['Name'].apply(pd.Series)
Out[21]:
0 1
dept
1 maria anna
2 joao NaN
3 jorge NaN
This is how I have merged two data frames recently.
rpt_data = connect_to_presto() # returned data from a db
df_rpt = pd.DataFrame(rpt_data, columns=["domain", "revenue"])
""" adding sellers.json seller {} into a panads df """
sj_data = data # returned response from requests module
df_sj = pd.json_normalize(sj_data, record_path="sellers", errors="ignore")
""" merging both dataframes """
df_merged = df_rpt.merge(df_sj, how="inner", on="domain", indicator=True)
Notice how I have stored the data into a variable each time, then created a DataFrame out of that? Then merged them like so
df_merged = df_rpt.merge(df_sj, how="inner", on="domain", indicator=True)
This may not be the best approach but it works.
I have multiple datasets in csv format that I would like to import by appending. Each dataset has the same columns name (fields), but different values and length.
For example:
df1
date name surname age address
...
df2
date name surname age address
...
I would like to have
df=df1+df2
date name surname age address dataset
(df1) 1
... 1
(df2) 2
... 2
i.e. I would like to add a new column that is an identifier for dataset (where fields come from, if from dataset 1 or dataset 2).
How can I do it?
Is this what you're looking for?
Note: Example has fewer columns that yours but the method is the same.
import pandas as pd
df1 = pd.DataFrame({
'name': [f'Name{i}' for i in range(5)],
'age': range(10, 15)
})
df2 = pd.DataFrame({
'name': [f'Name{i}' for i in range(20, 22)],
'age': range(20, 22)
})
combined = pd.concat([df1, df2])
combined['dataset'] = [1] * len(df1) + [2] * len(df2)
print(combined)
Output
name age dataset
0 Name0 10 1
1 Name1 11 1
2 Name2 12 1
3 Name3 13 1
4 Name4 14 1
0 Name20 20 2
1 Name21 21 2
We have key in concat
combined = pd.concat([df1, df2],keys=[1,2]).reset_index(level=1)
In Spark with scala , I would do something like this :
import org.apache.spark.sql.functions._
val df1 = sparkSession.read
.option("inferSchema", "true")
.json("/home/shredder/Desktop/data1.json")
val df2 = sparkSession.read
.option("inferSchema", "true")
.json("/home/shredder/Desktop/data2.json")
val df1New = df1.withColumn("dataset",lit(1))
val df2New = df2.withColumn("dataset",lit(2))
val df3 = df1New.union(df2New)
df3.show()