my dataset look similiar to this (but with a couple of more rows):
The aim is to get this:
What I tried to do is:
# Identify names that are in the dataset
names = df['name'].unique().tolist()
# Define dataframe with first name
df1 = pd.DataFrame()
df1 = df[(df == names[0]).any(axis=1)]
df1 = df1.drop(['name'], axis=1)
df1 = df1.rename({'color':'color_'+str(names[0]), 'number':'number_'+str(names[0])}, axis=1)
# Make dataframes with other names and their corresponding color and number, add them to df1
df_merged = pd.DataFrame()
for i in range(1, len(names)):
df2 = pd.DataFrame()
df2 = df[(df == names[i]).any(axis=1)]
df2 = df2.drop(['name'], axis=1)
df2 = df2.rename({'color':'color_'+str(names[i]), 'number':'number_'+str(names[i])}, axis=1)
df_merged = df1.join(df2, lsuffix="_left", rsuffix="_right", how='left')
In the end I get this result for df_merged:
As you can see the columns color_Donald and number_Donald are missing. Does anyone know why and how to improve the code? It seems as if the loop somehow skips or overwrites Donald.
Thanks in advance!
sample df
import pandas as pd
data = {'name': {'2020-01-01 00:00:00': 'Justin', '2020-01-02 00:00:00': 'Justin', '2020-01-03 00:00:00': 'Donald'}, 'color': {'2020-01-01 00:00:00': 'blue', '2020-01-02 00:00:00': 'red', '2020-01-03 00:00:00': 'green'}, 'number': {'2020-01-01 00:00:00': 1, '2020-01-02 00:00:00': 2, '2020-01-03 00:00:00': 9}}
df = pd.DataFrame(data)
print(f"{df}\n")
name color number
2020-01-01 00:00:00 Justin blue 1
2020-01-02 00:00:00 Justin red 2
2020-01-03 00:00:00 Donald green 9
final df
df = (
df
.reset_index(names="date")
.pivot(index="date", columns="name", values=["color", "number"])
.fillna("")
)
df.columns = ["_".join(x) for x in df.columns.values]
print(df)
color_Donald color_Justin number_Donald number_Justin
date
2020-01-01 00:00:00 blue 1
2020-01-02 00:00:00 red 2
2020-01-03 00:00:00 green 9
The problem is the line:
df_merged = df1.join(df2, lsuffix="_left", rsuffix="_right", how='left')
where df_merged will be set in the loop always to the join of df1 with current df2.
The result after the loop is therefore a join of df1 with the last df2 and the Donald gets lost in this process.
To fix this problem first join empty df_merged with df1 and then in the loop join df_merged with df2.
Here the full code with the changes (not tested):
# Identify names that are in the dataset
names = df['name'].unique().tolist()
# Define dataframe with first name
df1 = pd.DataFrame()
df1 = df[(df == names[0]).any(axis=1)]
df1 = df1.drop(['name'], axis=1)
df1 = df1.rename({'color':'color_'+str(names[0]), 'number':'number_'+str(names[0])}, axis=1)
# Make dataframes with other names and their corresponding color and number, add them to df1
df_merged = pd.DataFrame()
df_merged = df_merged.join(df1) # <- add options if necessary
for i in range(1, len(names)):
df2 = pd.DataFrame()
df2 = df[(df == names[i]).any(axis=1)]
df2 = df2.drop(['name'], axis=1)
df2 = df2.rename({'color':'color_'+str(names[i]), 'number':'number_'+str(names[i])}, axis=1)
# join the current df2 to df_merged:
df_merged = df_merged.join(df2, lsuffix="_left", rsuffix="_right", how='left')
Related
I wanted to left join df2 on df1 and then keep the row that matches by group and if there is no matching group then I would like to keep the first row of the group in order to achieve df3 (the desired result). I was hoping you guys could help me with finding the optimal solution.
Here is my code to create the two dataframes and the required result.
import pandas as pd
import numpy as np
market = ['SP', 'SP', 'SP']
underlying = ['TSLA', 'GOOG', 'MSFT']
# DF1
df = pd.DataFrame(list(zip(market, underlying)),
columns=['market', 'underlying'])
market2 = ['SP', 'SP', 'SP', 'SP', 'SP']
underlying2 = [None, 'TSLA', 'GBX', 'GBM', 'GBS']
client2 = [17, 12, 100, 21, 10]
# DF2
df2 = pd.DataFrame(list(zip(market2, underlying2, client2)),
columns=['market', 'underlying', 'client'])
market3 = ['SP', 'SP', 'SP']
underlying3 = ['TSLA', 'GOOG', 'MSFT']
client3 = [12, 17, 17]
# Desired
df3 = pd.DataFrame(list(zip(market3, underlying3, client3)),
columns =['market', 'underlying', 'client'])
# This works but feels sub optimal
df3 = pd.merge(df,
df2,
how='left',
on=['market', 'underlying'])
df3 = pd.merge(df3,
df2,
how='left',
on=['market'])
df3 = df3.drop_duplicates(['market', 'underlying_x'])
df3['client'] = df3['client_x'].combine_first(df3['client_y'])
df3 = df3.drop(labels=['underlying_y', 'client_x', 'client_y'], axis=1)
df3 = df3.rename(columns={'underlying_x': 'underlying'})
Hope you guys could help, thankyou so much!
Store the first value (a groupby might not be necessary if every single one in market is 'SP'), merge and fill with the first value:
fill_value = df2.groupby('market').client.first()
# if you are interested in filtering for None:
fill_value = df2.set_index('market').loc[lambda df: df.underlying.isna(), 'client']
(df
.merge(
df2,
on = ['market', 'underlying'],
how = 'left')
.set_index('market')
.fillna({'client':fill_value}, downcast='infer')
)
underlying client
market
SP TSLA 12
SP GOOG 17
SP MSFT 17
I have two dataframes:
df1 with columns 'state', 'date', 'number'
df2 with columns 'state', 'specificDate' (one specificDate for one state, each state is mentioned just once)
In the end, I want to have a dataset with columns 'state', 'specificDate', 'number'. Also, I would like to add 14 days to each specific date and get numbers for those dates too.
I tried this
df = df1.merge(df2, left_on='state', right_on='state')
df['newcolumn'] = np.where((df.state == df.state)& (df.date == df.specificDate), df.numbers)
df['newcolumn'] = np.where((df.state == df.state)& (df.date == df.specificDate+datetime.timedelta(days=14)), df.numbers)
but I got this error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
when I add all() it still gives me the same error
I feel that my logic is not correct. How else can I insert those values into my dataset?
I think you want to use df2 as the left side of the join. You can use pd.DateOffset to add 14 days.
# create dataset with specific date and specific date + 14
df2_14 = df2.set_index('state')['date'].apply(pd.DateOffset(14)).reset_index()
df = pd.concat([df2, df2_14])
# now join the values from df1
df = df.join(df1.set_index(['state', 'date']),
how='left',
on=['state', 'specificDate'])
You can declare an empty DataFrame and insert filtered data in it.
To filter data you may iterate through all rows of df2 and set a mask between the dates of specificDate column and specificDate+14 with same state name.
I have create two DataFrames df1 and df2 with several values from your DataFrames and tested the above procedure.
import pandas as pd
import datetime
data1 = {
"state":["Alabama","Alabama","Alabama"],
"date":["3/12/20", "3/13/20", "3/14/20"],
"number":[0,5,7]
}
data2 = {
"state": ["Alabama", "Alaska"],
"specificDate": ["03.13.2020", "03.11.2020"]
}
df1 = pd.DataFrame(data1)
df1['date'] = pd.to_datetime(df1['date'])
df2 = pd.DataFrame(data2)
df2['specificDate'] = pd.to_datetime(df2['specificDate'])
final_df = pd.DataFrame()
for index, row in df2.iterrows():
begin_date = row["specificDate"]
end_date = begin_date+datetime.timedelta(days=14)
mask = (df1['date'] >= begin_date) & (df1['date'] <= end_date) & (df1['state'] == row['state'])
filtered_data = df1.loc[mask]
if not filtered_data.empty:
final_df = final_df.append(filtered_data, ignore_index=True)
print(final_df)
Output:
state date number
0 Alabama 2020-03-13 5
1 Alabama 2020-03-14 7
Updated Answer:
To show the data only for specific date and specific date+14th date from df1 we should update the mask of the above code snippet.
import pandas as pd
import datetime
data1 = {
"state":["Alabama","Alabama","Alabama","Alabama","Alabama"],
"date":["3/12/20", "3/13/20", "3/14/20", "3/27/20", "3/28/20"],
"number":[0,5,7,9,3]
}
data2 = {
"state": ["Alabama", "Alaska"],
"specificDate": ["03.13.2020", "03.11.2020"]
}
df1 = pd.DataFrame(data1)
df1['date'] = pd.to_datetime(df1['date'])
df2 = pd.DataFrame(data2)
df2['specificDate'] = pd.to_datetime(df2['specificDate'])
final_df = pd.DataFrame()
for index, row in df2.iterrows():
first_date = row["specificDate"]
last_date = first_date+datetime.timedelta(days=14)
mask = ((df1['date'] == first_date) | (df1['date'] == last_date)) & (df1['state'] == row['state'])
filtered_data = df1.loc[mask]
if not filtered_data.empty:
final_df = final_df.append(filtered_data, ignore_index=True)
print(final_df)
Output:
state date number
0 Alabama 2020-03-13 5
1 Alabama 2020-03-27 9
Just a slight tweek on the first line in Eric's answer to make it a little simpler, as I was confused why he used set_index and reset_index.
df2_14['date'] = df2['date'].apply(pd.DateOffset(14))
I have two data frames say df1, df2 each has two columns ['Name', 'Marks']
I want to find the difference between the two ifs for corresponding Name Values.
Eg:
df = pd.DataFrame([["Shivi",70],["Alex",40]],columns=['Names', 'Value'])
df2 = pd.DataFrame([["Shivi",40],["Andrew",40]],columns=['Names', 'Value'])
For df1-df2 I want
pd.DataFrame([["Shivi",30],["Alex",40],["Andrew",40]],columns=['Names', 'Value'])
You can use:
diff = df1.set_index("Name").subtract(df2.set_index("Name"), fill_value=0)
So a complete program will look like this:
import pandas as pd
data1 = {'Name': ["Ashley", "Tom"], 'Marks': [40, 50]}
data2 = {'Name': ["Ashley", "Stan"], 'Marks': [80, 90]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
diff = df1.set_index("Name").subtract(df2.set_index("Name"), fill_value=0)
print(diff)
Output:
Marks
Name
Ashley -40.0
Stan -90.0
Tom 50.0
I have 10000 data that I'm sorting into a dictionary and then exporting that to a csv using pandas. I'm sorting temperatures, pressures and flow associated with a key. But when doing this I find: https://imgur.com/a/aNX7RHf
but I want something like this:https://imgur.com/a/ZxJgPv4
I'm transposing my dataframe so the index can be rows but in this case I want only 3 rows 1,2, & 3, and all the data populate those rows.
flow_dictionary = {'200:P1F1':[5.5, 5.5, 5.5]}
pres_dictionary = {'200:PT02':[200,200,200],
'200:PT03':[200,200,200],
'200:PT06':[66,66,66],
'200:PT07':[66,66,66]}
temp_dictionary = {'200:TE02':[27,27,27],
'200:TE03':[79,79,79],
'200:TE06':[113,113,113],
'200:TE07':[32,32,32]}
df = pd.DataFrame.from_dict(temp_dictionary, orient='index').T
df2 = pd.DataFrame.from_dict(pres_dictionary, orient='index').T
df3 = pd.DataFrame.from_dict(flow_dictionary, orient='index').T
df = df.append(df2, ignore_index=False, sort=True)
df = df.append(df3, ignore_index=False, sort=True)
df.to_csv('processedSegmentedData.csv')
SOLUTION:
df1 = pd.DataFrame.from_dict(temp_dictionary, orient='index').T
df2 = pd.DataFrame.from_dict(pres_dictionary, orient='index').T
df3 = pd.DataFrame.from_dict(flow_dictionary, orient='index').T
df4 = pd.concat([df1,df2,df3], axis=1)
df4.to_csv('processedSegmentedData.csv')
I have 3 DataFrames:
import pandas as pd
df1 = pd.DataFrame( np.random.randn(100,4), index = pd.date_range('1/1/2010', periods=100), columns = {"A", "B", "C", "D"}).T.sort_index()
df2 = pd.DataFrame( np.random.randn(100,4), index = pd.date_range('1/1/2010', periods=100), columns = {"A", "B", "C", "D"}).T.sort_index()
df3 = pd.DataFrame( np.random.randn(100,4), index = pd.date_range('1/1/2010', periods=100), columns = {"A", "B", "C", "D"}).T.sort_index()
I concatenate them creating a DataFrame with multi levels:
df_c = pd.concat([df1, df2, df3], axis = 1, keys = ["df1", "df2", "df3"])
Swap levels and sort:
df_c.columns = df_c.columns.swaplevel(0,1)
df_c = df_c.reindex_axis(sorted(df_c.columns), axis = 1)
ipdb> df_c
2010-01-01 2010-01-02
df1 df2 df3 df1 df2 df3
A -0.798407 0.124091 0.271089 0.754759 -0.575769 1.501942
B 0.602091 -0.415828 0.152780 0.530525 0.118447 0.057240
C -0.440619 -1.074837 -0.618084 0.627520 -1.298814 1.029443
D -0.242851 -0.738948 -1.312393 0.559021 0.196936 -1.074277
I would like to slice it to get values for individual rows, but so far I have only achieved such a degree of slicing:
cols = df_c.T.index.get_level_values(0)
ipdb> df_c.xs(cols[0], axis = 1, level = 0)
df1 df2 df3
A -0.798407 0.124091 0.271089
B 0.602091 -0.415828 0.152780
C -0.440619 -1.074837 -0.618084
D -0.242851 -0.738948 -1.312393
The only way I found to get the values for each raw is to define a new dataframe,
slcd_df = df_c.xs(cols[0], axis = 1, level = 0)
and then select rows using the usual proceadure:
ipdb> slcd_df.ix["A", :]
df1 -0.798407
df2 0.124091
df3 0.271089
But I was wondering whether there exists a better (meaning faster and more elegant) way to slice multilevel Dataframes.
You can use pd.IndexSlice:
idx = pd.IndexSlice
sliced = df_c.loc["A", idx["2010-01-01", :]]
print(sliced)
2010-01-01 df1 0.199332
df2 0.887018
df3 -0.346778
Name: A, dtype: float64
Or you may also use slice(None):
print(df_c.loc["A", ("2010-01-01", slice(None))])
2010-01-01 df1 0.199332
df2 0.887018
df3 -0.346778
Name: A, dtype: float64