Create a new column to data frame from existing columns using Pandas - python

I am not sure how to build a data frame here but I am looking for a way to take the data from multiple columns and combine them into 1 column. Not as a sum but as a joined value.
Ex. MB|Val|34567|W123 -> MB|Val|34567|W123|MB_Val_34567_W123.
What I have tried so far is creating a conditions variable that calls a particular column identical to the value in it
conditions = [(Groupings_df['GroupingCriteria1'] == 'MB')]
then a values variable that would include what I want in the new column
values = ['MB_Val_34567_W123']
and lastly grouping it
Groupings_df['GroupingColumn'] = np.select(conditions,values)
This works for 1 row but it would be inefficient to keep manually changing the number in the values variable (34567) over a df with thousands of rows

IIUC, you want to create a new column as a concatenation of each row:
df = pd.DataFrame({'GC1': ['MB'], 'GC2': ['Val'], 'GC3': [34567], 'GC4': ['W123'],
'Dummy': [10], 'Other': ['Hello']})
df['GC'] = df.filter(like='GC').astype(str).apply(lambda x: '_'.join(x), axis=1)
print(df)
# Output
GC1 GC2 GC3 GC4 Dummy Other GC
0 MB Val 34567 W123 10 Hello MB_Val_34567_W123

Related

Loop over DataFrame and use it's column values to create columns in new New DataFrame

I have a dataframe sorted_df which has 3 Columns CarNumber, Sightcode and Sightdate,currentcity I want to loop over this DF and use column values to create different columns in a new DF, I am trying to do something like this, this is not the complete snippet:
df_new = pd.DataFrame()
for index,row in sorted_df.iterrows():
if row['CarNumber'] == 'GACX470159' and row['Sightcode'] == 'ACTUAL_PLACEMENT':
df_new['AP_TS'] = row['Sightdate']
elif row['railcarnumber'] == 'GACX470159' and row['Sightcode'] == 'PULL':
df_new['PULL_TS'] = row['Sightdate']
df_new['PULL_CITY'] = row['currentcity']
elif row['railcarnumber'] == 'GACX470159' and row['Sightcode'] == 'ARRIVAL_AT_DESTINATION':
df_new['ARRIVAL_TS'] = row['Sightdate']
df_new['ARRIVAL_CITY'] = row['currentcity']`
The idea is for 1 Car number I want to get 1 row in the df_new, which would have multiple column based on what sight code is.
If Sightcode is equal to ACTUAL_PLACEMENT, in the df_new a new column df_new[AP_TS] is filled with the value of Sightdate.
If Sightcode is equal to PULL, in the df_new a new column df_new[PULL_TS] is filled with the value of the Sighdate.
Once I do this correctly, I would revamp the logic to loop through all the car numbers that i have in the sorted_df.
Can someone please help me rewrite this?
Below is what I am trying:
I am trying to create new column timestamp columns in df_new with the value of Sightdate i get using the loop, But this doesn't do anything for me, In the end, the desired columns are created in the df_new but have 0 rows.

How to create two dataframes from a given dataframe?

Assume I have the following data frame:
I want to create two data frames such that for any row if column Actual is equal to column Predicted then the value in both columns goes in one data frame otherwise both columns go in another data frame.
For example, row 0,1,2 goes in dataframe named correct_df and row 245,247 goes in dataframe named incorect_df
Use boolean indexing:
m = df['Actual'] == df['Predicted']
correct_df = df.loc[m]
incorrect_df = df.loc[~m]
You can use this :
df_cor = df.loc[(df['Actual'] == df['Predicted'])]
df_incor = df2 = df.loc[(df['Actual']!= df['Predicted'])]
And use reset_index if you want a new index.

Creating a function which creates a new column based on values in two columns?

I have data frame like -
ID Min_Value Max_Value
1 0 0.10562
2 0.10563 0.50641
3 0.50642 1.0
I have another data frame that contains Value as a column. I want to create a new column in second data frame which returns ID when Value is between Min_Value and Max_Value for a given ID as above data frame. I can use if-else conditions but number of ID's are large and code becomes too bulky. Is there a efficient way to do this?
If I understand correctly, just join/merge it into one DataFrame, using "between" function you can choose right indexes which will be located in the second DataFrame.
import pandas as pd
data = {"Min_Value": [0, 0.10563, 0.50642],
"Max_Value": [0.10562, 0.50641, 1.0]}
df = pd.DataFrame(data,
index=[1, 2, 3])
df2 = pd.DataFrame({"Value": [0, 0.1, 0.58]}, index=[1,2,3])
df = df.join(df2)
mask_between_values = df['Value'].between(df['Min_Value'], df['Max_Value'], inclusive="neither")
# This is the result
df2[mask_between_values]
1 0.00
3 0.58
Suppose you have two dataframes df and new_df. You want to assign a new column as 'new_column' into new_df dataframe. The value of 'Value' column must be between 'Min_Value' and 'Max_Value' from df dataframe. Then this code may help you.
for i in range(0,len(df)):
if df.loc[i,'Max_Value'] > new_df.loc[i,'Value'] and df.loc[i,'Min_value'] < new_df.loc[i,'Value']:
new_df.loc[i,'new_column'] = df.loc[i, 'ID']

Find the difference between data frames based on specific columns and output the entire record

I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.
I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.

Creating a function to describe a set of dataframes

I want to create a function which will give the output of df.describe for all the dataframes which is passed to the function argument.
My idea was to store all the dataframe(whom i need to describe) names as columns in a seperate dataframe (x) and then pass this to the function.
Here is what i have made and the output :
The problem is that its only showing description of only one dataframe
def des(df):
columns = df.columns
for column in columns:
column=pd.read_csv('SKUs\\'+column+'.csv')
column['Date'] = pd.to_datetime(column['Date'].astype(str),dayfirst = True, format ='%d&m%y',infer_datetime_format=True)
column.dropna(inplace=True)
return(column.describe())
data = {'UGCAA':[],'FAPG1':[],'ACSO5':[],'LGHF2':[],'LGMP8':[],'GGAF1':[]}
df=pd.DataFrame(data)
df
des(df)
Sales
count 948.000000
mean 876.415612
std 874.373236
min 1.000000
25% 298.750000
50% 619.500000
75% 1148.500000
max 7345.00000
I believe you can create list of DataFrames and last concat together:
def des(df):
dfs = []
for column in df.columns:
df1=pd.read_csv('SKUs\\'+column+'.csv')
df1['Date'] = pd.to_datetime(df1['Date'].astype(str),
format ='%d%m%y',infer_datetime_format=True)
df1.dropna(inplace=True)
dfs.append(df1.describe())
return pd.concat(dfs, axis=1, keys=df.columns)
It's because you are looping over and reseting column each time while only returning one. To just visualize it, you can just print the describe in each loop, or store them together in one variable and handle it after the loop.
def des(df):
columns = df.columns
for column in columns:
column=pd.read_csv('SKUs\\'+column+'.csv')
column['Date'] = pd.to_datetime(column['Date'].astype(str),dayfirst = True, format ='%d&m%y',infer_datetime_format=True)
column.dropna(inplace=True)
print(column.describe())

Categories