Combine pandas data column into a single column - python

I have 3 columns of ID's that I want to combine into a single column like the example below. The goal here is to simply replace all 0's in the main column with the values in either ID1 or ID2 AND maintain the score column to the far right.
Note, The Main ID column also has cases where there is already a value as shown in row 3, in that case, nothing needs to be done. Ultimately trying to get a single column as shown in the desired output. Tried using some iterative loop but it was not a pythonic approach.
Data Table
Main ID ID_1 ID_2 SCORE
0 0 121231 212
0 54453 0 199
12123 12123 0 185
343434 0 343434 34
2121 0 0 66
0 0 11 565
Desired output:
MAIN ID SCORE
121231 212
54453 199
12123 185
343434 34
2121 66
11 565
Update, applying the bfill method changed all the 'MAIN_ID' numbers into scientific notation like: 3.43559e+06

This one works for me, It's simple but functional :D
import pandas as pd
d = {'MAIN ID' : [0,0,12123,343434,2121,0], 'ID_1': [0,54453,12123,0,0,0],'ID_2':[121231,0,0,343434,0,11]}
df = pd.DataFrame(data=d)
for i in range(len(df)):
if df.iloc[i]['MAIN ID'] == 0:
if df.iloc[i]['ID_1'] != 0:
df.iloc[i]['MAIN ID'] = df.iloc[i]['ID_1']
else:
df.iloc[i]['MAIN ID'] = df.iloc[i]['ID_2']
df = df.drop(['ID_1', 'ID_2'], axis=1)

Try bfill with mask
out = df.mask(df.eq(0)).bfill(1)[['Main ID']]

Related

How to call a column by combining a string and another variable in a python dataframe?

Imagine I have a dataframe with these variables and values:
ID
Weight
LR Weight
UR Weight
Age
LS Age
US Age
Height
LS Height
US Height
1
63
50
80
20
18
21
165
160
175
2
75
50
80
22
18
21
172
160
170
3
49
45
80
17
18
21
180
160
180
I want to create the additional following variables:
ID
Flag_Weight
Flag_Age
Flag_Height
1
1
1
1
2
1
0
0
3
1
0
1
These flags simbolize that the main variable values (e.g.: Weight, Age and Height) are between the correspondent Lower or Upper limits, which may start with different 2 digits (in this dataframe I gave four examples: LR, UR, LS, US, but in my real dataframe I have more), and whose limit values sometimes differ from ID to ID.
Can you help me create these flags, please?
Thank you in advance.
You can use reshaping using a temporary MultiIndex:
(df.set_index('ID')
.pipe(lambda d: d.set_axis(pd.MultiIndex.from_frame(
d.columns.str.extract('(^[LU]?).*?\s*(\S+)$')),
axis=1)
)
.stack()
.assign(flag=lambda d: d[''].between(d['L'], d['U']).astype(int))
['flag'].unstack().add_prefix('Flag_').reset_index()
)
Output:
ID Flag_Age Flag_Height Flag_Weight
0 1 1 1 1
1 2 0 0 1
2 3 0 1 1
So, if I understood correctly, you want to add columns with these new variables. The simplest solution to this would be df.insert().
You could make it something like this:
df.insert(number of column after which you want to insert the new column, name of the column, values of the new column)
You can make up the new values in pretty much everyway you can imagine. So just copying a column or simple mathematical operations like +,-,*,/, can be performed. But you can also apply a whole function, which returns the flags based on your conditions as values of the new column.
If the new columsn can just be appended, you can even just make up a new column like this:
df['new column name'] = any values you want
I hope this helped.

How to remove Header Row from DataFrame Output

When I want to see a df finding the null values in a dataset here is what I get.
df.isnull().sum()
BAD 0
LOAN 0
MORTDUE 518
VALUE 112
REASON 252
JOB 279
YOJ 515
DEROG 708
DELINQ 580
CLAGE 308
NINQ 510
CLNO 222
DEBTINC 1267
dtype: int64
Next when I create a dataframe using this df, I get it as below.
df2 = pd.DataFrame(df.isnull().sum())
df2.set_index(0)
df2.index.name = None
0
BAD 0
LOAN 0
MORTDUE 518
VALUE 112
REASON 252
JOB 279
YOJ 515
DEROG 708
DELINQ 580
CLAGE 308
NINQ 510
CLNO 222
DEBTINC 1267
Why is that extra row coming in the output, and how can I remove it?. I saw a normal test, with df and i am able to use it (using set_index(0), and df.index.name = None and was able to remove the extra row. But that does not work on the created dataframe df2.
As you may already know, that extra zero appearing as an "extra row" in your output is actually the header for the column name(s). When you create the DataFrame, try passing a column name if you want something more descriptive than the default "0" for column name:
df2 = pd.DataFrame(df.isnull().sum(), columns=["Null_Counts"])
Same as the difference you would get from these two variants:
print(pd.DataFrame([0,1,2,3,4,5]))
0
0 0
1 1
2 2
3 3
4 4
5 5
vs
print(pd.DataFrame([0,1,2,3,4,5], columns=["My_Column"]))
My_Column
0 0
1 1
2 2
3 3
4 4
5 5
And, if you just don't want the header row to show up in your output, which seems to be the intent of your question, then you could do something like this to just use the index values and the count values to create whatever output format you want:
df1 = pd.DataFrame([0,1,2,3,4,5], columns=["My_Column"])
for tpl in zip(df1.index.values, df1["My_Column"].values):
print("{}\t{}".format(tpl[0], tpl[1]))
Output:
0 0
1 1
2 2
3 3
4 4
5 5
And you can also use the DataFrame function to_csv() and pass header=False if you just want to print or save the CSV output somewhere without the header row:
print(df1.to_csv(header=False))
0,0
1,1
2,2
3,3
4,4
5,5
And you can also pass sep="\t" to the to_csv function call if you prefer tab- instead of comma-delimited output.

Identify increasing features in a data frame

I have a data frame that present some features with cumulative values. I need to identify those features in order to revert the cumulative values.
This is how my dataset looks (plus about 50 variables):
a b
346 17
76 52
459 70
680 96
679 167
246 180
What I wish to achieve is:
a b
346 17
76 35
459 18
680 26
679 71
246 13
I've seem this answer, but it first revert the values and then try to identify the columns. Can't I do the other way around? First identify the features and then revert the values?
Finding cumulative features in dataframe?
What I do at the moment is run the following code in order to give me the feature's names with cumulative values:
def accmulate_col(value):
count = 0
count_1 = False
name = []
for i in range(len(value)-1):
if value[i+1]-value[i] >= 0:
count += 1
if value[i+1]-value[i] > 0:
count_1 = True
name.append(1) if count == len(value)-1 and count_1 else name.append(0)
return name
df.apply(accmulate_col)
Afterwards, I save these features names manually in a list called cum_features and revert the values, creating the desired dataset:
df_clean = df.copy()
df_clean[cum_cols] = df_clean[cum_features].apply(lambda col: np.diff(col, prepend=0))
Is there a better way to solve my problem?
To identify which columns have increasing* values throughout the whole column, you will need to apply conditions on all the values. So in that sense, you have to use the values first to figure out what columns fit the conditions.
With that out of the way, given a dataframe such as:
import pandas as pd
d = {'a': [1,2,3,4],
'b': [4,3,2,1]
}
df = pd.DataFrame(d)
#Output:
a b
0 1 4
1 2 3
2 3 2
3 4 1
Figuring out which columns contain increasing values is just a question of using diff on all values in the dataframe, and checking which ones are increasing throughout the whole column.
That can be written as:
out = (df.diff().dropna()>0).all()
#Output:
a True
b False
dtype: bool
Then, you can just use the column names to select only those with True in them
new_df = df[df.columns[out]]
#Output:
a
0 1
1 2
2 3
3 4
*(the term cumulative doesn't really represent the conditions you used.Did you want it to be cumulative or just increasing? Cumulative implies that the value in a particular row/index was the sum of all previous values upto that index, while increasing is just that, the value in current row/index is greater than previous.)

keep a random lowest value per row in a Python Pandas dataset

I have a dataframe where every line is ranked on several attributes vs. all the other rows. A single line can have the same rank in 2 attributes (meaning a row can be the best in few attributes) like shown in row 2 and 3 below:
att_1 att_2 att_3 att_4
ID
984 5 3 1 46
794 1 1 99 34
6471 20 2 3 2
Per line, I want to keep the index (ID) and the cell with the lowest value - in case there is more than 1 cell, I have to select a random one to keep a normal distribution.
I managed to convert the df into a numpy array and run the following:
idx = np.argmin(h_data.values, axis=1)
But I get the first line every time..
Desired output:
ID MIN
984 att_3
794 att_2
6471 att_1
Thank you!
Use list comprehension with numpy.random.choice:
df['MIN'] = [np.random.choice(df.columns[x == x.min()], 1)[0] for x in df.values]
print (df)
att_1 att_2 att_3 att_4 MIN
ID
984 5 3 1 46 att_3
794 1 1 99 34 att_1
6471 20 2 3 2 att_2
I you want to do something for each row (or column), you should try the .apply method
df.apply(np.argmin, axis=1) #row wise
df.apply(np.argmin, axis=0) #column wise

Pandas: replace values in dataframe from pivot_table

I have dataframe and Pivot Table and I need to replace some values in dataframe from pivot_table's columns.
Dataframe:
access_code ID cat1 cat2 cat3
g1gw8bzwelo83mhb 0433a3d29339a4b295b486e85874ec66 1 2
g0dgzfg4wpo3jytg 04467d3ae60fed134077a26ae33e0eae 1 2
g1gwui6r2ep471ht 06e3395c0b64a3168fbeab6a50cd8f18 1 2
g05ooypre5l87jkd 089c81ebeff5184e6563c90115186325 1
g0ifck11dix7avgu 0d254a81dca0ff716753b67a50c41fd7 1 2 3
Pivot Table:
type 1 2 \
access_code ID member_id
g1gw8bzwelo83mhb 0433a3d29339a4b295b486e85874ec66 1045794 1023 923 1 122
g05ooypre5l87jkd 089c81ebeff5184e6563c90115186325 768656 203 243 1 169
g1gwui6r2ep471ht 06e3395c0b64a3168fbeab6a50cd8f18 604095 392 919 1 35
g06q0itlmkqmz5cv f4a3b3f2fca77c443cd4286a4c91eedc 1457307 243 1
g074qx58cmuc1a2f 13f2674f6d5abc888d416ea6049b57b9 5637836 1
g0dgzfg4wpo3jytg 04467d3ae60fed134077a26ae33e0eae 5732738 111 2343 1
Desire output:
access_code ID cat1 cat2 cat3
g1gw8bzwelo83mhb 0433a3d29339a4b295b486e85874ec66 1023 923
g0dgzfg4wpo3jytg 04467d3ae60fed134077a26ae33e0eae 111 2343
g1gwui6r2ep471ht 06e3395c0b64a3168fbeab6a50cd8f18 392 919
g05ooypre5l87jkd 089c81ebeff5184e6563c90115186325 1
g0ifck11dix7avgu 0d254a81dca0ff716753b67a50c41fd7 1 2 3
If I use
df.ix[df.cat1 == 1] = pivot_table['1']
It returns error ValueError: cannot set using a list-like indexer with a different length than the value
As long as your dataframe is not exceedingly large, you can make it happen in some really ugly ways. I am sure someone else will provide you with a more elegant solution, but in the meantime this duct tape might point you in the right direction.
Keep in mind that in this case I did this with 2 dataframes instead of 1 dataframe and 1 pivot table, as I already had enough trouble formatting the dataframes from the textual data.
As there are empty fields in your data and my dataframes did not like this, first convert the empty fields to zeros.
df = df.replace(r'\s+', 0, regex=True)
Now ensure that your data is actually floats, else the comparisons will fail
df[['cat1', 'cat2', 'cat3']] = df[['cat1', 'cat2', 'cat3']].astype(float)
And for the fizzly fireworks:
df.cat1.loc[df.cat1 == 1] = piv['1'].loc[df.loc[df.cat1 == 1].index].dropna()
df.cat1 = df.cat1.fillna(1)
df.cat2.loc[df.cat2 == 2] = piv['2'].loc[df.loc[df.cat2 == 2].index].dropna()
df.cat2 = df.cat2.fillna(2)
df = df.replace(0, ' ')
The fillna is just to recreate your intended output, in which you clearly did not process some lines yet. I guess this column-by-column NaN-filling will not happen in your actual use.

Categories