When I want to see a df finding the null values in a dataset here is what I get.
df.isnull().sum()
BAD 0
LOAN 0
MORTDUE 518
VALUE 112
REASON 252
JOB 279
YOJ 515
DEROG 708
DELINQ 580
CLAGE 308
NINQ 510
CLNO 222
DEBTINC 1267
dtype: int64
Next when I create a dataframe using this df, I get it as below.
df2 = pd.DataFrame(df.isnull().sum())
df2.set_index(0)
df2.index.name = None
0
BAD 0
LOAN 0
MORTDUE 518
VALUE 112
REASON 252
JOB 279
YOJ 515
DEROG 708
DELINQ 580
CLAGE 308
NINQ 510
CLNO 222
DEBTINC 1267
Why is that extra row coming in the output, and how can I remove it?. I saw a normal test, with df and i am able to use it (using set_index(0), and df.index.name = None and was able to remove the extra row. But that does not work on the created dataframe df2.
As you may already know, that extra zero appearing as an "extra row" in your output is actually the header for the column name(s). When you create the DataFrame, try passing a column name if you want something more descriptive than the default "0" for column name:
df2 = pd.DataFrame(df.isnull().sum(), columns=["Null_Counts"])
Same as the difference you would get from these two variants:
print(pd.DataFrame([0,1,2,3,4,5]))
0
0 0
1 1
2 2
3 3
4 4
5 5
vs
print(pd.DataFrame([0,1,2,3,4,5], columns=["My_Column"]))
My_Column
0 0
1 1
2 2
3 3
4 4
5 5
And, if you just don't want the header row to show up in your output, which seems to be the intent of your question, then you could do something like this to just use the index values and the count values to create whatever output format you want:
df1 = pd.DataFrame([0,1,2,3,4,5], columns=["My_Column"])
for tpl in zip(df1.index.values, df1["My_Column"].values):
print("{}\t{}".format(tpl[0], tpl[1]))
Output:
0 0
1 1
2 2
3 3
4 4
5 5
And you can also use the DataFrame function to_csv() and pass header=False if you just want to print or save the CSV output somewhere without the header row:
print(df1.to_csv(header=False))
0,0
1,1
2,2
3,3
4,4
5,5
And you can also pass sep="\t" to the to_csv function call if you prefer tab- instead of comma-delimited output.
Related
Imagine I have a dataframe with these variables and values:
ID
Weight
LR Weight
UR Weight
Age
LS Age
US Age
Height
LS Height
US Height
1
63
50
80
20
18
21
165
160
175
2
75
50
80
22
18
21
172
160
170
3
49
45
80
17
18
21
180
160
180
I want to create the additional following variables:
ID
Flag_Weight
Flag_Age
Flag_Height
1
1
1
1
2
1
0
0
3
1
0
1
These flags simbolize that the main variable values (e.g.: Weight, Age and Height) are between the correspondent Lower or Upper limits, which may start with different 2 digits (in this dataframe I gave four examples: LR, UR, LS, US, but in my real dataframe I have more), and whose limit values sometimes differ from ID to ID.
Can you help me create these flags, please?
Thank you in advance.
You can use reshaping using a temporary MultiIndex:
(df.set_index('ID')
.pipe(lambda d: d.set_axis(pd.MultiIndex.from_frame(
d.columns.str.extract('(^[LU]?).*?\s*(\S+)$')),
axis=1)
)
.stack()
.assign(flag=lambda d: d[''].between(d['L'], d['U']).astype(int))
['flag'].unstack().add_prefix('Flag_').reset_index()
)
Output:
ID Flag_Age Flag_Height Flag_Weight
0 1 1 1 1
1 2 0 0 1
2 3 0 1 1
So, if I understood correctly, you want to add columns with these new variables. The simplest solution to this would be df.insert().
You could make it something like this:
df.insert(number of column after which you want to insert the new column, name of the column, values of the new column)
You can make up the new values in pretty much everyway you can imagine. So just copying a column or simple mathematical operations like +,-,*,/, can be performed. But you can also apply a whole function, which returns the flags based on your conditions as values of the new column.
If the new columsn can just be appended, you can even just make up a new column like this:
df['new column name'] = any values you want
I hope this helped.
My csv file looks like this.
,timestamp,side,size,price,tickDirection,grossValue,homeNotional,foreignNotional
0,1569974396.557895,1,11668,8319.5,1,140248813.0,11668,1.40248813
1,1569974394.78865,0,5000,8319.0,0,60103377.0,5000,0.60103377
2,1569974392.355395,0,564,8319.0,0,6779660.999999999,564,0.06779661
3,1569974383.797042,0,100,8319.0,0,1202067.0,100,0.01202067
4,1569974382.944569,0,3,8319.0,0,36062.0,3,0.00036062
5,1569974382.944569,0,7412,8319.0,-1,89097247.0,7412,0.89097247
There's a nameless index column. I want to remove this column.
When I read this in pandas, it just interprets it as an index and moves on.
The problem is, when you now use df[::-1], it flips the indexes as well. So df[::-1]['timestamp][0] is the same as df['timestamp'][0] if the file was read with indexes, but not if it was read without.
How do I make it actually ignore the index column so that df[::-1] doesn't flip my indexes?
I tried usecols in read_csv, but it doesn't matter, it reads the indexes as well as the columns specified. I tried del df[''], but it doesn't work because it doesn't interpret the index column as column '', even though that's what it is.
Just use index_col=0
df = pd.read_csv('data.csv', index_col=0)
print(df)
# Output
timestamp side size price tickDirection grossValue homeNotional foreignNotional
0 1.569974e+09 1 11668 8319.5 1 140248813.0 11668 1.402488
1 1.569974e+09 0 5000 8319.0 0 60103377.0 5000 0.601034
2 1.569974e+09 0 564 8319.0 0 6779661.0 564 0.067797
3 1.569974e+09 0 100 8319.0 0 1202067.0 100 0.012021
4 1.569974e+09 0 3 8319.0 0 36062.0 3 0.000361
5 1.569974e+09 0 7412 8319.0 -1 89097247.0 7412 0.890972
If I understand correctly you issue, you can just set timestamp as your index:
df.set_index('timestamp', drop = True)
I have 3 columns of ID's that I want to combine into a single column like the example below. The goal here is to simply replace all 0's in the main column with the values in either ID1 or ID2 AND maintain the score column to the far right.
Note, The Main ID column also has cases where there is already a value as shown in row 3, in that case, nothing needs to be done. Ultimately trying to get a single column as shown in the desired output. Tried using some iterative loop but it was not a pythonic approach.
Data Table
Main ID ID_1 ID_2 SCORE
0 0 121231 212
0 54453 0 199
12123 12123 0 185
343434 0 343434 34
2121 0 0 66
0 0 11 565
Desired output:
MAIN ID SCORE
121231 212
54453 199
12123 185
343434 34
2121 66
11 565
Update, applying the bfill method changed all the 'MAIN_ID' numbers into scientific notation like: 3.43559e+06
This one works for me, It's simple but functional :D
import pandas as pd
d = {'MAIN ID' : [0,0,12123,343434,2121,0], 'ID_1': [0,54453,12123,0,0,0],'ID_2':[121231,0,0,343434,0,11]}
df = pd.DataFrame(data=d)
for i in range(len(df)):
if df.iloc[i]['MAIN ID'] == 0:
if df.iloc[i]['ID_1'] != 0:
df.iloc[i]['MAIN ID'] = df.iloc[i]['ID_1']
else:
df.iloc[i]['MAIN ID'] = df.iloc[i]['ID_2']
df = df.drop(['ID_1', 'ID_2'], axis=1)
Try bfill with mask
out = df.mask(df.eq(0)).bfill(1)[['Main ID']]
I have a dataframe where every line is ranked on several attributes vs. all the other rows. A single line can have the same rank in 2 attributes (meaning a row can be the best in few attributes) like shown in row 2 and 3 below:
att_1 att_2 att_3 att_4
ID
984 5 3 1 46
794 1 1 99 34
6471 20 2 3 2
Per line, I want to keep the index (ID) and the cell with the lowest value - in case there is more than 1 cell, I have to select a random one to keep a normal distribution.
I managed to convert the df into a numpy array and run the following:
idx = np.argmin(h_data.values, axis=1)
But I get the first line every time..
Desired output:
ID MIN
984 att_3
794 att_2
6471 att_1
Thank you!
Use list comprehension with numpy.random.choice:
df['MIN'] = [np.random.choice(df.columns[x == x.min()], 1)[0] for x in df.values]
print (df)
att_1 att_2 att_3 att_4 MIN
ID
984 5 3 1 46 att_3
794 1 1 99 34 att_1
6471 20 2 3 2 att_2
I you want to do something for each row (or column), you should try the .apply method
df.apply(np.argmin, axis=1) #row wise
df.apply(np.argmin, axis=0) #column wise
I have dataframe and Pivot Table and I need to replace some values in dataframe from pivot_table's columns.
Dataframe:
access_code ID cat1 cat2 cat3
g1gw8bzwelo83mhb 0433a3d29339a4b295b486e85874ec66 1 2
g0dgzfg4wpo3jytg 04467d3ae60fed134077a26ae33e0eae 1 2
g1gwui6r2ep471ht 06e3395c0b64a3168fbeab6a50cd8f18 1 2
g05ooypre5l87jkd 089c81ebeff5184e6563c90115186325 1
g0ifck11dix7avgu 0d254a81dca0ff716753b67a50c41fd7 1 2 3
Pivot Table:
type 1 2 \
access_code ID member_id
g1gw8bzwelo83mhb 0433a3d29339a4b295b486e85874ec66 1045794 1023 923 1 122
g05ooypre5l87jkd 089c81ebeff5184e6563c90115186325 768656 203 243 1 169
g1gwui6r2ep471ht 06e3395c0b64a3168fbeab6a50cd8f18 604095 392 919 1 35
g06q0itlmkqmz5cv f4a3b3f2fca77c443cd4286a4c91eedc 1457307 243 1
g074qx58cmuc1a2f 13f2674f6d5abc888d416ea6049b57b9 5637836 1
g0dgzfg4wpo3jytg 04467d3ae60fed134077a26ae33e0eae 5732738 111 2343 1
Desire output:
access_code ID cat1 cat2 cat3
g1gw8bzwelo83mhb 0433a3d29339a4b295b486e85874ec66 1023 923
g0dgzfg4wpo3jytg 04467d3ae60fed134077a26ae33e0eae 111 2343
g1gwui6r2ep471ht 06e3395c0b64a3168fbeab6a50cd8f18 392 919
g05ooypre5l87jkd 089c81ebeff5184e6563c90115186325 1
g0ifck11dix7avgu 0d254a81dca0ff716753b67a50c41fd7 1 2 3
If I use
df.ix[df.cat1 == 1] = pivot_table['1']
It returns error ValueError: cannot set using a list-like indexer with a different length than the value
As long as your dataframe is not exceedingly large, you can make it happen in some really ugly ways. I am sure someone else will provide you with a more elegant solution, but in the meantime this duct tape might point you in the right direction.
Keep in mind that in this case I did this with 2 dataframes instead of 1 dataframe and 1 pivot table, as I already had enough trouble formatting the dataframes from the textual data.
As there are empty fields in your data and my dataframes did not like this, first convert the empty fields to zeros.
df = df.replace(r'\s+', 0, regex=True)
Now ensure that your data is actually floats, else the comparisons will fail
df[['cat1', 'cat2', 'cat3']] = df[['cat1', 'cat2', 'cat3']].astype(float)
And for the fizzly fireworks:
df.cat1.loc[df.cat1 == 1] = piv['1'].loc[df.loc[df.cat1 == 1].index].dropna()
df.cat1 = df.cat1.fillna(1)
df.cat2.loc[df.cat2 == 2] = piv['2'].loc[df.loc[df.cat2 == 2].index].dropna()
df.cat2 = df.cat2.fillna(2)
df = df.replace(0, ' ')
The fillna is just to recreate your intended output, in which you clearly did not process some lines yet. I guess this column-by-column NaN-filling will not happen in your actual use.