How to remove data from DataFrame permanently - python

After reading CSV data file with:
import pandas as pd
df = pd.read_csv('data.csv')
print df.shape
I get DataFrame 99 rows (indexes) long:
(99, 2)
To cleanup DataFrame I go ahead and apply dropna() method which reduces it to 33 rows:
df = df.dropna()
print df.shape
which prints:
(33, 2)
Now when I iterate the columns it prints out all 99 rows like they weren't dropped:
for index, value in df['column1'].iteritems():
print index
which gives me this:
0
1
2
.
.
.
97
98
99
It appears the dropna() simply made the data "hidden". That hidden data returns back when I iterate DataFrame. How to assure the dropped data is removed from DataFrame instead just getting hidden?

You're being confused by the fact that the row labels have been preserved so the last row label is still 99.
Example:
In [2]:
df = pd.DataFrame({'a':[0,1,np.NaN, np.NaN, 4]})
df
Out[2]:
a
0 0
1 1
2 NaN
3 NaN
4 4
After calling dropna the index row labels are preserved:
In [3]:
df = df.dropna()
df
Out[3]:
a
0 0
1 1
4 4
If you want to reset so that they are contiguous then call reset_index(drop=True) to assign a new index:
In [4]:
df = df.reset_index(drop=True)
df
Out[4]:
a
0 0
1 1
2 4

Or you can just adjust parameters for example:
Df = df.dropna(inplace=True)

Related

Stick the columns based on the one columns keeping ids

I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df
You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5

Dataframe becomes larger than it should be after join operation in pandas

I have an excel dataframe which I am trying to populate with fields from other excel file like so:
df = pd.read_excel("file1.xlsx")
df_new = df.join(conv.set_index('id'), on='id', how='inner')
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: "" if x==0 else x) # if id==0, its same as nan
df_new = df_new.dropna() # drop nan
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: str(int(x))) # convert id to string
df_new = df_new.drop_duplicates() # drop duplicates, if any
it is clear that df_new should be a subset of df, however, when I run following code:
len(df[df['id'].isin(df_new['id'].values)]) # length of this should be same as len(df_new)
len(df_new)
I get different results (there are 6 more rows in df_new than in df). How can that be? I have checked all dataframes for duplicates and none of them contain any. Interestingly, following code does give expected results:
len(df_new[df_new['id'].isin(df['id'].values)])
len(df_new)
These both print same numbers
Edit:
I have also tried following: others = df[~df['id'].isin(df_new['id'].values)], and checking if others has same length as len(df) - len(df_new), but again, in dataframe others there are 6 more rows than expected
The problem comes from your conv dataframe. Assume that your df that comes from file1 is
id PersonalN
0 1
And conv is
id other_col
0 'abc'
0 'def'
After the join you will get:
id PersonalN other_col
0 1 'abc'
0 1 'def'
size of df_new is larger than of df and drop_dulicates() or dropna() will not help you to reduce the shape of your resulting dataframe.
It's hard to know without the data, but even if there are no duplicates in either of the dataframe, the size of the result of an inner join can be larger than the original dataframe size. Consider the following example:
df1 = pd.DataFrame(range(10), columns=["id_"])
df2 = pd.DataFrame({"id_": list(range(10)) + [1] * 3, "something": range(13)})
df2.drop_duplicates(inplace = True)
print(len(df1), len(df2))
==> 10 13
df_new = df1.join(df2.set_index("id_"), on = "id_")
len(df_new)
==> 13
print(df_new)
id_ something
0 0 0
1 1 1
1 1 10
1 1 11
1 1 12
2 2 2
...
The reason is of course that the ids of the other dataframe are not unique, and a single id in the original dataframe (df1 in my example) is joined to several rows on the other dataframe (df2 in my example, conv in yours).

Pandas set index or reindex without changing the order of the data frame

Hello I have a dataframe I sorted so the index is not in order so I want to reorder the index so that sorted values have an index that is sequential I have not been able to figure this out should I remove the index or is there a way to set the index? When I reindex it should sorts by the index which unsorts by index.
Solution
I made some dummy data to show this. I hope this answers your question. Leave comments if you have any questions.
import pandas as pd
df = pd.DataFrame({'x': [1,2,3], 'y': [120, 8, 32]})
df = df.reset_index(drop=False).rename(columns={'index': 'ID'})
df = df.sort_values(by='y', ascending=True)
# After Sorting
print(df)
print("-----------------------")
# After Recovering
print(df.reindex(df.ID.to_list()).drop(columns='ID'))
Output:
ID x y
1 1 2 8
2 2 3 32
0 0 1 120
-----------------------
x y
1 2 8
2 3 32
0 1 120
df=df.reset_index(drop=True)? – ansev 1 min ago

Pandas: How to transpose a row to a column?

I have a csv file that I get from a specific software. In the csv file there are 196 rows, each row has a different amount of values. The values are seperated by a semicolon.
I want to have all values of the dataframe in one column, how to do it?
dftest = pd.read_csv("test.csv", sep=';', header=None)
dftest
0
0 14,0;14,0;13,9;13,9;13,8;14,0;13,9;13,9;13,8;1...
1 14,0;14,0;13,9;14,0;14,0;13,9;14,0;14,0;13,8;1...
2 13,8;13,9;14,0;13,9;13,9;14,6;14,0;14,0;13,9;1...
3 14,5;14,4;14,2;14,1;13,9;14,1;14,1;14,2;14,1;1...
4 14,1;14,0;14,1;14,2;14,0;14,3;13,9;14,2;13,7;1...
5 14,5;14,1;14,1;14,1;14,5;14,1;13,9;14,0;14,1;1...
6 14,1;14,7;14,0;13,9;14,2;13,8;13,8;13,9;14,8;1...
7 14,7;13,9;14,2;14,7;15,0;14,5;14,0;14,3;14,0;1...
8 13,9;13,8;15,1;14,1;13,8;14,3;14,1;14,8;14,0;1...
9 15,0;14,4;14,4;13,7;15,0;13,8;14,1;15,0;15,0;1...
10 14,3;13,8;13,9;14,8;14,3;14,0;14,5;14,1;14,0;1...
11 14,5;15,5;14,0;14,1;14,0;13,8;14,2;14,0;15,9;1...
The output looks like this, I want to have all values in one column
I would like to make it look like this:
0 14,0
1 14,0
2 13,9
.
.
.
If there is only one column 0 with values splitted by ; use Series.str.split with DataFrame.stack:
df = dftest[0].str.split(';', expand=True).stack().reset_index(drop=True)
you can also use numpy ravel and convert this to 1D Array.
df = pd.read_csv("test.csv", sep=';', header=None)
df = pd.DataFrame(df.values.ravel(), columns=['Name'])

Pandas dataframe how to replace row with one with additional attributes

I have a method that adds additional attributes to a given pandas series and I want to update a row in the df with the returned series.
Lets say I have a simple dataframe:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
a b
0 1 3
1 2 4
and now I want to replace a row with one with additional attributes, all other rows will show Nan for that column ex:
subdf = df.loc[1]
subdf["newVal"] = "foo"
# subdf is created externally and returned. Now it must be updated.
df.loc[1] = subdf #or something
df would look like:
a b newVal
0 1 3 Nan
1 2 4 foo
Without loss in generalisation, first reindex and then assign with (i)loc:
df = df.reindex(subdf.index, axis=1)
df.iloc[-1] = subdf
df
a b newVal
0 1 3 NaN
1 2 4 foo

Categories