I have a DataFrame with 100 columns (however I provide only three columns here) and I want to build a new DataFrame with two columns. Here is the DataFrame:
import pandas as pd
df = pd.DataFrame()
df ['id'] = [1,2,3]
df ['c1'] = [1,5,1]
df ['c2'] = [-1,6,5]
df
I want to stick the values of all columns for each id and put them in one columns. For example, for id=1 I want to stick 2, 3 in one column. Here is the DataFrame that I want.
Note: df.melt does not solve my question. Since I want to have the ids also.
Note2: I already use the stack and reset_index, and it can not help.
df = df.stack().reset_index()
df.columns = ['id','c']
df
You could first set_index with "id"; then stack + reset_index:
out = (df.set_index('id').stack()
.droplevel(1).reset_index(name='c'))
Output:
id c
0 1 1
1 1 -1
2 2 5
3 2 6
4 3 1
5 3 5
I have an excel dataframe which I am trying to populate with fields from other excel file like so:
df = pd.read_excel("file1.xlsx")
df_new = df.join(conv.set_index('id'), on='id', how='inner')
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: "" if x==0 else x) # if id==0, its same as nan
df_new = df_new.dropna() # drop nan
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: str(int(x))) # convert id to string
df_new = df_new.drop_duplicates() # drop duplicates, if any
it is clear that df_new should be a subset of df, however, when I run following code:
len(df[df['id'].isin(df_new['id'].values)]) # length of this should be same as len(df_new)
len(df_new)
I get different results (there are 6 more rows in df_new than in df). How can that be? I have checked all dataframes for duplicates and none of them contain any. Interestingly, following code does give expected results:
len(df_new[df_new['id'].isin(df['id'].values)])
len(df_new)
These both print same numbers
Edit:
I have also tried following: others = df[~df['id'].isin(df_new['id'].values)], and checking if others has same length as len(df) - len(df_new), but again, in dataframe others there are 6 more rows than expected
The problem comes from your conv dataframe. Assume that your df that comes from file1 is
id PersonalN
0 1
And conv is
id other_col
0 'abc'
0 'def'
After the join you will get:
id PersonalN other_col
0 1 'abc'
0 1 'def'
size of df_new is larger than of df and drop_dulicates() or dropna() will not help you to reduce the shape of your resulting dataframe.
It's hard to know without the data, but even if there are no duplicates in either of the dataframe, the size of the result of an inner join can be larger than the original dataframe size. Consider the following example:
df1 = pd.DataFrame(range(10), columns=["id_"])
df2 = pd.DataFrame({"id_": list(range(10)) + [1] * 3, "something": range(13)})
df2.drop_duplicates(inplace = True)
print(len(df1), len(df2))
==> 10 13
df_new = df1.join(df2.set_index("id_"), on = "id_")
len(df_new)
==> 13
print(df_new)
id_ something
0 0 0
1 1 1
1 1 10
1 1 11
1 1 12
2 2 2
...
The reason is of course that the ids of the other dataframe are not unique, and a single id in the original dataframe (df1 in my example) is joined to several rows on the other dataframe (df2 in my example, conv in yours).
Hello I have a dataframe I sorted so the index is not in order so I want to reorder the index so that sorted values have an index that is sequential I have not been able to figure this out should I remove the index or is there a way to set the index? When I reindex it should sorts by the index which unsorts by index.
Solution
I made some dummy data to show this. I hope this answers your question. Leave comments if you have any questions.
import pandas as pd
df = pd.DataFrame({'x': [1,2,3], 'y': [120, 8, 32]})
df = df.reset_index(drop=False).rename(columns={'index': 'ID'})
df = df.sort_values(by='y', ascending=True)
# After Sorting
print(df)
print("-----------------------")
# After Recovering
print(df.reindex(df.ID.to_list()).drop(columns='ID'))
Output:
ID x y
1 1 2 8
2 2 3 32
0 0 1 120
-----------------------
x y
1 2 8
2 3 32
0 1 120
df=df.reset_index(drop=True)? – ansev 1 min ago
I have a csv file that I get from a specific software. In the csv file there are 196 rows, each row has a different amount of values. The values are seperated by a semicolon.
I want to have all values of the dataframe in one column, how to do it?
dftest = pd.read_csv("test.csv", sep=';', header=None)
dftest
0
0 14,0;14,0;13,9;13,9;13,8;14,0;13,9;13,9;13,8;1...
1 14,0;14,0;13,9;14,0;14,0;13,9;14,0;14,0;13,8;1...
2 13,8;13,9;14,0;13,9;13,9;14,6;14,0;14,0;13,9;1...
3 14,5;14,4;14,2;14,1;13,9;14,1;14,1;14,2;14,1;1...
4 14,1;14,0;14,1;14,2;14,0;14,3;13,9;14,2;13,7;1...
5 14,5;14,1;14,1;14,1;14,5;14,1;13,9;14,0;14,1;1...
6 14,1;14,7;14,0;13,9;14,2;13,8;13,8;13,9;14,8;1...
7 14,7;13,9;14,2;14,7;15,0;14,5;14,0;14,3;14,0;1...
8 13,9;13,8;15,1;14,1;13,8;14,3;14,1;14,8;14,0;1...
9 15,0;14,4;14,4;13,7;15,0;13,8;14,1;15,0;15,0;1...
10 14,3;13,8;13,9;14,8;14,3;14,0;14,5;14,1;14,0;1...
11 14,5;15,5;14,0;14,1;14,0;13,8;14,2;14,0;15,9;1...
The output looks like this, I want to have all values in one column
I would like to make it look like this:
0 14,0
1 14,0
2 13,9
.
.
.
If there is only one column 0 with values splitted by ; use Series.str.split with DataFrame.stack:
df = dftest[0].str.split(';', expand=True).stack().reset_index(drop=True)
you can also use numpy ravel and convert this to 1D Array.
df = pd.read_csv("test.csv", sep=';', header=None)
df = pd.DataFrame(df.values.ravel(), columns=['Name'])
I have a method that adds additional attributes to a given pandas series and I want to update a row in the df with the returned series.
Lets say I have a simple dataframe:
df = pd.DataFrame({'a':[1, 2], 'b':[3, 4]})
a b
0 1 3
1 2 4
and now I want to replace a row with one with additional attributes, all other rows will show Nan for that column ex:
subdf = df.loc[1]
subdf["newVal"] = "foo"
# subdf is created externally and returned. Now it must be updated.
df.loc[1] = subdf #or something
df would look like:
a b newVal
0 1 3 Nan
1 2 4 foo
Without loss in generalisation, first reindex and then assign with (i)loc:
df = df.reindex(subdf.index, axis=1)
df.iloc[-1] = subdf
df
a b newVal
0 1 3 NaN
1 2 4 foo