I removed some duplicate columns by the following command.
columns = XY.columns[:-1].tolist()
XY1 = XY.drop_duplicates(subset=columns,keep='first').
The result is below:
Combined Series shape : (100, 4)
Combined Series: 1 222 223 0
0 0 0 0 1998.850000
1 0 0 0 0.947361
2 0 0 0 0.947361
3 0 0 0 0.947361
4 0 0 0 0.947361
Now the columns is labelled 1 222 223 0 (0 label at the end is because of concat with another df !!) I want the columns to be
re-labelled from index 0 onwards. How'll I do it?
So first create a dictionary with the mapping you want
trafo_dict = {x:y for x,y in zip( [1,222,223,0],np.linspace(0,3,4))}
Then you need to rename columns. This can be done with pd.DataFrame.rename:
XY1 = XY1.rename(columns=trafo_dict)
Edit: If you want it in a more general fashion use:
np.linspace(0,XY1.shape[1]-1,XY1.shape[1])
Related
working on NLP problem
I ended up with a big features dataset
dfMethod
Out[2]:
c0000167 c0000294 c0000545 ... c4721555 c4759703 c4759772
0 0 0 0 ... 0 0 0
1 0 0 0 ... 0 0 0
2 0 0 0 ... 0 0 0
3 0 0 0 ... 0 0 0
4 0 0 0 ... 0 0 0
... ... ... ... ... ... ...
3995 0 0 0 ... 0 0 0
3996 0 0 0 ... 0 0 0
3997 0 0 0 ... 0 0 0
3998 0 0 0 ... 0 0 0
3999 0 0 0 ... 0 0 0
[4000 rows x 14317 columns]
I want to remove columns with the smallest repetition (i.e. the columns with the smallest sum of of all records)
so if my columns sum would look like this
Sum of c0000167 = 7523
Sum of c0000294 = 8330
Sum of c0000545 = 502
Sum of c4721555 = 51
Sum of c4759703 = 9628
in the end, I want to only keep the top 5000 columns based on the sum of each column?
how can I do that?
Let's say you have a big dataframe big_df you can get the top columns with the following:
N = 5000
big_df[big_df.sum().sort_values(ascending=False).index[:N]]
Breaking this down:
big_df.sum() # Gives the sums you mentioned
.sort_values(ascending=False) # Sort the sums in descending order
.index # because .sum() defaults to axis=0, the index is your columns
[:N] # grab first N items
Edited after author comment.
Let's consider df a pandas DataFrame. Preparing the filter, select top 5000 sum columns:
df_sum = df.sum() # avoid repeating df.sum() next line
co = sorted([(c, v) for (c, v) in list(zip(df_sum.keys(), df_sum.values))], key = lambda row: row[1], reverse = True)[0:5000]
# fixed trouble of sum value greater than 5000, but the top 5000.
co = [row[0] for row in co]
# convert to a list of column names of interest
After filter columns in co:
df = df.filter(items = co)
df
I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))
I have created a DataFrame full of zeros such as:
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
...
n 0 0 0
I have a list containing names for the column in unicode, such as:
list = [u'One', u'Two', u'Three']
The DataFrame of zeroes is known as a, and I am creating a new complete DataFrame with the list as column headers via:
final = pd.DataFrame(a, columns=[list])
However, the resulting DataFrame has column names that are no longer unicode (i.e. they do not show the u'' tag).
I am wondering why this is happening. Thanks!
There is no reason for lost unicode, you can check it by:
print df.columns.tolist()
Please never use reserved words like list, type, id... as variables because masking built-in functions. Also is necessary add values for convert values to numpy array:
a = pd.DataFrame(0, columns=range(3), index=range(3))
print (a)
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
L = [u'One', u'Two', u'Three']
final = pd.DataFrame(a.values, columns=L)
print (final)
One Two Three
0 0 0 0
1 0 0 0
2 0 0 0
because columns are not aligned and get all NaNs:
final = pd.DataFrame(a, columns=L)
print (final)
One Two Three
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
I think simpliest is use only index of a DataFrame if all values are 0:
L = [u'One', u'Two', u'Three']
final = pd.DataFrame(0, columns=L, index=a.index)
print (final)
One Two Three
0 0 0 0
1 0 0 0
2 0 0 0
I have a Pandas series of 10000 rows which is populated with a single alphabet, starting from A to Z.
However, I want to create dummy data frames for only A, B, and C, using Pandas get_dummies.
How do I go around doing that?
I don't want to get dummies for all the row values in the column and then select the specific columns, as the column contains other redundant data which eventually causes a Memory Error.
try this:
# create mock dataframe
df = pd.DataFrame( {'alpha':['a','a','b','b','c','e','f','g']})
# use replace with a regex to set characters d-z to None
pd.get_dummies(df.replace({'[^a-c]':None},regex =True))
output:
alpha_a alpha_b alpha_c
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 1
5 0 0 0
6 0 0 0
7 0 0 0
I have a N x 3 DataFrame called A that looks like this:
_Segment _Article Binaire
0 550 5568226 1
1 550 5612047 1
2 550 5909228 1
3 550 5924375 1
4 550 5924456 1
5 550 6096557 1
....
The variable _Article is uniquely defined in A (there are N unique values of _Article in A).
I do a pivot:
B=A.pivot(index='_Segment', columns='_Article')
,then replace missing values nan with zeros:
B[np.isnan(B)]=0
and get:
Binaire \
_Article 2332299 2332329 2332337 2932377 2968223 3195643 3346080
_Segment
550 0 0 0 0 0 0 0
551 0 0 0 0 0 0 0
552 0 0 0 0 0 0 0
553 1 1 1 0 0 0 1
554 0 0 0 1 0 1 0
where columns were sorted lexicographically during the pivot.
My question is: how do I retain the sort order of _Article in A in the columns of B?
Thanks!
I think I got it. This works:
First, store the column _Article
order_art=A['_Article']
In the pivot, add the "values" argument to avoid hierarchical columns (see http://pandas.pydata.org/pandas-docs/stable/reshaping.html), which prevent reindex to work properly:
B=A.pivot(index='_Segment', columns='_Article', values='_Binaire')
then, as before, replace nan's with zeros
B[np.isnan(B)]=0
and finally use reindex to restore the original order of variable _Article across columns:
B=B.reindex(columns=order_art)
Are there more elegant solutions?