How is index modified by dataframe.sort_values() - python

I have a dataframe that i want to sort on one of my columns (that is a date)
However I have a loop i am running on the index (while i<df.shape[0]), I need the loop to go on my dataframe once it is sorted by date.
Is the current index modified accordingly to the sorting or should I use df.reset_index() ?

Maybe I'm not understanding the question, but a simple check shows that sort_values does modify the index:
df = pd.DataFrame({'x':['a','c','b'], 'y':[1,3,2]})
df = df.sort_values(by = 'x')
Yields:
x y
0 a 1
2 b 2
1 c 3
And a subsequent:
df = df.reset_index(drop = True)
Yields:
x y
0 a 1
1 b 2
2 c 3

Related

Duplicating Pandas Dataframe rows based on string split, without iteration

I have a dataframe with a multiindex, where one of thecolumns represents multiple values, separated by a "|", like this:
value
left right
x a|b 2
y b|c|d -1
I want to duplicate the rows based on the "right" column, to get something like this:
values
left right
x a 2
x b 2
y b -1
y c -1
y d -1
The solution I have to this feels wrong and runs slow, because it's based on iteration:
df2 = df.iloc[:0]
for index, row in df.iterrows():
stgs = index[1].split("|")
for s in stgs:
row.name = (index[0], s)
df2 = df2.append(row)
Is there a more vectored way to do this?
Pandas Series have a dedicated method split to perform this operation
split works only on Series so isolate the Column you want
SO = df['right']
Now 3 steps at once: spilt return A Series of array. apply(pd.Series, 1) convert array in columns. stack stacks you columns into a unique column
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
The only issue is that you have now a multi-index. So just drop the level you don`t need
S1.index.droplevel(-1)
Full example
SO = pd.Series(data=["a,b", "b,c,d"])
S1 = SO.str.split(',').apply(pd.Series, 1).stack()
S1
Out[4]:
0 0 a
1 b
1 0 b
1 c
2 d
S1.index = S1.index.droplevel(-1)
S1
Out[5]:
0 a
0 b
1 b
1 c
1 d
Building upon the answer #xNoK, I am adding here the additional step needed to include the result back in the original DataFrame.
We have this data:
arrays = [['x', 'y'], ['a|b', 'b|c|d']]
midx = pd.MultiIndex.from_arrays(arrays, names=['left', 'right'])
df = pd.DataFrame(index=midx, data=[2, -1], columns=['value'])
df
Out[17]:
value
left right
x a|b 2
y b|c|d -1
First, let's generate the values for right index as #xNoK suggested. First take the Index level we want to work on by index.levels[1] and convert it it to series so that we can perform the str.split() function, and finally stack() it to get the result we want.
new_multi_idx_val = df.index.levels[1].to_series().str.split('|').apply(pd.Series).stack()
new_multi_idx_val
Out[18]:
right
a|b 0 a
1 b
b|c|d 0 b
1 c
2 d
dtype: object
Now we want to put this value in the original DataFrame df. To do that, let's change its shape so that result we generated in the previous step could be copied.
In order to do that, we can repeat the rows (including the indexes) by a number of | present in right level of multi-index. df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)) gives the number of times a row (including index) should be repeated. We apply this to the function index.repeat() and fetch values at those indexes to create a new DataFrame df_repeted.
df_repeted = df.loc[df.index.repeat(df.index.levels[1].to_series().str.split('|').apply(lambda x: len(x)))]
df_repeted
Out[19]:
value
left right
x a|b 2
a|b 2
y b|c|d -1
b|c|d -1
b|c|d -1
Now df_repeted DataFrame is in a shape where we could change the index to get the answer we want.
Replace the index of df_repeted with desired values as following:
df_repeted.index = [df_repeted.index.droplevel(1), new_multi_idx_val]
df_repeted.index.rename(names=['left', 'right'], inplace=True)
df_repeted
Out[20]:
value
left right
x a 2
b 2
y b -1
c -1
d -1

How to process column names and create new columns

This is my pandas DataFrame with original column names.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
1 3 0 0
2 1 1 5
Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
1 3 0 0 2 0
2 1 1 5 2 1
I implemented the first step as follows:
cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns
How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.
You can use:
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt
0 1 3 0 0
1 2 1 1 5
First you can filter columns contains string cm, so columns without cm are removed.
df1 = df.filter(regex='cm')
Now you can change columns to new values like cm1, cm2, cm3.
print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']
df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
cm1 cm1 cm2
0 1 3 0
1 2 1 1
Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.
df1 = df1.astype(bool)
print df1
cm1 cm1 cm2
0 True True False
1 True True True
print df1.groupby(df1.columns, axis=1).sum()
cm1 cm2
0 2 0
1 2 1
You need unique columns, which are added to original df:
print df1.columns.unique()
['cm1' 'cm2']
Last you can add new columns by df[['cm1','cm2']] from groupby function:
df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
old_dt_cm1_tt old_dm_cm1 old_rr_cm2_epf old_gt cm1 cm2
0 1 3 0 0 2 0
1 2 1 1 5 2 1
Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:
col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
# ^ if you are hard coding this in you might as well use 2
so that instead of the string after cm it is cm and the character directly following, in this case it would be:
{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}
Then add the new columns to the DataFrame by iterating over the dict:
for col,new_col in col_map.items():
if new_col not in df:
df[new_col] =[int(a!=0) for a in df[col]]
else:
df[new_col]+=[int(a!=0) for a in df[col]]
note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)
import operator
for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
if new_col in df:
df[new_col]+=[int(a!=0) for a in df[col]]
else:
df[new_col] =[int(a!=0) for a in df[col]]
to ensure the new columns are inserted in order.

Drop rows if value in a specific column is not an integer in pandas dataframe

If I have a dataframe and want to drop any rows where the value in one column is not an integer how would I do this?
The alternative is to drop rows if value is not within a range 0-2 but since I am not sure how to do either of them I was hoping someonelse might.
Here is what I tried but it didn't work not sure why:
df = df[(df['entrytype'] != 0) | (df['entrytype'] !=1) | (df['entrytype'] != 2)].all(1)
There are 2 approaches I propose:
In [212]:
df = pd.DataFrame({'entrytype':[0,1,np.NaN, 'asdas',2]})
df
Out[212]:
entrytype
0 0
1 1
2 NaN
3 asdas
4 2
If the range of values is as restricted as you say then using isin will be the fastest method:
In [216]:
df[df['entrytype'].isin([0,1,2])]
Out[216]:
entrytype
0 0
1 1
4 2
Otherwise we could cast to a str and then call .isdigit()
In [215]:
df[df['entrytype'].apply(lambda x: str(x).isdigit())]
Out[215]:
entrytype
0 0
1 1
4 2
str("-1").isdigit() is False
str("-1").lstrip("-").isdigit() works but is not nice.
df.loc[df['Feature'].str.match('^[+-]?\d+$')]
for your question the reverse set
df.loc[ ~(df['Feature'].str.match('^[+-]?\d+$')) ]
We have multiple ways to do the same, but I found this method easy and efficient.
Quick Examples
#Using drop() to delete rows based on column value
df.drop(df[df['Fee'] >= 24000].index, inplace = True)
# Remove rows
df2 = df[df.Fee >= 24000]
# If you have space in column name
# Specify column name with in single quotes
df2 = df[df['column name']]
# Using loc
df2 = df.loc[df["Fee"] >= 24000 ]
# Delect rows based on multiple column value
df2 = df[ (df['Fee'] >= 22000) & (df['Discount'] == 2300)]
# Drop rows with None/NaN
df2 = df[df.Discount.notnull()]

add columns different length pandas

I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it

Create new columns and fill with calculated values from same dataframe

Here is a simplified example of my df:
ds = pd.DataFrame(np.abs(randn(3, 4)), index=[1,2,3], columns=['A','B','C','D'])
ds['sum'] = ds.sum(axis=1)
which looks like
A B C D sum
1 0.095389 0.556978 1.646888 1.959295 4.258550
2 1.076190 2.668270 0.825116 1.477040 6.046616
3 0.245034 1.066285 0.967124 0.791606 3.070049
I would like to create 4 new columns and calculate the percentage value from the total (sum) in every row. So first value in the first new column should be (0.095389/4.258550), first value in the second new column (0.556978/4.258550)...and so on.
You can do this easily manually for each column like this:
df['A_perc'] = df['A']/df['sum']
If you want to do this in one step for all columns, you can use the div method (http://pandas.pydata.org/pandas-docs/stable/basics.html#matching-broadcasting-behavior):
ds.div(ds['sum'], axis=0)
And if you want this in one step added to the same dataframe:
>>> ds.join(ds.div(ds['sum'], axis=0), rsuffix='_perc')
A B C D sum A_perc B_perc \
1 0.151722 0.935917 1.033526 0.941962 3.063127 0.049532 0.305543
2 0.033761 1.087302 1.110695 1.401260 3.633017 0.009293 0.299283
3 0.761368 0.484268 0.026837 1.276130 2.548603 0.298739 0.190013
C_perc D_perc sum_perc
1 0.337409 0.307517 1
2 0.305722 0.385701 1
3 0.010530 0.500718 1
In [56]: df = pd.DataFrame(np.abs(randn(3, 4)), index=[1,2,3], columns=['A','B','C','D'])
In [57]: df.divide(df.sum(axis=1), axis=0)
Out[57]:
A B C D
1 0.319124 0.296653 0.138206 0.246017
2 0.376994 0.326481 0.230464 0.066062
3 0.036134 0.192954 0.430341 0.340571
You can convert sum column in a numpy column array and broadcast division.
new_df = df / df[['sum']].values # note the double-brackets around 'sum'
To add the percentages as new columns,
df[df.columns.drop('sum') + '_perc'] = df.drop(columns='sum') / df[['sum']].values

Categories