Custom ordering of mixed integer and strings in dataframe index - python

I have a Panda's Dataframe that I need to reindex into a specific fashion. There are several numbered indices, but the last one is a string. Without the inclusion of the string, the index goes in numerical order, 1-20 just fine.
However, as soon as I include the string index, the order switches to alpha-numeric (1, 11, 12... 18, 19, 2, 20, 3, 4, etc.). Is there any way where I can organize the list properly numerically, then add on the string index without changing how the list is organized?
[EDIT]:
Realized the shortcoming on my own part. I thought I had included that the data frame was being converted to an html-safe table ( DataTable ) after construction and being displayed on a web page. It is possible this might be causing the issue I am having, though any insights to this matter are welcome.
An example of the kind of data frame I am looking at:
Column 1
0 Value 1
1 Value 2
2 Value 3
3 Value 4
...
18 Value 19
19 Value 20
string Value 21

Something along these lines should work:
new_index = list(df.index)
new_index[-1] = 'string'
df.index=new_index
For example:
df=pd.DataFrame(np.random.random(5))
>>> df
0
0 0.665922
1 0.591298
2 0.274722
3 0.561243
4 0.382927
new_index = list(df.index)
new_index[-1] = 'string'
df.index=new_index
returns a re-indexed df:
>>> df
0
0 0.665922
1 0.591298
2 0.274722
3 0.561243
string 0.382927

Here is one way. You can separate out numeric and non-numeric indices and sort them independently.
df = pd.DataFrame({1: ['Val1', 'Val2', 'Val3', 'Val4', 'Val5']},
index=['0', '1', '11', '2', 'string'])
order1 = sorted((x for x in df.index if x.isdigit()), key=lambda i: int(i))
order2 = sorted(x for x in df.index if not x.isdigit())
df = df.loc[order1+order2]
# 1
# 0 Val1
# 1 Val2
# 2 Val3
# 11 Val4
# string Val5

Related

Create a column based on the aggregation of values from multiple column at multiple row indexes

I'm trying to translate a technical analysis operator from another proprietary language to python as using dataframes, but I got stuck on a problem that seems rather simple, but I can't get to solve the pandas way. To simplify the problem let's have the example of this dataframe:
d = {'value1': [0,1,2,3], 'value2': [4,5,6,7]}
df = pd.DataFrame(data=d)
which result in the following dataframe:
What I want to achieve is this:
which in Pseudocode I would achieve in the following way:
value1 = [0,1,2,3]
value2 = [4,5,6,7]
result = []
for i in range(len(value1)):
calculation = value1[i] * value2[i]
lookback = value1[i]
for j in range(lookback):
calculation -= value2[j]
result[i] = calculation
How would I tackle a this in a dataframe context? Because the only similar approach that I found in the documentation is the usage of https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html# but there is no mentioning of interacting/manipulating the series contained in the columns/rows.
df['result'] = df.value1 * df.value2 - (df.value2.cumsum() - df.value2)
df
Output
value1 value2 result
0 0 4 0
1 1 5 1
2 2 6 3
3 3 7 6
Explanation
We are calculating cumulative sum for value2 and subtracting the current value2 which in total is subtracted by the product of value1 and value2.
This solution should work even if the first column value1 has random integers and not increasing integers from 0, and follow the pseudocode provided by the OP.
You should just ensure that any value in value1 is a valid integer for the dataframe (that is, no integer grater than the amount of rows in the dataframe, which is also required by the pseudocode).
import pandas as PD
d = {'value1': [0,1,2,3], 'value2': [4,5,6,7]}
df = PD.DataFrame(data=d)
csum2 = df["value2"].cumsum()
df["sum2"] = [csum2[i] for i in df["value1"]]
df["result"] = df["value1"] * df["value2"] - df["sum2"] + df["value2"]
df.drop("sum2", axis=1, inplace=True)
To explain: I save in an additional column "sum2" the result of the inner loop in the pseucode for j in range(lookback): so that I can then perform the main operation to get the "result" column.
At the end df is:
value1 value2 result
0 0 4 0
1 1 5 1
2 2 6 3
3 3 7 6

Convert a list of stringified lists into a dataframe whilst maintaining index

I have the following data frame coming from an API source, I'm trying to wrangle the data whilst not massively changing my original dataframe (don't want to do a cartesian product essentially)
data = ["[['Key','Metric','Value'],['foo','bar','4'],['foo2','bar2','55.21']]",
"[['Key','Metric','Value'],['foo','bar','5']]",
"[['Key','Metric','Value'],['foo','bar','6'],['foo1','bar1',''],['foo2','bar2','57.75']]"]
df = pd.DataFrame({'id' : [0,1,2],'arr' : data})
print(df)
id arr
0 0 [['Key','Metric','Value'],['foo','bar','4'],['...
1 1 [['Key','Metric','Value'],['foo','bar','5']]
2 2 [['Key','Metric','Value'],['foo','bar','6'],['...
The Key Value Metric tells the order of the arrays within what I'm trying to do is order it in a dictionary fashion of {key : value} where the key is the Key & Metric fields joined and the value is -1 index of the nested list.
The source data is coming via excel & the MS Graph API, I don't envisage that it will change, but it may so I'm trying to come up with a dynamic solution.
my target dataframe is :
target_df = pd.DataFrame({'id' : [0,1,2],
'foo_bar' : [4,5,6],
'foo1_bar1' : [np.nan, np.nan,''],
'foo2_bar2' : [55.21, np.nan, 57.75]})
print(target_df)
id foo_bar foo1_bar1 foo2_bar2
0 0 4 NaN 55.21
1 1 5 NaN NaN
2 2 6 57.75
my own attemps have been to use literal_eval from the ast library to get the first list which will always be the Key Metric & Value column - there maybe in future a Key Metric , Metric2, Value field - hence my desire to keep things dynamic.
there will always be a single Key & Value field.
Own attempt :
from ast import literal_eval
literal_eval(df['arr'][0])[0]
#['Key', 'Value', 'Metric']
with this i replaced the list characters and split by , then converted the result to a dataframe :
df['arr'].str.replace('\[|\]','').str.split(',',expand=True)
however after this I haven't made much clear head-way and wondering If im going about this the wrong way?
Try:
df2=df["arr"].map(eval).apply(lambda x: pd.Series({f"{el[0]}_{el[1]}": el[2] for el in x[1:]}))
df2["id"]=df["id"]
Output:
foo_bar foo2_bar2 foo1_bar1 id
0 4 55.21 NaN 0
1 5 NaN NaN 1
2 6 57.75 2
IIUC, you can loop over each row and use literal_eval, create dataframes, set_index the first two columns and transpose. then concat plus rename the columns, and create the column id:
from ast import literal_eval
df_target = pd.concat([pd.DataFrame.from_records(literal_eval(x)).drop(0).set_index([0,1]).T
for x in df.arr.to_numpy()],
ignore_index=True,
keys=df.id) #to keep the ids
# rename the columns as wanted
df_target.columns = ['{}_{}'.format(*col) for col in df_target.columns]
# add the ids as a column
df_target = df_target.reset_index().rename(columns={'index':'id'})
print (df_target)
id foo_bar foo1_bar1 foo2_bar2
0 0 4 NaN 55.21
1 1 5 NaN NaN
2 2 6 57.75
I'm still not entirely sure I understand every aspect of the question, but here's what I have so far.
import ast
import pandas as pd
data = ["[['Key','Metric','Value'],['foo','bar','4'],['foo2','bar2','55.21']]",
"[['Key','Metric','Value'],['foo','bar','5']]",
"[['Key','Metric','Value'],['foo','bar','6'],['foo1','bar1',''],['foo2','bar2','57.75']]"]
nested_lists = [ast.literal_eval(elem)[1:] for elem in data]
row_dicts = [{'_'.join([key, metric]): value for key, metric, value in curr_list} for curr_list in nested_lists]
df = pd.DataFrame(data=row_dicts)
print(df)
Output:
foo_bar foo2_bar2 foo1_bar1
0 4 55.21 NaN
1 5 NaN NaN
2 6 57.75
nested_lists and row_dicts are list comprehension since it makes debugging easier, but you can of course transform them into generator expressions.

Find index of cell in dateframe

I would like to modify the cell value based on its size.
If the dateframe is as below:
A B C
25802523 X1 2
M25JK0010 Y1 1
K25JK0010 Y2 1
I would like to modify the column 'A' and insert to another column.
For example, if the first cell value the size of column A is 8. I would like to break it and get least 5 values, similarly others depend on their sizes of each cell.
Is there any way I'm able to do this?
You can do this:
t = pd.DataFrame({'A': ['25802523', 'M25JK00010', 'KRJOJR4445'],
'size': [2, 1, 8]} )
Define a dictionary of your desired final length based on the corresponding size. Here if the size is 8 I will take the 5 last characters
size_dict = {8: 5, 2: 3, 1: 4}
Then use a simple pandas apply
t['A_bis'] = t.apply(lambda x: x['A'][len(x['A']) - size_dict[x['size']]:], axis=1)
The result is
0 523 >> 3 last characters (key 2)
1 0010 >> 4 last characters (key 1)
2 R4445 >> 5 last characters (key 8)
Another approach to do this:
Sample df:
t = pd.DataFrame({'A': ['25802523', 'M25JK00010', 'KRJOJR4445']})
Get the count of each elements of A:
t['Count'] =(t['A'].apply(len))
Then write a condition to replace:
t.loc[t.Count == 8, 'Number'] = t['A'].str[-5:]

How to remove blanks/NA's from dataframe and shift the values up

I have a huge dataframe which has values and blanks/NA's in it. I want to remove the blanks from the dataframe and move the next values up in the column. Consider below sample dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
df
0 1 2 3
0 1.857476 NaN -0.462941 -0.600606
1 0.000267 -0.540645 NaN 0.492480
2 NaN NaN -0.803889 0.527973
3 0.566922 0.036393 -1.584926 2.278294
4 -0.243182 -0.221294 1.403478 1.574097
I want my output to be as below
0 1 2 3
0 1.857476 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.584926 0.527973
3 -0.243182 1.403478 2.278294
4 1.574097
I want the NaN to be removed and the next value to move up. df.shift was not helpful. I tried with multiple loops and if statements and achieved the desired result but is there any better way to get it done.
You can use apply with dropna:
np.random.seed(100)
df = pd.DataFrame(np.random.randn(5,4))
df.iloc[1,2] = np.NaN
df.iloc[0,1] = np.NaN
df.iloc[2,1] = np.NaN
df.iloc[2,0] = np.NaN
print (df)
0 1 2 3
0 -1.749765 NaN 1.153036 -0.252436
1 0.981321 0.514219 NaN -1.070043
2 NaN NaN -0.458027 0.435163
3 -0.583595 0.816847 0.672721 -0.104411
4 -0.531280 1.029733 -0.438136 -1.118318
df1 = df.apply(lambda x: pd.Series(x.dropna().values))
print (df1)
0 1 2 3
0 -1.749765 0.514219 1.153036 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.029733 0.672721 0.435163
3 -0.531280 NaN -0.438136 -0.104411
4 NaN NaN NaN -1.118318
And then if need replace to empty space, what create mixed values - strings with numeric - some functions can be broken:
df1 = df.apply(lambda x: pd.Series(x.dropna().values)).fillna('')
print (df1)
0 1 2 3
0 -1.74977 0.514219 1.15304 -0.252436
1 0.981321 0.816847 -0.458027 -1.070043
2 -0.583595 1.02973 0.672721 0.435163
3 -0.53128 -0.438136 -0.104411
4 -1.118318
A numpy approach
The idea is to sort the columns by np.isnan so that np.nans are put last. I use kind='mergesort' to preserve the order within non np.nan. Finally, I slice the array and reassign it. I follow this up with a fillna
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
v[:] = v[a, i]
print(df.fillna(''))
0 1 2 3
0 1.85748 -0.540645 -0.462941 -0.600606
1 0.000267 0.036393 -0.803889 0.492480
2 0.566922 -0.221294 -1.58493 0.527973
3 -0.243182 1.40348 2.278294
4 1.574097
If you didn't want to alter the dataframe in place
v = df.values
i = np.arange(v.shape[1])
a = np.isnan(v).argsort(0, kind='mergesort')
pd.DataFrame(v[a, i], df.index, df.columns).fillna('')
The point of this is to leverage numpys quickness
naive time test
Adding on to solution by piRSquared:
This shifts all the values to the left instead of up.
If not all values are numbers, use pd.isnull
v = df.values
a = [[n]*v.shape[1] for n in range(v.shape[0])]
b = pd.isnull(v).argsort(axis=1, kind = 'mergesort')
# a is a matrix used to reference the row index,
# b is a matrix used to reference the column index
# taking an entry from a and the respective entry from b (Same index),
# we have a position that references an entry in v
v[a, b]
A bit of explanation:
a is a list of length v.shape[0], and it looks something like this:
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2],
[3, 3, 3, 3],
[4, 4, 4, 4],
...
what happens here is that, v is m x n, and I have made both a and b m x n, and so what we are doing is, pairing up every entry i,j in a and b to get the element at row with value of element at i,j in a and column with value of element at i,j, in b. So if we have a and b both look like the matrix above, then v[a,b] returns a matrix where the first row contains n copies of v[0][0], second row contains n copies of v[1][1] and so on.
In solution piRSquared, his i is a list not a matrix. So the list is used for v.shape[0] times, aka once for every row. Similarly, we could have done:
a = [[n] for n in range(v.shape[0])]
# which looks like
# [[0],[1],[2],[3]...]
# since we are trying to indicate the row indices of the matrix v as opposed to
# [0, 1, 2, 3, ...] which refers to column indices
Let me know if anything is unclear,
Thanks :)
As a pandas beginner I wasn't immediately able to follow the reasoning behind #jezrael's
df.apply(lambda x: pd.Series(x.dropna().values))
but I figured out that it works by resetting the index of the column. df.apply (by default) works column-by-column, treating each column as a series. Using df.dropna() removes NaNs but doesn't change the index of the remaining numbers, so when this column is added back to the dataframe the numbers go back to their original positions as their indices are still the same, and the empty spaces are filled with NaN, recreating the original dataframe and achieving nothing.
By resetting the index of the column, in this case by changing the series to an array (using .values) and back to a series (using pd.Series), only the empty spaces after all the numbers (i.e. at the bottom of the column) are filled with NaN. The same can be accomplished by
df.apply(lambda x: x.dropna().reset_index(drop = True))
(drop = True) for reset_index keeps the old index from becoming a new column.
I would have posted this as a comment on #jezrael's answer but my rep isn't high enough!

add columns different length pandas

I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it

Categories