Related
According to this post np.argsort() would be the function I am looking for.
However, this is not giving me my desire result.
Below is the R code that I am trying to convert to Python and my current Python code.
R Code
data.frame %>% select(order(colnames(.)))
Python Code
dataframe.iloc[numpy.array(dataframe.columns).argsort()]
The dataframe I am working with is 1,000,000+ rows and 42 columns, so I can not exactly re-create the output.
But I believe I can re-create the order() outputs.
From my understanding each number represents the original position in the columns list
order(colnames(data.frame)) returns
3,2,5,6,8,4,7,10,9,11,12,13,14,15,16,17,18,19,23,20,21,22,1,25,26,28,24,27,38,29,34,33,36,30,31,32,35,41,42,39,40,37
numpy.array(dataframe.columns).argsort() returns
2,4,5,7,3,6,9,8,10,11,12,13,14,15,16,17,18,22,19,20,21,0,24,25,27,23,26,37,28,33,32,35,29,30,31,34,40,41,38,39,36,1
I know R does not have 0 index like python, so I know the first two numbers 3 and 2 are the same.
I am looking for python code that could potentially return the same ordering at the R code.
Do you have mixed case? This is handled differently in python and R.
R:
order(c('a', 'b', 'B', 'A', 'c'))
# [1] 1 4 2 3 5
x <- c('a', 'b', 'B', 'A', 'c')
x[order(c('a', 'b', 'B', 'A', 'c'))]
# [1] "a" "A" "b" "B" "c"
Python:
np.argsort(['a', 'b', 'B', 'A', 'c'])+1
# array([4, 3, 1, 2, 5])
x = np.array(['a', 'b', 'B', 'A', 'c'])
x[np.argsort(x)]
# array(['A', 'B', 'a', 'b', 'c'], dtype='<U1')
You can mimick R's behavior using numpy.lexsort and sorting by lowercase, then by the original array with swapped case:
x = np.array(['a', 'b', 'B', 'A', 'c'])
x[np.lexsort([np.char.swapcase(x), np.char.lower(x)])]
# array(['a', 'A', 'b', 'B', 'c'], dtype='<U1')
np.argsort is the same thing as R's order.
Just experiment
> x=c(1,2,3,10,20,30,5,15,25,35)
> x
[1] 1 2 3 10 20 30 5 15 25 35
> order(x)
[1] 1 2 3 7 4 8 5 9 6 10
>>> x=np.array([1,2,3,10,20,30,5,15,25,35])
>>> x
array([ 1, 2, 3, 10, 20, 30, 5, 15, 25, 35])
>>> x.argsort()+1
array([ 1, 2, 3, 7, 4, 8, 5, 9, 6, 10])
+1 here is just to have index starting with 1, since output of argsort are index (0-based index).
So maybe the problem comes from your columns (shot in the dark: you have 2d-arrays, and are passing lines to R and columns to python, or something like that).
But np.argsort is R's order.
I have a dataset and I am trying to iterate each group and based on each group, I am trying to update original groups:
import pandas as pd
import numpy as np
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'grain': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']})
df["target"] = arr
for group_name, b in df.groupby("grain"):
if group_name == "A":
// do some processing
if group_name == "B":
// do another processing
I expect to see original df is updated. Is there any way to do it?
Here is a way to change the original data, this example requires a non-duplicate index. I am not sure what would be the benefit of this approach compared to using classical pandas operations.
import pandas as pd
import numpy as np
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'grain': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']})
df["target"] = arr
for g_name, g_df in df.groupby("grain"):
if g_name == "A":
df.loc[g_df.index, 'target'] *= 10
if g_name == "B":
df.loc[g_df.index, 'target'] *= -1
Output:
>>> df
grain target
0 A 10
1 B -2
2 A 40
3 B -7
4 A 110
5 B -16
6 A 220
7 B -29
8 A 370
9 B -46
I have a dataframe from a stata file and I would like to add a new column to it which has a numeric list as an entry for each row. How can one accomplish this? I have been trying assignment but its complaining about index size.
I tried initiating a new column of strings (also tried integers) and tried something like this but it didnt work.
testdf['new_col'] = '0'
testdf['new_col'] = testdf['new_col'].map(lambda x : list(range(100)))
Here is a toy example resembling what I have:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
testdf = pd.DataFrame.from_dict(data)
This is what I would like to have:
data2 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15], 'list' : [[1,2,3],[7,8,9,10,11],[9,10,11,12],[10,11,12,13,14,15]]}
testdf2 = pd.DataFrame.from_dict(data2)
My final goal is to use explode on that "list" column to duplicate the rows appropriately.
Try this bit of code:
testdf['list'] = pd.Series(np.arange(i, j) for i, j in zip(testdf['start_val'],
testdf['end_val']+1))
testdf
Output:
col_1 col_2 start_val end_val list
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]
Let's use comprehension and zip with a pd.Series constructor and np.arange to create the lists.
If you'd stick to using the apply function:
import pandas as pd
import numpy as np
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
df = pd.DataFrame.from_dict(data)
df['range'] = df.apply(lambda row: np.arange(row['start_val'], row['end_val']+1), axis=1)
print(df)
Output:
col_1 col_2 start_val end_val range
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]
I have a pandas DataFrame and I would like to save the DataFrame in a tab separated file format with pound(#) symbol at the beginning of the header.
Here is my demo code:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
file_name = 'test.tsv'
df.to_csv(file_name, sep='\t', index=False)
The above code create a dataframe and save it in a tab separated value format. that looks like:
a b c
1 2 3
4 5 6
7 8 9
But how I can add add pound symbol with the header while saving the DataFrame.
I want the output to be like bellow:
#a b c
1 2 3
4 5 6
7 8 9
Hope I am clear with the question and thanks in advance for the help.
Note: I would like to keep the DataFrame header definition same
Using your code, just modify the a column to be #a like below
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['#a', 'b', 'c'])
file_name = 'test.tsv'
df.to_csv(file_name, sep='\t', index=False)
Edit
If you don't want to adjust the starting dataframe, use .rename before sending to csv:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
file_name = 'test.tsv'
df.rename(columns={
'a' : '#a'
}).to_csv(file_name, sep='\t', index=False)
Use the header argument to create aliases for the columns.
df.to_csv(file_name, sep='\t', index=False,
header=[f'#{x}' if x == df.columns[0] else x for x in df.columns])
#a b c
1 2 3
4 5 6
7 8 9
Here's another way to get your column aliases:
from itertools import zip_longest
header = [''.join(x) for x in zip_longest('#', df.columns, fillvalue='')]
#['#a', 'b', 'c']
I Have a column within a dataset, regarding categorical company sizes, which currently looks like this, where the '-' hyphens are currently representing missing data:
I want to change the '-' in missing values with nulls so i can analyse missing data. However when I use the pd replace tool (see following code) with a None value it seems to also make any of the genuine entries as they also contain hyphens (e.g 51-200).
df['Company Size'].replace({'-': None},inplace =True, regex= True)
How can I replace only lone standing hyphens and leave the other entries untouched?
You need not to use regex=True.
df['Company Size'].replace({'-': None},inplace =True)
You could also just do:
df['column_name'] = df['column_name'].replace('-','None')
import numpy as np
df.replace('-', np.NaN, inplace=True)
This code worked for me.
you can do it like this
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': [5, 6, 7, 8, 9],
'C': ['a', '-', 'c--', 'd', 'e']})
df['C'] = df['C'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
# can also use this -> df['C'] = df['C'].where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5 a
1 1 6 None
2 2 7 c--
3 3 8 d
4 4 9 e
another example:
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': ['5-5', '-', 7, 8, 9],
'C': ['a', 'b', 'c--', 'd', 'e']})
df['B'] = df['B'].replace('-', np.nan)
df = df.where((pd.notnull(df)), None)
print(df)
output:
A B C
0 0 5-5 a
1 1 None b
2 2 7 c--
3 3 8 d
4 4 9 e