I have a dataframe with column 'A' as a string. Within 'A' there are values like
name1-L89783
nametwo-L33009
I would like to make a new column 'B' such that the '-Lxxxx' is removed and all that remains is 'name1' and 'nametwo'.
use vectorised str.split for this and then use str again to access the array element of interest in this case the first element of the split:
In [10]:
df[1] = df[0].str.split('-').str[0]
df
Out[10]:
0 1
0 name1-L89783 name1
1 nametwo-L33009 nametwo
Initialize DataFrame.
df = pd.DataFrame(['name1-L89783','nametwo-L33009'],columns=['A',])
>>> df
A
0 name1-L89783
1 nametwo-L33009
Apply function over rows and put the result in a new column.
df['B'] = df['A'].apply(lambda x: x.split('-')[0])
>>> df
A B
0 name1-L89783 name1
1 nametwo-L33009 nametwo
Related
In [182]: colname
Out[182]: 'col1'
In [183]: x= 'df_' + colname
In [184]: x
Out[184]: 'df_col1'
May I know how to create a new pandas data frame with x, such that the new data frame's name would be df_col1
Another way is to initialize a dictionary, and add keys containing dataframe as follows:
import pandas as pd
x = "your_column_name"
df_dict = {}
df_dict[x] = pd.DataFrame()
x = "your_new_column_name"
df_dict[x] = pd.DataFrame()
You can then change "x" anytime, and use the same idea to append dataframe in dictionary. To fetch dataframe back, you will then retrieve it back as you access dictionary.
You can use the locals() function as given below,
>>> mydf
col_A col_B
0 1 4
1 2 5
2 3 6
>>> colname = 'col1'
>>> locals()[f'df_{colname}'] = mydf.col_A
>>> df_col1
0 1
1 2
2 3
Name: col_A, dtype: int64
exec function lets you to define and evaluate expressions. Use:
df_=pd.DataFrame({'a':[1,2,3]})
col_name = 'col1'
exec(f"df_{col_name}= df_")
Output:
This is a basic question. I've got a square array with the rows and columns summed up. Eg:
df = pd.DataFrame([[0,0,1,0], [0,0,1,0], [1,0,0,0], [0,1,0,0]], index = ["a","b","c","d"], columns = ["a","b","c","d"])
df["sumRows"] = df.sum(axis = 1)
df.loc["sumCols"] = df.sum()
This returns:
In [100]: df
Out[100]:
a b c d sumRows
a 0 0 1 0 1
b 0 0 1 0 1
c 1 0 0 0 1
d 0 1 0 0 1
sumCols 1 1 2 0 4
I need to find the column labels for the sumCols rows which matches 0. At the moment I am doing this:
[df.loc["sumCols"] == 0].index
But this return a strange index type object. All I want is a list of values that match this criteria i.e: ['d'] in this case.
There is two ways (the index object can be converted to an interable like a list).
Do that with the columns:
columns = df.columns[df.sum()==0]
columns = list(columns)
Or you can rotate the Dataframe and treat columns as rows:
list(df.T[df.T.sumCols == 0].index)
You can use a lambda expression to filter series and if you want a list instead of index as result, you can call .tolist() on the index object:
(df.loc['sumCols'] == 0)[lambda x: x].index.tolist()
# ['d']
Or:
df.loc['sumCols'][lambda x: x == 0].index.tolist()
# ['d']
Without explicitly creating the sumCols and if you want to check which column has sum of zero, you can do:
df.sum()[lambda x: x == 0].index.tolist()
# ['d']
Check rows:
df.sum(axis = 1)[lambda x: x == 0].index.tolist()
# []
Note: The lambda expression approach is as fast as the vectorized method for subsetting, functional style and can be written easily in a one-liner if you prefer.
Heres a simple method using query after transposing
df.T.query('sumCols == 0').index.tolist()
I'm trying to replace and add some values in pandas dataframe object. I have to following code
import pandas as pd
df = pd.DataFrame.from_items([('A', ["va-lue", "value-%", "value"]), ('B', [4, 5, 6])])
print df
df['A'] = df['A'].str.replace('%', '_0')
print df
df['A'] = df['A'].str.replace('-', '')
print df
#allmost there?
df.A[df['A'].str.contains('-')] + "_0"
How can I find the cell values in column A which contains '-' sign, replace this value with '' and add for these values a trailing '_0'? The resulting data set should look like this
A B
0 value_0 4
1 value_0 5
2 value 6
You can first keep track of the rows whose A needs to be appended with the trailing string, and perform these operations in two steps:
mask = df['A'].str.contains('-')
df['A'] = df['A'].str.replace('-|%', '')
df.ix[mask, 'A'] += '_0'
print df
Output:
A B
0 value_0 4
1 value_0 5
2 value 6
I have a list that follows this format:
a=['date name','10150425010245 name1','10150425020245 name2']
I am trying to convert this to Pandas df:
newlist=[]
for item in a:
newlist.append(item.split(' '))
Now, convert this to df:
pd.DataFrame(newlist)
which results in
0 1
0 date name
1 10150425010245 name1
2 10150425020245 name2
I want to have 'date' and 'name' as header, but I can't manage to do that. Is there a more efficient way to automatically convert a list of strings into a dataframe than this?
Here's one approach.
Use list comprehensions instead of loops.
In [160]: data = [x.split('') for x in a]
In [161]: data
Out[161]: [['date', 'name'], ['10150425010245', 'name1'], ['10150425020245', 'name2']]
Then use data[1:] as values and data[0] as column names.
In [162]: pd.DataFrame(data[1:], columns=data[0])
Out[162]:
date name
0 10150425010245 name1
1 10150425020245 name2
you were on the right track. With slight modification, your code works fine.
import pandas as pd
a=['date name','10150425010245 name1','10150425020245 name2']
newlist=[]
for item in a:
newlist.append(item.split(' '))
newlist2=pd.DataFrame(newlist,columns=["date","name"])[1:]
newlist2
date name
10150425010245 name1
10150425020245 name2
Tempted to summarise the answers already given in one line:
a=['date name','10150425010245 name1','10150425020245 name2']
pd.DataFrame(
map(str.split, a)[1:],
columns=a[0].split(),
)
Output:
Out[8]:
date name
0 10150425010245 name1
1 10150425020245 name2
I am using .size() on a groupby result in order to count how many items are in each group.
I would like the result to be saved to a new column name without manually editing the column names array, how can it be done?
This is what I have tried:
grpd = df.groupby(['A','B'])
grpd['size'] = grpd.size()
grpd
and the error I got:
TypeError: 'DataFrameGroupBy' object does not support item assignment
(on the second line)
The .size() built-in method of DataFrameGroupBy objects actually returns a Series object with the group sizes and not a DataFrame. If you want a DataFrame whose column is the group sizes, indexed by the groups, with a custom name, you can use the .to_frame() method and use the desired column name as its argument.
grpd = df.groupby(['A','B']).size().to_frame('size')
If you wanted the groups to be columns again you could add a .reset_index() at the end.
You need transform size - len of df is same as before:
Notice:
Here it is necessary to add one column after groupby, else you get an error. Because GroupBy.size count NaNs too, what column is used is not important. All columns working same.
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df['size'] = df.groupby(['A', 'B'])['A'].transform('size')
print (df)
A B size
0 x a 1
1 x c 2
2 x c 2
3 y b 2
4 y b 2
If need set column name in aggregating df - len of df is obviously NOT same as before:
import pandas as pd
df = pd.DataFrame({'A': ['x', 'x', 'x','y','y']
, 'B': ['a', 'c', 'c','b','b']})
print (df)
A B
0 x a
1 x c
2 x c
3 y b
4 y b
df = df.groupby(['A', 'B']).size().reset_index(name='Size')
print (df)
A B Size
0 x a 1
1 x c 2
2 y b 2
The result of df.groupby(...) is not a DataFrame. To get a DataFrame back, you have to apply a function to each group, transform each element of a group, or filter the groups.
It seems like you want a DataFrame that contains (1) all your original data in df and (2) the count of how much data is in each group. These things have different lengths, so if they need to go into the same DataFrame, you'll need to list the size redundantly, i.e., for each row in each group.
df['size'] = df.groupby(['A','B']).transform(np.size)
(Aside: It's helpful if you can show succinct sample input and expected results.)
You can set the as_index parameter in groupby to False to get a DataFrame instead of a Series:
df = pd.DataFrame({'A': ['a', 'a', 'b', 'b'], 'B': [1, 2, 2, 2]})
df.groupby(['A', 'B'], as_index=False).size()
Output:
A B size
0 a 1 1
1 a 2 1
2 b 2 2
lets say n is the name of dataframe and cst is the no of items being repeted.
Below code gives the count in next column
cstn=Counter(n.cst)
cstlist = pd.DataFrame.from_dict(cstn, orient='index').reset_index()
cstlist.columns=['name','cnt']
n['cnt']=n['cst'].map(cstlist.loc[:, ['name','cnt']].set_index('name').iloc[:,0].to_dict())
Hope this will work