Creating new pandas DataFrame from existing DataFrame and index - python

I have a DataFrame like this:
a b
A 1 0
B 0 1
and I have an array ["A","B","C"].
From these, I want to create a new DataFrame like this:
a b
A 1 0
B 0 1
C NaN NaN
How can I do this?

Assuming I understand what you're after (setting aside weird duplicated-index cases), one way is to use loc to index into your frame:
>>> df = pd.DataFrame({'a': {'A': 1, 'B': 0}, 'b': {'A': 0, 'B': 1}})
>>> arr = ["A", "B", "C"]
>>> df
a b
A 1 0
B 0 1
>>> df.loc[arr]
a b
A 1 0
B 0 1
C NaN NaN

Create an DataFrame with only index=['C'] and concat:
df = pd.DataFrame({'a': {'A': 1, 'B': 0}, 'b': {'A': 0, 'B': 1}}
df = pd.concat([df, pd.DataFrame(index=['C'])])

Related

remove string from json row

I'm taking several columns from a data frame and adding them to a new column.
A B C
1 3 6
1 2 4
4 5 0
df['D'] = df.apply(lambda x: x[['C', 'B']].to_json(), axis=1)
I'm then creating a new data frame that locates the unique instances of df['A']:
df2 = pd.DataFrame({'A': df.A.unique()})
finally, I'm creating a new column in df2 that list the value of df['B'] and df['C']
df2['E'] = [list(set(df['D'].loc[df['A'] == x['A']]))
for _, x in df2.iterrows()]
but this is stringing each object:
A B C D
1 3 6 ['{"B":"3","C":6"}', '{"B":"2","C":4"}']
furthermore, when I dump this in JSON I get:
payload = json.dumps(data)
I get this result:
["{\"B\":\"3\",\"C\":"6"}", "{\"B\":\"2\",\"C\":"\4"}"]
but I'm ultimately looking to remove the string on the objects and have this as the output:
[{"B":"3","C":"6"}, {"B":"2","C":"4"}]
Any guidance will be greatly appreciated.
In your case do groupby with to_dict
out = df.groupby('A').apply(lambda x : x[['B','C']].to_dict('records')).to_frame('E').reset_index()
out
Out[198]:
A E
0 1 [{'B': 3, 'C': 6}, {'B': 2, 'C': 4}]
1 4 [{'B': 5, 'C': 0}]

Assigning column names while creating dataframe results in nan values

I have a list of dict which is being converted to a dataframe. When I attempt to pass the columns argument the output values are all nan.
# This code does not result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
pd.DataFrame(l, columns=['c', 'd'])
c d
0 NaN NaN
1 NaN NaN
# This code does result in desired output
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
df = pd.DataFrame(l)
df.columns = ['c', 'd']
df
c d
0 1 2
1 3 4
Why is this happening?
Because if pass list of dictionaries from keys are created new columns names in DataFrame constructor:
l = [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
print (pd.DataFrame(l))
a b
0 1 2
1 3 4
If pass columns parameter with some values not exist in keys of dictionaries then are filtered columns from dictonaries and for not exist values are created columns with missing values with order like values in list of columns names:
#changed order working, because a,b keys at least in one dictionary
print (pd.DataFrame(l, columns=['b', 'a']))
b a
0 2 1
1 4 3
#filtered a, d filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['a', 'd']))
a d
0 1 NaN
1 3 NaN
#filtered b, c filled missing values - key is not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'b']))
c b
0 NaN 2
1 NaN 4
#filtered a,b, c, d filled missing values - keys are not at least in one dictionary
print (pd.DataFrame(l, columns=['c', 'd','a','b']))
c d a b
0 NaN NaN 1 2
1 NaN NaN 3 4
So if want another columns names you need rename them or set new one like in your second code.

How to define a python lambda getting the first element?

import pandas as pd
df = pd.DataFrame({'A': [0, 0, 1, 1],
'B': [1, 3, 8, 10],
'C': ['alpha', 'bravo', 'charlie', 'delta']})
Now, I would like to group the data using own lambdas, but they behave different from what I expect. The lambda in the following example should return the first value of the column in the group:
df.groupby('A', as_index = False).agg({'B':'mean',
'C': lambda x: x[0]})
The code throws the KeyError: 0, which is unclear to me since ['alpha', 'bravo'][0] gives 'alpha'
So overall the desired output:
A B C
0 0 2 'alpha'
1 1 9 'charlie'
If need select first value in group is necessary use Series.iat or Series.iloc for select by position:
df1 = df.groupby('A', as_index = False).agg({'B':'mean', 'C': lambda x: x.iat[0]})
Another solution is use GroupBy.first:
df1 = df.groupby('A', as_index = False).agg({'B':'mean', 'C': 'first'})
print (df1)
A B C
0 0 2 alpha
1 1 9 charlie
Can you add an explanation of why the lambda doesn't work?
Problem is for second group, there is index not 0, but 2, what raise error, because x[0] try seelct by index with 0 and for second group it not exist:
df1 = df.groupby('A', as_index = False).agg({'B':'mean', 'C': lambda x: print (x[0])})
print (df1)
alpha <- return first value of first group only, because alpha has index 0
alpha
alpha
So if set index 0 for first values of groups it working with this sample data:
df = pd.DataFrame({'A': [0, 0, 1, 1],
'B': [1, 3, 8, 10],
'C': ['alpha', 'bravo', 'charlie', 'delta']}, index=[0,1,0,1])
print (df)
A B C
0 0 1 alpha <- index is 0
1 0 3 bravo
0 1 8 charlie <- index is 0
1 1 10 delta
df1 = df.groupby('A', as_index = False).agg({'B':'mean', 'C': lambda x: x[0]})
print (df1)
A B C
0 0 2 alpha
1 1 9 charlie
Small explanation on why your lambda function won't work.
When we use groupby we get an groupby object back:
g = df.groupby('A')
print(g)
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000023AA1BB41D0>
When we access the elements in our groupby object, we get grouped dataframes back:
for idx, d in g:
print(d, '\n')
A B C
0 0 1 alpha
1 0 3 bravo
A B C
2 1 8 charlie
3 1 10 delta
So thats why we need to threat these elements as DataFrames. As jezrael already pointed out in his answer, there are several ways to access the first value in your C column:
for idx, d in g:
print(d['C'].iat[0])
print(d['C'].iloc[0], '\n')
alpha
alpha
charlie
charlie

How to use Pandas to replace column entries in DataFrame and create dictionary new-old values

I have a file which contains data as the following:
x y
z w
a b
a x
w y
I want to create a file with the following replacements dictionary, which has a unique replacement number for each string that is determined by the the order in which strings first appear in the file, when read left-to-right and top to bottom (note that this should be created, it is not supplied):
{'x':1, 'y':2, 'z':3, 'w':4 , 'a':5, 'b':6}
and the output file would be:
1 2
3 3
5 6
5 1
4 2
Is there any efficient way to create both the processed file and the dictionary with Pandas?
I thought of creating the dictionary in the following policy:
_counter = 0
def counter():
global _counter
_counter += 1
return _counter
replacements_dict = collections.defaultdict(counter)
You can use factorize with MultiIndex Series created bystack, then unstack and last write to file by to_csv:
df = pd.read_csv(file, sep="\s+", header=None)
print (df)
0 1
0 x y
1 z w
2 a b
3 a x
4 w y
s = df.stack()
fact = pd.factorize(s)
#indexing is necessary
d = dict(zip(fact[1].values[fact[0]], fact[0] + 1))
print (d)
{'x': 1, 'y': 2, 'z': 3, 'w': 4, 'a': 5, 'b': 6}
For new file:
#values splited by ,
pd.Series(d).to_csv('dict.csv')
#read Series from file, convert to dict
d = pd.read_csv('dict.csv', index_col=[0], squeeze=True, header=None).to_dict()
print (d)
{'x': 1, 'y': 2, 'z': 3, 'w': 4, 'a': 5, 'b': 6}
df = pd.Series(fact[0] + 1, index=s.index).unstack()
print (df)
0 1
0 1 2
1 3 4
2 5 6
3 5 1
4 4 2
df.to_csv('out', index=False, header=None)
I assume you want dictionary d in such a way that values assigned to keys correspond to the keys appearance, in rows:
d={'col1':['x', 'y', 'a', 'a', 'w'], 'col2':['z','w','b','x','y']}
df=pd.DataFrame(d)
print(df)
Output:
col1 col2
0 x z
1 y w
2 a b
3 a x
4 w y
=================================
Using itertools:
import itertools
raw_list = list(itertools.chain(*[df.iloc[i].tolist() for i in range(df.shape[0])]))
d=dict()
counter=1
for k in raw_list:
try:
_=d[k]
except:
d[k]=counter
counter+=1
then:
d
Output:
{'a': 5, 'b': 6, 'w': 4, 'x': 1, 'y': 3, 'z': 2}
I hope it helps!
===========================================
Using factorize:
s = df.stack()
d=dict{}
for (x,y) in zip(pd.factorize(s)[1], pd.factorize(s)[0]+1):
d[x]=y

Merging multiple dataframe lines into aggregate lines

For the following dataframe:
df = pd.DataFrame({'Name': {0: "A", 1: "A", 2:"A", 3: "B"},
'Spec1': {0: '1', 1: '3', 2:'5',
3: '1'},
'Spec2': {0: '2a', 1: np.nan, 2:np.nan,
3: np.nan}
}, columns=['Name', 'Spec1', 'Spec2'])
Name Spec1 Spec2
0 A 1 2a
1 A 3 NaN
2 A 5 NaN
3 B 1 NaN
I would like to aggregate the columns into:
Name Spec
0 A 1,3,5,2a
1 B 1
Is there a more "pandas" way of doing this than just looping and keeping track of the values?
Or using melt
df.melt('Name').groupby('Name').value.apply(lambda x:','.join(pd.Series(x).dropna())).reset_index().rename(columns={'value':'spec'})
Out[2226]:
Name spec
0 A 1,3,5,2a
1 B 1
Another way
In [966]: (df.set_index('Name').unstack()
.dropna().reset_index()
.groupby('Name')[0].apply(','.join))
Out[966]:
Name
A 1,3,5,2a
B 1
Name: 0, dtype: object
Group rows by name, combine column values as a list, dropping NaN:
df = df.groupby('Name').agg(lambda x: list(x.dropna()))
Spec1 Spec2
Name
A [1, 3, 5] [2a]
B [1] []
Now merge Spec1 and Spec2 lists. Bring Name back as a column. Name the new Spec column.
df = (df.Spec1 + df.Spec2).reset_index().rename(columns={0:"Spec"})
Name Spec
0 A [1, 3, 5, 2a]
1 B [1]
Finally, convert Spec lists to string representations:
df.Spec = df.Spec.apply(','.join)
Name Spec
0 A 1,3,5,2a
1 B 1

Categories