Pandas: join a row of data on specific index number - python

import pandas
a=[['Date', 'letters', 'numbers', 'mixed'], ['1/2/2014', 'a', '6', 'z1'], ['1/2/2014', 'a', '3', 'z1'], ['1/3/2014', 'c', '1', 'x3']]
df = pandas.DataFrame.from_records(a[1:],columns=a[0])
b= [['a', 'b', 'c'], ['a', 'b', 'c']]
df2 = pandas.DataFrame.from_records(b[1:],columns=b[0])
I want to join df2 on df so it looks like this:
Date letters numbers mixed a b c
0 1/2/2014 a 6 z1
1 1/2/2014 a 3 z1 a b c
2 1/3/2014 c 1 x3
Looking through the docs, I got as close as df=df.join(df2,how='outer')
which gives you this:
Date letters numbers mixed a b c
0 1/2/2014 a 6 z1 a b c
1 1/2/2014 a 3 z1 NaN NaN NaN
2 1/3/2014 c 1 x3 NaN NaN NaN
I want something like df=df.join(df2,how='outer', on_index = 1)

It already does do a join with a specific index, it just so happens to be that your index in df2 is 0 and so when it joins it places the 'a', 'b', 'c' in index 0.
import pandas
a=[['Date', 'letters', 'numbers', 'mixed'], ['1/2/2014', 'a', '6', 'z1'], ['1/2/2014', 'a', '3', 'z1'], ['1/3/2014', 'c', '1', 'x3']]
df = pandas.DataFrame.from_records(a[1:],columns=a[0])
b= [['a', 'b', 'c'], ['a', 'b', 'c']]
df2 = pandas.DataFrame.from_records(b[1:],columns=b[0], index=[1])
df=df.join(df2,how='outer')
print(df)
# Date letters numbers mixed a b c
# 0 1/2/2014 a 6 z1 NaN NaN NaN
# 1 1/2/2014 a 3 z1 a b c
# 2 1/3/2014 c 1 x3 NaN NaN NaN
In this code I have set the index of df2 with the keyword argument index = [1]. If you cannot use this keyword argument then you can change the index (in this particular example) using df2.index = [1], this should be done before joining the two DataFrames.

Related

Pandas split dataframe on grouped index

Given a dataframe like
df = pd.DataFrame({
'A': ['a', 'b', 'b'],
'B': ['x', 'x', 'y'],
'C': [1, 2, 3]
})
agg = df.groupby(['A', 'B']).agg('sum')
I get
C
A B
a x 1
b x 2
b y 3
Now I would like to transform this to:
C_x C_y
A 1 NaN
a 2 NaN
b NaN 3
How can I split agg into columns on the second index?

How do I find how many members from one group are also members of another group?

I am trying to find the number of people of a certain group who appear in other groups. For instance, here is the Pandas dataframe:
d = {'name': ['ash', 'psyduck', 'pikachu', 'charizard', 'ash', 'psyduck'], 'group': ['a', 'b', 'c', 'b', 'b', 'a']}
Which looks like this:
Ash: a
Psyduck: b
Pikachu: c
Charizard: b
Ash: b
Psyduck: a
I am trying to create a cross tabulation that looks like the following:
a b c
a 2 2 0
b 2 3 0
c 0 0 1
Essentially, this cross tab shows how many members of group x are also members of group x. For example, there are 2 people who are in group a and b, thus there is a 2 in the intersection of those columns
I have used Pandas cross tab function but it doesn't give the result that I am looking for.
import pandas as pd
d = {'name': ['ash', 'psyduck', 'pikachu', 'charizard', 'ash', 'psyduck'], 'group': ['a', 'b', 'c', 'b', 'b', 'a']}
df = pd.DataFrame(d)
df = df.merge(df, on='name')
print(
pd.crosstab(df.group_x, df.group_y)
)
Output:
group_y a b c
group_x
a 2 2 0
b 2 3 0
c 0 0 1
Demo: https://repl.it/#alexmojaki/TragicFrigidConditions

How to select list of list elememts and make different columns in a single dataframe?

List1 = [[1,A,!,a],[2,B,#,b],[7,C,&,c],[1,B,#,c],[4,D,#,p]]
Output should be like this:
Each different column should contain 1 value of each sublist elements
for example
column1:[1,2,7,1,4]
column2:[A,B,C,B,D]
column3:[!,#,&,#,#]
column4:[a,b,c,c,p]
in the same dataframe
Assuming that you actually meant for List1 to be this (all elements are strings):
list1 = [["1","A","!","a"],["2","B","#","b"],["7","C","&","c"],["1","B","#","c"],["4","D","#","p"]]
I don't think that you need to do anything except pass List1 to the DataFrame constructor. There are several ways to pass information to a DataFrame. Using lists of lists constructs un-named columns.
print(pd.DataFrame(list1))
0 1 2 3
0 1 A ! a
1 2 B # b
2 7 C & c
3 1 B # c
4 4 D # p
Given the below list file:
l = [['1', 'A', '!', 'a'], ['2', 'B', '#', 'b'], ['7', 'C', '&', 'c'], ['1', 'B', '#', 'c'], ['4', 'D', '#', 'p']]
You can use pandas.Dataframe for converting it as below:
import pandas as pd
pd.DataFrame(l, columns=['c1', 'c2', 'c3', 'c4'])
# columns parameter for passing customized column names
Result:
c1 c2 c3 c4
0 1 A ! a
1 2 B # b
2 7 C & c
3 1 B # c
4 4 D # p
As commented (and illustrated by John L.'s answer), pandas.DataFrame should be sufficient. If what you actually want is a transposed dataframe, try transpose manually:
import pandas as pd
df = pd.DataFrame(List1).T
Or beforehand using zip:
df = pd.DataFrame(list(zip(*List1)))
Both of which returns:
0 1 2 3 4
0 1 2 7 1 4
1 A B C B D
2 ! # & # #
3 a b c c p

How to retain null/nan in one of the groupby columns while performing df.groupby

Lets say I have a dataframe that looks like this:
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', np.nan, 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
Group1 Group 2 Group 3 Value
A B C 54.34
A B nan 61.34
B A C 514.5
B A A 765.4
A B D 765.4
When I do a group by on these 3 columns, the nan row somehow gets deleted/dropped.
Ideally, I would want to have the combination ( A, B and nan in this case) to be retained. So a separate row should have been there in my output. However it gets dropped.
df2 = df.groupby(['Group1', 'Group2', 'Group3'],as_index=False).sum()
Group1 Group 2 Group 3 Value
A B C 54.34
A B D 765.4
B A A 765.4
B A C 514.5
For a workaround, I can fillna with some value and then do a group by so that i get to see the row there, however that is not an ideal solution I feel.
Please can you share how i can retain the nan row ?
Here is one way to fillna before groupby , since groupby will automatically remove the NaN
df.fillna('NaN',inplace=True)
df2 = df.groupby(['Group1', 'Group2', 'Group3'],as_index=False).sum()
df2
Group1 Group2 Group3 Value
0 A B C 54.34
1 A B D 765.40
2 A B NaN 61.34
3 B A A 765.40
4 B A C 514.50
From the doc :http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
> NA and NaT group handling
If there are any NaN or NaT values in the
grouping key, these will be automatically excluded. In other words,
there will never be an “NA group” or “NaT group”. This was not the
case in older versions of pandas, but users were generally discarding
the NA group anyway (and supporting it was an implementation
headache).

Python - Group multiple values from a column to create "Other" values

I have this dataset:
Field
A
A
A
B
C
C
C
D
C
C
C
A
This has been read into pandas through the following code:
data = read_csv('data.csv', header=None)
print(data.describe())
How can I transform the column to get the below result?
Field
A
A
A
Others
C
C
C
Others
C
C
C
A
I want to transform values B and D, since they have low frequency, to an aggregate value "Others".
Here is one way:
import pandas as pd
df = pd.DataFrame({'Field': ['A', 'A', 'A', 'B', 'C', 'C', 'C',
'D', 'C', 'C', 'C', 'C', 'A']})
n = 2
counts = df['Field'].value_counts()
others = set(counts[counts < n].index)
df['Field'] = df['Field'].replace(list(others), 'Others')
Result
Field
0 A
1 A
2 A
3 Others
4 C
5 C
6 C
7 Others
8 C
9 C
10 C
11 C
12 A
Explanation
First get the counts of each value in Field via value_counts.
Filter for values which occur less than n times. n is user-configurable.
Finally replace those values with 'Others'.

Categories