import pandas
a=[['Date', 'letters', 'numbers', 'mixed'], ['1/2/2014', 'a', '6', 'z1'], ['1/2/2014', 'a', '3', 'z1'], ['1/3/2014', 'c', '1', 'x3']]
df = pandas.DataFrame.from_records(a[1:],columns=a[0])
b= [['a', 'b', 'c'], ['a', 'b', 'c']]
df2 = pandas.DataFrame.from_records(b[1:],columns=b[0])
I want to join df2 on df so it looks like this:
Date letters numbers mixed a b c
0 1/2/2014 a 6 z1
1 1/2/2014 a 3 z1 a b c
2 1/3/2014 c 1 x3
Looking through the docs, I got as close as df=df.join(df2,how='outer')
which gives you this:
Date letters numbers mixed a b c
0 1/2/2014 a 6 z1 a b c
1 1/2/2014 a 3 z1 NaN NaN NaN
2 1/3/2014 c 1 x3 NaN NaN NaN
I want something like df=df.join(df2,how='outer', on_index = 1)
It already does do a join with a specific index, it just so happens to be that your index in df2 is 0 and so when it joins it places the 'a', 'b', 'c' in index 0.
import pandas
a=[['Date', 'letters', 'numbers', 'mixed'], ['1/2/2014', 'a', '6', 'z1'], ['1/2/2014', 'a', '3', 'z1'], ['1/3/2014', 'c', '1', 'x3']]
df = pandas.DataFrame.from_records(a[1:],columns=a[0])
b= [['a', 'b', 'c'], ['a', 'b', 'c']]
df2 = pandas.DataFrame.from_records(b[1:],columns=b[0], index=[1])
df=df.join(df2,how='outer')
print(df)
# Date letters numbers mixed a b c
# 0 1/2/2014 a 6 z1 NaN NaN NaN
# 1 1/2/2014 a 3 z1 a b c
# 2 1/3/2014 c 1 x3 NaN NaN NaN
In this code I have set the index of df2 with the keyword argument index = [1]. If you cannot use this keyword argument then you can change the index (in this particular example) using df2.index = [1], this should be done before joining the two DataFrames.
Related
Given a dataframe like
df = pd.DataFrame({
'A': ['a', 'b', 'b'],
'B': ['x', 'x', 'y'],
'C': [1, 2, 3]
})
agg = df.groupby(['A', 'B']).agg('sum')
I get
C
A B
a x 1
b x 2
b y 3
Now I would like to transform this to:
C_x C_y
A 1 NaN
a 2 NaN
b NaN 3
How can I split agg into columns on the second index?
I am trying to find the number of people of a certain group who appear in other groups. For instance, here is the Pandas dataframe:
d = {'name': ['ash', 'psyduck', 'pikachu', 'charizard', 'ash', 'psyduck'], 'group': ['a', 'b', 'c', 'b', 'b', 'a']}
Which looks like this:
Ash: a
Psyduck: b
Pikachu: c
Charizard: b
Ash: b
Psyduck: a
I am trying to create a cross tabulation that looks like the following:
a b c
a 2 2 0
b 2 3 0
c 0 0 1
Essentially, this cross tab shows how many members of group x are also members of group x. For example, there are 2 people who are in group a and b, thus there is a 2 in the intersection of those columns
I have used Pandas cross tab function but it doesn't give the result that I am looking for.
import pandas as pd
d = {'name': ['ash', 'psyduck', 'pikachu', 'charizard', 'ash', 'psyduck'], 'group': ['a', 'b', 'c', 'b', 'b', 'a']}
df = pd.DataFrame(d)
df = df.merge(df, on='name')
print(
pd.crosstab(df.group_x, df.group_y)
)
Output:
group_y a b c
group_x
a 2 2 0
b 2 3 0
c 0 0 1
Demo: https://repl.it/#alexmojaki/TragicFrigidConditions
List1 = [[1,A,!,a],[2,B,#,b],[7,C,&,c],[1,B,#,c],[4,D,#,p]]
Output should be like this:
Each different column should contain 1 value of each sublist elements
for example
column1:[1,2,7,1,4]
column2:[A,B,C,B,D]
column3:[!,#,&,#,#]
column4:[a,b,c,c,p]
in the same dataframe
Assuming that you actually meant for List1 to be this (all elements are strings):
list1 = [["1","A","!","a"],["2","B","#","b"],["7","C","&","c"],["1","B","#","c"],["4","D","#","p"]]
I don't think that you need to do anything except pass List1 to the DataFrame constructor. There are several ways to pass information to a DataFrame. Using lists of lists constructs un-named columns.
print(pd.DataFrame(list1))
0 1 2 3
0 1 A ! a
1 2 B # b
2 7 C & c
3 1 B # c
4 4 D # p
Given the below list file:
l = [['1', 'A', '!', 'a'], ['2', 'B', '#', 'b'], ['7', 'C', '&', 'c'], ['1', 'B', '#', 'c'], ['4', 'D', '#', 'p']]
You can use pandas.Dataframe for converting it as below:
import pandas as pd
pd.DataFrame(l, columns=['c1', 'c2', 'c3', 'c4'])
# columns parameter for passing customized column names
Result:
c1 c2 c3 c4
0 1 A ! a
1 2 B # b
2 7 C & c
3 1 B # c
4 4 D # p
As commented (and illustrated by John L.'s answer), pandas.DataFrame should be sufficient. If what you actually want is a transposed dataframe, try transpose manually:
import pandas as pd
df = pd.DataFrame(List1).T
Or beforehand using zip:
df = pd.DataFrame(list(zip(*List1)))
Both of which returns:
0 1 2 3 4
0 1 2 7 1 4
1 A B C B D
2 ! # & # #
3 a b c c p
Lets say I have a dataframe that looks like this:
group_cols = ['Group1', 'Group2', 'Group3']
df = pd.DataFrame([['A', 'B', 'C', 54.34],
['A', 'B', np.nan, 61.34],
['B', 'A', 'C', 514.5],
['B', 'A', 'A', 765.4],
['A', 'B', 'D', 765.4]],
columns=(group_cols+['Value']))
Group1 Group 2 Group 3 Value
A B C 54.34
A B nan 61.34
B A C 514.5
B A A 765.4
A B D 765.4
When I do a group by on these 3 columns, the nan row somehow gets deleted/dropped.
Ideally, I would want to have the combination ( A, B and nan in this case) to be retained. So a separate row should have been there in my output. However it gets dropped.
df2 = df.groupby(['Group1', 'Group2', 'Group3'],as_index=False).sum()
Group1 Group 2 Group 3 Value
A B C 54.34
A B D 765.4
B A A 765.4
B A C 514.5
For a workaround, I can fillna with some value and then do a group by so that i get to see the row there, however that is not an ideal solution I feel.
Please can you share how i can retain the nan row ?
Here is one way to fillna before groupby , since groupby will automatically remove the NaN
df.fillna('NaN',inplace=True)
df2 = df.groupby(['Group1', 'Group2', 'Group3'],as_index=False).sum()
df2
Group1 Group2 Group3 Value
0 A B C 54.34
1 A B D 765.40
2 A B NaN 61.34
3 B A A 765.40
4 B A C 514.50
From the doc :http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html
> NA and NaT group handling
If there are any NaN or NaT values in the
grouping key, these will be automatically excluded. In other words,
there will never be an “NA group” or “NaT group”. This was not the
case in older versions of pandas, but users were generally discarding
the NA group anyway (and supporting it was an implementation
headache).
I have this dataset:
Field
A
A
A
B
C
C
C
D
C
C
C
A
This has been read into pandas through the following code:
data = read_csv('data.csv', header=None)
print(data.describe())
How can I transform the column to get the below result?
Field
A
A
A
Others
C
C
C
Others
C
C
C
A
I want to transform values B and D, since they have low frequency, to an aggregate value "Others".
Here is one way:
import pandas as pd
df = pd.DataFrame({'Field': ['A', 'A', 'A', 'B', 'C', 'C', 'C',
'D', 'C', 'C', 'C', 'C', 'A']})
n = 2
counts = df['Field'].value_counts()
others = set(counts[counts < n].index)
df['Field'] = df['Field'].replace(list(others), 'Others')
Result
Field
0 A
1 A
2 A
3 Others
4 C
5 C
6 C
7 Others
8 C
9 C
10 C
11 C
12 A
Explanation
First get the counts of each value in Field via value_counts.
Filter for values which occur less than n times. n is user-configurable.
Finally replace those values with 'Others'.