I have 3 dataframes A,B,C:
import pandas as pd
A = pd.DataFrame({"id": [1,2],
"connected_to_B_id1":["A","B"],
"connected_to_B_id2":["B","C"],
"connected_to_B_id3":["C", np.nan],
# entry can have multiple ids from B
})
B = pd.DataFrame({"id": ["A","B","C"],
"connected_to_C_id1":[1,1,2],
"connected_to_C_id2":[2,2,np.nan],
# entry can have multiple ids from C
})
C = pd.DataFrame({"id": [1,2],
"name":["a","b"],
})
#Output should be D:
D = pd.DataFrame({"id_A": [1,1,1,1,1,2,2,2],
"id_B": ["A","A","B","B","C","B","B","C"],
"id_C": [1,2,1,2,2,1,2,1],
"name": ["a","b","a","b","b","a","b","a"]
})
I want to use the IDs stored in the "connected_to_X" columns of each dataframe to create a dataframe, which contains all relationships recorded in the three individual dataframes.
What is the most elegant way to combine the dataframes to A, B and C to D?
Currently I am using dicts,lists and for loops and its messy and complicated.
D:
|idx |id_A|id_B|id_C|name|
|---:|--:|--:|--:|--:|
| 0 | 1 | A | 1 | a |
| 1 | 1 | A | 2 | b |
| 2 | 1 | B | 1 | a |
| 3 | 1 | B | 2 | b |
| 4 | 1 | C | 2 | b |
| 5 | 2 | B | 1 | a |
| 6 | 2 | B | 2 | b |
| 7 | 2 | C | 1 | a |
You just need to unpivot A and B then you can join the tables up.
(A.
melt(id_vars='id').
merge(B.melt(id_vars='id'), left_on = 'value', right_on='id', how='left').
merge(C, left_on = 'value_y', right_on='id').
drop(columns = ['variable_x', 'variable_y', 'value_x']).
sort_values(['id_x', 'id_y']).
reset_index(drop=True).
reset_index()
)
index id_x id_y value_y id name
0 0 1 A 1.0 1 a
1 1 1 A 2.0 2 b
2 2 1 B 1.0 1 a
3 3 1 B 2.0 2 b
4 4 1 C 2.0 2 b
5 5 2 B 1.0 1 a
6 6 2 B 2.0 2 b
7 7 2 C 2.0 2 b
I have data like this:
+---+---+---+
| A | B | C |
+---+---+---+
| 1 | 2 | 7 |
| 2 | 2 | 7 |
| 3 | 2 | 1 |
| 3 | 2 | 1 |
| 3 | 2 | 1 |
+---+---+---+
Need to count unique value of each column and report it like below:
+---+---+---+
| A | 3 | 3 |
| A | 2 | 1 |
| A | 1 | 1 |
| B | 2 | 5 |
| C | 1 | 3 |
| C | 7 | 2 |
+---+---+---+
I have no issue when number of column are limit and manually name them, when input file is big it become hard,need to have simple way to have output
here is the code I have
import pandas as pd
df=pd.read_csv('1.csv')
A=df['A']
B=df['B']
C=df['C']
df1=A.value_counts()
df2=B.value_counts()
df3=C.value_counts()
all = {'A': df1,'B': df2,'C': df3}
result = pd.concat(all)
result.to_csv('out.csv')
Use DataFrame.stack with SeriesGroupBy.value_counts and then convert Series to DataFrame by Series.rename_axis and Series.reset_index and :
df=pd.read_csv('1.csv')
result = (df.stack()
.groupby(level=1)
.value_counts()
.rename_axis(['X','Y'])
.reset_index(name='Z'))
print (result)
X Y Z
0 A 3 3
1 A 1 1
2 A 2 1
3 B 2 5
4 C 1 3
5 C 7 2
X Y Z
2 A 3 3
0 A 1 1
1 A 2 1
3 B 2 5
4 C 1 3
5 C 7 2
result.to_csv('out.csv', index=False)
You can loop over column and insert them in dictionary.
you can initiate dictionary by all={}. To be scalable you can read column by colm=df.columns. This would give you all column in your df.
Try this code:
import pandas as pd
df=pd.read_csv('1.csv')
all={}
colm=df.columns
for i in colm:
all.update({i:df[i].value_counts()})
result = pd.concat(all)
result.to_csv('out.csv')
to find unique values of a data-frame.
df.A.unique()
to know the count of the unique values.
len(df.A.unique())
unique create an array to find the count use len() function
I have a dataframe like this:
| a | b | c |
0 | 0 | 0 | 0 |
1 | 5 | 5 | 5 |
I have a dataframe row (or series) like this:
| a | b | c |
0 | 1 | 2 | 3 |
I want to subtract the row from the entire dataframe to obtain this:
| a | b | c |
0 | 1 | 2 | 3 |
1 | 6 | 7 | 8 |
Any help is appreciated, thanks.
Use DataFrame.add or DataFrame.sub with convert one row DataFrame to Series - e.g. by DataFrame.iloc for first row:
df = df1.add(df2.iloc[0])
#alternative select by row label
#df = df1.add(df2.loc[0])
print (df)
a b c
0 1 2 3
1 6 7 8
Detail:
print (df2.iloc[0])
a 1
b 2
c 3
Name: 0, dtype: int64
You can convert the second dataframe to numpy array:
df1 + df2.values
Output:
a b c
0 1 2 3
1 6 7 8
I have a DataFrame df1:
| A | B | C | D |
-----------------
| 0 | 1 | 3 | 4 |
| 2 | 1 | 8 | 4 |
| 0 | 2 | 3 | 1 |
and a DataFrame df2:
| A | D |
---------
| 2 | 2 |
| 3 | 2 |
| 1 | 9 |
I want to replace column A and D of df1 with the equivalent columns of df2.
Surely I could do something like
df1['A'] = df2['A']
df1['D'] = df2['D']
But I need a solution for doing this automatically since I have thousands of columns.
You can use combine_first:
df2.combine_first(df1)
# A B C D
#0 2 1.0 3.0 2
#1 3 1.0 8.0 2
#2 1 2.0 3.0 9
The way to do this is with pd.DataFrame.update
Update will modify a dataframe in place with information in another dataframe.
df1.update(df2)
The advantage of this is that your dtypes in df1 are preserved.
df1
A B C D
0 2 1 3 2
1 3 1 8 2
2 1 2 3 9
Another way to have done this with out updating in place would be to have used pd.DataFrame.assign and dictionary unpacking on pd.DataFrame.iteritems. However, this would include new additional columns if they existed in df2.
df1.assign(**dict(df2.iteritems()))
A B C D
0 2 1 3 2
1 3 1 8 2
2 1 2 3 9
a simple for loop should suffice:
for c in df2.columns:
df1[c] = df2[c]
for col in df1.columns:
if col in df2.columns.tolist():
df1[col] = df2[col]
I have the following sample dataframe in Python pandas:
+---+------+------+------+
| | col1 | col2 | col3 |
+---+------+------+------+
| 0 | a | d | b |
+---+------+------+------+
| 1 | a | c | b |
+---+------+------+------+
| 2 | c | b | c |
+---+------+------+------+
| 3 | b | b | c |
+---+------+------+------+
| 4 | a | a | d |
+---+------+------+------+
I would like to perform a count of all the 'a,' 'b,' 'c,' and 'd' values across columns 1-3 so that I would end up with a dataframe like this:
+---+--------+-------+
| | letter | count |
+---+--------+-------+
| 0 | a | 4 |
+---+--------+-------+
| 1 | b | 5 |
+---+--------+-------+
| 2 | c | 4 |
+---+--------+-------+
| 3 | d | 2 |
+---+--------+-------+
One way I can do this is stack the columns on top of each other and THEN do a groupby count, but I feel like there has to be a better way. Can someone help me with this?
You can stack() the dataframe to put all columns into rows and then do value_counts:
df.stack().value_counts()
b 5
c 4
a 4
d 2
dtype: int64
You can apply value_counts with sum:
print (df.apply(pd.value_counts))
col1 col2 col3
a 3.0 1 NaN
b 1.0 2 2.0
c 1.0 1 2.0
d NaN 1 1.0
df1 = df.apply(pd.value_counts).sum(1).reset_index()
df1.columns = ['letter','count']
df1['count'] = df1['count'].astype(int)
print (df1)
letter count
0 a 4
1 b 5
2 c 4
3 d 2