I want to append 2 dataframes:
data1:
a
1 a
2 b
3 c
4 d
5 e
data2:
b
1 f
2 g
3 h
4 i
5 j
output:
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
currently i am using:
all_data= data1.append(data2, ignore_index=True)
this gives me result as:
a b
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
i.e. in different columns.
How can i get them in the same column?
Also tried converting the dataframes into list and then tring to append it. But it gave me the error:
TypeError: append() takes no keyword arguments
Also, is there any other function to remove duplicates from the datarame of strings? The drop_duplicates() function does not work in my case. The data still has duplicates.
You need to change one column name, so append can detect hat you want to do:
data2.columns = ["a"]
or
data1.columns = ["b"]
And then, after using data2.columns = ["a"]:
all_data = data1.append(data2, ignore_index=True)
all_data
a
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
And here you have your column named after the column's name of data1, which you can rename if you want:
all_data.columns = ["Foo"]
merge or concat work on keys. In this case, there are no common columns. However, why not use numpy append and create the dataframe?
In [68]: pd.DataFrame(pd.np.append(data1.values, data2.values), columns=['A'])
Out[68]:
A
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
df1.columns = ['b']
Out[78]:
b
0 a
1 b
2 c
3 d
4 e
pd.concat([df1 , df2] , ignore_index=True)
Out[80]:
b
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
9 j
Related
I'm trying to count values in two columns and then put the results in the same table.
dict = { "before": list("ABCDEFABDCFEFF"),
"after" : list("FABFCFFEEDEBFF") }
df = pd.DataFrame(dict)
Output
before after
0 A F
1 B A
2 C B
3 D F
4 E C
5 F F
6 A F
7 B E
8 D E
9 C D
10 F E
11 E B
12 F F
13 F F
I've achieved something close to what I want, but this looks messy, and I'm hoping for a "smoother" solution.
df.melt().groupby("variable")["value"].value_counts().to_frame().unstack()
Output:
value
value A B C D E F
variable
after 1 2 1 1 3 6
before 2 2 2 2 2 4
df.apply(lambda x: x.value_counts())
If you want to have before and after as the row indexes as shown in your current output, you should use the following.
df.apply(lambda x: x.value_counts()).transpose()
A different way with melt using pivot_table:
>>> df.melt().assign(count=1).pivot_table('count', 'variable', 'value', aggfunc='count')
value A B C D E F
variable
after 1 2 1 1 3 6
before 2 2 2 2 2 4
I need apply a function to all rows of dataframe
I have used this function that returns a list of column names if value is 1:
def find_column(x):
a=[]
for column in df.columns:
if (df.loc[x,column] == 1):
a = a + [column]
return a
it works if i just insert the index, for example:
print(find_column(1))
but:
df['new_col'] = df.apply(find_column,axis=1)
does not work
any idea?
Thanks!
You can iterate by each row, so x is Series with index same like columns names, so is possible filter index matched data and convert to list:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,1,4,5,5,1],
'C':[7,1,9,4,2,3],
'D':[1,1,5,7,1,1],
'E':[5,1,6,9,1,4],
'F':list('aaabbb')
})
def find_column(x):
return x.index[x == 1].tolist()
df['new'] = df.apply(find_column,axis=1)
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]
Another idea is use DataFrame.dot with mask by DataFrame.eq for equal, then remove last separator and use Series.str.split:
df['new'] = df.eq(1).dot(df.columns + ',').str.rstrip(',').str.split(',')
print (df)
A B C D E F new
0 a 4 7 1 5 a [D]
1 b 1 1 1 1 a [B, C, D, E]
2 c 4 9 5 6 a []
3 d 5 4 7 9 b []
4 e 5 2 1 1 b [D, E]
5 f 1 3 1 4 b [B, D]
I have to copy columns from one DataFrame A to another DataFrame B. The column names in A and B do not match.
What is the best way to do it? There are several columns like this. Do I need to write for each column like B["SO"] = A["Sales Order"] etc.
i would use pd.concat
combined_df = pd.concat([df1, df2[['column_a', 'column_b']]], axis=1)
also gives you the power to concat different size dateframes , outer join etc.
Use:
df1 = pd.DataFrame({
'SO':list('abcdef'),
'RI':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
})
print (df1)
SO RI C
0 a 4 7
1 b 5 8
2 c 4 9
3 d 5 4
4 e 5 2
5 f 4 3
df2 = pd.DataFrame({
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df2)
D E F
0 1 5 a
1 3 3 a
2 5 6 a
3 7 9 b
4 1 2 b
5 0 4 b
Create dictionary for rename, select columns matched, rename by dict and DataFrame.join to original - DataFrames matched by index values:
d = {'SO':'Sales Order',
'RI':'Retail Invoices'}
df11 = df1[d.keys()].rename(columns=d)
print (df11)
Sales Order Retail Invoices
0 a 4
1 b 5
2 c 4
3 d 5
4 e 5
5 f 4
df = df2.join(df11)
print (df)
D E F Sales Order Retail Invoices
0 1 5 a a 4
1 3 3 a b 5
2 5 6 a c 4
3 7 9 b d 5
4 1 2 b e 5
5 0 4 b f 4
Make a dictionary of abbreviations. And try this code.
Ex:
full_form_dict = {'SO':'Sales Order',
'RI':'Retail Invoices',}
A_col = list(A.columns)
B_col = [v for k,v in full_form_dict.items() if k in A_col]
# to loop over A_col
# B_col = [v for col in A_col for k,v in full_form_dict.items() if k == col]
Sorry if this seems simple but have been struggling to find an answer to this.
I have a large dataframe of the format in the picture:
Each row can be uniquely identified by the multi-index built from the columns "trip_id", "direction_id", "stop_sequence".
I would like to request methods using loops to create a python-dictionary of dataframes where each dataframe is a subset of the large dataframe which contains all the rows for each "trip_id" + "direction_id" multi-index.
At the end of the loops I would like to be able to have a python-dictionary of dataframes where I can access each dictionary with a simple index key such as from 0 - 10,000 or the key being the combination of trip_id and direction_id
E.g. for the image above, I would like all the rows where the trip_id is "17067064.T0.2-EPP-F-mjp-1.8.R" and the direction ID is "1" to be in one dataframe of this dictionary collection.
Thank you for your help.
Kind regards,
Ben
Use groupby with dictionary comprehension:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,5,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
}).set_index(['F','B','C'])
print (df)
A D E
F B C
a 4 7 a 1 5
5 8 b 3 3
9 c 5 6
b 5 4 d 7 9
2 e 1 2
4 3 f 0 4
A D E
#python 3.6+
dfs = {f'{a}_{b}':v for (a, b), v in df.groupby(level=['F','B'])}
#python bellow
#dfs = {'{}_{}'.format(a,b):v for (a, b), v in df.groupby(level=['F','B'])}
print (dfs)
{'a_4': A D E
F B C
a 4 7 a 1 5, 'a_5': A D E
F B C
a 5 8 b 3 3
9 c 5 6, 'b_4': A D E
F B C
b 4 3 f 0 4, 'b_5': A D E
F B C
b 5 4 d 7 9
2 e 1 2}
print (dfs['a_4'])
A D E
F B C
a 4 7 a 1 5
In a pandas dataframe I have a column that looks like:
0 M
1 E
2 L
3 M.1
4 M.2
5 M.3
6 E.1
7 E.2
8 E.3
9 E.4
10 L.1
11 L.2
12 M.1.a
13 M.1.b
14 M.1.c
15 M.2.a
16 M.3.a
17 E.1.a
18 E.1.b
19 E.1.c
20 E.2.a
21 E.3.a
22 E.3.b
23 E.4.a
I need to group all the value where the first elements are E, M, or L and then, for each group, I need to create a subgroup where the index is 1, 2, or 3 which will contain a record for each lowercase letter (a,b,c, ...)
Potentially the solution should work for any number of levels concatenate elements (in this case the number of levels is 3 (eg: A.1.a))
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1
2
M 1 a
b
c
2 a
3 a
I tried with:
df.groupby([0,1,2]).count()
But the result is missing the L level because it doesn't have records at the last sub-level
A workaround is to add a dummy variable and then remove it ... like:
df[2][(df[0]=='L') & (df[2].isnull()) & (df[1].notnull())]='x'
df = df.replace(np.nan,' ', regex=True)
df.sort_values(0, ascending=False, inplace=True)
newdf = df.groupby([0,1,2]).count()
which gives:
0 1 2
E 1 a
b
c
2 a
3 a
b
4 a
L 1 x
2 x
M 1 a
b
c
2 a
3 a
I then deal with the dummy entry x later in my code ...
how can avoid this ackish way to use groupby ?
Assuming the column under consideration to be represented by s, we can:
Split on "." delimiter along with expand=True to produce an expanded DF.
fnc : checks if all elements of the grouped frame consists of only None, then it replaces them by a dummy entry "" which is established via a list-comprehension. A series constructor is later called on the filtered list. Any None's present here are subsequently removed using dropna.
Perform groupby w.r.t. 0 & 1 column names and apply fnc to 2.
split_str = s.str.split(".", expand=True)
fnc = lambda g: pd.Series(["" if all(x is None for x in g) else x for x in g]).dropna()
split_str.groupby([0, 1])[2].apply(fnc)
produces:
0 1
E 1 1 a
2 b
3 c
2 1 a
3 1 a
2 b
4 1 a
L 1 0
2 0
M 1 1 a
2 b
3 c
2 1 a
3 1 a
Name: 2, dtype: object
To obtain a flattened DF, reset the indices same as the levels used to group the DF before:
split_str.groupby([0, 1])[2].apply(fnc).reset_index(level=[0, 1]).reset_index(drop=True)
produces:
0 1 2
0 E 1 a
1 E 1 b
2 E 1 c
3 E 2 a
4 E 3 a
5 E 3 b
6 E 4 a
7 L 1
8 L 2
9 M 1 a
10 M 1 b
11 M 1 c
12 M 2 a
13 M 3 a
Maybe you have to find a way with regex.
import pandas as pd
df = pd.read_clipboard(header=None).iloc[:, 1]
df2 = df.str.extract(r'([A-Z])\.?([0-9]?)\.?([a-z]?)')
print df2.set_index([0,1])
and the result is,
2
0 1
M
E
L
M 1
2
3
E 1
2
3
4
L 1
2
M 1 a
1 b
1 c
2 a
3 a
E 1 a
1 b
1 c
2 a
3 a
3 b
4 a