I have a df with more than 300 rows and more than 4000 columns. A sample of the Df looks like this:
AB
BC
DA
DC
FF
40
50
4
10
60
10
20
10
5
20
I want to create another DF by dividing all the observation by cells of column DC so that i will have a df that looks like this:
AB
BC
DA
DC
FF
4
5
0.4
1
6
1
2
1
0.2
2
an Idea that came to my mind was iterrows but I could not find my way around it.
any better suggestion on how to do this?
This should get you what you want
for column in df.columns:
if column == 'DC':
pass # this just does nothing and skip the column
else:
df[col] = df[col] / df['DC']
I have a df that looks like this:
df:
col1 col2
28 24 and 24
11 .1 for .1
3 43
I want to create logic that I can apply to a list of columns (not all the columns in the dataframe) where if the cell contains both integer and string
the particular value in the cell gets replaced with empty string, but not drop the entire row.
new df should look like this:
col1 col2
28
11
3 43
how would I do this?
Using to_numeric
l=['col2']
df[l]=df[l].apply(pd.to_numeric,errors='coerce').fillna('')
df
Out[32]:
col1 col2
0 28
1 11
2 3 43
I'm trying to update one dataframe with data from another, for one specific column called 'Data'. Both dataframe's have the unique ID caled column 'ID'. Both columns have a 'Data' column. I want data from 'Data' in df2 to overwrite entries in df1 'Data', for only the amount of rows that are in df1. Where there is no corresponding 'ID' in df2 the df1 entry should remain.
import pandas as pd
data1 = '''\
ID Data Data1
1 AA BB
2 AB BF
3 AC BK
4 AD BL'''
data2 = '''\
ID Data
1 AAB
3 AAL
4 MNL
5 AAP
6 MNX
8 DLP
9 POW'''
df1 = pd.read_csv(pd.compat.StringIO(data1), sep='\s+')
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='\s+')
Expected output:
new df3 expected outcome.
ID Data Data1
1 AAB BB
2 AB BF
3 AAL BK
4 MNL BL
df2 is a master list of values which never changes and has thousands of entries, where as df1 sometime only ever has a few hundred entries.
I have looked at pd.merge and combine_first however can't seem to get the right combination.
df3 = pd.merge(df1, df2, on='ID', how='left')
Any help much appreciated.
Create new dataframe
Here is one way making use of update:
df3 = df1[:].set_index('ID')
df3['Data'].update(df2.set_index('ID')['Data'])
df3.reset_index(inplace=True)
Or we could use maps/dicts and reassign (Python >= 3.5)
m = {**df1.set_index('ID')['Data'], **df2.set_index('ID')['Data']}
df3 = df1[:].assign(Data=df1['ID'].map(m))
Python < 3.5:
m = df1.set_index('ID')['Data']
m.update(df2.set_index('ID')['Data'])
df3 = df1[:].assign(Data=df1['ID'].map(m))
Update df1
Are you open to update the df1? In that case:
df1.update(df2)
Or if ID not index:
m = df2.set_index('ID')['Data']
df1.loc[df1['ID'].isin(df2['ID']),'Data'] =df1['ID'].map(m)
Or:
df1.set_index('ID',inplace=True)
df1.update(df2.set_index('ID'))
df1.reset_index(inplace=True)
Note: There might be something that makes more sense :)
Full example:
import pandas as pd
data1 = '''\
ID Data Data1
1 AA BB
2 AB BF
3 AC BK
4 AD BL'''
data2 = '''\
ID Data
1 AAB
3 AAL
4 MNL
5 AAP
6 MNX
8 DLP
9 POW'''
df1 = pd.read_csv(pd.compat.StringIO(data1), sep='\s+')
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='\s+')
m = {**df1.set_index('ID')['Data'], **df2.set_index('ID')['Data']}
df3 = df1[:].assign(Data=df1['ID'].map(m))
print(df3)
Returns:
ID Data Data1
0 1 AAB BB
1 2 AB BF
2 3 AAL BK
3 4 MNL BL
I have a huge data set in a pandas data frame. It looks something like this
df = pd.DataFrame([[1,2,3,4],[31,14,13,11],[115,613,1313,1]], columns=['c1','c1','c2','c2'])
Here first two columns have same name. So they should be concatenated into a single column so the the values are one below another. so the dataframe should look something like this:
df1 = pd.DataFrame([[1,3],[31,13],[115,1313],[2,4],[14,11],[613,1]], columns=['c1','c2'])
Note: My orignal dataframe has many column so I cannot used simple concat function to stack the columns. Also I tried using stack function, apart from concat function. What can I do?
use groupby + cumcount to create a pd.MultiIndex. Reassign column with new pd.MultiIndex and stack
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = df.copy()
df1.columns = [df.columns, df.columns.to_series().groupby(level=0).cumcount()]
print(df1.stack().reset_index(drop=True))
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
Or with a bit of creativity, in one line
df.T.set_index(
df.T.groupby([df.columns]).cumcount(),
append=True
).unstack().T.reset_index(drop=True)
c1 c2
0 1 3
1 2 4
2 31 13
3 14 11
4 115 1313
5 613 1
You could melt the dataframe, then count entries within each column to use as index for the new dataframe and then unstack it back like this:
import pandas as pd
df = pd.DataFrame(
[[1,2,3,4],[31,14,13,11],[115,613,1313,1]],
columns=['c1','c1','c2','c2'])
df1 = (pd.melt(df,var_name='column')
.assign(n = lambda x: x.groupby('column').cumcount())
.set_index(['n','column'])
.unstack())
df1.columns=df1.columns.get_level_values(1)
print(df1)
Which produces
column c1 c2
n
0 1 3
1 31 13
2 115 1313
3 2 4
4 14 11
5 613 1
i have a "1000 rows * 4 columns" DataFrame:
a b c d
1 aa 93 4
2 bb 32 3
...
1000 nn 78 2
**[1283 rows x 4 columns]**
and I use groupby to group them based on 3 of the columns:
df.groupby(['a','b','c']).sum()
print(df)
a b c d
1 aa 93 12
2 bb 32 53
...
1000 nn 78 38
**[1283 rows x 1 columns]**
however the result give me a "1000 rows * 1 columns" Dataframe. SO my question is if Groupby concatenate columns as one Column? if yes how can I prevent that. I want to plot my data after grouping it but i can't since it only see one column instead of all 4.
edit: when i call the columns i only get the last column, it means it can't read 'a','b','c' as columns, why is that and how can i markl them as column again.
df.columns
Index([u'd'], dtype='object')
you can do it this way:
df.groupby(['a','b','c'], as_index=False).sum()
or:
df.groupby(['a','b','c']).sum().reset_index()