I have 2 dataframes df1 and df2 (same index and number of rows), and I would like to create a new dataframe which columns are the sum of all combinations of 2 columns from df1 and df2, example :
input :
import pandas as pd
df1 = pd.DataFrame([[10,20]])
df2 = pd.DataFrame([[1,2]])
output :
import pandas as pd
df3 = pd.DataFrame([[11,12,21,22]])
Use MultiIndex.from_product for all combinations and sum DataFrames with repeated values by DataFrame.reindex:
mux = pd.MultiIndex.from_product([df1.columns, df2.columns])
df = df1.reindex(mux, level=0, axis=1) + df2.reindex(mux, level=1, axis=1)
df.columns = range(len(df.columns))
IIUC you can do this with numpy.
>>> import numpy as np
>>> n = df1.shape[1]
>>> pd.DataFrame(df1.values.repeat(n) + np.tile(df2.values, n))
0 1 2 3
0 11 12 21 22
I have data in an excel file, df that holds aggregated values per ID. I am looking to break this down to its distinct count and create a new record for each.
Data
A B C
2 3 1
Desired
count ID
1 A01
1 A02
1 B01
1 B02
1 B03
1 C01
Doing:
import pandas as pd
from numpy.random import randint
df = pd.DataFrame(columns=['A', 'B', 'C'])
for i in range(5):
df.loc[i] = ['ID' + str(i)] + list(randint(10, size=2))
I am thinking I can go about it this way, however, this is not stacking all the necessary IDs, consecutively.
Any suggestion or advice will be appreciated.
Let's try melt to reshape the data, reindex + repeat to duplicate the rows, and groupby + cumcount + zfill to create the suffixes:
import pandas as pd
df = pd.DataFrame({'A': {0: 2}, 'B': {0: 3}, 'C': {0: 1}})
# Melt Table Into New Form
df = df.melt(col_level=0, value_name='count', var_name='ID')
# Repeat Based on Count
df = df.reindex(df.index.repeat(df['count']))
# Set Count To 1
df['count'] = 1
# Add Suffix to Each ID
df['ID'] = df['ID'] + (
(df.groupby('ID').cumcount() + 1)
.astype(str)
.str.zfill(2)
)
# Reorder Columns
df = df[['count', 'ID']]
print(df)
df:
count ID
0 1 A01
0 1 A02
1 1 B01
1 1 B02
1 1 B03
2 1 C01
Do you want this?
df = pd.DataFrame([[f"{k}{str(i+1).zfill(2)}" for i in range(v)]
for k, v in df.to_dict('records')[0].items()]).stack().reset_index(drop=True).to_frame().rename(columns = {0:'ID'})
df['count'] = 1
Another option:
import numpy as np
df = df.melt()
new_df = (pd.DataFrame(np.repeat(df.variable, df.value))
.assign(count=1))
new_df.variable = new_df.variable + (new_df.groupby('variable').cumcount() + 1).astype(str).str.zfill(2)
I have two text columns A and B. I want to take the first non empty string or if both A and B has values take the values from A. C is the column im trying to create:
import pandas as pd
cols = ['A','B']
data = [['data','data'],
['','data'],
['',''],
['data1','data2']]
df = pd.DataFrame.from_records(data=data, columns=cols)
A B
0 data data
1 data
2
3 data1 data2
My attempt:
df['C'] = df[cols].apply(lambda row: sorted([val if val else '' for val in row], reverse=True)[0], axis=1) #Reverse sort to avoid picking an empty string
A B C
0 data data data
1 data data
2
3 data1 data2 data2 #I want data1 here
Expected output:
A B C
0 data data data
1 data data
2
3 data1 data2 data1
I think I want the pandas equivalent of SQL coalesce.
You can also use numpy.where:
In [1022]: import numpy as np
In [1023]: df['C'] = np.where(df['A'].eq(''), df['B'], df['A'])
In [1024]: df
Out[1024]:
A B C
0 data data data
1 data data
2
3 data1 data2 data1
Let's try idxmax + lookup:
df['C'] = df.lookup(df.index, df.ne('').idxmax(1))
Alternatively you can use Series.where:
df['C'] = df['A'].where(lambda x: x.ne(''), df['B'])
A B C
0 data data data
1 data data
2
3 data1 data2 data1
I have a pandas df:
import pandas as pd
df = pd.DataFrame({'col_a' : ['a','a', 'b'], 'col_b': [1,2,3]})
df.index = [4,5,6]
On this df i apply a query:
df_subset = df.query('col_a == "b"')
Now I have a second dataframe which looks like this:
import numpy as np
df_numpy = pd.DataFrame(np.array([0.1,0.2,0.3]))
which is like the original df but without the "identification" column (col_a) and the values are transformed in a way (in this toy example, divided by 10)
I would like to select from the df_numpy the same rows that are selected from the df after applying the query. In this toy example the 3rd row.
EDIT
The tricky part is that the index values between df_numpy and df are not the same.
Is there a way to do that ?
If there are same index values use:
print (df_numpy[df_numpy.index.isin(df_subset.index)])
0
2 0.3
EDIT: One idea is create same index values in both, because same length:
df = pd.DataFrame({'col_a' : ['a','a', 'b'], 'col_b': [1,2,3]})
df.index = [4,5,6]
df_subset = df.reset_index(drop=True).query('col_a == "b"')
df_numpy = pd.DataFrame(np.array([0.1,0.2,0.3]))
print (df_numpy[df_numpy.reset_index(drop=True).index.isin(df_subset.index)])
0
2 0.3
I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
What I would like to get is:
When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:
Is there a way to get the desired result?
Note: alternatively, I would accept this as a result:
Versions used:
Python: 2.7.8
Pandas 0.16.0
Here is an automated way to fix the column index. First,
pull the column level values into a DataFrame:
columns = pd.DataFrame(df.columns.tolist())
then rename the Unnamed: columns to NaN:
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
and then forward-fill the NaNs:
columns[0] = columns[0].fillna(method='ffill')
so that columns now looks like
In [314]: columns
Out[314]:
0 1
0 NaN A
1 NaN B
2 C X
3 C Y
4 C Z
5 D X
6 D Y
7 D Z
Now we can find the remaining NaNs and fill them with empty strings:
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
To make the first two columns, A and B, indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
Now you can build a new MultiIndex and assign it to df.columns:
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
Putting it all together, if data is
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0
then
import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
print(df)
yields
A B C D
X Y Z X Y Z
0 1 2 3 4 5 6 7 8
1 3 4 5 6 7 8 9 0
There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:
names = ['A', 'B',
('C','X'), ('C', 'Y'), ('C', 'Z'),
('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
header=1, names=names, index_col=[0, 1])
Gives:
C D
X Y Z X Y Z
A B
1 2 3 4 5 6 7 8
To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.
pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])
Then access the columns and loop to create your header.
Again, not a very clean solution, but should work.
you can read using :
df = pd.read_csv('file.csv', header=[0, 1], skipinitialspace=True, tupleize_cols=True)
and then
df.columns = pd.MultiIndex.from_tuples(df.columns)
Load the dataframe, with multiindex:
df = pd.read_csv(filelist,header=[0,1], sep=',')
Write a function to replace the index:
def replace_index(df):
arr = df.columns.values
l = [list(x) for x in arr]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
if l[i-1][0][:7] != 'Unnamed':
l[i][0] = l[i-1][0]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
l[i][0] = l[i][1]
l[i][1] = ''
index = pd.MultiIndex.from_tuples(l)
df.columns = index
return df
Return the new dataframe properly indexed:
replace_index(df)
I used a technique to flatten from the multi-index columns and make one column. It works well for me.
your_df.columns = ['_'.join(col).strip() for col in your_df.columns.values]
Import your csv file providing the header row indexes:
df = pd.read_csv('file.csv', header=[0, 1, 2])
Then, you can iterate over each column header, clean it up, assign it to a tuple, the re-assign the dataframe columns using pd.MultiIndex.from_tuples(list_of_tuples)
df.columns = pd.MultiIndex.from_tuples(
[tuple(['' if y.find('Unnamed')==0 else y for y in x]) for x in df.columns]
)
this is the quick one liner I was looking for when trying to figure this out.