factories and unique numbers in a pandas dataframe

factories and unique numbers in a pandas dataframe - python

I have a dataframe that looks like:
import pandas as pd
import random
d={'ID':["x1", "x2", "x1"],
'CUSIP':['a', 'b', "#NULL"],
'ISIN':["#NULL", "#NULL", 'I']}
df=pd.DataFrame(data=d)
df
I am trying to replace all the '#NULL' with a unique number suffix. So, the output table will look something like:
import pandas as pd
import random
d={'ID':["x1", "x2", "x1"],
'CUSIP':['a', 'b', "#NULL_1"],
'ISIN':["#NULL_2", "#NULL_3", 'I']}
df=pd.DataFrame(data=d)
df

Create Series and add new values of filtered rows with range, last reshape back:
s = df.unstack()
m = s == '#NULL'
s.loc[m] = [f'#NULL_{x + 1}' for x in range(m.sum())]
df = s.unstack().T
print (df)
ID CUSIP ISIN
0 x1 a #NULL_2
1 x2 b #NULL_3
2 x1 #NULL_1 I

A simple solution would be to iterate through all values
count=1
for i in range(len(df)):
for c in df.columns:
if df.loc[i,c]=="#NULL":
df.loc[i,c]="#NULL_"+str(count)
count+=1
df
CUSIP ID ISIN
0 a x1 #NULL_1
1 b x2 #NULL_2
2 #NULL_3 x1 I
To obtain the other order:
count=1
for c in df.columns:
for i in range(len(df)):
if df.loc[i,c]=="#NULL":
df.loc[i,c]="#NULL_"+str(count)
count+=1
df
CUSIP ID ISIN
0 a x1 #NULL_2
1 b x2 #NULL_3
2 #NULL_1 x1 I

Related

Sum all combinations of 2 columns from 2 dataframes

I have 2 dataframes df1 and df2 (same index and number of rows), and I would like to create a new dataframe which columns are the sum of all combinations of 2 columns from df1 and df2, example :
input :
import pandas as pd
df1 = pd.DataFrame([[10,20]])
df2 = pd.DataFrame([[1,2]])
output :
import pandas as pd
df3 = pd.DataFrame([[11,12,21,22]])

Use MultiIndex.from_product for all combinations and sum DataFrames with repeated values by DataFrame.reindex:
mux = pd.MultiIndex.from_product([df1.columns, df2.columns])
df = df1.reindex(mux, level=0, axis=1) + df2.reindex(mux, level=1, axis=1)
df.columns = range(len(df.columns))

IIUC you can do this with numpy.
>>> import numpy as np
>>> n = df1.shape[1]
>>> pd.DataFrame(df1.values.repeat(n) + np.tile(df2.values, n))
0 1 2 3
0 11 12 21 22

Split an aggregated value into a distinct count in python, returning new rows

I have data in an excel file, df that holds aggregated values per ID. I am looking to break this down to its distinct count and create a new record for each.
Data
A B C
2 3 1
Desired
count ID
1 A01
1 A02
1 B01
1 B02
1 B03
1 C01
Doing:
import pandas as pd
from numpy.random import randint
df = pd.DataFrame(columns=['A', 'B', 'C'])
for i in range(5):
df.loc[i] = ['ID' + str(i)] + list(randint(10, size=2))
I am thinking I can go about it this way, however, this is not stacking all the necessary IDs, consecutively.
Any suggestion or advice will be appreciated.

Let's try melt to reshape the data, reindex + repeat to duplicate the rows, and groupby + cumcount + zfill to create the suffixes:
import pandas as pd
df = pd.DataFrame({'A': {0: 2}, 'B': {0: 3}, 'C': {0: 1}})
# Melt Table Into New Form
df = df.melt(col_level=0, value_name='count', var_name='ID')
# Repeat Based on Count
df = df.reindex(df.index.repeat(df['count']))
# Set Count To 1
df['count'] = 1
# Add Suffix to Each ID
df['ID'] = df['ID'] + (
(df.groupby('ID').cumcount() + 1)
.astype(str)
.str.zfill(2)
)
# Reorder Columns
df = df[['count', 'ID']]
print(df)
df:
count ID
0 1 A01
0 1 A02
1 1 B01
1 1 B02
1 1 B03
2 1 C01

Do you want this?
df = pd.DataFrame([[f"{k}{str(i+1).zfill(2)}" for i in range(v)]
for k, v in df.to_dict('records')[0].items()]).stack().reset_index(drop=True).to_frame().rename(columns = {0:'ID'})
df['count'] = 1
Another option:
import numpy as np
df = df.melt()
new_df = (pd.DataFrame(np.repeat(df.variable, df.value))
.assign(count=1))
new_df.variable = new_df.variable + (new_df.groupby('variable').cumcount() + 1).astype(str).str.zfill(2)

Concatenate two columns

I have two text columns A and B. I want to take the first non empty string or if both A and B has values take the values from A. C is the column im trying to create:
import pandas as pd
cols = ['A','B']
data = [['data','data'],
['','data'],
['',''],
['data1','data2']]
df = pd.DataFrame.from_records(data=data, columns=cols)
A B
0 data data
1 data
2
3 data1 data2
My attempt:
df['C'] = df[cols].apply(lambda row: sorted([val if val else '' for val in row], reverse=True)[0], axis=1) #Reverse sort to avoid picking an empty string
A B C
0 data data data
1 data data
2
3 data1 data2 data2 #I want data1 here
Expected output:
A B C
0 data data data
1 data data
2
3 data1 data2 data1
I think I want the pandas equivalent of SQL coalesce.

You can also use numpy.where:
In [1022]: import numpy as np
In [1023]: df['C'] = np.where(df['A'].eq(''), df['B'], df['A'])
In [1024]: df
Out[1024]:
A B C
0 data data data
1 data data
2
3 data1 data2 data1

Let's try idxmax + lookup:
df['C'] = df.lookup(df.index, df.ne('').idxmax(1))
Alternatively you can use Series.where:
df['C'] = df['A'].where(lambda x: x.ne(''), df['B'])
A B C
0 data data data
1 data data
2
3 data1 data2 data1

How to filter a pandas dataframe using the result of pandas query of another dataframe

I have a pandas df:
import pandas as pd
df = pd.DataFrame({'col_a' : ['a','a', 'b'], 'col_b': [1,2,3]})
df.index = [4,5,6]
On this df i apply a query:
df_subset = df.query('col_a == "b"')
Now I have a second dataframe which looks like this:
import numpy as np
df_numpy = pd.DataFrame(np.array([0.1,0.2,0.3]))
which is like the original df but without the "identification" column (col_a) and the values are transformed in a way (in this toy example, divided by 10)
I would like to select from the df_numpy the same rows that are selected from the df after applying the query. In this toy example the 3rd row.
EDIT
The tricky part is that the index values between df_numpy and df are not the same.
Is there a way to do that ?

If there are same index values use:
print (df_numpy[df_numpy.index.isin(df_subset.index)])
0
2 0.3
EDIT: One idea is create same index values in both, because same length:
df = pd.DataFrame({'col_a' : ['a','a', 'b'], 'col_b': [1,2,3]})
df.index = [4,5,6]
df_subset = df.reset_index(drop=True).query('col_a == "b"')
df_numpy = pd.DataFrame(np.array([0.1,0.2,0.3]))
print (df_numpy[df_numpy.reset_index(drop=True).index.isin(df_subset.index)])
0
2 0.3

Pandas read multiindexed csv with blanks

I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
What I would like to get is:
When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:
Is there a way to get the desired result?
Note: alternatively, I would accept this as a result:
Versions used:
Python: 2.7.8
Pandas 0.16.0

Here is an automated way to fix the column index. First,
pull the column level values into a DataFrame:
columns = pd.DataFrame(df.columns.tolist())
then rename the Unnamed: columns to NaN:
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
and then forward-fill the NaNs:
columns[0] = columns[0].fillna(method='ffill')
so that columns now looks like
In [314]: columns
Out[314]:
0 1
0 NaN A
1 NaN B
2 C X
3 C Y
4 C Z
5 D X
6 D Y
7 D Z
Now we can find the remaining NaNs and fill them with empty strings:
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
To make the first two columns, A and B, indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
Now you can build a new MultiIndex and assign it to df.columns:
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
Putting it all together, if data is
,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0
then
import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
print(df)
yields
A B C D
X Y Z X Y Z
0 1 2 3 4 5 6 7 8
1 3 4 5 6 7 8 9 0

There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:
names = ['A', 'B',
('C','X'), ('C', 'Y'), ('C', 'Z'),
('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
header=1, names=names, index_col=[0, 1])
Gives:
C D
X Y Z X Y Z
A B
1 2 3 4 5 6 7 8
To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.
pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])
Then access the columns and loop to create your header.
Again, not a very clean solution, but should work.

you can read using :
df = pd.read_csv('file.csv', header=[0, 1], skipinitialspace=True, tupleize_cols=True)
and then
df.columns = pd.MultiIndex.from_tuples(df.columns)

Load the dataframe, with multiindex:
df = pd.read_csv(filelist,header=[0,1], sep=',')
Write a function to replace the index:
def replace_index(df):
arr = df.columns.values
l = [list(x) for x in arr]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
if l[i-1][0][:7] != 'Unnamed':
l[i][0] = l[i-1][0]
for i in range(len(l)):
if l[i][0][:7] == 'Unnamed':
l[i][0] = l[i][1]
l[i][1] = ''
index = pd.MultiIndex.from_tuples(l)
df.columns = index
return df
Return the new dataframe properly indexed:
replace_index(df)

I used a technique to flatten from the multi-index columns and make one column. It works well for me.
your_df.columns = ['_'.join(col).strip() for col in your_df.columns.values]

Import your csv file providing the header row indexes:
df = pd.read_csv('file.csv', header=[0, 1, 2])
Then, you can iterate over each column header, clean it up, assign it to a tuple, the re-assign the dataframe columns using pd.MultiIndex.from_tuples(list_of_tuples)
df.columns = pd.MultiIndex.from_tuples(
[tuple(['' if y.find('Unnamed')==0 else y for y in x]) for x in df.columns]
)
this is the quick one liner I was looking for when trying to figure this out.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

factories and unique numbers in a pandas dataframe - python

Create Series and add new values of filtered rows with range, last reshape back: s = df.unstack() m = s == '#NULL' s.loc[m] = [f'#NULL_{x + 1}' for x in range(m.sum())] df = s.unstack().T print (df) ID CUSIP ISIN 0 x1 a #NULL_2 1 x2 b #NULL_3 2 x1 #NULL_1 I

Related

Sum all combinations of 2 columns from 2 dataframes

Split an aggregated value into a distinct count in python, returning new rows

Concatenate two columns

How to filter a pandas dataframe using the result of pandas query of another dataframe

Pandas read multiindexed csv with blanks

Categories

Resources