This question already has answers here:
How to implement sql coalesce in pandas
(5 answers)
Closed 1 year ago.
I am trying to create a column in my dataframe which searches each column and checks if the value of at specific row is null or not, if it is not the new column will contain this value, otherwise it will skip it. It is not possible that two columns contains a non null value.
For example:
A B C D E
NaN NaN NaN NaN a
b NaN NaN NaN NaN
NaN NaN NaN NaN NaN
My expected output:
A B C D E new_column
NaN NaN NaN NaN a a
b NaN NaN NaN NaN b
NaN NaN NaN NaN NaN NaN
You can bfill horizontally and then select the first column:
df['new_column'] = df.bfill(axis=1).iloc[:, 0]
Output:
>>> df
A B C D E new_column
0 NaN NaN NaN NaN a a
1 b NaN NaN NaN NaN b
2 NaN NaN NaN NaN NaN NaN
Related
I have a df that a columns with some ID for companies. How can I split this ID in columns?
In this column the values can be 0(NaN) to more than 5 IDs, how to divide each one of them in separate columns?
Here is an example of the column:
0 4773300
1 NaN
2 6201501,6319400,6202300
3 8230001
4 NaN
5 4742300,4744004,4744003,7319002,4729699,475470
The division would be at each comma, I imagine an output like this:
columnA
columnB
columnC
4773300
Nan
Nan
NaN
Nan
Nan
6201501
6319400
6202300
8230001
Nan
Nan
And so on depending on the number of IDs
You can use the .str.split method to perform this type of transformation quite readily. The trick is to pass the expand=True parameter so your results are put into a DataFrame instead of a Series containing list objects.
>>> df
ID
0 4773300
1 NaN
2 6201501,6319400,6202300
3 8230001
4 NaN
5 4742300,4744004,4744003,7319002,4729699,475470
>>> df['ID'].str.split(',', expand=True)
0 1 2 3 4 5
0 4773300 None None None None None
1 NaN NaN NaN NaN NaN NaN
2 6201501 6319400 6202300 None None None
3 8230001 None None None None None
4 NaN NaN NaN NaN NaN NaN
5 4742300 4744004 4744003 7319002 4729699 475470
You can also clean up the output a little for better aesthetics
replace None for NaN
alphabetic column names (though I would opt to not do this as you'll hit errors if a given entry in the ID column has > 26 ids in it.)
join back to original DataFrame
>>> import pandas as pd
>>> from string import ascii_uppercase
>>> (
df['ID'].str.split(',', expand=True)
.replace({None: float('nan')})
.pipe(lambda d:
d.set_axis(
pd.Series(list(ascii_uppercase))[d.columns],
axis=1
)
)
.add_prefix("column")
.join(df)
)
columnA columnB columnC columnD columnE columnF ID
0 4773300 NaN NaN NaN NaN NaN 4773300
1 NaN NaN NaN NaN NaN NaN NaN
2 6201501 6319400 6202300 NaN NaN NaN 6201501,6319400,6202300
3 8230001 NaN NaN NaN NaN NaN 8230001
4 NaN NaN NaN NaN NaN NaN NaN
5 4742300 4744004 4744003 7319002 4729699 475470 4742300,4744004,4744003,7319002,4729699,475470
Consider each entry as a string, and parse the string to get to individual values.
from ast import literal_eval
df = pd.read_csv('sample.csv', converters={'company': literal_eval})
words = []
for items in df['company']:
for word in items:
words.append(word)
FYI, This is a good starting point. I do not know what is intended output format needed as of now, since your question is kind of incomplete.
I'm analyzing excel files generated by an organization who publishes yearly reports in Excel files. Each year, the column names (Year, A1, B1, C1, etc) remain identical. But each year the organization publishes those column names that start at different row numbers and column numbers.
Each year I manually search for the starting row and column, but it's tedious work given the number of years of reports to wade through.
So I'd like something like this:
...
df = pd.read_excel('test.xlsx')
start_row,start_col = df.find_columns('Year','A1','B1')
...
Thanks.
Let's say you have three .xlsx files on your desktop prefixed with Yearly_Report that when combined in python look like this after reading into one dataframe with something like: df = pd.concat([pd.read_excel(f, header=None) for f in yearly_files]):
0 1 2 3 4 5 6 7 8 9 10
0 A B C NaN NaN NaN NaN NaN NaN NaN NaN
1 1 2 3 NaN NaN NaN NaN NaN NaN NaN NaN
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN A B C NaN NaN NaN NaN NaN NaN
4 NaN NaN 4 5 6 NaN NaN NaN NaN NaN NaN
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN A B C
2 NaN NaN NaN NaN NaN NaN NaN NaN 4 5 6
As you can see, the columns and values are scattered across various columns and rows. The following steps would get you the desired result. First, you need to pd.concat the files and .dropna rows. Then, transpose the dataframe with .T before removing all cells with NaN values. Next, revert the dataframe back with another transpose .T. Finally, simply name the columns and drop rows that are equal to the column headers.
import glob, os
import pandas as pd
main_folder = 'Desktop/'
yearly_files = glob.glob(f'{main_folder}Yearly_Report*.xlsx')
df = pd.concat([pd.read_excel(f, header=None) for f in yearly_files]) \
.dropna(how='all').T \
.apply(lambda x: pd.Series(x.dropna().values)).T
df.columns = ['A','B','C']
df = df[df['A'] != 'A']
df
output:
A B C
1 1 2 3
4 4 5 6
2 4 5 6
Soething Like this not totally sure what you are looking for
df = pd.read_excel('test.xlsx')
for i in df.index:
print(df.loc[i,'Year'])
print(df.loc[i, 'A1'])
print(df.loc[i, "B1"])
I am trying to insert a list of data into a multi-level pandas dataframe.
It seems to work just fine, but when I view the entire dataframe, the new sub-row is not there.
Here is an example:
Create an empty multi-index dataframe:
ind = pd.MultiIndex.from_product([['A','B','C'], ['a', 'b','c']]) #set up index
df = pd.DataFrame(columns=['col1'], index=ind) #create empty df with multi-level nested index
print(df)
col1
A a NaN
b NaN
c NaN
B a NaN
b NaN
c NaN
C a NaN
b NaN
c NaN
Inserting a new column works fine:
newcol = 'col2' #new column name
df[newcol] = np.nan #fill new column with nans
print(df)
col1 col2
A a NaN NaN
b NaN NaN
c NaN NaN
B a NaN NaN
b NaN NaN
c NaN NaN
C a NaN NaN
b NaN NaN
c NaN NaN
Inserting data into an existing sub-row works with point data but not with a list:
df[newcol]['A','a'] = 1 #works with point data but not with list
print(df)
col1 col2
A a NaN 1.0
b NaN NaN
c NaN NaN
B a NaN NaN
b NaN NaN
c NaN NaN
C a NaN NaN
b NaN NaN
c NaN NaN
Inserting into new sub-row looks OK when viewing just the one column:
df[newcol]['A','d'] = [1,2,3] #insert into new sub-row 'd'
print(df[newcol]) #view just new column
A a 1
b NaN
c NaN
B a NaN
b NaN
c NaN
C a NaN
b NaN
c NaN
A d [1, 2, 3]
Name: col2, dtype: object
But it's not visible when viewing the entire dataframe - why?
print(df)
col1 col2
A a NaN 1.0
b NaN NaN
c NaN NaN
B a NaN NaN
b NaN NaN
c NaN NaN
C a NaN NaN
b NaN NaN
c NaN NaN
Also, when I try different methods of inserting the data, I run into issues:
Using df.loc[] works perfectly for a single data point, but not for lists:
df.loc[('A','f'), newcol] = 1 #create new row at [(row,sub-row),column] & insert point data
print(df) #works fine
col1 col2
A a NaN 1.0
b NaN NaN
c NaN NaN
B a NaN NaN
b NaN NaN
c NaN NaN
C a NaN NaN
b NaN NaN
c NaN NaN
A f NaN 1.0
Same method but inserting a list returns an error:
df.loc[('A','f'), newcol] = [1,2,3] #create new row at [(row,sub-row),column] & insert list data
TypeError: object of type 'numpy.float64' has no len()
Using df.at[] returns error with both point and list data:
data.at[('A','f'), newcol] = [1,2,3] #insert into existing sub-row 'f'
KeyError: ('A', 'f')
when you do df[newcol]['A','d'] = [1,2,3], it is chained-indexing assignment, so the result is unpredictable. Pandas doesn't guarantee correct behaviors when you do chained-indexing. When you run that command, pandas executes with a warning. This warning even includes the link to the full explanation in case you want to know. I don't go into the detail because the link in the warning explains very well on this chained-indexing.
On assigning list to a cell, it is always a pain. However, it is doable. I guess your issue with df.loc[('A','f'), newcol] = [1,2,3] because col2 is dtype float, so pandas doesn't consider [1,2,3] as a single object list. It considers [1,2,3] as a list of multiple numeric values, so it failed. I don't know whether it is a bug or intentional.
To solve your issue with .loc, convert col2 to dtype object and do assignment
df['col2'] = df['col2'].astype('O')
df.loc[('A','f'), 'col2'] = [1,2,3]
print(df)
Out[1911]:
col1 col2
A a NaN NaN
b NaN NaN
c NaN NaN
B a NaN NaN
b NaN NaN
c NaN NaN
C a NaN NaN
b NaN NaN
c NaN NaN
A f NaN [1, 2, 3]
print(df['col2'])
Out[1912]:
A a NaN
b NaN
c NaN
B a NaN
b NaN
c NaN
C a NaN
b NaN
c NaN
A f [1, 2, 3]
Name: col2, dtype: object
Say there is a csv file as follows:
# data.csv
0,1,2,3,4
a,3.0,3.0,3.0,3.0,3.0
b,3.0,3.0,3.0,3.0,3.0
c,3.0,3.0,3.0,3.0,3.0
d,3.0,3.0,3.0,3.0,3.0
Now I create two dataframes: one from the csv file, another using DataFrame().
I expect both DataFrame to be equal.
# Read the csv file into a pandas.DataFrame
A = pandas.read_csv('data.csv')
# Create (same?) dataframe by hand
B = pandas.DataFrame(3*numpy.ones((4,5)), index=['a', 'b', 'c', 'd'])
However, if I substract them, I obtain:
print(A-B)
0 1 2 3 4 0 1 2 3 4
a NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
b NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
c NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
d NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Any idea(s) why?
DataFrames are not equal, because in A are columns names strings, in B are integers.
So need convert integers columns to integers:
A = pandas.read_csv('data.csv').rename(columns=int)
Or convert B columns to strings:
B = pandas.DataFrame(3*numpy.ones((4,5)), index=['a', 'b', 'c', 'd']).rename(columns=str)
This is to go further from the following thread:
How to do join of multiindex dataframe with a single index dataframe?
The multi-indices of df1 are sublevel indices of df2.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: import itertools
In [4]: inner = ('a','b')
In [5]: outer = ((10,20), (1,2))
In [6]: cols = ('one','two','three','four')
In [7]: sngl = pd.DataFrame(np.random.randn(2,4), index=inner, columns=cols)
In [8]: index_tups = list(itertools.product(*(outer + (inner,))))
In [9]: index_mult = pd.MultiIndex.from_tuples(index_tups)
In [10]: mult = pd.DataFrame(index=index_mult, columns=cols)
In [11]: sngl
Out[11]:
one two three four
a 2.946876 -0.751171 2.306766 0.323146
b 0.192558 0.928031 1.230475 -0.256739
In [12]: mult
Out[12]:
one two three four
10 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
20 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
In [13]: mult.ix[(10,1)] = sngl
In [14]: mult
Out[14]:
one two three four
10 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
20 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
# the new dataframes
sng2=pd.concat([sng1,sng1],keys=['X','Y'])
mult2=pd.concat([mult,mult],keys=['X','Y'])
In [110]:
sng2
Out[110]:
one two three four
X a 0.206810 -1.056264 -0.572809 -0.314475
b 0.514873 -0.941380 0.132694 -0.682903
Y a 0.206810 -1.056264 -0.572809 -0.314475
b 0.514873 -0.941380 0.132694 -0.682903
In [121]: mult2
Out[121]:
one two three four
X 10 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
20 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
Y 10 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
20 1 a NaN NaN NaN NaN
b NaN NaN NaN NaN
2 a NaN NaN NaN NaN
b NaN NaN NaN NaN
the code above is long, please scroll
The two multilevel indices of sng2 share the 1st and 4th indices of mul2. ('X','a') for example.
#DSM proposed a solution to work with a multiindex df2 and single index df1
mult[:] = sngl.loc[mult.index.get_level_values(2)].values
BUt DataFrame.index.get_level_values(2) can only work for one level of index.
It's not clear from the question which index levels the data frames share. I think you need to revise the set-up code as it gives an error at the definition of sngl. Anyway, suppose mult shares the first and second level with sngl you can just drop the second level from the index of mult and index in:
mult[:] = sngl.loc[mult.index.droplevel(2)].values
On a side note, you can construct a multi index from a product directly using pd.MultiIndex.from_product rather than using itertools