I have a df that a columns with some ID for companies. How can I split this ID in columns?
In this column the values can be 0(NaN) to more than 5 IDs, how to divide each one of them in separate columns?
Here is an example of the column:
0 4773300
1 NaN
2 6201501,6319400,6202300
3 8230001
4 NaN
5 4742300,4744004,4744003,7319002,4729699,475470
The division would be at each comma, I imagine an output like this:
columnA
columnB
columnC
4773300
Nan
Nan
NaN
Nan
Nan
6201501
6319400
6202300
8230001
Nan
Nan
And so on depending on the number of IDs
You can use the .str.split method to perform this type of transformation quite readily. The trick is to pass the expand=True parameter so your results are put into a DataFrame instead of a Series containing list objects.
>>> df
ID
0 4773300
1 NaN
2 6201501,6319400,6202300
3 8230001
4 NaN
5 4742300,4744004,4744003,7319002,4729699,475470
>>> df['ID'].str.split(',', expand=True)
0 1 2 3 4 5
0 4773300 None None None None None
1 NaN NaN NaN NaN NaN NaN
2 6201501 6319400 6202300 None None None
3 8230001 None None None None None
4 NaN NaN NaN NaN NaN NaN
5 4742300 4744004 4744003 7319002 4729699 475470
You can also clean up the output a little for better aesthetics
replace None for NaN
alphabetic column names (though I would opt to not do this as you'll hit errors if a given entry in the ID column has > 26 ids in it.)
join back to original DataFrame
>>> import pandas as pd
>>> from string import ascii_uppercase
>>> (
df['ID'].str.split(',', expand=True)
.replace({None: float('nan')})
.pipe(lambda d:
d.set_axis(
pd.Series(list(ascii_uppercase))[d.columns],
axis=1
)
)
.add_prefix("column")
.join(df)
)
columnA columnB columnC columnD columnE columnF ID
0 4773300 NaN NaN NaN NaN NaN 4773300
1 NaN NaN NaN NaN NaN NaN NaN
2 6201501 6319400 6202300 NaN NaN NaN 6201501,6319400,6202300
3 8230001 NaN NaN NaN NaN NaN 8230001
4 NaN NaN NaN NaN NaN NaN NaN
5 4742300 4744004 4744003 7319002 4729699 475470 4742300,4744004,4744003,7319002,4729699,475470
Consider each entry as a string, and parse the string to get to individual values.
from ast import literal_eval
df = pd.read_csv('sample.csv', converters={'company': literal_eval})
words = []
for items in df['company']:
for word in items:
words.append(word)
FYI, This is a good starting point. I do not know what is intended output format needed as of now, since your question is kind of incomplete.
Related
I am trying to use np.where to compare whether the values from two columns are equal, but I am getting inconsistent results.
df['compare'] = np.where(df['a'] == df['b'], '0', '1')
Output:
a b compare
1B NaN 1
NaN NaN 1
NaN NaN 1
32X NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN 321 1
NaN Z51 1
NaN 3Y 1
It seemed strange that the command would return pairs of NaN as non-matches. I confirmed that column 'a' and column 'b' are both string data types.
I double checked the original CSV files. Using the 'if' formula in Excel, I found several additional pairs of non-matches. The NaN matches were not identified in matches in Excel.
Any tips on troubleshooting this issue?
nan is a special value which is not equal to itself and should not be used in equality test. You need to fill the df with comparable values beforehand:
df_ = df.fillna(0)
df['compare'] = np.where(df_['a'] == df_['b'], '0', '1')
a b compare
0 1B NaN 1
1 NaN NaN 0
2 NaN NaN 0
3 32X NaN 1
4 NaN NaN 0
5 NaN NaN 0
6 NaN NaN 0
7 NaN NaN 0
8 NaN NaN 0
9 NaN NaN 0
10 NaN 321 1
11 NaN Z51 1
12 NaN 3Y 1
I have a following looking dataframe, e.g.:
ID Value
0x3000 nan
nan 1
nan 2
nan 3
0x4252 nan
nan 10
nan 12
now, I'm looking for a way to get these two groups out of this dataframe, like so:
ID Value
0x3000 nan
nan 1
nan 2
nan 3
and
ID Value
0x4252 nan
nan 10
nan 12
so, a group basically starts on a hex value and contains its connected values all the way until the next occurence of valid hex value.
How can this be done effectively in pandas without manually looping through the rows and collecting row by row, until the condition (valid hex value) is met?
You can use groupby with a custom group to generate a list of DataFrames:
l = [g for _,g in df.groupby(df['ID'].notna().cumsum())]
output:
[ ID Value
0 0x3000 NaN
1 NaN 1.0
2 NaN 2.0
3 NaN 3.0,
ID Value
4 0x4252 NaN
5 NaN 10.0
6 NaN 12.0]
I'm having some problems iteratively filling a pandas DataFrame with two different types of values. As a simple example, please consider the following initialization:
IN:
df = pd.DataFrame(data=np.nan,
index=range(5),
columns=['date', 'price'])
df
OUT:
date price
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
When I try to fill one row of the DataFrame, it won't adjust the value in the date column. Example:
IN:
df.iloc[0]['date'] = '2022-05-06'
df.iloc[0]['price'] = 100
df
OUT:
date price
0 NaN 100.0
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
I'm suspecting it has something to do with the fact that the default np.nan value cannot be replaced by a str type value, but I'm not sure how to solve it. Please note that changing the date column's type to str does not seem to make a difference.
This doesn't work because df.iloc[0] creates a temporary Series, which is what you update, not the original DataFrame.
If you need to mix positional and label indexing you can use:
df.loc[df.index[0], 'date'] = '2022-05-06'
df.loc[df.index[0], 'price'] = 100
output:
date price
0 2022-05-06 100.0
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
Using loc() as shown below may work better:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.nan,
index=range(5),
columns=['date', 'price'])
print(df)
df.loc[0, 'date'] = '2022-05-06'
df.loc[0, 'price'] = 100
print(df)
Output:
date price
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
date price
0 2022-05-06 100.0
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
This question already has answers here:
How to implement sql coalesce in pandas
(5 answers)
Closed 1 year ago.
I am trying to create a column in my dataframe which searches each column and checks if the value of at specific row is null or not, if it is not the new column will contain this value, otherwise it will skip it. It is not possible that two columns contains a non null value.
For example:
A B C D E
NaN NaN NaN NaN a
b NaN NaN NaN NaN
NaN NaN NaN NaN NaN
My expected output:
A B C D E new_column
NaN NaN NaN NaN a a
b NaN NaN NaN NaN b
NaN NaN NaN NaN NaN NaN
You can bfill horizontally and then select the first column:
df['new_column'] = df.bfill(axis=1).iloc[:, 0]
Output:
>>> df
A B C D E new_column
0 NaN NaN NaN NaN a a
1 b NaN NaN NaN NaN b
2 NaN NaN NaN NaN NaN NaN
I'm analyzing excel files generated by an organization who publishes yearly reports in Excel files. Each year, the column names (Year, A1, B1, C1, etc) remain identical. But each year the organization publishes those column names that start at different row numbers and column numbers.
Each year I manually search for the starting row and column, but it's tedious work given the number of years of reports to wade through.
So I'd like something like this:
...
df = pd.read_excel('test.xlsx')
start_row,start_col = df.find_columns('Year','A1','B1')
...
Thanks.
Let's say you have three .xlsx files on your desktop prefixed with Yearly_Report that when combined in python look like this after reading into one dataframe with something like: df = pd.concat([pd.read_excel(f, header=None) for f in yearly_files]):
0 1 2 3 4 5 6 7 8 9 10
0 A B C NaN NaN NaN NaN NaN NaN NaN NaN
1 1 2 3 NaN NaN NaN NaN NaN NaN NaN NaN
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN A B C NaN NaN NaN NaN NaN NaN
4 NaN NaN 4 5 6 NaN NaN NaN NaN NaN NaN
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN A B C
2 NaN NaN NaN NaN NaN NaN NaN NaN 4 5 6
As you can see, the columns and values are scattered across various columns and rows. The following steps would get you the desired result. First, you need to pd.concat the files and .dropna rows. Then, transpose the dataframe with .T before removing all cells with NaN values. Next, revert the dataframe back with another transpose .T. Finally, simply name the columns and drop rows that are equal to the column headers.
import glob, os
import pandas as pd
main_folder = 'Desktop/'
yearly_files = glob.glob(f'{main_folder}Yearly_Report*.xlsx')
df = pd.concat([pd.read_excel(f, header=None) for f in yearly_files]) \
.dropna(how='all').T \
.apply(lambda x: pd.Series(x.dropna().values)).T
df.columns = ['A','B','C']
df = df[df['A'] != 'A']
df
output:
A B C
1 1 2 3
4 4 5 6
2 4 5 6
Soething Like this not totally sure what you are looking for
df = pd.read_excel('test.xlsx')
for i in df.index:
print(df.loc[i,'Year'])
print(df.loc[i, 'A1'])
print(df.loc[i, "B1"])