Pandas capture connected rows - python

I have a following looking dataframe, e.g.:
ID Value
0x3000 nan
nan 1
nan 2
nan 3
0x4252 nan
nan 10
nan 12
now, I'm looking for a way to get these two groups out of this dataframe, like so:
ID Value
0x3000 nan
nan 1
nan 2
nan 3
and
ID Value
0x4252 nan
nan 10
nan 12
so, a group basically starts on a hex value and contains its connected values all the way until the next occurence of valid hex value.
How can this be done effectively in pandas without manually looping through the rows and collecting row by row, until the condition (valid hex value) is met?

You can use groupby with a custom group to generate a list of DataFrames:
l = [g for _,g in df.groupby(df['ID'].notna().cumsum())]
output:
[ ID Value
0 0x3000 NaN
1 NaN 1.0
2 NaN 2.0
3 NaN 3.0,
ID Value
4 0x4252 NaN
5 NaN 10.0
6 NaN 12.0]

Related

np.where to compare whether two columns are identical is inconsistent with results from Excel

I am trying to use np.where to compare whether the values from two columns are equal, but I am getting inconsistent results.
df['compare'] = np.where(df['a'] == df['b'], '0', '1')
Output:
a b compare
1B NaN 1
NaN NaN 1
NaN NaN 1
32X NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN NaN 1
NaN 321 1
NaN Z51 1
NaN 3Y 1
It seemed strange that the command would return pairs of NaN as non-matches. I confirmed that column 'a' and column 'b' are both string data types.
I double checked the original CSV files. Using the 'if' formula in Excel, I found several additional pairs of non-matches. The NaN matches were not identified in matches in Excel.
Any tips on troubleshooting this issue?
nan is a special value which is not equal to itself and should not be used in equality test. You need to fill the df with comparable values beforehand:
df_ = df.fillna(0)
df['compare'] = np.where(df_['a'] == df_['b'], '0', '1')
a b compare
0 1B NaN 1
1 NaN NaN 0
2 NaN NaN 0
3 32X NaN 1
4 NaN NaN 0
5 NaN NaN 0
6 NaN NaN 0
7 NaN NaN 0
8 NaN NaN 0
9 NaN NaN 0
10 NaN 321 1
11 NaN Z51 1
12 NaN 3Y 1

Python search for string in whole pandas dataframe and make that first column

So I have a irregular dataframe with unnamed columns which looks something like this:
Unnamed:0 Unnamed:1 Unnamed:2 Unnamed:3 Unnamed:4
nan nan nan 2022-01-01 nan
nan nan nan nan nan
nan nan String Name Currency
nan nan nan nan nan
nan nan nan nan nan
nan nan String nan nan
nan nan xx A CAD
nan nan yy B USD
nan nan nan nan nan
Basically what I want to do is to find in which column and row the 'String' name is and start the dataframe from there, creating:
String Name Currency
String nan nan
xx A CAD
yy B USD
nan nan nan
My initial thought has been to use
locate_row = df.apply(lambda row: row.astype(str).str.contains('String').any(), axis=1) combined with
locate_col = df.apply(lambda column: column.astype(str).str.contains('String').any(), axis=0)
This gives me series with the rows with the string and column with the string. My main problem is solving this without hardcoding using for eg. iloc[6:, 2:]. Any help to get to the desired dataframe without hardcoding is of great help.
In your example you can drop the columns that are entirely null, then drop rows with any null values. The result is the slice you are looking for. You can then promote the first row to headers.
df = df.dropna(axis=1,how='all').dropna().reset_index(drop=True)
df = df.rename(columns=df.iloc[0]).drop(df.index[0])
Output
String Name Currency
1 xx A CAD
2 yy B USD
Hey you would need to iterate over the dataframe and then search for equality in the strings:
in this article iteration is described How to iterate over rows in a DataFrame in Pandas
with this you can check for equality
if s1 == s2:
print('s1 and s2 are equal.')

Comparing 2 dataframes

I have 2 dataframes that contain 3 account indictors per account number. The account numbers are like for like in the column labelled "account". I am trying to modify dataframe 2 so it matches dataframe 1 in terms of having the same NaN values for each column.
Dataframe 1:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1234567890,1,np.nan,'G'],
[7854567890,np.nan,100,np.nan],
[7854567899,np.nan,np.nan,np.nan],
[7854567893,np.nan,100,np.nan],
[7854567893,np.nan,np.nan,np.nan],
[9632587415,np.nan,np.nan,'B']],
columns = ['account','ind_1','ind_2','ind_3'])
df
Output:
account ind_1 ind_2 ind_3
0 1234567890 1.0 NaN G
1 7854567890 NaN 100.0 NaN
2 7854567899 NaN NaN NaN
3 7854567893 NaN 100.0 NaN
4 7854567893 NaN NaN NaN
5 9632587415 NaN NaN B
Dataframe 2:
df2 = pd.DataFrame([[1234567890,5,np.nan,'GG'],
[7854567890,1,106,np.nan],
[7854567899,np.nan,100,'N'],
[7854567893,np.nan,100,np.nan],
[7854567893,np.nan,np.nan,np.nan],
[9632587415,3,np.nan,'B']],
columns = ['account','ind_1','ind_2','ind_3'])
df2
Output:
account ind_1 ind_2 ind_3
0 1234567890 5.0 NaN GG
1 7854567890 1.0 106.0 NaN
2 7854567899 NaN 100.0 N
3 7854567893 NaN 100.0 NaN
4 7854567893 NaN NaN NaN
5 9632587415 3.0 NaN B
Problem:
I need to change dataframe 2 so that it is matching the same NaN values in dataframe 1.
For example: Column ind_1 has values in index 0, 1 and 5 in df2, whereas in df1 it only has values in index 0. I need to replace the values in index 1 and 5 in df2 with NaNs to match the same number of Nans in df1. Same logic to applied to the other 2 columns.
Expected outcome:
account ind_1 ind_2 ind_3
0 1234567890 5.0 NaN GG
1 7854567890 NaN 106.0 NaN
2 7854567899 NaN NaN NaN
3 7854567893 NaN 100.0 NaN
4 7854567893 NaN NaN NaN
5 9632587415 NaN NaN B
Is there any easy way to achieve this?
Thanks in advance
Alan
Try this
df2[~df.isna()]
df.isna() check where df has nan value and create a mask
then slice df2 with the mask.
You return a DataFrame with True values where there are NaN values in your first dataframe using df.isna(). You can use this as a Boolean mask to set the correct cells to NaN in your second DataFrame:
df2[df.isna()] = None
Be careful: I assume that you also need to take care that these NaNs will be associated with rows with the same value for account. This solution does not ensure that this will happen, i.e. it will assume that the values in account column are exactly int he same order in both dataframes.

Splitting a pandas dataframe values in a columns?

I have a df that a columns with some ID for companies. How can I split this ID in columns?
In this column the values can be 0(NaN) to more than 5 IDs, how to divide each one of them in separate columns?
Here is an example of the column:
0 4773300
1 NaN
2 6201501,6319400,6202300
3 8230001
4 NaN
5 4742300,4744004,4744003,7319002,4729699,475470
The division would be at each comma, I imagine an output like this:
columnA
columnB
columnC
4773300
Nan
Nan
NaN
Nan
Nan
6201501
6319400
6202300
8230001
Nan
Nan
And so on depending on the number of IDs
You can use the .str.split method to perform this type of transformation quite readily. The trick is to pass the expand=True parameter so your results are put into a DataFrame instead of a Series containing list objects.
>>> df
ID
0 4773300
1 NaN
2 6201501,6319400,6202300
3 8230001
4 NaN
5 4742300,4744004,4744003,7319002,4729699,475470
>>> df['ID'].str.split(',', expand=True)
0 1 2 3 4 5
0 4773300 None None None None None
1 NaN NaN NaN NaN NaN NaN
2 6201501 6319400 6202300 None None None
3 8230001 None None None None None
4 NaN NaN NaN NaN NaN NaN
5 4742300 4744004 4744003 7319002 4729699 475470
You can also clean up the output a little for better aesthetics
replace None for NaN
alphabetic column names (though I would opt to not do this as you'll hit errors if a given entry in the ID column has > 26 ids in it.)
join back to original DataFrame
>>> import pandas as pd
>>> from string import ascii_uppercase
>>> (
df['ID'].str.split(',', expand=True)
.replace({None: float('nan')})
.pipe(lambda d:
d.set_axis(
pd.Series(list(ascii_uppercase))[d.columns],
axis=1
)
)
.add_prefix("column")
.join(df)
)
columnA columnB columnC columnD columnE columnF ID
0 4773300 NaN NaN NaN NaN NaN 4773300
1 NaN NaN NaN NaN NaN NaN NaN
2 6201501 6319400 6202300 NaN NaN NaN 6201501,6319400,6202300
3 8230001 NaN NaN NaN NaN NaN 8230001
4 NaN NaN NaN NaN NaN NaN NaN
5 4742300 4744004 4744003 7319002 4729699 475470 4742300,4744004,4744003,7319002,4729699,475470
Consider each entry as a string, and parse the string to get to individual values.
from ast import literal_eval
df = pd.read_csv('sample.csv', converters={'company': literal_eval})
words = []
for items in df['company']:
for word in items:
words.append(word)
FYI, This is a good starting point. I do not know what is intended output format needed as of now, since your question is kind of incomplete.

create column by looking not null values in other columns [duplicate]

This question already has answers here:
How to implement sql coalesce in pandas
(5 answers)
Closed 1 year ago.
I am trying to create a column in my dataframe which searches each column and checks if the value of at specific row is null or not, if it is not the new column will contain this value, otherwise it will skip it. It is not possible that two columns contains a non null value.
For example:
A B C D E
NaN NaN NaN NaN a
b NaN NaN NaN NaN
NaN NaN NaN NaN NaN
My expected output:
A B C D E new_column
NaN NaN NaN NaN a a
b NaN NaN NaN NaN b
NaN NaN NaN NaN NaN NaN
You can bfill horizontally and then select the first column:
df['new_column'] = df.bfill(axis=1).iloc[:, 0]
Output:
>>> df
A B C D E new_column
0 NaN NaN NaN NaN a a
1 b NaN NaN NaN NaN b
2 NaN NaN NaN NaN NaN NaN

Categories