I have this Excel formula in A2 : =IF(B1=1;CONCAT(D1:Z1);"Null")
All cells are string or integer but some are empty. I filled the empty one with "null"
I've tried to translate it with pandas, and so far I wrote this :
'''
import pandas as pd
df = pd.read_table('C:/*path*/001.txt', sep=';', header=0, dtype=str)
rowcount = 0
for row in df:
rowcount+= 1
n = rowcount
m = len(df)
df['A']=""
for i in range(1,n):
if df[i-1]["B"]==1:
for k in range(2,m):
if df[i][k]!="Null"
df[i]['A']+=df[i][k]
'''
I can't find something close enough to my problem in questions, anyone can help?
I not sure you really expecting for this. If you need to fill empty cell with 'null' string in dataframe. You can use this
df.fillna('null', inplace=True)
If you provide the expected output with your input file. May helpful for the contributors.
Test dataframe:
df = pd.DataFrame({
"b":[1,0,1],
"c":["dummy", "dummy", "dummy"],
"d":["red", "green", "blue"],
"e":["-a", "-b", "-c"]
})
First step: add a new column and fill with NULL.
df["concatenated"] = "NULL"
Second step: filter by where column B is 1, and then set the value of the new column to the concatenation of columns D to Z.
df["concatenated"][df["b"]==1] = df[sub_columns].sum(axis=1)
df
Output:
EDIT: I notice there is an offset in your excel formula. Not sure if this is deliberate, but experiment with df.shift(-1) if so.
There's a lot to unpack here.
Firstly, len(df) gives us the row count. In your code, n and m will be one and the same.
Secondly, please never do chain indexing in pandas unless you absolutely have to. There's a number of reasons not to, one of them being that it's easy to make a mistake; also, assignment can fail. Here, we have a default range index, so we can use df.loc[i-1, 'B'] in place of df[i - 1]["B"].
Thirdly, the dtype is str, so please use =='1' rather than ==1.
If I understand your problem correctly, the following code should help you:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({
'B': ['1','2','0','1'],
'C': ['first', 'second', 'third', 'fourth'],
'D': ['-1', '-2', '-3', '-4']
})
In [3]: RELEVANT_COLUMNS = ['C', 'D'] # You can also extract them in any automatic way you please
In [4]: df["A"] = ""
In [5]: df.loc[df['B'] == '1', 'A'] = df.loc[df['B'] == '1', RELEVANT_COLUMNS].sum(axis=1)
In [6]: df
Out[6]:
B C D A
0 1 first -1 first-1
1 2 second -2
2 0 third -3
3 1 fourth -4 fourth-4
We note which columns to concat (In [3]) (we do not want to make the mistake of adding a column later on and using it. Here if we add 'A' it doesn't hurt, because it's full of empty strings. But it's more manageable to save the columns we concat.
We then add the column with empty strings (In [4]) (if we skip this step, we'll get NaNs instead of empty strings for the records where B does not equal 1).
The In [5] uses pandas' boolean indexing (through the Series to scalar equality operator) to limit our scope to where column B equals 1, and then we pull up the columns to concat and do just that, using the an axis-reducing sum operation.
Related
I have a dataframe (very simplified version below):
d = {'col1': [1, '', 2], 'col2': ['', '', 3], 'col3': [4, 5, 6]}
df = pd.DataFrame(data=d)
I need to loop through the dataframe and check how many columns are populated per row. If the row has just one column populated, then I can continue onto the next row. If however, the column has more than one non-NaN value, I need to make all the columns into NaNs apart from one, based on some hierarchy.
For example, let's say the hierarchy is:
col1 is the most important
col2 second etc.
Therefore, if there were two or more columns with data and one of them happened to be column 1, I would drop all other column values, otherwise I would defer to check if col2 has a value etc and then repeat for the next row.
I have something like this as an idea:
nrows = df.shape[0]
for index in range(0, nrows):
print(index)
#check is the row has only one column populated
if (df.iloc[[index]].notna().sum() == 1):
continue
#check if more than one column is populated for that row
elif (df.iloc[[index]].notna().sum() >= 1):
if (index['col1'].notna() == True):
df.loc[:, df.columns != 'col1'] == 'NaN'
#continue down the hierarchy
but this is not correct as it gives True/False for every column and cannot read it the way I need.
Any suggestions very welcome! I was thinking of creating some sort of key, but feel there may be a more simply way to get there with the code I already have?
Edit:
Another important point which I should have included is that my index is not integers - it is unique identifiers which look something like this: '123XYZ', which is why I used range(0,n) and reshaped the df.
For the example dataframe you gave I don't think it would change after applying this algorithm so I didn't test it thoroughly, but something like this should work:
import numpy as np
heirarchy = ['col1', 'col2', 'col3']
inds = df.isna().sum(axis=1)
inds = inds[inds >= 2].index
for i in inds:
for col in heirarchy:
if not pd.isna(df.iloc[[i]][col]).all():
tmp = df.iloc[[i]][col]
df.iloc[[i]] = np.nan
df.iloc[[i]][col] = tmp
Note I'm assuming that you actually mean nan and not the empty string like you have in your example. If you want to look for empty strings then inds and the if statement would change above
I also think this should be faster than what you have above since it's only looping through the rows with more than 1 nan values.
I'm new to python and especially to pandas so I don't really know what I'm doing. I have 10 columns with 100000 rows and 4 letter strings. I need to filter out rows which don't contain 'DDD' in all of the columns/rows.
I tried to do it with iloc and loc, but it doesn't work:
import pandas as pd
df = pd.read_csv("data_3.csv", delimiter = '!')
df.iloc[:,10:20].str.contains('DDD', regex= False, na = False)
df.head()
It returns me an error: 'DataFrame' object has no attribute 'str'
I suggest doing it without a for loop like this:
df[df.apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only string columns
df[df.select_dtypes(include='object').apply(lambda x: x.str.contains('DDD')).all(axis=1)]
To select only some string columns
selected_cols = ['A','B']
df[df[selected_cols].apply(lambda x: x.str.contains('DDD')).all(axis=1)]
You can do this but if your all column type is StringType:
for column in foo.columns:
df = df[~df[c].str.contains('DDD')]
You can use str.contains, but only on Series not on DataFrames. So to use it we look at each column (which is a series) one by one by for looping over them:
>>> import pandas as pd
>>> df = pd.DataFrame([['DDDA', 'DDDB', 'DDDC', 'DDDD'],
['DDDE', 'DDDF', 'DDDG', 'DHDD'],
['DDDI', 'DDDJ', 'DDDK', 'DDDL'],
['DMDD', 'DNDN', 'DDOD', 'DDDP']],
columns=['A', 'B', 'C', 'D'])
>>> for column in df.columns:
df = df[df[column].str.contains('DDD')]
In our for loop we're overwriting the DataFrame df with df where the column contains 'DDD'. By looping over each column we cut out rows that don't contain 'DDD' in that column until we've looked in all of our columns, leaving only rows that contain 'DDD' in every column.
This gives you:
>>> print(df)
A B C D
0 DDDA DDDB DDDC DDDD
2 DDDI DDDJ DDDK DDDL
As you're only looping over 10 columns this shouldn't be too slow.
Edit: You should probably do it without a for loop as explained by Christian Sloper as it's likely to be faster, but I'll leave this up as it's slightly easier to understand without knowledge of lambda functions.
I had originally asked this question here, and I believe it was incorrectly marked as a duplicate. I will do my best here to clarify my question and how I believe it is unique.
Given the following example MultiIndex dataframe:
import pandas as pd
import numpy as np
first = ['A', 'B', 'C']
second = ['a', 'b', 'c', 'd']
third = ['1', '2', '3']
indices = [first, second, third]
index = pd.MultiIndex.from_product(indices, names=['first', 'second', 'third'])
df = pd.DataFrame(np.random.randint(10, size=(len(first)*len(second)*len(third), 4)), index=index, columns=['Val1','Val2',' Val3', 'Val4'])
Goal: I would like to retain a specific level=1 index (such as 'a') if the value of column 'Val2' corresponding to index value 1 in level=2 is greater than 5 for that level=1 index. Therefore, if this criteria is not met (i.e. column 'Val2' is less than or equal to 5 for index 1 in level=2), then the corresponding level=1 index would be removed from the dataframe. If all level=1 indices do not meet the criteria for a given level=0 index, then that level=0 index would also be removed. My previous post contains my expected output (I can add it here, but I wanted this post to be as succinct as possible for clarity).
Here is my current solution, the performance of which I'm sure can be improved:
grouped = df.groupby(level=0)
output = pd.concat([grouped.get_group(key).groupby(level=1).filter(lambda x: (x.loc[pd.IndexSlice[:, :, '1'], 'Val2']>5).any()) for key, group in grouped])
This does produce my desired output, but for a dataframe with 100,000's of rows, the performance is rather poor. Is there something obvious I am missing here to better utilize the under-the-hood optimization of pandas?
I got the same result as your example solution by doing the following:
df.loc[df.xs('1', level=2)['Val2'] > 5]
Comparing time performance this is ~15X faster (in my machine your example takes 36ms while this take 2ms).
I have some pandas data frame, and I would like to add a column that is the difference of a column, based on the value of a third column. Here is a toy example:
import pandas as pd
import numpy as np
d = {'one' : pd.Series(range(4), index=['a', 'b', 'c', 'd']),
'two' : pd.Series(range(4), index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
df['three'] = [2,2,3,3]
four = []
for i in set(df['three']):
for j in range(len(df) -1):
four.append(df[df['three'] == i]['two'][j + 1] - df[df['three']==i]['two'][j])
four.append(0)
df['four'] = four
The final column should be [1, 1, 1, Nan], since that is the difference between each of the rows in the 'two' column
This makes more sense in the context of my original code -- my data frame is organized by some IDs, and then by time, and when I take the subset of the data frame by IDs, I'm left with the time series evolution of the variables for each individual ID. However, I keep on either receiving a key error, or attempting to edit a copy of the original data frame. What is the right way to go about this?
You could replace df[df['three'] == i] with a groupby on column three. And perhaps replace ['two'][j + 1] - ['two'][j] with df['two'].shift(-1) - df['two'].
I think that would be identical to what you are doing now within the nested loop. It depends a bit on what format you want as a result on how you would implement this. One way would be:
df.groupby('three').apply(lambda grp: pd.Series(grp['two'].shift(-1) - grp['two']))
Which would result in:
two a b
three
2 1 NaN
3 1 NaN
The columns names become a bit meaningless after this operation.
If all you want to do is to get the difference between the rows of column two you use the shift method.
df['four'] = df.two.shift(-1) - df.two
I am sure there is an obvious way to do this but cant think of anything slick right now.
Basically instead of raising exception I would like to get True or False to see if a value exists in pandas df index.
import pandas as pd
df = pd.DataFrame({'test':[1,2,3,4]}, index=['a','b','c','d'])
df.loc['g'] # (should give False)
What I have working now is the following
sum(df.index == 'g')
This should do the trick
'g' in df.index
Multi index works a little different from single index. Here are some methods for multi-indexed dataframe.
df = pd.DataFrame({'col1': ['a', 'b','c', 'd'], 'col2': ['X','X','Y', 'Y'], 'col3': [1, 2, 3, 4]}, columns=['col1', 'col2', 'col3'])
df = df.set_index(['col1', 'col2'])
in df.index works for the first level only when checking single index value.
'a' in df.index # True
'X' in df.index # False
Check df.index.levels for other levels.
'a' in df.index.levels[0] # True
'X' in df.index.levels[1] # True
Check in df.index for an index combination tuple.
('a', 'X') in df.index # True
('a', 'Y') in df.index # False
Just for reference as it was something I was looking for, you can test for presence within the values or the index by appending the ".values" method, e.g.
g in df.<your selected field>.values
g in df.index.values
I find that adding the ".values" to get a simple list or ndarray out makes exist or "in" checks run more smoothly with the other python tools. Just thought I'd toss that out there for people.
Code below does not print boolean, but allows for dataframe subsetting by index... I understand this is likely not the most efficient way to solve the problem, but I (1) like the way this reads and (2) you can easily subset where df1 index exists in df2:
df3 = df1[df1.index.isin(df2.index)]
or where df1 index does not exist in df2...
df3 = df1[~df1.index.isin(df2.index)]
with DataFrame: df_data
>>> df_data
id name value
0 a ampha 1
1 b beta 2
2 c ce 3
I tried:
>>> getattr(df_data, 'value').isin([1]).any()
True
>>> getattr(df_data, 'value').isin(['1']).any()
True
but:
>>> 1 in getattr(df_data, 'value')
True
>>> '1' in getattr(df_data, 'value')
False
So fun :D
df = pandas.DataFrame({'g':[1]}, index=['isStop'])
#df.loc['g']
if 'g' in df.index:
print("find g")
if 'isStop' in df.index:
print("find a")
I like to use:
if 'value' in df.index.get_level_values(0):
print(True)
get_level_values method is good because it allows you to get the value in the indexes no matter if your index is simple or composite.
Use 0 (zero) if you have a single index in your dataframe [or you want to check the first index in multiple index levels]. Use 1 for the second index, and so on...