I have a pandas dataframe read from file, some of whose columns contain strings, some of which in turn contain substrings separated by semicolons. My goal is to turn the semicolon-separated substrings into lists of strings and put those back in the dataframe.
When I use df.iloc[-1][-1] = df.iloc[-1][-1].split(';'); on a cell that contains a string with semicolons, there's no error but the value df.iloc[-1][-1] is not changed.
When I use
newval = df.iloc[-1,-1]; newval
newval = df.iloc[-1,-1].split( ';' ); newval
df.iloc[-1][-1] = newval; df.iloc[-1][-1]
It shows the original string for the first line and the list of substrings for the second, but then the original string again for the third. It looks as if nothing has been assigned -- but there was no error message either.
My first guess was that it was not allowed to put a list of strings in a cell that contains strings but a quick test showed me that that is OK:
>>> df = pd.DataFrame([["a", "a;b"], ["a;A", "a;b;A;B"]], index=[1, 2], columns=['A', 'B']);
>>> df
A B
1 a a;b
2 a;A a;b;A;B
>>> for row in range ( df.shape [ 0 ] ):
... for col in range ( df.shape [ 1 ] ):
... value = df.iloc[row][col];
... if ( type ( value ) == str ):
... value = value.split( ';' );
... df.iloc[row][col] = value;
>>> df
A B
1 [a] [a, b]
2 [a, A] [a, b, A, B]
So I'm puzzled why (i) the assignment works in the example but not for my CSV-imported dataframe, and (ii) why python does not give an error message?
Honestly, you can simplify your code avoiding the loops with a simple applymap. Loops should be avoided with pandas. Here applymap won't necessarily be faster, but it's definitely much easier to use and understand.
out = df.applymap(lambda x: x.split(';'))
output:
A B
1 [a] [a, b]
2 [a, A] [a, b, A, B]
why your approach failed
You're using df.iloc[row][col] = value which can cause setting value on a copy, you should use df.iloc[row, col] = value instead. Did you get a SettingWithCopyWarning?
not all values are strings:
df.applymap(lambda x: x.split(';') if isinstance(x, str) else x)
Example:
df = pd.DataFrame([["a", 2], ["a;A", "a;b;A;B"]], index=[1, 2], columns=['A', 'B'])
df.applymap(lambda x: x.split(';') if isinstance(x, str) else x)
A B
1 [a] 2
2 [a, A] [a, b, A, B]
Related
In the first example below, I am iterating over a list of dataframes. The For loop creates column 'c'. Printing each df shows that both elements in the list were updated.
In the second example, I am iterating over a list of variables. The For loop applys some math to each element. But when printing, the list does not reflect the changes made in the For loop.
Please help me to understand why the elements in the second example are not being impacted by the For loop, like they are in the first example.
import pandas as pd
df1 = pd.DataFrame([[1,2],[3,4]], columns=['a', 'b'])
df2 = pd.DataFrame([[3,4],[5,6]], columns=['a', 'b'])
dfs = [df1, df2]
for df in dfs:
df['c'] = df['a'] + df['b']
print(df1)
print(df2)
result:
a b c
0 1 2 3
1 3 4 7
a b c
0 3 4 7
1 5 6 11
Second example:
a, b = 2, 3
test = [a, b]
for x in test:
x = x * 2
print(test)
result: [2, 3]
expected result: [4, 6]
In your second example, test is a list of ints which are not mutable. If you want a similar effect to your first snippet, you will have to store something mutable in your list:
a, b = 2, 3
test = [[a], [b]]
for x in test:
x[0] = x[0] * 2
print(test)
Output: [[4], [6]]
When you iterate in a list like this x takes the value at the current position.
for x in test:
x = x * 2
When you try to assign a new value to x you are not changing the element in the list, you are changing what the variable x contains.
To change the actual value in the list iterate by index:
for i in range(len(test)):
test[i] = test[i] * 2
I have two dataframes. The first one (let's call it A) has a column (let's call it 'col1') whose elements are lists of strings. The other one (let's call it B) has a column (let's call it 'col2') whose elements are strings. I want to do a join between these two dataframes where B.col2 is in the list in A.col1. This is one-to-many join.
Also, I need the solution to be scalable since I wanna join two dataframes with hundreds of thousands of rows.
I have tried concatenating the values in A.col1 and creating a new column (let's call it 'col3') and joining with this condition: A.col3.contains(B.col2). However, my understanding is that this condition triggers a cartesian product between the two dataframes which I cannot afford considering the size of the dataframes.
def joinIds(IdList):
return "__".join(IdList)
joinIds_udf = udf(joinIds)
pnr_corr = pnr_corr.withColumn('joinedIds', joinIds_udf(pnr_corr.pnrCorrelations.correlationPnrSchedule.scheduleIds)
pnr_corr_skd = pnr_corr.join(skd, pnr_corr.joinedIds.contains(skd.id), how='inner')
This is a sample of the join that I have in mind:
dataframe A:
listColumn
["a","b","c"]
["a","b"]
["d","e"]
dataframe B:
valueColumn
a
b
d
output:
listColumn valueColumn
["a","b","c"] a
["a","b","c"] b
["a","b"] a
["a","b"] b
["d","e"] d
I don't know if there is an efficient way to do it, but this gives the correct output:
import pandas as pd
from itertools import chain
df1 = pd.Series([["a","b","c"],["a","b"],["d","e"]])
df2 = pd.Series(["a","b","d"])
result = [ [ [el2,list1] for el2 in df2.values if el2 in list1 ]
for list1 in df1.values ]
result_flat = list(chain(*result))
result_df = pd.DataFrame(result_flat)
You get:
In [26]: result_df
Out[26]:
0 1
0 a [a, b, c]
1 b [a, b, c]
2 a [a, b]
3 b [a, b]
4 d [d, e]
Another approach is to use the new explode() method from pandas>=0.25 and merge like this:
import pandas as pd
df1 = pd.DataFrame({'col1': [["a","b","c"],["a","b"],["d","e"]]})
df2 = pd.DataFrame({'col2': ["a","b","d"]})
df1_flat = df1.col1.explode().reset_index()
df_merged = pd.merge(df1_flat,df2,left_on='col1',right_on='col2')
df_merged['col2'] = df1.loc[df_merged['index']].values
df_merged.drop('index',axis=1, inplace=True)
This gives the same result:
col1 col2
0 a [a, b, c]
1 a [a, b]
2 b [a, b, c]
3 b [a, b]
4 d [d, e]
How about:
df['col1'] = [df['col1'].values[i] + [df['col2'].values[i]] for i in range(len(df))]
Where 'col1' is the list of strings and 'col2' is the strings.
You can also drop 'col2' if you don't want it anymore with:
df = df.drop('col2',axis=1)
I am looping over the rows of a DataFrame, in which each row contains a string in 'column_A' and a tuple in 'column_B'. If the tuple in 'column_B' meets a certain condition, an operation is performed on the string in 'column_A' and the result is to be stored as a string in 'column_C'. If the condition is not met, nothing is to be stored in 'column_C'. This is my DataFrame:
**column_A column_B column_C**
0 This is a string. [A, B, C]
1 And this a string. [A, B, D]
2 Yet another string. [A, B, C]
3 For the love of strings. [A, B, C]
My script looks like this.
import pandas as pd
a = pd.read_pickle('dataframe.pkl')
condition = [('A', 'B', 'C')]
def operation(j):
# some operations here
return i
df_list = []
for index, row in a.iterrows():
b = tuple(row['column_B'])
if b in condition:
lst = []
c = a['column_A'].apply(operation) # run function 'operation' above
lst.append(c)
df = pd.DataFrame([lst]) # Here things go wrong. Only first string from column_A is added
df_list.append(df)
df = pd.concat(df_list)
df = df.reset_index(drop=True)
a.insert(3, 'Column_C', df)
Based on the script, I expect to get my desired result:
**column_A column_B column_C**
0 This is a string. [A, B, C] This is a string!
1 And this a string. [A, B, D]
2 Yet another string. [A, B, C] Yet another string!
3 For the love of strings. [A, B, C] For the love of strings!
However, I get the following result:
**column_A column_B column_C**
0 This is a string. [A, B, C] 0 This is a string!
1 And this a string. [A, B, D] 0 This is a string!
2 Yet another string. [A, B, C] 0 This is a string!
3 For the love of strings. [A, B, C] 0 This is a string!
It is very unclear to me why each string is preceded by a '0' and why the processed version of only the first string appears - in every row. Any suggestions why this happens and how to change the script to get the desired result?
A better option versus iterating rows is to use pd.DataFrame.apply.
This avoids the expensive process you have of creating a dataframe for each row and concatenating them.
df = pd.DataFrame({'column_A': ['This is a string.', 'And this a string.',
'Yet another string.', 'For the love of strings.'],
'column_B': [['A', 'B', 'C'], ['A', 'B', 'D'],
['A', 'B', 'C'], ['A', 'B', 'C']]})
def func(row):
if row['column_B'] in [['A', 'B', 'C']]:
return row['column_A']
else:
return ''
df['column_C'] = df.apply(func, axis=1)
# column_A column_B column_C
# 0 This is a string. [A, B, C] This is a string.
# 1 And this a string. [A, B, D]
# 2 Yet another string. [A, B, C] Yet another string.
# 3 For the love of strings. [A, B, C] For the love of strings.
Take a look at this minimal example:
import pandas as pd
s1 = """A,B
a,1
b,1
"""
df = pd.read_csv(io.StringIO(s1))
print(df.apply(lambda x: 1 * [x.A],axis=1))
print("==================")
print(df.apply(lambda x: 2 * [x.A],axis=1))
print("==================")
print(df.apply(lambda x: 3 * [x.A],axis=1))
The print statements yield:
0 [a]
1 [b]
dtype: object
==================
A B
0 a a
1 b b
==================
0 [a, a, a]
1 [b, b, b]
dtype: object
As you can see when the number of list elements equals the number of columns in the initial dataframe the list gets matched to those columns, in all other cases the result is just a series containing the lists as elements.
I can solve this by checking the dimensions of the dataframe and add an empty dummy column to keep the number of columns from being the same as the length of the lists if necessary, but I wish to know if there is a direct way of controlling that matching behaviour.
EDIT: The specific way I'm creating the lists in my example is only for simplicity, the lists could also be created as an output e.g. from a numpy function such as linreg or polyfit.
EDIT 2: I want my 2nd case to look like this:
0 [a, a]
1 [b, b]
EDIT 3: My real application is this to have two columns with an array or list each and then using numpy polyfit on it which yields an array whose length depends on the degree of the polynomial.
df["polyfit"] = df.apply(lambda x: list(np.polyfit(x[x_name],x[y_name],degree)),axis=1)
Try a different approach:
In [9]: df['A'].apply(list) * 1
Out[9]:
0 [a]
1 [b]
Name: A, dtype: object
In [10]: df['A'].apply(list) * 2
Out[10]:
0 [a, a]
1 [b, b]
Name: A, dtype: object
In [11]: df['A'].apply(list) * 3
Out[11]:
0 [a, a, a]
1 [b, b, b]
Name: A, dtype: object
I am trying to parse text data in Pandas DataFrame based on certain tags and values in another column's fields and store them in their own columns. For example, if I created this dataframe, df:
df = pd.DataFrame([[1,2],['A: this is a value B: this is the b val C: and here is c.','A: and heres another a. C: and another c']])
df = df.T
df.columns = ['col1','col2']
df['tags'] = df['col2'].apply(lambda x: re.findall('(?:\s|)(\w*)(?::)',x))
all_tags = []
for val in df['tags']:
all_tags = all_tags + val
all_tags = list(set(all_tags))
for val in all_tags:
df[val] = ''
df:
col1 col2 tags A C B
0 1 A: this is a value B: this is the b val C: and... [A, B, C]
1 2 A: and heres another a. C: and another c [A, C]
How would I populate each of the new "tag" columns with their values from col2 so I get this df:
col1 col2 tags \
0 1 A: this is a value B: this is the b val C: and... [A, B, C]
1 2 A: and heres another a. C: and another c [A, C]
A C B
0 this is a value and here is c. this is the b val
1 and heres another a. and another c
Another option using str.extractall with regex (?P<key>\w+):(?P<val>[^:]*)(?=\w+:|$):
The regex captures the key (?P<key>\w+) before the semi colon and value after the semi colon (?P<val>[^:]*) as two separate columns key and val, the val will match non : characters until it reaches the next key value pair restricted by a look ahead syntax (?=\w+:|$); This assumes the key is always a single word which would be ambiguous otherwise:
import re
pat = re.compile("(?P<key>\w+):(?P<val>[^:]*)(?=\w+:|$)")
pd.concat([
df,
(
df.col2.str.extractall(pat)
.reset_index('match', drop=True)
.set_index('key', append=True)
.val.unstack('key')
)
], axis=1).fillna('')
Where str.extractall gives:
df.col2.str.extractall(pat)
And then you pivot the result and concatenate with the original data frame.
Here's one way
In [683]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
.apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x])))
)
Out[683]:
A B C
0 this is a value this is the b val and here is c.
1 and heres another a. NaN and another c
You could append back the results using join
In [690]: df.join(df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
.apply(lambda x: pd.Series(dict([v.split(':', 1) for v in x]))))
Out[690]:
col1 col2 tags \
0 1 A: this is a value B: this is the b val C: and... [A, B, C]
1 2 A: and heres another a. C: and another c [A, C]
A B C
0 this is a value this is the b val and here is c.
1 and heres another a. NaN and another c
Infact, you could get df['tags'] using string method
In [688]: df.col2.str.findall('(?:\s|)(\w*)(?::)')
Out[688]:
0 [A, B, C]
1 [A, C]
Name: col2, dtype: object
Details:
Split groups into lists
In [684]: df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
Out[684]:
0 [A: this is a value, B: this is the b val, C: ...
1 [A: and heres another a., C: and another c]
Name: col2, dtype: object
Now, to key and value pairs of lists.
In [685]: (df.col2.str.findall('[\S]+(?:\s(?!\S+:)\S+)+')
.apply(lambda x: [v.split(':', 1) for v in x]))
Out[685]:
0 [[A, this is a value], [B, this is the b val...
1 [[A, and heres another a.], [C, and another c]]
Name: col2, dtype: object