In a column in a Dask Dataframe, I have strings like this:
column_name_1
column_name_2
a^b^c
j
e^f^g
k^l
h^i
m
I need to split these strings into columns in the same data frame, like this
column_name_1
column_name_2
column_name_1_1
column_name_1_2
column_name_1_3
column_name_2_1
column_name_2_2
a^b^c
j
a
b
c
j
e^f^g
k^l
e
f
g
k
l
h^i
m
h
i
m
I cannot figure out how to do this without knowing in advance how many occurrences of the delimiter there are in the data. Also, there are tens of columns in the Dataframe that are to be left alone, so I need to be able to specify which columns to split like this.
My best effort either includes something like
df[["column_name_1_1","column_name_1_2 ","column_name_1_3"]] = df["column_name_1"].str.split('^',n=2, expand=True)
But it fails with a
ValueError: The columns in the computed data do not match the columns in the provided metadata
Here are 2 solutions working without stack but with loop for selected columns names:
cols = ['column_name_1','column_name_2']
for c in cols:
df = df.join(df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna(''))
print (df)
column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2 \
0 a^b^c j a b c
1 e^f^g k^l e f g
2 h^i m h i
column_name_2_0 column_name_2_1
0 j
1 k l
2 m
Or modify another solution:
cols = ['column_name_1','column_name_2']
dfs = [df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna('') for c in cols]
df = pd.concat([df] + dfs, axis=1)
print (df)
column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2 \
0 a^b^c j a b c
1 e^f^g k^l e f g
2 h^i m h i
column_name_2_0 column_name_2_1
0 j
1 k l
2 m
Unfortunately using dask.dataframe.Series.str.split with expand=True and an unknown number of splits is not yet supported in Dask, the following returns a NotImplementedError:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame({
'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
}), npartitions=2
)
# returns NotImplementedError
ddf['column_name_1'].str.split('^', expand=True).compute()
Usually when a pandas equivalent has not yet been implemented in Dask, map_partitions can be used to apply a Python function on each DataFrame partition. In this case, however, Dask would still need to know how many columns to expect in order to lazily produce a Dask DataFrame, as provided with a meta argument. This makes using Dask for this task challenging. Relatedly, the ValueError occurs because column_name_2 requires only 1 split, and returns a Dask DataFrame with 2 columns, but Dask is expecting a DataFrame with 3 columns.
Here is one solution (building from #Fontanka16's answer) if you do know the number of splits ahead of time:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame({
'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
}), npartitions=2
)
ddf_list = []
num_split_dict = {'column_name_1': 2, 'column_name_2': 1}
for col, num_splits in num_split_dict.items():
split_df = ddf[col].str.split('^', n=num_splits, expand=True).add_prefix(f'{col}_')
ddf_list.append(split_df)
new_ddf = dd.concat([ddf] + ddf_list, axis=1)
new_ddf.compute()
Related
I am importing a .txt file via read_table and get a DataFrame similar to
d = ['89278 5857', '1.000e-02', '1.591184e-02', '2.100053e-02', '89300 5857', '4.038443e-01', '4.037924e-01', '4.037336e-01']
df = pd.DataFrame(data = d)
and would like to reorganize it into
r = {'89278 5857': [1.000e-02, 1.591184e-02, 2.100053e-02], '89300 5857': [4.038443e-01, 4.037924e-01, 4.037336e-01]}
rf = pd.DataFrame(data = r)
The .txt file is typically 50k+ rows with an unknown number of '89278 5857' type values.
Thanks!
You can use itertools.groupby:
from itertools import groupby
data, cur_group = {}, None
for v, g in groupby(df[0], lambda k: " " in k):
if v:
cur_group = []
data[next(g)] = cur_group
else:
cur_group.extend(g)
df = pd.DataFrame(data)
print(df)
Prints:
89278 5857 89300 5857
0 1.000e-02 4.038443e-01
1 1.591184e-02 4.037924e-01
2 2.100053e-02 4.037336e-01
Assuming what delineates the start of the next group is a space, here what I would:
df.assign(
key=lambda df: numpy.where(
df['value'].str.contains(' '), # what defines each group
df['value'],
numpy.nan
),
).fillna(
method='ffill' # copy the group label down until the next group starts
).loc[
lambda df: df['value'] != df['key'] # remove the rows that kicked off each group
].assign(
idx=lambda df: df.groupby('key').cumcount() # get a row number for each group
).pivot(
index='idx', # pivot into the wide format
columns='key',
values='value'
).astype(float) # turn values into numbers instead of strings
And I get:
key 89278 5857 89300 5857
idx
0 0.010000 0.403844
1 0.015912 0.403792
2 0.021001 0.403734
I have a dataframe that contains text separated by a comma
1 a,b,c,d
2 a,b,e,f
3 a,b,e,f
I am trying to have an output that prints the top 2 most common combinations of 2 letters + the # of occurrences among the entire dataframe. So based on the above dataframe the output would be
(a,b,3) (e,f,2)
The combination of a and b occurs 3 times, and the combination of e and f occurs 2 times. (Yes there are more combos that occur 2 times but we can just cut it off here to keep it simple) I am really stumped on just how to even start this. I was thinking of maybe looping through each row and somehow storing all combinations, and at the end we can print out the top n combinations and how many times they occurred in the dataframe.
Below is what I have so far according to what I have in mind.
import pandas as pd
from io import StringIO
StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""")
df = pd.read_csv(StringData, sep =";")
for index, row in df.iterrows():
(somehow get and store all possible 2 word combos?)
You can do it this way:
import numpy as np
import pandas as pd
from io import StringIO
StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""")
df = pd.read_csv(StringData, sep =";")
df['Date'] = df['Date'].apply(lambda x: x.split(','))
df['combinations'] = df['Date'].apply(lambda x: [(x[i], x[i+1]) for i in range(len(x)-1)])
df = df.explode('combinations')
df = df.groupby('combinations').agg('count').reset_index()
df.sort_values('Date', inplace=True, ascending=False)
df['combinations'] = df.values.tolist()
df.drop('Date', axis=1, inplace=True)
df['combinations'] = df['combinations'].apply(np.hstack)
print(df.iloc[:2, :])
Output:
combinations
0 [a, b, 3]
2 [b, e, 2]
I have a dataframe as following,
print(df)
SAS_a1 SAS2_a1 SAS3_a1 FDF_b1 FDF2_b1
0 0.673114 0.745755 0.989468 0.498920 0.837440
1 0.811218 0.392196 0.505301 0.615603 0.946847
2 0.252856 0.709125 0.321580 0.826123 0.224813
3 0.566833 0.738661 0.626808 0.815460 0.003738
4 0.102995 0.171741 0.246565 0.784519 0.980965
I aiming to pairwise correlation using pearsonr and but I wanted the pairwise correlation between columns ending with a1 versus b1. The final result should look like,
PCC p-value
SAS_a1__FDF_b1 -0.293373 0.631895
SAS_a1__FDF2_b1 -0.947724 0.014235
SAS2_a1__FDF_b1 0.771389 0.126618
SAS2_a1__FDF2_b1 e 0.132380 0.831942
SAS3_a1__FDF_b1 0.422249 0.478808
SAS3_a1__FDF2_b1 0.346411 0.567923
Any suggestions would be great ..!!!
Here is what I tried,
columns = df.columns.tolist()
for col_a, col_b in itertools.combinations(columns, 2):
correlations[col_a + '__' + col_b] = pearsonr(df.loc[:, col_a], df.loc[:, col_b])
results = DataFrame.from_dict(correlations, orient='index')
results.columns = ['PCC', 'p-value']
I don't know if its the most elegant solution but you can use list comprehension to create a list containing the relevant columns:
import pandas as pd
from scipy.stats import pearsonr
result = pd.DataFrame()
for a1 in [column for column in df.columns if 'a1' in column]:
for b1 in [column for column in df.columns if 'b1' in column]:
result = result.append(
pd.Series(
pearsonr(df[a1],df[b1]),
index=['PCC', 'p-value'],
name=a1 + '__' +b1
))
PS: It would be great if you would include your imports in your next question. (So that people answering don't have to google it)
After running some commands I have a pandas dataframe, eg.:
>>> print df
B A
1 2 1
2 3 2
3 4 3
4 5 4
I would like to print this out so that it produces simple code that would recreate it, eg.:
DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
I tried pulling out each of the three pieces (data, columns and rows):
[[e for e in row] for row in df.iterrows()]
[c for c in df.columns]
[r for r in df.index]
but the first line fails because e is not a value but a Series.
Is there a pre-build command to do this, and if not, how do I do it? Thanks.
You can get the values of the data frame in array format by calling df.values:
df = pd.DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
arrays = df.values
cols = df.columns
index = df.index
df2 = pd.DataFrame(arrays, columns = cols, index = index)
Based on #Woody Pride's approach, here is the full solution I am using. It handles hierarchical indices and index names.
from types import MethodType
from pandas import DataFrame, MultiIndex
def _gencmd(df, pandas_as='pd'):
"""
With this addition to DataFrame's methods, you can use:
df.command()
to get the command required to regenerate the dataframe df.
"""
if pandas_as:
pandas_as += '.'
index_cmd = df.index.__class__.__name__
if type(df.index)==MultiIndex:
index_cmd += '.from_tuples({0}, names={1})'.format([i for i in df.index], df.index.names)
else:
index_cmd += "({0}, name='{1}')".format([i for i in df.index], df.index.name)
return 'DataFrame({0}, index={1}{2}, columns={3})'.format([[xx for xx in x] for x in df.values],
pandas_as,
index_cmd,
[c for c in df.columns])
DataFrame.command = MethodType(_gencmd, None, DataFrame)
I have only tested it on a few cases so far and would love a more general solution.
I'm trying to merge/join two dataframes, each with three keys (Age, Gender and Signed_In). Both dataframes have the same parent and were created by groupby, but have unique value columns.
It seems like the merge/join should be painless given the unique combined keys are shared across both dataframes. Thinking there must be some simple error with my attempt at 'merge' and 'join' but can't for the life of me resolve it.
times = pd.read_csv('nytimes.csv')
# Produces times_mean table consisting of two value columns, avg_impressions and avg_clicks
times_mean = times.groupby(['Age','Gender','Signed_In']).mean()
times_mean.columns = ['avg_impressions', 'avg_clicks']
# Produces times_max table consisting of two value columns, max_impressions and max_clicks
times_max = times.groupby(['Age','Gender','Signed_In']).max()
times_max.columns = ['max_impressions', 'max_clicks']
# Following intended to produce combined table with four value columns
times_join = times_mean.join(times_max, on = ['Age', 'Gender', 'Signed_In'])
times_join2 = pd.merge(times_mean, times_max, on=['Age', 'Gender', 'Signed_In'])
You don't need to the on kwarg when joining on equivalently structured MultiIndex
Here's an example demonstrating this:
import numpy as np
import pandas
a = np.random.normal(size=10)
b = a + 10
index = pandas.MultiIndex.from_product([['A', 'B'], list('abcde')])
df_a = pandas.DataFrame(a, index=index, columns=['colA'])
df_b = pandas.DataFrame(b, index=index, columns=['colB'])
df_a.join(df_b)
Which gives me:
colA colB
A a -1.525376 8.474624
b 0.778333 10.778333
c 1.153172 11.153172
d 0.966560 10.966560
e 0.089765 10.089765
B a 0.717717 10.717717
b 0.305545 10.305545
c 0.123548 10.123548
d -1.018660 8.981340
e -0.635103 9.364897