I have a dataframe that contains text separated by a comma
1 a,b,c,d
2 a,b,e,f
3 a,b,e,f
I am trying to have an output that prints the top 2 most common combinations of 2 letters + the # of occurrences among the entire dataframe. So based on the above dataframe the output would be
(a,b,3) (e,f,2)
The combination of a and b occurs 3 times, and the combination of e and f occurs 2 times. (Yes there are more combos that occur 2 times but we can just cut it off here to keep it simple) I am really stumped on just how to even start this. I was thinking of maybe looping through each row and somehow storing all combinations, and at the end we can print out the top n combinations and how many times they occurred in the dataframe.
Below is what I have so far according to what I have in mind.
import pandas as pd
from io import StringIO
StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""")
df = pd.read_csv(StringData, sep =";")
for index, row in df.iterrows():
(somehow get and store all possible 2 word combos?)
You can do it this way:
import numpy as np
import pandas as pd
from io import StringIO
StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""")
df = pd.read_csv(StringData, sep =";")
df['Date'] = df['Date'].apply(lambda x: x.split(','))
df['combinations'] = df['Date'].apply(lambda x: [(x[i], x[i+1]) for i in range(len(x)-1)])
df = df.explode('combinations')
df = df.groupby('combinations').agg('count').reset_index()
df.sort_values('Date', inplace=True, ascending=False)
df['combinations'] = df.values.tolist()
df.drop('Date', axis=1, inplace=True)
df['combinations'] = df['combinations'].apply(np.hstack)
print(df.iloc[:2, :])
Output:
combinations
0 [a, b, 3]
2 [b, e, 2]
Related
import numpy as np
import pandas as pd
df = pd.read_csv('test_python.csv')
print(df.groupby('fifth').sum())
this is my data
**And I am summing the first three columns for every word is in fifth.
The result is this and it is correct
The next thing I want to do is take those results and sum the together
example:
**buy = 6
cheese = 8
file = 12
.
.
.
word = 13**
How can I do this? how can I use the results.**
-And also now, want to use the column second as a new column with the name second2 with the results as data, how can I do it?
For Summing you can use apply-lambda ;
df = pd.DataFrame({"first":[1]*14,
"second":np.arange(1,15),
"third":[0]*14,
"forth":["one","two","three","four"]*3+["one","two"],
"fifth":["hello","no","hello","hi","buy","hello","cheese","water","hi","juice","file","word","hi","red"]})
df1 = df.groupby(['fifth'])['first','second','third'].agg('sum').reset_index()
df1["sum_3_Col"] = df1.apply(lambda x: x["first"] + x["second"] + x["third"],axis=1)
df1.rename(columns={'second':'second2'}, inplace=True)
Output of df1;
In a column in a Dask Dataframe, I have strings like this:
column_name_1
column_name_2
a^b^c
j
e^f^g
k^l
h^i
m
I need to split these strings into columns in the same data frame, like this
column_name_1
column_name_2
column_name_1_1
column_name_1_2
column_name_1_3
column_name_2_1
column_name_2_2
a^b^c
j
a
b
c
j
e^f^g
k^l
e
f
g
k
l
h^i
m
h
i
m
I cannot figure out how to do this without knowing in advance how many occurrences of the delimiter there are in the data. Also, there are tens of columns in the Dataframe that are to be left alone, so I need to be able to specify which columns to split like this.
My best effort either includes something like
df[["column_name_1_1","column_name_1_2 ","column_name_1_3"]] = df["column_name_1"].str.split('^',n=2, expand=True)
But it fails with a
ValueError: The columns in the computed data do not match the columns in the provided metadata
Here are 2 solutions working without stack but with loop for selected columns names:
cols = ['column_name_1','column_name_2']
for c in cols:
df = df.join(df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna(''))
print (df)
column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2 \
0 a^b^c j a b c
1 e^f^g k^l e f g
2 h^i m h i
column_name_2_0 column_name_2_1
0 j
1 k l
2 m
Or modify another solution:
cols = ['column_name_1','column_name_2']
dfs = [df[c].str.split('^',n=2, expand=True).add_prefix(f'{c}_').fillna('') for c in cols]
df = pd.concat([df] + dfs, axis=1)
print (df)
column_name_1 column_name_2 column_name_1_0 column_name_1_1 column_name_1_2 \
0 a^b^c j a b c
1 e^f^g k^l e f g
2 h^i m h i
column_name_2_0 column_name_2_1
0 j
1 k l
2 m
Unfortunately using dask.dataframe.Series.str.split with expand=True and an unknown number of splits is not yet supported in Dask, the following returns a NotImplementedError:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame({
'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
}), npartitions=2
)
# returns NotImplementedError
ddf['column_name_1'].str.split('^', expand=True).compute()
Usually when a pandas equivalent has not yet been implemented in Dask, map_partitions can be used to apply a Python function on each DataFrame partition. In this case, however, Dask would still need to know how many columns to expect in order to lazily produce a Dask DataFrame, as provided with a meta argument. This makes using Dask for this task challenging. Relatedly, the ValueError occurs because column_name_2 requires only 1 split, and returns a Dask DataFrame with 2 columns, but Dask is expecting a DataFrame with 3 columns.
Here is one solution (building from #Fontanka16's answer) if you do know the number of splits ahead of time:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame({
'column_name_1': ['a^b^c', 'e^f^g', 'h^i'], 'column_name_2': ['j', 'k^l', 'm']
}), npartitions=2
)
ddf_list = []
num_split_dict = {'column_name_1': 2, 'column_name_2': 1}
for col, num_splits in num_split_dict.items():
split_df = ddf[col].str.split('^', n=num_splits, expand=True).add_prefix(f'{col}_')
ddf_list.append(split_df)
new_ddf = dd.concat([ddf] + ddf_list, axis=1)
new_ddf.compute()
I am importing a .txt file via read_table and get a DataFrame similar to
d = ['89278 5857', '1.000e-02', '1.591184e-02', '2.100053e-02', '89300 5857', '4.038443e-01', '4.037924e-01', '4.037336e-01']
df = pd.DataFrame(data = d)
and would like to reorganize it into
r = {'89278 5857': [1.000e-02, 1.591184e-02, 2.100053e-02], '89300 5857': [4.038443e-01, 4.037924e-01, 4.037336e-01]}
rf = pd.DataFrame(data = r)
The .txt file is typically 50k+ rows with an unknown number of '89278 5857' type values.
Thanks!
You can use itertools.groupby:
from itertools import groupby
data, cur_group = {}, None
for v, g in groupby(df[0], lambda k: " " in k):
if v:
cur_group = []
data[next(g)] = cur_group
else:
cur_group.extend(g)
df = pd.DataFrame(data)
print(df)
Prints:
89278 5857 89300 5857
0 1.000e-02 4.038443e-01
1 1.591184e-02 4.037924e-01
2 2.100053e-02 4.037336e-01
Assuming what delineates the start of the next group is a space, here what I would:
df.assign(
key=lambda df: numpy.where(
df['value'].str.contains(' '), # what defines each group
df['value'],
numpy.nan
),
).fillna(
method='ffill' # copy the group label down until the next group starts
).loc[
lambda df: df['value'] != df['key'] # remove the rows that kicked off each group
].assign(
idx=lambda df: df.groupby('key').cumcount() # get a row number for each group
).pivot(
index='idx', # pivot into the wide format
columns='key',
values='value'
).astype(float) # turn values into numbers instead of strings
And I get:
key 89278 5857 89300 5857
idx
0 0.010000 0.403844
1 0.015912 0.403792
2 0.021001 0.403734
I have two data-frames like :
df1 :
df2 :
I am trying make a match of any term to text.
MyCode :
import sys,os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import csv
import re
# data
data1 = {'termID': [1,55,341,41,5685], 'term':['Cardic Arrest','Headache','Chest Pain','Muscle Pain', 'Knee Pain']}
data2 = {'textID': [25,12,52,35], 'text':['Hello Mike, Good Morning!!',
'Oops!! My Knee pains!!',
'Stop Music!! my head pains',
'Arrest Innocent!!'
]}
#Dataframes
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Matching logic
matchList=[]
for index_b, row_b in df2.iterrows():
for index_a, row_a in df1.iterrows():
if row_a.term.lower() in row_b.text.lower() :
#print(row_b.text, row_a.term)
matchList.append([row_b.textID,row_b.text ,row_a.term, row_a.termID] )
cols = ['textID', 'text,','term ','termID' ]
d = pd.DataFrame(matchList, columns = cols)
print(d)
Which gave me only single row as output :
I have two issues to fix:
Not sure how can I get output for any partial match like this :
Both DF1 and DF2 are of size of around 0.4M and 13M records.
What optimum ways are there to fix these two issues?
I've a quick fix for problem 1 but not an optimisation.
You only get one match because "Knee pain" is the only term that appears in full in df1.
I've modified the if statement to split the text from df2 and check if there are any matches from the list.
Agree with #jakub that there are libraries that will do this quicker.
# Matching logic
matchList=[]
for index_b, row_b in df2.iterrows():
print(row_b)
for index_a, row_a in df1.iterrows():
if any(word in row_a.term.lower() for word in row_b.text.lower().split()):
#print(row_b.text, row_a.term)
matchList.append([row_b.textID,row_b.text ,row_a.term, row_a.termID] )
cols = ['textID', 'text,','term ','termID' ]
d = pd.DataFrame(matchList, columns = cols)
print(d)
Output
textID text, term termID
0 12 Oops!! My Knee pains!! Knee Pain 5685
1 52 Stop Music!! my head pains Headache 55
2 35 Arrest Innocent!! Cardic Arrest 1
After running some commands I have a pandas dataframe, eg.:
>>> print df
B A
1 2 1
2 3 2
3 4 3
4 5 4
I would like to print this out so that it produces simple code that would recreate it, eg.:
DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
I tried pulling out each of the three pieces (data, columns and rows):
[[e for e in row] for row in df.iterrows()]
[c for c in df.columns]
[r for r in df.index]
but the first line fails because e is not a value but a Series.
Is there a pre-build command to do this, and if not, how do I do it? Thanks.
You can get the values of the data frame in array format by calling df.values:
df = pd.DataFrame([[2,1],[3,2],[4,3],[5,4]],columns=['B','A'],index=[1,2,3,4])
arrays = df.values
cols = df.columns
index = df.index
df2 = pd.DataFrame(arrays, columns = cols, index = index)
Based on #Woody Pride's approach, here is the full solution I am using. It handles hierarchical indices and index names.
from types import MethodType
from pandas import DataFrame, MultiIndex
def _gencmd(df, pandas_as='pd'):
"""
With this addition to DataFrame's methods, you can use:
df.command()
to get the command required to regenerate the dataframe df.
"""
if pandas_as:
pandas_as += '.'
index_cmd = df.index.__class__.__name__
if type(df.index)==MultiIndex:
index_cmd += '.from_tuples({0}, names={1})'.format([i for i in df.index], df.index.names)
else:
index_cmd += "({0}, name='{1}')".format([i for i in df.index], df.index.name)
return 'DataFrame({0}, index={1}{2}, columns={3})'.format([[xx for xx in x] for x in df.values],
pandas_as,
index_cmd,
[c for c in df.columns])
DataFrame.command = MethodType(_gencmd, None, DataFrame)
I have only tested it on a few cases so far and would love a more general solution.