I have a large text file over 1GB of chat data (chat.txt) in the following format:
john|12-02-1999|hello#,there#,how#,are#,you#,tom$
tom|12-02-1999|hey#,john$,hows#, it#, goin#
mary|12-03-1999|hello#,boys#,fancy#,meetin#,ya'll#,here#
...
...
john|12-02-2000|well#,its#,been#,nice#,catching#,up#,with#,you#,and#, mary$
mary|12-03-2000|catch#,you#,on#,the#,flipside#,tom$,and#,john$
I want to process this text and summarize the word counts for certain keywords(say 500 words - hello, nice, like.... dinner, no) for each users separately. This process also involves removing all trailing special characters from each word
The output would look like
user hello nice like ..... dinner No
Tom 10000 500 300 ..... 6000 0
John 6000 1200 200 ..... 3000 5
Mary 23 9000 10000 ..... 100 9000
This is my current pythonic solution:
chat_data = pd.read_csv("chat.txt", sep="|", names =["user","date","words"])
user_lst = chat_data.user.unique()
user_grouped_data= pd.DataFrame(columns=["user","words"])
user_grouped_data['user']=user_lst
for i,row in user_grouped_data.iterrows():
id = row["user"]
temp = chat_data[chat_data["user"]==id]
user_grouped_data.loc[i,"words"] = ",".join(temp["words"].tolist())
result = pd.DataFrame(columns=[ "user", "hello", "nice", "like","...500 other keywords...", "dinner", "no"])
result["user"]= user_lst
for i, row in result.iterrows():
id = row["user"]
temp = user_grouped_data[user_grouped_data["user"]==id]
words = temp.values.tolist()[0][1]
word_lst = words.split(",")
word_lst = [item[0:-1] for item in word_lst]
t_dict = Counter(word_lst)
keys = t_dict.keys()
for word in keys:
result.at[i,word]= t_dict.get(word)
result.to_csv("user_word_counts.csv")
This works fine for small data, but when my chat_data becomes over 1gb, this solution becomes very slow and unusable.
Is there any part from below that I can improve upon which would help me process the data more faster?
grouping textual data by user
cleaning textual data in each row by removing trailing special characters
counting words and assigning the word count to the right column
You can split the comma-separated column to a list, explode to a dataframe by that column of lists, groupby name and the values from the exploded list, unstack or pivot_table the dataframe into your desired format and do some final cleaning on the multi-index columns with droplevel(), reset_index(), etc.
All of the below is vectorized pandas methods, so hopefully it is quick. Note: The three columns are [0,1,2] in the code below as I read from clipboard and passed headers=None
Input:
df = pd.DataFrame({0: {0: 'john', 1: 'tom', 2: 'mary', 3: 'john', 4: 'mary'},
1: {0: '12-02-1999',
1: '12-02-1999',
2: '12-03-1999',
3: '12-02-2000',
4: '12-03-2000'},
2: {0: 'hello#,there#,how#,are#,you#,tom$ ',
1: 'hey#,john$,hows#, it#, goin#',
2: "hello#,boys#,fancy#,meetin#,ya'll#,here#",
3: 'well#,its#,been#,nice#,catching#,up#,with#,you#,and#, mary$',
4: 'catch#,you#,on#,the#,flipside#,tom$,and#,john$'}})
Code:
df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')
df = (df.explode(2)
.groupby([0, 2])[2].count()
.rename('Count')
.reset_index()
.set_index([0,2])
.unstack(1)
.fillna(0))
df.columns = df.columns.droplevel()
df = df.reset_index()
df
Out[1]:
2 0 goin it mary and are been boys catch catching ... on \
0 john 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 1.0 ... 0.0
1 mary 0.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 ... 1.0
2 tom 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0
2 the there tom tom up well with ya'll you
0 0.0 1.0 0.0 1.0 1.0 1.0 1.0 0.0 2.0
1 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0
You could also use .pivot_table instead of .unstack(), which saves you this line of code: df.columns = df.columns.droplevel():
df[2] = df[2].replace(['\#', '\$'],'', regex=True).str.split(',')
df = (df.explode(2)
.groupby([0, 2])[2].count()
.rename('Count')
.reset_index()
.pivot_table(index=0, columns=2, values='Count')
.fillna(0)
.astype(int)
.reset_index())
df
Out[45]:
2 0 goin it mary and are been boys catch catching ... on \
0 john 0 0 1 1 1 1 0 0 1 ... 0
1 mary 0 0 0 1 0 0 1 1 0 ... 1
2 tom 1 1 0 0 0 0 0 0 0 ... 0
2 the there tom tom up well with ya'll you
0 0 1 0 1 1 1 1 0 2
1 1 0 1 0 0 0 0 1 1
2 0 0 0 0 0 0 0 0 0
[3 rows x 31 columns]
If you are able to use scikit-learn, it is very easy with CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
s = df['words'].str.replace("#|\$|\s+", "")
model = CountVectorizer(tokenizer=lambda x: x.split(','))
df_final = pd.DataFrame(model.fit_transform(s).toarray(),
columns=model.get_feature_names(),
index=df.user).sum(level=0)
Out[279]:
and are been boys catch catching fancy flipside goin hello \
user
john 1 1 1 0 0 1 0 0 0 1
tom 0 0 0 0 0 0 0 0 1 0
mary 1 0 0 1 1 0 1 1 0 1
here hey how hows it its john mary meetin nice on the there \
user
john 0 0 1 0 0 1 0 1 0 1 0 0 1
tom 0 1 0 1 1 0 1 0 0 0 0 0 0
mary 1 0 0 0 0 0 1 0 1 0 1 1 0
tom up well with ya'll you
user
john 1 1 1 1 0 2
tom 0 0 0 0 0 0
mary 1 0 0 0 1 1
I am not sure how much faster is this approach on a large DataFrame, but you can give it a try. First, remove the special characters and split the strings into lists of words, thus forming another column:
from itertools import chain
from collections import Counter
df['lists'] = df['words'].str.replace("#|\$", "").str.split(",")
Now, group by the user), collect the lists into one lists, and count the occurrences with the Counter:
df.groupby('user')['lists'].apply(chain.from_iterable)\
.apply(Counter)\
.apply(pd.Series)\
.fillna(0).astype(int)
I had a data which I pivoted using pivot table method , now the data looks like this:
rule_id a b c
50211 8 0 0
50249 16 0 3
50378 0 2 0
50402 12 9 6
I have set 'rule_id' as index. Now I compared one column to it's corresponding column and created another column with it's result. The idea is if the first column has a value other than 0 and the second column , to which the first column is compared to ,has 0 , then 100 should be updated in the newly created column, but if the situation is vice-versa then 'Null' should be updated. If both column have 0 , then also 'Null' should be updated. If the last column has value 0 , then 'Null' should be updated and other than 0 , then 100 should be updated. But if both the columns have values other than 0(like in the last row of my data) , then the comparison should be like this for column a and b:
value_of_b/value_of_a *50 + 50
and for column b and c:
value_of_c/value_of_b *25 + 25
and similarly if there are more columns ,then the multiplication and addition value should be 12.5 and so on.
I was able to achieve all the above things apart from the last result which is the division and multiplication stuff. I used this code:
m = df.eq(df.shift(-1, axis=1))
arr = np.select([df ==0, m], [np.nan, df], 1*100)
df2 = pd.DataFrame(arr, index=df.index).rename(columns=lambda x: f'comp{x+1}')
df3 = df.join(df2)
df is the dataframe which stores my pivoted table data which I mentioned at the start. After using this code my data looks like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 100 100 100
But I want the data to look like this:
rule_id a b c comp1 comp2 comp3
50211 8 0 0 100 NaN NaN
50249 16 0 3 100 NaN 100
50378 0 2 0 NaN 100 NaN
50402 12 9 6 87.5 41.67 100
If you guys can help me get the desired data , I would greatly appreciate it.
Edit:
This is how my data looks:
The problem is that the coefficient to use when building the new compx column does not depend only on the columns position. In fact in each row it is reset to its maximum of 50 after each 0 value and is half of previous one after a non 0 value. Those resetable series are hard to vectorize in pandas, especially in rows. Here I would build a companion dataframe holding only those coefficients, and use directly the numpy underlying arrays to compute them as efficiently as possible. Code could be:
# transpose the dataframe to process columns instead of rows
coeff = df.T
# compute the coefficients
for name, s in coeff.items():
top = 100 # start at 100
r = []
for i, v in enumerate(s):
if v == 0: # reset to 100 on a 0 value
top=100
else:
top = top/2 # else half the previous value
r.append(top)
coeff.loc[:, name] = r # set the whole column in one operation
# transpose back to have a companion dataframe for df
coeff = coeff.T
# build a new column from 2 consecutive ones, using the coeff dataframe
def build_comp(col1, col2, i):
df['comp{}'.format(i)] = np.where(df[col1] == 0, np.nan,
np.where(df[col2] == 0, 100,
df[col2]/df[col1]*coeff[col1]
+coeff[col1]))
old = df.columns[0] # store name of first column
# Ok, enumerate all the columns (except first one)
for i, col in enumerate(df.columns[1:], 1):
build_comp(old, col, i)
old = col # keep current column name for next iteration
# special processing for last comp column
df['comp{}'.format(i+1)] = np.where(df[col] == 0, np.nan, 100)
With this initial dataframe:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32
rule_id
50402 0 0 9 0
51121 0 1 0 0
51147 0 1 0 0
51183 2 0 0 0
51283 0 12 9 6
51684 0 1 0 0
52035 0 4 3 2
it gives as expected:
date 2019-04-25 15:08:23 2019-04-25 16:14:14 2019-04-25 16:29:05 2019-04-25 16:36:32 comp1 comp2 comp3 comp4
rule_id
50402 0 0 9 0 NaN NaN 100.000000 NaN
51121 0 1 0 0 NaN 100.0 NaN NaN
51147 0 1 0 0 NaN 100.0 NaN NaN
51183 2 0 0 0 100.0 NaN NaN NaN
51283 0 12 9 6 NaN 87.5 41.666667 100.0
51684 0 1 0 0 NaN 100.0 NaN NaN
52035 0 4 3 2 NaN 87.5 41.666667 100.0
Ok, I think you can iterate over your dataframe df and use some if-else to get the desired output.
for i in range(len(df.index)):
if df.iloc[i,1]!=0 and df.iloc[i,2]==0: # column start from index 0
df.loc[i,'colname'] = 'whatever you want' # so rule_id is column 0
elif:
.
.
.
I have a dataframe df1 that looks like this:
Sample_names esv0 esv1 esv2 ... esv918 esv919 esv920 esv921
0 pr1gluc8NH1 2.1 3.5 6222 ... 0 0 0 0
1 pr1gluc8NH2 3189.0 75.0 9045 ... 0 0 0 0
2 pr1gluc8NHCR1 0.0 2152.0 12217 ... 0 0 0 0
3 pr1gluc8NHCR2 0.0 17411.0 1315 ... 0 1 0 0
4 pr1sdm8NH1 365.0 7.0 4117 ... 0 0 0 0
5 pr1sdm8NH2 4657.0 18.0 13520 ... 0 0 0 0
6 pr1sdm8NHCR1 0.0 139.0 3451 ... 0 0 0 0
7 pr1sdm8NHCR2 1130.0 1439.0 4163 ... 0 0 0 0
I want to perform some operations on the rows and replace them , via a for loop.
for i in range(len(df1)):
x=df1.iloc[i].values ### gets all the values corresponding to each row
x=np.vstack(x[1:]).astype(np.float) ####converts object type to a regular 2D array for all row elements except the first, which is a string.
x=x/np.sum(x) ###normalize to 1
df1.iloc[i,1:]=x ###this is the step that should replace part of the old row with the new array.
But with this I get an error "ValueError: Must have equal len keys and value when setting with an ndarray". x does have the same length as each row of df1 - 1 (I don't want to replace the first column, Sample_names)
I also tried df1=df1.replace(df1.iloc[i,1:],x). This gives TypeError: value argument must be scalar, dict, or Series.
I would appreciate any ideas for how to do this.
Thanks.
You need to reshape the x array as its shape is (n, 1), where n is the length of your all esv-like columns.
Change the line: df1.iloc[i, 1:] = x to
df1.iloc[i, 1:] = x.squeeze()
I have created a DataFrame full of zeros such as:
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
...
n 0 0 0
I have a list containing names for the column in unicode, such as:
list = [u'One', u'Two', u'Three']
The DataFrame of zeroes is known as a, and I am creating a new complete DataFrame with the list as column headers via:
final = pd.DataFrame(a, columns=[list])
However, the resulting DataFrame has column names that are no longer unicode (i.e. they do not show the u'' tag).
I am wondering why this is happening. Thanks!
There is no reason for lost unicode, you can check it by:
print df.columns.tolist()
Please never use reserved words like list, type, id... as variables because masking built-in functions. Also is necessary add values for convert values to numpy array:
a = pd.DataFrame(0, columns=range(3), index=range(3))
print (a)
0 1 2
0 0 0 0
1 0 0 0
2 0 0 0
L = [u'One', u'Two', u'Three']
final = pd.DataFrame(a.values, columns=L)
print (final)
One Two Three
0 0 0 0
1 0 0 0
2 0 0 0
because columns are not aligned and get all NaNs:
final = pd.DataFrame(a, columns=L)
print (final)
One Two Three
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
I think simpliest is use only index of a DataFrame if all values are 0:
L = [u'One', u'Two', u'Three']
final = pd.DataFrame(0, columns=L, index=a.index)
print (final)
One Two Three
0 0 0 0
1 0 0 0
2 0 0 0
I would like to calculate the number of instances two criteria are fulfilled in a Pandas DataFrame at a different index value. A snipped of the DataFrame is:
GDP USRECQ
DATE
1947-01-01 NaN 0
1947-04-01 NaN 0
1947-07-01 NaN 0
1947-10-01 NaN 0
1948-01-01 0.095023 0
1948-04-01 0.107998 0
1948-07-01 0.117553 0
1948-10-01 0.078371 0
1949-01-01 0.034560 1
1949-04-01 -0.004397 1
I would like to count the number of observation for which USRECQ[DATE+1]==1 and GDP[DATE]>a if GDP[DATE]!='NAN'.
By referring to DATE+1 and DATE I mean that the value of USRECQ should be check at the subsequent date for which the value of GDP is examined. Unfortunately, I do not know how to address the deal with the different time indices in my selection. Can someone kindly advise me on how to count the number of instances properly?
One may of achieving this is to create a new column to show what the next value of 'USRECQ' is:
>>> df['USRECQ NEXT'] = df['USRECQ'].shift(-1)
>>> df
DATE GDP USRECQ USRECQ NEXT
0 1947-01-01 NaN 0 0
1 1947-04-01 NaN 0 0
2 1947-07-01 NaN 0 0
3 1947-10-01 NaN 0 0
4 1948-01-01 0.095023 0 0
5 1948-04-01 0.107998 0 0
6 1948-07-01 0.117553 0 0
7 1948-10-01 0.078371 0 1
8 1949-01-01 0.034560 1 1
9 1949-04-01 -0.004397 1 NaN
Then you could filter your DataFrame according to your requirements as follows:
>>> a = 0.01
>>> df[(df['USRECQ NEXT'] == 1) & (df['GDP'] > a) & (pd.notnull(df['GDP']))]
DATE GDP USRECQ USRECQ NEXT
7 1948-10-01 0.078371 0 1
8 1949-01-01 0.034560 1 1
To count the number of rows in a DataFrame, you can just use the built-in function len.
I think the DataFrame.shift method is the key to what you seek in terms of looking at the next index.
And Numpy's logical expressions can come in really handy for these sorts of things.
So if df is your dataframe then I think what you're looking for is something like
count = df[np.logical_and(df.shift(-1)['USRECQ'] == 1,df.GDP > -0.1)]
The example I used to test this is on github.