extracting the text after certain value in pandas - python

I am trying to extract the values in a column which has text data as below:
create date:1953/01/01 | first author:REAGAN RL
How can I extract the author name from the columns and store in a new column.
I tried the following ways:
df.str.extract("first author:(.*?)")
and
authorname=df['EntrezUID'].apply(lambda x:x.split("first author:")). The second one worked.
How can I use the regualr expressions achieve the similar thing

You can do:
## sample data
df = pd.DataFrame({'dd':['create date:1953/01/01 | first author:REAGAN RL','create date:1953/01/01 | first author:MEGAN RL']})
## output
df['names'] = df['dd'].str.extract(r'author\:(.*)')
print(df)
dd names
0 create date:1953/01/01 | first author:REAGAN RL REAGAN RL
1 create date:1953/01/01 | first author:MEGAN RL MEGAN RL

Related

Split data frame of comments into multiple rows

I have a data frame with long comments and I want to split them into indiviual sentences using spacy sentencizer.
Comments = pd.read_excel('Comments.xlsx', sheet_name = 'Sheet1')
Comments
>>>
reviews
0 One of the rare films where every discussion leaving the theater is about how much you
just had, instead of an analysis of its quotients.
1 Gorgeous cinematography, insane flying action sequences, thrilling, emotionally moving,
and a sequel that absolutely surpasses its predecessor. Well-paced, executed & has that
re-watchability factor.
I loaded the model like this
import spacy
nlp = spacy.load("en_core_news_sm")
And using sentencizer
from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')
Data = Comments.reviews.apply(lambda x : list( nlp(x).sents))
But when I check the sentence is in just one row like this
[One of the rare films where every discussion leaving the theater is about how much you just had.,
Instead of an analysis of its quotients.]
Thanks a lot for any help. I'm new using NLP tools in Data Frame.
Currently, Data is a Series whose rows are lists of sentences, or actually, lists of Spacy's Span objects. You probably want to obtain the text of these sentences and to put each sentence on a different row.
comments = [{'reviews': 'This is the first sentence of the first review. And this is the second.'},
{'reviews': 'This is the first sentence of the second review. And this is the second.'}]
comments = pd.DataFrame(comments) # building your input DataFrame
+----+--------------------------------------------------------------------------+
| | reviews |
|----+--------------------------------------------------------------------------|
| 0 | This is the first sentence of the first review. And this is the second. |
| 1 | This is the first sentence of the second review. And this is the second. |
+----+--------------------------------------------------------------------------+
Now let's define a function which, given a string, returns the list of its sentences as texts (strings).
def obtain_sentences(s):
doc = nlp(s)
sents = [sent.text for sent in doc.sents]
return sents
The function can be applied to the comments DataFrame to produce a new DataFrame containing sentences.
data = comments.copy()
data['reviews'] = comments.apply(lambda x: obtain_sentences(x['reviews']), axis=1)
data = data.explode('reviews').reset_index(drop=True)
data
I used explode to transform the elements of the lists of sentences into rows.
And this is the obtained output!
+----+--------------------------------------------------+
| | reviews |
|----+--------------------------------------------------|
| 0 | This is the first sentence of the first review. |
| 1 | And this is the second. |
| 2 | This is the first sentence of the second review. |
| 3 | And this is the second. |
+----+--------------------------------------------------+

How to transform multiple pandas dataframes to array in memory constrains?

The given problem:
I have folders named from folder1 to folder999. In each folder there are parquet files - named from 1.parquet to 999.parquet. Each parquet consist pandas dataframe of given structure:
id |title |a
1 |abc |1
1 |abc |3
1 |abc |2
2 |abc |1
... |def | ...
Where column a can be value of range a1 to a3.
The partial step is to obtain structure:
id | title | a1 | a2 | a3
1 | abc | 1 | 1 | 1
2 | abc | 1 | 0 | 0
...
In order to obtain final form,:
title
id | abc | def | ...
1 | 3 | ... |
2 | 1 | ... |
where values of column abc is sum of columns a1, a2 and a3.
The goal is to obtain final form calculated on all the parquet files in all the folders.
Now, the situation I am now looks like this: I do know how to receive the final form by partial step, e.g. by using sparse.coo_matrix() like explained in How to make full matrix from dense pandas dataframe .
The problem is: due to memory limitations I cannot simply read all the parquets at once.
I have three questions:
How to get there efficiently, if I have plenty of data (assume each parquet file consists of 500MB)?
Can I transform each parquet to final form separately and THEN merge them somehow? If yes, how could I do that?
Is there any way to skip the partial step?
For every dataframe in the files, you seem to
Group Data by the columns id, title
Now, sum the data in column a for each group
Creating a full matrix for the task, is not necessary and so's the partial step.
I am not sure, how many unique combinations of id, title exists in a file and or all of them. A safe step would be to process files in batches, save their results and later combine all results
Which looks like,
import pandas as pd
import numpy as np
import string
def gen_random_data(N, M):
# N = 100
# M = 10
titles = np.apply_along_axis(lambda x: ''.join(x), 1, np.random.choice(list(string.ascii_lowercase), 3*M).reshape(-1, 3))
titles = np.random.choice(titles, N)
_id = np.random.choice(np.arange(M) + 1, N)
val = np.random.randint(M, size=(N,))
df = pd.DataFrame(np.vstack((_id, titles, val)).T, columns=['id', 'title', 'a'])
df = df.astype({'id': np.int64, 'title': str, 'a': np.int64})
return df
def combine_results(grplist):
# stitch into one dataframe
comb_df = pd.concat(dflist, axis=1)
# Sum over common axes i.e. id, titles
comb_df = comb_df.apply(lambda row: np.nansum(row), axis=1)
# Return a data frame with sum of a's
return comb_df.to_frame('sum_of_a')
totalfiles = 10
batch = 2
filelist = []
for counter,nfiles in enumerate(range(0, totalfiles, batch)):
# Read data from files. generate random data
dflist = [gen_random_data(100, 2) for _ in range(nfiles)]
# Process the data in memory
dflist = [_.groupby(['id', 'title']).agg(['sum']) for _ in dflist]
collection = combine_results(dflist)
# write intermediate results to file and repeat the process for the rest of the files
intermediate_result_file_name = f'resfile_{counter}'
collection.to_parquet(intermediate_result_file_name, index=True)
filelist.append(intermediate_result_file_name)
# Combining result files.
collection = [pd.read_parquet(file) for file in filelist]
totalresult = combine_results(collection)

Pandas: Why are my headers being inserted into the first row of my dataframe?

I have a script that collates sets of tags from other dataframes, converts them into comma-separated string and adds all of this to a new dataframe. If I use pd.read_csv to generate the dataframe, the first entry is what I expect it to be. However, if I use the df_empty script (below), then I get a copy of the headers in that first row instead of the data I want. The only difference I have made is generating a new dataframe instead of loading one.
The resultData = pd.read_csv() reads a .csv file with the following headers and no additional information:
Sheet, Cause, Initiator, Group, Effects
The df_empty script is as follows:
def df_empty(columns, dtypes, index=None):
assert len(columns)==len(dtypes)
df = pd.DataFrame(index=index)
for c,d in zip(columns, dtypes):
df[c] = pd.Series(dtype=d)
return df
# https://stackoverflow.com/a/48374031
# Usage: df = df_empty(['a', 'b'], dtypes=[np.int64, np.int64])
My script contains the following line to create the dataframe:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],[np.str,np.int64,np.str,np.str,np.str])
I've also used the following with no differences:
resultData = df_empty(['Sheet','Cause','Initiator','Group','Effects'],['object','int64','object','object','object'])
My script to collate the data and add it to my dataframe is as follows:
data = {'Sheet': sheetNum, 'Cause': causeNum, 'Initiator': initTag, 'Group': grp, 'Effects': effectStr}
count = len(resultData)
resultData.at[count,:] = data
When I run display(data), I get the following in Jupyter:
{'Sheet': '0001',
'Cause': 1,
'Initiator': 'Tag_I1',
'Group': 'DIG',
'Effects': 'Tag_O1, Tag_O2,...'}
What I want to see with both options / what I get when reading the csv:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| 0001 | 1 | Tag_I1 | DIG | Tag_O1, Tag_O2,... |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
What I see when generating a dataframe with df_empty:
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
+-------+-------+-----------+-------+--------------------+
| Sheet | Cause | Initiator | Group | Effects |
| 0001 | 2 | Tag_I2 | DIG | Tag_O2, Tag_04,... |
+-------+-------+-----------+-------+--------------------+
Any ideas on what might be causing the generated dataframe to copy my headers into the first row and if it possible for me to not have to read an otherwise empty csv?
Thanks!
Why? Because you've inserted the first row as data. The magic behaviour of using the first row as header is in read_csv(), if you create your dataframe without using read_csv, the first row is not treated specially.
Solution? Skip the first row when inserting to the data frame generate by df_empty.

Rename Columns in Python using Regular Expressions

I have a data set that has columns for number of units sold in a given month - the problem being that the monthly units columns are named in MM/yyyy format, meaning that I have 12 columns of units information per record.
So for instance, my data looks like:
ProductID | CustomerID | 04/2018 | 03/2018 | 02/2018 | FileDate |
a1032 | c1576 | 36 | 12 | 19 | 04/20/2018 |
What causes this to be problematic is that a new file comes in every month, with the same file name, but different column headers for the units information based on the last 12 months.
What I would like to do, is rename the monthly units columns to Month1, Month2, Month3... based on a simple regex such as ([0-9]*)/([0-9]*) that will result in the output:
ProductID | CustomerID | Month1 | Month2 | Month3 | FileDate |
a1032 | c1576 | 36 | 12 | 19 | 04/20/2018 |
I know that this should be possible using Python, but as I have never used Python before (I am an old .Net developer) I honestly have no idea how to achieve this.
I have done a bit of research on renaming columns in Python, but none of them mentioned pattern matching to rename a column, eg:
df = df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'})
UPDATE: The data that I am showing in my example is only a subset of the columns; total, in my data set I have 120 columns, only 12 of which need to be renamed, this is why I thought that regex might be the simplest way to go.
import re
# regex pattern
pattern = re.compile("([0-9]*)/([0-9]*)")
# get headers as list
headers = list(df)
# apply regex
months = 1
for index, header in enumerate(headers):
if pattern.match(header):
headers[index] = 'Month{}'.format(months)
months += 1
# set new list as column headers
df.columns = headers
If you have some set names that you want to convert to, then rather than using rename, it might easier to just pass a new list to the df.columns attribute
df.columns = ['ProductID','CustomerID']+['Month{}'.format(i) for i in range(12)]+['FileDate']
If you want to use rename, if you can write a function find_new_name that does the conversion you want for a single name, you can rename an entire list old_names with
df.rename(columns = {oldname:find_new_name(old_name) for old_name in old_names})
Or if you have a function that takes a new name and figures out what old name corresponds to it, then it would be
df.rename(columns = {find_old_name(new_name):new_name for new_name in new_names})
You can also do
for new_name in new_names:
old_name = find_new_name(old_name)
df[new_name] = df[old_name]
This will copy the data into new columns with the new names rather than renaming, so you can then subset to just the columns you want.
Since rename could take a function as a mapper, we could define a customized function which returns a new column name in the new format if the old column name matches regex; otherwise, returns the same column name. For example,
import re
def mapper(old_name):
match = re.match(r'([0-9]*)/([0-9]*)', old_name)
if match:
return 'Month{}'.format(int(match.group(1)))
return old_name
df = df.rename(columns=mapper)

Min Values in each row of selected columns in Python

Hi I have a rather simple task but seems like all online help is not working.
I have data set like this:
ID | Px_1 | Px_2
theta| 106.013676 | 102.8024788702673
Rho | 100.002818 | 102.62640389123405
gamma| 105.360589 | 107.21999706084836
Beta | 106.133046 | 115.40449479551263
alpha| 106.821119 | 110.54312246081719
I want to find min by each row in a fourth col so the output I can have is for example, theta is 102.802 because it is the min value of both Px_1 and Px_2
I tried this but doesnt work
I constantly get max value
df_subset = read.set_index('ID')[['Px_1','Px_2']]
d = df_subset.min( axis=1)
Thanks
You can try this
df["min"] = df[["Px_1", "Px_2"]].min(axis=1)
Select the columns needed, here ["Px_1", "Px_2"], to perform min operation.

Categories