I have a data frame, that looks like this:
print(df)
Text
0|This is a text
1|This is also text
What I wish: I would like to do a for loop over the Text column for the data frame, and create a new column with the derived information to be like this:
Text | Derived_text
0|This is a text | Something
1|This is also text| Something
Code: I have written the following code (Im using Spacy btw):
for i in df['Text'].tolist():
doc = nlp(i)
resolved = [(doc._.coref_resolved) for docs in doc.ents]
df = df.append(pd.Series(resolved), ignore_index=True)
Problem: The problem is that the appended series gets misplaced/mismatched, so it looks like this:
Text | Derived_text
0|This is a text | NaN
1|This is also text| NaN
2|NaN | Something
3|NaN | Something
I have also tried to just save it into a list, but the list does not include NaN values, which can occur doing the derived for loop. I need the NaN values to be kept, so I can match the original text with the derived text using the index position.
It appears that you want to add a column, which can be done using pandas concat method using the axis argument like pd.concat([df, new_columns], axis = 1).
However I think you shouldn't use for loops while using pandas. What probably should do is use it's pandas's apply function, which would look something like:
# define you DataFrame
df = pd.DataFrame(data = [range(6), range(1, 7)], columns = ['a', 'b'])
# create the new column from one of them
df['a_squared'] = df['a'].apply(lambda x: x ** 2)
Maybe you should also look into lambda expressions.
Also, look into this stackoverflow question.
Hope this helped! Happy coding!
Related
I want to subtract or remove the words in one dataframe from another dataframe in each row.
This is the main table/columns of a pyspark dataframe.
+----------+--------------------+
| event_dt| cust_text|
+----------+--------------------+
|2020-09-02|hi fine i want to go|
|2020-09-02|i need a line hold |
|2020-09-02|i have the 60 packs|
|2020-09-02|hello want you teach|
Below is another pyspark dataframe. The words in this dataframe needs to be removed from the above main table in column cust_text wherever the words occur in each row. For example, 'want' will be removed from every row wherever it shows up in 1st dataframe.
+-------+
|column1|
+-------+
| want|
|because|
| need|
| hello|
| a|
| have|
| go|
+-------+
This can be done in pyspark or pandas. I have tried googling the solution using Python, Pyspark, pandas, but still not able to remove the words from the main table based on a single column table.
The result should look like this:
+----------+--------------------+
| event_dt| cust_text|
+----------+--------------------+
|2020-09-02|hi fine i to |
|2020-09-02|i line hold |
|2020-09-02|i the 60 packs |
|2020-09-02|you teach |
+----------+--------------------+
If you want to remove just the word in the corresponding line of df2, you could do that as follows, but it will probably be slow for large data sets, because it only can partially can use fast C implementations:
# define your helper function to remove the string
def remove_string(ser_row):
return ser_row['cust_text'].replace(ser_row['remove'], '')
# create a temporary column with the string to remove in the first dataframe
df1['remove']= df2['column1']
df1= df1.apply(remove_string, axis='columns')
# drop the temporary column afterwards
df1.drop(columns=['remove'], inplace=True)
The result looks like:
Out[145]:
0 hi fine i to go
1 i need lines hold
2 i have the 60 packs
3 can you teach
dtype: object
If however, you want to remove all words in your df2 column from every column, you need to do it differntly. Unfortunately str.replace does not help here with regular strings, unless you want to call it for every line in your second dataframe.
So if your second dataframe is not too large, you can create a regular expression to make use of str.replace.
import re
replace=re.compile(r'\b(' + ('|'.join(df2['column1'])) + r')\b')
df1['cust_text'].str.replace(replace, '')
The output is:
Out[184]:
0 hi fine i to
1 i lines hold
2 i the 60 packs
3 can you teach
Name: cust_text, dtype: object
If you don't like the repeated spaces, that remain, you can just perform something like:
df1['cust_text'].str.replace(replace, '').str.replace(re.compile('\s{2,}'), ' ')
Addition: what, if not only the text without the words is relevant, but the words themselves as well. How can we get the words, which were replaced. Here is one attempt, which would work, if one character can be identified, which will not appear in the text. Let's assume this character is a #, then you could do (on the original column value without replacement):
# enclose each keywords in #
ser_matched= df1['cust_text'].replace({replace: r'#\1#'}, regex=True)
# now remove the rest of the line, which is unmatched
# this is the part of the string after the last occurance
# of a #
ser_matched= ser_matched.replace({r'^(.*)#.*$': r'\1', '^#': ''}, regex=True)
# and if you like your keywords to be in a list, rather than a string
# you can split the string at last
ser_matched.str.split(r'#+')
This solution would be specific to pandas. If I understand your challenge correctly, you want to remove all words from column cust_text that occur in column1 of the second DataFrame. Lets give the corresponding DataFrames the names: df1 and df2. This is how you would do this:
for i in range(len(df1)):
sentence = df1.loc[i, "cust_text"]
for j in range(len(df2)):
delete_word = df2.loc[j, "column1"]
if delete_word in sentence:
sentence = sentence.replace(delete_word, "")
df1.loc[i, "cust_text"] = sentence
I have assigned variables to certain data points in these dataframes (sentence and delete_word), but that is just for the sake of understanding. You can easily condense this code to a few lines shorter by not doing that.
I have a dataframe like as given below
test1 = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2],
'flag' : ['','','T1','T1','T1','T1','T1','T1','T1','T1','','','T1','T1','T1','T1','T1','T1','T1','T1']
})
It looks like as shown below
As per the rule/logic, T1 can appear in flag field only after 5 days/records from it's first occurrence. For example if T1 had occurred on 3rd index, it can then only occur on 9th index and more..Anything before that are invalid and has to be removed.
I tried the below. Though this works, it doesn't look elegant and not suitable for all subjects.
a = test1[test1['flag']=='T1'].index.min()
test1.loc[a+1:a+6, 'flag'] = ''
How can I do this check indvidually for all the subjects? Each subject and its flag should follow this rule
I expect my output to be like as shown below. You can see the invalid flags are removed
We can do
s=test1['flag'].eq('T1').groupby(test1['subject_id']).transform('idxmax')
test1.loc[~((test1.index==s)|(test1.index>(s+5))),'flag']=''
Here is a slightly different way to do it, in a single piped statement. For clarity, I'm creating additional columns for the cumsum and the condition and then sub-setting the dataframe.
test1.\
assign(cum_sum=lambda x: x.flag.eq('T1').groupby(x.subject_id).cumsum()).\
assign(condition=lambda x: (x.flag=='') | (x.cum_sum==1) | (x.cum_sum >=5)).\
loc[lambda x: x.condition]
Hope this helps.
I know that there are a few questions about nested dictionaries to dataframe but their solutions do not work for me. I have a dataframe, which is contained in a dictionary, which is contained in another dictionary, like this:
df1 = pd.DataFrame({'2019-01-01':[38],'2019-01-02':[43]},index = [1,2])
df2 = pd.DataFrame({'2019-01-01':[108],'2019-01-02':[313]},index = [1,2])
da = {}
da['ES']={}
da['ES']['TV']=df1
da['ES']['WEB']=df2
What I want to obtain is the following:
df_final = pd.DataFrame({'market':['ES','ES','ES','ES'],'device':['TV','TV','WEB','WEB'],
'ds':['2019-01-01','2019-01-02','2019-01-01','2019-01-02'],
'yhat':[43,38,423,138]})
Getting the code from another SO question I have tried this:
market_ids = []
frames = []
for market_id,d in da.items():
market_ids.append(market_id)
frames.append(pd.DataFrame.from_dict(da,orient = 'index'))
df = pd.concat(frames, keys=market_ids)
Which gives me a dataframe with multiple indexes and the devices as column names.
Thank you
The code below works well and gives the desired output:
t1=da['ES']['TV'].melt(var_name='ds', value_name='yhat')
t1['market']='ES'
t1['device']='TV'
t2=da['ES']['WEB'].melt(var_name='ds', value_name='yhat')
t2['market']='ES'
t2['device']='WEB'
m = pd.concat([t1,t2]).reset_index().drop(columns={'index'})
print(m)
And the output is:
ds yhat market device
0 2019-01-01 38 ES TV
1 2019-01-02 43 ES TV
2 2019-01-01 108 ES WEB
3 2019-01-02 313 ES WEB
The main takeaway here is melt function, which if you read about isn't that difficult to understand what's it doing here. Now as I mentioned in the comment above, this can be done iteratively over whole da named dictionary, but to perform that I'd need replicated form of the actual data. What I intended to do was to take this first t1 as the initial dataframe and then keep on concatinating others to it, which should be really easy. But I don't know how your actual values are. But I am sure you can figure out on your own from above how to put this under a loop.
The pseudo code for that loop thing I am talking about would be like this:
real=t1
for a in da['ES'].keys():
if a!='TV':
p=da['ES'][a].melt(var_name='ds', value_name='yhat')
p['market']='ES'
p['device']=a
real = pd.concat([real,p],axis=0,sort=True)
real.reset_index().drop(columns={'index'})
I'm sure I'm making an obviously mistake, but can't see it.
I have a df that looks like this:
id year plan grade prior_grade
21 2017 text A B
56 2015 text B B
43 2016 text A C
and want to create a new df with only those rows where prior_grade = c. I'm using this to do so:
prior_c = (df.loc[(df['prior_grade']=='C')])
which returns an empty df (column names print but no rows when calling prior_c.head())
Again, I'm sure I'm making an obvious mistake, but just can't see it.
edit: also tried with less parens and got the same result:
prior_c = df.loc[df['prior_grade']=='C']
This should work, although I believe you should make a copy of the dataframe. This serves to explicitly make the new dataframe a copy (rather than a view of the original) so as to avoid unintentionally changing your original df. I would recommend the following:
prior_c = df.loc[df['prior_grade']=='C'].copy()
I have some experimental data which looks like this - http://paste2.org/YzJL4e1b (too long to post here). The blocks which are separated by field name lines are different trials of the same experiment - I would like to read everything in a pandas dataframe but have it bin together certain trials (for instance 0,1,6,7 taken together - and 2,3,4,5 taken together in another group). This is because different trials have slightly different conditions and I would like to analyze the results difference between these conditions. I have a list of numbers for different conditions from another file.
Currently I am doing this:
tracker_data = pd.DataFrame
tracker_data = tracker_data.from_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4)
tracker_data['GazePointXLeft'] = tracker_data['GazePointXLeft'].astype(np.float64)
but this of course just reads everything in one go (including the field name lines) - it would be great if I could nest the blocks somehow which allows me to easily access them via numeric indices...
Do you have any ideas how I could best do this?
You should use read_csv rather than from_csv*:
tracker_data = pd.read_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4)
If you want to join a list of DataFrames like this you could use concat:
trackers = (pd.read_csv(bhpath+i+'_wmet.tsv', sep='\t', header=4) for i in range(?))
df = pd.concat(trackers)
* which I think is deprecated.
I haven't quite got it working, but I think that's because of how I copy/pasted the data. Try this, let me know if it doesn't work.
Using some inspiration from this question
pat = "TimeStamp\tGazePointXLeft\tGazePointYLeft\tValidityLeft\tGazePointXRight\tGazePointYRight\tValidityRight\tGazePointX\tGazePointY\tEvent\n"
with open('rec.txt') as infile:
header, names, tail = infile.read().partition(pat)
names = names.split() # get rid of the tabs here
all_data = tail.split(pat)
res = [pd.read_csv(StringIO(x), sep='\t', names=names) for x in all_data]
We read in the whole file so this won't work for huge files, and then partition it based on the known line giving the column names. tail is just a string with the rest of the data so we can split that, again based on the names. There may be a better way than using StringIO, but this should work.
I'm note sure how you want to join the separate blocks together, but this leaves them as a list. You can concat from there however you desire.
For larger files you might want to write a generator to read until you hit the column names and write a new file until you hit them again. Then read those in separately using something like Andy's answer.
A separate question from how to work with the multiple blocks. Assuming you've got the list of Dataframes, which I've called res, you can use pandas' concat to join them together into a single DataFrame with a MultiIndex (also see the link Andy posted).
In [122]: df = pd.concat(res, axis=1, keys=['a', 'b', 'c']) # Use whatever makes sense for the keys
In [123]: df.xs('TimeStamp', level=1, axis=1)
Out[123]:
a b c
0 NaN NaN NaN
1 0.0 0.0 0.0
2 3.3 3.3 3.3
3 6.6 6.6 6.6
I ended up doing it iteratively. very very iteratively. Nothing else seems to work.
pat = 'TimeStamp GazePointXLeft GazePointYLeft ValidityLeft GazePointXRight GazePointYRight ValidityRight GazePointX GazePointY Event'
with open(bhpath+fileid+'_wmet.tsv') as infile:
eye_data = infile.read().split(pat)
eye_data = [trial.split('\r\n') for trial in eye_data] # split at '\r'
for idx, trial in enumerate(eye_data):
trial = [row.split('\t') for row in trial]
eye_data[idx] = trial