How to Combine Rows of Text in Pandas - python

I have a table with two columns and i want to combine the text with the same id
import pandas as pd
df = DataFrame({'id':[101453,101465,101478,101453,101465,101465], 'text' :['this','is','a','test','string','one']})
I need a result like this:
df = DataFrame({'id':[101453,101465,101478], 'text':['this test','is string one','a']})

Use groupby with apply join:
print (df.groupby('id')['text'].apply(' '.join).reset_index())
id text
0 101453 this test
1 101465 is string one
2 101478 a

df['id'] = sorted(list(set(df['id'])))
set() removes all equal elements. Then return it to list(). And sort it if you need.

Related

Python how to filter a csv based on a column value and get the row count

I want to do data insepction and print count of rows that matches a certain value in one of the columns. So below is my code
import numpy as np
import pandas as pd
data = pd.read_csv("census.csv")
The census.csv has a column "income" which has 3 values '<=50K', '=50K' and '>50K'
and i want to print number of rows that has income value '<=50K'
i was trying like below
count = data['income']='<=50K'
That does not work though.
Sum Boolean selection
(data['income'].eq('<50K')).sum()
The key is to learn how to filter pandas rows.
Quick answer:
import pandas as pd
data = pd.read_csv("census.csv")
df2 = data[data['income']=='<=50K']
print(df2)
print(len(df2))
Slightly longer answer:
import pandas as pd
data = pd.read_csv("census.csv")
filter = data['income']=='<=50K'
print(filter) # notice the boolean list based on filter criteria
df2 = data[filter] # next we use that boolean list to filter data
print(df2)
print(len(df2))

Pandas dataframe filter out rows with non-english text

I have a pandas df which has 6 columns, the last one is input_text. I want to remove from df all rows that have non-english text in that column. I would like to use langdetect's detect function.
Some template
from langdetect import detect
import pandas as pd
def filter_nonenglish(df):
new_df = None # Do some magical operations here to create the filtered df
return new_df
df = pd.read_csv('somecsv.csv')
df_new = filter_nonenglish(df)
print('New df is: ', df_new)
Note! It doesn't matter what the other 5 columns are.
Also note: using detect is as simple as:
t = 'I am very cool!'
print(detect(t))
Output is:
en
You can do it as below on your df and get all the rows with english text in the input_text column:
df_new = df[df.input_text.apply(detect).eq('en')]
So basically just apply the langdetect.detect function to the values in input_text column and get all those rows for which text is detected as "en".

How to query a Pandas Dataframe based on column values

I have a dataframe:
ID
Name
1
A
2
B
3
C
I defined a list:
mylist =[A,C]
If I want to extract only the rows where Name is equal to A and C (namely, mylist), I am trying to use the following code:
df_new = df[(df['Name'].isin(mylist))]
>>> df_new
As result, I get an empty table.
Any suggestion regarding why I get this error?
Just remove the additional open bracket before the df['Name']
df_new = df[df['Name'].isin(lst)]
Found the solution, It was a problem related to the list that caused the result of the empty table.
The format of the list should be:
mylist =['A','C']
instead of
mylist =[A,C]
You could use .loc and lambda as it’s more readable
import pandas as pd
dataf = pd.DataFrame({'ID':[1,2,3],'Name':['A','B','C']})
names = ['A','C']
# lock rows where column Name in names
df = dataf.loc[lambda d: d['Name'].isin(names)]
print(df)

Loop over multiple columns to find strings in a numerical column?

The following code finds any strings for column B. Is it possible to loop over multiple columns of a dataframe outputting the cells containing strings for each column?
import pandas as pd
for i in df:
print(df[df['i'].str.contains(r'^[a-zA-Z]+$')])
Link to code above
https://stackoverflow.com/a/65410078/12801962
Here is how to loop through columns
import pandas as pd
colList = ['ColB', 'Some_other', 'ColC']
for col in colList:
subdf = df[df[col].str.contains(r'^[a-zA-Z]+$')]
#do something with sub DF
or do it in one long test and get all the problem rows in one dataframe
import pandas as pd
subdf = df[((df['ColB'].str.contains(r'^[a-zA-Z]+$')) |
(df['Some_other'].str.contains(r'^[a-zA-Z]+$')) |
(df['ColC'].str.contains(r'^[a-zA-Z]+$')))]
Not sure if it's what you are intending to do
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['ColA'] = ['ABC', 'DEF', 12345, 23456]
df['ColB'] = ['abc', 12345, 'def', 23456]
all_trues = pd.Series(np.ones(df.shape[0], dtype=np.bool))
for col in df:
all_trues &= df[col].str.contains(r'^[a-zA-Z]+$')
df[all_trues]
Which will give the result:
ColA ColB
0 ABC abc
Try:
for k, s in df.astype(str).items():
print(s.loc[s.str.contains(r'^[a-zA-Z]+$')])
Or, for the values only (no index nor column information):
for k, s in df.astype(str).items():
print(s.loc[s.str.contains(r'^[a-zA-Z]+$')].values)
Note, both of the above only work because you just want to print the matching values in the columns, not return a new structure with filtered entries.
If you tried to make a new DataFrame with cells filtered by the condition, then that would lead to ragged arrays, which are not implemented (you could replace these cells by a marker of your choice, but you cannot cut them away). Another possibility would be to select rows where any or all the cells present the condition you are testing for (that way, the result is an homogeneous array, not a ragged one).
Yet another option would be to return a list of Series, each representing a column, or a dict of colname: Series:
{k: s.loc[s.str.contains(r'^[a-zA-Z]+$')] for k, s in df.astype(str).items()}

How to select all columns that start with "durations" or "shape"?

How to select all columns that have header names starting with "durations" or "shape"? (instead of defining a long list of column names). I need to select these columns and substitute blank fields by 0.
column_names = ['durations.blockMinutes_x',
'durations.scheduledBlockMinutes_y']
data[column_names] = data[column_names].fillna(0)
You could use str methods of dataframe startwith:
df = data[data.columns[data.columns.str.startwith('durations') | data.columns.str.startwith('so')]]
df.fillna(0)
Or you could use contains method:
df = data.iloc[:, data.columns.str.contains('durations.*'|'shape.*') ]
df.fillna(0)
I would use the select method:
df.select(lambda c: c.startwith('durations') or c.startswith('shape'), axis=1)
Use my_dataframe.columns.values.tolist() to get the column names (based on Get list from pandas DataFrame column headers):
column_names = [x for x in data.columns.values.tolist() if x.startswith("durations") or x.startswith("shape")]
A simple and easy way
data[data.filter(regex='durations|shape').columns].fillna(0)
Sample Screenshot

Categories