I have a given text string:
text = """Alice has two apples and bananas. Apples are very healty."""
and a dataframe:
word
apples
bananas
company
I would like to add a column "frequency" which will count occurrences of each word in column "word" in the text.
So the output should be as below:
word
frequency
apples
2
bananas
1
company
0
import pandas as pd
df = pd.DataFrame(['apples', 'bananas', 'company'], columns=['word'])
para = "Alice has two apples and bananas. Apples are very healty.".lower()
df['frequency'] = df['word'].apply(lambda x : para.count(x.lower()))
word frequency
0 apples 2
1 bananas 1
2 company 0
Convert the text to lowercase and then use regex to convert it to a list of words. You might check out this page for learning purposes.
Loop through each row in the dataset and use lambda function to count the specific value in the previously created list.
# Import and create the data
import pandas as pd
import re
text = """Alice has two apples and bananas. Apples are very healty."""
df = pd.DataFrame(data={'word':['apples','bananas','company']})
# Solution
words_list = re.findall(r'\w+', text.lower())
df['Frequency'] = df['word'].apply(lambda x: words_list.count(x))
I am trying to re-arrange a file to match a BACS bank format. In order for it to work the columns in the csv below need to be of a specific length. I have figured out the abcdabcd column as it's a repeating pattern (as are a couple more in the file), but several columns have random numbers that I cannot easily target.
Is there a way for me to target either (ideally) a specific column based on its header, or alternatively target everything up to a comma to butcher something that could work?
In my example file below, you'll see three columns where the value changes. If targeting everything up to a specific character is the solution, I was thinking of using .ljust to fill the column up to the specified length (and then sorting it out manually in excel).
Original File
a,b,c,d,e,f,g,h,i,j,k
12345,1234567,0,11,123456,12345678,1234567,abcdabcd,A ABCD
123456,12345678,0,11,123456,12345678,12345678,abcdabcd,A ABCD
123456,1234567,0,11,123456,12345678,12345,abcdabcd,A ABCD
12345,1234567,0,11,123456,12345678,1234567,abcdabcd,A ABCD
123456,12345678,0,11,123456,12345678,123456789,abcdabcd,A ABCD
Ideal output
a,b,c,d,e,f,g,h,i,j,k
123450,12345670,0,11,123456,12345678,123456700,abcdabcd,A ABCD
123456,12345678,0,11,123456,12345678,123456780,abcdabcd,A ABCD
123456,12345670,0,11,123456,12345678,123450000,abcdabcd,A ABCD
123450,12345670,0,11,123456,12345678,123456700,abcdabcd,A ABCD
123456,12345678,0,11,123456,12345678,123456789,abcdabcd,A ABCD
Code
with open('file.txt', 'r') as file :
filedata = file.read()
filedata = filedata.replace('12345', '12345'.ljust(6, '0'))
with open('file.txt', 'w') as file:
file.write(filedata)
EDIT:
Something similar to this Python - How to add zeros to and integer/string? but while either targeting a specific column per header, or at least the first one.
EDIT2:
I am using the below to rearrange my columns, could this be modified to work with string lengths?
import pandas as pd
## Read csv / tab-delimited in this example
df = pd.read_csv('test.txt', sep='\t')
## Reorder columns
df = df[['h','i','c','g','a','b','e','d','f','j','k']]
## Write csv / tab-delimited
df.to_csv('test', sep='\t')
Using pandas, you can convert the column to str and then use .str.pad. You can make a dict with the requested lengths:
lengths = {
"a": 6,
"b": 8,
"c": 3,
"d": 6,
"e": 8,
}
and use it like this:
result = pd.DataFrame(
{
column_name: column.str.pad(
lengths.get(column_name, 0), side="right", fillchar="0"
)
for column_name, column in df.astype(str).items()
}
)
If the fillchar is different per column, you can get that from a dict as well
>>> print '{:0>5}'.format(4)
'00004'
>>> print '{:0<5}'.format(4)
'40000'
>>> print '{:0^5}'.format(4)
'00400'
Example:
#--------------DEFs------------------
def number_zero_right(number,len_number):
return ('{:0<'+str(len_number)+'}').format(number)
#--------------MAIN------------------
a = 12345
b = 1234567
c = 0
d = 11
e = 123456
f = 12345678
g = 1234567
h = 'abcdabcd'
i = 'A'
j = 'ABCD'
print(a,b,c,d,e,f,g,h,i,j)
# > 12345 1234567 0 11 123456 12345678 1234567 abcdabcd A ABCD
a = number_zero_right(a,6)
b = number_zero_right(b,8)
c = number_zero_right(c,1)
d = number_zero_right(d,2)
e = number_zero_right(e,6)
f = number_zero_right(f,8)
g = number_zero_right(g,9)
print(a,b,c,d,e,f,g,h,i,j)
#> 123450 12345670 0 11 123456 12345678 123456700 abcdabcd A ABCD
Managed to get there, so thought I'd post in case someone has a similar issue. This only works on one column, but that's enough for me now.
#import pandas
import pandas as pd
#open file and convert data to str
data = pd.read_csv('Test.CSV', dtype = str)
# width of output string
width = 6
# fillchar
char ="_"
#Change the contents of column named ColumnID
data["ColumnID"]= data["ColumnID"].str.ljust(width, char)
#print output
print(data)
I'm trying to update the strings in a .csv file that I am reading using Pandas. The .csv contains the column name 'about' which contains the rows of data I want to manipulate.
I've already used str. to update but it is not reflecting in the exported .csv file. Some of my code can be seen below.
import pandas as pd
df = pd.read_csv('data.csv')
df.About.str.lower() #About is the column I am trying to update
df.About.str.replace('[^a-zA-Z ]', '')
df.to_csv('newdata.csv')
You need assign output to column, also is possible chain both operation together, because working with same column About and because values are converted to lowercase, is possible change regex to replace not uppercase:
df = pd.read_csv('data.csv')
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
df.to_csv('newdata.csv', index=False)
Sample:
df = pd.DataFrame({'About':['AaSD14%', 'SDD Aa']})
df.About = df.About.str.lower().str.replace('[^a-z ]', '')
print (df)
About
0 aasd
1 sdd aa
import pandas as pd
import numpy as np
columns = ['About']
data = ["ALPHA","OMEGA","ALpHOmGA"]
df = pd.DataFrame(data, columns=columns)
df.About = df.About.str.lower().str.replace('[^a-zA-Z ]', '')
print(df)
OUTPUT:
Example Dataframe:
>>> df
About
0 JOHN23
1 PINKO22
2 MERRY jen
3 Soojan San
4 Remo55
Solution:,another way Using a compiled regex with flags
>>> df.About.str.lower().str.replace(regex_pat, '')
0 john
1 pinko
2 merry jen
3 soojan san
4 remo
Name: About, dtype: object
Explanation:
Match a single character not present in the list below [^a-z]+
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) a-z a single character in
the range between a (index 97) and z (index 122) (case sensitive)
$ asserts position at the end of a line
This question already has answers here:
Fast punctuation removal with pandas
(4 answers)
Closed 4 years ago.
Using Canopy and Pandas, I have data frame a which is defined by:
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"]
test.txt is a single column file that contains a list of string that contains text, numerical and punctuation.
Assuming df looks like:
test
%hgh&12
abc123!!!
porkyfries
I want my results to be:
test
hgh12
abc123
porkyfries
Effort so far:
from string import punctuation /-- import punctuation list from python itself
a=pd.read_csv('text.txt')
df=pd.DataFrame(a)
df.columns=["test"] /-- define the dataframe
for p in list(punctuation):
...: df2=df.med.str.replace(p,'')
...: df2=pd.DataFrame(df2);
...: df2
The command above basically just returns me with the same data set.
Appreciate any leads.
Edit: Reason why I am using Pandas is because data is huge, spanning to bout 1M rows, and future usage of the coding will be applied to list that go up to 30M rows.
Long story short, I need to clean data in a very efficient manner for big data sets.
Use replace with correct regex would be easier:
In [41]:
import pandas as pd
pd.set_option('display.notebook_repr_html', False)
df = pd.DataFrame({'text':['test','%hgh&12','abc123!!!','porkyfries']})
df
Out[41]:
text
0 test
1 %hgh&12
2 abc123!!!
3 porkyfries
[4 rows x 1 columns]
use regex with the pattern which means not alphanumeric/whitespace
In [49]:
df['text'] = df['text'].str.replace('[^\w\s]','')
df
Out[49]:
text
0 test
1 hgh12
2 abc123
3 porkyfries
[4 rows x 1 columns]
For removing punctuation from a text column in your dataframme:
In:
import re
import string
rem = string.punctuation
pattern = r"[{}]".format(rem)
pattern
Out:
'[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~]'
In:
df = pd.DataFrame({'text':['book...regh', 'book...', 'boo,', 'book. ', 'ball, ', 'ballnroll"', '"rope"', 'rick % ']})
df
Out:
text
0 book...regh
1 book...
2 boo,
3 book.
4 ball,
5 ballnroll"
6 "rope"
7 rick %
In:
df['text'] = df['text'].str.replace(pattern, '')
df
You can replace the pattern with your desired character. Ex - replace(pattern, '$')
Out:
text
0 bookregh
1 book
2 boo
3 book
4 ball
5 ballnroll
6 rope
7 rick
Translate is often considered the cleanest and fastest way to remove punctuation (source)
import string
text = text.translate(None, string.punctuation.translate(None, '"'))
You may find that it works better to remove punctuation in 'a' before loading it into pandas.