string (file1.txt) search from file2.txt - python

file1.txt contains usernames, i.e.
tony
peter
john
...
file2.txt contains user details, just one line for each user details, i.e.
alice 20160102 1101 abc
john 20120212 1110 zjc9
mary 20140405 0100 few3
peter 20140405 0001 io90
tango 19090114 0011 n4-8
tony 20150405 1001 ewdf
zoe 20000211 0111 jn09
...
I want to get a shortlist of user details from file2.txt by file1.txt user provided, i.e.
john 20120212 1110 zjc9
peter 20140405 0001 io90
tony 20150405 1001 ewdf
How to use python to do this?

You can use .split(' '), assuming that there will always be a space between the name and the other information in the file2.txt
Here's an example:
UserList = []
with open("file1.txt","r") as fuser:
UserLine = fuser.readline()
while UserLine!='':
UserList.append(UserLine.split("\n")[0]) # Separate the user name from the new line command in the text file.
UserLine = fuser.readline()
InfoUserList = []
InfoList = []
with open("file2.txt","r") as finfo:
InfoLine = finfo.readline()
while InfoLine!='':
InfoList.append(InfoLine)
line1 = InfoLine.split(' ')
InfoUserList.append(line1[0]) # Take just the user name to compare it later
InfoLine = finfo.readline()
for user in UserList:
for i in range(len(InfoUserList)):
if user == InfoUserList[i]:
print InfoList[i]

You can use pandas:
import pandas as pd
file1 = pd.read_csv('file1.txt', sep =' ', header=None)
file2 = pd.read_csv('file2.txt', sep=' ', header=None)
shortlist = file2.loc[file2[0].isin(file1.values.T[0])]
it will give you the following result:
0 1 2 3
1 john 20120212 1110 zjc9
3 peter 20140405 1 io90
5 tony 20150405 1001 ewdf
The above is a DataFrame to convert it back to an array just use shortlist.values

import pandas as pd
df1 = pd.read_csv('df1.txt', header=None)
df2 = pd.read_csv('df2.txt', header=None)
df1[0] = df1[0].str.strip() # remove the 2 whitespace followed by the feild
df2 = df2[0].str[0:-2].str.split(' ').apply(pd.Series) # split the word and remove whitespace
df = df1.merge(df2)
Out[26]:
0 1 2 3
0 tony 20150405 1001 ewdf
1 peter 20140405 0001 io90
2 john 20120212 1110 zjc9

Related

Create new column using str.contains and based on if-else condition

I have a list of names 'pattern' that I wish to match with strings in column 'url_text'. If there is a match i.e. True the name should be printed in a new column 'pol_names_block' and if False leave the row empty.
pattern = '|'.join(pol_names_list)
print(pattern)
'Jon Kyl|Doug Jones|Tim Kaine|Lindsey Graham|Cory Booker|Kamala Harris|Orrin Hatch|Bernie Sanders|Thom Tillis|Jerry Moran|Shelly Moore Capito|Maggie Hassan|Tom Carper|Martin Heinrich|Steve Daines|Pat Toomey|Todd Young|Bill Nelson|John Barrasso|Chris Murphy|Mike Rounds|Mike Crapo|John Thune|John. McCain|Susan Collins|Patty Murray|Dianne Feinstein|Claire McCaskill|Lamar Alexander|Jack Reed|Chuck Grassley|Catherine Masto|Pat Roberts|Ben Cardin|Dean Heller|Ron Wyden|Dick Durbin|Jeanne Shaheen|Tammy Duckworth|Sheldon Whitehouse|Tom Cotton|Sherrod Brown|Bob Corker|Tom Udall|Mitch McConnell|James Lankford|Ted Cruz|Mike Enzi|Gary Peters|Jeff Flake|Johnny Isakson|Jim Inhofe|Lindsey Graham|Marco Rubio|Angus King|Kirsten Gillibrand|Bob Casey|Chris Van Hollen|Thad Cochran|Richard Burr|Rob Portman|Jon Tester|Bob Menendez|John Boozman|Mazie Hirono|Joe Manchin|Deb Fischer|Michael Bennet|Debbie Stabenow|Ben Sasse|Brian Schatz|Jim Risch|Mike Lee|Elizabeth Warren|Richard Blumenthal|David Perdue|Al Franken|Bill Cassidy|Cory Gardner|Lisa Murkowski|Maria Cantwell|Tammy Baldwin|Joe Donnelly|Roger Wicker|Amy Klobuchar|Joel Heitkamp|Joni Ernst|Chris Coons|Mark Warner|John Cornyn|Ron Johnson|Patrick Leahy|Chuck Schumer|John Kennedy|Jeff Merkley|Roy Blunt|Richard Shelby|John Hoeven|Rand Paul|Dan Sullivan|Tim Scott|Ed Markey'
I am using the following code df['url_text'].str.contains(pattern) which results in True in case a name in 'pattern' is present in a row in column 'url_text' and False otherwise. With that I have tried the following code:
df['pol_name_block'] = df.apply(
lambda row: pol_names_list if df['url_text'].str.contains(pattern) in row['url_text'] else ' ',
axis=1
)
I get the error:
TypeError: 'in <string>' requires string as left operand, not Series
From this toy Dataframe :
>>> import pandas as pd
>>> from io import StringIO
>>> df = pd.read_csv(StringIO("""
... id,url_text
... 1,Tim Kaine
... 2,Tim Kain
... 3,Tim
... 4,Lindsey Graham.com
... """), sep=',')
>>> df
id url_text
0 1 Tim Kaine
1 2 Tim Kain
2 3 Tim
3 4 Lindsey Graham.com
From pol_names_list, we build patterns by formating it like so :
patterns = '(%s)' % '|'.join(pol_names_list)
Then, we can use the extract method to assign the value to the column pol_name_block to get the expected result :
df['pol_name_block'] = df['url_text'].str.extract(patterns)
Output :
id url_text pol_name_block
0 1 Tim Kaine Tim Kaine
1 2 Tim Kain NaN
2 3 Tim NaN
3 4 Lindsey Graham.com Lindsey Graham
Change your pattern to enclose it around a capture group () and use extract:
pattern = fr"({'|'.join(pol_names_list)})"
df['pol_name_block'] = df['url_text'].str.extract(pattern)
print(df)
# Output <- with the sample of #tlentali
id url_text pol_name_block
0 1 Tim Kaine Tim Kaine
1 2 Tim Kain NaN
2 3 Tim NaN
3 4 Lindsey Graham Lindsey Graham
Important: you can extract only one element even there are multiple matches. If you want to extract all elements you have to use findall or extractall (only the output format will change)
# New sample, same pattern
>>> df
id url_text
0 1 Tim Kaine and Lindsey Graham
1 2 Tim Kain
2 3 Tim
3 4 Lindsey Graham
# findall
>>> df['url_text'].str.findall(pattern)
0 [Tim Kaine, Lindsey Graham]
1 []
2 []
3 [Lindsey Graham]
Name: url_text, dtype: object
# extractall
>>> df['url_text'].str.extractall(pattern)
0
match
0 0 Tim Kaine
1 Lindsey Graham
3 0 Lindsey Graham

How to add new row in csv using Python panda

Hello this is my csv data
Age Name
0 22 George
1 33 lucas
2 22 Nick
3 12 Leo
4 32 Adriano
5 53 Bram
6 11 David
7 32 Andrei
8 22 Sergio
i want to use if else statement , for example if George is adult create new row and insert +
i mean
Age Name Adul
22 George +
What is best way?
This is my code Which i am using to read data from csv
import pandas as pd
produtos = pd.read_csv('User.csv', nrows=9)
print(produtos)
for i, produto in produtos.iterrows():
print(i,produto['Age'],produto['Name'])
IIUC, you want to create a new column (not row) call "Adul". You can do this with numpy.where:
import numpy as np
produtos["Adul"] = np.where(produtos["Age"].ge(18), "+", np.nan)
Edit:
To only do this for a specific name, you could use:
name = input("Name")
if name in produtos["Name"].tolist():
if produtos.loc[produtos["Name"]==name, "Age"] >= 18:
produtos.loc[produtos["Name"]==name, "Adul"] = "+"
You can do this:
produtos["Adul"] = np.where(produtos["Age"] >= 18, "+", np.nan)

Reading file with delimiter using pandas

I have a data in a file I dont know if it is delimited by space or tab
Data In:
id Name year Age Score
123456 ALEX BROWNNIS VND 0 19 115
123457 MARIA BROWNNIS VND 0 57 170
123458 jORDAN BROWNNIS VND 0 27 191
I read it the data with read_csv and using the tab delimited
df = pd.read_csv(data.txt,sep='\t')
out:
id Name year Age Score
0 123456 ALEX BROWNNIS VND ... 0 19 115
1 123457 MARIA BROWNNIS VND ... 0 57 170
2 123458 jORDAN BROWNNIS VND ... 0 27 191
There is a lot of a white spaces between the column. Am I using delimiter correctly? and when I try to process the column name, I gotkey error so I basically think the fault is use of \t.
What are the possible way to fix this problem?
Since you have two columns and the second one has variable number of words, you need to read it as a regular file and then combine second to last words.
id = []
Name = []
year = []
Age = []
Score = []
with open('data.txt') as f:
text = f.read()
lines = text.split('\n')
for line in lines:
if len(line) < 3: continue
words = line.split()
id.append(words[0])
Name.append(' '.join(words[1:-3]))
year.append(words[-3])
Age.append(words[-2])
Score.append(words[-1])
df = pd.DataFrame.from_dict({'id': id, 'Name': Name,
'year': year, 'Age': Age, 'Score': Score})
Edit: you'd posted the overall data, so I'll change my answer to fit it.
You can use the skipinitialspace parameter like in the following example.
df2 = pd.read_csv('data.txt', sep='\t', delimiter=',', encoding="utf-8", skipinitialspace=True)
Pandas documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Problem Solved:
df = pd.read_csv('data.txt', sep='\t',engine="python")
I added this line of code to remove space between columns and it's work
df.columns = df.columns.str.strip()

Is there a way in python pandas to do "Text to Columns" by location (not by a delimiter) like in excel?

I'm using Vote History data from the Secretary of State, however the .txt file they gave me is 7 million rows, where each row is a string with 27 characters. The first 3 characters are a code for the county. The next 8 characters are the registration ID, the next 8 characters are the date voted, etc. I can't do text to columns in excel because the file is too big. Is there a way to separate this file into columns in python pandas?
Example
Currently I have:
0010000413707312012026R
0010000413708212012027R
0010000413711062012029
0010004535307312012026D
I want to have columns:
001 00004137 07312012 026 R
001 00004137 08212012 027 R
001 00004137 11062012 029
001 00045353 07312012 026 D
Where each space separates a new column. Any suggestions? Thanks.
Simplest I can make it:
import pandas as pd
sample_lines = ['0010000413707312012026R','0010000413708212012027R','0010000413711062012029','0010004535307312012026D]']
COLUMN_NAMES = ['A','B','C','D','E']
df = pd.DataFrame(columns=COLUMN_NAMES)
for line in sample_lines:
row = [line[0:3], line[3:11], line[11:19], line[19:22], line[22:23]]
df.loc[len(df)] = row
print (df)
Outputs:
A B C D E
0 001 00004137 07312012 026 R
1 001 00004137 08212012 027 R
2 001 00004137 11062012 029
3 001 00045353 07312012 026 D
try this:
I think you don't have issue reading form txt file,simplified case would be like here:
a=['0010000413707312012026R','0010000413708212012027R','0010000413711062012029','0010004535307312012026D']
area=[]
date=[]
e1=[]
e2=[]
e3=[]
#001 00004137 07312012 026 R
for i in range (0,len(a)):
area.append(a[i][0:3])
date.append(a[i][3:11])
e1.append(a[i][11:19])
e2.append(a[i][19:22])
e3.append(a[i][22:23])
all_list = pd.DataFrame(
{'area': area,
'date': date,
'e1': e1,
'e2': e2,
'e3': e3
})
print(all_list )
#save as CSV file
all_list.to_csv('all.csv')
Since the file is too big, its better to read and save it into a different file, instead of read the entire file in memory:
with open('temp.csv') as f:
for line in f:
code = line[0:3]
registration = line[3:11]
date = line[11:19]
second_code = line[19:22]
letter = line[22:]
with open('modified.csv', 'a') as f2:
f2.write(
' '.join([code, registration, date, second_code, letter]))
You can also read the content to from the txt file and use extract to divide the dataframe columns
df = pd.read_csv('temp.csv', header=None)
df
# 0
# 0 0010000413707312012026R
# 1 0010000413708212012027R
# 2 0010000413711062012029
# 3 0010004535307312012026D
df = df[df.columns[0]].str.extract('(.{3})(.{8})(.{8})(.{3})(.*)')
df
# 0 1 2 3 4
# 0 001 00004137 07312012 026 R
# 1 001 00004137 08212012 027 R
# 2 001 00004137 11062012 029
# 3 001 00045353 07312012 026 D

Read structured file in python

I have a file with data similar to this:
[START]
Name = Peter
Sex = Male
Age = 34
Income[2020] = 40000
Income[2019] = 38500
[END]
[START]
Name = Maria
Sex = Female
Age = 28
Income[2020] = 43000
Income[2019] = 42500
Income[2018] = 40000
[END]
[START]
Name = Jane
Sex = Female
Age = 41
Income[2020] = 60500
Income[2019] = 57500
Income[2018] = 54000
[END]
I want to read this data into a pandas dataframe so that at the end it is similar to this
Name Sex Age Income[2020] Income[2019] Income[2018]
Peter Male 34 40000 38500 NaN
Maria Female 28 43000 42500 40000
Jane Female 41 60500 57500 54000
So far, I wasn't able to figure out if this is a standard data file format (it has some similarities to JSON but is still very different).
Is there an elegant and fast way to read this data to a dataframe?
Elegant I do not know, but easy way, yes. Python is very good at parsing simple formatted text.
Here, [START] starts a new record, [END] ends it, and inside a record, you have key = value lines. You can easily build a custom parser to generate a list of records to feed into a pandas DataFrame:
inblock = False
fieldnames = []
data = []
for line in open(filename):
if inblock:
if line.strip() == '[END]':
inblock = False
elif '=' in line:
k, v = (i.strip() for i in line.split('=', 1))
record[k] = v
if not k in fieldnames:
fieldnames.append(k)
else:
if line.strip() == '[START]':
inblock = True
record = {}
data.append(record)
df = pd.DataFrame(data, columns=fieldnames)
df is as expected:
Name Sex Age Income[2020] Income[2019] Income[2018]
0 Peter Male 34 40000 38500 NaN
1 Maria Female 28 43000 42500 40000
2 Jane Female 41 60500 57500 54000

Categories