How to read data without specific symbol in python? [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
My dataset looks like following. I am trying to read numbers in "per" column without reading "%" symbol.Being a beginner in python,I was wondering if we can do such in python. Also, if you could provide the explanation that will be great!
State Year per
A 1990 6.10%
A 1989 4.50%
B 1990 3.4%
B 1989 1.25%
Thanks in advance,

In case it is a csv file, this should help (or there might be another way to get a dataframe):
import pandas as pd
data = pd.read_csv("somefile.csv")
data["per"] = data["per"].str.replace("%", "").to_numeric()

Your file type doesn't matter for this and no modules required. It works by taking each row and going to the last word. Then it splits the percentage and removes the percent symbol.
def readFile(filename):
percents = []
with open (filename,"r") as f:
for row in f:#for each line, we remove the first one late
splitRow = row.split()[-1]# spliting the elements by word, we want the last one only
percent = splitRow
percent = percent.split("%")[0]#removing the percent
percents.append(percent)#if you want it as an number instead of a string do percents.append(float(percent))
percents = percents[1:] # removes the header "per"
return percents

Related

Extraction of data from a delimited string in Python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have a string variable which has some data as shown below:
'From\tTo\nA0A3Q8IUE6\t13392634\nA4I9M8\t5072523\nE9BQL4\t13392634\nQ4Q3E9\t5654813\nE9B4M7\t13452251\nA0A088S7I8\t22574266\nA4HAG8\t5414882\nA0A3P3Z499\t5414882'
The data basically has two columns 'From' and 'To'. How do I extract the entries from the 'To' column in python?
You can use split, and then extract the data from the odd indexes, like so:
data = 'From\tTo\nA0A3Q8IUE6\t13392634\nA4I9M8\t5072523\nE9BQL4\t13392634\nQ4Q3E9\t5654813\nE9B4M7\t13452251\nA0A088S7I8\t22574266\nA4HAG8\t5414882\nA0A3P3Z499\t5414882'
print(data)
data = data.split()
to = [data[i] for i in range(3, len(data), 2)]
print(to)
In python you could split a string at specific chars, in your case \n delimits the row and \t delimits the column
something like this should work:
string='From\tTo\nA0A3Q8IUE6\t13392634\nA4I9M8\t5072523\nE9BQL4\t13392634\nQ4Q3E9\t5654813\nE9B4M7\t13452251\nA0A088S7I8\t22574266\nA4HAG8\t5414882\nA0A3P3Z499\t5414882'
f=[]
t=[]
for row in string.split("\n")[1:]:
fr,to=row.split("\t")
f.append(fr)
t.append(to)
print(f,t)

String Manipulation in Dataframe [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Hey guys I have a quick question regarding the string manipulation in pandas dataframe.
Suppose we have 2 columns looks like this:
Question:
How I can keep only the string part for each cell and delete the [' ']?
Thank you so much for your help! I am looking forward to hearing your brilliant idea!
Please use regex to replace all non alphanumeric characters
print(df)
State City
0 ['AK'] ['Yakutat']
1 ['AK'] ['Apache']
Solution
df=df.replace(regex='[^\w]',value='')
print(df)
State City
0 AK Yakutat
1 AK Apache
Depends if the values in each of your cells are strings with brackets "['AK']" or actual lists: ['AK'].
If they are strings with brackets on either side, we can strip bracket characters from both sides:
df["State"] = df["State"].str.strip("[]")
df["City"] = df["City"].str.strip("[]")
If they are lists with you can join them with a comma to turn them into a string
df["State"] = df["State"].str.join(", ")
df["City"] = df["City"].str.join(", ")
You can do the following:
df['City']=df['City'].apply(lambda x: x[2:-2])
df['State']=df['State'].apply(lambda x: x[2:-2])

Competition Problem: Finding numbers containing "2020" from 1 to N which is a multiple of 2020 faster than O(N/2020) [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I got stuck on the following competition problem:
Given an integer N, find the total count of positive integers from 1 to N
that is a multiple of N and have the string "2020" in it.
For example: 2020, 20200, 1012020 are what we want to find.
I've got some sort of O(N/2020) solution, where it starts checking if it has "2020" in it from 2020, skipping 2020 every time (checks 2020,4040,then,6060,and so on).
In other words,
I started checking from 2020, is "2020" in it? Yes, add one to the total count.
Then check 4040, is "2020" in it? No, keep searching.
Check 6060, is "2020" i it? No, keep searching.
...
That got me a TLE, that is, Time Limit Exceeded error ian the competition.
So I would like to know if anyone has a better, faster solution?
If you get stuck here's a solution
import math
baseInt = 2020 #This is the int you want to find within the other int
randomInt = 1012020 #This is the int you're going to count up to
def checkBaseInt(baseInt, theIteratorInt):
theNumString = str(theIteratorInt) #Converting to strings to find the substring
baseIntString = str(baseInt)
if(theNumString.find(baseIntString) == 0): #Found the substring of the baseInt into the randomint which is the iterator
return theNumString #Return the iterator which includes the baseInt
for i in range(baseInt, randomInt): #From your base int up to your final int
if (checkBaseInt(baseInt, i) != None): #If it found the desired integer within your final integer
print(checkBaseInt(baseInt, i)) #Print the iterating integer which includes 2020, or your baseInt
text = "20"
if "2020" in text:
print(text)
You can try using the above

Make an edge list from data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have A huge data and this picture shows some sample of my data:
I want to make a edge list. If the row value of column1=column2=column3=column4=column6 are same, there is relation ship (edge) between the row value of column 5 And the result should be like below picture:
Is there a way to do this? Can postgressSQL or Python or R do that for me?
If I understand correctly, you want a self join:
select t1.col5 as vertex_1, t2.col5 as vertex_2
from t t1 join
t t2
on t1.col1 = t2.col1 and t1.col2 = t2.col2 and t1.col3 = t2.col3 and
t1.col4 = t2.col4 and t1.col6 = t2.col6 and
t1.col5 <> t2.col5;
I cannot tell if you want undirected or directed edges. If undirected, then change the last condition to: t1.col5 < t2.col5.
What you want is unique elements in your result list. Look at the SQL key words "unique" / "distinct", they can probably be used to generated unique rows.

Get specific words from a string in python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I want to to extract some information from a data file. The following is the format I have in my data file:
44 2.463181s> (G) GET_NBI: 0x00002aaa ecc00e90 <- (4,0x00002aab 4c000c00) (256 bytes)
From this line, I want to extract 256 which is the last number and 4 which is the first number from
(4,0x00002aab 4c000c00)
Could you please recommend some functions which will be useful for my case?
You should use str.split().
What it does is split the string every place there is a space, so you would get a list of strings like so:
n = '44 2.463181s> (G) GET_NBI: 0x00002aaa ecc00e90 <- (4,0x00002aab 4c000c00) (256 bytes)'
o = n.split()
print o
Output:
['44', '2.463181s>', '(G)', 'GET_NBI:', '0x00002aaa', 'ecc00e90', '<-', '(4,0x00002aab', '4c000c00)', '(256', 'bytes)']
Then simply get the second-to-last index like o[-2] -> '(256'
Remove the extra parenthesis: '(256'[1:] -> '256', and If you wanna, turn it into an integer. int('256') -> 256
You could also use regular expressions, which in this case might be a bit more clear.
import re
txt = "44 2.463181s> (G) GET_NBI: 0x00002aaa ecc00e90 <- (4,0x00002aab 4c000c00) (256 bytes)"
results = re.findall(r"\((\d+)", txt)
# ["4", "256"]

Categories