I have a sample data set:
ID sequence
H100 ATTCCT
H231 CTGGGA
H2002 CCCCCCA
I simply want to add a ">" in front of each ID:
ID sequence
>H100 ATTCCT
>H231 CTGGGA
>H2002 CCCCCCA
From this post Append string to the start of each value in a said column of a pandas dataframe (elegantly)
I got the code :
df["ID"] = '>' + df["ID"].astype(str)
However, this warning message came up:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
so I tried:
df.loc[: , "ID"] = '>'
The same error message came up
How should i correct it this?
thanks
Give this a shot - works for me in Python 3.5:
df['ID'] = ('>' + df['ID'])
If that won't do it, you may have to refer to df.iloc[:,1] for example (just type it in the terminal first to ensure you grabbed the right field where ID is located).
The other problem you may be experiencing is that your dataframe was created as a slice of another dataframe. Try converting your "slice" to its own dataframe:
dataframename = pandas.DataFrame(dataframename)
Then do the code snip I posted.
Best - Matt
Not sure why I'm losing reputation points for trying to answer questions for people with actually verified answers... kind of wondering what the point of this forum is at the moment.
Related
I am making a project using Jupyter Notebook. I am creating an oversimplified example here.
I have a url, lets say
url=www.instagram.com/alex
I need to create a database by adding url with replace function in column adjacent to names
And I have a pandas data frame
Names
John
Cherry
nancy
Results wanted using function
Names url
John wwww.instagram.com/john
Cherry www.instagram.com/cherry
nancy www.instagram.com/nancy
What I am doing is:
data["url"] = url
w = data.names.values
def replace()
for i in w,data.iteritems:
for j in range(len(data.url),data.iteritems:
data["url"]=url.replace("alex",i(j))
return data
It throws an error that I cannot use range as indices... so I tried many things to use integers, but it still doesn't give me the results until I manually put i(0) or i(1) or i(3)
If I try to add another for line like
for w in range(len(data.url):
And do i(w)..
Then it changes everything to the i(0) that in this example will be www.instagram.com/john
I have used oversimplified example for my problem, in my project it is very important to create function because url is too big and the names are input (user selects) so that is why i need to creaTe function
Please check below:
df['url'] = df['Name'].apply(lambda x : url.replace('alex',x.lower()))
data["url"] = "www.instagram.com/" + data["Names"].str.lower()
Here is my code snippet to get the data I need from the CSV:
pathName = 'pathName'
export = pd.read_csv(pathName, skiprows = [0], header = None)
#pathName: Find the correct path for the file
#skiprows: The first row is occupied for the title, we dont need that
omsList = export.values.T[1].tolist() #Transpose the matrix + get second path
for omsID in omsList:
productOMS = omsID
Here is how I'm yielding said item:
item['productOMS'] = productOMS
yield item
Here is the column I am trying to get data from
When I run my spider I get nan as the output for omsID, which after research I found out means not a number. It would make sense why I'm getting that since I think they would be considered strings so how would I adjust my program to recognize these data fields as strings and not ints or read them in as ints?
you need to use pythons type conversion / casting - i.e int(my_numerical_string) tells python to interpret the text as an integer. you can also use type(my_var) to find out the type of your variable
This was a silly problem that I did not see coming. I have to increase the width of the target column in excel so the values could actually be read in.
I have a json file for tweet data. The data that I want to look at is the text of the tweet. For some reason, some of the tweets are too long to put into the normal text part of the dictionary.
It seems like there is a dictionary within another dictionary and I can't figure out how to access it very well.
Basically, what I want in the end is one column of a data frame that will have all of the text from each individual tweet. Here is a link to a small sample of the data that contains a problem tweet.
Here is the code I have so far:
import json
import pandas as pd
tweets = []
#This writes the json file so that I can work with it. This part works correctly.
with open("filelocation.txt") as source
for line in source:
if line.strip():
tweets.append(json.loads(line))
print(len(tweets)
df = pd.DataFrame.from_dict(tweets)
df.info()
When looking at the info you can see that there will be a column called extended_tweet that only encompasses one of the two sample tweets. Within this column, there seems to be another dictionary with one of those keys being full_text.
I want to add another column to the dataframe that just has this information along with the normal text column when the full_text is null.
My first thought was to try and read that specific column of the dataframe as a dictionary again using:
d = pd.DataFrame.from_dict(tweets['extended_tweet]['full_text])
But this doesn't work. I don't really understand why that doesn't work as that is how I read the data the first time.
My guess is that I can't look at the specific names because I am going back to the list and it would have to read all or none. The error it gives me says "KeyError: 'full_text' "
I also tried using the recommendation provided by this website. But this gave me a None value no matter what.
Thanks in advance!
I tried to do what #Dan D. suggested, however, this still gave me errors. But it gave me the idea to try this:
tweet[0]['extended_tweet']['full_text']
This works and gives me the value that I am looking for. But I need to run through the whole thing. So I tried this:
df['full'] = [tweet[i]['extended_tweet']['full_text'] for i in range(len(tweet))
This gives me "Key Error: 'extended_tweet' "
Does it seem like I am on the right track?
I would suggest to flatten out the dictionaries like this:
tweet = json.loads(line)
tweet['full_text'] = tweet['extended_tweet']['full_text']
tweets.append(tweet)
I don't know if the answer suggested earlier works. I never got that successfully. But I did figure out something else that works well for me.
What I really needed was a way to display the full text of a tweet. I first loaded the tweets from the json with what I posted above. Then I noticed that in the data file, there is something called truncated. If this value is true, the tweet is cut short and the full tweet is placed within the
tweet[i]['extended_tweet]['full_text]
In order to access it, I used this:
tweet_list = []
for i in range(len(tweets)):
if tweets[i]['truncated'] == 'True':
tweet_list.append(tweets[i]['extended_tweet']['full_text']
else:
tweet_list.append(tweets[i]['text']
Then I can work with the data using the whol text from each tweet.
I am currently writing a program which uses the ComapaniesHouse API to return a json file containing information about a certain company.
I am able to retrieve the data easily using the following commands:
r = requests.get('https://api.companieshouse.gov.uk/company/COMPANY-NO/filing-history', auth=('API-KEY', ''))
data = r.json()
With that information I can do an awful lot, however I've ran into a problem which I was hoping you guys could possible help me with. What I aim to do is go through every nested entry in the json file and check if the value of certain keys matches certain criteria, if the values of 2 keys match a certain criteria then other code is executed.
One of the keys is the date of an entry, and I would like to ignore results that are older than a certain date, I have attempted to do this with the following:
date_threshold = datetime.date.today() - datetime.timedelta(days=30)``
for each in data["items"]:
date = ['date']
type = ['type']
if date < date_threshold and type is "RM01":
print("wwwwww")
In case it isn't clear, what I'm attempting to do (albeit very badly) is assign each of the entries to a variable, which then gets tested against certain criteria.
Although this doesn't work, python spits out a variable mismatch error:
TypeError: unorderable types: list() < datetime.date()
Which makes me think the date is being stored as a string, and so I can't compare it to the datetime value set earlier, but when I check the API documentation (https://developer.companieshouse.gov.uk/api/docs/company/company_number/filing-history/filingHistoryItem-resource.html), it says clearly that the 'date' entry is returned as a date type.
What am I doing wrong, its very clear that I'm extremely new to python given what I presume is the atrocity of my code, but in my head it seems to make at least a little sense. In case none of this clear, I basically want to go through all the entries in the json file, and the if the date and type match a certain description, then other code can be executed (in this case I have just used random text).
Any help is greatly appreciated! Let me know if you need anything cleared up.
:)
EDIT
After tweaking my code to the below:
for each in data["items"]:
date = each['date']
type = each['type']
if date is '2016-09-15' and type is "RM01":
print("wwwwww")
The code executes without any errors, but the words aren't printed, even though I know there is an entry in the json file with that exact date, and that exact type, any thoughts?
SOLUTION:
Thanks to everyone for helping me out, I had made a couple of very basic errors, the code that works as expected is below::
for each in data["items"]:
date = each['date']
typevariable = each['type']
if date == '2016-09-15' and typevariable == "RM01":
print("wwwwww")
This prints the word "wwwwww" 3 times, which is correct seeing as there are 3 entries in the JSON that fulfil those criteria.
You need to first convert your date variable to a datetime type using datetime.strptime()
You are comparing a list type variable date with datetime type variable date_threshold.
I am aware that there are several other posts on Stack Overflow regarding this same issue, however, not a single solution found on those posts, or any other post I've found online for that matter, has worked. I have followed numerous tutorials, videos, books, and Stack Overflow posts on pandas and all mentioned solutions have failed.
The frustrating thing is that all the solutions I have found are correct, or at least they should be; I am fairly new to pandas so my only conclusion is that I am probably doing something wrong.
Here is the pandas documentation that I started with: Pandas to_json Doc. I can't seem to get pandas to_json to convert a pandas DataFrame to a json object or json string.
Basically, I want to convert a csv string into a DataFrame, then convert that DataFrame into a json object or json string (I don't care which one). Then, once I have my json data structure, I'm going to bind it to a D3.js bar chart
Here is an example of what I am trying to do:
# Declare my csv string (Works):
csvStr = '"pid","dos","facility","a1c_val"\n"123456","2013-01-01 13:37:00","UOFU",5.4\n"65432","2014-01-01 14:32:00","UOFU",5.8\n"65432","2013-01-01 13:01:00","UOFU",6.4'
print (csvStr) # Just checking the variables contents
# Read csv and convert to DataFrame (Works):
csvDf = pandas.read_csv(StringIO.StringIO(csvStr))
print (csvDf) # Just checking the variables contents
# Convert DataFrame to json (Three of the ways I tried - None of them work):
myJSON = csvDf.to_json(path_or_buf = None, orient = 'record', date_format = 'epoch', double_precision = 10, force_ascii = True, date_unit = 'ms', default_handler = None) # Attempt 1
print (myJSON) # Just checking the variables contents
myJSON = csvDf.to_json() # Attempt 2
print (myJSON) # Just checking the variables contents
myJSON = pandas.io.json.to_json(csvDf)
print (myJSON) # Just checking the variables contents
The error that I am getting is:
argument 1 must be string or read-only character buffer, not DataFrame
Which is misleading because the documentation says "A Series or DataFrame can be converted to a valid JSON string."
Regardless, I tried giving it a string anyway, and it resulted in the exact same error.
I have tried creating a test scenario, following the exact steps from books and other tutorials and/or posts and it just results in the same error. At this point, I need a simple solution asap. I am open to suggestions, but I must emphasize that I do not have time waste on learning a completely new library.
For you first attempt, the correct string is 'records' not 'record' This worked for me:
myJSON = csvDf.to_json(path_or_buf = None, orient = 'records', date_format = 'epoch', double_precision = 10, force_ascii = True, date_unit = 'ms', default_handler = None) # Attempt 1
Printing gives:
[{"pid":123456,"dos":"2013-01-01 13:37:00","facility":"UOFU","a1c_val":5.4},
{"pid":65432,"dos":"2014-01-01 14:32:00","facility":"UOFU","a1c_val":5.8},
{"pid":65432,"dos":"2013-01-01 13:01:00","facility":"UOFU","a1c_val":6.4}]
It turns out that the problem was becuase of my own stupid mistake. While testing my use of to_json, I copy and pasted an example into my code and went from there. Thinking I had commented out that code, I proceeded to try using to_json with my test data. Turns out the error I was receiving was being thrown from the example code that I had copy and pasted. Once I deleted everything and re-wrote it using my test data it worked.
However, as user667648 (Bair) pointed out, there was another mistake in my code. The orient param was suppose to be orient = 'records' and NOT orient = 'record'.