Write a dataframe with only integer values - python

The title of this question might not be appropriate...
So let's suppose I have the following input.csv :
Division,id,name
1,3870,name1
1,4537,name2
1,5690,name3
I need to do some treatments based on the id row, that fetch like this :
>>> get_data(3870)
[{"matchId": 42, comment: "Awesome match"}, {"matchId": 43, comment: "StackOverflow is quite good"}]
My objective is to output a csv that is a join between the first one, and the related data retrieved through get_data :
Division,id,name,matchId,comment
1,3870,name1,42,Awesome match
1,3870,name1,43,StackOverflow is quite good
1,4537,name2,90,Random value
1,4537,name2,91,Still a random value
1,5690,name3,10,Guess what it is
1,5690,name3,11,A random value
However, for some reasons, in the process, the integer data are converted into float :
Division,id,name,matchId,comment
1.0,3870.0,name1,42.0,Awesome match
1.0,3870.0,name1,43.0,StackOverflow is quite good
1.0,4537.0,name2,90.0,Random value
1.0,4537.0,name2,91.0,Still a random value
1.0,5690.0,name3,10.0,Guess what it is
1.0,5690.0,name3,11.0,A random value
Here is short version of my code, I think I missed something...
input_df = pd.read_csv(INPUT_FILE)
output_df = pd.DataFrame()
for index, row in input_df.iterrows():
matches = get_data(row)
rdict = dict(row)
for m in matches:
m.update(rdict)
output_df = output_df.append(m, ignore_index=True)
# FIXME: this was an attempt to solve the problem
output_df["id"] = output_df["id"].astype(int)
output_df["matchId"] = output_df["matchId"].astype(int)
output_df.to_csv(OUTPUT_FILE, index=False)
How can I convert every float column into integer ?

First solution is add parameter float_format='%.0f' to to_csv:
print output_df.to_csv(index=False, float_format='%.0f')
Division,comment,id,matchId,name
1,StackOverflow is quite good,3870,43,name1
1,StackOverflow is quite good,4537,43,name2
1,StackOverflow is quite good,5690,43,name3
Second possible solution is apply function convert_to_int instead of astype:
print output_df
Division comment id matchId name
0 1 StackOverflow is quite good 3870 43 name1
1 1 StackOverflow is quite good 4537 43 name2
2 1 StackOverflow is quite good 5690 43 name3
print output_df.dtypes
Division float64
comment object
id float64
matchId float64
name object
dtype: object
def convert_to_int(x):
try:
return x.astype(int)
except:
return x
output_df = output_df.apply(convert_to_int)
print output_df
Division comment id matchId name
0 1 StackOverflow is quite good 3870 43 name1
1 1 StackOverflow is quite good 4537 43 name2
2 1 StackOverflow is quite good 5690 43 name3
print output_df.dtypes
Division int32
comment object
id int32
matchId int32
name object
dtype: object

Related

Pandas remove every entry with a specific value

I would like to go through every row (entry) in my df and remove every entry that has the value of " " (which yes is an empty string).
So if my data set is:
Name Gender Age
Jack 5
Anna F 6
Carl M 7
Jake M 7
Therefore Jack would be removed from the dataset.
On another note, I would also like to remove entries that has the value "Unspecified" and "Undetermined" as well.
Eg:
Name Gender Age Address
Jack 5 *address*
Anna F 6 *address*
Carl M 7 Undetermined
Jake M 7 Unspecified
Now,
Jack will be removed due to empty field.
Carl will be removed due to the value Undetermined present in a column.
Jake will be removed due to the value Unspecified present in a column.
For now, this has been my approach but I keep getting a TypeError.
list = []
for i in df.columns:
if df[i] == "":
# everytime there is an empty string, add 1 to list
list.append(1)
# count list to see how many entries there are with empty string
len(list)
Please help me with this. I would prefer a for loop being used due to there being about 22 columns and 9000+ rows in my actual dataset.
Note - I do understand that there are other questions asked like this, its just that none of them apply to my situation, meaning that most of them are only useful for a few columns and I do not wish to hardcode all 22 columns.
Edit - Thank you for all your feedbacks, you all have been incredibly helpful.
To delete a row based on a condition use the following:
df = df.drop(df[condition].index)
For example:
df = df.drop(df[Age==5].index) , will drop the row where the Age is 5.
I've come across a post regarding the same dating back to 2017, it should help you understand it more clearer.
Regarding question 2, here's how to remove rows with the specified values in a given column:
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Let's assume we have a Pandas DataFrame object df.
To remove every row given your conditions, simply do:
df = df[df.Gender == " " or df.df.Age == " " or df.Address in [" ", "Undetermined", "Unspecified"]]
If the unspecified fields are NaN, you can also do:
df = df.dropna(how="any", axis = 0)
Answer from #ThatCSFresher or #Bence will help you out in removing rows based on single column... Which is great!
However, I think there are multiple condition in your query needed to check across multiple columns at once in a loop. So, probably apply-lambda can do the job; Try the following code;
df = pd.DataFrame({"Name":["Jack","Anna","Carl","Jake"],
"Gender":["","F","M","M"],
"Age":[5,6,7,7],
"Address":["address","address","Undetermined","Unspecified"]})
df["Noise_Tag"] = df.apply(lambda x: "Noise" if ("" in list(x)) or ("Undetermined" in list(x)) or ("Unspecified" in list(x)) else "No Noise",axis=1)
df1 = df[df["Noise_Tag"] == "No Noise"]
del df1["Noise_Tag"]
# Output of df;
Name Gender Age Address Noise_Tag
0 Jack 5 address Noise
1 Anna F 6 address No Noise
2 Carl M 7 Undetermined Noise
3 Jake M 7 Unspecified Noise
# Output of df1;
Name Gender Age Address
1 Anna F 6 address
Well, OP actually wants to delete any column with "empty" string.
df = df[~(df=="").any(axis=1)] # deletes all rows that have empty string in any column.
If you want to delete specifically for address column, then you can just delete using
df = df[~df["Address"].isin(("Undetermined", "Unspecified"))]
Or if any column with Undetermined or Unspecified, try similar as the first solution in my post, just by replacing the empty string with Undertermined or Unspecified.
df = df[~((df=="Undetermined") | (df=="Unspecified")).any(axis=1)]
You can build masks and then filter the df according to it:
m1 = df.eq('').any(axis=1)
# m1 is True if any cell in a row has an empty string
m2 = df['Address'].isin(['Undetermined', 'Unspecified'])
# m2 is True if a row has one of the values in the list in column 'Address'
out = df[~m1 & ~m2] # invert both condition and get the desired output
print(out)
Output:
Name Gender Age Address
1 Anna F 6 *address*
Used Input:
df = pd.DataFrame({'Name': ['Jack', 'Anna', 'Carl', 'Jake'],
'Gender': ['', 'F', 'M', 'M'],
'Age': [5, 6, 7, 7],
'Address': ['*address*', '*address*', 'Undetermined', 'Unspecified']}
)
using lambda fun
Code:
df[df.apply(lambda x: False if (x.Address in ['Undetermined', 'Unspecified'] or '' in list(x)) else True, axis=1)]
Output:
Name Gender Age Address
1 Anna F 6 *add

Python pandas split column with NaN values

Hello my dear coders,
I'm new to coding and I've stumbled upon a problem. I want to split a column of a csv file that I have imported via pandas in Python. The column name is CATEGORY and contains 1, 2 or 3 values such seperated by a comma (IE: 2343, 3432, 4959) Now I want to split these values into seperate columns named CATEGORY, SUBCATEGORY and SUBSUBCATEGORY.
I have tried this line of code:
products_combined[['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY']] = products_combined.pop('CATEGORY').str.split(expand=True)
But I get this error: ValueError: Columns must be same length as key
Would love to hear your feedback <3
You need:
pd.DataFrame(df.CATEGORY.str.split(',').tolist(), columns=['CATEGORY','SUBCATEGORY', 'SUBSUBCATEGORY'])
Output:
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 2343 3432 4959
I think this could be accomplished by creating three new columns and assigning each to a lambda function applied to the 'CATEGORY' column. Like so:
products_combined['SUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[1] if len(original) > 1 else None)
products_combined['SUBSUBCATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[2] if len(original) > 2 else None)
products_combined['CATEGORY'] = products_combined['CATEGORY'].apply(lambda original: original[0])
The apply() method called on a series returns a new series that contains the result of running the passed function (in this case, the lambda function) on each row of the original series.
IIUC, use split and then Series:
(
df[0].apply(lambda x: pd.Series(x.split(",")))
.rename(columns={0:"CATEGORY", 1:"SUBCATEGORY", 2:"SUBSUBCATEGORY"})
)
CATEGORY SUBCATEGORY SUBSUBCATEGORY
0 2343 3432 4959
1 1 NaN NaN
2 44 55 NaN
Data:
d = [["2343,3432,4959"],["1"],["44,55"]]
df = pd.DataFrame(d)

Scalar error when swapping out a hard coded string for a variable with pandas

df1 looks something like this:
name age
1 Bobby 17
2 Sally 23
3 John 19
df2 looks like this:
name city state
1 Bobby Lakeside MN
2 Sally Carlstown MS
3 John Wallsburg UT
I am looping through a DataFrame, df1, like this:
for row in df1.itertuples(name='Pandas', index=True):
name = getattr(row, "name")
print(type(name))
print(name)
and I will get (as expected):
<type 'str'>
Bobby
<type 'str'>
Sally
<type 'str'>
John
Then I am searching a second dataframe, df2, and getting it's row location (index) number, so I can get additional information.
i = df2[(df2['name'] == "Bobby").index.item()
i is now the integer... worked like a champ. It found Bobby in the other DataFrame, df2, and walla! Gave me the index number.
However... if I try swapping out the hard coded string "Bobby" to the variable like this...
for row in df1.itertuples(name='Pandas', index=True):
name = getattr(row, "name")
i = df2[(df2['name'] == name)].index.item()
then it explodes and dies.
for row in df1.itertuples(name='Pandas', index=True):
name = getattr(row, "name")
i = df2[(df2['name'] == str(name))].index.item()
I get the following exception:
ValueError: can only convert an array of size 1 to a Python scalar
I am at a complete loss, help! and Thank you!
Your logic seems overcomplicated. You can create a name to age mapping from df1 and iterate df2.iterrows. There is no need to access indices, unless you have repeated names. In the latter case, you can use the index.
s = df1.set_index('name')['age']
for _, row in df2.iterrows():
print('{0} who is {1} lives in {2}'.format(row['name'], s.get(row['name']), row.city))
Bobby who is 17 lives in Lakeside
Sally who is 23 lives in Carlstown
John who is 19 lives in Wallsburg

Pandas - Retrieve Value from df.loc

Using pandas I have a result (here aresult) from a df.loc lookup that python is telling me is a 'Timeseries'.
sample of predictions.csv:
prediction id
1 593960337793155072
0 991960332793155071
....
code to retrieve one prediction
predictionsfile = pandas.read_csv('predictions.csv')
idtest = 593960337793155072
result = (predictionsfile.loc[predictionsfile['id'] == idtest])
aresult = result['prediction']
A result retreives a data format that cannot be keyed:
In: print aresult
11 1
Name: prediction, dtype: int64
I just need the prediction, which in this case is 1. I've tried aresult['result'], aresult[0] and aresult[1] all to no avail. Before I do something awful like converting it to a string and strip it out, I thought I'd ask here.
A series requires .item() to retrieve its value.
print aresult.item()
1

Replace WhiteSpace with a 0 in Pandas (Python 3)

simple question here -- how do I replace all of the whitespaces in a column with a zero?
For example:
Name Age
John 12
Mary
Tim 15
into
Name Age
John 12
Mary 0
Tim 15
I've been trying using something like this but I am unsure how Pandas actually reads whitespace:
merged['Age'].replace(" ", 0).bfill()
Any ideas?
merged['Age'] = merged['Age'].apply(lambda x: 0 if x == ' ' else x)
Use the built in method convert_objects and set param convert_numeric=True:
In [12]:
# convert objects will handle multiple whitespace, this will convert them to NaN
# we then call fillna to convert those to 0
df.Age = df[['Age']].convert_objects(convert_numeric=True).fillna(0)
df
Out[12]:
Name Age
0 John 12
1 Mary 0
2 Tim 15
Here's an answer modified from this, more thorough question. I'll make it a little bit more Pythonic and resolve your basestring issue.
def ws_to_zero(maybe_ws):
try:
if maybe_ws.isspace():
return 0
else:
return maybe_ws
except AttributeError:
return maybe_ws
d.applymap(ws_to_zero)
where d is your dataframe.
if you want to use NumPy, then you can use the below snippet:
import numpy as np
df['column_of_interest'] = np.where(df['column_of_interest']==' ',0,df['column_of_interest']).astype(float)
While Paulo's response is excellent, my snippet above may be useful when multiple criteria are required during advanced data manipulation.

Categories