Could not drop NaN values using Pandas [closed] - python

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I am trying to drop NaN values using the dropna() method, provided by Panda. I've read the document and looked at other StackOverflow posts, but I still could not fix the error.
For my code, I will first read an excel file. If the rows have value “-“, I will change it to a NaN value. After that, I will use the method dropna() to drop the NaN values. I will then reassign the result of the dropna() method to a new variable called mydf2. Below are my codes and screenshots
mydf = pd.read_excel('pandas lab datasets/singstats_maritalstatus.xlsx',
na_values='-')
mydf = mydf.set_index(['Variables'])
print(mydf.head(5)) # Original data
mydf2 = mydf.dropna()
print(mydf2)

dropna() has worked correctly. You have two print statements. The first one has printed five rows as asked for by print(mydf.head(5)).
The output of your second print statement print(mydf2) is an empty dataframe [0 rows and 37 columns] because you have apparently got an NaN in each and every row. (see the bottom of your screenshot)

Sounds like here that NaN is a string, so do:
mydf2 = mydf.replace('-',np.nan).dropna()

I wrote a piece of code here, it works fine with my data, so try this out.
mydf = pd.read_excel('pandas lab datasets/singstats_maritalstatus.xlsx')
to_del = []
for i in range(mydf.shape[0]):
if "-" in list(mydf.iloc[i]):
to_del.append(i)
out_df = mydf.drop(to_del, axis=0)

As you have not posted your data, I'm not sure if every row has NaN values or not. If this is so, df.dropna() will simply drop every row. For example, the columns 1981 and 1982 are all NaN values in your image. use df.dropna(axis=1) will drop these two columns, and will not return you an empty df.
df = pd.DataFrame({'Variables':['Total','Single','Married','Widowed','Divorced/Separated'],
'1980':range(5),
'1981':[np.nan]*5})
df.set_index('Variables')
df.dropna(axis=1)

Related

Is it possible to move the last character of a column to the first character of another column in excel or using python code? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Good afternoon , I was wondering if it was possible to move the last character of Column A to the first character of column B using excel or maybe even python. I know how to remove the last character in excel but i dont know how to go about adding it to another column?
Ex.
Column A = ABCD 1 , Column B = 234567
DEsired results:
Column A = ABCD, Column B = 1234567
In Excel:
Define column C with the following formula:
C1 = LEFT(A1, LEN(A1)-1)
Extend downwards along the entire column.
Define column D with the following formula:
D1 = RIGHT(A1, 1) & B1
Extend this in the same way down column D.
Copy columns C and D, use Paste Special > Paste Values over columns A and B.
You can now delete the temporary columns C and D.
In Column C, just use the formula:
=LEFT(A1,LEN(A1)-1)
and in D:
=RIGHT(A1,1)&B1
Then copy/paste Columns C and D as Values to "lock in" column C and D.

Pandas:slicing the dataframe using index values [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 3 years ago.
Improve this question
Pandas:I have a dataframe given below which contains the same set of banks twice..I need to slice the data from 0th index that contains a bank name, upto the index that contains the same bank name..here in the problem -DEUTSCH BANK AG..I need to apply same logic to any such kind of dataframes.ty..
I tried with logic:- df25.iloc[0,1]==df25[1].any().. but it returns nly true but not the index position.
DataFrame:-[1]:https://i.stack.imgur.com/iJ1hJ.png, https://i.stack.imgur.com/J2aDX.png
You need to get the index of all the rows that has the value you are looking for (in this case the bank name) and get the slice the data frame using the indices.
Example:
df = pd.DataFrame({'Col1':list('abcdeafgbfhi')})
search_str = 'b'
idx_list = list(df[(df['Col1']==search_str)].index.values)
print(df[idx_list[0]:idx_list[1]])
Output:
Col1
1 b
2 c
3 d
4 e
5 a
6 f
7 g
Note that the assumption is that there will be only 2 rows with the same value. If there are more than 2, you have to play with the index list values and get what you need. Hope this helps.
Keep in mind that posting a sample data set will always help you get more answers as people will move away to another question when they see images or screenshots, because it involves additional steps to reproduce the issue

How to Read A CSV With A Variable Number of Columns? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
My csv file looks like this:
5783,145v
g656,4589,3243,tt56
6579
How do I read this with pandas (or otherwise)?
(the table should contain empty cells)
You could pass a dummy separator, and then use str.split (by ",") with expand=True:
df = pd.read_csv('path/to/file.csv', sep=" ", header=None)
df = df[0].str.split(",", expand=True).fillna("")
print(df)
Output
0 1 2 3
0 5783 145v
1 g656 4589 3243 tt56
2 6579
I think that the solution proposed by #researchnewbie is good. If you need to replace the NaN values for say, zero, you could add this line after the read:
dataFrame.fillna(0, inplace=True)
Try doing the following:
import pandas as pd
dataFrame = pd.read_csv(filename)
Your empty cells should contain the NaN value, which essentially null.

How to add minutes to datetime64 object in a Pandas Dataframe [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I want to add a list of minutes to datetime64 columns into a new df column.
I tried using datetime.timedelta(minutes=x) in a for loop. But as a result, it is adding a constant value to all of my rows. How do I resolve this?
for x in wait_min:
data['New_datetime'] = data['Date'] + datetime.timedelta(minutes=x)
I expect to iterate through the list and add corresponding minutes, but this is adding a constant value of 16 minutes to each row.
Let us try
data['Date'] + pd.to_timedelta(wait_min, unit='m')
pandas sums two Series element-wise, if they have the same length. All you need to do is create a Series of timedelta objects.
So if wait_min is a list of minutes of length equal to the number of rows in your dataframe, this will do:
data['New_datetime'] = data['Date'] + pd.Series([datetime.timedelta(minutes=x) for x in wait_min])
The following changes worked for me:
for i, x in enumerate(wait_min):
data['New_Datetime'].iloc[i] = data['Date'].iloc[i] + datetime.timedelta(minutes=x)
might not be the best solution, but this works for what I was trying to do.

Importing CSV in Python and manipulating the data [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I'm new to programming and I need to do some (maybe very basic) stuff but I'm really struggling with it.
I have some CSV-files, when opened in excel it consits of somewhat 1500 rows and 500 columns and its all numbers except for the first element of the first row (some kind of header). I need to do stuff like avareging over the elements of the first 60 rows and adding and substracting complete rows.
I'm having a bit of trouble with importing the files. When I just use readcsv and then add them to an empty dataset row bu row I get the desired format (list of rows?) but all the elements are strings instead of floats (maybe because the first element in the file is a string?) and I can't get them to convert to floats so maybe you can help me out a little.
Another thing is how do I actually manipulate a certain part of the data, like a loop going through a certain amount of rows. I can't really figure it out since mathmatical things on string dont work.
Thanks in advance for your help and comments!
I use the following and it works fine:
import numpy
csv = numpy.loadtxt('something.csv', delimiter = ',')
If you want to skip the first row, you can do like this:
csv = numpy.loadtxt('something.csv', delimiter = ',', skiprows = 1)
And if you want to operate on the first 60 rows:
X = csv[:60,:]
Then you just use X for what you want.
Hope it helps
What you need is read_csv in pandas dataframe.
Following codes will automatically recognize your header and set the headers as column names.
import pandas as pd
data = pd.read_csv('Your file name.csv')
Regarding your problem of string format of data, there is no way to help you without some sample data.
I need to do stuff like averaging over the elements of the first 60 rows and adding and subtracting complete rows.
for averaging the first 60 rows, you can do something like this:
import pandas as pd
lst1 = range(100)
lst2 = range(100,200)
lst3 = range(200,300)
data = pd.DataFrame({'a': lst1,'b': lst2,'c': lst3})
data_avrg = data[:60].mean()
In[20]:data_avrg
Out[20]:
a 29.5
b 129.5
c 229.5
dtype: float64
If you want to add or subtract the average of 60 rows to the complete rows, like all rows in column a, you can do this:
data['a_add'] = data.a + data_avrg.a
data['a_subtract'] = data.a - data_avrg.a
I don't think that if the 1st cell is string whole column is of the string type... That may be the label of that column. Try accessing the data from the 2nd row or explicitly name the column
for example
df = pd.DataFrame({'$a':[1,2], '$b': [10,20]})
print df
output
$a $b
0 1 10
1 2 20
you can change the name of the column by
df.columns = ['a', 'b']
output
a b
0 1 10
1 2 20
and after changing the name you can access the column as df['a'] or af['b']

Categories