How to read a group of elements in pandas? - python

I trying to create a code to take all my data and create groups/DF.
For example, I have 4000 rows in my data but I want to read the first 100 and create a DataFRAME with this first 100, read the next 100 and create a another DataFRAme until the EOF.
I started with that but I just can take all the data:
for index, rows in df_.iterrows():
# Create list for the current row
my_list = rows.Temperatura
# append the list to the final list
Row_list.append(my_list)

You can numpy's array_split:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(list(range(1000)))
In [4]: np.array_split(df, df.count() / 100)

Related

Python how to filter a csv based on a column value and get the row count

I want to do data insepction and print count of rows that matches a certain value in one of the columns. So below is my code
import numpy as np
import pandas as pd
data = pd.read_csv("census.csv")
The census.csv has a column "income" which has 3 values '<=50K', '=50K' and '>50K'
and i want to print number of rows that has income value '<=50K'
i was trying like below
count = data['income']='<=50K'
That does not work though.
Sum Boolean selection
(data['income'].eq('<50K')).sum()
The key is to learn how to filter pandas rows.
Quick answer:
import pandas as pd
data = pd.read_csv("census.csv")
df2 = data[data['income']=='<=50K']
print(df2)
print(len(df2))
Slightly longer answer:
import pandas as pd
data = pd.read_csv("census.csv")
filter = data['income']=='<=50K'
print(filter) # notice the boolean list based on filter criteria
df2 = data[filter] # next we use that boolean list to filter data
print(df2)
print(len(df2))

Extracting rows with a specific column value using pandas, no headers on the columns

So, I run this code:
import pandas as pd
df = pd.read_csv(filename, delim_whitespace=True, header=None)
My file is huge so I have isolated the first 14 rows and all 10 columns for clarity.
X = df.iloc[0:14, 0:10].values
X when printed, comes out like this(as said in title, no column headers):
So far, so good.
Now, I want to isolate the rows which have a 'CYT' string in the 9th column.
Assuming 0th index, I want to isolate rows 5, 9 and 12. Next, I want to put these three rows into a matrix for later use. How do I do that?
I am very new to python so I will welcome any guidance.
Thanks!
Shreeman
[EDIT]
Correct Code [THANKS PAVEL!]:
import pandas as pd
df = pd.read_csv(fileName, delim_whitespace=True, header=None)
X_CYT = df.loc[df.iloc[:, 9] == 'CYT']
X_CYT = X_CYT.values # This converts it to a numpy array
You shouldn't try to isolate those using indexing, use conditions instead:
X_CYT = X.loc[X.iloc[:, 9] == 'CYT']
X_CYT = X_CYT.values # This converts it to a numpy array

Is there any function to assign values in a Pandas Dataframe

I am trying to assign values to some rows using pandas dataframe. Is there any function to do this?
For a whole column:
df = df.assign(column=value)
... where column is the name of the column.
For a specific column of a specific row:
df.at[row, column] = value
... where row is the index of the row, and column is the name of the column.
The later changes the dataframe "in place".
There is a good tutorial here.
Basically, try this:
import pandas as pd
import numpy as np
# Creating a dataframe
# Setting the seed value to re-generate the result.
np.random.seed(25)
df = pd.DataFrame(np.random.rand(10, 3), columns =['A', 'B', 'C'])
# np.random.rand(10, 3) has generated a
# random 2-Dimensional array of shape 10 * 3
# which is then converted to a dataframe
df
You will get something like this:

Adding a new column to a df each cycle of a for loop

I am doing some modifications to a dataframe with a for loop. I am adding a new column every cycle of the for loop, however, I also drop this column at the end of the cycle. I would like to know if it is possible to store the values of this column per cycle, and create a new dataframe that is made of each of these columns that were generated per cycle. I am using the following code:
import numpy as np
import pandas as pd
newdf = np.zeros([1000,5])
df = pd.DataFrame(np.random.choice([0.0, 0.05], size=(1000,1000)))
for i in range(0, 10):
df['sum']= df.iloc[:, -1000:].sum(axis=1)
newdf[:,i] = df['sum']
df = df.drop('sum', 1)
However, I get the following error:
index 5 is out of bounds for axis 1 with size 5
Thanks
The issue occurs not because of anything that has to do with df, but because when i = 5, newdf[:, i] refers to the sixth column of a NumPy array containing only five columns. If, instead, you initialize newdf through newdf = np.zeros([1000, 10]), or loop only over range(5), then your code runs without errors.

Converting numpy array into dataframe column?

How do I convert a numpy array into a dataframe column. Let's say I have created an empty dataframe, df, and I loop through code to create 5 numpy arrays. Each iteration of my for loop, I want to convert the numpy array I have created in that iteration into a column in my dataframe. Just to clarify, I do not want to create a new dataframe every iteration of my loop, I only want to add a column to the existing one. The code I have below is sketchy and not syntactically correct, but illustrates my point.
df = pd.dataframe()
for i in range(5):
arr = create_numpy_arr(blah) # creates a numpy array
df[i] = # convert arr to df column
This is the simplest way:
df['column_name']=pd.Series(arr)
Since you want to create a column and not an entire DataFrame from your array, you could do
import pandas as pd
import numpy as np
column_series = pd.Series(np.array([0, 1, 2, 3]))
To assign that column to an existing DataFrame:
df = df.assign(column_name=column_series)
The above will add a column named column_name into df.
If, instead, you don't have any DataFrame to assign those values to, you can pass a dict to the constructor to create a named column from your numpy array:
df = pd.DataFrame({ 'column_name': np.array([0, 1, 2, 3]) })
That will work
import pandas as pd
import numpy as np
df = pd.DataFrame()
for i in range(5):
arr = np.random.rand(10)
df[i] = arr
Maybe a simpler way is to use the vectorization
arr = np.random.rand(10, 5)
df = pd.DataFrame(arr)

Categories