Find unique elements of column with chunksize pandas - python

Given a sample(!) data frame:
test =
time clock
1 1
1 1
2 2
2 2
3 3
3 3
I was trying to do some operations with pandas chunksize:
for df in pd.read_csv("...path...",chunksize = 10):
time_spam = df.time.unique()
detector_list = df.clock.unique()
But it gives me operation to the length of the chunsize. If 10, then give me 10 rows only.
P.S. It is sample data

Please try:
for df in pd.read_csv("...path...",chunksize = 10, iterator=True):
time_spam = df.time.unique()
detector_list = df.clock.unique()
You need to use the iterator flag as described here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-chunking

Here's how you can create lists of unique elements parsing chunks:
# Initialize lists
time_spam = []
detector_list = []
# Cycle over each chunk
for df in pd.read_csv("...path...", chunksize = 10):
# Add elements if not already in the list
time_spam += [t for t in df['time'].unique() if t not in time_spam]
detector_list += [c for c in df['clock'].unique() if c not in detector_list ]

File test.csv:
col1,col2
1,1
1,2
1,3
1,4
2,1
2,2
2,3
2,4
Code:
col1, col2 = [], []
for df in pd.read_csv('test.csv', chunksize = 3):
col1.append(df.col1)
col2.append(df.col2)
Results:
print(pd.concat(col1).unique())
[1 2]
print(pd.concat(col2).unique())
[1 2 3 4]

Related

How to count the instances of unique values in Python Dataframe

I have a dataframe like below where I have 2 million rows. The sample data can be found here.
The list of matches in every row can be any number between 1 to 761. I want to count the occurrences of every number between 1 to 761 in the matches column altogether. For example, the result of the above data will be:
If a particular id is not found, then the count will be 0 in the output. I tried using for loop approach but it is quite slow.
def readData():
df = pd.read_excel(file_path)
pattern_match_count = [0] * 761
for index, row in df.iterrows():
matches = row["matches"]
for pattern_id in range(1, 762):
if(pattern_id in matches):
pattern_match_count[pattern_id - 1] = pattern_match_count[pattern_id - 1] + 1
Is there any better approach with pandas to make the implementation faster?
You can use the .explode() method to "explode" the lists into new rows.
def readData():
df = pd.read_excel(file_path)
return df.loc[:, "count"].explode().value_counts()
You can use collections.Counter:
df = pd.DataFrame({"matches": [[1,2,3],[1,3,3,4]]})
#df:
# matches
#0 [1, 2, 3]
#1 [1, 3, 3, 4]
from collections import Counter
C = Counter([i for sl in df.matches for i in sl])
#C:
#Counter({1: 2, 2: 1, 3: 3, 4: 1})
pd.DataFrame(C.items(), columns=["match_id", "counts"])
# match_id counts
#0 1 2
#1 2 1
#2 3 3
#3 4 1
If you want zeros for match_ids that aren't in any of the matches, then you can update the Counter object C:
for i in range(1,762):
if i not in C:
C[i] = 0
pd.DataFrame(C.items(), columns=["match_id", "counts"])

How to split/group a list of dataframes by the length of each data frames

For example, I have a list of 100 data frames, some have column length of 8, others 10, others 12. I want to be able to split these into groups based on their column length. I have tried dictionaries but couldn't get it to append properly in a loop.
Previously tried code:
col_count = [8, 10, 12]
d = dict.fromkeys(col_count, [])
for df in df_lst:
for i in col_count:
if i == len(df.columns):
d[i] = df
but this just seems to replace the values in the dict each time. I have tried .append also, but that seems to append to all keys.
Instead of assigning a df to d[column_count]. You should append it.
You initialized d with d = dict.fromkeys(col_count, []) so d is a dictionary of empty lists.
When you do d[i] = df you replace the empty list by a DataFrame, so d will be a dictionary of DataFrame. If you do d[i].append(df) you will have a dictionary of list of DataFrame. (which is what you want AFAIU)
Also i'm not sure that you need the col_count variable. You could just do d[len(df.columns)].append(df).
I think this should suffice for you. Think of how to dynamically solve your problems to make better use of Python.
In [2]: import pandas as pd
In [3]: for i in range(1, 5):
...: exec(f"df{i} = pd.DataFrame(0, index=range({i}), columns=list('ABCD'))") #making my own testing list of dataframes with variable length
...:
In [4]: df1 #one row df
Out[4]:
A B C D
0 0 0 0 0
In [5]: df2 #two row df
Out[5]:
A B C D
0 0 0 0 0
1 0 0 0 0
In [6]: df3 #three row df
Out[6]:
A B C D
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
In [7]: L = [df1, df2, df3, df4, df5] #i assume all your dataframes are put into something like a container, which is the problem
In [13]: my_3_length_shape_dfs = [] #you need to create some sort of containers for your lengths (you can do an additional exec in the following In
In [14]: for i in L:
...: if i.shape[0] == 3: #add more of these if needed, you mentioned your lengths are known [8, 10, 12]
...: my_3_length_shape_dfs.append(i) #adding the df to a specified container, thus grouping any dfs that are of row length/shape equal to 3
...: print(i)
...:
A B C D
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0

populating empty dataframe from a list in python

I need to populate dataframe from the list.
lst=[1,"name1",10,2,"name2",2,"name2",20,3]
df=pd.DataFrame(columns=['a','b','c'])
j=0
for i in range(len(list(df.columns))-1):
for t,v in enumerate(lst):
col_index=j%3
df.iloc[i,col_index]=lst[t]
j=j+1
The above code is giving me an error.
i want df to be following
a b c
1 name1 10
2 name2 20
3 NaN NaN
I have tried this but it is giving me a following error
IndexError :Single positional indexer is out of bounds
Create a list of dictionarys [{key:value, key:value}, {key:value, key:value}, {key:value, key:value}]
Add this straight as a dataframe. You can also control what is added this way by making a fucntion and passing data to it as the dictionary is built.
You can achieve this using itertools cycle if the rows are always in the correct order to the columns.
I assume that 3, name3, 30were incorrect and the list i think you should have should look like this.
cols = ['a','b','c']
rows = [1, "name1", 10, 2,"name2", 20, 3, "name3", 30]
And using the power of itertools
https://docs.python.org/3/library/itertools.html#itertools.cycle
cycle('abc') --> a b c a b c a b c a b c ...
I think this code can help you.
import itertools
def parse_data(data):
if data:
pass
#do something.
return data
cols = ['a','b','c']
rows = [1, "name1", 10, 2,"name2", 20, 3, "name3", 30]
d = [] # Temp list for dataframe to hold the dictionaries of data.
e = {} # Temp dict to fill rows & cols for each cycle.
for x, y in zip(itertools.cycle(cols), rows): # cycle through the cols but not the rows.
y = parse_data(y) # do any filtering or removals here.
if x == cols[0]: # the first col triggers the append and reset of the dictionary
e = {x:y} # re init the temp dictionary
d.append(e) # append to temp df list
else:
e.update({x:y}) # add other elements
print(e)
print(d)
df=pd.DataFrame(d) # create dataframe
print(df)
"""
a b c
1 name1 10
2 name2 20
3 name3 30
""""

For each item in list L, find all of the corresponding items in a dataframe

I'm looking for a fast solution to this Python problem:
- 'For each item in list L, find all of the corresponding items in a dataframe column (`df [ 'col1' ]).
The catch is that both L and df ['col1'] may contain duplicate values and all duplicates should be returned.
For example:
L = [1,4,1]
d = {'col1': [1,2,3,4,1,4,4], 'col2': ['a','b','c','d','e','f','g']}
df = pd.DataFrame(data=d)
The desired output would be a new DataFrame where df [ 'col1' ] contains the values:
[1,1,1,1,4,4,4]
and rows are duplicated accordingly. Note that 1 appears 4 times (twice in L * twice in df)
I have found that the obvious solutions like .isin() don't work because they drop duplicates.
A list comprehension does work, but it is too slow for my real-life problem, where len(df) = 16 million and len(L) = 150000):
idx = [y for x in L for y in df[df['col1'].values == x]]
res = df.loc[idx].reset_index(drop=True)
This is basically just a problem of comparing two lists (with a bit of dataframe indexing difficulty tacked on), and a clever and very fast solution by Mad Physicist almost works for this, except that duplicates in L are dropped (it returns [1, 4, 1, 4, 4] in the example above; i.e., it finds the duplicates in df but ignores the duplicates in L).
train = np.array([...]) # my df['col1']
keep = np.array([...]) # my list L
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]
I'd be grateful for any ideas.
Initial data:
L = [1,4,1]
df = pd.DataFrame({'col':[1,2,3,4,1,4,4] })
You can create dataframe from L
df2 = pd.DataFrame({'col':L})
and merge it with initial dataframe:
result = df.merge(df2, how='inner', on='col')
print(result)
Result:
col
0 1
1 1
2 1
3 1
4 4
5 4
6 4
IIUC try:
L = [1,4,1]
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0)
(Not sure how do you want to have indexes- the above will return a bit raw format)
Output:
0 1
4 1
3 4
5 4
6 4
0 1
4 1
Name: col, dtype: int64
Reindexed:
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0).reset_index(drop=True)
#output:
0 1
1 1
2 4
3 4
4 4
5 1
6 1
Name: col, dtype: int64

Selecting rows from pandas DataFrame using a list

I have a list of lists as below
[[1, 2], [1, 3]]
The DataFrame is similar to
A B C
0 1 2 4
1 0 1 2
2 1 3 0
I would like a DataFrame, if the value in column A is equal to the first element of any of the nested lists and the value in column B of the corresponding row is equal to the second element of that same nested list.
Thus the resulting DataFrame should be
A B C
0 1 2 4
2 1 3 0
The code below do want you need:
tmp_filter = pandas.DataFrame(None) #The dataframe you want
# Create your list and your dataframe
tmp_list = [[1, 2], [1, 3]]
tmp_df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0]], columns = ['A','B','C'])
#This function will pass the df pass columns by columns and
#only keep the columns with the value you want
def pass_true_df(df, cond):
for i, c in enumerate(cond):
df = df[df.iloc[:,i] == c]
return df
# Pass through your list and add the row you want to keep
for i in tmp_list:
tmp_filter = pandas.concat([tmp_filter, pass_true_df(tmp_df, i)])
import pandas
df = pandas.DataFrame([[1,2,4],[0,1,2],[1,3,0],[0,2,5],[1,4,0]],
columns = ['A','B','C'])
filt = pandas.DataFrame([[1, 2], [1, 3],[0,2]],
columns = ['A','B'])
accum = []
#grouped to-filter
data_g = df.groupby('A')
for k2,v2 in data_g:
accum.append(v2[v2.B.isin(filt.B[filt.A==k2])])
print(pandas.concat(accum))
result:
A B C
3 0 2 5
0 1 2 4
2 1 3 0
(I made the data and filter a little more complicated as a test.)

Categories