I have this massive dataset and I need to subset the data by using criteria. This is for illustration:
| Group | Name | Value |
|--------------------|-------------|-----------------|
| A | Bill| 256 |
| A | Jack| 268 |
| A | Melissa| 489 |
| B | Amanda | 787 |
| B | Eric| 485 |
| C | Matt| 1236 |
| C | Lisa| 1485 |
| D | Ben | 785 |
| D | Andrew| 985 |
| D | Cathy| 1025 |
| D | Suzanne| 1256 |
| D | Jim| 1520 |
I know how to handle this problem manually, such as:
import pandas as pd
df=pd.read_csv('Test.csv')
A=df[df.Group =="A "].to_numpy()
B=df[df.Group =="B "].to_numpy()
C=df[df.Group =="C "].to_numpy()
D=df[df.Group =="D "].to_numpy()
But considering the size of the data, it will take a lot of time if I handle it in this way.
With that in mind, I would like to know if it is possible to build an iteration with an IF statement that can look at the values in column “Group”(table above) . I was thinking, IF statement to see if the first value is the same with one the below if so, group them and create a new array/ dataframe.
I have a data frame of rows of more than 1,000,000 and 15 columns.
I have to make new columns and assign the value to the columns w.r.t the other string values in the other columns via matching them either with regex or exact character match.
For example, if a column called FIle path is there. I have to make a column as a feature that will be assigned values with the input of the folder path (Full | partial) and match it with the file path and update the feature column.
I thought about using the iteration with for loop but it is so much time taking and while using pandas for this I think iterating would consume more time if looping components increase in the future.
Is there an efficient way for the pandas to do this type of operation
Please help me with this.
Example:
I have a df as:
| ID | File |
| -------- | -------------- |
| 1 | SWE_Toot |
| 2 | SWE_Thun |
| 3 | IDH_Toet |
| 4 | SDF_Then |
| 5 | SWE_Toot |
| 6 | SWE_Thun |
| 7 | SEH_Toot |
| 8 | SFD_Thun |
I will get components in other tables as
| ID | File |
| -------- | -------------- |
| Software | */SWE_Toot/*.h |
| |*/IDH_Toet/*.c |
| |*/SFD_Toto/*.c |
second as:
| ID | File |
| -------- | -------------- |
| Wire | */SDF_Then/*.h |
| |*/SFD_Thun/*.c |
| |*/SFD_Toto/*.c |
etc., will me around like 1000000 files and 278 components are received
I want as
| ID | File |Component|
| -------- | -------------- |---------|
| 1 | SWE_Toot |Software |
| 2 | SWE_Thun |Other |
| 3 | IDH_Toet |Software |
| 4 | SDF_Then |Wire |
| 5 | SWE_Toto |Various |
| 6 | SWE_Thun |Other |
| 7 | SEH_Toto |Various |
| 8 | SFD_Thun |Wire |
Other - will be filled at last once all the fields and regex are checked and do not belong to any component.
Various - It may belong to more than one (or) we can give a list of components it belong to.
I was able to read the components tables and create a regex and if I want to create the component column then I have to write for loops for all the 278 columns and I have to loop the same table with the component.
Is there a way to do this with the pandas easier
Because the date will be very large
inside my application i have a dataframe that looks similiar to this:
Example:
id | address | code_a | code_b | code_c | more columns
1 | parkdrive 1 | 012ah8 | 012ah8a | 1345wqdwqe | ....
2 | parkdrive 1 | 012ah8 | 012ah8a | dwqd4646 | ....
3 | parkdrive 2 | 852fhz | 852fhza | fewf6465 | ....
4 | parkdrive 3 | 456se1 | 456se1a | 856fewf13 | ....
5 | parkdrive 3 | 456se1 | 456se1a | gth8596s | ....
6 | parkdrive 3 | 456se1 | 456se1a | a48qsgg | ....
7 | parkdrive 4 | tg8596 | tg8596a | 134568a | ....
As you may see, every address can contain multiple entrys inside my dataframe, the code_a and code_b are following a certain pattern and only code_c is unqiue.
What I'm trying to obtain is a dataframe where the column code_c is ignored, dropped or whatever and the whole dataframe is reduced to only one entry for each address...something like this:
id | address | code_a | code_b | more columns
1 | parkdrive 1 | 012ah8 | 012ah8a | ...
3 | parkdrive 2 | 852fhz | 852fhza | ...
4 | parkdrive 3 | 456se1 | 456se1a | ...
7 | parkdrive 4 | tg8596 | tg8596a | ...
I tried the groupby-function, but this doesn't seemed to work - or is this even the right function?
Thanks for your help and good day to all of you!
You can drop_duplicates to do this
df.drop_duplicates(subset=[‘address’], inplace=True)
This will keep only a single entry per address
I think what you are looking for is
# in this way you are looking for all the duplicates rows in all columns except for 'code_c'
df.drop_duplicates(subset=df.columns.difference(['code_c']))
# in this way you are looking for all the duplicates rows ONLY based on column 'address'
df.drop_duplicates(subset='address')
I notice in your example data, if you drop columnC then all the entries with address "parkdrive 1" for example, are just duplicates.
you should drop the column c:
df.drop('code_c',axis=1,inplace=True)
Then you can drop the duplicates:
df_clean = df.drop_duplicates()
What is the best way to compare 2 dataframes w/ the same column names, row by row, if a cell is different have the Before & After value and which cellis different in that dataframe.
I know this question has been asked a lot, but none of the applications fit my use case. Speed is important. There is a package called datacompy but it is not good if I have to compare 5000 dataframes in a loop (i'm only comparing 2 at a time, but around 10,000 total, and 5000 times).
I don't want to join the dataframes on a column. I want to compare them row by row. Row 1 with row 1. Etc. If a column in row 1 is different, I only need to know the column name, the before, and the after. Perhaps if it is numeric I could also add a column w/ the abs val. of the dif.
The problem is, there is sometimes an edge case where rows are out of order (only by 1 entry), and don’t want these to come up as false positives.
Example:
These dataframes would be created when I pass in race # (there are 5,000 race numbers)
df1
+-----+-------+--+------+--+----------+----------+-------------+--+
| Id | Speed | | Name | | Distance | | Location | |
+-----+-------+--+------+--+----------+----------+-------------+--+
| 181 | 10.3 | | Joe | | 2 | | New York | |
| 192 | 9.1 | | Rob | | 1 | | Chicago | |
| 910 | 1.0 | | Fred | | 5 | | Los Angeles | |
| 97 | 1.8 | | Bob | | 8 | | New York | |
| 88 | 1.2 | | Ken | | 7 | | Miami | |
| 99 | 1.1 | | Mark | | 6 | | Austin | |
+-----+-------+--+------+--+----------+----------+-------------+--+
df2:
+-----+-------+--+------+--+----------+----------+-------------+--+
| Id | Speed | | Name | | Distance | | | Location |
+-----+-------+--+------+--+----------+----------+-------------+--+
| 181 | 10.3 | | Joe | | 2 | | New York | |
| 192 | 9.4 | | Rob | | 1 | | Chicago | |
| 910 | 1.0 | | Fred | | 5 | | Los Angeles | |
| 97 | 1.5 | | Bob | | 8 | | New York | |
| 99 | 1.1 | | Mark | | 6 | | Austin | |
| 88 | 1.2 | | Ken | | 7 | | Miami | |
+-----+-------+--+------+--+----------+----------+-------------+--+
diff:
+-------+----------+--------+-------+
| Race# | Diff_col | Before | After |
+-------+----------+--------+-------+
| 123 | Speed | 9.1 | 9.4 |
| 123 | Speed | 1.8 | 1.5 |
An example of a false positive is with the last 2 rows, Ken + Mark.
I could summarize the differences in one line per race, but if the dataframe has 3000 records and there are 1,000 differences (unlikely, but possible) than I will have tons of columns. I figured this was was easier as I could export to excel and then sort by race #, see all the differences, or by diff_col, see which columns are different.
def DiffCol2(df1, df2, race_num):
is_diff = False
diff_cols_list = []
row_coords, col_coords = np.where(df1 != df2)
diffDf = []
alldiffDf = []
for y in set(col_coords):
col_df1 = df1.iloc[:,y].name
col_df2 = df2.iloc[:,y].name
for index, row in df1.iterrows():
if df1.loc[index, col_df1] != df2.loc[index, col_df2]:
col_name = col_df1
if col_df1 != col_df2: col_name = (col_df1, col_df2)
diffDf.append({‘Race #’: race_num,'Column Name': col_name, 'Before: df2.loc[index, col_df2], ‘After’: df1.loc[index, col_df1]})
try:
check_edge_case = df1.loc[index, col_df1] == df2.loc[index+1, col_df1]
except:
check_edge_case = False
try:
check_edge_case_two = df1.loc[index, col_df1] == df2.loc[index-1, col_df1]
except:
check_edge_case_two = False
if not (check_edge_case or check_edge_case_two):
col_name = col_df1
if col_df1 != col_df2:
col_name = (col_df1, col_df2) #if for some reason column name isn’t the same, which should never happen but in case, I want to know both col names
is_diff = True
diffDf.append({‘Race #’: race_num,'Column Name': col_name, 'Before: df2.loc[index, col_df2], ‘After’: df1.loc[index, col_df1]})
return diffDf, alldiffDf, is_diff
[apologies in advance for weirdly formatted tables, i did my best given how annoying pasting tables into s/o is]
The code below works if dataframes have the same number and names of columns and the same number of rows, so comparing only values in the tables
Not sure where you want to get Race# from
df1 = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
df2 = df1.copy(deep=True)
df2['B'][5] = 100 # Creating difference
df2['C'][6] = 100 # Creating difference
dif=[]
for col in df1.columns:
for bef, aft in zip(df1[col], df2[col]):
if bef!=aft:
dif.append([col, bef, aft])
print(dif)
Results below
Alternative solution without loops
df = df1.melt()
df.columns=['Column', 'Before']
df.insert(2, 'After', df2.melt().value)
df[df.Before!=df.After]
Im trying to use the streaming insert_all method to insert data to a table using the google-api-client gem in ruby.
So I start with creating a new table in Bigquery (read and write priveleges are correct)
with the following contents:
+-----+-----------+-------------+
| Row | person_id | person_name |
+-----+-----------+-------------+
| 1 | 1 | ABCD |
| 2 | 2 | EFGH |
| 3 | 3 | IJKL |
+-----+-----------+-------------+
This is my code in ruby: (I discovered earlier today that tabledata.insert_all is ruby for tabledata.insertAll - google docs / example need to be updated)
def streaming_insert_data_in_table(table, dataset=DATASET)
body = {"rows"=>[
{"json"=> {"person_id"=>10,"person_name"=>"george"}},
{"json"=> {"person_id"=>11,"person_name"=>"washington"}}
]}
result = #client.execute(
:api_method=> #bigquery.tabledata.insert_all,
:parameters=> {
:projectId=> #project_id.to_s,
:datasetId=> dataset,
:tableId=>table},
:body_object=>body,
)
puts result.body
end
So I run my code the first time and all appears fine. I see this in the table on Bigquery:
+-----+-----------+-------------+
| Row | person_id | person_name |
+-----+-----------+-------------+
| 1 | 1 | ABCD |
| 2 | 2 | EFGH |
| 3 | 3 | IJKL |
| 4 | 10 | george |
| 5 | 11 | washington |
+-----+-----------+-------------+
Then I change the data in the method to:
body = {"rows"=>[
{"json"=> {"person_id"=>5,"person_name"=>"john"}},
{"json"=> {"person_id"=>6,"person_name"=>"kennedy"}}
]}
Run the method and get this in Bigquery:
+-----+-----------+-------------+
| Row | person_id | person_name |
+-----+-----------+-------------+
| 1 | 1 | ABCD |
| 2 | 2 | EFGH |
| 3 | 3 | IJKL |
| 4 | 10 | george |
| 5 | 6 | kennedy |
+-----+-----------+-------------+
So, what gives? I've lost data.... (ids 11 and id 5 have vanished) The responses for the request do not have errors either.
Could someone tell me if Im doing something incorrectly or why this is happening please?
Any help is much appreciated.
Thanks and have a great day.
Discovered this appears something to do with the ui (row count doesn't populate for a while and trying to extract the data in the table results in an error "Unexpected. Please try again."). However data is actually stored and can be queried. Thanks for the help Jordan