I have dataframe and Pivot Table and I need to replace some values in dataframe from pivot_table's columns.
Dataframe:
access_code ID cat1 cat2 cat3
g1gw8bzwelo83mhb 0433a3d29339a4b295b486e85874ec66 1 2
g0dgzfg4wpo3jytg 04467d3ae60fed134077a26ae33e0eae 1 2
g1gwui6r2ep471ht 06e3395c0b64a3168fbeab6a50cd8f18 1 2
g05ooypre5l87jkd 089c81ebeff5184e6563c90115186325 1
g0ifck11dix7avgu 0d254a81dca0ff716753b67a50c41fd7 1 2 3
Pivot Table:
type 1 2 \
access_code ID member_id
g1gw8bzwelo83mhb 0433a3d29339a4b295b486e85874ec66 1045794 1023 923 1 122
g05ooypre5l87jkd 089c81ebeff5184e6563c90115186325 768656 203 243 1 169
g1gwui6r2ep471ht 06e3395c0b64a3168fbeab6a50cd8f18 604095 392 919 1 35
g06q0itlmkqmz5cv f4a3b3f2fca77c443cd4286a4c91eedc 1457307 243 1
g074qx58cmuc1a2f 13f2674f6d5abc888d416ea6049b57b9 5637836 1
g0dgzfg4wpo3jytg 04467d3ae60fed134077a26ae33e0eae 5732738 111 2343 1
Desire output:
access_code ID cat1 cat2 cat3
g1gw8bzwelo83mhb 0433a3d29339a4b295b486e85874ec66 1023 923
g0dgzfg4wpo3jytg 04467d3ae60fed134077a26ae33e0eae 111 2343
g1gwui6r2ep471ht 06e3395c0b64a3168fbeab6a50cd8f18 392 919
g05ooypre5l87jkd 089c81ebeff5184e6563c90115186325 1
g0ifck11dix7avgu 0d254a81dca0ff716753b67a50c41fd7 1 2 3
If I use
df.ix[df.cat1 == 1] = pivot_table['1']
It returns error ValueError: cannot set using a list-like indexer with a different length than the value
As long as your dataframe is not exceedingly large, you can make it happen in some really ugly ways. I am sure someone else will provide you with a more elegant solution, but in the meantime this duct tape might point you in the right direction.
Keep in mind that in this case I did this with 2 dataframes instead of 1 dataframe and 1 pivot table, as I already had enough trouble formatting the dataframes from the textual data.
As there are empty fields in your data and my dataframes did not like this, first convert the empty fields to zeros.
df = df.replace(r'\s+', 0, regex=True)
Now ensure that your data is actually floats, else the comparisons will fail
df[['cat1', 'cat2', 'cat3']] = df[['cat1', 'cat2', 'cat3']].astype(float)
And for the fizzly fireworks:
df.cat1.loc[df.cat1 == 1] = piv['1'].loc[df.loc[df.cat1 == 1].index].dropna()
df.cat1 = df.cat1.fillna(1)
df.cat2.loc[df.cat2 == 2] = piv['2'].loc[df.loc[df.cat2 == 2].index].dropna()
df.cat2 = df.cat2.fillna(2)
df = df.replace(0, ' ')
The fillna is just to recreate your intended output, in which you clearly did not process some lines yet. I guess this column-by-column NaN-filling will not happen in your actual use.
Related
I ran into this specific problem where I have a dataframe of ID numbers. Some of these account numbers have dropped leading zeros.
dataframe is df.
ID
345
345
543
000922
000345
000345
000543
So what im trying to do is create a generalized way to check if we have dropped leading zeros. So basically, in my real data set there would be millions of rows. So I want to use a pandas method to say if there is a section of ID that matches a section with the zeros to put that into another dataframe so I can further examine.
I do that like this:
new_df = df.loc[df['ID'].isin(df['ID'])]
My reasoning for this is that I want to filter that dataset to find if any of the IDs are inside the full IDs.
Now I have
ID
345
345
543
000345
000345
000543
I can use a .unique() to get a series of each unique combo.
ID
345
543
000345
000543
This is fine for a small dataset. But for rows of millions, I am wondering how I can make it easier to do this check.
I trying to find a way to create a dictionary where the keys are the 3 digit and the values are its full ID. or vice versa.
Any tips on that would be appreciated.
If anyone has any tips also on a different idea to checking for dropped zeros, other than the dictionary approach, that would be helpful too.
Note: It is not always 3 digits. Could be 4567 for example, where the real value would be 004567.
One option is to strip leading "0"s:
out = df['ID'].str.lstrip('0').unique()
Output:
array(['345', '543', '922'], dtype=object)
or prepend "0"s:
out = df['ID'].str.zfill(df['ID'].str.len().max()).unique()
Output:
array(['000345', '000543', '000922'], dtype=object)
Use:
print (df)
ID
0 345
1 345
2 540
3 2922
4 002922
5 000344
6 000345
7 000543
#filter ID starting by 0 to Series
d = df.loc[df['ID'].str.startswith('0'), 'ID']
#create index in Series with remove zeros from left side
d.index = d.str.lstrip('0')
print (d)
ID
2922 002922
344 000344
345 000345
543 000543
Name: ID, dtype: object
#dict all possible values
print (d.to_dict())
{'2922': '002922', '344': '000344', '345': '000345', '543': '000543'}
#compare if exist indices in original ID column and create dict
d = d[d.index.isin(df['ID'])].to_dict()
print (d)
{'2922': '002922', '345': '000345', '543': '000543'}
Create a dictionary for finding potentially affected records.
# Creates a dummy dataframe.
df = pd.DataFrame(['00456', '0000456', '567', '00567'], columns=['ID'])
df['stripped'] = pd.to_numeric(df['ID'])
df['affected_id'] = df.ID.str.len() == df.stripped.astype(str).str.len()
df
ID stripped affected_id
0 00456 456 False
1 0000456 456 False
2 567 567 True
3 00567 567 False
# Creates a dictionary of potentially affected records.
d = dict()
for i in df[df.affected_id == True].stripped.unique():
d[i] = df[(df.stripped == i) & (df.ID != str(i))].ID.unique().tolist()
d
{567: ['00567']}
If you want to include the stripped records into the list, then:
for i in df[df.affected_id == True].stripped.unique():
d[i] = df[df.stripped == i].ID.unique().tolist()
d
{567: ['567', '00567']}
You can convert the column type to int
m = df['ID'].ne(df['ID'].astype(int))
print(m)
0 False
1 False
2 False
3 True
4 True
5 True
Name: ID, dtype: bool
print(df[m])
ID
3 000345
4 000345
5 000543
My csv file looks like this.
,timestamp,side,size,price,tickDirection,grossValue,homeNotional,foreignNotional
0,1569974396.557895,1,11668,8319.5,1,140248813.0,11668,1.40248813
1,1569974394.78865,0,5000,8319.0,0,60103377.0,5000,0.60103377
2,1569974392.355395,0,564,8319.0,0,6779660.999999999,564,0.06779661
3,1569974383.797042,0,100,8319.0,0,1202067.0,100,0.01202067
4,1569974382.944569,0,3,8319.0,0,36062.0,3,0.00036062
5,1569974382.944569,0,7412,8319.0,-1,89097247.0,7412,0.89097247
There's a nameless index column. I want to remove this column.
When I read this in pandas, it just interprets it as an index and moves on.
The problem is, when you now use df[::-1], it flips the indexes as well. So df[::-1]['timestamp][0] is the same as df['timestamp'][0] if the file was read with indexes, but not if it was read without.
How do I make it actually ignore the index column so that df[::-1] doesn't flip my indexes?
I tried usecols in read_csv, but it doesn't matter, it reads the indexes as well as the columns specified. I tried del df[''], but it doesn't work because it doesn't interpret the index column as column '', even though that's what it is.
Just use index_col=0
df = pd.read_csv('data.csv', index_col=0)
print(df)
# Output
timestamp side size price tickDirection grossValue homeNotional foreignNotional
0 1.569974e+09 1 11668 8319.5 1 140248813.0 11668 1.402488
1 1.569974e+09 0 5000 8319.0 0 60103377.0 5000 0.601034
2 1.569974e+09 0 564 8319.0 0 6779661.0 564 0.067797
3 1.569974e+09 0 100 8319.0 0 1202067.0 100 0.012021
4 1.569974e+09 0 3 8319.0 0 36062.0 3 0.000361
5 1.569974e+09 0 7412 8319.0 -1 89097247.0 7412 0.890972
If I understand correctly you issue, you can just set timestamp as your index:
df.set_index('timestamp', drop = True)
When I want to see a df finding the null values in a dataset here is what I get.
df.isnull().sum()
BAD 0
LOAN 0
MORTDUE 518
VALUE 112
REASON 252
JOB 279
YOJ 515
DEROG 708
DELINQ 580
CLAGE 308
NINQ 510
CLNO 222
DEBTINC 1267
dtype: int64
Next when I create a dataframe using this df, I get it as below.
df2 = pd.DataFrame(df.isnull().sum())
df2.set_index(0)
df2.index.name = None
0
BAD 0
LOAN 0
MORTDUE 518
VALUE 112
REASON 252
JOB 279
YOJ 515
DEROG 708
DELINQ 580
CLAGE 308
NINQ 510
CLNO 222
DEBTINC 1267
Why is that extra row coming in the output, and how can I remove it?. I saw a normal test, with df and i am able to use it (using set_index(0), and df.index.name = None and was able to remove the extra row. But that does not work on the created dataframe df2.
As you may already know, that extra zero appearing as an "extra row" in your output is actually the header for the column name(s). When you create the DataFrame, try passing a column name if you want something more descriptive than the default "0" for column name:
df2 = pd.DataFrame(df.isnull().sum(), columns=["Null_Counts"])
Same as the difference you would get from these two variants:
print(pd.DataFrame([0,1,2,3,4,5]))
0
0 0
1 1
2 2
3 3
4 4
5 5
vs
print(pd.DataFrame([0,1,2,3,4,5], columns=["My_Column"]))
My_Column
0 0
1 1
2 2
3 3
4 4
5 5
And, if you just don't want the header row to show up in your output, which seems to be the intent of your question, then you could do something like this to just use the index values and the count values to create whatever output format you want:
df1 = pd.DataFrame([0,1,2,3,4,5], columns=["My_Column"])
for tpl in zip(df1.index.values, df1["My_Column"].values):
print("{}\t{}".format(tpl[0], tpl[1]))
Output:
0 0
1 1
2 2
3 3
4 4
5 5
And you can also use the DataFrame function to_csv() and pass header=False if you just want to print or save the CSV output somewhere without the header row:
print(df1.to_csv(header=False))
0,0
1,1
2,2
3,3
4,4
5,5
And you can also pass sep="\t" to the to_csv function call if you prefer tab- instead of comma-delimited output.
I have 3 columns of ID's that I want to combine into a single column like the example below. The goal here is to simply replace all 0's in the main column with the values in either ID1 or ID2 AND maintain the score column to the far right.
Note, The Main ID column also has cases where there is already a value as shown in row 3, in that case, nothing needs to be done. Ultimately trying to get a single column as shown in the desired output. Tried using some iterative loop but it was not a pythonic approach.
Data Table
Main ID ID_1 ID_2 SCORE
0 0 121231 212
0 54453 0 199
12123 12123 0 185
343434 0 343434 34
2121 0 0 66
0 0 11 565
Desired output:
MAIN ID SCORE
121231 212
54453 199
12123 185
343434 34
2121 66
11 565
Update, applying the bfill method changed all the 'MAIN_ID' numbers into scientific notation like: 3.43559e+06
This one works for me, It's simple but functional :D
import pandas as pd
d = {'MAIN ID' : [0,0,12123,343434,2121,0], 'ID_1': [0,54453,12123,0,0,0],'ID_2':[121231,0,0,343434,0,11]}
df = pd.DataFrame(data=d)
for i in range(len(df)):
if df.iloc[i]['MAIN ID'] == 0:
if df.iloc[i]['ID_1'] != 0:
df.iloc[i]['MAIN ID'] = df.iloc[i]['ID_1']
else:
df.iloc[i]['MAIN ID'] = df.iloc[i]['ID_2']
df = df.drop(['ID_1', 'ID_2'], axis=1)
Try bfill with mask
out = df.mask(df.eq(0)).bfill(1)[['Main ID']]
I have a data frame that present some features with cumulative values. I need to identify those features in order to revert the cumulative values.
This is how my dataset looks (plus about 50 variables):
a b
346 17
76 52
459 70
680 96
679 167
246 180
What I wish to achieve is:
a b
346 17
76 35
459 18
680 26
679 71
246 13
I've seem this answer, but it first revert the values and then try to identify the columns. Can't I do the other way around? First identify the features and then revert the values?
Finding cumulative features in dataframe?
What I do at the moment is run the following code in order to give me the feature's names with cumulative values:
def accmulate_col(value):
count = 0
count_1 = False
name = []
for i in range(len(value)-1):
if value[i+1]-value[i] >= 0:
count += 1
if value[i+1]-value[i] > 0:
count_1 = True
name.append(1) if count == len(value)-1 and count_1 else name.append(0)
return name
df.apply(accmulate_col)
Afterwards, I save these features names manually in a list called cum_features and revert the values, creating the desired dataset:
df_clean = df.copy()
df_clean[cum_cols] = df_clean[cum_features].apply(lambda col: np.diff(col, prepend=0))
Is there a better way to solve my problem?
To identify which columns have increasing* values throughout the whole column, you will need to apply conditions on all the values. So in that sense, you have to use the values first to figure out what columns fit the conditions.
With that out of the way, given a dataframe such as:
import pandas as pd
d = {'a': [1,2,3,4],
'b': [4,3,2,1]
}
df = pd.DataFrame(d)
#Output:
a b
0 1 4
1 2 3
2 3 2
3 4 1
Figuring out which columns contain increasing values is just a question of using diff on all values in the dataframe, and checking which ones are increasing throughout the whole column.
That can be written as:
out = (df.diff().dropna()>0).all()
#Output:
a True
b False
dtype: bool
Then, you can just use the column names to select only those with True in them
new_df = df[df.columns[out]]
#Output:
a
0 1
1 2
2 3
3 4
*(the term cumulative doesn't really represent the conditions you used.Did you want it to be cumulative or just increasing? Cumulative implies that the value in a particular row/index was the sum of all previous values upto that index, while increasing is just that, the value in current row/index is greater than previous.)