how to get dataframe column index by cell value - python

i have a dataframe look like this:
enter image description here
how can i get the column index base on the cell value exist .xml which is 4 in my dataframe ?
Here is my code:
df = fileserver
for index in df:
df1 = df[index].str.contains(".xml")
print(df1)
for the result True, i don't know how to find the index, please help and thank you everyone

You may track which column does indeed contain a pattern .xml using the following code. For the example below, the column of interest has index 2.
First, create a sample dataframe for your example.
import pandas as pd
>>> df = pd.DataFrame([[1,2,"3.xml"], [4,5,"6.xml"]])
>>> df
0 1 2
0 1 2 3.xml
1 4 5 6.xml
Now, check for your pattern among all cells of the dataframe.
>>> df_filtered = df.applymap(lambda cell: str(cell).endswith(".xml"))
>>> df_filtered
0 1 2
0 False False True
1 False False True
Finally, keep columns if at least one cell was found with the desired pattern.
>>> [column for column, count in df_filtered.sum().to_dict().items()
if count > 0]
[2]

Related

Pandas find the first occurrence of a specific value in a row within multiple columns and return column index

For a dataframe:
df = pd.DataFrame({"A":[0,0],"B":[0,1],"C":[1,2],"D":[2,2]})
How to obtain the column name or column index when the value is 2 or a certain value
and put it in a new column at df, say df["TAG"]
df = pd.DataFrame({"A":[0,0],"B":[0,1],"C":[1,2],"D":[2,2],"TAG":[D,C]})
i tried
df["TAG"]=np.where(df[cols]>=2,df.columns,'')
where [cols] is the list of df columns
So far i can only find how to find row index when matching a value in Pandas
In excel we can do some approach using MATCH(TRUE,INDEX($A:$D>=2,0),) and apply to multiple rows
Any help or hints are appreciated
Thank you so much in advance
Try idxmax:
>>> df['TAG'] = df.ge(2).T.idxmax()
>>> df
A B C D TAG
0 0 0 1 2 D
1 0 1 2 2 C
>>>

Python: How to remove rows which are in other dataframe?

I have two dataframes:
First:
tif_pobrany
0 65926_504019_N-33-127-B-d-3-4.tif
1 65926_504618_N-33-139-D-b-1-3.tif
2 65926_504670_N-33-140-A-a-2-3.tif
3 66533_595038_N-33-79-C-b-3-3.tif
4 66533_595135_N-33-79-D-d-3-4.tif
Second:
url godlo ... row_num nazwa_tifa
0 https://opendata.geoportal.gov.pl/ortofotomapa... M-34-68-C-a-1-2 ... 48004 73231_904142_M-34-68-C-a-1-2.tif
1 https://opendata.geoportal.gov.pl/ortofotomapa... M-34-68-C-a-3-1 ... 48011 73231_904127_M-34-68-C-a-3-1.tif
2 https://opendata.geoportal.gov.pl/ortofotomapa... M-34-68-C-a-3-2 ... 48012 73231_904336_M-34-68-C-a-3-2.tif
3 https://opendata.geoportal.gov.pl/ortofotomapa... M-34-68-C-a-3-3 ... 48013 73231_904286_M-34-68-C-a-3-3.tif
4 https://opendata.geoportal.gov.pl/ortofotomapa... M-34-68-C-a-4-2 ... 48016 73231_904263_M-34-68-C-a-4-2.tif
How can I delete rows in second dataframe which have the same 'nazwa_tifa' like in the first dataframe 'tif_pobrany'?
Something like this:
for index, row in second.iterrows():
for index2, row2 in first.iterrows():
if row['nazwa_tifa'] == row2['tif_pobrany']:
del row
but it didn't work.
Try this with your data:
import pandas as pd
df1 = pd.DataFrame({"col1":[1,2,3,4,5]})
df2 = pd.DataFrame({"col2":[1,3,4,9,8]})
df1.drop(df1[df1.col1.isin(df2.col2)].index, inplace = True)
print(df1)
output:
col1
1 2
4 5
considering df1 and df2 are names of your dataframes respectively:
df2 = df2[df2['nazwa_tifa'].isin(df1['tif_pobrany'])]
How it works?
The isin function is used to check whether values inside a Pandas series are present in another Pandas series.
Then, an array of True/False values are passed to df2, so it selects only rows wherever the condition is True.
Finally, an assignment is used to replace df2 with the new dataframe.

Panda drop values in columns but keep columns

Has the title say, I would like to find a way to drop the row (erase it) in a data frame from a column to the end of the data frame but I don't find any way to do so.
I would like to start with
A B C
-----------
1 1 1
1 1 1
1 1 1
and get
A B C
-----------
1
1
1
I was trying with
df.drop(df.loc[:, 'B':].columns, axis = 1, inplace = True)
But this delete the column itself too
A
-
1
1
1
am I missing something?
If you only know the column name that you want to keep:
import pandas as pd
new_df = pd.DataFrame(df["A"])
If you only know the column names that you want to drop:
new_df = df.drop(["B", "C"], axis=1)
For your case, to keep the columns, but remove the content, one possible way is:
new_df = pd.DataFrame(df["A"], columns=df.columns)
Resulting df contains columns "A" and "B" but without values (NaN instead)

Remove rows of one Dataframe based on one column of another dataframe

I got two DataFrame and want remove rows in df1 where we have same value in column 'a' in df2. Moreover one common value in df2 will only remove one row.
df1 = pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]})
df2 = pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]})
result = pd.DataFrame({'a':[1,1,3,4],'b':[1,2,4,6],'c':[6,5,3,1]})
Use Series.isin + Series.duplicated to create a boolean mask and use this mask to filter the rows from df1:
m = df1['a'].isin(df2['a']) & ~df1['a'].duplicated()
df = df1[~m]
Result:
print(df)
a b c
0 1 1 6
1 1 2 5
3 3 4 3
5 4 6 1
Try This:
import pandas as pd
df1=pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]})
df2=pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]})
df2a = df2['a'].tolist()
def remove_df2_dup(x):
if x in df2a:
df2a.remove(x)
return False
return True
df1[df1.a.apply(remove_df2_dup)]
It creates a list from df2['a'], then checks that list against each value of df1['a'], removing values from the list each time there's a match in df1
try this
df1=pd.DataFrame({'a':[1,1,2,3,4,4],'b':[1,2,3,4,5,6],'c':[6,5,4,3,2,1]})
df2=pd.DataFrame({'a':[2,4,2],'b':[1,2,3],'c':[6,5,4]})
for x in df2.a:
if x in df1.a:
df1.drop(df1[df1.a==x].index[0], inplace=True)
print(df1)

Dataframe becomes larger than it should be after join operation in pandas

I have an excel dataframe which I am trying to populate with fields from other excel file like so:
df = pd.read_excel("file1.xlsx")
df_new = df.join(conv.set_index('id'), on='id', how='inner')
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: "" if x==0 else x) # if id==0, its same as nan
df_new = df_new.dropna() # drop nan
df_new['PersonalN'] = df_new['PersonalN'].apply(lambda x: str(int(x))) # convert id to string
df_new = df_new.drop_duplicates() # drop duplicates, if any
it is clear that df_new should be a subset of df, however, when I run following code:
len(df[df['id'].isin(df_new['id'].values)]) # length of this should be same as len(df_new)
len(df_new)
I get different results (there are 6 more rows in df_new than in df). How can that be? I have checked all dataframes for duplicates and none of them contain any. Interestingly, following code does give expected results:
len(df_new[df_new['id'].isin(df['id'].values)])
len(df_new)
These both print same numbers
Edit:
I have also tried following: others = df[~df['id'].isin(df_new['id'].values)], and checking if others has same length as len(df) - len(df_new), but again, in dataframe others there are 6 more rows than expected
The problem comes from your conv dataframe. Assume that your df that comes from file1 is
id PersonalN
0 1
And conv is
id other_col
0 'abc'
0 'def'
After the join you will get:
id PersonalN other_col
0 1 'abc'
0 1 'def'
size of df_new is larger than of df and drop_dulicates() or dropna() will not help you to reduce the shape of your resulting dataframe.
It's hard to know without the data, but even if there are no duplicates in either of the dataframe, the size of the result of an inner join can be larger than the original dataframe size. Consider the following example:
df1 = pd.DataFrame(range(10), columns=["id_"])
df2 = pd.DataFrame({"id_": list(range(10)) + [1] * 3, "something": range(13)})
df2.drop_duplicates(inplace = True)
print(len(df1), len(df2))
==> 10 13
df_new = df1.join(df2.set_index("id_"), on = "id_")
len(df_new)
==> 13
print(df_new)
id_ something
0 0 0
1 1 1
1 1 10
1 1 11
1 1 12
2 2 2
...
The reason is of course that the ids of the other dataframe are not unique, and a single id in the original dataframe (df1 in my example) is joined to several rows on the other dataframe (df2 in my example, conv in yours).

Categories