cleaning my dataframe (similar lines and \xc3\x28 in the field) - python

I am working on dataframe with python.
in my first dataframe df1 i have :
+------+---------+-------------+-------------------------------+
| ID | PUBLICATION TITLE | DATE | JOURNAL |
+------+---------------------+--------------+------------------+
| 1 "a" "01/10/2000" "book1" |
| 2 "b" "09/03/2005" NaN |
| NaN "b" "09/03/2005" "book2 |
| 5 "z" "21/08/1995" "book4" |
| 6 "n" "15/04/1993" "book9\xc3\x28" |
+--------------------------------------------------------------+
Here I would like to clean my dataframe but I don't know how to do it in this case.
Indeed there are two points which block me.
The first one is that lines 2 and 3 seems to be the same line because the title of the publication is the same and because I think that the title of the publication is unique to a newspaper
The second point is for the last line one to \xc3\x28.
How can I clean my dataframe in a smart way, to be able to use this code for other daataframe if possible?

First you should remove the row with ID = NaN. This can be done by:
df1 = df1[df1['ID'].notna()]
Then update the journal of the 2nd row:
df1.iloc[1, df1.columns.get_loc('JOURNAL')] = 'book2'
Finally, for the entry of 'book9\xc3\x28', you can update it by:
df1.iloc[4, df1.columns.get_loc('JOURNAL')] = 'book9'

What type of encoding are you using.
I recommend using "utf8" encoding for this purpose.

Related

Replace Multiple Values On Python or EXCEL

Seeking for Help. Hi Guys i didnt code yet because i think i need some idea to access the csv and the row. so technically i want to replace the text with the id on the CSV file
import pandas as pd
df = pd.read_csv('replace.csv')
print(df)
Please kindly view the photo. so if you see there is 3 column, so i want to replace the D Column if the D Column is Equal to A Column, then replace with the ID (column B). seeking for i idea if what is the first step or guide.. thanks
Photo
In The Photo
name | id | Replace
james | 5 | James,James,Tom
tom | 2 | Tom,James,James
jerry | 10 | Tom,Tom,Tom
What Im Expected Result:
name | id | Replace
james | 5 | 5,5,2
tom | 2 | 2,5,5
jerry | 10 | 2,2,2
Excel 365:
As per my comment, if it's ok to get data in a new column and with ms365, try:
Formula in E2:
=MAP(C2:C4,LAMBDA(x,TEXTJOIN(",",,XLOOKUP(TEXTSPLIT(x,","),A2:A4,B2:B4,"",0))))
Or, if all values will be present anyways:
=MAP(C2:C4,LAMBDA(x,TEXTJOIN(",",,VLOOKUP(TEXTSPLIT(x,","),A2:B4,2,0))))
Google-Sheets:
The Google-Sheets equivalent, as per your request, could be:
=MAP(C2:C4,LAMBDA(x,INDEX(TEXTJOIN(",",,VLOOKUP(SPLIT(x,","),A2:B4,2,0)))))
Python/Pandas:
After some trial and error I came up with:
import pandas as pd
df = pd.read_csv('replace.csv', sep=';')
df['Replace'] = df['Replace'].replace(pd.Series(dict(zip(df.name, df.id))).astype(str), regex=True)
print(df)
Prints:
name id Replace
0 James 5 5,5,2
1 Tom 2 2,5,5
2 Jerry 10 2,2,2
Note: I used the semi-colon as seperator in the function call to open the CSV.
Nested =substitute functions would make this easy.
=substitute(substitute(substitute(d2, a2, b2),a3,b3),a4,b4)

How can I copy values from one dataframe column to another based on the difference between the values

I have two csv mirror files generated by two different servers. Both files have the same number of lines and should have the exact same unix timestamp column. However, due to some clock issues, some records in one file, might have asmall difference of a nanosecond than it's counterpart record in the other csv file, see below an example, the difference is always of 1:
dataframe_A dataframe_B
| | ts_ns | | | ts_ns |
| -------- | ------------------ | | -------- | ------------------ |
| 1 | 1661773636777407794| | 1 | 1661773636777407793|
| 2 | 1661773636786474677| | 2 | 1661773636786474677|
| 3 | 1661773636787956823| | 3 | 1661773636787956823|
| 4 | 1661773636794333099| | 4 | 1661773636794333100|
Since these are huge files with milions of lines, I use pandas and dask to process them, but before I process, I need to ensure they have the same timestamp column.
I need to check the difference between column ts_ns in A and B and if there is a difference of 1 or -1 I need to replace the value in B with the corresponding ts_ns value in A so I can finally have the same ts_ns value in both files for corresponding records.
How can I do this in a decent way using pandas/dask?
If you're sure that the timestamps should be identical, why don't you simply use the timestamp column from dataframe A and overwrite the timestamp column in dataframe B with it?
Why even check whether the difference is there or not?
You can use the pandas merge_asof function for this, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html . The tolerance allows for a int or timedelta which should be set to the +1 for your example with direction being nearest.
Assuming your files are identical except from your ts_ns column you can perform a .merge on indices.
df_a = pd.DataFrame({'ts_ns': [1661773636777407794, 1661773636786474677, 1661773636787956823, 1661773636794333099]})
df_b = pd.DataFrame({'ts_ns': [1661773636777407793, 1661773636786474677, 1661773636787956823, 1661773636794333100]})
df_b = (df_b
.merge(df_a, how='left', left_index=True, right_index=True, suffixes=('', '_a'))
.assign(
ts_ns = lambda df_: np.where(abs(df_.ts_ns - df_.ts_ns_a) <= 1, df_.ts_ns_a, df_.ts_ns)
)
.loc[:, ['ts_ns']]
)
But I agree with #ManEngel, just overwrite all the values if you know they are identical.

How to Check is Firstnames, and Lastnames are English?

I have a csv file which has two columns and about 9,000 rows. Column 1 contains the firstname of a respondent in a survey, column 2 contains the lastname of a respondent in a survey, so each row is an observation.
These surveys were conducted in a very diverse place. I am trying to find a way to tell, whether a respondent's firstname is of English (British or American) origin or not. Same for his lastname.
This task is very far away from my area of expertise. After reading interesting discussions online here, and here. I have thought about three way:
1- Take a dataset of the most common triplets (families of 3 letters often found together in English) or quadruplets (families of 4 letters often found together in English) and to check for each firstname, and lastname, whether it contains these families of letters.
2- Use a dataset of British names (say the most X common names in the UK in the early XX Century, and match these names based on proximity to my dataset. These datasets could be good I think, data1, data2, data3.
3- Use python and an interface to detect what is (most likely) English from what is not.
If anyone has advise on this, can share experience etc that would be great!
I am attaching an example of the data (I made up the names) and of the expected output.
NB: Please note that I am perfectly aware that classifying names according to an English/Non English dichotomy is not without drawbacks and semantic issues.
I built something a while back that is quite similar. Summary below.
Created 2 Source lists a Firstname list, and a lastname
Created 4+ Comparison lists (English Firstname list, English Last name list, et. al)
Then used an in_array function to compare a source first name to comparison first name
Then I used a big if statement to check lists against eachother. Eng.First vs Src.First, American.First vs Src.First, Irish.First vs src.First.
and so on. If you are thinking of using your first bullet as an option (e.g. parts and pieces of a name, I wrote a paper which includes some source code as well that may be able to help.
Ordered Match Ratio as a Method for Detecting Program Abuse / Fraud
Although the best solution would probably be to train a classification model on top of BERT or a similar language model, a crude solution would be to use zero-shot classification. The example below uses transformers. It does a fairly decent job, although you see some semantic issues pop up: the classification of the name Black, for example, is likely distorted due to it also being a color.
import pandas as pd
from transformers import pipeline
data = [['James', 'Brown'], ['Gerhard', 'Schreuder'], ['Musa', 'Bemba'], ['Morris D.', 'Kemba'], ['Evelyne', 'Fontaine'], ['Max D.', 'Kpali Jr.'], ['Musa', 'Black']]
df = pd.DataFrame(data, columns=['firstname', 'name'])
classifier = pipeline("zero-shot-classification")
firstnames = df['firstname'].tolist()
lastnames = df['name'].tolist()
candidate_labels = ["English or American", "not English or American"]
hypothesis_template = "This name is {}."
results_firstnames = classifier(firstnames, candidate_labels, hypothesis_template=hypothesis_template)
results_lastnames = classifier(lastnames, candidate_labels, hypothesis_template=hypothesis_template)
df['f_english'] = [1 if i['labels'][0] == 'English or American' else 0 for i in results_firstnames ]
df['n_english'] = [1 if i['labels'][0] == 'English or American' else 0 for i in results_lastnames]
df
Output:
| | firstname | name | f_english | n_english |
|---:|:------------|:----------|------------:|------------:|
| 0 | James | Brown | 1 | 1 |
| 1 | Gerhard | Schroeder | 0 | 0 |
| 2 | Musa | Bemba | 0 | 0 |
| 3 | Morris D. | Kemba | 1 | 0 |
| 4 | Evelyne | Fontaine | 1 | 0 |
| 5 | Max D. | Kpali Jr. | 1 | 0 |
| 6 | Musa | Black | 0 | 0 |

Use one data-frame (used as a dictionary) to fill in the main data-frame (Python, Pandas)

I have a central DataFrame called "cases" (5000000 rows × 5 columns) and a secondary DataFrame, called "relevant information", which is a kind of dictionary in relation to the central DataFrame (300 rows × 6 columns).
I am trying to fill in the central DataFrame based on a common column called "Verdict_type".
And, if the value does not appear in the secondary DataFrame it fill in "not_relevant" in all the rows that will be added.
I used all sorts of directions without success.
I would love to get a good direction.
The DataFrames
import pandas as pd
# this is a mockup of the raw data
cases = [
[1, "1", "v1"],
[2, "2", "v2"],
[3, "3", "v3"]
]
relevant_info = [
["v1", "info1"],
["v3", "info3"]
]
# these are the data from screenshot
df_cases = pd.DataFrame(cases, columns=["id", "verdict_name", "verdict_type"]).set_index("id")
df_relevant_info = pd.DataFrame(relevant_info, columns=["verdict_type", "features"])
Input:
df_cases <-- note here the index marked as 'id'
df_relevant_info
# first, flatten the index of the cases ( this is probably what you were missing )
df_cases = df_cases.reset_index()
# then, merge the two sets on the verdict_type
df_merge = pd.merge(df_cases, df_relevant_info, on="verdict_type", how="outer")
# finally, mark missing values as non relevant
df_merge["features"] = df_merge["features"].fillna(value="not_relevant")
Output:
merged set:
+----+------+----------------+----------------+--------------+
| | id | verdict_name | verdict_type | features |
|----+------+----------------+----------------+--------------|
| 0 | 1 | 1 | v1 | info1 |
| 1 | 2 | 2 | v2 | not_relevant |
| 2 | 3 | 3 | v3 | info3 |
+----+------+----------------+----------------+--------------+

How can I make the groups in a specific column appear when the groupby method is used in order to merge?

x=df.groupby(['id_gamer'])[['sucess', 'nb_games']].shift(periods=1).cumsum()
.apply(lambda row: row.sucess/row.nb_games, axis=1)
In the code above, I make a groupby on a pandas.DataFrame in order to obtain a shifted column of results represented as ratio, for each gamer, and each game. Actually his rate of success considering the number of games he played.
It returns a pandas.core.series.Series object as:
+---------------+----------------+
| Index | Computed_ratio |
+---------------+----------------+
| id_game_date | NaN |
| id_game2_date | 0.30 |
| id_game3_date | 0.40 |
| id_game_date | NaN |
| id_game4_date | 0.50 |
| ... | ... |
+---------------+----------------+
So, you may see the NaN as the delimitation between gamers. As you may see the first gamer and the second one met in one game: id_game_date. And this is why I would prefer the column of gamer from id_gamer to appear in order to merge it with the dataframe where data are from.
To be honest I have an idea of solution: just do not use the id of games as index, then each row will be indexed correctly and there is no conflict when I proceed a merge, I guess. But I would like to know if it is possible with this current pattern shown here.
NB: I already tried with the solutions presented in this topic. But none of these work, certainly because the functions shown are aggregations and not mine: cumsum(). If I used an aggregating function like sum() (with a different pattern of code, do not try with the one I gave you or it will return an error) the id_gamer appears. But it is not corresponding to my expectations.

Categories