I've been looking all over Google for a solution.
I am pulling data using requests.get(), which outputs a lengthy block in JSON.
I've been able to normalize it and figure out how to put it into a PANDAS dataframe.
The problem I am having is taking the URLs in Columns X, Y, Z and removing the percent encoding. I'd be okay with removing it from the entire dataframe.
stat.get.url
0
1
2 entrance%7Cstate%2Fgoogle
3 entrance%7Cstate%2Fyahoo
4 entrance%7Cstate%2Fmsn
I've tried this code:
df.replace('%7C', '|', regex=True)
But that doesn't replace anything in the dataframe.
How can I replace the percent encoded characters and get them to be saved on the dataframe?
Related
I have the following string in a column within a row in a pandas dataframe. You could just treat it as a string.
;2;613;12;1;Ajc hw EEE;13;.387639;1;EXP;13;2;128;12;1;NNN XX Ajc;13;.208966;1;SGX;13;..
It goes on like that.
I want to convert it into a table and use the semi colon ; symbol as a delimiter. The problem is there is no new line delimiter and I have to estimate it to be every 10 items.
So, it should look something like this.
;2;613;12;1;Ajc hw EEE;13;.387639;1;EXP;13;
2;128;12;1;NNN XX Ajc;13;.208966;1;SGX;13;..
How do I convert that string into a new dataframe in pandas. After every 10 semi colon delimiters, a new row should be created.
I have no idea how to do this, any help would be greatly appreciated in terms of tools or ideas.
This should work
# removing first value as it's a semi colon
data = ';2;613;12;1;Ajc hw EEE;13;.387639;1;EXP;13;2;128;12;1;NNN XX Ajc;13;.208966;1;SGX;13;'[1:]
data = data.split(';')
row_count = len(data)//10
data = [data[x*10:(x+1)*10] for x in range(row_count)]
pd.DataFrame(data)
I used a double slash for dividing but as your data length should be dividable by 10, you can use only one.
Here's a screenshot of my output.
Issue:
Pandas appears to be swapping the column data on the data frame when it is saving to CSV? What is going on
# Code
myDF.to_csv('./myDF.csv')
print(myDF)
# Print Output
dd-3 dd-4
5346177884_triplet+ 3 3
5346177884_dublet- 5 5
5346177884_dublet+ 3 3
...
1434120345_triplet+ NaN 1
1434120345_singlet+ NaN 3
# CSV File
,dd-3,dd-4
5346177884_triplet+,3.0,3
5346177884_dublet-,5.0,5
5346177884_dublet+,3.0,3
...
1434120345_triplet+,,1
1434120345_singlet+,,3
Anyone seen anything like this before?
Be sure to check the raw CSV file to make sure that it is not the tool you are using to display the CSV that is interpreting the file incorrectly. For instance pandas will output nans as blank space in a csv file. While libercalc on import can be set to merge repeat delimiters for things like space separated files with multiple spaces. If you accidentally leave that feature on when importing a csv with blanks between delimiters you may see an effect similar to what you have reported.
Issue:
# CSV Format
,h1,h2,3
obj,v1,v2,v3
# PD handling NAN for v1 & v2
,h1,h2,3
obj,,,v3
# Merge delimiter interpretation
,h1,h2,h3
obj,v3
# Resulting View
h1 h2 h3
obj_number v3
So I am venturing into python scripts. I am a beginner, but i have been tasked with taking a excel formula to python code. I have a text file that contains 3+million rows, each row has three columns and is delimited by tab.
All rows comes as string and first two columns have no problem. Problem with third column is that the content when downloaded gets added padding of 0s to make 18 characters if data is numerical.
In the same column, there are also values that contain space in between. Like 00 5372. Some are fully text format either identified by letter or character like ABC3400 or 00-ab-fu-99 or 00-33-44-66.
A1 B1 Values Output
AA00 dd 000000000000056484 56484
AB00 dd 00 564842 00 564842
AC00 dd 00-563554-f 00-563554-f
AD00 dd STO45642 STO45642
AE00 dd 45632 45632
I need to clean this type of codes to make the output text to be clean, while
Leaving the spaces between,
Clean the leading and trailing spaces,
Clean the value if it is padded with 0's in front.
I do in excel for limited amount by using following formula.
=TRIM(IFERROR(IF((FIND(" ";A2))>=1;A2);TEXT(A2;0)))
*Semicolon due to regional language setting.
For large file, I use following power query steps.
= Table.ReplaceValue(#"Trimmed Text", each [Values], each if Text.Contains([Values]," ") then [Values] else if Number.From([Values]) is number then Text.TrimStart([Values],"0") else [Values],Replacer.ReplaceValue,{"Values"})
First trim and then replace values. This does the job in Power Query very well. Now I would like to do it with Python Script. But as a noob I am stuck at very beginning. Can anyone help me with the library and code ?
My end target is to get the data saved in txt/csv with cleaned values.
Excel ScreenShot
*Edited to correct point 1) Leaving and not removing and further clarification with data.
try the code below(replace the column1,column2,column3 with respectiove column names and provide file_address to variable file_name, if python script and excel file is saved at same location then only name will be enough):
import pandas as pd
df = pd.read_excel(file_name, sep='\t', lineterminator='\r', skipinitialspace=True)
df['column1'] = df['column1'].str.replace(' ','')
df['column2'] = df['column2'].str.replace(' ','')
df['column3'] = df['column3'].str.replace(' ','')
df.to_csv('output.csv',index=False)
I am using code below to remove all non english characters below:
DF.text.replace({r'[^\x00-\x7F]+':''}, regex=True, inplace=True)
where df has a column called text with text in it like below:
text
hi what are you saying?
okay let me know
sounds great, mikey
ok.
right
ご承知のとおり、残念ながら悪質な詐欺が増加しているようですのでお気を付けください。\n
¡Hola miguel! Lamento mucho la confusión cau
expected output:
text
hi what are you saying?
okay let me know
sounds great, mikey
ok.
right
For my rows where my code removes characters -
I want to delete those rows from the df completely, meaning if it does replace any non-english characters, I want to delete that row from the df completely to avoid having that row with either 0 characters or a few characters that are meaningless after they have been altered by the code above.
You can use
df[~df['text'].str.contains(r'[^\x00-\x7F]')]
Pandas test:
import pandas as pd
df = pd.DataFrame({'text': ['hi what are you saying?', 'ご承知のとおり、残念ながら悪質な詐欺が増加しているようですのでお気を付けください。'], 'another_col':['demo 1', 'demo 2']})
df[~df['text'].str.contains(r'[^\x00-\x7F]')]
# text another_col
# 0 hi what are you saying? demo 1
Notes:
df['text'].str.contains(r'[^\x00-\x7F]') finds all values in text column that contain a character other than ASCII char (it is our "mask")
df[~...] only keeps those rows that did not match the regex.
str.contains() returns a Series of booleans that we can use to index our frame
patternDel = "[^\x00-\x7F]"
filter = df['Event Name'].str.contains(patternDel)
I tend to keep the things we want as opposed to deleting rows. Since filter represents things we want to delete we use ~ to get all the rows that don't match and keep them
df = df[~filter]
I read pickledata and put into dataframe for further processing. And I found an issue when it comes to choose certain rows that contains special string.
Below is the first few lines of dataframe.
parent_pid \
0 UXXY-C240-M4L
2 UXXZ-B200-M5-U
4 UXXZ-B200-M5-U
6 UXXZ-B200-M5-U
8 UXXZ-B200-M5-U
pid
0 UXXY-F-H19001,UXX-SD480G...
2 UXX-SD-32G-S,UXX-ML-X64G...
4 UXX-SD-32G-S,UXX-SD-32G-...
6 UXX-SD-32G-S,UXX-MR-X32G...
8 UXX-SD-32G-S,UXX-MR-X32G...
when it comes to search rows that contains "UXXZ-B200-M5-U", I used below codes.
df.query('parent_pid == "UXXZ-B200-M5-U"')
And below is the return.
Empty DataFrame
Columns: [parent_pid, pid]
Index: []
I used many different ways to search rows with this string, it returns the same.
whitespace in the columns doesn't seem to matter.
df[df["parent_pid"].isin(["UXXZ-B200-M5-U"])]
df.filter(like="UXXZ-B200-M5-U").columns
Does anyone know what's the issue here?