UPDATE users SET address = TRIM(SUBSTR(address, LENGTH(street_number)+1, LENGTH(address)- LENGTH(street_number))) Step 6 - Same function as step 2, just to clean the street number off of the address field UPDATE users SET street_number = REGEXP_SUBSTR(address, '+') Step 5 - We basically the step 3 function again on the address to build the street_number column Now we’re going to pull the street number out. , 499 Beahan Harbors, Annestad, Rhode IslandĤ99 Beahan Harbors, Annestad, Rhode IslanĨ7673 Williamson Bridge, Port Nedra, MichigaĤ22 Wilson Streets, Nienowmouth, Arkansas , 4769 Evangeline Garden, New Carlosstad, VermontĤ769 Evangeline Garden, New Carlosstad, Vermonħ8700 Vernon Ford, North Rudolph, New JerseĨ37 Franecki Meadows, Schmelerfurt, OklahomaĨ37 Franecki Meadows, Schmelerfurt, Oklahom , 81001 Klocko Crossroad, Haleyhaven, IndianaĨ1001 Klocko Crossroad, Haleyhaven, IndianĢ353 Cathryn Pass, West Jenniferview, Illinoiĩ2467 Satterfield Locks, Wendyshire, Washingto So at this point, our data looks like this: id SET address = SUBSTR(address, REGEXP_INSTR(address, '+'), LENGTH(address) - REGEXP_INSTR(address, '+')) So we look for the next number and chop a bit more off the address field if there is any junk left REGEX_INSTR() will return the position of the first regex match Turns into ", 81001 Klocko Crossroad, Haleyhaven, Indiana" ![]() 285, 81001 Klocko Crossroad, Haleyhaven, Indiana" When we do step 2, it often leaves some junk at the beginning of the string.Įg: "apt. SET unit_number = REGEXP_SUBSTR(unit_number, '+') where unit_number IS NOT NULL In this one, we're just looking for instances of grouped digits. Step 3 - Now we clean up the unit number field SET address = SUBSTR(address, LENGTH(unit_number)+1, LENGTH(address)- LENGTH(unit_number)) where unit_number IS NOT NULL This query just snips the length of unit number off the front of the address column Step 2 - Now we're going to remove the unit number from the address field REGEX_SUBSTR will basically use a regex pattern to find the first occurance of a regex pattern in a string. ![]() Step 1 - Copy the dirty unit numbers over to the unit_number column 37 422 Wilson Streets, Nienowmouth, Arkansas Unit 99, 499 Beahan Harbors, Annestad, Rhode IslandĨ7673 Williamson Bridge, Port Nedra, Michigan 45 837 Franecki Meadows, Schmelerfurt, Oklahoma 179, 4769 Evangeline Garden, New Carlosstad, Vermontħ8700 Vernon Ford, North Rudolph, New Jersey 285, 81001 Klocko Crossroad, Haleyhaven, IndianaĢ353 Cathryn Pass, West Jenniferview, Illinoisĩ2467 Satterfield Locks, Wendyshire, Washington Right away, we should add columns for unit_number, street_number, street, city, state to the table idĪpt. Here is our sample data, and you can see that the address is all bunched up into one column. In the first example the use case was really about using REGEX to clean the data, in this example we’re going to use regex to test the data first. UPDATE users SET phone_number = CONCAT( '(',SUBSTR(phone_number,1,3), ') ',SUBSTR(phone_number,4,3), '-',SUBSTR(phone_number,7,4)) UPDATE users SET phone_number = CONCAT(SUBSTR(phone_number,1,3), '.',SUBSTR(phone_number,4,3), '.',SUBSTR(phone_number,7,4)) UPDATE users SET phone_number = CONCAT(SUBSTR(phone_number,1,3), '-',SUBSTR(phone_number,4,3), '-',SUBSTR(phone_number,7,4)) idįinally, after all this data-cleansing and breakdown is done, you can reformat the phone number into a readable format. Here is the progression of the phone number through our queries. SET phone_number = SUBSTR(phone_number, 1,10) Step 4 - Cleans the extension off the end of the phone number column SET extension = SUBSTR(phone_number, 11, LENGTH(phone_number) - 10) Step 3 - Where ever the Phone Number is longer than 10 digits, take any digits after 10, and put them in the extension column Note - this data was randomly generated using Fakerįirst thing we do is clean out all of the extra characters from the data. You’ve got a completely bonkers list of phone numbers, and you need to clean them. It’s important to note that MSSQL and MySQL handle regex a bit differently, so these functions may not work with MSSQL cleaning up phone numbers In this post, I figured I’d go through a few common things that I end up doing for SQL scripts. This data is often grouped together in unusable ways, and needs to be broken up in order to be usable for analysis.ĭepending on what needs to happen, I’ll either write up a quick php script to iterate through, or I’ll go right to MySQL and break it apart. This usually involves taking multiple sources of data like spreadsheets JSON or XML files and pulling them into a database. In my work, I frequently end up having to do a lot of data analysis.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |