This file is available as both an .Rmd and .html file. If you already have access to the rstudio server at https://rstudio-ztrax.demog.berkeley.edu, then the most recent vresions of these files are in your home directory and you should be able to simply click on QuickStart.Rmd or QuickStart.html from Rstudio’s “Files” pane to get here… Perhaps you already did. The HTML version of the file provides an easier to look at experience and includes the output. The .Rmd version provides that interactive experience.

Quick Start

The ZTRAX data come in three “branches” :

  1. Zillow Transactions - Transactions (ZtranTran)
  2. Zillow Transactions - Assessor (ZtranZasmt)
  3. Zillow Assessor (Zasmt)

The first two branches appear to hold data related to real estate transactions from the point of view of the buyer/seller and from the point of view of the county Assessor. The third branch is mysterious and quite a bit smaller than then other two.

Each of these branches is further divided into 51 state level sub directories each holding a set of related files. So for example the “Zillow Transactions - Transactions” branch has 51 sub directories one of which holds 21 files related to Real Estate transactions that took place in California.

The complete file structure is shown below, but for the moment, you need not engage with it because a locally written package will allow you to read a state’s worth of data without knowing exactly where it is.

The ZTAX organization scheme is very conducive to constructing a data set for a single small state; making sure that it works as expected; then as a BATCH job –creating a big data set for the entire nation and running your code on it.

Heads up on Rstudio and large data sets

  1. Unless the machine crashes, your session will stay up and ready to use for two days after you stop doing stuff. (BUT…)
  2. AFTER 48 hours of idleness, Rstudio will suspend your session to disk and if it’s big it could take a long time to resuscitate – See the section “Acts of Desperation” below for what you might do in this case.
  3. While sessions are idle but not yet suspended, they will be consuming resources that other scientists could be using. So at a minimum, you should feel guilty about this, but better yet – save and exit when you’re done working for the day.
  4. Unsaved changes in your idle session will be lost when for whatever reason, your session dies.

Best practice is to save your work constantly and to organize your work so that you can read your data easily (if not quickly). Experiment with fwrite() and fread() for writing .csv files to disk and with save() to write R objects directly to binary format files on disk. I recommend against storing your data in a single 20GB+ .RData file – those are also dog slow to load back.

To disable these to features go to Tools -> Global Options.

nice -n 15 nohup Rscript filename.R

to launch the job whose instructions are in filename.R – with a priority that will be “nice” to your colleagues. The “nohup” means the R job will continue after you hangup, i.e. exit the shell from which your launched the job.

Acts of Desperation

When you are operating at the cutting edge of human knowledge, things will go wrong from time to time. When things go wrong in Rstudio, you will often begin to see “504 warning messages” this is very general message that essentially means that your R session is not behaving as your Rstudio session thinks it should – most likely R is just not responding. Since R and Rstudio are distinct processes, the latter has no way of knowing if the former is just being slow (because your data set is humongous) or if it’s gone off the rails. Don’t pannic at the first instance of a 504 warning but after a suficient interval, there are three acts of desperation that you may wish to contemplate:

  1. Terminate and restart your R session: Look under the “Session” menu for some options here.
  2. Terminate and restart your Rstudio session:
  1. Send email to

Example: Reading and exploring a single state’s worth of transaction data

In this example we use the locally written demogztrax package to read in Zillow data on transactions in one small state.

#  load the locally written demogztrax package
library(demogztrax)
## Loading required package: data.table
# read in a bunch of files
vermont <- ztrax_fread(branch="ZtranTran",state="VT")
## [1] "Reading data on  Vermont ..."
## [1] ""
## [1] "reading  ZtranTran : Vermont : BKManagedSpecific.txt ..."
## [1] "reading  ZtranTran : Vermont : BorrowerMailAddress.txt ..."
## Warning in fread(cmd = paste("tr <", f, "-dc '[:print:]\n' "), quote =
## "", : File '/tmp/RtmpUYu8E6/filefcf35d21e2db' has size 0. Returning a NULL
## data.table.
## [1] "reading  ZtranTran : Vermont : BorrowerNameDescriptionCode.txt ..."
## Warning in fread(cmd = paste("tr <", f, "-dc '[:print:]\n' "), quote =
## "", : File '/tmp/RtmpUYu8E6/filefcf32c95a41f' has size 0. Returning a NULL
## data.table.
## [1] "reading  ZtranTran : Vermont : BorrowerName.txt ..."
## Warning in fread(cmd = paste("tr <", f, "-dc '[:print:]\n' "), quote =
## "", : File '/tmp/RtmpUYu8E6/filefcf361f03c7e' has size 0. Returning a NULL
## data.table.
## [1] "reading  ZtranTran : Vermont : BuyerMailAddress.txt ..."
## [1] "reading  ZtranTran : Vermont : BuyerNameDescriptionCode.txt ..."
## [1] "reading  ZtranTran : Vermont : BuyerName.txt ..."
## [1] "reading  ZtranTran : Vermont : ForeclosureNameAddress.txt ..."
## Warning in fread(cmd = paste("tr <", f, "-dc '[:print:]\n' "), quote =
## "", : File '/tmp/RtmpUYu8E6/filefcf34be5fd3e' has size 0. Returning a NULL
## data.table.
## [1] "reading  ZtranTran : Vermont : ForeclosureNODNOSDoc.txt ..."
## Warning in fread(cmd = paste("tr <", f, "-dc '[:print:]\n' "), quote =
## "", : File '/tmp/RtmpUYu8E6/filefcf37d0624f6' has size 0. Returning a NULL
## data.table.
## [1] "reading  ZtranTran : Vermont : ForeclosureOriginalLoan.txt ..."
## Warning in fread(cmd = paste("tr <", f, "-dc '[:print:]\n' "), quote =
## "", : File '/tmp/RtmpUYu8E6/filefcf312a691e6' has size 0. Returning a NULL
## data.table.
## [1] "reading  ZtranTran : Vermont : HawaiiBorrowerNotes.txt ..."
## Warning in fread(cmd = paste("tr <", f, "-dc '[:print:]\n' "), quote =
## "", : File '/tmp/RtmpUYu8E6/filefcf365827867' has size 0. Returning a NULL
## data.table.
## [1] "reading  ZtranTran : Vermont : HawaiiGranteeNotes.txt ..."
## Warning in fread(cmd = paste("tr <", f, "-dc '[:print:]\n' "), quote =
## "", : File '/tmp/RtmpUYu8E6/filefcf36e5b37eb' has size 0. Returning a NULL
## data.table.
## [1] "reading  ZtranTran : Vermont : Legacy.txt ..."
## Warning in fread(cmd = paste("tr <", f, "-dc '[:print:]\n' "), quote =
## "", : File '/tmp/RtmpUYu8E6/filefcf36176abe3' has size 0. Returning a NULL
## data.table.
## [1] "reading  ZtranTran : Vermont : Main.txt ..."
## [1] "reading  ZtranTran : Vermont : Modification.txt ..."
## Warning in fread(cmd = paste("tr <", f, "-dc '[:print:]\n' "), quote =
## "", : File '/tmp/RtmpUYu8E6/filefcf3206ab4e8' has size 0. Returning a NULL
## data.table.
## [1] "reading  ZtranTran : Vermont : PropertyInfo.txt ..."
## [1] "reading  ZtranTran : Vermont : RegionSpecific.txt ..."
## Warning in fread(cmd = paste("tr <", f, "-dc '[:print:]\n' "), quote =
## "", : File '/tmp/RtmpUYu8E6/filefcf3aa784db' has size 0. Returning a NULL
## data.table.
## [1] "reading  ZtranTran : Vermont : SellerMailAddress.txt ..."
## [1] "reading  ZtranTran : Vermont : SellerNameDescriptionCode.txt ..."
## [1] "reading  ZtranTran : Vermont : SellerName.txt ..."

It takes a minute or two to read the 3GB or so of ZtranTran data for Vermont into RAM. The server we are running on has about 512GB of RAM so this is not a problem – when we build a data set of the entire US, we’ll be much more careful about discarding or not reading in the variables that we don’t need.

Of the twenty ZtranTran files, in the case of Vermont, several are empty. These generate a slightly cryptic warning message and empty element in the “vermont” list object. HawaiiGranteeNotes.txt is one such empty file.

The “vermont” object is a list each element of which is a data.table – an enhanced version of the data.frame. Data.tables are particularly well suited to the Ztrax data because data.tables can do SQL like operations very efficiently and the ZTRAX data are dumps from a relational (SQL) data base.

names(vermont)
##  [1] "BKManagedSpecific.txt"           "BorrowerMailAddress.txt"        
##  [3] "BorrowerNameDescriptionCode.txt" "BorrowerName.txt"               
##  [5] "BuyerMailAddress.txt"            "BuyerNameDescriptionCode.txt"   
##  [7] "BuyerName.txt"                   "ForeclosureNameAddress.txt"     
##  [9] "ForeclosureNODNOSDoc.txt"        "ForeclosureOriginalLoan.txt"    
## [11] "HawaiiBorrowerNotes.txt"         "HawaiiGranteeNotes.txt"         
## [13] "Legacy.txt"                      "Main.txt"                       
## [15] "Modification.txt"                "PropertyInfo.txt"               
## [17] "RegionSpecific.txt"              "SellerMailAddress.txt"          
## [19] "SellerNameDescriptionCode.txt"   "SellerName.txt"

Each data.table (element) of the vermont (list object) is organized as you would expect. Each row is an observation, each column is a variable – same as any data.frame or as any relational database table. lapply provides a handy way to operate sequentially on each element of a list.

lapply() is used here to show the variable (column) names for each table in the vermont object.

# List the variables in each file pertaining to ZtranTran (Transactions) in Vermont.
lapply(vermont,function(x){names(x)})
## $BKManagedSpecific.txt
## [1] "TransID"        "Loan1TransType" "Loan2TransType" "DeedTransType" 
## [5] "BKFSPID"        "FIPS"           "BatchID"       
## 
## $BorrowerMailAddress.txt
## character(0)
## 
## $BorrowerNameDescriptionCode.txt
## character(0)
## 
## $BorrowerName.txt
## character(0)
## 
## $BuyerMailAddress.txt
##  [1] "TransId"                            
##  [2] "BuyerMailSequenceNumber"            
##  [3] "BuyerMailCareOfName"                
##  [4] "BuyerMailHouseNumber"               
##  [5] "BuyerMailHouseNumberExt"            
##  [6] "BuyerMailStreetPreDirectional"      
##  [7] "BuyerMailStreetName"                
##  [8] "BuyerMailStreetSuffix"              
##  [9] "BuyerMailStreetPostDirectional"     
## [10] "BuyerMailBuildingName"              
## [11] "BuyerMailBuildingNumber"            
## [12] "BuyerMailFullStreetAddress"         
## [13] "BuyerMailCity"                      
## [14] "BuyerMailState"                     
## [15] "BuyerMailZip"                       
## [16] "BuyerMailZip4"                      
## [17] "OriginalBuyerMailFullStreetAddress" 
## [18] "OriginalBuyerMailAddressLastline"   
## [19] "BuyerMailAddressStndCode"           
## [20] "BuyerMailAddressMatchCode"          
## [21] "BuyerMailAddressUnitDesignatorCode" 
## [22] "BuyerMailAddressUnitNumber"         
## [23] "BuyerMailAddressCarrierRoute"       
## [24] "BuyerMailAddressFIPSCode"           
## [25] "BuyerMailAddressLatitude"           
## [26] "BuyerMailAddressLongitude"          
## [27] "BuyerMailAddressCensusTractAndBlock"
## [28] "BuyerMailAddressConfidenceScore"    
## [29] "BuyerMailAddressCBSACode"           
## [30] "BuyerMailAddressCBSADivisionCode"   
## [31] "BuyerMailAddressMatchType"          
## [32] "BuyerMailAddressDPV"                
## [33] "BuyerMailAddressGeocodeQualityCode" 
## [34] "BuyerMailAddressQualityCode"        
## [35] "LoadID"                             
## [36] "FIPS"                               
## [37] "BatchID"                            
## 
## $BuyerNameDescriptionCode.txt
## [1] "TransId"                  "BuyerDescriptionStndCode"
## [3] "BuyerDescSequenceNumber"  "BuyerNameSequenceNumber" 
## [5] "FIPS"                     "BatchID"                 
## 
## $BuyerName.txt
##  [1] "TransId"                 "BuyerFirstMiddleName"   
##  [3] "BuyerLastName"           "BuyerIndividualFullName"
##  [5] "BuyerNonIndividualName"  "BuyerNameSequenceNumber"
##  [7] "BuyerMailSequenceNumber" "LoadID"                 
##  [9] "FIPS"                    "BatchID"                
## 
## $ForeclosureNameAddress.txt
## character(0)
## 
## $ForeclosureNODNOSDoc.txt
## character(0)
## 
## $ForeclosureOriginalLoan.txt
## character(0)
## 
## $HawaiiBorrowerNotes.txt
## character(0)
## 
## $HawaiiGranteeNotes.txt
## character(0)
## 
## $Legacy.txt
## character(0)
## 
## $Main.txt
##   [1] "TransId"                         "FIPS"                           
##   [3] "State"                           "County"                         
##   [5] "DataClassStndCode"               "RecordTypeStndCode"             
##   [7] "RecordingDate"                   "RecordingDocumentNumber"        
##   [9] "RecordingBookNumber"             "RecordingPageNumber"            
##  [11] "ReRecordedCorrectionStndCode"    "PriorRecordingDate"             
##  [13] "PriorDocumentDate"               "PriorDocumentNumber"            
##  [15] "PriorBookNumber"                 "PriorPageNumber"                
##  [17] "DocumentTypeStndCode"            "DocumentDate"                   
##  [19] "SignatureDate"                   "EffectiveDate"                  
##  [21] "BuyerVestingStndCode"            "BuyerMultiVestingFlag"          
##  [23] "PartialInterestTransferStndCode" "PartialInterestTransferPercent" 
##  [25] "SalesPriceAmount"                "SalesPriceAmountStndCode"       
##  [27] "CityTransferTax"                 "CountyTransferTax"              
##  [29] "StateTransferTax"                "TotalTransferTax"               
##  [31] "IntraFamilyTransferFlag"         "TransferTaxExemptFlag"          
##  [33] "PropertyUseStndCode"             "AssessmentLandUseStndCode"      
##  [35] "OccupancyStatusStndCode"         "LegalStndCode"                  
##  [37] "BorrowerVestingStndCode"         "LenderName"                     
##  [39] "LenderTypeStndCode"              "LenderIDStndCode"               
##  [41] "LenderDBAName"                   "DBALenderTypeStndCode"          
##  [43] "DBALenderIDStndCode"             "LenderMailCareOfName"           
##  [45] "LenderMailHouseNumber"           "LenderMailHouseNumberExt"       
##  [47] "LenderMailStreetPreDirectional"  "LenderMailStreetName"           
##  [49] "LenderMailStreetSuffix"          "LenderMailStreetPostDirectional"
##  [51] "LenderMailFullStreetAddress"     "LenderMailBuildingName"         
##  [53] "LenderMailBuildingNumber"        "LenderMailUnitDesignator"       
##  [55] "LenderMailUnit"                  "LenderMailCity"                 
##  [57] "LenderMailState"                 "LenderMailZip"                  
##  [59] "LenderMailZip4"                  "LoanAmount"                     
##  [61] "LoanAmountStndCode"              "MaximumLoanAmount"              
##  [63] "LoanTypeStndCode"                "LoanTypeClosedOpenEndStndCode"  
##  [65] "LoanTypeFutureAdvanceFlag"       "LoanTypeProgramStndCode"        
##  [67] "LoanRateTypeStndCode"            "LoanDueDate"                    
##  [69] "LoanTermMonths"                  "LoanTermYears"                  
##  [71] "InitialInterestRate"             "ARMFirstAdjustmentDate"         
##  [73] "ARMFirstAdjustmentMaxRate"       "ARMFirstAdjustmentMinRate"      
##  [75] "ARMIndexStndCode"                "ARMAdjustmentFrequencyStndCode" 
##  [77] "ARMMargin"                       "ARMInitialCap"                  
##  [79] "ARMPeriodicCap"                  "ARMLifetimeCap"                 
##  [81] "ARMMaxInterestRate"              "ARMMinInterestRate"             
##  [83] "InterestOnlyFlag"                "InterestOnlyTerm"               
##  [85] "PrepaymentPenaltyFlag"           "PrepaymentPenaltyTerm"          
##  [87] "BiWeeklyPaymentFlag"             "AssumabilityRiderFlag"          
##  [89] "BalloonRiderFlag"                "CondominiumRiderFlag"           
##  [91] "PlannedUnitDevelopmentRiderFlag" "SecondHomeRiderFlag"            
##  [93] "OneToFourFamilyRiderFlag"        "ConcurrentMtgeDocOrBkPg"        
##  [95] "LoanNumber"                      "MERSMINNumber"                  
##  [97] "CaseNumber"                      "MERSFlag"                       
##  [99] "TitleCompanyName"                "TitleCompanyIDStndCode"         
## [101] "AccommodationRecordingFlag"      "UnpaidBalance"                  
## [103] "InstallmentAmount"               "InstallmentDueDate"             
## [105] "TotalDelinquentAmount"           "DelinquentAsOfDate"             
## [107] "CurrentLender"                   "CurrentLenderTypeStndCode"      
## [109] "CurrentLenderIDStndCode"         "TrusteeSaleNumber"              
## [111] "AttorneyFileNumber"              "AuctionDate"                    
## [113] "AuctionTime"                     "AuctionFullStreetAddress"       
## [115] "AuctionCityName"                 "StartingBid"                    
## [117] "KeyedDate"                       "KeyerID"                        
## [119] "SubVendorStndCode"               "ImageFileName"                  
## [121] "BuilderFlag"                     "MatchStndCode"                  
## [123] "REOStndCode"                     "UpdateOwnershipFlag"            
## [125] "LoadID"                          "StatusInd"                      
## [127] "TransactionTypeStndCode"         "BatchID"                        
## [129] "BKFSPID"                         "ZVendorStndCode"                
## [131] "SourceChkSum"                   
## 
## $Modification.txt
## character(0)
## 
## $PropertyInfo.txt
##  [1] "TransId"                           
##  [2] "AssessorParcelNumber"              
##  [3] "APNIndicatorStndCode"              
##  [4] "TaxIDNumber"                       
##  [5] "TaxIDIndicatorStndCode"            
##  [6] "UnformattedAssessorParcelNumber"   
##  [7] "AlternateParcelNumber"             
##  [8] "HawaiiCondoCPRCode"                
##  [9] "PropertyHouseNumber"               
## [10] "PropertyHouseNumberExt"            
## [11] "PropertyStreetPreDirectional"      
## [12] "PropertyStreetName"                
## [13] "PropertyStreetSuffix"              
## [14] "PropertyStreetPostDirectional"     
## [15] "PropertyBuildingNumber"            
## [16] "PropertyFullStreetAddress"         
## [17] "PropertyCity"                      
## [18] "PropertyState"                     
## [19] "PropertyZip"                       
## [20] "PropertyZip4"                      
## [21] "OriginalPropertyFullStreetAddress" 
## [22] "OriginalPropertyAddressLastline"   
## [23] "PropertyAddressStndCode"           
## [24] "LegalLot"                          
## [25] "LegalOtherLot"                     
## [26] "LegalLotCode"                      
## [27] "LegalBlock"                        
## [28] "LegalSubdivisionName"              
## [29] "LegalCondoProjectPUDDevName"       
## [30] "LegalBuildingNumber"               
## [31] "LegalUnit"                         
## [32] "LegalSection"                      
## [33] "LegalPhase"                        
## [34] "LegalTract"                        
## [35] "LegalDistrict"                     
## [36] "LegalMunicipality"                 
## [37] "LegalCity"                         
## [38] "LegalTownship"                     
## [39] "LegalSTRSection"                   
## [40] "LegalSTRTownship"                  
## [41] "LegalSTRRange"                     
## [42] "LegalSTRMeridian"                  
## [43] "LegalSecTwnRngMer"                 
## [44] "LegalRecordersMapReference"        
## [45] "LegalDescription"                  
## [46] "LegalLotSize"                      
## [47] "PropertySequenceNumber"            
## [48] "PropertyAddressMatchcode"          
## [49] "PropertyAddressUnitDesignator"     
## [50] "PropertyAddressUnitNumber"         
## [51] "PropertyAddressCarrierRoute"       
## [52] "PropertyAddressGeoCodeMatchCode"   
## [53] "PropertyAddressLatitude"           
## [54] "PropertyAddressLongitude"          
## [55] "PropertyAddressCensusTractAndBlock"
## [56] "PropertyAddressConfidenceScore"    
## [57] "PropertyAddressCBSACode"           
## [58] "PropertyAddressCBSADivisionCode"   
## [59] "PropertyAddressMatchType"          
## [60] "PropertyAddressDPV"                
## [61] "PropertyGeocodeQualityCode"        
## [62] "PropertyAddressQualityCode"        
## [63] "FIPS"                              
## [64] "LoadID"                            
## [65] "ImportParcelID"                    
## [66] "BKFSPID"                           
## [67] "AssessmentRecordMatchFlag"         
## [68] "BatchID"                           
## 
## $RegionSpecific.txt
## character(0)
## 
## $SellerMailAddress.txt
##  [1] "TransId"                             
##  [2] "SellerMailSequenceNumber"            
##  [3] "SellerMailAddressStndCode"           
##  [4] "SellerMailCareOfName"                
##  [5] "SellerMailHouseNumber"               
##  [6] "SellerMailHouseNumberExt"            
##  [7] "SellerMailStreetPreDirectional"      
##  [8] "SellerMailStreetName"                
##  [9] "SellerMailStreetSuffix"              
## [10] "SellerMailStreetPostDirectional"     
## [11] "SellerMailBuildingName"              
## [12] "SellerMailBuildingNumber"            
## [13] "SellerMailFullStreetAddress"         
## [14] "SellerMailCity"                      
## [15] "SellerMailState"                     
## [16] "SellerMailZip"                       
## [17] "SellerMailZip4"                      
## [18] "OriginalSellerMailFullStreetAddress" 
## [19] "OriginalSellerMailAddressLastline"   
## [20] "SellerMailAddressMatchCode"          
## [21] "SellerMailAddressUnitDesignatorCode" 
## [22] "SellerMailAddressUnitNumber"         
## [23] "SellerMailAddressCarrierRoute"       
## [24] "SellerMailAddressFIPSCode"           
## [25] "SellerMailAddressLatitude"           
## [26] "SellerMailAddressLongitude"          
## [27] "SellerMailAddressCensusTractAndBlock"
## [28] "SellerMailAddressConfidenceScore"    
## [29] "SellerMailAddressCBSACode"           
## [30] "SellerMailAddressCBSADivisionCode"   
## [31] "SellerMailAddressMatchType"          
## [32] "SellerMailAddressDPV"                
## [33] "SellerMailAddressGeocodeQualityCode" 
## [34] "SellerMailAddressQualityCode"        
## [35] "LoadID"                              
## [36] "FIPS"                                
## [37] "BatchID"                             
## 
## $SellerNameDescriptionCode.txt
## [1] "TransId"                   "SellerDescriptionStndCode"
## [3] "SellerDescSequenceNumber"  "SellerNameSequenceNumber" 
## [5] "FIPS"                      "BatchID"                  
## 
## $SellerName.txt
##  [1] "TransId"                  "SellerFirstMiddleName"   
##  [3] "SellerLastName"           "SellerIndividualFullName"
##  [5] "SellerNonIndividualName"  "SellerNameSequenceNumber"
##  [7] "SellerMailSequenceNumber" "LoadID"                  
##  [9] "FIPS"                     "BatchID"

Example: Average sale price by buyer’s state of residence

To find average sale price of a Vermont property by buyer’s state of residence we need two tables: Main.txt which has the “SalesPriceAmount” variable, and BuyerMailAddress.txt which contains the “BuyerMailState” column. TransId is apparently the common variable so we’ll use it to do an “inner join” of the two tables with the information we are looking for. See https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html for more details on how joins work.

# join two of the twenty vermont tables along the common value of TransId
# use setkey to define the variable to use as index
setkey(vermont$Main.txt, TransId)
setkey(vermont$BuyerMailAddress.txt,TransId)
# create a new table by merging (joining) the two tables
mainBuyer<- merge(vermont$Main.txt,vermont$BuyerMailAddress.txt,all=FALSE)
## inspect dimensions of results
dim(vermont$Main.txt)
## [1] 797811    131
dim(vermont$BuyerMailAddress.txt)
## [1] 797811     37
dim(mainBuyer)
## [1] 797811    167

Note that this join took very little time. Also note that from the dim() statements it looks like this join did what we expected – both input tables have the same number of rows (not necessary for inner joins, but implies that there is one observation per property in each table) and the output table has the sum of the number of columns in the input tables minus 1 – because the index variable TransId is in both tables.

Once the reading and sql-like manipulations are complete, the case for using data.table rather than say tidyverse or base R is weaker, we’ll switch to tidyverse for variety in the next step – taking the mean of the price by state of residence of the buyer.

require(tidyverse)
## Loading required package: tidyverse
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## ── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::between()   masks data.table::between()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::first()     masks data.table::first()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::last()      masks data.table::last()
## ✖ purrr::transpose() masks data.table::transpose()
require(ggplot2)
mainBuyer %>% group_by(BuyerMailState)  %>% summarise(meanPrice=mean(SalesPriceAmount)) %>% ggplot() + geom_point(aes(x=reorder(BuyerMailState,meanPrice), y=meanPrice)) + coord_flip()  + theme(axis.text=element_text(size=5,face='bold'))

Curiously, there are a few state’s that are not among the 50 that we know about. NA, VZ, 01 for example.

Example : Building a national level data set

Let’s start by reminding ourselves of how many tables and variables we are dealing with. This time we’ll look at the Assessor data rather than the Transaction data but it’s still huge. We could, as we did above, simply read in a state’s worth of data using ztrax_fread() in order to observe the tables and variable names. But a couple of other demogztrax package functions can get that information for us without reading in all those GB.

for (f in ztrax_path_to(branch="ZTranAsmt",state="CA",asFiles = TRUE)){
  print(f)
  print(ztrax_varnames(f))
}
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/AdditionalPropertyAddress.txt"
##       tabname                       vars                                
##  [1,] "utAdditionalPropertyAddress" "RowID"                             
##  [2,] "utAdditionalPropertyAddress" "PropertyAddressSequenceNumber"     
##  [3,] "utAdditionalPropertyAddress" "PropertyHouseNumber"               
##  [4,] "utAdditionalPropertyAddress" "PropertyHouseNumberExt"            
##  [5,] "utAdditionalPropertyAddress" "PropertyStreetPreDirectional"      
##  [6,] "utAdditionalPropertyAddress" "PropertyStreetName"                
##  [7,] "utAdditionalPropertyAddress" "PropertyStreetSuffix"              
##  [8,] "utAdditionalPropertyAddress" "PropertyStreetPostDirectional"     
##  [9,] "utAdditionalPropertyAddress" "PropertyBuildingNumber"            
## [10,] "utAdditionalPropertyAddress" "PropertyFullStreetAddress"         
## [11,] "utAdditionalPropertyAddress" "PropertyCity"                      
## [12,] "utAdditionalPropertyAddress" "PropertyState"                     
## [13,] "utAdditionalPropertyAddress" "PropertyZip"                       
## [14,] "utAdditionalPropertyAddress" "PropertyZip4"                      
## [15,] "utAdditionalPropertyAddress" "OriginalPropertyFullStreetAddress" 
## [16,] "utAdditionalPropertyAddress" "OriginalPropertyAddressLastline"   
## [17,] "utAdditionalPropertyAddress" "PropertyAddressCensusTract"        
## [18,] "utAdditionalPropertyAddress" "PropertyAddressMatchcode"          
## [19,] "utAdditionalPropertyAddress" "PropertyAddressUnitDesignator"     
## [20,] "utAdditionalPropertyAddress" "PropertyAddressUnitNumber"         
## [21,] "utAdditionalPropertyAddress" "PropertyAddressCarrierRoute"       
## [22,] "utAdditionalPropertyAddress" "PropertyAddressGeoCodeMatchCode"   
## [23,] "utAdditionalPropertyAddress" "PropertyAddressLatitude"           
## [24,] "utAdditionalPropertyAddress" "PropertyAddressLongitude"          
## [25,] "utAdditionalPropertyAddress" "PropertyAddressCensusTractAndBlock"
## [26,] "utAdditionalPropertyAddress" "PropertyAddressConfidenceScore"    
## [27,] "utAdditionalPropertyAddress" "PropertyAddressCBSACode"           
## [28,] "utAdditionalPropertyAddress" "PropertyAddressCBSADivisionCode"   
## [29,] "utAdditionalPropertyAddress" "PropertyAddressMatchType"          
## [30,] "utAdditionalPropertyAddress" "PropertyAddressDPV"                
## [31,] "utAdditionalPropertyAddress" "PropertyGeocodeQualityCode"        
## [32,] "utAdditionalPropertyAddress" "PropertyAddressQualityCode"        
## [33,] "utAdditionalPropertyAddress" "FIPS"                              
## [34,] "utAdditionalPropertyAddress" "BatchID"                           
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/BKManagedSpecific.txt"
##      tabname               vars           
## [1,] "utBKManagedSpecific" "RowID"        
## [2,] "utBKManagedSpecific" "LotSizeUnit"  
## [3,] "utBKManagedSpecific" "LotSizeorArea"
## [4,] "utBKManagedSpecific" "BKFSPID"      
## [5,] "utBKManagedSpecific" "BKFSLoadDate" 
## [6,] "utBKManagedSpecific" "FIPS"         
## [7,] "utBKManagedSpecific" "BatchID"      
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/BuildingAreas.txt"
##      tabname           vars                         
## [1,] "utBuildingAreas" "RowID"                      
## [2,] "utBuildingAreas" "BuildingOrImprovementNumber"
## [3,] "utBuildingAreas" "BuildingAreaSequenceNumber" 
## [4,] "utBuildingAreas" "BuildingAreaStndCode"       
## [5,] "utBuildingAreas" "BuildingAreaSqFt"           
## [6,] "utBuildingAreas" "FIPS"                       
## [7,] "utBuildingAreas" "BatchID"                    
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/Building.txt"
##       tabname      vars                                 
##  [1,] "utBuilding" "RowID"                              
##  [2,] "utBuilding" "NoOfUnits"                          
##  [3,] "utBuilding" "OccupancyStatusStndCode"            
##  [4,] "utBuilding" "PropertyCountyLandUseDescription"   
##  [5,] "utBuilding" "PropertyCountyLandUseCode"          
##  [6,] "utBuilding" "PropertyLandUseStndCode"            
##  [7,] "utBuilding" "PropertyStateLandUseDescription"    
##  [8,] "utBuilding" "PropertyStateLandUseCode"           
##  [9,] "utBuilding" "BuildingOrImprovementNumber"        
## [10,] "utBuilding" "BuildingClassStndCode"              
## [11,] "utBuilding" "BuildingQualityStndCode"            
## [12,] "utBuilding" "BuildingQualityStndCodeOriginal"    
## [13,] "utBuilding" "BuildingConditionStndCode"          
## [14,] "utBuilding" "ArchitecturalStyleStndCode"         
## [15,] "utBuilding" "YearBuilt"                          
## [16,] "utBuilding" "EffectiveYearBuilt"                 
## [17,] "utBuilding" "YearRemodeled"                      
## [18,] "utBuilding" "NoOfStories"                        
## [19,] "utBuilding" "TotalRooms"                         
## [20,] "utBuilding" "TotalBedrooms"                      
## [21,] "utBuilding" "TotalKitchens"                      
## [22,] "utBuilding" "FullBath"                           
## [23,] "utBuilding" "ThreeQuarterBath"                   
## [24,] "utBuilding" "HalfBath"                           
## [25,] "utBuilding" "QuarterBath"                        
## [26,] "utBuilding" "TotalCalculatedBathCount"           
## [27,] "utBuilding" "TotalActualBathCount"               
## [28,] "utBuilding" "BathSourceStndCode"                 
## [29,] "utBuilding" "TotalBathPlumbingFixtures"          
## [30,] "utBuilding" "RoofCoverStndCode"                  
## [31,] "utBuilding" "RoofStructureTypeStndCode"          
## [32,] "utBuilding" "HeatingTypeorSystemStndCode"        
## [33,] "utBuilding" "AirConditioningTypeorSystemStndCode"
## [34,] "utBuilding" "FoundationTypeStndCode"             
## [35,] "utBuilding" "ElevatorStndCode"                   
## [36,] "utBuilding" "FireplaceFlag"                      
## [37,] "utBuilding" "FirePlaceTypeStndCode"              
## [38,] "utBuilding" "FireplaceNumber"                    
## [39,] "utBuilding" "WaterStndCode"                      
## [40,] "utBuilding" "SewerStndCode"                      
## [41,] "utBuilding" "MortgageLenderName"                 
## [42,] "utBuilding" "TimeshareStndCode"                  
## [43,] "utBuilding" "Comments"                           
## [44,] "utBuilding" "LoadID"                             
## [45,] "utBuilding" "StoryTypeStndCode"                  
## [46,] "utBuilding" "FIPS"                               
## [47,] "utBuilding" "BatchID"                            
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/CareOfName.txt"
##      tabname        vars                   
## [1,] "utCareOfName" "RowID"                
## [2,] "utCareOfName" "NameSequenceNumber"   
## [3,] "utCareOfName" "NameTypeStndCode"     
## [4,] "utCareOfName" "MailCareOfName"       
## [5,] "utCareOfName" "MailCOPatternStndCode"
## [6,] "utCareOfName" "FIPS"                 
## [7,] "utCareOfName" "BatchID"              
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/ExteriorWall.txt"
##      tabname          vars                         
## [1,] "utExteriorWall" "RowID"                      
## [2,] "utExteriorWall" "BuildingOrImprovementNumber"
## [3,] "utExteriorWall" "ExteriorWallStndCode"       
## [4,] "utExteriorWall" "ExteriorWallPercent"        
## [5,] "utExteriorWall" "FIPS"                       
## [6,] "utExteriorWall" "BatchID"                    
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/ExtraFeature.txt"
##      tabname          vars                         
## [1,] "utExtraFeature" "RowID"                      
## [2,] "utExtraFeature" "BuildingOrImprovementNumber"
## [3,] "utExtraFeature" "ExtraFeatureSequenceNumber" 
## [4,] "utExtraFeature" "ExtraFeaturesStndCode"      
## [5,] "utExtraFeature" "ExtraFeaturesSqFt"          
## [6,] "utExtraFeature" "ExtraFeaturesLinearFt"      
## [7,] "utExtraFeature" "FIPS"                       
## [8,] "utExtraFeature" "BatchID"                    
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/Garage.txt"
##      tabname    vars                         
## [1,] "utGarage" "RowID"                      
## [2,] "utGarage" "BuildingOrImprovementNumber"
## [3,] "utGarage" "GarageSequenceNumber"       
## [4,] "utGarage" "GarageStndCode"             
## [5,] "utGarage" "GarageAreaSqFt"             
## [6,] "utGarage" "GarageNoOfCars"             
## [7,] "utGarage" "FIPS"                       
## [8,] "utGarage" "BatchID"                    
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/InteriorFlooring.txt"
##      tabname              vars                          
## [1,] "utInteriorFlooring" "RowID"                       
## [2,] "utInteriorFlooring" "BuildingOrImprovementNumber" 
## [3,] "utInteriorFlooring" "InteriorFlooringTypeStndCode"
## [4,] "utInteriorFlooring" "FlooringPercent"             
## [5,] "utInteriorFlooring" "FIPS"                        
## [6,] "utInteriorFlooring" "BatchID"                     
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/InteriorWall.txt"
##      tabname          vars                         
## [1,] "utInteriorWall" "RowID"                      
## [2,] "utInteriorWall" "BuildingOrImprovementNumber"
## [3,] "utInteriorWall" "InteriorWallStndCode"       
## [4,] "utInteriorWall" "InteriorWallPercent"        
## [5,] "utInteriorWall" "FIPS"                       
## [6,] "utInteriorWall" "BatchID"                    
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/LotSiteAppeal.txt"
##      tabname           vars                   
## [1,] "utLotSiteAppeal" "RowID"                
## [2,] "utLotSiteAppeal" "LotSiteAppealStndCode"
## [3,] "utLotSiteAppeal" "FIPS"                 
## [4,] "utLotSiteAppeal" "BatchID"              
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/MailAddress.txt"
##       tabname         vars                            
##  [1,] "utMailAddress" "RowID"                         
##  [2,] "utMailAddress" "MailAddressTypeStndCode"       
##  [3,] "utMailAddress" "MailHouseNumber"               
##  [4,] "utMailAddress" "MailHouseNumberExt"            
##  [5,] "utMailAddress" "MailStreetPreDirectional"      
##  [6,] "utMailAddress" "MailStreetName"                
##  [7,] "utMailAddress" "MailStreetSuffix"              
##  [8,] "utMailAddress" "MailStreetPostDirectional"     
##  [9,] "utMailAddress" "MailBuildingName"              
## [10,] "utMailAddress" "MailBuildingNumber"            
## [11,] "utMailAddress" "MailFullStreetAddress"         
## [12,] "utMailAddress" "MailCity"                      
## [13,] "utMailAddress" "MailState"                     
## [14,] "utMailAddress" "MailZip"                       
## [15,] "utMailAddress" "MailZip4"                      
## [16,] "utMailAddress" "OriginalMailFullStreetAddress" 
## [17,] "utMailAddress" "OriginalMailAddressLastline"   
## [18,] "utMailAddress" "MailCityStateZip"              
## [19,] "utMailAddress" "MailInternationalAddressFlag"  
## [20,] "utMailAddress" "NameTypeStndCode"              
## [21,] "utMailAddress" "MailAddressMatchCode"          
## [22,] "utMailAddress" "MailAddressUnitDesignatorCode" 
## [23,] "utMailAddress" "MailAddressUnitNumber"         
## [24,] "utMailAddress" "MailAddressCarrierRoute"       
## [25,] "utMailAddress" "MailAddressFIPSCode"           
## [26,] "utMailAddress" "MailAddressLatitude"           
## [27,] "utMailAddress" "MailAddressLongitude"          
## [28,] "utMailAddress" "MailAddressCensusTractAndBlock"
## [29,] "utMailAddress" "MailAddressConfidenceScore"    
## [30,] "utMailAddress" "MailAddressCBSACode"           
## [31,] "utMailAddress" "MailAddressCBSADivisionCode"   
## [32,] "utMailAddress" "MailAddressMatchType"          
## [33,] "utMailAddress" "MailAddressDPV"                
## [34,] "utMailAddress" "MailAddressGeocodeQualityCode" 
## [35,] "utMailAddress" "MailAddressQualityCode"        
## [36,] "utMailAddress" "FIPS"                          
## [37,] "utMailAddress" "BatchID"                       
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/Main.txt"
##       tabname  vars                                
##  [1,] "utMain" "RowID"                             
##  [2,] "utMain" "ImportParcelID"                    
##  [3,] "utMain" "FIPS"                              
##  [4,] "utMain" "State"                             
##  [5,] "utMain" "County"                            
##  [6,] "utMain" "ValueCertDate"                     
##  [7,] "utMain" "ExtractDate"                       
##  [8,] "utMain" "Edition"                           
##  [9,] "utMain" "ZVendorStndCode"                   
## [10,] "utMain" "AssessorParcelNumber"              
## [11,] "utMain" "DupAPN"                            
## [12,] "utMain" "UnformattedAssessorParcelNumber"   
## [13,] "utMain" "ParcelSequenceNumber"              
## [14,] "utMain" "AlternateParcelNumber"             
## [15,] "utMain" "OldParcelNumber"                   
## [16,] "utMain" "ParcelNumberTypeStndCode"          
## [17,] "utMain" "RecordSourceStndCode"              
## [18,] "utMain" "RecordTypeStndCode"                
## [19,] "utMain" "ConfidentialRecordFlag"            
## [20,] "utMain" "PropertyAddressSourceStndCode"     
## [21,] "utMain" "PropertyHouseNumber"               
## [22,] "utMain" "PropertyHouseNumberExt"            
## [23,] "utMain" "PropertyStreetPreDirectional"      
## [24,] "utMain" "PropertyStreetName"                
## [25,] "utMain" "PropertyStreetSuffix"              
## [26,] "utMain" "PropertyStreetPostDirectional"     
## [27,] "utMain" "PropertyFullStreetAddress"         
## [28,] "utMain" "PropertyCity"                      
## [29,] "utMain" "PropertyState"                     
## [30,] "utMain" "PropertyZip"                       
## [31,] "utMain" "PropertyZip4"                      
## [32,] "utMain" "OriginalPropertyFullStreetAddress" 
## [33,] "utMain" "OriginalPropertyAddressLastline"   
## [34,] "utMain" "PropertyBuildingNumber"            
## [35,] "utMain" "PropertyZoningDescription"         
## [36,] "utMain" "PropertyZoningSourceCode"          
## [37,] "utMain" "CensusTract"                       
## [38,] "utMain" "TaxIDNumber"                       
## [39,] "utMain" "TaxAmount"                         
## [40,] "utMain" "TaxYear"                           
## [41,] "utMain" "TaxDelinquencyFlag"                
## [42,] "utMain" "TaxDelinquencyAmount"              
## [43,] "utMain" "TaxDelinquencyYear"                
## [44,] "utMain" "TaxRateCodeArea"                   
## [45,] "utMain" "LegalLot"                          
## [46,] "utMain" "LegalLotStndCode"                  
## [47,] "utMain" "LegalOtherLot"                     
## [48,] "utMain" "LegalBlock"                        
## [49,] "utMain" "LegalSubdivisionCode"              
## [50,] "utMain" "LegalSubdivisionName"              
## [51,] "utMain" "LegalCondoProjectPUDDevName"       
## [52,] "utMain" "LegalBuildingNumber"               
## [53,] "utMain" "LegalUnit"                         
## [54,] "utMain" "LegalSection"                      
## [55,] "utMain" "LegalPhase"                        
## [56,] "utMain" "LegalTract"                        
## [57,] "utMain" "LegalDistrict"                     
## [58,] "utMain" "LegalMunicipality"                 
## [59,] "utMain" "LegalCity"                         
## [60,] "utMain" "LegalTownship"                     
## [61,] "utMain" "LegalSTRSection"                   
## [62,] "utMain" "LegalSTRTownship"                  
## [63,] "utMain" "LegalSTRRange"                     
## [64,] "utMain" "LegalSTRMeridian"                  
## [65,] "utMain" "LegalSecTwnRngMer"                 
## [66,] "utMain" "LegalRecordersMapReference"        
## [67,] "utMain" "LegalDescription"                  
## [68,] "utMain" "LegalNeighborhoodSourceCode"       
## [69,] "utMain" "NoOfBuildings"                     
## [70,] "utMain" "LotSizeAcres"                      
## [71,] "utMain" "LotSizeSquareFeet"                 
## [72,] "utMain" "LotSizeFrontageFeet"               
## [73,] "utMain" "LotSizeDepthFeet"                  
## [74,] "utMain" "LotSizeIRR"                        
## [75,] "utMain" "LotSiteTopographyStndCode"         
## [76,] "utMain" "LoadID"                            
## [77,] "utMain" "PropertyAddressMatchcode"          
## [78,] "utMain" "PropertyAddressUnitDesignator"     
## [79,] "utMain" "PropertyAddressUnitNumber"         
## [80,] "utMain" "PropertyAddressCarrierRoute"       
## [81,] "utMain" "PropertyAddressGeoCodeMatchCode"   
## [82,] "utMain" "PropertyAddressLatitude"           
## [83,] "utMain" "PropertyAddressLongitude"          
## [84,] "utMain" "PropertyAddressCensusTractAndBlock"
## [85,] "utMain" "PropertyAddressConfidenceScore"    
## [86,] "utMain" "PropertyAddressCBSACode"           
## [87,] "utMain" "PropertyAddressCBSADivisionCode"   
## [88,] "utMain" "PropertyAddressMatchType"          
## [89,] "utMain" "PropertyAddressDPV"                
## [90,] "utMain" "PropertyGeocodeQualityCode"        
## [91,] "utMain" "PropertyAddressQualityCode"        
## [92,] "utMain" "SubEdition"                        
## [93,] "utMain" "BatchID"                           
## [94,] "utMain" "BKFSPID"                           
## [95,] "utMain" "SourceChkSum"                      
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/Name.txt"
##       tabname  vars                     
##  [1,] "utName" "RowID"                  
##  [2,] "utName" "Name"                   
##  [3,] "utName" "NameETStndCode"         
##  [4,] "utName" "NameDescriptionStndCode"
##  [5,] "utName" "NameSequenceNumber"     
##  [6,] "utName" "NameTypeStndCode"       
##  [7,] "utName" "NamePatternStndCode"    
##  [8,] "utName" "NameClassStndCode"      
##  [9,] "utName" "FIPS"                   
## [10,] "utName" "BatchID"                
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/Oby.txt"
##      tabname vars                         
## [1,] "utOby" "RowID"                      
## [2,] "utOby" "BuildingOrImprovementNumber"
## [3,] "utOby" "OBYSequenceNumber"          
## [4,] "utOby" "OBYStndCode"                
## [5,] "utOby" "OBYAreaSqFt"                
## [6,] "utOby" "FIPS"                       
## [7,] "utOby" "BatchID"                    
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/Pool.txt"
##      tabname  vars                         
## [1,] "utPool" "RowID"                      
## [2,] "utPool" "BuildingOrImprovementNumber"
## [3,] "utPool" "PoolStndCode"               
## [4,] "utPool" "PoolSize"                   
## [5,] "utPool" "FIPS"                       
## [6,] "utPool" "BatchID"                    
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/SaleData.txt"
##       tabname      vars                           
##  [1,] "utSaleData" "RowID"                        
##  [2,] "utSaleData" "SaleSeqNum"                   
##  [3,] "utSaleData" "SellerFullName"               
##  [4,] "utSaleData" "BuyerFullName"                
##  [5,] "utSaleData" "RecordingDate"                
##  [6,] "utSaleData" "DocumentDate"                 
##  [7,] "utSaleData" "RecordingDocumentNumber"      
##  [8,] "utSaleData" "RecordingBookNumber"          
##  [9,] "utSaleData" "RecordingPageNumber"          
## [10,] "utSaleData" "DocumentTypeCountyDescription"
## [11,] "utSaleData" "DocumentTypeStndCode"         
## [12,] "utSaleData" "SalesPriceAmount"             
## [13,] "utSaleData" "SalesPriceAmountStndCode"     
## [14,] "utSaleData" "FIPS"                         
## [15,] "utSaleData" "BatchID"                      
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/TaxDistrict.txt"
##      tabname         vars                 
## [1,] "utTaxDistrict" "RowID"              
## [2,] "utTaxDistrict" "TaxDistrictStndCode"
## [3,] "utTaxDistrict" "TaxDistrictName"    
## [4,] "utTaxDistrict" "TaxDistrictAmount"  
## [5,] "utTaxDistrict" "FIPS"               
## [6,] "utTaxDistrict" "BatchID"            
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/TaxExemption.txt"
##      tabname          vars                  
## [1,] "utTaxExemption" "RowID"               
## [2,] "utTaxExemption" "TaxExemptionStndCode"
## [3,] "utTaxExemption" "FIPS"                
## [4,] "utTaxExemption" "BatchID"             
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/TypeConstruction.txt"
##      tabname              vars                         
## [1,] "utTypeConstruction" "RowID"                      
## [2,] "utTypeConstruction" "BuildingOrImprovementNumber"
## [3,] "utTypeConstruction" "TypeConstructionStndCode"   
## [4,] "utTypeConstruction" "TypeConstructionPercent"    
## [5,] "utTypeConstruction" "FIPS"                       
## [6,] "utTypeConstruction" "BatchID"                    
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/Value.txt"
##       tabname   vars                       
##  [1,] "utValue" "RowID"                    
##  [2,] "utValue" "LandAssessedValue"        
##  [3,] "utValue" "ImprovementAssessedValue" 
##  [4,] "utValue" "TotalAssessedValue"       
##  [5,] "utValue" "AssessmentYear"           
##  [6,] "utValue" "LandMarketValue"          
##  [7,] "utValue" "ImprovementMarketValue"   
##  [8,] "utValue" "TotalMarketValue"         
##  [9,] "utValue" "MarketValueYear"          
## [10,] "utValue" "LandAppraisalValue"       
## [11,] "utValue" "ImprovementAppraisalValue"
## [12,] "utValue" "TotalAppraisalValue"      
## [13,] "utValue" "AppraisalValueYear"       
## [14,] "utValue" "FIPS"                     
## [15,] "utValue" "BatchID"                  
## [1] "/ztrax-data/raw/2019/Zillow_Transaction/06/ZAsmt/VestingCodes.txt"
##      tabname          vars                  
## [1,] "utVestingCodes" "RowID"               
## [2,] "utVestingCodes" "NameSequenceNumber"  
## [3,] "utVestingCodes" "NameTypeStndCode"    
## [4,] "utVestingCodes" "OwnerVestingStndCode"
## [5,] "utVestingCodes" "FIPS"                
## [6,] "utVestingCodes" "BatchID"

Since our goal this time is to read data from all 51 states, it’s going to be a lot more efficient if we read only the data that we need. Suppose we want assemble a data set that includes

  1. “Main” “State”
  2. “Main” “County”
  3. “Main” “LotSizeSquareFeet”
  4. “Value” “LandAssessedValue”
  5. “SaleData” “SalesPriceAmount”

That’s five variables from 3 tables. A reasonable approach will be to loop through all 51 states reading in ONLY the three tables – or better yet only the 5 columns of the 3 tables plus RowID so that we can join the tables. We can do the join, saving just what we need; joining the tables into a single data.table for each state; and then maybe discard the tables we read since we now have two copies of the data in RAM – then we can go on to the next state and do the same thing.

If one is reading very large amounts of data, it is useful run the gc() command after deleting large objects. gc() stands for “garbage collection” which is the process by which R returns disused RAM to the operating system. R can be very slow if left to itself to do this, explicitly running gc() can keep the world tidier and happier for everyone.

The next two cells present a reasonably good way of looping through the states to construct a national level data set.

###
# Step on Create a function that will read and process one state's worth of data
###
require(demogztrax)

getOneState<-function(cstate){
  # specify the varialbes that we want to keep make probably want to include RowId or 
  # TransId
  v2keep <- c("RowID","State","County","LotSizeSquareFeet","LandAssessedValue","SalePriceAmount")
  # note that the docs specify the table names as utSomething but the corresponding file is called Something.txt  ztrax_fread() is flexible
  tab2keep <- c("utMain","Value.txt","SaleData")
  
  print(cstate)
  cstate.dt <- ztrax_fread(branch = "ZTranAsmt", state= cstate,
                           keeptabs = tab2keep,
                           keepvars = v2keep)
  
  # combine the tables we just read into a single table by joining on RowID
  setkey(cstate.dt$Main.txt,RowID)
  setkey(cstate.dt$SaleData.txt,RowID)
  setkey(cstate.dt$Value.txt,RowID)
  res<-merge(cstate.dt$Main.txt,cstate.dt$SaleData.txt,all = FALSE)
  setkey(res,RowID)
  res<-merge(res,cstate.dt$Value.txt,all = FALSE)
  # remove the data that we read because we have anoter copy of it
  rm(cstate.dt)
  # force garbage collection - a cheap act of altruism
  gc()
  return(res)   
}
# test the function
system.time(wisconsin <- getOneState("WI"))
## [1] "WI"
## [1] "Reading data on  Wisconsin ..."
## [1] ""
## [1] "reading  ZTranAsmt : Wisconsin : Main.txt ..."
## [1] "Variables: RowID"             "Variables: State"            
## [3] "Variables: County"            "Variables: LotSizeSquareFeet"
## [1] "reading  ZTranAsmt : Wisconsin : SaleData.txt ..."
## [1] "Variables: RowID"
## [1] "reading  ZTranAsmt : Wisconsin : Value.txt ..."
## [1] "Variables: RowID"             "Variables: LandAssessedValue"
##    user  system elapsed 
##  60.722  17.779  43.018

Under a minute for a medium sized state.

Next we wrap the getOneState() function in an lapply() to loop through all the states. In the example below, we’ll just do 3 states because running through all the states takes a long time.

###
# Step 2 : wrap the getOneState() function in lapply() to do all the states
system.time(
  ## NOTE the [1:3] in the line below limits the job to the AL,AK and AZ remove those
  ## characters to do the entire US -- which takes about 1.5 hours
  allstates<- lapply(fipdat$stusps[1:3],
                    function(cstate){getOneState(cstate)}
                     ))
## [1] "AL"
## [1] "Reading data on  Alabama ..."
## [1] ""
## [1] "reading  ZTranAsmt : Alabama : Main.txt ..."
## [1] "Variables: RowID"             "Variables: State"            
## [3] "Variables: County"            "Variables: LotSizeSquareFeet"
## [1] "reading  ZTranAsmt : Alabama : SaleData.txt ..."
## [1] "Variables: RowID"
## [1] "reading  ZTranAsmt : Alabama : Value.txt ..."
## [1] "Variables: RowID"             "Variables: LandAssessedValue"
## [1] "AK"
## [1] "Reading data on  Alaska ..."
## [1] ""
## [1] "reading  ZTranAsmt : Alaska : Main.txt ..."
## [1] "Variables: RowID"             "Variables: State"            
## [3] "Variables: County"            "Variables: LotSizeSquareFeet"
## [1] "reading  ZTranAsmt : Alaska : SaleData.txt ..."
## [1] "Variables: RowID"
## [1] "reading  ZTranAsmt : Alaska : Value.txt ..."
## [1] "Variables: RowID"             "Variables: LandAssessedValue"
## [1] "AZ"
## [1] "Reading data on  Arizona ..."
## [1] ""
## [1] "reading  ZTranAsmt : Arizona : Main.txt ..."
## [1] "Variables: RowID"             "Variables: State"            
## [3] "Variables: County"            "Variables: LotSizeSquareFeet"
## [1] "reading  ZTranAsmt : Arizona : SaleData.txt ..."
## [1] "Variables: RowID"
## [1] "reading  ZTranAsmt : Arizona : Value.txt ..."
## [1] "Variables: RowID"             "Variables: LandAssessedValue"
##    user  system elapsed 
## 151.689  54.312 113.054

Doing it for just 3 states takes about 117 seconds. If you modify the code to do the entire US, it will take About 96 minutes and result in an object that takes up about 15GB of RAM. It’s easy to do, just remove the “[1:3]” from the above code.

Clever readers may see the opportunity to use doParallel to parallel process. Please be cautious. You can speed the process, by using a small number or cores, (<3). But disks can only spin so fast before they catch fire. Using too many cores will not reduce the overall time it takes to read all that data, but it will cause bottlenecks that affect other users – and it will be obvious who is responsible.

The above code results in a list object each element being an data.table with the same number and position of columns. To convert that to a single data.table, we can use the code below:

allstates.dt <- dplyr::bind_rows(allstates)

lapply(allstates,function(x){dim(x)})
## [[1]]
## [1] 3034826       5
## 
## [[2]]
## [1] 1170683       5
## 
## [[3]]
## [1] 2991898       5
dim(allstates.dt)
## [1] 7197407       5

Notes on rolling your own read routines.

On a good day, with a tail wind, the demogztrax::ztrax.fread() function is all you need to read the Zillow data that you need. But if your needs are complicated, you can use other means to read and process the data. All the data files have ‘.txt’ suffix and are “pipe”(|) delimited. There are two suboptimalities about them:

/ztrax-data/raw/2019/Zillow_Assessor/LayoutAsmtHistory.xlsx\ /ztrax-data/raw/2019/Zillow_Transaction/Layout.xlsx\ /ztrax-data/clean/2019/Zillow_Assessor/LayoutAsmtHistory.xlsx\

File structure of 2019 Ztrax data

/data-ztrax  (or ~/ZTRAX)
└── raw
    └── 2019
        ├── Zillow_Assessor
        │   ├── 01
        │   │   └── ZAsmt
        │   │       ├── BuildingAreas.txt
        │   │       ├── Building.txt
        │   │       ├── Garage.txt
        │   │       ├── LotSiteAppeal.txt
        │   │       ├── Main.txt
        │   │       └── Value.txt
        │   ├── 02
        │   ├── 03
        │    .
        │    .
        │   ├── 66
    
    
        ├── Zillow_Transaction
        │   ├── 01
        │   │   ├── ZAsmt
        │   │   │   ├── AdditionalPropertyAddress.txt
        │   │   │   ├── BKManagedSpecific.txt
        │   │   │   ├── BuildingAreas.txt
        │   │   │   ├── Building.txt
        │   │   │   ├── CareOfName.txt
        │   │   │   ├── ExteriorWall.txt
        │   │   │   ├── ExtraFeature.txt
        │   │   │   ├── Garage.txt
        │   │   │   ├── InteriorFlooring.txt
        │   │   │   ├── InteriorWall.txt
        │   │   │   ├── LotSiteAppeal.txt
        │   │   │   ├── MailAddress.txt
        │   │   │   ├── Main.txt
        │   │   │   ├── Name2.txt
        │   │   │   ├── Name.txt
        │   │   │   ├── Oby.txt
        │   │   │   ├── Pool.txt
        │   │   │   ├── SaleData.txt
        │   │   │   ├── TaxDistrict.txt
        │   │   │   ├── TaxExemption.txt
        │   │   │   ├── TypeConstruction.txt
        │   │   │   ├── Value.txt
        │   │   │   └── VestingCodes.txt
        │   │   └── ZTrans
        │   │       ├── BKManagedSpecific.txt
        │   │       ├── BorrowerMailAddress.txt
        │   │       ├── BorrowerNameDescriptionCode.txt
        │   │       ├── BorrowerName.txt
        │   │       ├── BuyerMailAddress.txt
        │   │       ├── BuyerNameDescriptionCode.txt
        │   │       ├── BuyerName.txt
        │   │       ├── ForeclosureNameAddress.txt
        │   │       ├── ForeclosureNODNOSDoc.txt
        │   │       ├── ForeclosureOriginalLoan.txt
        │   │       ├── HawaiiBorrowerNotes.txt
        │   │       ├── HawaiiGranteeNotes.txt
        │   │       ├── Legacy.txt
        │   │       ├── Main.txt
        │   │       ├── Modification.txt
        │   │       ├── PropertyInfo.txt
        │   │       ├── RegionSpecific.txt
        │   │       ├── SellerMailAddress.txt
        │   │       ├── SellerNameDescriptionCode.txt
        │   │       └── SellerName.txt
        │   ├── 01.zip
        │   ├── 02
        │   ├── 03
        │    .
        │    .
        │   ├── 66