Knowing how to clean/create your datasets for optimum models is one of the things we need to learn by doing multiple competitions. Jeremy covers some of these things in lesson 2 and 3.

But, if you really want to start on this competition you might want to think over why Jeremy had taken log of sale price in his example? My take is - It really simplified the explanation and the python notebook.

In reality, the log price is required only when we need to calculate RMSLE of our validation set. So instead of using the RMSE function:

def rmse(x,y): return math.sqrt(((x-y)**2).mean())

where x and y are log prices.

You can actually do:

def rmsle(x,y): return math.sqrt(((np.log1p(x)-np.log1p(y))**2).mean())

where x and y are normal (no log) prices.

Similarly in the Google competition you need log of user transaction but only during validation. So, run your rf as-is without logs. And when you need to check your predictions compare the log (sum of user transactions) in test and validation sets.