Why does a many to many LSTM model only work well with a sequence of length 1 at test time and fail other wise?

Hello everyone, a noob here. I’d be grateful if you could help me why this is the case and how one can solve this issue.
Basically I followed the Udacity’s Pytorch IPython notebooks here and worked with Character RNN example. I wrote everything and everything works fine there. However, today I noticed We have different RNN/LSTM types, which are as follows:

  1. many to many

  2. many to one

  3. one to many

  4. one to one!

and apparently getting text input and outputting text is a many to many or sequence to sequence type!

I noticed when training we simply feed a sequence of 100 characters and get outputs with sequence of 100 characters , so far so good! but when it comes to generate text ourselves, I noticed the author used single input (one character-lengthed input!) and using that she generated many text that looked good! by looking good, I mean, words and punctuation were mostly correct, there were actual words and phrases not something random!

However, I tried to see whether feeding multiple characters at once would generate the same output or at least in a similar fashion! but to my surprise, the output was garbage!

Here is how the original sampling functions look like :

def predict(model, input_char , hidden_states, char2int, int2char, length, topk, device): 

    # 1.convert the char into int and then onehot encode it 
    input_int = np.array([char2int[input_char]]).reshape(1,-1)
    input_one_hot = one_hot_encode(input_int, length)
    input_tensor = torch.from_numpy(input_one_hot).to(device)

    output, hidden_states = model(input_tensor, hidden_states)
    hidden_states = tuple(h.data for h in hidden_states)

    output = torch.nn.functional.softmax(output, dim=1).data

    if topk == None: 
        top_characters = output.topk(np.arange(length))
    else : 
        probs, top_characters = output.topk(topk)

    top_characters = top_characters.cpu().numpy().squeeze()
    probs = probs.cpu().numpy().squeeze()

    char = np.random.choice(top_characters, p=probs/probs.sum())

    return int2char[char], hidden_states

def sample(model, size, string_prime, topk, device ): 
    model.eval() 
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
    chars = [c for c in string_prime]
    w = (next(model.parameters())).data
    h = (w.new_zeros(model.num_layers,1,model.hidden_size).to(device),
        w.new_zeros(model.num_layers,1,model.hidden_size).to(device)  )

    for c in string_prime:
        char, h = predict(model, c, h, model.char2int,model.int2char, 83,5,device)
    chars.append(char)

    for c in range(size): 
        char, h = predict(model, chars[-1], h, model.char2int,model.int2char,83,5,device)
        chars.append(char)

    return ''.join(chars)
    
sample(model, 1000,'The time', 5, device)

And this is my version which creates garbage output on the very same model that the above functions create very good resulta! :

def predict2(model, input_string , hidden_states, char2int, int2char, length, topk, device): 
    # 1.convert the string into char and then int and then one-hot encode it 
    all_chars = [char for char in input_string]
    input_int = np.array([char2int[ch] for ch in all_chars]).reshape(1,-1)
    input_one_hot = one_hot_encode(input_int, length)
    input_tensor = torch.from_numpy(input_one_hot).to(device)

    output, hidden_states = model(input_tensor, hidden_states)
    # remove hidden state history or something else that I dont fully understand!!
    hidden_states = tuple(h.data for h in hidden_states)

    # our output is distribution! and we need distribution probability so 
    # we use softmax, we also use .data attribute/property since we are after
    # the values and dont need grads!
    output = torch.nn.functional.softmax(output, dim=1).data

    # now lets take the most probable characters and among them get the highest!
    if topk == None: 
        top_characters = output.topk(np.arange(length))
    else : 
        probs, top_characters = output.topk(topk)

    top_characters = top_characters.cpu().numpy().squeeze()
    probs = probs.cpu().numpy().squeeze()
    chars = []
    for i in range(probs.shape[0]):
        char = np.random.choice(top_characters[i], p=probs[i]/probs[i].sum())
        chars.append(char)

    return [int2char[char] for char in chars], hidden_states

def sample2(model, size, string_prime, topk, device ):
    model.eval() 
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 
    # our output shall start with this prime text, 
    # we first get the output for the last character, 
    # then continue to generate more using the last character! each time
    w = (next(model.parameters())).data
    h = (w.new_zeros(model.num_layers,1,model.hidden_size).to(device),
        w.new_zeros(model.num_layers,1,model.hidden_size).to(device)  )

    chars = []
    chars.append(string_prime)

    for i in range(size):
        string_prime, h = predict2(model, string_prime, h, model.char2int,model.int2char,83,5,device)
        chars.append(''.join(string_prime))
    return ''.join(chars) 

print('sample 2: ') 
sample2(model, 20,'The time', 5, device)

Example outputs :

Sample 1 :

print('sample 1: ')
sample(model, 100,'The time', 5, device)

outputs:

sample 1:
‘The time, and no\none in a child her moist sly cress swing on his eyes as though some of Varenka had\ncranced f’

Sample 2:

print('sample 2: ')
sample2(model, 20,'The time', 5, device)

outputs:

sample 2:
‘The timerirohme ,air adbidtsfsmatei\nuianednsetldt e,.uy.ebx _o. ues u\nHn\n pon\neigmesscithad ,eeisn-p s,n, lrdi s toessepco s\n lshrb\nimaceaimnpiisttyeynl ii d eywnmmiy’

My question is, is it not supposed to work either way? why does it not work with multi-character sequence input? and only works with single charactered input?
Clearly the loss decreases and the network learns something! but why can it only work with sequences of length 1 ? Do I need to do something else in Pytorch to get this to work?

Thank you very much in advance