In spite of having full confidence of being able to spend some good memorable times with family and friends during this Christmas, the month of December 2022 will also be remembered by me as the time when:
Fusion energy yielded net positive results.
ChatGPT was made available for the masses to evaluate.
In this session I ask ChatGPT to write some RISC-V assembly. One needs to be very explicit to avoid some common mistakes ChatGPT makes on the task, like not knowing how to handle floating point operands and treating them as integers instead. This tendency reflects the likely fact that most examples it has seen did only use integer arithmetic. However, ChatGPT manages to generate valid code when hinted appropriately.
Once it has been able to generate valid code I make suggestions on how to optimise the code. For my amusement, ChatGPT understands my instructions and does implement them, leading to a faster code.
Abel:
Can you translate the following function into RISC-V assembly?
void madd(size_t size, float* dst, float* a, float* b, float* b) {
for (size_t i = 0; i < size; ++i) {
dst[i] = a[i] + b[i]*c[i];
}
}
Please note that the arithmetic operations inside of the loop should be done in floating point. For example, you should use flw.s rather than lw for loading operands, when you feel that is better fitting.
Also, please exploit the fmadd.s instruction that perfectly fits in this use case.
ChatGPT:
Here is the RISC-V assembly code for the madd function:
# RISC-V assembly for the madd function
# Load the arguments
la a0, a # load the address of a into a0
la a1, b # load the address of b into a1
la a2, c # load the address of c into a2
la a3, dst # load the address of dst into a3
li a4, size # load the value of size into a4
# Set up the loop
addi a5, zero, 0 # initialise the loop counter to 0
loop:
# Load the operands
flw.s f0, 0(a0) # load a[i] into f0
flw.s f1, 0(a1) # load b[i] into f1
flw.s f2, 0(a2) # load c[i] into f2
# Perform the floating point multiplication
fmul.s f3, f1, f2 # compute b[i]*c[i] and store the result in f3
# Perform the floating point addition
fadd.s f4, f0, f3 # compute a[i] + b[i]*c[i] and store the result in f4
# Store the result
fsw.s f4, 0(a3) # store the result in dst[i]
# Increment the loop counter and the pointers
addi a5, a5, 1 # increment the loop counter
addi a0, a0, 4 # increment the pointer to a
addi a1, a1, 4 # increment the pointer to b
addi a2, a2, 4 # increment the pointer to c
addi a3, a3, 4 # increment the pointer to dst
# Check if the loop should continue
blt a5, a4, loop # if a5 < a4, then branch to loop
# Return from the function
jr ra # jump to the return address
This code assumes that the RISC-V ABI is being used, and that the function takes four arguments: a, b, c, and dst, which are pointers to arrays of float values, and a fifth argument size, which is the number of elements in the arrays. It also assumes that the fmadd.s instruction is available, which performs a floating point multiply-add operation in a single cycle.
The code sets up a loop that iterates over the elements of the arrays, loads the operands from the arrays using the flw.s instruction, performs the multiplication and addition using the fmul.s and fadd.s instructions, and stores the result back into the dst array using the fsw.s instruction. The loop counter and the pointers to the arrays are incremented at the end of each iteration. The loop continues until the loop counter reaches the value of size.
Finally, the code returns from the function using the jr instruction, which jumps to the return address stored in the ra register.
Abel:
Would you actually replace the fmul and fadd pair of instructions in that program with fmadd.s?
ChatGPT:
Yes, it is generally recommended to use the fmadd.s instruction whenever possible, as it can perform a floating point multiply-add operation in a single cycle, which is faster than performing the multiplication and addition separately.
In the RISC-V assembly code that I provided, the fmul.s and fadd.s instructions can be replaced with a single fmadd.s instruction as follows:
# Load the operands
flw.s f0, 0(a0) # load a[i] into f0
flw.s f1, 0(a1) # load b[i] into f1
flw.s f2, 0(a2) # load c[i] into f2
# Perform the floating point multiply-add
fmadd.s f4, f1, f2, f0 # compute a[i] + b[i]*c[i] and store the result in f4
This reduces the number of instructions needed to perform the operation, which can improve the performance of the program.
It is important to note that the fmadd.s instruction is not available on all RISC-V processors, so it is important to check whether the instruction is supported before using it. If the fmadd.s instruction is not available, then it may be necessary to use the fmul.s and fadd.s instructions instead.
Abel:
Now, show me the whole program for translating the madd function, and not only the inner instructions of the loop.
ChatGPT:
Here is the complete RISC-V assembly code for the madd function:
# RISC-V assembly for the madd function
# Load the arguments
la a0, a # load the address of a into a0
la a1, b # load the address of b into a1
la a2, c # load the address of c into a2
la a3, dst # load the address of dst into a3
li a4, size # load the value of size into a4
# Set up the loop
addi a5, zero, 0 # initialise the loop counter to 0
loop:
# Load the operands
flw.s f0, 0(a0) # load a[i] into f0
flw.s f1, 0(a1) # load b[i] into f1
flw.s f2, 0(a2) # load c[i] into f2
# Perform the floating point multiply-add
fmadd.s f4, f1, f2, f0 # compute a[i] + b[i]*c[i] and store the result in f4
# Store the result
fsw.s f4, 0(a3) # store the result in dst[i]
# Increment the loop counter and the pointers
addi a5, a5, 1 # increment the loop counter
addi a0, a0, 4 # increment the pointer to a
addi a1, a1, 4 # increment the pointer to b
addi a2, a2, 4 # increment the pointer to c
addi a3, a3, 4 # increment the pointer to dst
# Check if the loop should continue
blt a5, a4, loop # if a5 < a4, then branch to loop
# Return from the function
jr ra # jump to the return address
This code defines a function called madd that takes four arguments: a, b, c, and dst, which are pointers to arrays of float values, and a fifth argument size, which is the number of elements in the arrays. The function performs a loop that iterates over the elements of the arrays, loads the operands from the arrays using the flw.s instruction, performs the multiply-add operation using the fmadd.s instruction, and stores the result back into the dst array using the fsw.s instruction. The loop counter and the pointers to the arrays are incremented at the end of each iteration. The loop continues until the loop counter reaches the value of size.
Finally, the code returns from the function using the jr instruction, which jumps to the return address stored in the ra register.
Abel:
Can you generate a smaller number of instructions in the inner loop by applying the following transformation?
Use the final value of a0 as a way of detecting the end of loop condition, instead of maintaining an iteration index on a5. That should make the inner loop a bit shorter.
ChatGPT:
Yes, it is possible to modify the inner loop of the madd function to use the final value of a0 as a way of detecting the end of the loop, rather than maintaining an iteration index in the a5 register. Here is the modified RISC-V assembly code for the madd function:
# RISC-V assembly for the madd function
# Load the arguments
la a0, a # load the address of a into a0
la a1, b # load the address of b into a1
la a2, c # load the address of c into a2
la a3, dst # load the address of dst into a3
li a4, size # load the value of size into a4
# Set up the loop
addi a5, a0, 4*size # initialise the loop counter to the final value of a0
loop:
# Load the operands
flw.s f0, 0(a0) # load a[i] into f0
flw.s f1, 0(a1) # load b[i] into f1
flw.s f2, 0(a2) # load c[i] into f2
# Perform the floating point multiply-add
fmadd.s f4, f1, f2, f0 # compute a[i] + b[i]*c[i] and store the result in f4
# Store the result
fsw.s f4, 0(a3) # store the result in dst[i]
# Increment the pointers
addi a0, a0, 4 # increment the pointer to a
addi a1, a1, 4 # increment the pointer to b
addi a2, a2, 4 # increment the pointer to c
addi a3, a3, 4 # increment the pointer to dst
# Check if the loop should continue
blt a0, a5, loop # if a0 < a5, then branch to loop
# Return from the function
jr ra # jump to the return address
In this modified version of the code, the loop counter is initialised to the final value of a0, which is the address of the last element in the a array. The loop continues until a0 reaches this final value, at which point the loop condition is no longer satisfied and the loop terminates. This eliminates the need to maintain a separate loop counter and reduces the number of instructions in the inner loop.
It is important to note that this modified version of the code still assumes that the RISC-V ABI is being used, and that the function takes four arguments: a, b, c, and dst, which are pointers to arrays of float values, and a fifth argument size, which is the number of elements in the arrays. It also assumes that the fmadd.s instruction is available, which performs a floating point multiply-add operation in a single cycle.
Overall, I am quite impressed and looking forward to seeing how this technology allows engineers like me to approach new challenges.