Little is known about how children under 5 years respond to electronic texts. Traditional methods of transcription can record spoken language and paralinguistic features, but not the relations between children’s non-verbal behaviour (e.g. gaze, gesture, facial expressions) and the visual elements which are the focus of their attention. In this paper, drawing on naturalistic videotaped data from 4 children aged 4–5 years interacting with I Spy CD-ROMs (Scholastic), we offer an innovative method of transcription which may be used to help us understand children’s responses in depth. The method captures each child’s language, body posture, facial expressions and gestures, in relation to the visual image and game sounds they are currently attending to. Our detailed observations suggest that the manner in which young children engage with e-games varies according to the social context, the textual features of the e-game and their proficiency in using computer hardware and software. Several implications for educators are then discussed, including the need for teachers to be sensitive to the affordances offered by various kinds of software and different genres. The composition of the social grouping using multimodal texts is another important consideration for educators who wish to support children’s ‘multiliteracy’ development.